• No results found

The main benefit of using permutation methods is that we can use any test statistic as long as its value does not remain constant under permutations. As the test using Fisher’s G-statistic performs consistently well and has the lowest run time, we create a new test statistic based on Fisher’s G-statistic.

In Chapter 4 we saw that under the null, Fisher G-statistic remains constant. For the new test statistic we leverage this property and define it as follows:

TN S:= max

j∈Jn0 I(X,n)j) − I(X,n) 2π τ0

 ,

where 2π/τ0is the frequency corresponding to τ0and I is the extended periodogram of the data X.

As we are using permutation methods, we do not need to determine its distribution analytically.

This is assuming we can find an exact distribution for it. Here we see one of the main benefits of using permutation methods.

If τ0 is the true period of f then

I(X,n) 2π/τ0 = max

j∈J0nI(X,n)j)

and so TN S = 0. If τ0 is not the period of f then TN S > 0. Thus, the hypotheses are H0: TN S= 0,

against

H1: TN S> 0.

Due to the nature of the permutations defined in Chapter4, we know that under permutations the test statistic will remain constant. Under the alternative hypothesis, the periodogram values will be large and the averaging causes the value of the test statistic to decrease under permutations.

Based on this description, we can see the parallels between the test using this new test statistic and the test using Fisher’s G-statistic. Thus, the results should be similar to those of Fisher’s G-statistic.

To evaluate the performance of this test we compute the type I error and use ROC curves to compare it to the two best tests found before; the tests using Fisher’s G-statistic and Bartlett’s method. First, we consider the sine function with period 7/3. For all permuting methods the type I error is 0.0000 and the ROC curves can be found in Figures 5.29, 5.30, and 5.31. The figures show that the test is even slightly better than the tests using Fisher’s G-statistic and Bartlett’s method. For permuting method 3, the area under the curve is 0.9818 which is the highest value found overall. Moreover, the run time for this test statistic is O(n log n) with the fast Fourier transform as the bottleneck.

Due to these promising results, the test was also used on functions with multiple frequencies.

It was found that the test statistic did perform better than the test using Fisher’s G-statistic as well. However, the test behaves similarly to the test using Fisher’s G-statistic with a composite null hypothesis considering all the frequencies contributing to the period of the function. Due to this, the limitations have a greater effect on the test than the test using Bartlett’s method.

Therefore, this was not further investigated.

CHAPTER 5. RESULTS & COMPARISON OF METHODS

Figure 5.29: ROC Curves for Sine Function with Period 7/3 for permuting method 1.

Figure 5.30: ROC Curves for Sine Function with Period 7/3 for permuting method 2.

Figure 5.31: ROC Curves for Sine Function with Period 7/3 for permuting method 3.

Chapter 6

Conclusion

Determining the period of repetitive behaviour in data is important for accurate predictions. This has high practical value in fields such as business forecasting. Therefore, the goal of this report is to investigate whether permutation methods can be used to create exact tests that determine the period of a function. The current tests available are mainly used to detect periodicity rather than test for a specific period. Moreover, many are only approximates and not exact. Therefore, this report investigates this topic for seemingly the first time with permutation methods and forms a base for possible future research.

Using permutation methods results in flexible, simple, and exact tests under minimal constraints.

The constraints are distribution invariance for the sample data and that the value of the test statistic cannot stay constant under permutations. To ensure the distribution of the sample data is invariant, specific permuting methods are designed in Chapter4. Any test statistic that is not constant under these permutations can then be used to test the competing hypotheses; τ0 is the period of the function f (null hypothesis) against its complement. By computing the value of this test statistic for each permutation, we can construct its distribution. Therefore, we do not need to derive the distribution of the chosen test statistic making the test simple. Moreover, the fact that we can use any test statistic with varying values under permutations, results in a flexible test.

Firstly, we used test statistics found in existing tests and checked whether their values remain constant under permutation. Four test statistics, Fisher’s G-statistic, Bartlett’s method, Welch’s method, and Lomb-Scargle’s periodogram, satisfied this requirement. The tests using these four statistics were evaluated for functions with a single frequency and functions with multiple frequen-cies. Note that these functions are still considered quite basic, however, most periodic functions can be written as the sum of cosines and sines and thus, this forms a solid base for more com-plicated functions. Preliminary testing for these more comcom-plicated functions already looks very promising.

For both functions with single and multiple frequencies, two tests performed consistently well;

the tests using Fisher’s G-statistic and Bartlett’s method. Although Fisher’s G-statistic has a known distribution, the test using permutation methods leverages the fact that under the null hypothesis Fisher’s G-statistic remains constant and only varies under permutations under the alternative hypothesis. Due to this contrasting behaviour under the two hypotheses, the test per-forms exceptionally well with a negligible type I error. On the other hand, the distribution of Bartlett’s method can only be approximated analytically. Thus, we can find an exact distribution for the statistic using permutation methods.

For single frequency functions, the test using Fisher’s G-statistic performs consistently well. How-ever, for functions with multiple frequencies the test using Bartlett’s method is preferred as the limitation has less of an influence on the test, despite the similar performance compared to the

CHAPTER 6. CONCLUSION

test using Fisher’s G-statistic. In terms of efficiency, Fisher’s test performs much better as it has a run time of O(n log n) compared to an O(n2) for Bartlett’s method. Thus, for practical use, the test using Fisher’s G-statistic is recommended.

In an attempt to further improve the test using Fisher’s G-statistic, we constructed a new test statistic. As we are using permutation methods it was not necessary to find the distribution of this statistic. Furthermore, we based this test statistic on Fisher’s G-statistic to retain the same contrasting behaviour under the different hypotheses. We found that this test statistic did per-form better than both the tests using Bartlett’s method and Fisher’s G-statistic for functions with a single frequency as it is less affected by the limitations. However, Bartlett’s is still better for functions with multiple frequencies. Lastly, the run time for the new statistic is O(n log n) making it suitable for practical use as well.

There are two main limitations for the tests. Based on the definition of periodicity the tests would fail to reject the null hypothesis for multiples of the period. This results in issues of identi-fiability. Secondly, based on how the permuting methods are defined, it was found that only the numerator of the hypothesised period had to be a multiple of the period for the test to incorrectly fail to reject the null hypothesis. It is believed that the first issue cannot be fixed as it is based on the definition of a periodic function. However, some measures can be taken to avoid it (see Section 5.5.3). For the second limitation, it is believed that a solution can be found. This is suggested as future research. Investigating the sampling rate used may be a good starting point. In case that the second limitation is solved, it is believed that the test using Fisher’s G-statistic and the new test statistics would be near perfect, as in almost all cases they incorrectly failed to reject the null hypothesis due to this limitation. For Bartlett’s method, this is not always the case. Therefore, the test using Fisher’s G-statistic and the new test statistic seems to be the most promising for future research. Furthermore, there is a wide range of possible test statistics that can also be considered for future research due to the flexibility when testing using permutation methods. For example, the test statistic could be as simple as the periodogram itself.

It is worth noting that the tests still have useful applications, despite the aforementioned lim-itations. If the tests reject the null hypothesis we are confident that the value is not the period.

Thus, the test can accurately determine if a value is not the period of the function. This can be used to determine whether a specific frequency band is free before sending a signal transmitting data through it. For the test using Fisher’s method if the test rejects the null hypothesis, we are confident that the frequency band is free. Moreover, if a company wants to determine where to invest over a year based on previous data, in many cases it is sufficient to check if the data is yearly or not (i.e. periodic per year). Thus, the limitations do not affect this case.

References

Abromovich, F., & Ritov, Y. (2013). Statistical Theory: A Concise Introduction.

Anderson, T. W. (1971). The statistical analysis of time series. New York: John Wiley & Sons.

Bartlett, M. S. (1950, 6). Periodogram Analysis and Continuous Spectra. Biometrika, 37 (1-2), 1–16. doi: 10.1093/BIOMET

Brandes, O., Farley, J., Hinich, M., & Zackrisson, U. (1968). The Time Domain and the Frequency Domain in Time Series Analysis. The Swedish Journal of Economics, 70 (1), 25–42. doi: 10.2307/

3438983

Brockwell, P. J., & Davis, R. A. (2006). Time Series: Theory and Methods - Second Edition.

In Springer series in statistics (pp. 331–396). Springer Science & Business Media, LLC. doi:

10.1007/978-1-4419-0320-4

Cowpertwait, P., & Metcalfe, A. (2009). Introductory time series with R. New York.

Craymer, M. R. (1998). The least squares spectrum, its inverse transform and autocorrelation func-tion: Theory and some applications in geodesy (Unpublished doctoral dissertation). University of Toronto.

Ernst, M. D. (2004). Permutation methods: A basis for exact inference. Statistical Science, 19 (4), 676–685. doi: 10.1214/088342304000000396

Fisher, R. A. (1929). Tests of significance in harmonic analysis. Proceedings of the Royal Society Series A, 125 (796), 54–59.

Fisher, R. A. (1936). Design of Experiments (Vol. 1) (No. 3923). doi: 10.1136/bmj.1.3923.554-a Fisher, R. A. (1939). The Sampling Distribution of some Statistics Obtained From Non-linear

Equations. Annals of Eugenics, 9 (3), 238–249. doi: 10.1111/j.1469-1809.1939.tb02211.x Fuller, W. A. (1976). Introduction to statistical time series. New York: John Wiley & Sons.

Grenander, U., & Rosenblatt, M. (1957). Statistical Analysis of Stationary Time Series. New York: John Wiley.

Newton, H. J. (1997). A periodogram-based test for white noise. Stata Technical Bulletin, 6 (34).

Novick, S. J. (1994). Analysis of Fisher ’ s test for hidden periodicities (Unpublished doctoral dissertation). Lehigh University.

Pitman, E. J. G. (1937). Significance Tests Which May be Applied to Samples From any Popu-lations. Journal of the Royal Statistical Society , 4 (1), 119–130.

Scargle, J. D. (1982). Statistical aspects of spectral analysis of unevenly spaced data. Astrophysical Journal , 263 , 835–853. doi: 10.1086/160554

REFERENCES

Shimshoni, M. (1971). On Fisher’s Test of Significance in Harmonic Analysis. Geophysical Journal of the Royal Astronomical Society, 23 (4), 373–377. doi: 10.1111/j.1365-246X.1971.tb01829.x Siegel, A. F. (1980). Testing for Periodicity in a Time Series. Journal of the American Statistical

Association, 75 (370), 345–348. doi: 10.1214/17-AOS1645

Van der Plas, J. (2015). Fast Lomb-Scargle Periodograms in Python — Pythonic Perambulations.

Retrieved fromhttps://jakevdp.github.io/blog/2015/06/13/lomb-scargle-in-python/

Welch, P. D. (1967). The Use of the Fast Fourier Transform for the Estimation of the Power Spectra: A Method Based on Time Averaging Over Short, Modified Periodograms. IEEE Trans. Audio and electroacoustic, 15 , 70–73.

Appendix A

Notation

Notation Meaning Definition

τ True period of f . Introduction.

ft True value from the periodic function f (t) at time t. Introduction.

t Random error at time t. Introduction.

τ0 Hypothesised period of f . Introduction.

X = (X1, ..., Xn) Data. Definition2.1.1.

x = (x1, ..., xn) Observations of the data X.

-Θ Parameter set. Definition2.1.2.

θ Unknown parameter in Θ. Definition2.1.2.

H0 Null hypothesis. Definition2.1.2.

H1 Alternative hypothesis. Definition2.1.2.

ψ Test function. Definition2.1.2.

C Critical region. Definition2.1.4.

αψ(θ) Probability of Type I error for test ψ as a function of θ. Equation2.1.

βψ(θ) Probability of Type I error for test ψ as a function of θ. Equation2.2.

α Significance level. Definition2.1.5.

T Test statistic. Definition2.1.6.

t Observed value of test statistic T.

ψ(θ) Power function of test psi as a function of θ. Definition2.1.9.

p p-Value. Definition2.1.11.

µ Mean.

-σ Standard deviation.

2 Variance.

-X¯ Mean of the data X.

-Z Standard normal random variable.

-n Length of data.

j Fourier Frequencies for j ∈ Jn. Definition2.3.1.

ej Basis vector for Cn for j ∈ Jn. Equation2.8.

Jn Set of integers corresponding to the Fourier Frequencies. Equation2.9 { ˜fi: i = 1, ..., n} Discrete Fourier Transform of f . Equation2.11.

Ifj) Periodogram value for f at ωj. Definition2.4.1.

cj Cosine basis vector for Rn. Equation2.17.

sj Sine basis vector for Rn. Equation2.18.

APPENDIX A. NOTATION

γ Autocovariance function. Definition2.5.3.

ρ Autocorrelation function. Definition2.5.4.

h Difference between time. Definition2.5.7.

ω Frequency.

-Jn0 Real counterpart of Jn. Section3.3.

ν [(n − 1)/2] Section3.3.1.

P Projection matrix. Equation3.10.

Gr rthorder Fisher G-statistic. Equation3.17.

A Basis matrix for Rn. Equation3.22.

s Least square spectrum. Equation3.30.

` Time delay. Definition3.3.1.

PX(ω) Lomb-Scargle Periodogram. Definition3.3.2.

p Numerator of τ0i.e. τ0= p/q. Equation4.1

q Denominator of τ0 i.e. τ0= p/q. Equation4.1

m bn/pc Equation4.2.

Bi Block i for i = 1, . . . , m. Equation4.3.

Bij Sub-block j in block i for i = 1, . . . , m and j = 1, . . . , q. Equation4.4.

R Number of Monte Carlo iterations. Section5.2.

K Set of contributing frequency components. Section5.6.1.

Table A.1: Notation used in the report.

Appendix B

Proof of Proposition 3.2.1

Proposition3.2.1 Suppose that Xt= ft+twhere t∼ N (0, σ2). Let the sets { ˜Xj}j∈Jn,{ ˜fj}j∈Jn, and {˜j}j∈Jn denote the DFT for X, f,  respectively and let I(X,n)j), I(f,n)j), I(,n)j) denote the periodogram for X, f,  for j ∈ Jn as defined in equation2.27.

(i) ˜Xj ∼ N ( ˜fj, σ2), ˜Xj independent

(ii) IXj) = Ifj) + Ij) + Cj where Cj ∼ N (0, 4σ2j2), Ij) ∼ Exp(σ12), and Ifj) is deterministic.

(iii) E[I(X,n)j)] = Ifj) + σ2

(iv) If E[X14] < ∞ and ωj =2πjn ∈ [0, π], then

Cov(I(X,n)j), I(X,n)k)) = 4σ2d

n2 → 0 as n → ∞ and

V ar(I(X,n)j)) = 2σ2

n Ifj) +2σ2 n2 d0+

4 if ωj = 0 or π σ4 else

where d, d0 ∈ R

Proof. (i) From the linearity of the discrete Fourier transform (DFT) terms

j = ˜fj+ ˜j (B.1)

for j ∈ Jn. Then, from Corollary 2.3.2, ˜j =< , ej >. Thus, ˜j is a linear combination of nor-mally distributed variables and therefore is also nornor-mally distributed. We claim that ˜j∼ N (0, σ2).

Let A be the complex n × n matrix with the basis {ej : j ∈ Jn} as its columns. Then ˜ = AT.

Thus,

E[˜] = E[AT] = ATE[] = 0. (B.2)

and

Cov(˜) = ATCov()A = ATσ2IA. (B.3) Then

V ar(j) = (ATσ2IA)(j,j)= σ2ATjAj = σ2 (B.4) Thus, ˜j ∼ N (0, σ2) because ATjAj= 1 as we have an orthonormal basis. Then ˜fjis deterministic, we can conclude that ˜Xj∼ N ( ˜fj, σ2).

APPENDIX B. PROOF OF PROPOSITION 3.2.1

Lastly, the ˜Xj are independent because the ˜j are independent as they are orthogonal trans-forms of an orthonormal basis.

(ii) For this proof we consider values in the real plane. The proof follows from equations2.27and B.1. Note that Re(a) denotes the real part of a complex number a.

IXj) = | ˜Xj|2= | ˜fj+ ˜j|2 (B.5)

As we are considering values in the real plane, we know that {e0, c1, s1, ..., c[(n−1)/2], s[(n−1)/2], en/2}

is an orthonormal basis for Rn, where cj and sj are defined in equations2.17and2.18. Then we can define the periodogram for  for ωj, j = 1, ..., [(n − 1)/2] as follows:

as the components corresponding to cj and sj are combined to produce the component for ωj. Both η and ζ are linear combinations of normally distributed variables and therefore are also normally distributed. They are also independent due to the independence between cj and sj. Then we claim that η, ζ ∼ N (0, σ2).

Clearly, the mean of both η and ζ is zero. Then we compute their variance as follows:

Var(η(ωj)) = E[η2j)] = E

APPENDIX B. PROOF OF PROPOSITION 3.2.1

The equality with the (∗) follows from the fact that

E[ts] =

Therefore, η, ζ ∼ N (0, σ2). Then from this results and Equation B.10we can conclude that I(,n)j)i.i.d∼ exp 1

σ2



. (B.19)

(iii) From equation2.27we can express the periodogram of X as follows:

I(X,n)= n−1

From proposition 10.3.2 ofBrockwell & Davis(2006) we know that

E[I(f ,n)j)I(f ,n)k)] = σ4

APPENDIX B. PROOF OF PROPOSITION 3.2.1

Appendix C

Python Code

i m p o r t m a t p l o t l i b . p y p l o t a s p l t i m p o r t numpy a s np

from s c i p y . f f t i m p o r t f f t from s c i p y i m p o r t s i g n a l i m p o r t math

from i t e r t o o l s i m p o r t p e r m u t a t i o n s i m p o r t i t e r t o o l s

from p p r i n t i m p o r t p p r i n t i m p o r t random

i m p o r t t i m e

# U s e r d e f i n e d v a r i a b l e : Value o f t h e v a r i a n c e o f t h e n o i s e . D e f a u l t i s 1 . sigma = 1

# F u n c t i o n s we a r e o b s e r v i n g . Used t o g e n e r a t e d a t a g i v e n a p e r i o d . S i g n a l t o N o i s e r a t i o = 2 when f ( x , 5 / 2 ) , g ( x , 7 / 3 ) , a ( x , 2 0 ) , s t ( x , 1 0 ) , t r i ( x , 4 ) .

#S i n g l e f r e q u e n c y f u n c t i o n s d e f f ( x , t a u ) :

r e t u r n 2 ∗ np . c o s ( 2 ∗ np . p i ∗ x / t a u )

d e f g ( x , t a u ) :

r e t u r n 2 ∗ np . s i n ( 2 ∗ np . p i ∗ x / t a u )

d e f f r a c ( x ) :

r e t u r n x − math . f l o o r ( x )

d e f s t ( x , t a u ) :

r e t u r n np . s q r t ( 6 ) ∗ 2 ∗ ( x / tau−math . f l o o r (1/2+ x / t a u ) )

d e f t r i ( x , t a u ) :

r e t u r n np . s q r t ( 6 ) ∗ ( 4 / t a u ∗ ( x−( t a u / 2 ) ∗math . f l o o r ( 2 ∗ x / t a u +1/2) ) ∗( −1) ∗∗ math . f l o o r ( 2 ∗ x / t a u +1/2) )

#M u l t i p l e f r e q u e n c y f u n c t i o n

d e f a ( x , t a u ) :

r e t u r n np . s q r t ( 2 ) ∗ ( np . c o s ( 2 ∗ np . p i ∗ x / 5 ) + np . s i n ( 2 ∗ np . p i ∗ x / 6 ) )

#F u n c t i o n s t o a i d i n t h e h a n d l i n g o f d a t a .

#F l a t t e n s a l i s t o f l i s t i n t o a s i n g l e l i s t . d e f f l a t t e n L i s t ( l s t ) :

f l a t L i s t = [ ]

APPENDIX C. PYTHON CODE

APPENDIX C. PYTHON CODE

APPENDIX C. PYTHON CODE

APPENDIX C. PYTHON CODE

APPENDIX C. PYTHON CODE

APPENDIX C. PYTHON CODE

APPENDIX C. PYTHON CODE