Short-term forecasting of the settlement price of the Dutch electricity imbalance market using ARMA-GARCH models

(1)

Short-term forecasting of the settlement price of the Dutch

electricity imbalance market using ARMA-GARCH models

M.J.S. Neisingh

(2)

Master’s thesis Econometrics, Operations Research and Actuarial Studies Specialization: Econometrics

University of Groningen

Supervisors:

Prof. Dr. R.H. Koning (University of Groningen) J.M.A. Sijtsma MSc (GDF SUEZ Energie Nederland)

Co-assessor:

(3)

Short-term forecasting of the settlement price of the Dutch

electricity imbalance market using ARMA-GARCH models

M.J.S. Neisingh

Abstract

(4)

1 Introduction

A discrepancy between demand and supply of electricity within the electricity grid is called imbalance. Since imbalance can damage the grid, it is important to minimize this. A reg-ulating authority is responsible for this and asks parties connected to the grid to submit their expected demand from the grid and supply to the grid, called the E-Programme, one day ahead. However due to unforeseen events such as demand forecast errors, (wind) power forecast errors or a major event such as a plant outage, some imbalance inevitably remains.

When an imbalance arises the supply to the grid and/or demand from the grid must change. The authority in the Netherlands responsible for maintaining and monitoring the system balance is TenneT TSO B.V. (hereinafter referred to as TenneT). TenneT does not own production capacity and therefore relies on parties that do have this capacity. The change in supply is arranged through the imbalance market in which the producing parties have an incentive to adjust production when necessary. This incentive is the settlement price, which is explained in Section 2 in detail. The parties that are causing the imbalance are penalized through the settlement price.

The imbalance market has received some attention in the literature. These researches review the imbalance market as a whole or focus on long-term development. Van der Veen and Hakvoort (2009) conclude that the characteristics of the mechanisms used in the Nether-lands results in a moderately low accuracy of the E-Programmes, moderately high imbalance volumes and a high and volatile settlement price. IPA Energy Consulting (2006) give a brief overview of the organisation of the electricity markets, including the balancing mechanism, in Europe. The increasing share of renewable sources, such as wind power, has its effect on the imbalance market. Van der Veen and de Vries (2009) conclude that the overall effects of large-scale penetration of microgeneration on the operational performance of the Dutch bal-ancing market will probably be manageable and give suggestions to improve the operational performance. Jaehnert and Doorman (2012) show that the system imbalances will increase up to 2020 due to wind power forecast errors. Farahmand et al. (2012) expect that integration of the balancing markets in Northern Europe increase the flexibility that is needed to cope with more uncertainty due to an increased share of renewable sources.

Insight in the short-term behaviour of the imbalance market can be informative for parties, for example for a connected party to know during the day when financial risks will be high. However, to our knowledge, there is little existing literature on the short-term behaviour of the imbalance market. This research therefore focusses on short-term forecasts for the settle-ment price, an important price on the imbalance market.

(6)

analysis and specifically Autoregressive Moving Average (ARMA) and Generalized Autore-gressive Conditionally Heteroskedastic (GARCH) processes have been used several times to model the short-term dynamics of the electricity price. Contreras et al. (2003) use Autore-gressive Integrated Moving Average (ARIMA) models to predict next-day electricity prices and Garcia et al. (2005) find that the GARCH methodology outperforms a general ARIMA model when volatility and price spikes are present. Zhou et al. (2006) use an extended ARIMA approach to produce forecasts and confidence intervals for the electricity price with a satisfac-tory accuracy. We build on the existing literature by applying an ARMA-GARCH approach to forecast the settlement price.

This research examines and analyses the use of three ARMA-GARCH type models for short-term forecasts for the settlement price: short-term ARMA-GARCH models, models with longer lags and ARMA-GARCH models in which external variables are used as external covariates, resulting in an ARMAX-GARCH process. The Normal and (skewed) Student-t distribution are considered as the innovation distribution to capture potential excess kurtosis and skewness. The interest of the forecasts lies both in the point forecasts and the prediction intervals associated with the point forecasts.

For each forecast we examine the prediction performance of the point forecasts and pre-diction intervals based on 1 Programme Time Unit (PTU, 1 PTU equals 15 minutes) and 4 PTU forecasts. The method of backtesting is used to compare the point forecasts and the prediction intervals to the realized prices.

The accuracy of the point forecast is measured by common used sample statistics for the forecast error: the root mean squared error, mean absolute error and the mean absolute percentage error.

The accuracy of the prediction interval is measured by the coverage of the prediction interval. By this we mean both the nominal coverage and the conditional coverage as in Christoffersen (1998). This means that the accuracy of the prediction interval depends both on the coverage of the interval and on the independence of the violations over time.

The coverage level and forecast horizon, 1 PTU or 4 PTU, for which the accuracy of the prediction intervals increases is examined as well.

The main research question examined in this thesis is:

Which of three ARMA-GARCH type models (short-term, models with longer lags and ARMAX-GARCH models) can best be used for accurate short-term point forecasts and prediction inter-vals for the settlement price of the Dutch electricity imbalance market and for which coverage level and forecast horizon are the prediction intervals most accurate?

(7)

which ARMA-GARCH type model can best be used for accurate short-term forecasts for these separated datasets and examine for which coverage level and forecast horizon the prediction intervals are most accurate. Second we examine the forecast performance of the ARMA-GARCH type models for the unseparated dataset in which we include a dummy variable to take the difference between weekdays and weekends/holidays into account. By comparing the forecasts from the separated datasets with the forecasts from the unseparated dataset we examine whether separating the datasets can lead to a better prediction performance. Finally we use external variables as external covariates and examine whether this improves the accuracy of the forecasts.

The structure of this thesis is as follows. In Section 2 we discuss the Dutch imbalance market and the settlement price. Section 3 gives the theory of the ARMA-GARCH processes. The data is first analysed in Section 4 and the modelling and forecasting procedure is given in Section 5. The results of this procedure for the separated datasets are given in Section 6, the results for the separated datasets in Section 7.

2 The Dutch imbalance market and settlement price

2.1 The imbalance market

Imbalance on the grid can damage the grid and therefore there is a constant need for bal-ancing power supply and demand. In Europe, Transmission System Operators (TSO’s) are responsible for maintaining this balance. To settle imbalances correctly and to maintain the system’s balance, any party that is connected to the grid with own production or sales has balance responsibility. A party accountable for this balance is called a Balance Responsible Party (BRP) and must be recognized by TenneT as a BRP. Suppliers have balance respon-sibility, but they may outsource this to another BRP. The role of the BRP is to balance the purchased, produced and sold volumes of electricity of their portfolio (Energiekamer and NMa, 2013).

When there is a discrepancy between the purchased/produced and sold volumes of a BRP, the BRP has an imbalance. A BRP is said to be long if the BRP purchased/produced more power than sold. A BRP is said to be short if the BRP purchased/produced less power than sold. BRP’s are penalized through the settlement price if they deviate from their planned net volumes. At the same time, a BRP is a party that can help to restore the balance on the grid. If they restore the balance they are rewarded through the settlement price.

(8)

When an imbalance arises, TenneT deploys flexible production capacity offered by the BRP’s. When the grid is long in power, it requests one or more BRP’s to regulate its production in downward direction: producing less than planned. When the grid is short in power, TenneT requests BRP’s to regulate in upward direction: producing more than planned. This flexible production capacity is divided into categories.

• R2: Regulating and reserve power. TenneT has contracts with several BRP’s to make sure that there is regulating and reserve power available to restore the balance on the grid. At the time of writing, total size of contracted R2 capacity between TenneT and the BRP’s is 300 MW. These BRP’s must have the contracted capacity in reserve at all times, both in upward and downward direction. For this the BRP’s receive a fixed compensation, furthermore an amount is paid depending on the volume activated by the TSO.

• R3: Emergency power. This power is contracted as well, with both BRP’s and con-sumers. Contracted consumers are willing to reduce their demand or start decentralized generation when necessary to maintain the balance on the grid. At the time of writing the size of contracted R3 capacity between TenneT and BRP’s equals 350 MW. • IGCC capacity. As of 1 February 2012, there is collaboration between Germany and the

Netherlands in which imbalances can be compensated real-time between the countries. This collaboration is called the International Grid Control Cooperation (IGCC). A capacity on the grid of 300 MW is reserved for transportation of power between the Netherlands and Germany (TenneT, 2012a).

First, if possible, the TSO’s of the Netherlands and Germany settle imbalances between the countries. The remaining imbalance, which is the majority, is settled by allocating the scarce capacities R2/R3 using the imbalance price.

2.2 The settlement price

The imbalance price determines the settlement price. The determination of the imbalance price is now discussed, where TenneT (2010) and TenneT (2012b) are used.

(9)

One day ahead, the R2/R3 bids must be sent to TenneT. These bids may be changed up to 4 PTU before operation. They consist of prices (e/MWh) per PTU and are offered in blocks of 5 MW. The bids are made in two directions: for upward regulation and for down-ward regulation. As additional production costs more money (fuel costs etc.), the updown-ward regulation bids increase as the volume increases. Hence, the first 5 MW is the cheapest, and as more electricity is offered, the price of these bids increases. The downward regulation bid, also offered in blocks of 5 MW, decreases as volume increases. Suppliers are willing to pay if they have to produce less than planned as they will save fuel costs when they don’t have to produce. However, the more a supplier has to regulate in the downward direction, the scarcity of available downward power increases and the more costly it will be for the power plant (for example because the power plant has to start up again and power plants run less efficient). The downward regulation bid may become negative.

For each PTU TenneT constructs the bid ladder from these bids. The bid ladder consists of two prices: the upward regulation price and the downward regulation price. In the bid ladder all bids of the Dutch BRP’s are ordered, starting with the lowest bid for the upward direction and the highest bid for the downward direction. Figure 1 displays an example of a (first part of a) bid ladder for a PTU. Note that the horizontal dotted line are the marginal costs for producing one MWh (which is approximately e50). TenneT publishes the price of the total volume offered and the 100th, 300th and 600th MW of the bid ladder.

−25 −15 −5 5 10 15 20 25 MWh Pr ice (€/MWh) −20 0 20 40 60

80 downward direction upward direction

marginal costs

0

Figure 1: Example of (first part of) bid ladder

(10)

The imbalance price for upward regulation is the highest price that is deployed in that PTU, the imbalance price for downward regulation is the lowest price that is deployed in that PTU. The imbalance price holds for every supplier that is requested to regulate in that PTU. Because of this mechanism, the incentive for the BRP’s to resolve an imbalance is higher because the reward is higher. At the same time this is an incentive for the BRP’s to minimize their imbalance, since the penalty for causing an imbalance is higher as well.

After the operational day, during the settlement stage, the settlement price is determined. This price depends on several components:

• The imbalance price as explained above.

• The regulation state. This is a state that is determined per PTU by TenneT as follows. If TenneT in any PTU regulates:

– Neither downward nor upward: 0. In this case the midprice holds. The midprice is the average of the lowest upward regulating bid and the highest downward regulating bid and is always positive. Prices are approximately between e20 and e70.

– Only in upward direction: +1. The imbalance price for upward regulation holds. Prices are always positive and approximately between e30 and e500.

– Only in downward direction: −1. The imbalance price for downward regulation holds. Prices can be both positive and negative and are approximately between −e200 and +e100.

– Both upward and downward: 2. In this case the imbalance price for upward regulation holds for regulating in downward direction, the price for downward regulation holds for regulating in upward direction. Prices can both be positive and negative and are approximately between −e200 and +e300.

TenneT settles the deployed R2/R3 capacity and the imbalance position of each BRP during the settlement stage. The imbalance position of a BRP can be determined through the Energy Programme (E-Programme) that each BRP has to submit one day ahead, in which expected electricity supply to the grid and expected consumption from the grid of their portfolio per PTU is given. When an imbalance on the grid arises and is resolved, there are several parties involved: the parties causing the imbalance and the parties helping to restore the balance. The BRP that helps to resolve the imbalance gains money (except when the regulation state is 2, which is explained shortly), the BRP that causes the imbalance loses money. To illustrate this we assume that there is one party causing the imbalance and one party resolving the imbalance. This can easily be extended to the general case in which there are more parties causing and resolving the imbalance.

(11)

resolving the imbalance gains, where

m = |marginal costs − imbalance price|. (1)

Position of a BRP

Price Causes the imbalance Resolves the imbalance

Positive −m +m

Negative1 −M +M

1_{Only occurs when the regulation state is −1.}

Table 1: The position of the BRP and the gain/loss

In this Table M is also defined by (1) and 0 < m < M . Note that in case of regulating in the upward direction, the imbalance price will always be above the marginal cost of one MWh. Otherwise the costs of producing one extra MWh would be higher than the reward for this production. Reasoning similarly, in case of regulating in the downward direction, the imbalance price will always be below the marginal cost of one MWh. Otherwise the supplier helping the grid would have more costs from reducing capacity with one MWh than it would earn. The effect of the settlement price on the gain or loss of the BRP’s is next illustrated by two examples: in the first example the regulation state is 1, in the second example the regulation state is 2.

Example 2.1. Consider a PTU where the Netherlands is short in power and there is only regulated in the upward direction to restore the balance in the grid: the regulation state is 1. Assuming an imbalance volume of 1 MW, if the highest deployed bid for the upward direction ise70, then the party causing the imbalance must pay TenneT e70. Since this party saves e50, the marginal costs, from producing one MWh less than planned, this means that the loss of this BRP ise20. On the other hand, the party that helps to restore the balance receives e70 from TenneT but had e50 costs from producing one MWh more than planned, meaning that this BRP gainse20.

Note that in this example the amount that the BRP that causes the imbalance loses is exactly the amount that the BRP that resolves the imbalance gains.

(12)

Note that in this case both BRP’s lose money: both the BRP that causes the imbalance and the BRP that resolves the imbalance. By these examples we have illustrated the effect of the imbalance price and the regulation state, which determine the settlement price, on the gain or loss of the BRP’s.

The settlement price and time series

As mentioned, knowledge on future behaviour of the settlement price can be informative. Note that through the settlement price the BRP causing an imbalance is penalized and the BRP resolving the imbalance can earn money. Therefore when future uncertainty or prices are known for the short-term, the BRP can better estimate the short-term risk.

The focus of this research is on short-term predictions of the settlement price using time-series models, specifically ARMA-GARCH models. With this approach, historical prices at PTU t are used to forecast the settlement price. Hence imbalance volume is left out of scope and it is implicitly assumed that for a given price any imbalance volume could hold.

An underlying reason for choosing time-series models are the many complicating factors involved. The settlement price is determined to a large extent by the imbalance price, which in turn depends on both the volume R2/R3 necessary in a PTU and the corresponding bid ladder in that PTU. Not only the imbalance volume but also the bid ladder depends on many factors. These complicating factors are assumed to be captured in the historical prices.

We are not only interested in point forecasts, but also in the prediction intervals associated with these point forecasts. Since volatility clustering in the price is found, the error terms of the ARMA process are modelled by a GARCH process. First we only use the historical price to compute forecasts, later we add external variables as external covariates.

3 Theory on ARMA-GARCH processes

This Section introduces theory on ARMA-GARCH processes. First some basic definitions are introduced. Then the ARMA, GARCH and ARMA-GARCH processes are defined after which prediction in ARMA-GARCH processes is discussed. For this, McNeil et al. (2005) and Hamilton (1994) are used.

(13)

The time series (Pt)t∈Z is strictly stationary if (Pt1, . . . , Ptn)

d

= (Pt1+k, . . . , Ptn+k), ∀t1, . . . , tn, k ∈ Z, n ∈ N. (2)

The time series (Pt)t∈Zis covariance stationary (or weakly stationary) if the first two moments, the mean function µ(t) and the autocovariance function γ(t, s), do not depend on time t,

µ(t) = E(Pt) = µ, t ∈ Z, (3)

γ(t, s) = E((Pt− µ(t))(Ps− µ(s))) = γ(t + k, s + k), t, s, k ∈ Z. (4) This definition implies that the covariance between Pt and Ps only depends on the dis-tance |s − t|. Thus the autocovariance function can be written as a function of one variable γ(h) := γ(h, 0). Note that γ(0) = Var(Pt).

The autocorrelation function (ACF) ρ(h) of a covariance stationary process (Pt)t∈Zis defined by

ρ(h) = γ(h)/γ(0), ∀h ∈ Z. (5)

An ARMA process is a covariance stationary process where the basic building block is a white noise process. A stochastic process (Pt)t∈Z is a white noise process if it is covariance stationary with ACF

ρ(h) =    1 if h = 0, 0 if h 6= 0. (6)

A process (Pt)t∈Z is a strict white noise process if it is a series of independent and identically distributed (i.i.d.) finite-variance random variables.

3.1 ARMA processes

The family of Autoregressive Moving Average (ARMA) models are widely used in time series analysis. A process (Pt)t∈Z with mean µ is an ARMA(p,q) process if it is a covariance stationary process that satisfies

(14)

where t is a white noise process with mean 0 and variance σ2. Denote the information available at time t by It. The conditional mean and variance of Pt is given by

E(Pt|It−1) = µ + p X i=1 φi(Pt−i− µ) + q X j=1 θjt−j, ∀t ∈ Z, (8) Var(Pt|It−1) = σ2, ∀t ∈ Z. (9)

An extension of the ARMA(p,q) process is the ARMA process with exogenous variables, the ARMAX(p,q) process. An ARMAX(p,q) process with k ≥ 1 exogenous variables is defined by (7), but the mean µ is now depending on the exogenous variables by

λt= µ + k X i=1

δixi,t, ∀t ∈ Z, (10)

where xi,t is the value of the i-th external regressor at time t. δi is a coefficient that has to be estimated for the external covariates (i = 1, . . . , k), just like the coefficients φi and θj (i = 1, . . . , p, j = 1, . . . , q). An ARMAX(p,q) process with k ≥ 1 exogenous variables is therefore defined by Pt− λt= p X i=1 φi(Pt−i− λt−i) + q X j=1 θjt−j+ t, ∀t ∈ Z. (11)

Following the approach of McNeil et al. (2005) we use Maximum Likelihood Estimation (MLE) to estimate the coefficients for the processes.

Using the ARMA(p,q) process we can model the conditional mean of Ptgiven information from the past. Furthermore, prediction intervals can be constructed using the theoretical distribution of the error terms (t)t∈Z. However, the conditional variance of Ptdoes not change over time. It might be that if the settlement price is very volatile now, the price will be volatile in the near future as well. The presence of volatility clustering in the residuals can be tested by the Lagrange Multiplier (LM) test as proposed by Engle (1982). The null-hypothesis of volatility clustering is tested by regressing the T squared residuals of the ARMA(p,q) process on a constant and h ≥ 1 lagged values, where T is the number of observations in the regression. The test statistic is given by

LM = T R2, (12)

(15)

3.2 GARCH processes

Engle (1982) introduced Autoregressive Conditional Heteroskedasticity (ARCH) models, which were extended by Bollerslev (1986) to Generalized Autoregressive Conditional Heteroskedas-ticity (GARCH) models. For some data, for example financial data, volatility is not constant and it may be time-varying. (G)ARCH processes are used to model the conditional variance given information from the past.

The strictly stationary process (Xt)t∈Z is a GARCH(m,n) process if it satisfies, ∀t ∈ Z,

Xt= σtZt, (13) σ2_t = ω + m X i=1 αiXt−i2 + n X j=1 βjσt−j2 , (14)

where (Zt)t∈Zis a strict white noise process with mean 0 and variance 1. Furthermore, ω > 0, αi ≥ 0, βj ≥ 0, i = 1, . . . , m, j = 1, . . . , n and m X i=1 αi+ n X j=1 βj < 1. (15)

This last condition is necessary for the process to be covariance stationary as shown by Bollerslev (1986). Note that the conditional mean and variance of Xt are given by

E(Xt|It−1) = 0, ∀t ∈ Z, (16)

Var(Xt|It−1) = σt2, ∀t ∈ Z. (17)

The distribution for the innovations Zt can be any distribution with mean zero and vari-ance 1. It can be shown that the kurtosis of Xt, if it exists, is greater than the kurtosis of Zt when the kurtosis of the innovations, E(Zt4), is larger than 1 such as for the Normal and (skewed) Student-t distribution. Therefore the distribution of a GARCH process has fatter tails than the innovation distribution and the unconditional distribution of Xt is different from the innovation distribution.

(16)

three distributions are given in Appendix A.

3.3 ARMA-GARCH processes

The two models are combined by modelling the error terms of the ARMA(p,q) process by a GARCH(m,n) process. The covariance stationary process (Pt)t∈Zis said to be an ARMA(p,q) process with GARCH(m,n) errors if it satisfies

Pt− µ = p X i=1 φi(Pt−i− µ) + q X j=1 θjt−j+ t, ∀t ∈ Z, (18)

where the error terms are modelled, ∀t ∈ Z, by the GARCH(m,n) process

t= σtZt, (19) σ_t2= ω + m X h=1 αi2t−i+ n X l=1 βjσt−j2 , (20)

where (Zt)t∈Z is strict white noise with mean 0 and unit variance, ω > 0, αi ≥ 0, βj ≥ 0, i = 1, . . . , m, j = 1, . . . , n and Pm

i=1αi+Pnj=1βj < 1. The conditional mean and variance of Pt are given by

E(Pt|It−1) = µ + p X i=1 φi(Pt−i− µ) + q X j=1 θjt−j, ∀t ∈ Z, (21) Var(Pt|It−1) = σ2t, ∀t ∈ Z. (22)

Note that both the conditional mean and the conditional variance of Pt are time-varying.

The ARMA-GARCH process can be extended to an ARMAX-GARCH process. Similar as when the ARMA process is extended to an ARMAX process, it is assumed that the mean of the process (Pt)t∈Z depends on external covariates. An ARMAX(p,q)-GARCH(m,n) process is therefore defined by an ARMAX(p,q) process as in (10), where the error terms are modelled as in (19) and (20).

3.4 Prediction in ARMA-GARCH processes

Forecasts for an ARMA-GARCH process consist of two forecasts: the forecast for the condi-tional mean and for the condicondi-tional variance.

(17)

i.e.:

ˆ

Pt+s= E(Pt+s|It). (23)

Suppose the parameters of an ARMA(p,q)-GARCH(m,n) process have been estimated by MLE. Assume that we have an infinite number of observations of the process up to time t. The 1-step prediction for Pt+1, ˆPt+1, is given by

ˆ

Pt+1=ˆµ + ˆφ1(Pt− ˆµ) + ˆφ2(Pt−1− ˆµ) + . . . + ˆφp(Pt+1−p− ˆµ)

+ ˆθ1t+ ˆθ2t−1+ . . . + ˆθqt+1−q. (24) In practice not all historical values of the process are known. The predictor can therefore not be evaluated exactly, but an accurate approximation can be obtained if the sample size T is reasonably large by substituting the model residual

ˆ t= Pt− E(Pt|It−1) (25) = Pt−  ˆµ + p X i=1 ˆ φi(Pt−i− ˆµ) + q X j=1 ˆ θjˆt−j   (26)

for t, where the the presample ˆ’s are set equal to zero. Then (26) is used for t = 1, . . . , T .

By recursive substitution, for an s-step prediction based on It, we get

ˆ Pt+s =    ˆ µ + ˆφ1( ˆPt+s−1− ˆµ) + . . . + ˆφp( ˆPt+s−p− ˆµ) + ˆθsˆt+ . . . + ˆθqˆt+s−q for 1 ≤ s ≤ q; ˆ µ + ˆφ1( ˆPt+s−1− ˆµ) + . . . + ˆφp( ˆPt+s−p− ˆµ) for s > q, (27)

where ˆPτ = Pτ if τ < t. By a similar approach the prediction for the conditional variance (22) is given by

ˆ

σ2_t+s= E(σt+s2 |It). (28)

Recursive substitution yields the s-step predictor,

ˆ σ_t+s2 = ˆω + r X i=1 [( ˆαi+ ˆβj)σ2t+s−i] + l X j=s [ ˆαj2t+s−j+ ˆβjσt+s−j2 ], (29)

(18)

because not all historical values of the process are known. By substituting (26) for t and ˆ σ_t2= ˆω + m X h=1 ˆ αiˆ2t−i+ n X l=1 ˆ βjσˆt−j2 , (30)

for σt, where the presample ˆ’s are set equal to zero and the presample ˆσ’s are set equal to zero or to the sample variance, the prediction formula (29) can be approximated.

The predictions of an ARMA-GARCH process can be extended to an ARMAX-GARCH process, where the mean depends on exogenous variables as in (10), by substituting ˆλt for ˆµ in equation (24) and (27).

Prediction intervals can be calculated based on (23) and (28) and the theoretical distri-bution of Zt. A q-level prediction interval, q ∈ [0, 1], for the price at time t + s, conditional on It is given by h ˆ_P_t+s_{+ ˆ}_σ_t+s_{· t}1 2(1−q), ˆPt+s+ ˆσt+s · t1 2(1+q) i , (31) where t1 2(1−q) is the 1

2(1 − q) percentile of the theoretical distribution of Zt with zero mean and unit variance. A q-level prediction interval is an estimated interval in which the future price is expected to lie with probability q.

4 Data

The settlement price can consist of two prices: the price for upward regulation and the price for downward regulation. In most situations, when the regulation state is 0, 1 or −1, these prices are equal to each other as the loss of the causing party is exactly the amount that the helping party gains, recall Table 1 and the Examples in Section 2.2. When the regulation state is 2 the two prices differ. Under the assumption that over time on average the ‘side’ on which the helping party is balancing the grid (short or long) occurs equally, we consider the average of the upward and downward price. Note that there is no upfront knowledge on the final regulation state that TenneT decides on.

Using this construction, the settlement price in PTU t, pt, is given by

pt=       

Upward regulation price if rt= 0, 1 in PTU t;

Downward regulation price if rt= −1 in PTU t;

1

(19)

PTU are available at the TenneT website (TenneT, 2013).

4.1 Data description

Prices over the period 1 March 2012 - 28 February 2013 are considered. The reason for this time frame is twofold. First, starting 1 February 2012, IGCC contribution is implemented (TenneT, 2012a) which caused a change on the imbalance market in the Netherlands. Second, February 2012 is not considered because of the combination of an imbalance market that was adjusting to this new situation and very low temperatures. Therefore this month might not be representative for the situation as it is now. One day has 96 PTU’s, except the two days with a shift between standard time and daylight saving time. The data consists of T = 35, 040 observations.

The first thing to notice from the data plotted over time, see Figure 2, is that the settlement price is very volatile. Second, the price −e200 occurs many times. The regulation state was −1 at these moments and hence the Netherlands was long in power. The price occurs most frequently at PTU 25 and 29, which are both hour transitions: 6:00-6:15 and 7:00-7:15.

An explanation for these extreme prices can be lying in the fact that electricity can only be bought in blocks of one hour. During a morning hour a BRP knows that demand for electricity will increase over time. However, because electricity on the market can only be bought in blocks on one hour this means that the BRP buys electricity such that it meets average demand for that hour. Because of the increasing pattern in demand, the volume of electricity bought is therefore higher than true demand during the first part of the hour and lower than true demand in the last part of the hour. This can explain why the Netherlands is often long on these hour transitions: volume bought by BRP’s exceeds true demand during these moments.

Furthermore, offered R2/R3 capacity at these morning hours is probably low. First, because the power plants of the BRP’s have to start, little flexible capacity can be offered. Second, the BRP’s need the flexible capacity that they have for their portfolio as well, because there are many uncertainties during these morning hours (start deviation from the power plant, a demand pattern that increases different than forecasted, etc.). The BRP wants to minimize imbalance costs and therefore needs flexible capacity to balance their portfolio. The long situation of the Netherlands in combination with the little offered downward power might explain the high prices during these hour transitions.

(20)

−400 −200 0 200 400 Pr ice (€/MWh)

Mar−12 May−12 Jul−12 Sep−12 Nov−12 Jan−13

Figure 2: The settlement price plotted from 1 March 2012 - 28 February 2013

than the Normal distribution. The null-hypothesis of Normality from the Jarque-Bera test is, not surprisingly, rejected (p-value < 2.2e−16). Table 2 also gives a description of the prices only during weekdays and during weekends/holidays. Note the larger mean and standard deviation for the weekdays compared to the weekends/holidays.

All Weekdays Weekends

Mean 50.60 53.89 43.06 Median 43.01 45.29 37.90 Standard Deviation 63.61 69.12 47.89 Minimum −449.60 −449.60 −298.60 Maximum 500.70 500.60 500.70 1st Quartile 30.55 32.69 26.00 3rd Quartile 58.55 60.55 53.41 Kurtosis 16.58 14.82 21.26 Skewness 1.58 1.35 2.44 T 35,040 24,384 10,656

Table 2: Data description of settlement price (e/MWh)

In Figure 4 the mean and standard deviation of the price for each day are plotted. The outlier of approximately −e50/MWh was on 31 December 2012. A possible explanation of this low price is that because of a low demand only few power plants were running and they were running on their minimal level due to technical constraints. Therefore little R2/R3 capacity is offered to TenneT for the downward direction and extreme prices hold if there is a need for balancing in that direction.

(21)

−400 −200 0 200 400 0.000 0.005 0.010 0.015 0.020 Price Normal mixture density

Normal density

Figure 3: Histogram of prices with Normal and Normal mixture density curve

the morning hours (approximately between 5:00 and 9:00). This can be partly explained by a different allocation of running and starting power plants. Due to technical difficulties, when a power plant has to start and change production quickly, load forecast errors can arise. These load forecast errors can cause an imbalance. During morning hours, power plants have to start because demand for electricity during the day is higher than during the night. Since at weekdays demand for electricity is higher than at weekends/holidays, more power plants must start and increase production quickly during weekdays than during weekends/holidays. Therefore this can cause more imbalance at morning hours during the week compared to a weekend or holiday. In Section 4.3 we formally test whether the prices during weekdays and weekends differ. Now that the data is shortly described, we test for covariance stationarity of the data.

4.2 Stationarity

From the time series plot of the data we do not immediately see evidence against stationarity of the data. To make inference about the serial dependence structure of the data we use the ACF as in (5). The sample ACF is calculated by

ˆ

ρ(h) = ˆγ(h)/ˆγ(0), 0 ≤ h < T, (33)

(22)

−50 0 50 100 Mean settlement pr ice

Mar−12 Jun−12 Sep−12 Dec−12

(a) Mean 50 100 150 200 Standard de viation

Mar−12 Jun−12 Sep−12 Dec−12

(b) Standard Deviation

Figure 4: Mean settlement price and standard deviation for each day in the sample

0 20 40 60 80 0 50 100 PTU Mean settlement pr ice Weekdays Weekend (a) Mean 0 20 40 60 80 20 40 60 80 100 120 PTU Standard de viation Weekdays Weekend (b) Standard Deviation

(23)

where ¯P =PT

t=1Pt/T .

The sample ACF indicates a dependency structure with 24 hours (96 PTU) earlier. The ACF does not decay very quick, which can indicate non-stationarity or stationarity with a long-memory dependence.

To formally test for stationarity we use the ADF test (Dickey and Fuller, 1979) and the KPSS test (Kwiatkowski et al., 1992). The lag length as suggested by Schwert (1989), in our case p = 51, yields ADF=−19.1453 and KPSS=1.5921. The 1% asymptotic critical value for the ADF test is −3.43 and the null-hypothesis of a unit root is rejected for small values. Hence the null-hypothesis from the ADF test is strongly rejected. For the KPSS test the 1% critical value is 0.739 and hence the null-hypothesis of stationarity is rejected. Because the two tests are contradicting, this is examined in further detail.

It is a stylized fact that the KPSS test has severe size distortions in finite samples, for example shown by Caner and Kilian (2001). Furthermore Amsler et al. (2009) show the that the KPSS test has substantial size distortions (overrejections). They suggest that this prob-lem could be solved by using a larger number of lags than traditionally been used. Using their lag order, p = 216, we find ADF= −9.5331 and KPSS= 0.7579. Hence, the null-hypothesis of the ADF test is still strongly rejected whereas the KPSS test statistic is now close to the 1% critical value. M¨uller (2005) states that if the sampling frequency of stationary data is very high and if the ratio of the chosen lag length to the sample size is small, then the null-hypothesis will be rejected with probability one if the sample size goes to infinity. As our sample is very large with a high sampling frequency, this could explain the (potentially false) rejection of the KPSS test.

To get an impression of the behaviour of the ADF and KPSS test in large, highly corre-lated samples, we simulate 1,000 stationary processes that have a high autocorrelation and sample size 35, 000. We consider an ARMA(98,4)-GARCH(98,4) process with a Normal dis-tribution. Using the two lag lengths p = 51 and p = 216, the null-hypothesis of the ADF test is always rejected in favour of stationarity. The null-hypothesis of the KPSS test is rejected in 21% of the cases when using p = 51 and in 7% of the cases when using p = 216. We use the 1% critical value for rejection and hence the KPSS test is overrejecting for these simulated samples. This could indicate that the null-hypothesis of the KPSS test is falsely rejected in our case.

The two tests for the weekdays and weekends/holidays separately confirm each other in favour of stationarity for the weekdays, but for the weekends/holidays there is a contradiction in the conclusion of both tests.

(24)

together with the fact that for the weekdays (which is approximately 70% of the entire dataset) both tests conclude stationarity, we continue our analysis with the original prices.

4.3 Distribution

We fit a mixture of two Normal density functions to describe the data, so that both the high peak and the fat tails of the distribution can be captured. The density function of each observation is f (pt|θ) = λ 1 σ1 φ pt− µ1 σ1 + (1 − λ) 1 σ2 φ pt− µ2 σ2 , (35)

where φ(·) is the standard Normal density function and θ = (µ1, σ1, µ2, σ2, λ) is a 5-vector of parameters Koning (nd). The estimated parameters are given in the left column of Table 3, where T is the number of observations in the sample. In Figure 3 the grey line is the fitted density for the Normal mixture distribution which has a much better fit than the Normal distribution.

In Figure 5 it is shown that the prices during the weekdays and weekends/holiday behave slightly different. To test whether the prices are significantly different we do a likelihood-ratio test where the value of the restricted log-likelihood is given in the left column of Table 3 and the unrestricted log-likelihood is the sum of the log-likelihoods for the weekdays and the weekends/holidays. Note the difference in the estimated parameters from Table 3 for the weekdays and the weekends/holidays.

All Weekdays Weekends

ˆ µ1 41.65 43.88 37.75 (0.11) (0.14) (0.22) ˆ σ1 15.51 16.07 16.90 (0.11) (0.14) (0.23) ˆ µ2 68.77 84.08 71.28 (1.14) (1.76) (2.29) ˆ σ2 106.87 132.11 100.29 (0.76) (1.36) (2.01) ˆ λ 0.74 0.76 0.78 (0.00) (0.00) (0.01) T 35,040 24,384 10,656 Log-likelihood 176,067.2 123,197.5 52,220.1

Table 3: Fitted parameters for a two-point Normal mixture distribution (standard errors in parentheses)

(25)

the weekends. Let θr be the true parameters of the density for the data containing both weekdays and weekends. We want to test

H0: (I, −I) θ1 θ2 ! = 0, H1: (I, −I) θ1 θ2 ! 6= 0,

where I is the 5 × 5 identity matrix. Denote the log-likelihood by `(·). The LR test statistic is given by Λ = −2 h `(ˆθ1) + `(ˆθ2) − `(ˆθr) i . (36)

Under the null the test statistic approximately follows a χ2

5 distribution because of the 5 restricted parameters in the restricted model, hence the critical value is 15.09. We find Λ = 1299.33 and therefore reject the null-hypothesis of equal parameters. This difference between weekdays and weekends/holidays can be used as information to forecast the settlement price.

5 Procedure to model and forecast the settlement price

The theory on ARMA-GARCH processes and basic characteristics of the data have been discussed. We now discuss the procedure that is used to fit the models and to compute the point forecasts and prediction intervals.

5.1 Modelling procedure

Within the class of time-series we use three ARMA-GARCH type models to compute short-term point forecasts and prediction intervals for the settlement price: short-short-term ARMA-GARCH, ARMA-GARCH models with longer lags and ARMAX-GARCH models.

We first separate the dataset in a dataset containing weekdays and a dataset containing weekends/holidays and fit a term ARMA-GARCH model to each dataset. These short-term models are then extended with longer lags.

(26)

Last we fit an ARMAX-GARCH model to the unseparated dataset by using several ex-ternal covariates and examine whether the forecasting performance can be improved by using these external covariates.

The data is split into a training set and a test set for the three datasets (unseparated dataset, only weekdays and only weekends/holidays). Nine months, March 2012 - November 2012, are used for the training set. The remaining three months, December 2012 - February 2013, are used to examine the accuracy of the point forecasts and prediction intervals. For every dataset we use the following procedure, which is to a large extent based on the approach as described McNeil et al. (2005).

We first test for serial dependence in the data by the Ljung-Box test (1978). The null-hypothesis of randomness is tested by the test statistic

QLB(h) = T (T + 2) h X j=1 ˆ ρ2_j T − j, (37)

where T is the number of observations in the sample and ˆρj is the sample ACF as in (33). Under the null-hypothesis that a sample exhibits no autocorrelation for h ≥ 1 lags, the test statistic has an asymptotic χ2_h distribution. We use the 1% significance level.

After concluding that there is serial dependence in the data, we fit the ARMA-GARCH model. We first select the orders of the ARMA(p,q) (p, q = 1, . . . , 5) process for the conditional mean by the Bayesian information criterion (BIC) (Schwarz, 1978) and the Akaike information criterion (AIC) (Akaike, 1974). If these criteria select different orders, we first continue with both models. From the selected ARMA(p,q) processes we obtain the residuals and test for the presence ARCH effects by the LM test as in (12) where we use the 1% significance level. The ARCH effects in the residuals are present and modelled by a GARCH(m,n) (m, n = 1, . . . , 5) process where we consider three zero mean, unit variance distributions for Zt: the Normal, Student-t and skewed Student-t distribution. The orders of the GARCH(m,n) pro-cess are selected by the BIC and AIC as well. Based on the sample ACF, the short-term ARMA-GARCH models are extended by longer lags in the ARMA and GARCH part. Based on the short-term and extended models the orders of the ARMA-GARCH model are selected. If it occurs that the AIC and BIC are not selecting the same model, we use the model that is selected by the BIC. This is since our main goal is to produce forecasts and the BIC selects more parsimonious than the AIC (Koehler and Murphree, 1988).

The ARMA-GARCH model fitted to the unseparated dataset is extended to an ARMAX-GARCH model by using external variables as external covariates.

(27)

diagnostics and comparison. The standardized residuals are given by ˆ Zt= ˆ t ˆ σt , (38) where ˆσ_t2 = ˆω +Pm i=1αˆiˆ2t−i+ Pn

j=1βˆjσˆt−j2 and ˆt = pt− ˆpt. ˆpt is the fitted value for the price at PTU t. QQ-plots are used to examine the fit of the distributions. We test whether the standardized residuals are strict white noise by inspecting the sample ACF of the stan-dardized residuals, which is formally tested by the Ljung-Box test. The sample ACF of the squared standardized residuals is used to examine whether there are still ARCH effects in the residuals. This is formally tested by the LM test as in (12).

From the selected and estimated short-term model, models with longer lags and ARMAX-GARCH model we compute point forecasts and prediction intervals.

5.2 Forecasting procedure

The computation time of estimating the coefficients is large, therefore we do not re-estimate the coefficients after every s-step forecast. The test set is used to update the information set and to evaluate the point forecasts and prediction intervals.

The main scheme to compute forecasts that is used in this research is the fixed scheme. The parameters are estimated once using the R observations in the original training set. The data to produce the forecast is updated with new information from the test set after an s-step prediction. Hence, the first s-step forecast is based on the R observations in the training set, the second s-step forecast is based on R observations from the training set and the first s observations from the test set, etc.

For the separated datasets we consider a version of the rolling scheme to examine whether updating the coefficients leads to better forecasting results. In the ‘original’ rolling scheme, the training set consists of a fixed sample size and the coefficients are estimated after every s-step forecast. The approach that we use is slightly different because of the computation time. The coefficients are updated after one month, we then use the fixed scheme for each month to compute the forecasts. Since there are three months in the test set this means that we update the coefficients twice.

The forecasts consist of a forecast for the price, based on (23), and a forecast for the conditional variance of the price, based on (28).

(28)

percentage error (MAPE). These are given by RMSE = v u u t 1 N N X t=1 (ˆpt+s− pt+s)2, (39) MAE = 1 N N X t=1 |ˆpt+s− pt+s|, (40) MAPE = 100 N N X t=1 ˆ pt+s− pt+s pt+s , (41)

where N is the number of observations in the test set, ˆpt+s is the forecast based on (23) and pt+s is the realized price at time t + s. For all measures, the lower the value, the better the forecast. The MAPE divides by pt+s and it is therefore not possible to compute this measure in the case that the realized price is equal to zero. In the test set there are three cases where this happens. Therefore when the MAPE is computed, we remove these values from both the realized and the forecasted price.

The accuracy of the prediction interval, which is computed using (31), depends on both the nominal and the conditional coverage. For this we construct the indicator variable I for a q-level prediction interval. Let the indicator variable I for time t + s, made at time t, be defined by It+s =    1 if pt+s∈Lt+s|t(q), Ut+s|t(q) ; 0 if pt+s∈/ Lt+s|t(q), Ut+s|t(q) , (42)

where Lt+s|t(q) is the lower limit of the prediction interval and Ut+s|t(q) is the upper limit of the prediction interval for time t + s, made at time t, for coverage probability q:

L_t+s|t(q) = ˆpt+s+ ˆσt+s· t1

2(1−q), (43)

U_t+s|t(q) = ˆpt+s+ ˆσt+s· t1

2(1+q). (44)

Following Baillie and Bollerslev (1992) we compare the nominal coverage rate N

X h=1

Ih/N, (45)

where h > t, to the true coverage rate q.

(29)

this into account. This test consists of the sum of a Likelihood Ratio (LR) test statistic for unconditional coverage and a LR test statistic for independence.

Consider the sequence {Ih}N_h=1based on a coverage level q. Again h > t so that for h = 1 we consider the first forecasted value for pt+1. The LR test for unconditional coverage tests the null-hypothesis that E(Ih) = q against the alternative that E(Ih) 6= q. This corresponds to testing whether the nominal coverage rate corresponds to the true coverage rate, given that they are independent. The LR test for unconditional coverage is given by

LRuc= −2 log L(q; I1, . . . , IN) L(ˆπ; I1, . . . , IN) a ∼ χ2 1, (46) where L(q; I1, . . . , IN) = (1 − q)n0qn1 (47) and n0 is the number of zeroes, n1 is the number of ones in the indicator variable I. Further-more

L(ˆπ; I1, . . . , IN) = (1 − ˆπ)n0πˆn1, (48) where π is estimated by ˆπ = n1

n0+n1.

The LR test for independence tests the null-hypothesis of independence of the sequence {I_h}N

h=1 against the alternative of a first-order Markov process. Let πij = P (It= j|It−1 = i) and let nij be the number of observations with value i followed by value j (i, j = 0, 1). Under the alternative the likelihood for a binary first-order Markov chain with transition probability matrix Π1= 1 − π01 π01 1 − π11 π11 ! (49) is L(Π1; I1, . . . , IN) = (1 − π01)n00π01n01(1 − π11)n10π11n11. (50)

This likelihood is maximized by taking

ˆ Π1= n00 n00+n01 n01 n00+n01 n10 n10+n11 n11 n10+n11 ! . (51)

(30)

is given by Π2 = 1 − π2 π2 1 − π2 π2 ! . (52)

The likelihood under the null is

L(Π2; I1, . . . IN) = (1 − π2)(n00+n10)π(n2 01+n11). (53) The ML estimate for Π2, ˆΠ2, is given by the matrix as in (52) where π2 is estimated by ˆ

π2= _n₀₀_+nn00₁₀+n_+n11₀₁_+n₁₁. The LR test for independence is

LRind= −2 log " L( ˆΠ2; I1, . . . , IN) L( ˆΠ1; I1, . . . , IN) # a ∼ χ2₁. (54)

The joined test for conditional coverage consists of a combination of the test for unconditional coverage and independence. This test statistic is given by

LRcc = −2 log " L(q; I1, . . . , IN) L( ˆΠ1; I1, . . . , IN) # a ∼ χ2₂. (55)

When ignoring the first observation the three LR tests are related by

LRcc = LRuc+ LRind. (56)

Using both methods (comparing nominal coverage to the true coverage and testing for conditional coverage) we examine the accuracy of the prediction intervals. We first examine which model has the best prediction performance by comparing the accuracy of 60%, 80% and 95% prediction intervals, that is we use q = 0.60, 0.80, 0.95.

Second we examine for which coverage level q the prediction interval has the best nominal and conditional coverage. For this we compute prediction intervals for q = 0.50, 0.55, 0.60, . . . , 0.95. Results for this last analysis are only reported for the fixed scheme, since the results for the rolling scheme are very similar to the results of the fixed scheme.

6 Results for the separate datasets

(31)

6.1 Modelling results

6.1.1 Weekdays

For the weekdays the training set consists of 18,029 observations and the test set of 5,856 observations.

The Ljung-Box test of randomness is rejected for lags up to h = 200 and hence this depen-dence structure is used as information to forecast the settlement price. For the conditional mean the BIC selects the ARMA(3,3) model whereas the AIC selects the ARMA(5,4) model. For these models the null-hypothesis of no ARCH effects in the residuals is rejected for all lags up to a maximum lag length of 200.

For both ARMA processes the ARCH effects in the residuals are modelled by a GARCH(m,n) process. We find, in both cases, that the process with the skewed Student-t distribution out-performs the models with a Normal and Student-t distribution for all combinations of m,n. The GARCH(4,4) model leads, for both ARMA models, to the lowest BIC and AIC. In com-bination with this model ARMA(3,3) is slightly preferred by the BIC (9.5310 vs. 9.5311), whereas the ARMA(5,4) is preferred by the AIC (9.5232 vs. 9.5220). For all fitted ex-tended models (introduced in detail later) both selection criteria select the models with the ARMA(5,4) process for the conditional mean. As the ARMA(5,4) is preferred by the BIC in four out of five cases and in five out of five cases by the AIC, this model is selected for the conditional mean.

Figure 6 displays QQ-plots of the standardized residuals of the ARMA(5,4)-GARCH(4,4) models against the theoretical innovation distribution, which displays unsatisfactory results. None of the standardized residuals matches the theoretical distribution, hence the assumptions on the distribution for the innovations are not met. Even the (skewed) Student-t distribution, which can capture fatter tails than the Normal distribution, does not capture the fat tails of the data.

A slightly surprising result is that the residuals of the (skewed) Student-t distribution are almost on a straight line, however not on the 45-degree line. We consider the two-step approach as suggested in McNeil et al. (2005) to improve the fit of the distribution: first the ARMA-GARCH process is fitted by Quasi-Maximum Likelihood (QML) where we falsely assume the Normal distribution for the innovations, second a (skewed) Student-t distribution is fitted to the standardized residuals. Unfortunately this does not improve the QQ-plot for the standardized residuals and the (skewed) Student-t distribution does not capture the fat tails of the data. To offer a suggestion for future research we fit a different distribution, the Normal Inverse Gaussian distribution, in Section 7.3.

(32)

into account, the short-term model that is selected according to the BIC and AIC is the ARMA(5,4)-GARCH(4,4) model with a skewed Student-t distribution for the innovations.

(a) Normal (b) Student-t (c) skewed Student-t

Figure 6: QQ-plots of the standardized residuals against the theoretical distribution of the innovations

The ACF of the (squared) standardized residuals of the ARMA(5,4)-GARCH(4,4) model with a skewed Student-t distribution in Figure 7, shows a peak around lag 96 which is exactly 24 hours earlier. According to Figure 7a the residuals are not white noise, which is confirmed by the Ljung-Box test up to lag 200. The ACF for the squared standardized residuals is plotted in Figure 7b. Although the estimated correlations are not in the interval 95% of the time, there is a big improvement in capturing the ARCH effects of the residuals compared to the ARMA(5,4) model. The LM test of no ARCH effects is now only rejected for lags after lag 47, which is an improvement compared to the ARMA(5,4) model where the null-hypothesis is rejected for all lags. Based on the ACF of the (squared) standardized residuals we expect that we can improve the fit by including longer lags.

Four extensions are fitted to the ARMA(5,4)-GARCH(4,4) model where longer lags are included. Including the original ARMA(5,4)-GARCH(4,4) model, the estimated models are:

1. ARMA(5,4) - GARCH(4,4); 2. ARMA(96,4) - GARCH(4,4); 3. ARMA(96,4) - GARCH(96,4); 4. ARMA(98,4) - GARCH(4,4); 5. ARMA(98,4) - GARCH(98,4),

all with a skewed Student-t distribution for the innovations. In the sequel this numbering is used for the models.

(33)

−0.05 0.00 0.05 0.10 0.15 0.20 Lag A CF 1 10 20 30 40 50 60 70 80 90 100

(a) Standardized residuals

0 20 40 60 80 100 0.00 0.02 0.04 0.06 Lag A CF

(b) Squared standardized residuals

Figure 7: ACF of the (squared) standardized residuals of the selected ARMA(5,4)-GARCH(4,4) model, with a skewed Student-t distribution for the innovations, for the week-days

robust against general misspecification. The size and significance of the estimated AR coeffi-cients φ for lag 1 and 4 indicate that the conditional price is related to the price from 1 PTU and 4 PTU earlier, lag 96 is highly significant as well. The estimated parameter for α96 is highly significant as well and therefore we expect that model 3 and 5 have a better prediction performance. The necessary and sufficient condition for second-order stationarity as in (15) holds for all fitted models. The estimated parameters for the skewness ξ and the shape ν of the skewed Student-t distribution for the innovations is similar in all cases. We find very thick tails and, as log(ξ) > 0, the density is slightly skewed to the right.

The QQ-plots of the standardized residuals of the extended models are not displayed as they all are very similar to the QQ-plot of the standardized residuals for the model 1: the fit of the standardized residuals does not match the theoretical innovation distribution. Comparing the BIC and AIC of the extended models to BIC and AIC of the short-term model, see Table 4, we find an improvement compared to the short-term model 1. Model 5 has the lowest BIC and AIC, followed by model 3. In Figure 8 the ACF’s of the standardized residuals of the extended models are plotted. Although the standardized residuals are not white noise, when these ACF’s are compared to the ACF of model 1 (Figure 7a), the fit has improved. Especially for model 5 and model 3 the peak around lag 96 has decreased. The Ljung-Box test confirms that the residuals are not white noise, but quite surprisingly the p-values of the extended models are not smaller than the p-values of the original models, indicating that according to this test there is no improvement. The correlation in the residuals could be reduced by including more lags in the model.

(34)

Model

1 2 3 4 5

BIC 9.5311 9.4997 9.4632 9.4970 9.4591

AIC 9.5220 9.4901 9.4533 9.4858 9.4456

Table 4: AIC and BIC for the fitted models for the weekdays

the GARCH part of the model as well. The LM test for ARCH effects confirms this, as for model 3 and 5 the null-hypothesis of no ARCH effects is not rejected for most lags up to a lag length of 200. For model 2 and 4 the rejections are similar as for the short-term model 1.

0 20 40 60 80 100 −0.05 0.00 0.05 0.10 0.15 Lag (a) Model 2 0 20 40 60 80 100 −0.05 0.00 0.05 0.10 0.15 Lag (b) Model 3 0 20 40 60 80 100 −0.05 0.00 0.05 0.10 0.15 Lag (c) Model 4 0 20 40 60 80 100 −0.05 0.00 0.05 0.10 0.15 Lag (d) Model 5

Figure 8: ACF’s of the standardized residuals of the extended models for the weekdays

0 20 40 60 80 100 0.00 0.02 0.04 0.06 Lag A CF (a) Model 2 0 20 40 60 80 100 0.00 0.02 0.04 0.06 Lag A CF (b) Model 3 0 20 40 60 80 100 0.00 0.02 0.04 0.06 Lag A CF (c) Model 4 0 20 40 60 80 100 0.00 0.02 0.04 0.06 Lag A CF (d) Model 5

(35)

Table 5: Fitted coefficients for the ARMA-GARCH models for the weekdays

Model 1 Model 2 Model 3 Model 4 Model 5

(36)

Table 5: (continued)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Robust standard errors in parentheses. ∗p < 0.1;∗∗ p < 0.05;∗∗∗p < 0.01.

6.1.2 Weekends/Holidays

To have regularly sampled data we exclude the two days from the dataset in which there is switched between standard time and daylight saving time. After this adjustment, the training set for the weekends contains 7,481 observations and the test set 2,784. The results for the weekends/holidays are very similar to the results for the weekdays and therefore we describe the model diagnostics shortly.

The Ljung-Box test of randomness is rejected for lags up to h = 200. For the conditional mean we find that an ARMA(3,4) model has the lowest BIC and an ARMA(4,3) model has the lowest AIC. The null-hypothesis of the LM test for ARCH effects is rejected for both models. By a similar procedure as for the weekdays the errors are modelled by a GARCH process to select the orders of the ARMA and GARCH processes.

(37)

−0.05 0.00 0.05 0.10 0.15 0.20 Lag A CF 1 10 20 30 40 50 60 70 80 90 100

(a) Standardized residuals

−0.02 0.00 0.02 0.04 0.06 Lag A CF 1 10 20 30 40 50 60 70 80 90 100

(b) Squared standardized residuals

Figure 10: ACF of the (squared) standardized residuals of the selected ARMA(4,3)-ARCH(4) model, with a skewed Student-t distribution for the innovations, for the weekends/holidays

The skewed Student-t distribution leads, for almost all combinations of m,n, to a lower BIC and AIC than when the Normal and Student-t distribution are used. The QQ-plots of the three ARMA-GARCH models are very similar to the QQ-plots for the weekdays and are therefore not displayed. The finding is that the standardized residuals do not match the theoretical Normal or (skewed) Student-t distribution.

According to the ACF the standardized residuals of the ARMA(4,3)-ARCH(4) model are not white noise, which is displayed by Figure 10a. This is confirmed by the Ljung-Box test for a maximum lag length of 200, where the null-hypothesis of no autocorrelation is only not rejected for the first three lags but it is rejected for higher lags. The ACF of the squared standardized residuals is plotted in Figure 10b. Similar to the weekdays there is a peak around lag 96. The null-hypothesis of no ARCH effects of the LM test (up to lag 200), is not rejected for most lags. Comparing this result with the test results from the ARMA(4,3) model, where the null-hypothesis was rejected for all lags, we find that the ARCH effects are much better captured by the short-term ARMA(4,3)-ARCH(4) model.

We expect that we can improve the model by including longer lags. We use a similar numbering for the models for the weekends as for the weekdays,

1. ARMA(4,3) - ARCH(4); 2. ARMA(96,3) - ARCH(4); 3. ARMA(96,3) - ARCH(96); 4. ARMA(98,3) - ARCH(4); 5. ARMA(98,3) - ARCH(98),

where the skewed Student-t distribution is used for the innovations.

(38)

for lag 96, both in the ARMA and the GARCH part, is highly significant for every extended model. The estimated parameters for the skewness and shape are similar for all fitted models and, as log(ξ) > 0, the distribution is slightly skewed to the right. Comparing the distribution of the weekends/holidays to the weekdays, it is found that the distribution for the weekdays is more skewed and has fatter tails than the distribution for the weekends/holidays.

According to both selection criteria model 3 is the best model in terms of a minimum BIC and AIC, followed by model 4. This is different from the selected model for the weekdays, where model 5 was the preferred model. The BIC and AIC for all models for the week-ends/holidays are given in Table 6.

Model

1 2 3 4 5

BIC 9.0064 8.9697 8.9536 8.9589 8.9629

AIC 8.9925 8.9549 8.9379 8.9404 8.9397

Table 6: AIC and BIC for the fitted models for the weekends/holidays

Table 7: Fitted coefficients for the ARMA-ARCH models for the weekends/holidays

(39)

Table 7: (continued)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Robust standard errors in parentheses. ∗p < 0.1;∗∗ p < 0.05;∗∗∗p < 0.01.

(40)

0 20 40 60 80 100 −0.05 0.00 0.05 0.10 0.15 0.20 Lag A CF (a) Model 2 0 20 40 60 80 100 −0.05 0.00 0.05 0.10 0.15 0.20 Lag A CF (b) Model 3 0 20 40 60 80 100 −0.05 0.00 0.05 0.10 0.15 0.20 Lag A CF (c) Model 4 0 20 40 60 80 100 −0.05 0.00 0.05 0.10 0.15 0.20 Lag A CF (d) Model 5

Figure 11: ACF’s of the standardized residuals of the extended models for the week-ends/holidays 0 20 40 60 80 100 −0.02 0.00 0.02 0.04 0.06 Lag A CF (a) Model 2 0 20 40 60 80 100 −0.02 0.00 0.02 0.04 0.06 Lag A CF (b) Model 3 0 20 40 60 80 100 −0.02 0.00 0.02 0.04 0.06 Lag A CF (c) Model 4 0 20 40 60 80 100 −0.02 0.00 0.02 0.04 0.06 Lag A CF (d) Model 5

Figure 12: ACF’s of the squared standardized residuals of the extended models for the week-ends/holidays

Conclusion

The conclusions for the models fitted to the weekdays and weekends/holidays are similar. First we conclude that according to the model diagnostics not all assumptions on the models are met, but we see an improvement for the extended models. Model 3 and 5, in which we included longer lags in both the ARMA and the GARCH part, are found to have the best fit. These models have the lowest BIC and AIC and the ARCH effects are best captured. Taking this into account we expect that model 3 and 5 have the best prediction performance for the weekdays and the weekends/holidays.

6.2 Empirical forecasting performance

(41)

6.2.1 Weekdays

In the case of the weekdays, the number of observations in the test set is N = 5, 856. For computing the MAPE we do not consider the two prices which are equal to zero, hence this computation is based on 5,854 observations.

Fixed scheme

In the upper part of Table 8 the RMSE, MAE and MAPE for the two forecast horizons (4 PTU and 1 PTU) are given. All three statistics select the most extended model, which is model 5, for both forecast horizons. Forecasts for 1 PTU are better than the forecasts for 4 PTU in terms of a lower RMSE and MAE, but according to the MAPE the 4 PTU forecasts are better than the 1 PTU forecasts. The MAPE indicates that the point forecasts are not very accurate, since the error is approximately 80-90%.

Model 1 2 3 4 5 Fixed scheme 4 PTU RMSE 75.5094 74.5442 74.0294 74.4160 73.7951 MAE 38.5281 38.0189 37.9658 37.8747 37.8010 MAPE 82.1911 79.2295 78.0304 78.4303 76.9439 1 PTU RMSE 72.6141 71.6802 71.1761 71.5348 70.9232 MAE 37.8064 37.2353 37.1140 37.1125 36.9874 MAPE 89.0103 85.4957 84.5337 84.8689 83.5664 Rolling scheme 4 PTU RMSE 75.4337 74.4956 74.0091 74.3436 73.7920 MAE 38.5200 38.0017 37.9455 37.8399 37.7864 MAPE 82.7263 79.6628 78.4745 78.8078 77.5305 1 PTU RMSE 72.5449 71.6400 71.1637 71.4781 70.9239 MAE 37.7894 37.2263 37.1110 37.0931 36.9797 MAPE 89.3278 85.9151 84.9773 85.1624 83.9976

Table 8: Performance measures for the point forecasts with the fixed and rolling scheme (weekdays)

For the prediction intervals, using the models with longer lags results in better prediction intervals in most cases. Table 9 displays results for the nominal coverage for the 60%, 80% and 95% prediction intervals: q = 0.60, 0.80, 0.95. Model 5 almost always has the best nominal coverage.

(42)

result as shown in the QQ-plot of the standardized residuals, which indicated that the fatness in the tails was not captured enough by the skewed Student-t distribution.

Model True coverage q 1 2 3 4 5 Fixed scheme 4 PTU 0.95 0.8880 0.8880 0.8946 0.8893 0.8952 0.80 0.7654 0.7666 0.7790 0.7679 0.7816 0.60 0.6253 0.6255 0.6313 0.6279 0.6317 1 PTU 0.95 0.8970 0.8981 0.9020 0.8979 0.9042 0.80 0.7683 0.7664 0.7737 0.7666 0.7744 0.60 0.5864 0.5859 0.5890 0.5871 0.5890 Rolling scheme 4 PTU 0.95 0.8880 0.8887 0.8945 0.8899 0.8943 0.80 0.7672 0.7686 0.7789 0.7695 0.7804 0.60 0.6226 0.6252 0.6293 0.6289 0.6294 1 PTU 0.95 0.8982 0.8998 0.9011 0.8986 0.9039 0.80 0.7713 0.7695 0.7753 0.7702 0.7766 0.60 0.5854 0.5886 0.5897 0.5897 0.5927

Table 9: Nominal coverage PN

h=1Ih/N compared to the true coverage (weekdays)

The LR test statistics for conditional coverage are displayed in Table 10 for q = 0.60, 0.80, 0.95. The null-hypothesis of randomness and correct coverage is rejected in all cases. In most cases the separate tests for independence and unconditional coverage are rejected as well, except for 1 PTU forecasts for q = 0.95 where the null-hypothesis of independence is not rejected and for q = 0.60 where the null-hypothesis of correct unconditional coverage is not rejected. What we should keep in mind is that the computations are based on a large dataset and therefore even a small difference between the null-hypothesis and the alternative hypothesis is a significant difference.

Model 3 and model 5 show similar values for the LR test statistics for both forecast hori-zons. These models are the best models for q = 0.95 and q = 0.80. However, for 4 PTU forecasts and q = 0.60, these models are outperformed by model 2 and by model 4. Further-more, the forecast horizon of 1 PTU always shows (much) better results than when forecasting 4 PTU ahead. This is not a surprising result, since there is more uncertainty when forecasting on a longer horizon.

(43)

Model Coverage q Test 1 2 3 4 5 4 PTU 0.95 LRuc 356.33 (0.0000) 356.33 (0.0000) 290.79 (0.0000) 342.45 (0.0000) 285.97 (0.0000) LRind 148.20 (0.0000) 150.81 (0.0000) 132.26 (0.0000) 147.40 (0.0000) 137.37 (0.0000) LRcc 504.53 (0.0000) 507.14 (0.0000) 423.05 (0.0000) 489.85 (0.0000) 423.34 (0.0000) 0.80 LRuc 42.25 (0.0000) 39.43 (0.0000) 15.75 (0.0001) 36.33 (0.0000) 12.18 (0.0005) LRind 357.85 (0.0000) 337.73 (0.0000) 323.84 (0.0000) 352.00 (0.0000) 319.61 (0.0000) LRcc 400.10 (0.0000) 377.16 (0.0000) 339.59 (0.0000) 388.33 (0.0000) 331.78 (0.0000) 0.60 LRuc 15.92 (0.0001) 16.13 (0.0001) 24.32 (0.0000) 19.30 (0.0000) 24.86 (0.0000) LRind 433.09 (0.0000) 398.47 (0.0000) 412.55 (0.0000) 387.92 (0.0000) 423.10 (0.0000) LRcc 449.01 (0.0000) 414.60 (0.0000) 436.87 (0.0000) 407.22 (0.0000) 447.96 (0.0000) 1 PTU 0.95 LRuc 268.59 (0.0000) 259.30 (0.0000) 224.93 (0.0000) 260.84 (0.0000) 206.40 (0.0000) LRind 1.73 (0.1888) 3.59 (0.0583) 2.43 (0.1192) 2.62 (0.1054) 2.79 (0.0950) LRcc 270.32 (0.0000) 262.88 (0.0000) 227.36 (0.0000) 263.46 (0.0000) 209.19 (0.0000) 0.80 LRuc 35.58 (0.0000) 39.83 (0.0000) 24.54 (0.0000) 39.43 (0.0000) 23.30 (0.0000) LRind 20.44 (0.0000) 14.81 (0.0001) 22.89 (0.0000) 19.83 (0.0000) 24.26 (0.0000) LRdc 56.02 (0.0000) 54.64 (0.0000) 47.43 (0.0000) 59.26 (0.0000) 47.56 (0.0000) 0.60 LRuc 4.42 (0.0354) 4.77 (0.0290) 2.91 (0.0882) 3.99 (0.0458) 2.91 (0.0882) LRind 72.82 (0.0000) 69.60 (0.0000) 77.15 (0.0000) 75.02 (0.0000) 76.21 (0.0000) LRcc 77.24 (0.0000) 74.37 (0.0000) 80.06 (0.0000) 79.01 (0.0000) 79.11 (0.0000)

p-value in parentheses; LRuc ∼ χ21, LRind ∼ χ21, LRcc ∼ χ22. LRcc= LRuc + LRind

Table 10: LR test for conditional coverage (weekdays, fixed scheme)

coverage is very close to the true coverage. For both forecast horizons, the difference between nominal and true coverage is largest for high values of q. This is again a sign that the skewed Student-t does not capture the fat tails of the data.

The conditional coverage of the prediction intervals is best for q = 0.80, 0.85 as can be seen from Figure 14, where the black dashed line presents the critical value for the test. The prediction intervals based on 1 PTU forecasts show better results than the prediction intervals based on 4 PTU forecasts. The LR test statistics for the tests for unconditional coverage and independence, LRuc and LRind, are plotted in Figure 24 in Appendix B. From this we see that the independence of the violations is better for higher values of q, but the unconditional coverage is better for lower levels of q.

(44)

0.5 0.6 0.7 0.8 0.9 True Coverage q Nominal Co v er age 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

(a) 4 PTU forecasts

0.5 0.6 0.7 0.8 0.9 True Coverage q Nominal Co v er age 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 (b) 1 PTU forecasts

Figure 13: Nominal coverage compared to the true coverage q (weekdays). The black solid line presents the true coverage.

0 100 200 300 400 500 q L Rc c ● ● ● ● ● ● ● ● ● ● 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 ● 4 PTU 1 PTU Model 1 Model 1 Model 2 Model 2 Model 3 Model 3 Model 4 Model 4 Model 5 Model 5

Figure 14: LR test for conditional coverage for coverage level q (two forecast horizons, week-days)

Rolling scheme

Short-term forecasting of the settlement price of the Dutch electricity imbalance market using ARMA-GARCH models

Short-term forecasting of the settlement price of the Dutch

electricity imbalance market using ARMA-GARCH models

Short-term forecasting of the settlement price of the Dutch

electricity imbalance market using ARMA-GARCH models

Contents

1

Introduction

2

The Dutch imbalance market and settlement price

3

Theory on ARMA-GARCH processes

4

Data

5

Procedure to model and forecast the settlement price

6

Results for the separate datasets