Modeling the term structure of interest rates : using the generalized autoregressive score framework.

(1)

University of Amsterdam

Faculty of Economics & Business

MSc Thesis: Financial Econometrics

Modeling the Term Structure of Interest

Rates: using the Generalized

Autoregressive Score Framework

Student: Guido Jonker (10457615) Supervisor: dr. N.P.A van Giersbergen Second Reader: prof. dr. H.P. Boswijk

(2)

(3)

Abstract

In this thesis the term structure of interest rates is modeled with the purpose of fitting and forecasting. For this the Dynamic Nelson-Siegel (DNS) model is used, which is esti-mated using the Generalized Autoregressive Score (GAS) framework. Within the GAS framework, some new extensions of the DNS are proposed and some existing extensions are evaluated. We propose a new time-varying volatility specification. Also, an exten-sion with student-t disturbances is proposed, but found unfit for modeling. Further, extensions with nonlinearities and an additional fourth factor are investigated. We find that more flexible models lead to a better in-sample fit of the data. Moreover, the GAS estimated models lead to a better in-sample fit than comparable standard models esti-mated by the Kalman filter. However, out-of-sample predictability of the term structure is not proven for the new estimation method and model extensions. Sub-sample analysis indicates that a naive random walk is difficult to beat using both the GAS and Kalman modeling framework.

(4)

(5)

6.1.9 Estimation Robustness . . . 50 6.1.10 In-Sample Conclusion . . . 51 6.2 Out-of-Sample Performance . . . 52 6.2.1 Forecast Procedure . . . 52 6.2.2 Forecast Measures . . . 53 6.2.3 Forecast Results . . . 55 6.2.4 Out-of-Sample Conclusion . . . 56 7 Conclusion 57 8 Further Research 59 A Tables 61 B Kalman Filter 69 Bibliography 71

(7)

Chapter 1

Introduction

The term structure of interest rates gives the relation between interest rates or bond yields at different terms (maturities). In general yields increase in line with maturity, giving rise to an upward sloping yield curve. One basic explanation for this compound-ing phenomenon is that lenders demand higher interest rates for longer-term loans as compensation for the greater risk associated with them, in comparison to short-term loans. The yield curve plays a central role in an economy. Modeling and forecasting the term structure of interest rates is therefore of great importance in many ways: pricing derivatives, portfolio management, valuation of assets, risk management and monetary policy.

To address this issue researchers have come up with a vast amount of literature. Many use theoretically rigorous approaches, but result in empirically unsatisfying results and especially bad out-of-sample forecasting capabilities. Because bonds trade in well-organized, deep and liquid markets, it is logical and appealing to impose the absence of arbitrage. The US treasury market for example is very large and liquid, with a total out-standing debt of almost 12 trillion, a yearly issuance of over 2 trillion and approximately 500 billion traded every day1. Because of this liquid market it is unlikely that arbitrage opportunities exist: the risk adjusted returns of different maturity bonds should be the same. Or differently stated: the yields are internally consistent. Therefore, a large amount of literature holds this as their grounding. The associated arbitrage-free (AF) models started with Vasicek (1977) and Cox et al. (1985). They introduced these so-called ’affine’ models. Which are functions of the instantaneous short rate(r). These models became even more popular when Duffie and Kan (1996) generalized them. Un-fortunately these models have a poor fit and are difficult to estimate, many having multiple likelihood maxima (Kim and Orphanides, 2005). Besides these estimation dif-ficulties performance of forecasting is also poor for these affine models (Duffee, 2002).

1_{http://www.sifma.org/research/statistics.aspx}

(8)

Introduction 2

As a result of these difficulties, the focus of many researchers has been on empirically attractive models. The most important being the Nelson-Siegel (NS) curve (1987). The NS curve is widely used among central banks and others in the financial industry. The reason for this, is its relative simplicity, its ease of estimation and empirically tractable forecasting results. It faces the problem of modeling the yield curve by summarising the information at any point in time, for a large number of bonds. It does so by expressing the large set of yields of various maturities as a function of a small set of unobserved factors. The underlying economic interpretation of the three factors it uses are: the level (long-term), slope (short-term) and curvature (medium-term) of the yield curve. The Nelson-Siegel curve is reasonably flexible, allowing for various shapes (monotonic, inverted, humped, S-shaped). Therefore, the NS curve ensures a good fit to the data. As an evolution, the Dynamic Nelson Siegel (DNS) model was developed by Diebold and Li (2006). This model imposes the structural restrictions of the NS with time varying factors (level, slope, curvature). These time-varying factors are modeled using (Vector) Autoregressive specifications. Their paper shows that their forecasts outperform stan-dard time series and therefore this has brought research back to the Nelson-Siegel class. Several researchers have extended and investigated the DNS. Diebold et al. (2006) put the DNS in state-space equation form and include macro-economic factors. Pooter (2007) examines various extensions of the DNS with the purpose of fitting and forecasting. He concludes that an extension with a fourth factor (Svensson, 1995) forecasts very well. Especially with a one-step state-space estimation approach using a Kalman filter. Koop-man et al. (2010) extend the DNS in two directions. First, they impose that the factor loadings in the DNS depend on an additional loading parameter, that they treat as the fourth latent variable (λt). Second, they introduce time-varying volatility to the DNS

using a standard GARCH specification.

Besides the introduction of the DNS, the Arbitrage-Free Nelson-Siegel(AFNS) model was developed by Christensen et al. (2011) which takes the DNS and imposes the absence of arbitrage. This partly closes the gap between theoretically rigorous and statistical term structure models. It adds an additional time-invariant ”yield-adjustment term” which leads to the differences between the AFNS and DNS. Because the DNS generally fits well, and market yields are assumed to be arbitrage-free, it is arguable that the DNS is arbitrage-free up to an accurate approximation: the no-arbitrage constraint should be largely non-binding. Coroneo et al. (2011) support this claim by finding that the normal Nelson-Siegel is compatible with the no-arbitrage constraints in the US interest rate market. Duffee and Stanton (2012) come to similar conclusions. Joslin et al. (2011) con-clude that for their JSZ model forecasts are invariant to the imposition of no-arbitrage restrictions. Furthermore, they state that the AFNS model is a constrained special case of the JSZ normalization. Despite this consideration a lot of recent research has focused on the AFNS. For example, Christensen et al. (2009a) extend their AFNS model to

(9)

Introduction 3

form the Arbitrage-Free Generalized Nelson-Siegel (AFGNS); a Svensson extension (a fourth factor). Also, Christensen et al. (2013) incorporate stochastic volatility in the AFNS, but conclude that much observed stochastic volatility cannot be associated with the spanned term structure factors.

Because of these arbitrage-free considerations, this theses will focus on the DNS. The novel feature of this thesis is the relatively new estimation method. Time-series mod-els with time-varying parameters can be categorized into two classes: parameter driven models and observation driven models (Cox et al., 1981).

In parameter driven models, the time-varying parameters are stochastic processes sub-ject to their own source of error. Therefore, the parameters are not perfectly predictable given the past: the likelihood function is not known in closed form.

The alternative is an observation driven model, where the time-varying parameters are dependent on (functions of) lagged dependent values, exogenous variables and past ob-servations. Parameters are stochastic, but predictable given the past. This approach simplifies likelihood evaluation, because the likelihood function is known in closed form. Creal et al. (2008, 2012) introduce Generalized Autoregressive Score (GAS) models; a class of observation driven time series models. The GAS model uses the scaled score function as driving mechanism of the time-varying parameters.

The usage of the GAS framework is different, as most papers use a parameter driven approach. Where the NS model parametrization is used for the observation equation, combined with a transition equation for the unobserved factors, which are modeled as (Vector) Autoregressive processes. Together with the NS observation equation this forms a state-space structure. Restrictions on this state-space are imposed in order to estimate the model, mostly Gaussian disturbances are assumed in both equations. This way it is possible to use a Kalman filter (1960). Some propose Bayesian estimation using Markov Chain Monte Carlo (MCMC) in order not to make these assumptions (Laurini and Hotta, 2010). With the use of the GAS framework we are not restricted to such assumptions or methods. This leads to the main question that is answered in this thesis:

How does the GAS framework perform in the term structure setting?

Which model specification performs best? Different specifications are compared, both in-sample fit and out-of-sample forecasting is regarded, as well as estimation ease and robustness.

This main question results in different sub-questions:

• Is the assumption of normality valid?

As mentioned, previous applications of the DNS/AFNS have used Gaussian errors. Different distributions could be use, such as the student-t distribution.

(10)

Introduction 4

• Can heteroskedasticity be included in the model?

Is the assumption of cross-sectional and longitudinal independence of the distur-bances valid? Can the model be extended to incorporate other possibilities?

• How do GAS estimated models perform in relation to Kalman estimated models? The performance of the GAS estimated models is compared to traditional Kalman estimated models, regarding in-sample fit and out-of-sample predictions.

Methodology and Techniques

In this thesis new specifications of the DNS are proposed. For these new model specifi-cations analytical derivations of the likelihood are determined. Furthermore, all models considered are programmed in Matlab.

The model specifications are assessed on the basis of in-sample fit and out-of-sample fore-casts. For the in-sample fit, measures such as the Root Mean Squared Error (RMSE)) are used. Also more deciding measures as Akaike and Bayesian information criterion (AIC/BIC) are used, which judge the effect of adding additional variables. Increasing the fit of the model, making it more complex vs. a more parsimonious model. Likelihood ratio (LR) tests are used to test significance of more elaborate nested models, such as a model with student-t errors vs. Gaussian errors and specifications with time-varying volatility. For the non-nested models we use the Rivers-Vuong (RV) test: the GAS es-timated vs. Kalman filter eses-timated models.

For out-of-sample forecast, comparisons are made between the forecast errors. This is done using (trace) Root Mean Square Forecast Error ((t)RMSFE). Besides this rela-tively subjective comparison a more formal test is used to compare predictive accuracy of different models, namely the Diebold-Mariano (DM) test statistic.

(11)

Chapter 2

Theory

In this section the theory behind modeling the term-structure is explained. First we explain what interest rate curves are and how these curves relate. We then explain how they are derived. Subsequently, we give reason why to model them and why to use the Nelson-Siegel method. Finally, we explain the Generalised Autoregressive Score (GAS) framework and how it can be adopted for the term-structure.

2.1 The Yield Curve

We start by deriving what we try to model: the yield curve or the term structure of interest rate. The yield curve is used to describe the relationship between the yield (i.e. the return) of bonds and the time to maturity. This term-structure is given for bonds with the same credit risk : for example US treasury bonds. This is done because it is assumed that these will have similar dynamics or the same factors driving its dynamics. We want to compare different countries or companies over the course of time. We start by defining the relationships between the different interest rate curves: the yield curve, the discount curve and the forward curve (Hull, 1999).

Define P (τ ) the price of τ -period discount bond (discount bond meaning that a dis-counted price is paid for a to be received amount in the future). Hence P (τ ) denotes the present value of a risk-free contract that pays unit at its maturity τ -periods ahead: τ = T − t, the time to maturity. If y(τ ) is the continuously compounded yield to maturity, then by definition the discount curve is given by:

P (τ ) = e−τ y(τ ). (2.1)

Hence the yield curve and the discount curve are fundamentally related. Knowing one of the curves enables you to calculate the other curve immediately.

(12)

Theory 6

The forward rate curve gives the forward rate as a function of the maturity. It is similarly related :

f (τ ) = −P

0_{(τ )}

P (τ ). (2.2)

(2.1) and (2.2) imply in the relationship between yield and forward curve:

y(τ ) = 1 τ

Z τ

0

f (u)du. (2.3)

So the yield is the equally weighted average of the forward rates. This proves that once we have a representation of any one of the above equations we can automatically derive the other. That is: all are interchangeable. (Diebold and Rudebusch, 2011; Piazzesi, 2010)

2.2 Zero-Coupon Yields

Although the yield curve is most used in practice, yields are not observed. Instead yields need to be estimated from a large set of bond prices with an approximation method. In practice bonds exist of all different maturities at all time. The difficulty is that we cannot simply apply the formulas (2.1)-(2.3) to the observed market prices of the bonds. This is not possible because most bonds bear coupon payments: mostly semi-annually payments are received. Because of these coupon payments the prices of these bonds bear a so-called ’coupon effect’ as analyzed by Caks (1977). Therefore bonds with the same maturity, but with different coupon rates will have different yields. Because of these effects ’zero-coupon’ yields have to be estimated from the large pool of bonds that are traded. Different researchers have come up with methods to estimate these zero-coupon yields.

McCulloch (1975) introduces a cubic splines method to estimate the zero-coupon yield. The disadvantage of this method is that it has some trouble in fitting flat curves. This is a result of a diverging discount curve at long maturities. Still the Federal Reserve presents their yield curves using this method. Vasicek and Fong (1982) overcome the problem of this method by using exponential splines. Only this method has as problem that the forward rate is not strictly positive. The last method is introduced by Fama and Bliss (1987). They construct the yields from estimated forward rates: ”unsmoothed Fama-Bliss” rates. These forward rates are created using the prices of the coupon bearing bonds which are averaged using a bootstrap method, assuming a constant forward rate between the different maturities. The forward rates are then converted to a yield curve using formula (2.3). The created yields exactly price the bonds used to create them.

(13)

Theory 7

This method is regarded as the most accurate and therefore many studies use it: for example Diebold and Li (2006), Pooter (2007), Koopman et al. (2010). Eventually fitting a parametric model to these yields will create smoothed yields. This will be discussed in chapter 3.

2.3 Why Model?

Now that we have defined what we try to model, we elaborate on why we would want to model this yield curve. Piazzesi (2010) mentions that there are at least four reasons for modeling the yield curve.

The first reason is forecasting. Yields of long maturity bonds are the expected values of the average short yields. This is after any risk adjustments. Therefore the current yield curve tells something about future directions of the economy. Besides the use for forecasting future yields, it can be used to forecast real activity and inflation (Diebold et al., 2005; Fama, 1990). All these forecasts can be used for investment decisions, sav-ings decisions and policy decisions.

A second reason is the assessment of monetary policy. Central banks of most industri-alized countries seem to be capable to move the short end of the yield curve. However, what mostly matters for term economic growth and developments are the long-term yields. For example decisions made to buy or rent a house are driven by long-long-term mortgage rates (so long-term yields), not the short-term central bank driven rate (e.g. Federal Funds Rate). Modeling the yield curve can help to understand how moving the short-end of the yield curve effects long-term yields. The research comprises both the understanding of the mechanisms in this process as well as understanding how central banks conduct policy. A recent example is the Quantitative Easing (QE) by the Fed-eral Reserve (Christensen et al., 2009b). Also the European central bank has recently decided to use such unconventional monetary policy to stimulate the economy.

A third reason is debt policy. Governments issue debt in the form of bonds. Govern-ments need to decide about the maturities of the issued bonds. The supply of bonds with different maturities influences the yields. Governments can actively manage the maturity structure of its public debt. For example this can be done by selling short maturity debt and buying long maturity bonds.

The fourth reason is the use for pricing and hedging. For example, coupon-bearing bonds can be priced using a model of the term structure. Each payment is weighted by the price of a zero-coupon bond that matures at the date of the coupon. Also, prices of futures, options, swaps, caps and floors can be computed from a yield curve model. Fur-thermore, some parties may need to manage risks associated with differences in received or payed interests. Hedging strategies can be computed by determining the prices of

(14)

Theory 8

the derivatives, depending on different states of the economy. It can be argued that for purposes such as pricing and hedging it is more important to use a model that imposes the restrictions implied by the absence of arbitrage, whereas the temporal dynamics are often of less value.

2.4 Nelson-Siegel Model

Because of the nature of the data a multivariate model is needed. As mentioned, the focus of this thesis is on forecasting. Because we are focusing on forecasting we use an empirically attractive model and stay away from traditional financial theory that imposes the restrictions of the absense of arbitrage. To compress the information that is in the bond data, a model with a factor structure can be used. We want to compress the information in the yields for statistical reasons. A more parsimonious model will result in worse in-sample-fit, but generally better forecasts. Fortunately financial theory often suggests a factor structure. Successful financial models that use a factor structure are for example: CAPM model (one factor) and Fama-French (three factors) (Fama and French, 1993; Jensen and Scholes, 1972). The risk premiums are often driven by a smaller number of risk factors. Luckily yields also have a factor structure. The first three principal components explain almost all variation (97%) (Litterman and Scheinkman, 1991). This means that the high-dimensional set of yields is driven by a lower dimensional set of factors.

The Nelson-Siegel is the most used parametric model for the interest rate curve. The model is popular because of its robustness. It provides a smooth fit and is relatively flexible, thereby ensuring a good fit. Above all, it provides statistically useful results that are economically meaningful. Because of these properties it is a popular model among its users.

The original model is in the class of exponential affine three factor term structure models. ’Affine’ in this context meaning constant plus a linear term: a function of a vector of observable or unobservable (latent) factors. It was developed by Nelson and Siegel (1987) for the static cross-section of the term structure. Their model was designed to fit the forward rate. The model is deduced from the observation that the typical yield curve shapes are associated with differential or difference equations. The function consists of a product between an polynomial and an exponential decay term: a Laguerre function. A Laguerre function is used to approximate function in domain [0, ∞). Because yields are in this domain it will give a good fit. In this thesis the re-factorization of Diebold and Li (2006) is used. This is done because it gives a more intuitive interpretation of the factors (level,slope and curvature). The representation of the instantaneous forward

(15)

Theory 9

rate is given by:

f (τ ) = β1+ β2e−λτ + β3λτ e−λτ. (2.4)

Integrating this forward rate from 0 to t as in equation (2.3) gives the Nelson-Siegel yield curve: y(τ ) = β1+ β2 1 − e−λτ λτ + β3 1 − e−λτ λτ − e −λτ . (2.5)

The Nelson-Siegel is not just arbitrary but it has some characteristics that are desirable. Namely that the price of a bond is unit at execution, because of the absence of risk. Also the price of a bond will go zero when time to maturity(τ ) goes to infinity.:

P (0) = 1. (2.6)

lim

τ →∞P (τ ) = 0. (2.7)

The interpretation of the β’s can be deduced by examining the limiting properties of the parametrisation:

lim

τ →∞y(τ ) = β1. (2.8)

lim

τ →0y(τ ) = β1+ β2 = r. (2.9)

It shows that β1 gives the level of the yield curve. It provides the long-run component of

the yield curve. It is constant for all maturities. It can be seen as the level of the short rate (r). β2 is the short-term component as it starts at 1 and decays fast to zero with

maturity. Together β1 and β2 form the instantaneous short rate (r). β3 is the

medium-term component because it starts at zero, then increases and finally decays again with longer maturities. This means that β3 provides the curvature of the yield curve. It does

neither affect the short, nor the long end very much. It mostly effects the middle of the curve. The properties are illustrated by figure 2.1

λ determines the decay speed of the parameters. A larger λ will fit longer maturities better. The opposite is true for a smaller λ, which will fit shorter maturities better.

(16)

Theory 10 20 40 60 80 100 120 0.2 0.4 0.6 0.8 1.0

Figure 2.1: Factor loadings, blue depicts the first loading given by 1. In red the second loading (1−e_λτ−λτ) and in yellow third loading ( 1−e_λτ−λτ − e−λτ_{), where λ = .062}

The flexibility of the NS curve is also a desirable property. It is flexible because is can assume a variety of shapes through different values of the latent factors: It can be an increasing or a decreasing function, both at a decreasing as at an increasing rate. It can assume a S-shaped curve, it can be a flat line ,but it can also adopt a U-shape or an inverted U-shape (humped or inverted humped). Figure 2.2 illustrates the flexibility of the Nelson-Siegel curve. The limitation of the NS is that is can only have one optimum. Luckily this constraint is mostly non-binding. The yield curve usually does not move jagged with maturity.

Although the NS curve is relatively flexible, it still uses a parsimonious approximation. The sparse use of factors results in a smooth curve. This smoothness is preferred because it protects against over-fitting. Over-fitting is undesirable because it results in difficul-ties in estimation. This over-fitting will most likely end in unmanageable estimation. Moreover, over-fitting frequently leads to bad forecasting capabilities.

20 40 60 80 100 120 -0.5 0.5 1.0 1.5 2.0 2.5

Figure 2.2: Shapes the NS model can assume: constructed by fixing β1= 1, β2+ β3=

(17)

Theory 11

2.4.1 Dynamic Nelson Siegel

The DNS is an evolution of the NS model. Diebold and Li (2006) convert the factor model for the cross-section into a dynamic factor model. This is done by extending the model with time-varying latent factors. The latent factors determine the cross-section of the yield curve, as shown in the previous section. The dynamics of the factors subse-quently determine the longitudinal dynamics of the yields.

The introduction of this dynamic factor structure has mentionable advantages. It con-verts the high-dimensional situation (a cross-section of many yields over different time-periods) into a easier low-dimensional one.

yt(τ ) = β1t+ β2t 1 − e−λτ λτ + β3t 1 − e−λτ λτ − e −λτ . (2.10)

Diebold and Li (2006) introduce two benchmark models. One with the three latent factors modeled as univariate AR(1) models. And one with the factors modeled as a first-order vector autoregressive model VAR(1).

Subsequently Diebold et al. (2006) put the model in State-Space form, adding stochastic errors to the Nelson-Siegel observation curve. This produces a measurement equation. It relates a set of N yields with time to maturity τ ∈ {τ1, τ2, . . . , τN} to the three

unobservable factors (β1t, β2t, β3t). This gives:

       yt(τ1) yt(τ2) .. . yt(τN)        =        1 1−e_λτ−λτ1 1 1−e−λτ1 λτ1 − e −λτ1 1 1−e_λτ−λτ2 2 1−e−λτ2 λτ2 − e −λτ2 .. . ... ... 1 1−e_λτ−λτN N 1−e−λτN λτN − e −λτN            β1t β2t β3t     +        t(τ1) t(τ2) .. . t(τN)        . (2.11)

The factor dynamics are then specified in the state equation as:

    β1t β2t β3t     =     µ1 µ2 µ3     +     a11 a12 a13 a21 a22 a23 a31 a32 a33         β1,t−1 β2,t−1 β2,t−2     +     η1t η2t η3t     . (2.12)

The state space formed by (2.11) and (2.12) can be put in a more convenient vector notation:

yt= X(λ)βt+ t. (2.13)

(18)

Theory 12

Where yt, t are vectors of (N × 1), X is a matrix of (N × 3) and βt, µt, ηt vectors of

(3 × 1).

The matrix A can be estimated in full or in diagonal form: VAR(1) or AR(1). However, it is argued by Diebold and Li (2006) that using a VAR (correlated factors) will result in bad forecasting capabilities.

In order to estimate the model using a Kalman filter (1960) (procedure given in appendix B), assumptions are made on the error decomposition. Most papers assume that t and

ηt are uncorrelated Gaussian White Noise:

t ηt ! ∼ N " 0 0 ! , Σ , 0 0 , Ση !# . (2.15)

Furthermore it is assumed that the errors are orthogonal to the initial state vector:

E[β00t] = 0, E[β0η0t] = 0. (2.16)

Also extensions can be made using this framework; such as a fourth factor (Bj¨ork and Christensen, 1999; Svensson, 1995). They argue that this improves fit for longer matu-rities: especially longer than 10 years. Pooter (2007) even concludes that this improves forecast performance. They conclude that the extension of Bj¨ork and Christensen (1999) gives similar results. This specification is easier to estimate because it only assumes one λ instead of two and therefore reduces the estimation space. The resulting loading ma-trix X(λ) then has dimensions (4 × 1) and βt, µt, ηtbecome (4 × 1) vectors. The loading

matrix is then given by:

X(λ) =        1 1−e_λτ−λτ1 1 1−e−λτ1 λτ1 − e −λτ1 1−e−2λτ1 2λτ1 1 1−e_λτ−λτ2 2 1−e−λτ2 λτ2 − e −λτ2 1−e−2λτ2 2λτ2 .. . ... ... ... 1 1−e_λτ−λτN N 1−e−λτN λτN − e −λτN 1−e−2λτN 2λτN        . (2.17)

Another direction in which the model is extended is through the inclusion of macro-variables. Diebold et al. (2006) extend the model by adding macro-variables to the state vector.

Furthermore, some have proposed models in which the state equation has regime-switching properties. This can be used when yield structurally change for long periods for example for a change in fiscal policy or a recession. Bernadell et al. (2005) and Xiang and Zhu (2013) for example make the means of the state equation dependent on the regime. Such models can be estimated using a Kalman filter with a Hamilton model (1990) for regime-switching.

(19)

Theory 13

Another way of estimating the models in state-space form is the use of Bayesian infer-ence using Markov Chain Monte Carlo (MCMC). These methods allow for more flexible disturbance specifications, though they are computationally very intensive. Hautsch and Yang (2012) and Caldeira et al. (2010) estimate models with stochastic volatility using these Bayesian methods.

The main disadvantage of this state-space framework is the assumption of disturbances in the measurement equation and the state equation. And moreover, the assumptions that are made in order to estimate it with a Kalman filter. Or the estimation disadvan-tages when not making these assumption through the use of Bayesian methods. A clear advantage is the theory that has accumulated on these subjects. And especially the theory on the Kalman filter with the elegant recursive estimation routine: it is simple, intuitive, straightforward and powerful. The clear disadvantage of the Kalman filter is its sensitivity to deviations from the Gaussian distribution and its adaptive capacities. We therefore proceed to introduce the GAS model.

2.5 Generalized Autoregressive Score

The GAS model of Creal et al. (2008, 2012) gives a different framework to estimate the DNS introduced before. To model the dynamics of the time-varying parameters, it does not assume a state-space framework with individual disturbances: both a disturbance in the observation equation as one in the state equation. Consequently we do not have to make assumptions on the distribution of the disturbance in the unobserved state equa-tion.

The GAS model approaches the modeling of these time-varying parameters different but also shares some similarities. Like the Kalman filter the GAS model links past obser-vations with future parameters. And in the way that it does this, likelihood evaluation will still be straightforward.

The observation-driven GAS model is chosen because its main advantage is that it exploits the full observation density. It is not simply limited to a first or second mo-ment. Also important is that it can be used for all kinds probability distributions. Furthermore, it can be applied to linear regressions, but also non-linear regressions with time-varying coefficients (we will come to that in section 3.1.3). Consequently, the GAS model nests many econometric models such as the Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models of Bollerslev (1986),the autoregressive conditional duration and intensity (ACD and ACI, respectively) models of Engle and Russell (1998) and Russell (1999), the dynamic conditional correlation(DCC) model of Engle (2002) and Dynamic Copula (Creal et al., 2008). We will now introduce the modeling framework.

(20)

Theory 14

2.5.1 The Modeling Framework

Let yt denote the dependent variable of interest a N × 1 vector, ft the time-varying

parameter vector, xt a vector of exogenous variables, and θ a vector of static

param-eters. Define Yt = {y1, ..., yt}, Ft = {f1, ..., ft} and Xt = {x1, ..., xt}. The available

information set at time t consists of {ft, Ft} where

Ft= {Yt−1, Ft−1, Xt}. (2.18)

yt is assumed to be generated by the observation density

yt∼ p(yt|ft, Ft; θ). (2.19)

The updating mechanism for the time varying parameter vector ftis given by the

auto-regressive updating equation

ft+1= ω + p X i=1 Aist−i+1+ q X j=1 Bjft−j+1, (2.20)

where ω is a vector of constants, Ai and Bj coefficient matrices, st= st(yt, ft, Ft; θ) and

stis given by:

st= St· ∇t, with ∇t=

∂ ln p(yt|ft, Ft; θ)

∂ft

, St= S(t, ft, Ft; θ). (2.21)

So the updating equation (2.20) consists of a constant (ω), a part that uses the scaled score (st−i+1) and a part that uses the lagged factors (ft−j+1).

The scaled score (st) consists of, as said the score (∇t) and a scaling matrix (St). The use

of the score (∇t) for updating the factors is intuitive, as it gives the direction

(steepest-ascent) in which the factors must be changed to increase the local likelihood, given the current factors (ft). St is the scaling matrix. Through the choice of this scaling, the

model allows for more flexibility. However it is often a natural consideration to use a scaling that depends on the variance of the score. That is, the use of the inverse Fisher information: St= I_t|t−1−1 = Et−1∇t∇0t −1 = −Et−1 ∂2_{ln p(y} t|ft, Ft; θ) ∂ft∂ft0 −1 . (2.22)

Together equations (2.20)-(2.22) form a GAS(p,q): a Generalised Auto-regressive Score model with orders p and q. The q gives the number of lags of the factors are consid-ered: the auto-regressive part of the factors. The p gives the number of lags of the (scaled) score that are considered. The updating mechanism (2.20) can be interpreted as a Gauss-Newton algorithm.

(21)

Theory 15

An important feature of the model is that under the right specifications the scaled score (st) forms a martingale difference series: Et−1[st] = 0. This is a property of the score.

For the variance we get Et−1[sts0t] = StIt|t−1St0. When scaling with St = It−1−1 this

reduces to I_t|t−1−1 . When scaling with St = I we get It|t−1 as variance. As suggested

scaling in preferably done with the inverse Fisher information matrix (I_t|t−1−1 ). Alterna-tively the score could be scaled with a unit matrix. This way the unscaled score is used as updating mechanism. This makes updating similar to a steepest-ascent optimization of the likelihood. But according to Creal this updating mechanism is often less stable. Koopman (2012) suggests to use St = I_t|t−1−1/2. For this choice of scaling, st has constant

unit variance and is invariant under non-degenerate parameter transformations g(ft).

They state that the constant unit variance property that results from this scaling choice is a useful device for detecting model mis-specification in applications.

Additionally the updating equation can be extended to include exogenous variables: xt.

Besides this the coefficient matrices can be functions dependent on the static parame-ters: ω(θ), Ai(θ), Bi(θ).

This chapter is concluded by the observation that the state equation of the DNS in equations (2.13)-(2.14) can be replaced by the updating mechanism of 2.20 the GAS framework, as also proposed by Creal et al. (2008). Furthermore, Creal et al. (2011a) even suggests to keep using state-space equation framework and the Kalman filter, but to model other parameters than βtusing the GAS framework. Although this possibility

exist, Koopman (2012) conduct a Monte Carlo study to compare parameter driven mod-els with observation driven modmod-els and conclude that observation-driven GAS modmod-els have similar predictive accuracy to correctly specified parameter-driven models. There-fore, we proceed to model the time-varying factors of the NS using the GAS updating equation.

(22)

(23)

Chapter 3

Model Specifications

In the following section different model specifications are introduced. As a result of the GAS adoption we are no longer constraint by the use of a Kalman filter. We can thus assume different disturbance specifications and nonlinearities. First, we consider the standard Gaussian error specification. We then extend this to a specification with Student-t distributed errors. Subsequently, a model with variable lambda is proposed. Finally, we extend the model with a disturbance specification with time-varying volatil-ity.

3.1 General Model

As general model we have:

yt= X(λ)βt+ t, (3.1)

where X(λ) is as given in equation (2.11) or (2.17) for the four-factor extension of Bj¨ork and Christensen (1999). We add βt to the time-varying parameter vector ft, which is

updated using equation (2.20) as proposed in the modelling framework. So as time-varying factor vector we have at least ft= βt= (β1t, β2t, β3t)0 .

3.1.1 Gaussian

At first we assume Gaussian disturbances as done in the state-space framework with Kalman filter. This specification is also estimated in Creal et al. (2008). For this specification we have the disturbances tgiven by:

(24)

Model Specifications 18

t∼ N (0, Σ). (3.2)

where tis the disturbance vector of N × 1 and Σ a positive definite covariance matrix

of N × N

3.1.2 Student-t

Instead of the Gaussian disturbances that are used by most researchers, we propose to adopt multivariate student-t distributed disturbances. The student t-distribution is symmetric and bell-shaped, like the Gaussian distribution. But as distinctive feature, it has heavier tails, meaning that it is more prone to producing values that fall far from its mean. The student-t distribution is suggested for many variables in finance. However it does not yet capture the asymmetry we often see in financial returns. But as a start we suggest to use the symmetric student-t distribution for the DNS.

Compared to the Gaussian distribution the student-t adds an additional parameter to the probability function, namely the degrees of freedom v. The degrees of freedom determine how fat the tails are. Particularly, the higher the degrees of freedom, the closer that distribution will resemble a standard normal distribution. That is, for v → ∞ it resembles the Gaussian distribution. And for values of v > 30 it almost resembles the Gaussian distribution. So the student-t is a family of distributions that nests the Gaussian distribution. Hence, this generalizes the Gaussian model. For the disturbances we have:

t∼ Student-t (0, Σ, v), (3.3)

where t is a N × 1 disturbance vector and Σ a positive definite covariance matrix of

N × N and v gives the degrees of freedom.

3.1.3 Variable Lambda

So far the decay parameter λ is assumed to be fixed over time. For example Diebold and Li (2006) fix λ at 0.0609 and Diebold et al. (2006) estimate that λ = 0.077. λ determines the place of the maximum of the curvature. It may be too restrictive to fix this parameter as the characteristics of the yield curve may have changed over time. So we allow for a variable λt as proposed by Koopman et al. (2010) and adopted by Creal

et al. (2008). Instead of the Kalman framework used by Koopman, which needs for local linearization, we use the observation driven approach of the GAS as used by Creal. Our

(25)

time-varying factor vector is extended to ft = (βt0, λt)0. The matrix X(λ) in equation

(3.1) is replaced by a time-varying Xtdependent on λt:

Xt=        1 1−e_λ−λtτ1 tτ1 1−e−λtτ1 λtτ1 − e −λtτ1 1 1−e_λ−λtτ2 tτ2 1−e−λtτ2 λtτ2 − e −λtτ2 .. . ... ... 1 1−e_λ−λtτN tτN 1−e−λtτN λtτN − e −λtτN        . (3.4) 3.1.4 Time-Varying Volatility

Interest rates are subject to financial market trade and therefore sensitive to market sentiments and market movements. Therefore changes in volatility may emerge. The models investigated so far have assumed constant volatility. We propose to adopt some time-varying volatility specifications. We first propose an adapted version of the speci-fication of Koopman et al. (2010). Further we propose a completely new specispeci-fication.

3.1.5 Common Disturbance with Time-Varying Volatility

At first the disturbance decomposition proposed by Koopman et al. (2010) is adopted. They argue that volatilities vary across different maturity yields. They find that shorter maturity yields are more sensitive to a common shock than longer maturity yields. Therefore they assume that the disturbance is composed of a common disturbance ∗_t and an individual disturbance +_t distributed as:

∗_t +_t ! ∼ N " 0 0 ! , ht 0 0 Σ+ !# . (3.5)

The combined disturbance is then defined as

t= Γ∗t+ +t . (3.6)

This leads to a variance given by:

Σ(ht) = htΓΓ0+ Σ+ , (3.7)

where ht is the time-variable variance of the common disturbance (∗t). Γ is a N × 1

loading vector to pass the effect of the common disturbance onto the yields of the dif-ferent maturities.

The GAS model is used to update ht, as opposed to the GARCH specification used by

(26)

Harvey et al. (1994). In our case the variance ht is modeled as 1 of 4 latent factors

f_t∗= (β1t, β2t, β3t, ht). Where the factors are again updated using equation (2.20).

In this specification restrictions are required to overcome identification problems. Koop-man et al. (2010) propose a normalization Γ0Γ = 1, but choose to fix the constant of

the common variance at a value close to zero. We choose to fix the first element of Γ

at 1 as this also prevents identification issues.

3.1.6 Common Volatility

Another convenient approach is to use a diagonal parametrization for the covariance matrix. We therefore propose to use the following specification:

Σ(ht) = htΦ, (3.8)

where Φ is a N × N symmetric positive definite matrix of loadings passing the effect of the common volatility on to the different maturity yields. Φ is chosen as a diagonal matrix to save on parameters to be estimated. This means that the diagonal contains N values. ht is again modeled using the GAS updating equation. Again we need to

normalize this as a multiplication of ht with an arbitrary number (or matrix) and the

division of Φ with the same number (or matrix inverse multiplication) would yield the same variance matrix Σ(ht). This will result in identification issues. Again we choose

(27)

Chapter 4

Data

In this section zero-coupon interest rate data are presented. The characteristic properties of the yields are discussed. We discuss how the yield curves behave cross-sectional and temporal: across different maturities and over time. Both are important as we want to capture the dynamics of the yield curve.

In the thesis we use unsmoothed end-of-month US zero-coupon yields. The data can be downloaded from the Center for Research in Security Prices (CRSP)1. These unsmoothed yields are constructed using the Fama-Bliss method (1987). The data-set consists of continuously compounded interest rates which are presented on an annualized basis. The method gets rid of the coupon effects discussed by Caks (1977). For this method filtered bond prices(average bid/ask) are used, eliminating bonds with special features (such as option features). Using these bond prices, forward rates are generated using the Fama-Bliss (1987) bootstrap method. The data-set used consists of observations in the period from January 1970 to December 2009 and has T = 480 observations. It consists of yield of N = 17 maturities: τi = 3, 6, 9, 12, 15, 18, 21, 24, 30, 36, 48, 60, 72, 84, 96, 108, 120

months. Together this forms a panel of 8160 data points. Figure 4.1 presents a 3D-plot of the the observations.

1_{http://www.crsp.com}

(28)

Data 22 19701975 19801985 19901995 20002005 2010 0 50 100 150 0 5 10 15 20 Year Time to Maturity (in Months)

Yield (%)

Figure 4.1: 3D plot of the panel of zero-coupon yields. The figure shows the yields in the period from January 1970 till December 2009. Yield data is used of 17 maturities,

between 3 months and 10 years.

From the plot it can be seen that yields differ substantially as a result of major economic events and economic policy. There are periods of extreme high interest: in the early 80’s interests were high due to economic policy. But also we see the recent extreme low interest rates after the financial crisis of 2008. Interest rates can be seen rising and declining: for example before and after the burst of the 2000 dot-com bubble and the 2006 housing bubble. Further we observe that the long-term trend for interests is downward.

Also observable is that the yield curve differs in shape. A lot of different shapes can be seen: increasing or decreasing both at increasing rate or at decreasing rate, flat, S-shaped, U-shaped (inverted humped), inverted U-shaped (humped).

From the descriptive statistics in table 4.1 we see that the average yield curve is upward sloping. This would mean that term-premia exist. A logical explanation for these premia can be risk aversion or liquidity preference. Another stylised fact that is shown is that the short-end of the yield curve is more volatile than the long-end. Volatilities of yields tend to decrease with maturity. This can be seen as a confirmation that long-term rates are the average of the expected future short-rates. We also see that all maturities have high autocorrelations. But, the short end of the yield curve is less persistent; it has lower

(29)

Data 23

autocorrelations for longer lags than the long end. Autocorrelations of longer maturities are still strong for longer lags (2 years).

It can also be seen that the yields are skewed to the right, which means that more mass is in the right tail than there is in the left tail. Median yield curve with quantiles in figure 4.3 confirms that the yield is right skewed. Another fact is that yields are leptokurtic, which might suggest thick tales and hence a student-t distribution.

Furthermore we specify proxies for the level, slope and curvature as proposed by Diebold and Li (2006). The proxy for the level is simply given by the longest maturity yield (10 year). The slope is estimated as the yield of the longest maturity minus the yield of the shortest maturity (10 year yield - 3 month yield). Finally, the curvature is defined as 2 times the 2 year yield minus the sum of the 3 month and the 10 year yield. We can deduce from these that the yield curve is concave because the slope and curvature are on average positive. Moreover we see that the level is highly persistent, as opposed to the autocorrelation of the slope which goes to zero. The stylised fact that long rates are more persistent than the short rates is indicated by the higher persistence of the level than the persistence of the slope and curvature (β2 and β3).

The sample autocorrelations indicate that yields might be integrated of order one. If that is the case the underlying process is non-stationary and we need to take first differences. Fortunately, economic theory dictates that yields cannot be integrated and must have a non-negative, finite expected value. So we may follow through in modelling in levels. Finally, a Principal Component Analysis (PCA) confirms that the first three principal components indeed give most of the yield variation. Together they capture almost 99% of the yield variation. As can be seen from figure 4.2 the loadings of the principal components show similarities with the Nelson-Siegel loadings. The loadings of the first principal component almost exactly matches the inverted shape of the Nelson-Siegel slope loading. The loadings of the second principal component corresponds to the shape of the curvature. Only the loading of the third principal component is not exactly level as its Nelson-Siegel counterpart. Instead it has to some extend a sinusoid shape.

(30)

Data 24 0 20 40 60 80 100 120 −20 −15 −10 −5 0 5 10 15 20 25

Time to Maturity (in Months)

Loadings

PC1 PC2 PC3

Figure 4.2: Loadings of first three Principal Components: where the inverse of the loading of the first principal component is depicted.

0 20 40 60 80 100 120 0 2 4 6 8 10 12 14

Time to Maturity (in Months)

Yield (%)

5% 25% Median

Figure 4.3: Median yield curve with 5, 25, 75 and 95 percentiles. The graph indicates that yields are skewed to the right.

(31)

Data 25 T able 4.1: Descriptiv e statistics of the yields Maturit y Mean Std. Dev. Median Minim um Maxim um Sk ewness Kurtosis ρ1 ρ12 ρ24 3 5.766 3.071 5.327 0.041 16.019 0.711 3.996 0.979 0.749 0.489 6 5.969 3.098 5.515 0.150 16.481 0.665 3.821 0.980 0.763 0.517 9 6.083 3.089 5.692 0.193 16.394 0.632 3.712 0.981 0.771 0.538 12 6.166 3.053 5.831 0.245 16.101 0.573 3.588 0.981 0.777 0.552 15 6.253 3.029 5.992 0.377 16.055 0.519 3.487 0.982 0.785 0.571 18 6.324 3.009 6.070 0.438 16.219 0.519 3.463 0.983 0.792 0.585 21 6.387 2.990 6.131 0.532 16.173 0.534 3.462 0.983 0.797 0.598 24 6.418 2.943 6.183 0.532 15.814 0.518 3.400 0.983 0.799 0.609 30 6.512 2.878 6.274 0.819 15.429 0.496 3.322 0.983 0.808 0.627 36 6.600 2.832 6.347 0.978 15.538 0.531 3.350 0.984 0.814 0.642 48 6.756 2.755 6.571 1.019 15.599 0.567 3.335 0.984 0.822 0.664 60 6.852 2.671 6.650 1.556 15.129 0.611 3.277 0.985 0.832 0.685 72 6.964 2.638 6.732 1.525 15.108 0.635 3.259 0.987 0.842 0.702 84 7.026 2.573 6.843 2.179 15.024 0.709 3.302 0.987 0.841 0.709 96 7.069 2.536 6.805 2.105 15.052 0.748 3.293 0.988 0.850 0.721 108 7.095 2.519 6.775 2.152 15.114 0.800 3.327 0.988 0.853 0.724 120 (Lev el) 7.067 2.465 6.683 2.679 15.194 0.863 3.409 0.988 0.843 0.717 Slop e 1.301 1.362 1.338 -3.191 3.954 -0.454 3.036 0.934 0.418 0.024 Curv ature 0.003 0.863 0.112 -2.174 2.905 -0.126 3.318 0.877 0.441 0.242

(32)

(33)

Chapter 5

Estimation

In this section the estimation methods are introduced. First we give the methods for initial parameters estimates. This is for the initiation of the optimization procedures of the specific models. We need sensible initial parameter estimates to avoid estimation difficulties, because the models are highly parametrized. We first proceed to estimate a good initial value for lambda. We then give the procedure to estimate the model using the two step approach introduced by Diebold and Li (2006). Subsequently we give the method to estimate the model using Kalman filter (1960) as proposed by Diebold et al. (2006). Finally, we give the estimation framework of the GAS.

5.1 Initial Estimates

5.1.1 Lambda

For each cross-section, at each moment in time, we can estimate a (D)NS model. The estimates of these cross-section models will play an important role later on in the estima-tion of the models. We are especially interested in the estimaestima-tion of λ as the optimizaestima-tion over λ will be nonlinear. This nonlinearity may result in difficulties.

For any cross-section we minimize the sum of squares error. We want to know the values of the following optimization:

minλ,β(yt− X(λ)β)0(yt− X(λ)β) , (5.1)

where ytis the 17 × 1 vector of yields, X(λ) a 17 × 3 matrix of factor loadings depending

on λ, and β a 3 × 1 vector of factors

(34)

Estimation 28

Given some λ this reduces to:

minβ(yt− Xλβ)0(yt− Xλβ) . (5.2)

This is a simple OLS, hence we get:

ˆ

βλ= Xλ0Xλ

−1

X_λ0yt. (5.3)

We substitute this in equation (5.1) and get:

minλ yt− Xλ Xλ0Xλ −1 X_λ0yt 0 yt− Xλ Xλ0Xλ −1 X_λ0yt (5.4) = minλy0t I − Xλ Xλ0Xλ −1 X_λ0 0 I − Xλ Xλ0Xλ −1 X_λ0yt.

Because of orthogonal projectors this reduces to:

minλy0tyt− y0tXλ Xλ0Xλ

−1

X_λ0yt. (5.5)

The optimization problem for the minimum sum of squares problem is then given by:

minλ T X t=1 y0_tyt− yt0Xλ Xλ0Xλ −1 X_λ0yt . (5.6)

Optimization of this function is done using the optimization routine of matlab; fminunc, and gives as result λ = 0.062. Furthermore we estimate equation (5.1) for each cross-section using Nonlinear Least Squares (NLS) to compare with our model with time-varying λt.

5.1.2 Two-Step Estimation

Next we can turn to an estimate of the model using the two-step approach as proposed by Diebold and Li (2006):

1. For some fixed λ we fit a static Nelson-Siegel to each cross-section (t = 1, ..., T ) using Ordinary Least Squares using equation (5.3). This results in three time series of estimated latent factors ( ˆβ1t, ˆβ2t, ˆβ3t) and estimated residual errors, the

measurement disturbances (ˆt)

2. A dynamic model is fitted to the estimated factors using equation (2.12). In the paper of Diebold and Li (2006), AR(1) models are fitted to each of the factors. It is also possible to fit a VAR(1) to the factors.

(35)

Estimation 29

The advantage of this procedure is that it is simple and numerically stable as it only uses linear regressions. In this approach it is also possible to additionally estimate λ in the first step. This results in four factors in the first step and a four-dimensional dynamic model in the second step. In this procedure the parameter estimation error from the first step is ignored in the second step. This may effect the second step and create a bias or distort results. Consequently, it is difficult to conduct statistical inference.

5.1.3 One-Step State-Space Estimation

Using the initial values estimated in the two-step approach the state-space model can be estimated in one step. Estimation of measurement (observation) (2.13) and state (transition) equation (2.14) is done with the Kalman filter (1960). The Kalman filter accounts for all the uncertainty in the framework.

The Kalman filter is an iterative estimation algorithm, it consists of two steps: a pre-diction and an update step. The filter gives a minimum mean squared error prepre-diction of the latent factors. The Kalman algorithm is given in appendix B.

The likelihood of the Kalman filter is maximized using an optimization routine of MAT-LAB. For the optimization ’fmincon’ is used, which is a constrained optimization rou-tine. The routine is initialized with the estimates of the two-step model with reasonable constraints given. The likelihood is optimized using the interior-point algorithm with numerical derivatives and Broyden-Fletcher-Goldfarb-Shanno (BFGS) Hessian’s. BFGS is the Quasi-Newton method used to approximate the Hessian’s of the likelihood needed for the optimization.

5.2 GAS Estimation

Next we introduce the estimation procedure of the model in the GAS framework. First the Maximum Likelihood estimation is introduced. We then give the procedure for the initial factors. Subsequently we introduce a smoothing scheme for the Fisher information matrix. We then show how we may conduct statistical inference. And finally we derive the model specific scores and Fisher information criteria.

5.2.1 Maximum Likelihood Estimation

The GAS models are like the Kalman filter estimated by Maximum Likelihood (MLE). For the maximum likelihood we need a fully specified probability density function. For

(36)

Estimation 30

our fully specified parametric model we have:

arg max θ (L(θ)) = log p(y1, . . . , yt|θ) (5.7) = log T Y t=1 p(yt|θ) = T X t=1

`t(θ), where `t= log p(yt|θ).

The likelihood of the GAS can be evaluated in a iterative manner as can be seen from equation (5.7) above: the local log-likelihood (`t) is determined for each time period

(t = 1, . . . , T ) and summed to a total in an iterative manner.

To determine the factors for each time-period we need to derive st from the updating

equation (2.20). For this we need at least the score (∇t) and preferably the Fisher

in-formation I_t|t−1 : so we need the derivative w.r.t. the dynamic factors (ft) as given in

equation (2.21) and (2.22). For each model specification this gives different results, the scores and Fisher information matrices are derived in section 5.3 below.

The optimization routine optimizes the model over the static parameter space θ: for each θ the likelihood is evaluated and adapted in a direction it will increase the likelihood. Maximization is done using the MATLAB routine fmincon with the interior-point algo-rithm and again BFGS Hessian’s.

5.2.2 Initial Factors

To start each likelihood evaluation we need to specify initial values for the dynamic factors (f1). A couple of options are considered:

• A natural consideration is the unconditional expectation of the factors. For the GAS(1,1) we have:

ft+1= ω + A · stB · ft (5.8)

E [ft+1] = ω + A · E [st] + B · E [ft]

(I − B)E [ft] = ω + A · 0

E [ft] = (I − B)−1ω.

• Another possibility is to initialize the iterative procedure at the optimal DNS of the cross-section at time t=1. For some fixed λ this could be estimated using OLS with equation (5.3). For fixed λ the model is less sensitive to the initial value. For

(37)

Estimation 31

a variable λt initial values are more important as completely wrong initial values

will influence the estimates.

• The final possibility and the most correct is to first apply the forward GAS filter (compute ft for t = 2, . . . , T + 1) with arbitrary initial values (f1), then backward

filter (compute ft for t = T, . . . , 0), then forward again.

So the GAS updating recursion can be started at different values, but in theory should approach the optimal values after some ’learning’ time even with wrong initial values.

5.2.3 Smoothing

As proposed by Creal et al. (2008, 2012) we try to scale with the inverse Fisher informa-tion (I_t|t−1−1 ). A difficulty of scaling with (an approximation of) the inverse information matrix is that this information matrix must be inverted. This can be a problem if the information matrix is ill-behaved, i.e. it is not full of rank or numerically unstable for some model.

A way to help reduce the chance of problems with non-invertible matrices is the use of some smoothing scheme. Instead of the normal information matrix, a smoothed infor-mation matrix is used as scaling; (I_t−1s )−1. Creal et al. (2008, 2012) proposes to use a Exponentially Weighted Moving Average (EWMA):

I_t−1s = αI_t−2s + (1 − α)It−1. (5.9)

for some 0 ≤ α ≤ 1. For α → 1 the model averages over all the past observations. For α → 0 it reduces to scaling with the Information Matrix. The parameter α is initially fixed at a safe value of α = 0.2. Eventually it can be added to the unknown parameter vector θ and optimized using MATLAB’s optimization routine.

5.2.4 Inference

To conduct statistical inference we apply the standard limiting result. The estimated vector θ has all the static parameters of the models. We use the Hessian at the optimum to compute standard errors and t-values. By standard regularity conditions the MLE is consistent and we have:

√

T (ˆθ − θ)−→ N (0, Hd −1), with H = −E[∂2`/∂θ∂θ0]. (5.10)

The Hessian is calculated numerically and optimization is terminated when tolerance between iterations is smaller than 10−6.

(38)

Estimation 32

5.3 Scores and Scaling

We now proceed to derive the score vectors and scaling matrices of the proposed models. In order to do this, analytical derivatives and expectation of the log-likelihood functions are determined.

5.3.1 Gaussian

At first we assume Gaussian errors as proposed in section 3.1.1. The errors tare given

by a vector of N × 1 in our example N = 17. Hence we have:

t= yt− Xtft∼ N (0, Σ). (5.11)

For estimation convenience we assume a diagonal Σ.

The probability density for each observation is given by:

p(yt|θ) = (2π)− N 2(Σ)− 1 2exp −1 2(yt− Xβt) 0_Σ−1 (yt− Xβt) . (5.12) `t(θ) = − N 2 log(2π) − 1 2log(|Σ|) − 1 2(yt− Xβt) 0 Σ−1 (yt− Xβt). (5.13)

Taking derivatives w.r.t. the initial factor ft= βt leads to the gradient and the scaling

matrix, the inverse information matrix:

∇_t(θ) = ∂`t ∂βt = −1 2· −2 · X 0 tΣ−1 (yt− Xtft) (5.14) = X_t0Σ−1 (yt− Xtft). St= Et−1[Xt0Σ−1 t0tΣ−1 Xt]−1 (5.15) = (X_t0Σ−1 Xt)−1.

Combined this leads to the scaled score given by:

st= (XtΣ−1 X 0 t)

−1

(39)

Estimation 33

5.3.2 Student-t

We now derive the score and information matrix of the student-t distribution.

p(yt|θ) = Γ v+m₂ Γ v₂ [(v − 2)π]N/2|Σ|1/2 1 + 0 tΣ−1 t (v − 2) −(v+N )/2 , (5.17) with, t= yt− Xtβt

This leads to the log-likelihood given by

`t= log Γ v + m 2 − loghΓv 2 i −N 2 log [(v − 2)π] (5.18) −1 2log|Σ|− (v + N ) 2 log 1 + 0 tΣ−1 t (v − 2) . (5.19)

Taking derivatives w.r.t. the factor βt obtains the score given by:

∇_t= ∂`t ∂βt = −(v + N ) 2 1 + 0 tΣ−1 t (v − 2) −1 · −2 ·X 0 tΣ−1 t (v − 2) (5.20) = (v + N ) 1 + 0 tΣ−1 t (v − 2) −1 X_t0Σ−1 t (v − 2) . (5.21)

taking derivatives again gives:

∂2`t ∂βt∂β0t = (v + N ) ∂ v + 0 tΣ−1 t −1 ∂β0_t X 0 tΣt+ v + 0tΣ−1 t −1∂X 0 tΣ−1 t ∂β_t0 ! (5.22) = (v + N )− v + 0_tΣ−1 t −2 X_t0Σ−1 t 0_tΣ−1 Xt + v + 0_tΣ−1 t −1 −X_t0Σ−1 Xt.

The problem with this expression is that it is difficult to determine the expectation. We therefore take the scaling derived for the Gaussian specification as an approximation of the scaling. Also the derived hessian above is used as an approximation combined with the smoothing scheme introduced in section (5.2.3). This leads to an approximation of the real information matrix.

5.3.3 Variable Lambda

We now proceed to extend the factor vector ftwith λtas proposed by Creal et al. (2008):

(40)

Estimation 34

Because the derivative w.r.t. βt stays unchanged. Only the derivative w.r.t λt needs to

be derived . Which is given by:

∂`t ∂λt = ∂Xt ∂λt βt 0 Σ−1 (yt− Xtβt). (5.24)

Together with the derivative given in (5.14) this forms:

∇+_t(θ) = ∂` ∂f_t+ (5.25) = ∂` ∂βt , ∂` ∂λt 0 =   X_t0Σ−1 (yt− Xtβt) ∂Xt ∂λtβt 0 Σ−1 (yt− Xtβt)   =hXt , ∂Xt ∂λtβt i0 Σ−1 (yt− Xtβt) = ˙Xt 0 Σ−1 (yt− Xtβt), with ˙Xt= Xt, ∂Xt ∂λt βt , and ∂xi(τi) ∂λt = 0,e −λtτi λt −1 − e −λtτi λ2 tτi , −1 − e −λtτi λ2 tτi + τie−λtτi+ e−λtτi λt .

Next the Information matrix is derived using the gradient:

It= E∇+t∇+t 0 (5.26) = Eh ˙Xt 0 Σ−1 t0tΣ −1 X˙t i = ˙Xt 0 Σ−1 Et0t Σ −1 X˙t = ˙Xt 0 Σ−1 ΣΣ−1 X˙t = ˙Xt 0 Σ−1 X˙t. (5.27)

So the scaling matrix is given by:

St= It−1 = ˙Xt 0

Σ−1 X˙t

−1

. (5.28)

Combined with the score this leads to the scaled score:

st= ˙Xt 0 Σ−1 X˙t −1 ˙ Xt 0 Σ−1 (yt− Xtβt). (5.29)

We now specify f_t+= (β_t0, λt)0 a vector of 4x1 as proposed by Creal et al. (2008):

(41)

Estimation 35

Here ft = βt is the 3x1 vector of factors. This imposes a three-factor structure on the

dynamics of (β_t0, λt)0. Using these restrictions the performance of the GAS model can

be assessed with non-linearity but with restrictions on the dynamics of the parameters. Imposing no restrictions at all will result in a highly non-linear system, which will result in estimation difficulties. For identification purposes the upper 3 × 3 matrix is set equal to the identity matrix. The the upper three elements of φ0 are set equal to zero. λt is

now a linear function of βt:

f_t+ = φ0+ Φft (5.31)        β1t β2t β3t λt        =        0 0 0 c0        +        1 0 0 0 1 0 0 0 1 c1 c2 c3            β1 β2 β3     =        β1t β2t β3t c0+ c1β1+ c2β2+ c3β3        .

Creal et al. (2008) forgets to mention that this is parametrization for their model and that the scaled score (5.29) that they depict is not complete. Since it is a parametrization a new scaled score must be derived for the updating equation as this is the driving mechanism of the updating equation:

∇_t= ∂`t ∂ft (5.32) = ∂f + t ∂f_t0 · ∂`t ∂f_t+ = Φ0· ∇+_t .

The inverse Fisher information is then given by:

I_t|t−1−1 = Et−1Φ0∇+t ∇+t 0_Φ−1 (5.33) = Et−1Φ0∇+t ∇ + t 0_Φ−1 = Φ0Et−1∇+t ∇ + t 0_Φ−1 = Φ0X˙t 0 Σ−1 X˙tΦ −1 .

(42)

Estimation 36

Hence this results in the scaled score:

st= It|t−1 −1 · ∇t (5.34) =Φ0X˙t 0 Σ−1 X˙tΦ −1 Φ0· ∇+_t .

5.3.4 Common Disturbance with Time-Varying Volatility

We now derive the score and information matrix for the model with common disturbance specification. For our factor vector we have: ft= (β1t, β2t, β3t, ht). For each time period

the log-likelihood is given by:

`t(θ) = − N 2 ln(2π) − 1 2ln(|Σ(ht)|) − 1 2 0 t(Σ(ht))−1t. (5.35)

with, t= yt− Xtβt. Taking the derivative with regard to ht gives:

∂`t ∂ht = −1 2 ∂ln (|Σ(ht)|) ∂ht −1 2 ∂ 0_t(Σ(ht))−1t ∂ht . (5.36)

Derivations of the parts are given below:

∂Σ(ht) ∂ht = ΓΓ0. (5.37) ∂ln (|Σ(ht)|) ∂ht = T r Σ(ht)−1 ∂Σ(ht) ∂ht (5.38) = T r Σ(ht)−1ΓΓ0 . ∂0_t(Σ(ht))−1t ∂ht = −0_t(Σ(ht))−1 ∂Σ(ht) ∂ht (Σ(ht))−1t (5.39) = −0_t(Σ(ht))−1ΓΓ0(Σ(ht))−1t = −0_t(Σ(ht))−1Γ 2 .

Hence, together these form the derivative w.r.t. ht; the fourth element of the score

vector: ∂`t ∂ht = −1 2T r(Σ −1 (ht)ΓΓ0) + 1 2 0_t(Σ(ht))−1Γ 2 . (5.40)

Combined with the derivative w.r.t. βt derived in (5.14) this forms:

∂`t ∂f_t∗ = ∂`t ∂β_t0, ∂`t ∂ht 0 . (5.41)

(43)

Estimation 37

To derive the scaling matrix, the inverse Information matrix, all terms are derived w.r.t to the factors once more. This results in:

∂2`t ∂f_t∗∂f_t∗0 =   ∂2_` t ∂βt∂βt0 ∂2_` t ∂βt∂ht ∂2_` t ∂ht∂βt0 ∂2_` t ∂2_h2 t  . (5.42)

Further derivations of the parts are given by:

∂T r(Σ−1 (ht)ΓΓ0) ∂ht = T r ∂Σ −1 (ht) ∂ht ΓΓ0 (5.43) = T r −Σ−1 (ht) ∂Σ(ht) ∂ht Σ−1 (ht)ΓΓ0 = −T r Σ−1 (ht)ΓΓ0Σ −1 (ht)ΓΓ0 = −T r Σ−1 (ht)ΓΓ0 2 . ∂0_t(Σ(ht))−1Γ ∂ht = 0_t∂ (Σ(ht)) −1 ∂ht Γ (5.44) = −0_tΣ(ht)−1 ∂Σ(ht) ∂ht Σ(ht)−1Γ = −0_tΣ(ht)−1ΓΓ0Σ(ht)−1Γ. ∂ 0_t(Σ(ht))−1Γ 2 ∂ht = 2 · 0_t(Σ(ht))−1Γ ∂ 0_t(Σ(ht))−1Γ ∂ht (5.45) = −20_t(Σ(ht))−1Γ 0_tΣ(ht)−1Γ Γ0Σ(ht)−1Γ = −2 0_t(Σ(ht))−1Γ 2 Γ0Σ(ht)−1Γ.

Together these parts form:

∂2`t ∂h2_t = 1 2T r Σ−1 (ht)ΓΓ0 2 −0_t(Σ(ht))−1Γ 2 Γ0Σ(ht)−1Γ. (5.46)

Taking expectations gets:

I_t|t−1= −Et−1 ∂2_` t ∂h2_t = −1 2T r Σ−1 (ht)ΓΓ0 2 + Γ0Σ(ht)−1Γ 2 . (5.47)

(44)

Estimation 38

For the cross-derivative this gets:

∂ 0_t(Σ(ht))−1Γ ∂β_t0 = −Γ 0 Σ(ht)−1Xt. (5.48) ∂2`t ∂ht∂βt0 = ∂0_t(Σ(ht))−1Γ 2 ∂β_t0 (5.49) = 1 2 · 2 · 0 t(Σ(ht))−1Γ ∂ 0_t(Σ(ht))−1Γ ∂β0_t = −0_t(Σ(ht))−1ΓΓ0Σ(ht)−1Xt.

Taking expectations this gets:

Et−1 ∂2`t ∂ht∂βt0 = 0. (5.50)

Because htis a variance it needs to be positive. We thus use ht= exp(αt) as

parametriza-tion. αt is taken to be the modeled factor. We need to derive the new score and

in-formation matrix w.r.t. to our new factors ˜ft = (β1t, β2t, β3t, αt). Because this is an

invertible mapping. The following rule can be applied for derivation of the scaled score (Creal et al., 2008). Let:

˜

ft= g(ft) (5.51)

˙gt=

∂g(ft)

∂f_t0 . (5.52)

And let ˙gt be invertible. We then have:

˜ ∇_t= ∂`t ∂ ˜ft = ∂ft ∂ ˜ft · ∂`t ∂ft (5.53) = ∂ ˜ft ∂f_t0 !−1 · ∇t ˜ ∇_t= ˙g_t−10 −1 ∇_t.

Thus the score is given by:

˜ st= Et−1 h ˙g_t−10 −1∇t∇0t( ˙gt−1)−1 i−1 ˙g_t−10 −1∇t (5.54) = ˙gt−1s˜t.

Modeling the term structure of interest rates : using the generalized autoregressive score framework.

University of Amsterdam

Faculty of Economics & Business

Modeling the Term Structure of Interest

Rates: using the Generalized

Autoregressive Score Framework

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Theory

2.1

The Yield Curve

2.2

Zero-Coupon Yields

2.3

Why Model?

2.4

Nelson-Siegel Model

2.5

Generalized Autoregressive Score

Chapter 3

Model Specifications

3.1

General Model

Chapter 4

Data

Chapter 5

Estimation

5.1

Initial Estimates

5.2

GAS Estimation

5.3

Scores and Scaling