GPU-based Hedging of a Mortgage Portfolio Using Monte-Carlo and Finite Difference Methods

(1)

Monte-Carlo and Finite Diﬀerence Methods

Alexios eiakos

Faculty of Science

University of Amsterdam

ING Nederlands - MRMB Research

A thesis submied for the degree of

MSc. Computational Science

June 2013

Supervised by

(2)

We present GPU-based approaches for estimating the market risk of a portfolio of fixed-rate mortgages with embedded prepayment options. e value of a mortgage is calculated by evaluating a integral or, alternatively, by solving a forward PDE. We simulate the path-integral with a Monte Carlo (MC) method and numerically solve the PDE with an implicit finite difference (FD) method. Subsequently, we compare accuracy and speed-up of these two parallel methods with a serial implementation. In conclusion, the FD approach achieves the greatest speed-up and the MC approach is most adaptable to accommodate model adjust-ments.

(3)

I would like to thank my supervisors Drona Kandhai, Jurgen Tas and Han van der Lem for their guidance and insight. Furthermore, I am grateful to Tim Wood and ING for allowing me to use their state of the art GPU hardware. I would like to acknowledge all my colleagues in ING for their support and for the friendly atmosphere they provided throughout this project. Finally, I thank my friend John Tyree for his optimized PCR algorithm implementation and ideas.

(4)

1 Introduction 3

1.1 Mortgages . . . 3

1.1.1 Risks and Mortgage Hedging . . . 3

1.1.2 Problem Description . . . 4

1.1.3 Data distribution . . . 6

1.1.3.1 Research questions . . . 8

1.1.4 Parallelization of the scenario tool . . . 8

1.1.4.1 Research questions regarding a Monte Carlo simulation . . . 8

1.1.4.2 Research questions regarding a ﬁnite diﬀerence framework . 9 1.2 Overview of the research . . . 9

2 Modeling 10 2.1 Short rate dynamics . . . 10

2.1.1 e Hull White One Factor dynamics . . . 10

2.2 Prepayment function . . . 12

2.3 Monte Carlo . . . 13

2.3.1 Transformation of the dynamics . . . 13

2.3.2 Valuing a Mortgage . . . 15

2.4 Finite diﬀerence . . . 16

2.4.1 Arrow-Debreu pricing and formulation of Kolmogorov equation . . . 17

2.4.2 Valuing a mortgage . . . 19

2.4.3 Advantages over the backwards Kolmogorov PDE . . . 20

(5)

3 Implementation Approa 22 3.1 GPU Introduction . . . 22 3.1.1 Hardware design . . . 23 3.2 Data transfer . . . 26 3.3 Data decomposition . . . 27 3.4 Computation algorithms . . . 29

3.4.1 Monte Carlo procedure . . . 30

3.4.1.1 Path generation kernels . . . 31

3.4.1.2 Cash ﬂows kernel . . . 31

3.4.2 Finite diﬀerence procedure . . . 34

4 Numerical Results 36 4.1 Monte Carlo . . . 37

4.1.1 Convergence analysis . . . 37

4.1.2 Scenario Tool analysis . . . 43

4.1.2.1 NPV cash ﬂows . . . 44

4.1.2.2 BPV sensitivities . . . 47

4.2 Finite Diﬀerence . . . 48

4.2.1 Convergence to bond prices . . . 48

4.2.2 NPV cash ﬂows . . . 49

4.2.3 BPV sensitivities . . . 50

4.3 Timing results . . . 50

5 Discussion & Recommendations 55 5.1 Recommendations & further research . . . 57

(6)

Introduction

e research presented in this document is part of my esis project for the MSc Computa-tional Science held by University of Amsterdam. is project was conducted in the MRMB Retail/Research group of ING Netherlands.

1.1 Mortgages

A financial institution issuing mortgage contracts lends an amount of money to a client to buy a house. When a client enters the contract, an agreement has been reached on the total notional amount lent, the redemption schedule, the (entire) contractual period, the interest rate paid by the client (client rate) and the period during which the client rate is fixed. e period can range from a few months to 20-30 years. Note that the entire contractual period can consist of several fixed interest rate periods. In that case, the corresponding rates are set at a market level at the beginning of each period.

1.1.1 Risks and Mortgage Hedging

Consider a mortgage contract, where the client repays according to the contractual agreement. We can think of this contract as a coupon bearing bond whose value changes when the interest rates change. If the rates go up, the value will decrease while if the rates’ go down, the value increases. When the bond/mortgage has a high tenor the sensitivity towards the interest rates movements is higher as well.

Every month we require new funding for new mortgages and renewed mortgage contracts. To receive the funding, deposit contracts are internally arranged within bank. It appears

(7)

consistent to determine that the period of the funding be equal to the periods of ﬁxed interest of the mortgages.

However, mortgages embed the option that the client can repay (part o) the notional amount, before the ﬁxed interest rate period ends. For this investigation, we assume that the incentive to prepay is the spread between the client rate and the market rate. When this spread is favourable, prepayment will enable a client to reﬁnance against a lower interest rate. Generally, when a prepayment occurs, the period of arranged funding will be too long. If a prepayment occurs when interest rates have dropped, the money returned to the bank will have to be invested at a rate lower than the rate for which it has been aracted, which generates a loss. On the other hand, it is possible that no loss is incurred; i.e.ẇhen the new investment for the remaining months can be made at a higher than the original rate.

e net present value (NPV) of a mortgage portfolio is based on predictions of future prepayments, interest rate payments and repayments. e interest rates, needed to determine future prepayments (and to discount future cash ﬂows), are simulated by means of a stochastic interest rate model. Although, the contractual payments and interest payments are known a priori, they require adjustments aer a prepayment. e NPV provides the basis for hedging the interest rate risk of a mortgage portfolio. If the portfolio can be valued, it is also possible to calculate the sensitivity of the value for a 1 basis point interest rate curve increase (BPV).

e idea behind the mortgage hedge is to transfer the exposure to (small) interest rate curve changes of a mortgage portfolio to a portfolio containing internally arranged funding contracts. is funding portfolio mimics the mortgage portfolio with respect to its interest rate risk and hence changes in the interest rate curve will aﬀect the funding portfolio and the mortgage portfolio in a similar but opposite way.

1.1.2 Problem Description

For research purposes, we want to be able to perform a scenario analysis in order to test the quality of the hedge method. is means that we want to test how well a hedge in the future would perform. For example, if we run a hedge simulation for the current month at time t0

we would like to test the performance of the hedge at time t1, t2,· · · , tn. is is called a roll

in the future hedge. Assuming that we require n = 60 rolls (months) in the future as well as

50diﬀerent initial interest rate scenarios to test, we need to simulate 3000 diﬀerent scenarios in total. Ideally, we would like to simulate even more than 50 interest rate scenarios in order

(8)

. . Scenario (0,0). Scenario (0,1) . _{· · ·} . Scenario (0,n) . rpath= 1 . Hedge t0 . Hedge t1 . Hedge tn . Scenario (1,0) . Scenario (1,1) . Scenario (1,n) . rpath= 2 . Hedge t0 . Hedge t1 . Hedge tn . · · · . .. . . .. . . .. . . Scenario (m,0) . rpath= m . Scenario (m,1) . Scenario (m,n) . · · · . Hedge t0 . Hedge t1 . Hedge tn . t0 . t1 . tn

Figure 1.1: e scenario analysis for a mortgage portfolio. Where N the number of hedge rolls in the future we wish to perform and M the amount of different interest rates scenarios’ development. Note that we could add another dimension L of different prepayment functions. to have a result with a higher amount of statistical significance. A schematic representation of the scenario analysis is depicted in figure 1.1.

e steps performed by such a scenario analysis tool can be summarized as: • Collect the market data from the database.

• Insert stylized or random generated scenarios we wish to test.

• Perform short interest rate simulations by use of an appropriate model.

• Calculate cash ﬂows (NPVs) and sensitivities (BPVs) for the portfolio (typically consists of 100 clusters).

• Compute the notional proﬁles. • Perform the hedge calculations.

(9)

• Store the results in the respective database for further analysis.

By proﬁling the current tool it is clear that the computational heavy process is simulating the short rates and calculating the cash ﬂows and sensitivities. ese processes alone take up for almost 98% of the total runtime.

is research aims to replace the computational heavy operations with a GPU (Graphics Processing Unit) implementation in order to make the program more eﬃcient and thus make the analysis of the scenarios less cumbersome. e main goal is to provide a functional pro-totype, demonstrating that this type of simulation can be parallelized and report the amount of speed up we can acquire.

e project also has 2 further goals: we wish to investigate numerical methods of simu-lating the short rates and valuing the portfolios other than the current laice implementation (trinomial tree). is is because the trinomial tree does not map well into a highly parallelized GPU environment, as we will discuss later.

We would also like to investigate the possibility of a transparent connection of the par-allelized implementation with the current library. is is a requirement because the current library performs a number of administrative operations, like database parsing, market data calibration etc. that we wish to preserve. ese operations are well tested and debugged.

is project is a a proof of concept that will give us an insight of the performance gain under a GPU seing as well as advantages of using more advanced numerical techniques than a laice method when pricing a mortgage portfolio.

In the following sections these topics are described and we formulate the research ques-tions that require investigation, in order to provide a functional prototype.

1.1.3 Data distribution

e profile analysis, as mentioned above, demonstrated that there no requirement for a com-plete refactoring of the current code base. e current Java library client is well tested and efficient in performing the non critical parts of the tool. Rewriting the whole program in a GPU ”friendly” language like C/C++ in order to seamlessly work with the GPU is not advised, since it would require a lot of developing time and testing. A more detailed view of the tool modules are shown in figure 1.2

We aim to distribute the relevant data of a scenario analysis on a cluster of GPUs, living in the local network, over the wire where we can parallelize and eﬃciently perform the cash

(10)

Co nf ig ur a tio n Swa p R ate s F und D ata Mo rtag e Co nt ra ct s Int e rpolat ion Al go rith ms Clus te rin g Alg o rit h m s Par ser S cen ario Build Tool HW 1F He dg ing Alg o ri thm s Ou tp ut D ata P rof ile Cu rves Ou tp ut D ata Plo ts Java + Cuda V a nilla No n eq uid ista n t Trinomial Trees Finite Difference PD E + Bo und ari e s D isc reti zati on Sch e m e s Monte Carlo V a nilla Va ri ab le Re du ctio n Alg o rit h m s Interface/Input Interface/Output Pre Simulation : Native Code (C) : Java Code : Data formats (csv) : Java AND/OR C Post Simulation

(11)

ﬂow calculations.

1.1.3.1 Resear questions

Regarding the data distribution between a client and the GPU cluster, we want to:

• Investigate the communication of a client/server module in an eﬃcient, scalable and transparent way, while avoiding excessive overhead of data transfer.

An efficient tool for communicating between the cluster equipped with the GPU hardware and clients is an important first step towards process speed up. If the data is not handled fast enough, even if the actual scenario tool implementation is very efficient, a boleneck will arise. e above research question requires investigation in order to determine the limits of the tool and in order to design the clusters specifications, i.e. how many machines are required for a given dataset analysis and how many GPUs per machine.

1.1.4 Parallelization of the scenario tool

When the limitations of the data distribution are known, the next step in designing such a tool is the parallelization of the computational expensive part, which was identified as the valuation of the portfolio. As mentioned above, one of the project’s goals is the investiga-tion of different numerical methods for the calculainvestiga-tion of the cash flows. e most common approaches are Monte Carlo simulations and finite difference methods. In this research both approaches are considered and compared.

1.1.4.1 Resear questions regarding a Monte Carlo simulation

If we look closer at the problem, we can identify the ﬁnancial product we are trying to price as a path dependant one. Our payoﬀ, the NPV, depends on the short rate path due to the dis-count factors and prepayment rates. e path dependency is an indication that a Monte Carlo simulation is appropriate for pricing this security. Coupled with the fact that a Monte Carlo method is considered embarrassingly parallel, we conclude that this technique is a suitable candidate for the parallelization of the scenario tool under a GPU. e research question we want to address:

(12)

the convergence (vanilla and variance reduction methods) in order to use these results for the design of the tool.

1.1.4.2 Resear questions regarding a ﬁnite diﬀerence framework

In order to apply a finite difference discretization, we need to convert the dynamics of the short rate model (a stochastic differential equation) into a partial differential equation (PDE) and solve it numerically in a discretized domain. We want to:

• Investigate the applicability of a PDE solver in the mortgage valuation seing.

• Investigate and design a parallel solution for the solver, given the sequential nature of the discretized domain and the limitations of GPU thread synchronization.

Finally, we would like to compare the two methods in terms of speed up, accuracy and flexibility. In theory we known that a finite difference method is faster than a Monte Carlo model but we wish to investigate how these algorithms map on the GPU seing given the limitations of the platform. e flexibility of the method is also a crucial factor in designing such a tool on a production level, compared to a prototype, as we may wish to extend the functionality of the model.

1.2 Overview of the resear

In the next sections we will answer the research questions that were formulated in sec-tion 1.1.2. In chapter 2 we discuss model specific topics such as the short rate model, the prepayment function, and the mathematical formulation of the pricing under the Monte Carlo and finite difference methods. In chapter 3 we discuss implementation details and the algo-rithms used for the computer simulations. Furthermore, a small introduction of the GPU and how it achieves the speed up is given. In chapter 4 we provide a scenario tool test case and we give the numerical results and speed ups for the methods considered. Finally, in chapter 5 we summarize the results with a small discussion and provide useful recommendations for a production level system.

(13)

Modeling

In this chapter we discuss all the models that are currently used for the mortgage valuation in detail. is involves a model for simulating the instantaneous short rates 2.1, a model for simulating the clients’ behaviour i.e. the prepayment function 2.2, the path generation, and the cash flows calculation under a Monte Carlo framework 2.3, and finally the discretized short rate and the cash flows calculation under a finite difference framework 2.4.

2.1 Short rate dynamics

e literature is rich of short rate models, e.g. Vasicek [24], the Cox-Ross-Ingersoll [7], the Hull-White one factor [13] (HW1F) and the Black-Karasinski model [4]. For a complete overview we refer to [5].

When choosing a short rate model one has to answer some questions before deciding which one is suitable. For example, some models imply negative rates (r(t) < 0) which may or may not be acceptable. Additionally, models like the HW1F model assume that short rates are normally distributed. Consequently, this provides analytical tractable bond prices. For this investigation we consider the HW1F short rate model with constant mean reversion and constant volatility.

2.1.1 e Hull White One Factor dynamics

e HW1F model is considered an extension of the Vasicek model. e idea behind the Vasicek model is that the short rate evolves as an Ornstein-Uhlenbeck process with dynamics:

(14)

where r(t) is the instantaneous short rate at time t, k is the rate at which the spot rate is pushed to the long term level, θ a constant chosen to make the model consistent with the initial term structure, σ the volatility and W (t) a Brownian motion under the risk-neutral measure. Integrating equation (2.1.1) yields:

r(t) = r(s)e−k(t−s)+ θ(1− e−k(t−s)) + σ ∫ s

t

e−k(t−u)dW (u),

with r(t) is normally distributed with ﬁrst and second moment conditional on time s given by: E[r(t)|r(s)] = r(s)e−k(t−s)_{+ θ(1}_{− e}−k(t−s)_), V ar[r(t)|r(s)] = σ 2 2k ( 1− e−2k(t−s)). (2.1.2) e Vasicek model has the downside that the term θ is constant and cannot capture the move-ment of the term structure. Another drawback of the model is the theoretical probability of r becoming negative (although this is also the case with the HW1F model as we will see later). e extended Vasicek model, proposed by Hull and White, is focused on improving the ﬁing of the model to market data by introducing a deterministic function of time for the θ term as well as a deterministic function of time for the σ term in the more generic form. is transforms (2.1.1) into:

dr(t) = (θ(t)− kr(t)) dt + σ(t)dW (t). (2.1.3)

In a follow up paper of Hull and White [14] it is advised not to calibrate the volatility because the volatilities implied by (2.1.3) are likely unrealistic and thus:

dr(t) = (θ(t)− kr(t)) dt + σdW (t). (2.1.4)

Now θ(t)

k is the level to which the sport rate is moving. It can be shown (i.e. see [13]) that θ(t)

equals: θ(t) = ∂f (0, t) ∂t + kf (0, t) + σ2 2k ( 1− 2e−2kt). (2.1.5)

with f(0, t) the market instantaneous forward rate at time 0 having maturity t given by,

f (0, t) =−∂ln P (0, T )

∂t (2.1.6)

Here P (0, t) is the market observed zero coupon bond at time 0 and maturity t. Integrat-ing (2.1.4) yields:

r(t) = r(s)e−k(t−s)+ α(t)− α(s)e−k(t−s)+ σ ∫ t

s

(15)

where: α(t) = f (0, t) + σ 2 2k2(1− e −kt₎2 .

e distribution of r(t) in the Hull White model is Gaussian with ﬁrst and second moment conditional on time s given by:

E[r(t)|r(s)] = r(s)e−k(t−s)_{+ a(t)}_{− a(s)e}−k(t−s)_,

V ar[r(t)|r(s)] = σ 2 2k ( 1− e−2k(t−s)). (2.1.8)

Under the dynamics of (2.1.4) bond prices are analytically tractable, i.e. see [15]:

P (t, T ) = A(t, T )eB(t,T )r(t), (2.1.9) where B(t, T ) = 1− e −k(T −t) k , and A(t, T ) = P (0, T ) P (0, t) exp ( −B(t, T )f(0, t) − σ2 4k(1− e −2kt_{)B(t, T )}2 ) .

In Hull and White’s original paper [13], a laice method is proposed to simulate the short rate dynamics. More specifically, they use a trinomial tree, where forward induction is used in order to calibrate the initial term structure. Two other methods were used in this project: a Monte Carlo simulation and a finite difference discretization. Note that an explicit finite dif-ference scheme is an extension of the trinomial tree. However, using more advanced schemes, such as the implicit or the Crank Nicolson scheme, we can achieve beer results in terms of numerical stability and rate of convergence.

2.2 Prepayment function

e prepayment rate as a function of the interest rate spread is determined by means of a continuous parametrized deterministic functional form, ˆg(r(t)). Here interest rate spread is defined as the difference between the fixed client rate and the market rate. e functional form ˆg(r(t)) is defined in such a way that the prepayment rate increases with the interest rate spread. us, the more economically favourable the spread, the more clients are motivated to prepay.

(16)

Note that, not all clients display this rational behaviour. is means that even if the spread is substantially positive; clients are not triggered to prepay more. Moreover, irrational be-haviour occurs when the prepayment spread is negative. erefore, the prepayment rate is assumed to be bounded between an upper and lower level.

Although the prepayment function is not stochastic, the input for this function, i.e. the short rates, is. is makes the prepayment rates a path-dependent stochastic process.

2.3 Monte Carlo

When dealing with path dependant problems, such as valuing future mortgage cash flows, the Monte Carlo simulation is the first method to consider because of its simplicity compared to other methods like a laice or a finite difference solver. In short, the chain of thought for a Monte Carlo simulations is:

• Simulate a number of paths for the underlying asset.

• Compute the discounted value of a derivative for a given sample path. • Average the payoﬀ over all the paths.

It is easy to see that Monte Carlo is a sampling technique where we sample the domain of all possible outcomes. By the law of large numbers, the estimate is bound to converge to the correct value. Because of the Central Limit eorem (CLM) the error of a Monte Carlo estimate is normally distributed with mean 0 and standard deviation σ/√n. is implies a

convergence rate ofO(n−1/2₎_{which seems like a big drawback since, in order to reduce the}

error of the estimate by half, we would need 4 times the number of samples.

Recent advances in technology (i.e. powerful CPUs and GPUs) allow us to reduce the error by just taking more sample paths because of the availability in computational power. Moreover, a wide list of variance reduction techniques are available in the literature in order to reduce the error while keeping the number of sample paths constant.

2.3.1 Transformation of the dynamics

In order to compute the dynamics of (2.1.4) we need a discretization method. e term θ(t) in the dynamics is given by (2.1.5) and involves the numerical approximation of the derivative of the forward rate ∂f (0,t)

(17)

To avoid the approximation of this potentially discontinuous derivative, we can perform the following coordinate transformation:

x(r(t), t) := r(t)− α(t). (2.3.1)

In this case x follows an Ornstein-Uhlenbeck process, given by

dx(t) =−kx(t)dt + σdW (t), x(0) = 0. (2.3.2)

By applying Ito’s lemma to f(x(t), t) = x(t)ekt_{and integrating with respect to t we arrive at:}

x(t) = x(s)e−k(t−s)+ σ ∫ t

s

e−k(t−u)dW (u), s < t

Next, by starting from time s = 0:

x(t) = x(0)e−kt+ σ ∫ t

0

e−k(t−u)dW (u).

In practice this means that for the Monte Carlo simulation we ﬁrst discretize simulate the x(t) process and, in order to get the short rates, we use equation (2.3.1).

Note that, by performing this transformation, process x is now only dependant on the Hull White parameters k and σ. Every scenario in the tool can now share the same process x since scenarios diﬀer in the interest rate curves and/or the prepayment function. We consider this an implementation optimization.

For a time schedule 0 < t1 < t2 <· · · < tN we can advance x(t) in a bias free manner:

xk+1 =E[x(tk+1)|x(tk)] + √ V ar[x(tk+1)|x(tk)]Zk = e−k∆txk+ σ √ 1− e−2k∆t 2k zk+1,

where k = 0, 1,· · · , N − 1 and Z∼N (0, 1) a sequence of independent standard Gaussian

random variables. Note that an Euler-Maruayama discretization [18] of the x process would yield the same bias free result.

As mentioned in the beginning of the section, a vanilla Monte Carlo method has an order of convergenceO(n−1/2_{). Variance reduction techniques, such as antithetic sampling, control}

variates and importance sampling; although they cannot reduce the order of convergence they can reduce the constant n. Henceforth, the antithetic sampling is used [9].

(18)

2.3.2 Valuing a Mortgage

Under the risk-neutral measureQ, in a complete market and in the absence of arbitrage op-portunities, any interest rate claim V (r, t) can be priced by using:

V (r, t) =EQ [ e−∫tTr(u)duV (T );{r(t) : 0 ≤ t ≤ T }|F t ] (2.3.3) where the payoﬀ V (T ) denotes the dependency (if any) over the paths of r(t) and F a

σ-algebra of the sample space. e Monte Carlo estimate of the payoﬀ V (T ) is given by: EQ_{[V (T )]}_≈ 1 n n ∑ k=1 V (T )

Notice that the discount factor in (2.3.3) is now stochastic. ere are 3 common methods to compute the discount factor. First, we can approximate the stochastic integral by:

∫ T 0 r(u)du≈ T ∑ k=1 r(tk−1)(tk− tk−1) (2.3.4)

Note that this approximation will render the expectation biased because the estimator con-tains the discretization error, even though we use bias free advancement of the x process. Second, simulate the integral with a MC as well. e joint simulations will have a correlation that needs to be taken care of. e third method is to change the numeraire to T-forward measure QT_{. is will remove the discount factor from the expectation but will change the}

dynamics of r and consequently x. For the rest of this project the simple approximation (2.3.4) was used.

As mentioned in the introduction, the mortgage cash ﬂows depend on the repayments, Rt,

prepayment, Ptand coupon payments Ct. Even though contractual repayments and

contrac-tual coupon payments are known, the possibility of a prepayment renders them stochastic as well. When a prepayment occurs the remaining mortgage notional needs to be re-adjusted. To this end, we deﬁne:

ωt:= N_tc Nc t−1 , d_t−1 := e−rt−1∆t , cˆt:= rclient 12 (2.3.5) where Nc

t the contractual notional, dt the discount factor and ˆct the monthly client rate at

(19)

wrien as:

Pt= dt−1· ωt· At−1· ˆg(r, t), (2.3.6)

Rt= dt−1· (1 − ωt)· At−1, (2.3.7)

Ct= dt−1· At−1· ˆct, (2.3.8)

At= dt−1· At−1· ωt· (1 − ˆg(r, t)), (2.3.9)

where Atthe corrected notional and ˆg(r, t) the prepayment function as deﬁned in section 2.2.

e net present value (NPV) of a mortgage is then given by:

N P V =EQ [ _T ∑ i=1 Ri+ Pi+ Ci ] . (2.3.10)

Since the portfolio consists of various contracts, the total value is given by:

N P V_total =EQ [ _C ∑ j=1 T ∑ i=1 Ri+ Pi+ Ci ] . (2.3.11)

Note that the above equations work for all the types of mortgages

e first order interest rate sensitivity a portfolio, given 1 basis point shi of the interest rate curve (BPV), is computed with a finite difference approximation. In this case we use a central approximation to reduce the bias in the expectation:

BP V = E

Q_{[N P V}+

total] +EQ[N P Vtotal− ]

2 . (2.3.12)

Where NP V+

total the value given an up shi (by 1bp) of the interest rate curve and NP Vtotal−

the total value given a down (by 1bp) shi of the interest rate curve. In practice this means that we require 3 simulations for every interest rate path. Furthermore, we require the proﬁle of the notionals or, in other words, how the notional value decreases over time given that prepayments occurred during the life of the contract. To compute the notional proﬁle, we can use the same equations as before, sans the discount factors.

2.4 Finite diﬀerence

A contract whose future cash ﬂows depend on future states of the world is called a state-contingent claim, or state claim. In this investigation, we consider a mortgage contract with an embedded prepayment option. Moreover, we assume that the prepayment is modelled as

(20)

an interest rate path-dependent function of the interest rate in the market. Note that this contract is a state claim. Consequently, by applying the Arrow-Debreu pricing theory [2] we can value the cash flows. is valuation involves solving a parabolic partial differential equation (PDE) to be specified below. e finite difference discretization method discussed in this section is a numerical method for solving PDE’s.

2.4.1 Arrow-Debreu pricing and formulation of Kolmogorov equation

Below, we give the formal deﬁnition of the value of Q of an Arrow-Debreu security:

Deﬁnition 1. Q(s, t; y, T ) is the value of a security, with a payoﬀ of 1 if we reach state y at time

T > 0and zero otherwise, given that we started at state s at time t (0 < t < T ).

By assuming a no-arbitrage argument, under the risk-neutral measure Q, we value the Arrow-Debreu security with the discounted conditional expectation of the payoﬀ:

Q(t, s; T, y) = EQ [ e−∫tTruduδ(y)|r t= s ] , t < T (2.4.1)

where δ(y) is the Delta Dirac function which simulates the payoﬀ of the Arrow-Debreu secu-rity and rtsatisﬁes the SDE (2.1.4).

By applying the Feynman-Kac theorem we can link this probabilistic solution with a Kol-mogorov equation [16]. e Arrow-Debreu security, satisﬁes both the forward and backward Kolmogorov equation [3]. In this project we are interested in solving the forward equation for reasons stated in section 2.4.3. e forward Kolmogorov equation is equal to:

∂Q(t, s; T, y) ∂T − kQ(t, s; T, y) − ky ∂Q(t, s; T, y) ∂y − 1 2σ 2∂2Q(t, s; T, y) ∂y2 =−rQ(t, s; T, y) (2.4.2) with (t, s) ﬁxed and domain t ∈ R+_{. Equation (2.4.2) is a parabolic PDE and thus an initial}

boundary value problem. In order to solve this equation initial value and boundary conditions are needed. As initial condition we use the Dirac delta payoﬀ:

Q(0, s; 0, y) := δ(s)

Furthermore, we use the following Dirichlet BC’s, i.e.

(21)

In our investigation, we are considering Arrow-Debreu prices for diﬀerent states of short rates. us we can re-write Q as Q(t, r(t); T, r(T )) were r(t) satisﬁes (2.1.4). For the same reasons as stated in 2.3.1 we perform the transformation:

x(t) := r(t)− α(t),

Note that, in the discretized rectangular grid the ﬁt to the yield curve when using α(t) is not exact. A calibration to bond prices is required for an exact ﬁt. Nevertheless, the approximation error is small.

Next, by taking the transformation into account, PDE (2.4.2) is rewrien. To simplify the notation, we write Q(t, x(t); T, x(T )) = Q(T, x): ∂Q(T, x) ∂T − kQ(T, x) − kx ∂Q(T, x) ∂x − 1 2σ 2∂2Q(T, x) ∂x2 =−(x + α(T ))Q(T, x), (2.4.3)

Note that, the right hand side of the PDE is the discount factor term. us, we need to add

α(T )in order to transform process x to short rate r. From the dynamics of x we know that x(0) = 0and thus the initial Dirac delta condition:

Q(0, x(0); T, x(T )) = δ(x(0)) =1_{x=0},

with1{x=0}the indicator function. e following Dirichlet boundaries are applied:

Q(t, x(t); T, xM /2(T )) = Q(t, x(t); T, x−M/2(T )) = 0

We discretize the PDE in a (N + 1)× (M + 1) rectangular grid as illustrated in ﬁgure 2.1.

Let (i, j) denote a grid point at time t = i∆t with an associated x value equal to j∆x. i varies from 0 to N and j from−M/2 to M/2.

We can employ probabilistic methods to identify the level of discretization of x, on the rectangular grid:

x_−M/2= E[x(T )]− α√V ar[x(T )] xM /2 = E[x(T )] + α

√

V ar[x(T )],

where α a user specified level of confidence (usually 3-4 is adequate) and E(x(T )), V ar[x(T )] first and second moment of process x (2.3.1) respectively.

In order to obtain a numerical solution of the PDE, we use an implicit scheme, see Ap-pendix A. e implicit scheme is unconditional stable compared to the explicit. e Crank-Nicolson scheme should be used with caution even though it has beer convergence (second

(22)

... t . x . t0= 0 . t1 . t2 . tN−1 . tN= T . ∆t . δ(x(0)) .. x0 xM/2 . x_−M/2 .

upper boundary condition

.

lower boundary condition

Figure 2.1: Rectangular grid of size (N + 1)× (M + 1) for numerical discretization of the

PDE.

order in time compared with first order for the implicit). e initial Dirac pulse condition is highly discontinuous and the Crank-Nicolson scheme is known to suffer from spurious oscil-lations when using such payoffs [8]. To avoid those osciloscil-lations, the initial condition requires further treatment.

2.4.2 Valuing a mortgage

In this section, we apply the Arrow-Debreu equilibrium framework [2]. is framework states that if markets are complete, any arbitrary complex security can be replicated by a portfolio of Arrow-Debreu securities. e market completeness essentially means that prices exist for all the Arrow-Debreu securities. Moreover, the framework assumes no-arbitrage opportunities.

Specifically, we use the framework to value a mortgage contract. To illustrate the pricing principle, consider a mortgage cash flow at time t. In a discrete seing, the value of this cash flow is calculated as the Euclidean inner product of a vector of Arrow-Debreu prices and a vector of cash flows. Note that in a continuous seing (i.e. the sample space is a countably infinite set) the value is given by a Lebesque integral. e implicit finite difference scheme implies, that we need to solve a matrix of equation for each time step, i.e.

At+1· Qt+1=Qt, (2.4.4)

(23)

monthly coupon payments and Ptthe prepayments: Rt =1(1 − ωt)Ntc−1, Pt = Λt1Ntc,

Ct =1Ntc−1cˆt,

(2.4.5)

where ωtdeﬁned in (2.3.5), Ntcthe contractual notional at time t and1 the identity vector. Λt

a diagonal matrix of the prepayment rates for all levels of interest rates at time t with Λ0 the

null matrix.

Because of the option of prepaying, we need to adjust the state prices Qtfor the possibility

of being prepaid before we reach the next state:

At+1· Q∗t+1= (I− Λt)Q∗t, (2.4.6)

with I the identity matrix. e value of the mortgage portfolio, corrected for prepayments:

N P V = C ∑ i=0 T ∑ t=1 Qt∗(Rt+Pt+Ct) (2.4.7)

where C the number of clusters in the portfolio.

e correction of the state prices for prepayment implies that for every cluster in the portfolio we need to solve the PDE. is is because the prepayment correction depends on the cluster speciﬁc ﬁxed rate.

2.4.3 Advantages over the bawards Kolmogorov PDE

In the case of mortgage valuation, we are required to correct state prices obtained from the Kolmogorov PDE for prepayments. is introduces a dependency over the solution of the previous steps. In order to use the backwards Kolmogorov equation we need to reformulate the algorithm with the forward-from-backward induction, see [1] page 460. e reformula-tion is necessary to achieve O(n2₎_{computational complexity. e same applies if we wish}

to perform an exact calibration to the yield curve by pricing bonds. e forward equation achieves that complexity without any further modiﬁcations.

is is also advantageous when we are interested in pricing a portfolio of similar securities which diﬀer only in terms of the maturity. With the forward PDE we can price all the securities on one grid, since we can stop the calculation of certain securities of the portfolio once we pass their expiration. In our case this is not feasible since we need to correct for prepayments and each contract of the portfolio has a diﬀerent prepayment rate.

(24)

2.4.4 Advantages over the classical trinomial tree

In Hull and White’s original paper, the writers propose a modiﬁed trinomial tree numerical solution for solving the SDE was proposed. e modiﬁcation ensures that the solution is going to be numerically stable since this doesn’t hold in general for explicit schemes.

By using a general PDE numerical solution, we can take advantage of more robust schemes. Implicit or Crank Nicolson schemes are unconditionally stable (in a numerical sense). Further-more, with a Crank Nicolson scheme the convergence of the solution is second order in both time and space, as opposed to ﬁrst order for explicit (in time and space) and second order for space/ﬁrst order for time in an implicit scheme. Moreover, the application of non-equidistant discretization is easier on a rectangular grid than the trinomial tree.

Even though we did not apply these features (CN scheme, non-equidistant grid) on this investigation, choosing a finite difference scheme is more general and flexible to explore new possibilities.

e main reason to choose a general PDE solver, rather than the trinomial tree on a parallel seing, is because of the superior mapping of a grid solver to the GPU instead of a trinomial laice. As we will see later in chapter 3, we want all the spawned threads of a GPU kernel to do computations. A laice would require threads to be idle in the first steps since only few nodes have actual values. In short the beer mapping to the GPU, the faster convergence of some schemes and the much beer flexibility of a general solver justifies the use of partial differential equations.

(25)

Implementation Approa

In this chapter we discuss the implementation of bulk solvers. More specifically, a small introduction on the GPUs and how they achieve the massive parallelization is discussed in section 3.1. e implementation of the data transfer between a client machine and a server machine is discussed in section 3.2. A small section in 3.2 is devoted for the decomposition of the scenario tool data for parallel calculations, suitable for a GPU seing. We continue with the specifics of the algorithms for the Monte Carlo method in section 3.4.1 and for the finite difference method in section 3.4.2.

3.1 GPU Introduction

A Graphics Processing Unit (GPU) is a dedicated or integrated chip on a computer that can very eﬃciently handle graphics related computations. During the last decade, the performance of GPUs increased dramatically, something that led researchers to investigate the use of these chips for purposes other than visualizing. By utilizing the shaders of the GPU, one could load data into the textures and perform number crunching operations in a highly parallel fashion [17]. ese methods were very cumbersome and time consuming. e hardware manufactures that developed these devices, acknowledged the potential and developed tools and libraries to exploit this power. is initiative gave birth to frameworks and APIs like CUDA [20] from NVIDIA and the open standard OpenCL [22] from Khronos group. e ﬁeld of using GPUs to parallelize computations is described as General Purpose computing on

Graphical Processing Units (GPGPU). In ﬁgure 3.1a the ﬂoating point operations per second

(26)

are compared in terms of memory bandwidth.

(a) FLOPS for GPUs and CPUs over the last years. (b) Memory bandwidth for GPUs and CPUs over the last years.

Figure 3.1: Source [20]

In this project we use the CUDA framework coupled with NVIDIA hardware. In the fol-lowing sections, details about the hardware and how this is utilized are speciﬁc to CUDA/NVIDIA. In principal, the same approach can be used for the OpenCL framework, but no tests were per-formed.

3.1.1 Hardware design

e main difference between the hardware designs of a CPU and a GPU, is that the GPU is designed for highly parallel processes; it therefore has a lot of transistors devoted to data crunching rather than flow control and data caching. A schematic of the GPU design com-pared to that of a CPU is in figure 3.2.

(27)

e hardware design follows the Single Instruction Multiple Data (SIMD) paradigm [21]. e same program (instruction) is executed from multiple applications (threads). Coupled with high memory bandwidth, the performance of a scalable algorithm can increase substan-tially.

e atomic computation unit in a GPU is the thread. reads are organized in blocks, were blocks can be of one, two or three dimensions in order to facilitate computations for vectors, matrices or volumes. Blocks are organized in grids that can be organized in three dimensions as well. is hierarchy is demonstrated in ﬁgure 3.3 for 2D blocks and a 2D grid.

Figure 3.3: read hierarchy on a GPU. Source [20]

GPUs also have multiple -on device- memory spaces. Every thread has its own local

mem-ory that is only visible to itself. e next level of memmem-ory is the shared memmem-ory, which is visible

to the entire block of threads. As the name suggests, all threads in a block can read/write from the shared memory. On the highest level we have the global memory where all the threads in a grid can read/write. A schematic of the memory hierarchy is demonstrated in ﬁgure 3.4.

Because shared memory is closer to the ALUs, it’s also orders of magnitude faster than global memory. e drawback is that shared memory is limited and one has to explicitly

(28)

Figure 3.4: Memory hierarchy on a GPU. Source [20]

program with it. is means that data is not cached in the shared memory automatically, but rather the programmer is responsible for transferring data in and out in order to take advantage of the high bandwidth speeds.

e instruction to the threads is issued from a special function called kernel. e kernel is like any other function with the diﬀerence that it cannot return any value, i.e. the return value has to be void. Before we pass arguments to the kernel (like arrays or constants), we need to copy the data in the device global memory. Return values are also passed as arguments.

(29)

3.2 Data transfer

When we posed our research question, in section 1.1.3, we mentioned the importance of being able to connect a client running the current Java implementation with a cluster of GPUs on the local network to perform heavy duty computations.

ere are several ways to accomplish this task. Even though Java has the Java Native

Interface (JNI) for calling native shared libraries implemented in C, the approach is not ﬂexible

enough. A more robust solution is the use of sockets to distribute serialized data and tasks over the network to a cluster of GPUs. is solution is demonstrated in ﬁgure 3.5.

... . .. . . Database. Prepare Data . Serialize Data . _GPB . Deserialize Data . Perform Hedge . GPB . Java Client . Deserialize Data . ZeroMQ sockets . GPB . Compute . _CUDA . Serialize Data . ZeroMQ sockets . GPB . C/CUDA Server

Figure 3.5: Connection of a Java client with a C/CUDA server over the local network using Google Protocol Buﬀers and ZeroMQ sockets

e Java client is responsible for quering the database and preparing the data. Once the data are organized in appropriate objects, they are serialized in binary format using Google

Protocol Buﬀers (GPB) [10]. Once we serialize the data, we send the packages over the local

network using ZeroMQ sockets [11]. For data and task distribution through the local network we use two types of sockets:

• Publisher/Subscriber (asynchronous): Used for data distribution. • Request/Reply (synchronous): Used for task distribution.

More specifically, every client (work station) has the socket in publisher mode, bind to a specific IP address. Each active server on the GPU cluster has the socket in subscriber mode where it listens to specific IP addresses. Once the client sends data the server captures them. is transfer type is asynchronous and is used for data distribution. e Request/Reply socket

(30)

is used to synchronize the client/server and in order to issue tasks. is type is blocking, i.e. the process only continues when it gets back a reply. e data/task distribution is show in ﬁgure 3.6.

...

.

Work Station

.

PUB

.

REQ

.

Server #1

.

SUB

.

REP

.

Server #2

.

SUB

.

REP

.

Server #3

.

SUB

.

REP

..

Cluster

Figure 3.6: Communication with ZeroMQ sockets between a client and the cluster of GPUs e advantages of this setup are numerous: ﬁrstly, the implementation is language agnos-tic. Client and server can be programmed in whatever language we wish too. More impor-tantly, the distributed system allows us to decompose large problem sizes and distribute them to multiple servers in order to utilize multiple GPUs. Also if multiple workstations require computations, they can connect to diﬀerent servers and thus avoid queuing one server with tasks.

3.3 Data decomposition

Once we transfer the data over to a machine with GPU hardware, we need to decompose the scenario data in order to proceed with parallel calculations on the GPU. In section 1.1.2 we mentioned that we wish to replace computation heavy operations in the scenario tool, such as computing cash ﬂows (NPV) and sensitivities (BPV). By taking a closer look at the scenario

(31)

tool, we can identify that, given a yield curve as an input to every scenario, NPV and BPV calculations for each scenario are independent.

Given this information we notice that the scenario tool resembles a grid of thread blocks on the GPU. Figure 3.7 points this similarity.

. . Scenario (0,0). Scenario (0,1) . _{· · ·} . Scenario (0,n) . rpath= 1 . Hedge t0 . Hedge t1 . Hedge tn . Scenario (1,0) . Scenario (1,1) . Scenario (1,n) . rpath= 2 . Hedge t0 . Hedge t1 . Hedge tn . · · · . .. . . .. . . .. . . Scenario (m,0) . rpath= m . Scenario (m,1) . Scenario (m,n) . · · · . Hedge t0 . Hedge t1 . Hedge tn . t0 . t1 . tn

(a) Scenario tool.

... . Scenario (0,0). Scenario (0,1) . Scenario (0,n) . Scenario (1,0) . Scenario (1,1) . Scenario (1,n) . Scenario (m,0) . Scenario (m,1) . Scenario (m,n) . Grid

(b) GPU grid of scenarios.

Figure 3.7: Decomposition of the whole scenario tool into a GPU grid

Only the hedge calculation is a serial calculation, i.e. to compute hedge t1 we require

hedge t0. Once we compute the BPVs we can return to the client m BPV arrays of size n in

order to compute the ﬁnal hedge.

Since the calculations are independent, we can further decompose the data for linear speed increase for multiple GPUs and multiple servers in the cluster. is decomposition is illus-trated in ﬁgure 3.8.

(32)

...

.

Scenario

(0,0)

.

Scenario

(0,1)

.

Scenario

(0,2)

.

Scenario

(0,n)

.

Scenario

(1,0)

.

Scenario

(1,1)

.

Scenario

(1,2)

.

Scenario

(1,n)

.

Scenario

(2,0)

.

Scenario

(2,1)

.

Scenario

(2,2)

.

Scenario

(2,n)

.

Scenario

(3,0)

.

Scenario

(3,1)

.

Scenario

(3,2)

.

Scenario

(3,n)

.

GPU 1

.

GPU 2

.

GPU 1

.

GPU 2

.

Server #1

.

Server #2

Figure 3.8: Decomposition of the scenario tool for multiple servers with multiple GPUs for each server.

3.4 Computation algorithms

In order to compute the BPV sensitivity, we calculate the central ﬁnite diﬀerence approxima-tion:

BPV = NPV++NPV− 2

In practice this requires running a simulation 3 times, since NPV+ _{and NPV}− _{are cash}

ﬂows aer we bump the yield curve by±1bp. ese 3 simulations are independent from each

other and we can parallelize them by utilizing CUDA’s concurrent streams. Each cash ﬂow is computed with a kernel instruction and all 3 kernels can run concurrently as long as we have available resources, namely registers, local and shared memory.

(33)

Procedure ComputeGPU

Input: Conﬁguration, Scenarios, Clusters, Yieldcurves Output: NPV,BPV

// Construct yieldcurves, allocate memory

1 configure(· · · )

// Scenario maps to a GPU grid

2 grid.x = n, grid.y = m 3 block.x = threads_per_block 4 shi← {−0.0001, 0.0, 0.0001} 5 if MC then 6 generate_paths <<<>>> () 7 ..for i in shi do

8 MC_GPU<<<grid,block>>>(shi[i], stream[i],…)

9 ..

10 else if FD then 11 ..for i in shi do

12 FD_GPU<<<grid,block>>>(shi[i], stream[i],…)

13 .. 14 return NPV,BPV . ...

...

.

N P V

.

N P V

+

.

N P V

−

.

Overlap

Figure 3.9: Overlap of kernel computations

Procedure ComputeGPU showcases pseudo code for the parallelized kernel calls, either for Monte Carlo or ﬁnite diﬀerence methods. Highlighted code notes the concurrent kernels. Figure 3.9 illustrates the overlap of the kernel calls.

3.4.1 Monte Carlo procedure

e Monte Carlo procedure is divided in two kernel calls as we can observe from proce-dure ComputeGPU. Every scenario in the tool uses the same random paths. In section 2.3.1 we mentioned that the transformed Hull White paths x are the same for every scenario, since they depend only on the Hull White parameters. e purpose of the ﬁrst kernel call is to generate and store these paths for later usage in the cash ﬂow calculation kernel.

For this project we use two diﬀerent path generation kernels, one for vanilla paths and one for antithetic paths. For the vanilla kernel, we spawn as many threads as the required simulations. For the antithetic kernel, we spawn half the threads. In essence, every thread will generate two paths, one normal and one antithetic. For example, if we require 3000 paths, 1500 of them are drawn randoms and 1500 are the antithetic parts.

In order to draw normal pseudo-random numbers in the GPU, we use the CURAND li-brary [19] that is distributed with the CUDA Toolkit 5.0.

(34)

3.4.1.1 Path generation kernels

Procedure generateNormalPaths demonstrates the kernel for normal paths. e paths are wrien into a ﬂaen 2D array in a way that achieves coalescing access for all the threads.

Procedure generateNormalPaths Input: Conﬁgurations, states, paths

// Compute current thread's id

1 double *x

2 for i=thread_id to sims do

// Shift the pointer

3 x = paths + i 4 *x = 0.0

5 for t = 1 to max_tenor do

6 x += sims

7 rnd = curand_normal_double()

8 *x = x_process(x_i−1, k, σ, dt)

9 i← i + blockDim.x

Procedure generateAntitheticPaths demonstrates the kernel for the antithetic parts. e main diﬀerence is that each thread computes 2 paths, one normal and its antithetic part. Each pair is stored in memory consecutively.

3.4.1.2 Cash ﬂows kernel

e cash ﬂow calculation kernel is independent of the method we used to generate the paths, as long as we do not wish to compute the standard error of our Monte Carlo simulation. is is because, in order to compute the standard error and the corresponding 95% conﬁdence interval, we need to reduce the NPV result for the antithetic pairs. For testing purposes, we implement a kernel to do that, but for pure calculations, in order to optimize, we omit this.

Procedure MonteCarloGPUOptimized demonstrates pseudo-code for the cash ﬂows under the Monte Carlo method. Some notes about the Monte Carlo method:

(35)

Procedure generateAntitheticPaths Input: Conﬁgurations, states, paths

// Compute current thread's id

1 double *x

2 for i=thread_id to sims/2 do

3 x = paths + i + i 4 *x = 0.0

5 *(x+1) = 0.0

6 for t = 1 to max_tenor do

7 x += sims

8 rnd = curand_normal_double()

9 *x = x_process(xi−1, k, σ, dt) 10 *(x + 1) =-*x

(36)

Procedure MonteCarloGPUOptimized

Input: Conﬁguration, Scenarios, Clusters, Yieldcurves, shi Output: Result

// Point to the correct data depending on the block

1 __shared__ double npv[threads] 2 for i=thread_id to sims do 3 for t = 1 to max_tenor do 4 r = x + α(f(0, t), x, k, σ, t) 5 swap_rate = swap(r,·) 6 for c in clusters do 7 N P V + = e−r∆t(Rt+ Pt(swap) + Ct) 8 npv[thread_id] = NPV 9 Reduction(npv) 10 if thread_id == 0 then 11 Result[block] = npv[0] / sims

is reduced in parallel aer all the simulations are performed. e parallel reduction is useful because it eliminates the need for atomic operations or the need to constantly write on slow global memory.

• Observing the pseudo code we notice that we ﬁrst loop through the remaining months on the tenor and then through all the clusters in the portfolio. is is because we want to optimize the function calls of transforming the x process to the short rate r and com-puting the swap rates required for the prepayment rates. is optimization requires the use of a local array for each thread (not seen in the pseudo code) in order to store the corrected notional of each cluster from the previous time step. e size of the local array is equal to the number of clusters in the portfolio. is implies that the run time is not linear as the number of clusters increase, since at some point (depending on hard-ware capabilities) we are going to spill data from local (fast) memory into global (slow) memory.

(37)

3.4.2 Finite diﬀerence procedure

In section 2.4 we described the PDE of the Arrow-Debrue prices and we used an implicit scheme for its numerical solution. e implicit scheme requires the solution of system of equations. A tridiagonal system is solved eﬃciently with the omas algorithm [23] under

O(n) operations. omas algorithm is inherently sequential because of the Gaussian

elimi-nation and thus it is not suitable for a GPU parallel solver, since every node of the grid is es-sentially a GPU thread. For this purpose we use the Parallel Cyclic Reduction algorithm [12]. is algorithm requiresO(nlogn) operations, but since it performs on parallel, it’s faster than

omas.

Pseudo-code for calculations of the cash flows with the finite difference solver is illustrated in procedure FDGPU.

Notes on the ﬁnite diﬀerence solver:

• In the finite difference kernel, we cannot loop over the tenor period first as we did in the Monte Carlo. is is because every clusters requires its own solution of a system since Arrow-Debrue prices are corrected for prepayments.

• We do not require the local array to store notionals as in Monte Carlo.

• e same parallel reduction optimization is used in ﬁnite diﬀerence as in the Monte Carlo.

(38)

Procedure FDGPU

Input: Conﬁguration, Scenarios, Clusters, Yieldcurves, shi Output: Result

// Point to the correct data depending on the block

1 __shared__ double npv[threads] 2 for c in clusters do 3 Q0 = δ(x) 4 for t = 1 max_tenor do 5 r = x(t) + α(t) 6 swap_rate = swap(r,·) 7 Λt= (swap)

// Construct matrix of coefficients

8 ct+1= (·), ut+1= (·), lt+1 = (·) 9 if t != 1 then 10 Qt = (1− Λt)Qt 11 solve_tridiagonal(c,u,l,Q) 12 N P V + = Qt· (Rt+ Ct+ Λ· Nt) 13 npv[thread_id] = NPV 14 Reduction(npv) 15 if thread_id == 0 then 16 Result[block] = npv[0]

(39)

Numerical Results

In this chapter the results of the project are presented. Section 4.1 demonstrates the results of the Monte Carlo method. is includes a convergence analysis for normal and antithetic samples. Section 4.2 demonstrates the results of the scenario tool for the finite difference method. Both methods are tested against the current propriety tool, implemented in Java. Finally, we test the speed up achieved by both Monte Carlo and finite difference.

In order to test the GPU implementation, in terms of correctness and optimization results, the following test case was created in the database:

Seing Value

clusters 100

mortgage type bullet

future hedge months 60 interest rate paths 49 prepayment functions 1

Table 4.1: Test case values for the scenario tool.

e total scenarios are future hedge months × interest rate paths × prepayment functions

= 2989. e hardware used for both reference and GPU implementation is summarized in table 4.2

(40)

Seing Reference GPU Tool Notes

CPU 2.47 GHz 2 cores i7xX990 12 cores reference multithreaded

RAM 4GB 32GB

GPU - 2 Kepler K20c One GPU is utilized

Table 4.2: Hardware used for reference and GPU implementations

e implementation of the cash ﬂows calculation for the portfolio in the reference system is based on a trinomial tree approach, like in the original Hull and White paper. e reference system is multithreaded with Windows threads on the two cores. Moreover, the GPU is not utilized at any point. On the GPU system the host code is sequential. Only one of the 2 Keplers is used for the results.

4.1 Monte Carlo

Before testing the full scenario tool, the Monte Carlo method requires a convergence analysis. is is helpful in order to identify how many paths we require for acceptable convergence and accuracy.

Finally, the full scenario case, as presented in table 4.1, is tested. e portfolio’s cash ﬂows (NPV), sensitivity (BPV) and notionals proﬁle are computed and tested against the reference trinomial tree implementation.

4.1.1 Convergence analysis

For testing the convergence of the Monte Carlo method, a set up similar to that of the intro-duction was utilized, 4.3:

(41)

Seing Value

clusters 100

mortgage type bullet

future hedge months 1 interest rate paths 1 prepayment functions 1

Table 4.3: Test case values for the convergence analysis.

e Hull White parameters used; k = 0.03, σ = 0.01,T = 120M. Note that not all the clusters had the same tenor. e maximum tenor was at T = 120M. Other than implemen-tation details, this does not aﬀect the result.

From ﬁgure 4.1 we notice the superiority of the antithetic variance reduction. Antithetics achieve convergence much faster. is is more noticeable when we look at the convergence of the standard error in ﬁgure 4.2.

In order to understand the convergence of the antithetic samples, we decompose an arbi-trary payoﬀ f(z), z∼N (0, 1)in a symmetric (even) and an antisymmetric (odd) part.

fe(z) =

f (z) + f (−z)

2 fo(z) =

f (z)− f(−z)

2 ,

clearly f(z) = fe(z) + fo(z). e symmetric and antisymmetric parts are independent from

each other, see [9] p208, and thus the variance is equal to:

V ar[f (z)] = V ar[fe(z)] + V ar[fo(z)].

From the variance of f, we notice that if f is a linear function with respect to the Gaussian increments, V ar[fe(z)] = 0and the antithetic sampling will remove all the variance. More

generally, if f is antisymmetric. f = fothe antithetic sampling becomes extremely aractive.

In the case of the mortgage valuation, the payoﬀ depends on the parametric function ˆg(r(t)). If this function is nearly linear most of the variance will be removed.

In the case studied in the convergence analysis, we compute V ar[f(z)] = 42.75 and

V ar[fe(z)] = 0.07. Solving for the odd variance:

V ar[f (z)] = V ar[fe(z)] + V ar[fo(z)]

42.75 = 0.07 + V ar[fo(z)]

(42)

Antithetics Normal 102.0 102.5 103.0 0 25000 50000 75000 1000000 25000 50000 75000 100000 sims NPV (a) NPV convergence. Antithetics Normal 102.0 102.5 103.0

1e+03 1e+05 1e+03 1e+05

sims

NPV

(b) NPV convergence on log10.scale.

(43)

Antithetics Normal 0.0 0.2 0.4 0.6 0 25000 50000 75000 1000000 25000 50000 75000 100000 sims Standard Error

(a) Standard error convergence.

Antithetics Normal

0.0 0.2 0.4 0.6

1e+03 1e+05 1e+03 1e+05

sims

Standard Error

(b) Standard error on log10.

(44)

102 103 104 0 25000 50000 75000 100000 sims NPV

method Antithetics Normal

Figure 4.3: NPV convergence for normal and antithetic variables. e shaded area represents the 95% conﬁdence interval for the normal variables. Both methods converge to the same result.

(45)

or the antisymmetric part of our payoﬀ provides about 99.85% of the variance. ese results agree with the results obtained for a linear example demonstrated in [6]. In order to test the eﬀect of the prepayment function, further research is required with carefully structured prepayment functions were they demonstrate a less linear case, such as heavy side functions etc.

e convergence results allow us to identify that 3000− 5000 antithetic paths are a good

starting point for the scenario tool.

Figure 4.3 illustrates the convergence of the two methods and the 95% conﬁdence level (shade area) for the normal variables. is plot demonstrates that both methods converge to the same value.

As mentioned in the modeling section, along with the BPV we also require the notional profile, in order to perform the final hedging calculation. We test the convergence of the notional profile on 5 discrete cases; N = 103_{, 10}4_{, 2}_{× 10}4_{, 6}_{× 10}4_{and 10}5_.

0 20 40 60 80 100 0 20 40 60 80 100 120 Notional decay Timesteps Notional variable N1000 N10000 N20000 N60000 N100000

(46)

0 5000 10000 15000 0 25000 50000 75000 100000 sims Time (ms)

method Antithetics Normal

Figure 4.5: Scaling of one scenario in computational time with increasing sample paths Figure 4.4 reveals something interesting; even when using a low amount of paths, the profile almost converges. By increasing the samples we do not notice significant difference in the result. is allows us to implement the scenario tool on a hybrid model. e notional profile can be done on the CPU using a low number of samples, while the NPVs and BPVs are calculated asynchronously on the GPU.

Finally, we wish to test how one scenario scales with increasing number of sample paths

N. In ﬁgure (4.5) we notice the linear increase in computational time with respect to the

number of sample paths N. Both methods achieve linear scaling with the number of sample paths. Moreover, both methods achieve the same results in terms of time, i.e. no method is faster than the other.

4.1.2 Scenario Tool analysis

From the previous section we identiﬁed that 3000− 5000 antithetic samples provide an

ac-curate result, while still being computationally eﬃcient. For the remaining of the scenario analysis tool, we consider antithetic samples with 3000 paths, although we perform a short convergence analysis of the whole scenario tool with a small number of diﬀerent sample cases. e numerical results obtained from the scenario analysis are compared against the pro-duction system implementation (sequential trinomial tree). We have to keep in mind that this

(47)

is not an analytical result, but is used only as a guideline. We also want to compare the speed up against that implementation and thus comparing the numerical results as well makes sense.

4.1.2.1 NPV cash ﬂows

Consider the case presented in table 4.1. We compare the mortgage cash ﬂows NPV in ﬁg-ure 4.6. 80 90 100 110 80 90 100 110 GPU REF 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Time NPV

NPV for GPU and Reference

Figure 4.6: NPV for scenario tool under the GPU and reference system. Every path on the plot represents one of the 49 interest rate paths. On the x axis we have the 60 roll in the future hedge months.

Every one of the paths on plot 4.6, represents one of the 49 interest rate paths in the test case. ese paths are not to be confused with Monte Carlo paths, they represent the y dimension of the scenario grid as seen in 1.1. Additionally, on the x axis we have the roll in the future hedge months, i.e. the x dimension of the scenario grid. For every one of those time steps a full simulation is performed (3 considering the calculation of BPV). To observe any differences in the two systems, we plot the relative difference on figure 4.7.

(48)

0.000 0.001 0.002 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Time Relativ e diff erence

max mean min

NPV Relative Difference for GPU−Reference

Figure 4.7: Relative diﬀerence for the NPV between the GPU and the reference system. From the delta plot, we observe that the two implementations yield very similar results. e mean of all the scenarios is almost less than 1 basis points diﬀerence. We observe a spike for hedge time T = 11 for all the scenario paths. is could be aributed to the discontinuity of the forward curve, due to the linear interpolation, and the way the two systems calculate the time fraction.

As was mentioned in the beginning, the standard error of the MC simulations was tested for 5 discrete sample sizes, N = 102_{, 10}3_{, 3}× 103_{, 5}× 103_{. is was in order to test the}

convergence and identify a good balance between eﬃciency and accuracy. e standard error plots are given in ﬁgure 4.8 and 4.9.

(49)

0.01 0.02 0.03

01 234 567 89 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Standard error for NPV − all scenarios

T

Standard Error

sims 100 1000 3000 5000

Figure 4.8: Standard error for 5 diﬀerent sample sizes.

0.01 0.02 0.03 0.01 0.02 0.03 0.01 0.02 0.03 0.01 0.02 0.03 100 1000 3000 5000 012 3456 789 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Standard error for NPV

T

Standard Error

sims 100 1000 3000 5000

Figure 4.9: Break down of the standard error for all the discrete cases. e mean is also provided.

As expected the more paths we have, the lower the error from the simulation. e simu-lation time required for these cases is reported in table 4.4