Scalable GPU Accelerated Framework for Risk-Neutral Model Calibration and (Nested) Simulation

(1)

University of Amsterdam

Ortec Finance

Master Thesis Computational Science

Scalable GPU Accelerated Framework

for Risk-Neutral Model Calibration and

(Nested) Simulation

dhr. A.A (Alex) de Geus 10071539

(2)

(3)

Scalable GPU Accelerated

Framework for Risk-Neutral

Model Calibration and (Nested)

Simulation

Author:

dhr. A.A (Alex) de Geus

Date:

August 18, 2016

Ortec Finance supervisors:

dhr. J. (Joris) Cramwinckel MSc.

dhr. ir. S. (Stefan) Singor

Ortec Finance BV

Boompjes 40

3011 XB, Rotterdam

The Netherlands

Academic supervisors:

Supervisor:

dhr. dr. B.D. (Drona) Kandhai

Second reader:

dhr. dr. M.H. (Mike) Lees

Third reader:

dhr. C.S.L. (Kees) de Graaf MSc.

University of Amsterdam

Faculty of Sciences

Science Park 904

1090 GE Amsterdam

The Netherlands

(4)

Abstract

In this thesis we present a scalable high performance computing platform for hetero-geneous computation problems utilizing Graphical Processing Units (GPUs) and demon-strated its use in the acceleration of nested Monte-Carlo simulations (outer real world scenarios with embedded risk-neutral valuation) and model parameter calibration based on repeated Monte-Carlo valuation. The framework makes use of job schedulers with a simple task submission interface. We demonstrate near perfect CPU and GPU scalabil-ity, GPU utilization and CPU-GPU concurrency with accelerations of up to ≈ 180 times in comparison with a vectorised Python implementation on the CPU. Furthermore, we apply the framework for the calibration of the Heston model for several asset classes and investigate the effects of introducing a linear time dependency in the model parameters on the goodness of fit and parameter stability.

(5)

Acknowledgements

First of all, I would like to thank my daily supervisor at Ortec Finance: Joris Cramwinckel, for providing invaluable feedback and support throughout the thesis. Ad-ditionally, I would like to thank my other colleagues at Ortec Finance with special mentions to Martin van der Schans, Patrick Tuijp and Stefan Singor for their support and suggestions. Furthermore, I would also like to thank Kees de Graaf and my supervi-sor Drona Kandhai from the University of Amsterdam for their support and provision of reading materials. Lastly, I would like to thank my family, friends and fellow students.

(6)

1 Introduction

Following the financial crisis, subsequent to the introduction of the European Union’s new risk-based insurance financial regulatory regime Solvency II, increased focus has been given to the regulation of risk. Now, a key aspect of risk-management for insur-ance companies and pension funds, is the determination of required capital to cover against unforeseen losses. A typical definition of required capital is based around a 1-year Value at Risk (VaR) metric. Here the capital requirements are defined in terms of a tail percentile of the market value of the balance sheet. An institution is considered solvent1 if the available capital is strictly positive for some future time T (e.g. 1 year) with probability X% (e.g. 99.5%). Here, the available capital is the amount of financial resources that can serve as a buffer against potential losses and is determined by a mar-ket consistent valuation of the balance sheet as the difference between the marmar-ket value of assets and market value of liabilities.

Future cash flows and the value of assets of an institution’s portfolio can be simu-lated by generating a large number of so called real world scenarios and determining the value of the balance sheet at each point. real world scenarios are possible states of the economy generated based on an assumed joint distribution of all potential risk drivers, current market positions and investment strategies. This assumed distribution will be denoted as real world probability measure P, which should provide realistic scenarios and capture as many stylized facts of the market as possible, where stylized facts are empirical observations of market behavior.

Most asset classes on the balance sheet, such as real estate or positions on the money market, are straightforward to value. However some items, such as options cannot be explicitly priced. An option is a financial derivative based on an underlying (e.g. a stock) giving the right, but not the obligation, to buy or sell an asset before or at a specified time. The simplest kind of options give the right to buy or sell a single share of common stock for predetermined price, also named the exercise price or strike price. To price these options at market value, one has to rely on risk-neutral valuation models. The pricing of options in a risk-neutral model is done under the objective martingale measure Q rather than a subjective real world measure P. For a more thorough explana-tion of the risk-neutral measure we refer to (Black and Scholes, 1973) and (Gisiger, 2010). Ortec Finance provides, as part of its core business, asset liability studies for finan-cial institutions, such as insurance companies, pension funds and housing corporations. Typical financial products related to these institutions, such as issued mortgages, life insurance policies and pension plans, carry characteristics of an option. Such optional-ities are referred to as embedded options. Pricing embedded options can be difficult as analytic solutions for risk-neutral models may not be readily available. In such cases, one has to rely on numerical simulation techniques. In this thesis, we will only con-sider Monte-Carlo simulation techniques. Though computationally expensive, they are a straightforward and relatively easy way to value complex contracts.

Calculations of future risk exposure using numerical simulation techniques as

de-1

(8)

scribed above, are so called nested simulations. The valuation of the balance sheet in a real world scenario requires a risk-neutral simulation, thus resulting in simulations within simulations. The concept of nested simulations is illustrated in Figure 1.1. Here, real world or outer scenarios are modeled under measure P and risk-neutral or inner scenarios are modeled under measure Q. Nested Monte-Carlo simulations have intense computational scaling with respect to the number of (inner and outer) sample paths. Common real world simulations consist of 2.000 scenarios with a 10 year horizon with a risk-neutral balance sheet valuation for each year. Here, each risk-neutral valuation consists of a 10.000 scenarios with over 10.000 time points each. The computational requirements are enormous. Furthermore, risk-neutral valuations rely on calibrated parameter sets, which are produced by an iterative optimization procedure based on repeated (Monte-Carlo) valuation, which may take a significant amount of iterations. Due to the high computational requirements for these calculations, we aim to improve performance by accelerating the risk-neutral Monte-Carlo simulations in a nested simu-lation or calibration using Graphical Processing Units (GPUs) in order to enable Ortec Finance to provide its services within acceptable time.

Figure 1.1: A nested Monte-Carlo simulation with outer real world scenarios (blue) and inner embedded risk-neutral scenarios (orange). Source: Ortec Finance

GPUs were originally designed to render images, but are increasingly popular in High Performance Computing (HPC) due to their parallel nature and large amount of pro-cessing cores compared to Central Propro-cessing Units (CPUs). Due to the independence of each path, Monte-Carlo simulations are a prime candidate for massively parallel ac-celeration.

(9)

In (Cramwinckel, 2015) and (Cramwinckel et al., 2015) a framework is introduced to improve the performance of model calibration and nested simulations by offloading risk-neutral simulations to a GPU. We add to this concept by extending the modeling ca-pabilities for multi-asset risk-neutral modeling and model parameter calibration, which further increases the required computational effort. We improve on accessibility, flexibil-ity and concurrency by implementing an asynchronous framework using job schedulers. Furthermore, we improve the scalability of the framework by supporting multi-process and multi-GPU computing, where it is possible to flexibly impose the number of CPU processes that offloads work to a single GPU.

Alternatives to high performance GPU implementations for nested Monte-Carlo simulations, such as Least Squares Monte Carlo (LSMC) regression, Stochastic Grid Bundling Methods (SGBM) and Option Interpolation Models (OIM) aim to reduce the workload rather than improve performance. LSMC methods aim to reduce the number of inner risk-neutral scenarios through an approximation of the conditional expectation by regressing on inaccurate option values. Similarly, SGBM approximates the condi-tional expectation by regressing on bundles of stochastic grid points. Alternatively, OIM reduces the number of required risk-neutral valuations by determining accurate option values in function space and interpolating and extrapolating between them. Such methods have originally been used for pricing so called early-exercise or American style options (Longstaff and E.S., 2001) (Jain and Oosterlee, 2015), where the problem of nested simulations also emerges, but have also been adapted for risk management cal-culations (Bauer et al., 2010) (Feng, 2016) (Ortec Finance, 2015). Here, an American style option may be exercised at any time or predefined exercise dates before maturity. However, for the case of LSMC, many outer scenarios are required for accurate re-sults. Additionally for LSMC and SGBM the choice of regression function is not always straightforward. For the case of OIM, accurate risk-neutral valuations by Monte-Carlo simulation are still required. Using high performance GPU implementations, optimal overlap between real world and risk-neutral computations and scalable hardware, we can achieve the same effect without modeling difficulties. Namely, that the additional work of risk-neutral valuations does not have an impact on the runtime of real world simulations. Additionally, high performance GPU implementations are also suitable for computationally intensive model parameter calibrations, which are also necessary for the use of LSMC, SGBM and OIM methods.

The structure of this thesis is as follows: first we will elaborate on risk-neutral valuation models and model parameter calibration methodologies (Section 2). Second, we will elaborate on the concept of High Performance Computing (HPC) using GPUs, the CUDA programming model and its features (Section 3). Third, we will introduce a scalable HPC framework using GPUs and validate our design choices by demonstrating the capabilities of the framework with respect to performance acceleration, CPU-GPU concurrency, CPU scaling, GPU scaling and GPU utilization (Section 4). Lastly, we apply the framework to the calibration of several asset classes for monthly intervals in the period 30/9/2008 to 31/12/2015 (Section 5). Ortec Finance provides, as a service, such calibrated parameter sets for a variety of asset classes each month to its customers. We investigate the goodness of fit for the obtained calibrated parameter sets. Furthermore, parameter stability through time is desirable for consistent market valuations. In general,

(10)

model parameters are calibrated as if independent of previous realizations, which can lead to over-fitting and volatile parameter evolution through time. We investigate the improvement of parameter stability by enforcing a linear time dependency in comparison to an expected loss in goodness of fit due to reduced degrees of freedom.

(11)

2 Risk-Neutral Valuation with the Heston Model

To price financial derivatives (in e.g. balance sheet valuations or model parameter cal-ibrations) one has to rely on so called risk-neutral valuations models. Based on the principle that it should not be possible to make guaranteed profits larger than the risk-free rate, (Black and Scholes, 1973) show that it is possible to determine a theoretical fair price for an option as the discounted expected payoff V of the option at maturity T . This assumption is also known as the principle of no-arbitrage.

The fair option price at current time t is given by

V (S, t) = EQ_[e−RtTr(s)dsV (S, T )|F_t] (2.1)

with risk-free rate r and (Ft)t≥0 a filtration generated by the price process S. The

risk-free rate is often coupled to the interest on US treasury bills due to the nearly bul-letproof credit rating of the US government. Note that the expectation is with respect to the martingale measure Q.

The value of the conditional expectation depends on the chosen dynamics of the price process S. In this thesis, we will mainly consider the Heston model (Heston, 1993) with stochastic volatility and dynamics

( dSt= rStdt + √ νtStdWtS, S0≥ 0 dνt= κ(θ − νt)dt + γ √ νtdWtν, ν0≥ 0 (2.2) with constant risk-free rate r, stochastic price variance νt with initial value ν0, long

term average price variance θ, mean reversion rate κ, volatility of the volatility γ, and where dW_tS, dW_tν are Q Brownian motions with mean 0, variance dt and instantaneous correlation ρ, such that E[dW_tSdW_tν] = ρdt. We denote the full Heston parameter set as ΩHeston= {κ, θ, γ, ν0, ρ}. The model describes the evolution of a single asset through

time.

2.1 Multi-Asset Heston Model

Research has mostly been focused on risk-neutral pricing models for single assets. How-ever, in risk management one is interested in the value of the entire balance sheet, which may contain a variety of financial products based on different underlying. Further-more, traders and investors have become increasingly interested in multi-asset products, since they are cost-efficient and provide natural diversification across assets and market segments. Multi-asset options are derivatives with payoffs based on more than one un-derlying. One could model each asset individually, however correlations between assets exist and should be considered in these cases to determine accurate market values.

Correlations are a measure of linear dependency between random variables. For a set of Brownian motions dWi ∈ {dW1, . . . , dWN} with mean 0 and variance dt, the Pearson

(12)

Corr(dWi, dWj) =

E[(dWi− E[dWi])(dWj− E[dWj])]

pV ar(dWi)V ar(dWj)

= ρi,j

Var(dWi) = E[(dWi− E[dWi])2]

(2.3)

Substituting E[dW ] = 0 and Var(dW ) = dt, we retrieve E[dWidWj] = ρi,jdt. Note

that |ρi,j| ≤ 1, ρi,j = ρj,i and ρi,i= 1.

From this, we can define real-valued symmetric N × N correlation matrix Σ with diagonal entries equal to 1. For the single-asset Heston model defined in Equation (2.2), we have

Σ =1 ρ ρ 1

(2.4) Σ can also be shown to always be positive-definite. A matrix M is considered positive definite if aTM a > 0 for any arbitrary column vector a. Equivalently, matrix M only has positive eigenvalues.

Consider N underlying assets each with dynamics specified by the Heston model (Equation (2.2)) with i, j ∈ {1, . . . , N }. We can now define 2N × 2N correlation matrix Σ with entries ρSS_i,j, ρSν_i,j and ρνν_i,j, which denote stock-stock, stock-variance and variance-variance correlations between the ith and jth asset respectively. Here the diagonal entries ρSS_i,i and ρνν_i,i are equal to 1, the ρSν_i,i entries are the original stock-variance correlations defined in the single-asset Heston model and the ρSS_i,j, ρSν_i,j and ρνν_i,j entries for i 6= j are additional cross correlations terms between the N single-asset models. Thus we have

Σ =                1 ρSS_1,2 . . . ρSS_1,N ρSν_1,1 . . . ρSν_{1,N −1} ρSν_1,N ρSS_2,1 1 ρSν_2,N .. . 1 ... ρSS N,1 1 ρSνN,N ρSν_1,1 1 ρνν_1,1 .. . 1 ... ρSν N −1,1 1 ρννN −1,N ρSν_N,1 ρSν_N,2 . . . ρ_N,NSν ρνν_N,1 . . . ρνν_{N,N −1} 1                = Σ SS _ΣSν ΣSν T Σνν ! (2N ×2N ) (2.5)

Correlation structures Σ, such as specified in Equation (2.4) and (2.5), can be im-posed upon normally distributed random variables by using a Cholesky-decomposition. Consider N Brownian motions with valid correlation matrix Σ, where Σ is an N × N , real-valued, symmetric and positive definite matrix. We can now apply a Cholesky-decomposition, such that Σ = LLT, where L is a lower triangular matrix. Take Z = {W1, . . . , WN}T a vector of N independent Brownian motions following a standard

normal distribution with mean 0 and variance 1. Let LZ be a new vector of random variables. It is now easy to show that the random variables in LZ have correlation matrix Σ.

(13)

µLZ = E[LZ] = LE[Z] = 0

ΣLZ = E[(LZ − E[LZ])(LZ − E[LZ])T] = E[LZ(LZ)T]

= E[LZZTLT] = LE[ZZT]LT = LILT = LLT = Σ

(2.6)

For the single-asset correlation structure of the Heston model defined in Equation (2.4), we have LLT =1 ρ ρ 1 , where L =1 0 ρ p1 − ρ2

Thus Equation (2.2) is equivalent to ( dSt= rStdt + √ νtStdWtS, S0 ≥ 0 dνt= κ(θ − νt)dt + γ √ νt(ρdWtS+p(1 − ρ2)dWtν), ν0 ≥ 0 (2.7) where W_tS and W_tν are two independent Q Brownian motions.

The addition of cross model correlations, makes the evaluation of the conditional expectation for balance sheet valuations and multi-asset option payoffs increasingly com-plex. Furthermore, when the number of models N increases, correlation matrix Σ can grow significantly large with rate ∼ N2. This increases the methodological challenges for determining appropriate values for each correlation parameter. Additionally, the required computational effort for producing correlated random numbers with Equation (2.6) in Monte-Carlo simulations increases as well. We will further elaborate on this in Section 2.3.2 and Section 2.2 respectively.

2.2 Monte-Carlo Simulation Techniques

When pricing options using complex underlying dynamics or payoff functions, analytic solutions to Equation (2.1) may not be available. In such cases, one has to rely on numerical methods. A Monte-Carlo simulation is a numeric approximation technique based on repeated random sampling of stochastic processes. By the strong law of large numbers: if a set of random variables Xi ∈ {X1, . . . , XN} are independent and

iden-tically distributed (i.i.d.) with mean µ and variance σ2, then the sample average ¯X almost surely converges to µ, when N → ∞. This property can be used to evaluate the expectation in Equation (2.1) by random sampling of V (S, T ), such that

V (S, t) = EQ_[e−RtTr(s)dsV (S, T )|F_t] ≈ 1 N N X i=1 e−r(T −t)Vi(S, T ) (2.8)

with N the number of samples, which should be sufficiently large, constant risk-free rate r and Vi the ith payoff sample.

The central limit theorem states that for N → ∞, the sum of i.i.d. random variables Xi with finite variance will approximate a normal distribution, such that

(14)

N X i=1 Xi ∼ N (N µ, N σ2) → X =¯ 1 N N X i=1 Xi ∼ N (µ, σ2 N) → √ N σ ( ¯X − µ) ∼ N (0, 1) (2.9) where N (·) is a normal distribution. Using this property, a confidence interval for the true value of µ can be established from

P r(−a ≤ √ N σ ( ¯X − µ) ≤ a) = X% → P r( ¯X − a σ √ N ≤ µ ≤ ¯X + a σ √ N) = X% (2.10) where constant a (e.g. 1.96) corresponds to a confidence degree of X% (e.g. 95%). The unknown variance σ2 can be estimated with

σ2 ≈ s2= PN

i=1(Xi− ¯X)2

N − 1 (2.11)

where s2 is the sample variance.

A sample Vi(S, T ) can be obtained by generating paths based on the dynamics of

the model. For the dynamics of the Heston model in Equation (2.2), paths can be generated by taking n steps from time t to maturity T with step size δt = T −t_n . For Heston parameter set ΩHeston= {κ, θ, γ, ν0, ρ} and initial stock price St, the next value

St+1 can be determined by ( St+1= St+ rStδt + √ νtStδWtS νt+1= νt+ κ(θ − νt)dt + γ √ νtδWtν (2.12) where δW_tS and δW_tν are random samples from two normal distributions with mean 0, variance dt and correlated with correlation parameter ρ. This simulation scheme is known as a forward Euler-scheme. The approximation error of a forward Euler-scheme increases with step size, therefore, for accurate results, one should take n → ∞ and δt → 0. Correlated random samples δW can be generated using a Cholesky-decomposition as shown in Equation (2.6) by matrix multiplication of lower triangu-lar matrix L and vector Z consisting of N independent random samples, where N is the number of Brownian motions. For the case of multi-asset modeling, N is equal the sum of the number of Brownian motions over all models and can grow significantly large. Due to the square root process √νt, the variance should remain strictly positive

for all t. (Feller, 1951) showed that the boundary at the origin for Equation (2.2) is unattainable and non-attracting for 2κθ > γ2, which is called the Feller condition. In practice the Feller condition is often not fulfilled, which implies that certain simulation schemes, such as an Euler Scheme, can break down due to the square root process. Therefore, an appropriate scheme should be selected to ensure that the variance process never becomes negative.

Many alternative simulation schemes exist. In (Wadman, 2010), an overview is given of appropriate and efficient schemes for the Heston model with an extension for

(15)

multi-asset simulation for the popular Quadratic Exponential (QE) scheme introduced in (Andersen, 2007). In this thesis, only a truncated Euler-scheme will be considered, which deals with the square root process by truncating ν at 0 using νt+1 = max(νt+1, 0)

after each step.

The advantage of Monte-Carlo methods lies in the straightforward computation of complex contracts. Alternatives, such as finite difference or Fourier methods do not scale well with the number of free variables or rely on the existence of a characteristic function. The Monte-Carlo simulation approach has no such issues. However, the expectation of a Monte-Carlo estimator typically converges with a rate of O(√1

N) as follows from the

central limit theorem. This is relatively slow compared to other numerical methods. To obtain accurate results, many sample paths are required, wich can require signifi-cant computational effort. However, in (Shapiro, 2001) it is noted, that given certain assumptions, the convergence may approach exponential rates. Additionally, variance reduction techniques, such as anti-thetic variables and control variates exist to improve the convergence of the Monte-Carlo estimator using negative covariance between two random variables. We will not focus on such techniques in this thesis, but rather the acceleration of path generation in general to overcome the required computational effort. For more information on Monte-Carlo methods and their application in finance we refer to (Glasserman, 2003).

2.3 Model Calibration

Valuation of derivatives using risk-neutral models relies on a model parameter set. Ap-propriate parameter values are crucial in accurate valuation of options. Model calibration is the process of finding such a set of model parameters that best fit some given target data.

In finance, model calibration can be divided into two main categories: the historic approach and the market implied approach. Of the two, the market implied approach is much more prevalent. The market implied approach considers current market prices to be most relevant. The model parameters are chosen such that relevant option prices are replicated by the model as closely as possible. Observable option prices in sufficiently liquid markets are considered arbitrage-free, since arbitrage opportunities will eventually be discovered and corrected accordingly. The market implied approach therefore pro-duces by definition arbitrage free prices. In contrast, the historic approach uses historic asset time series to best estimate the current parameters. The issue with this approach is that not all parameters are visible in the market. Furthermore, this approach is very model dependent and will most always result in different option prices than currently on the market and are thus not arbitrage-free. Therefore, the market implied approach is preferred, though the historic approach is used when no liquid option data is available. An example for the Heston model is given in (Fatone et al., 2008), where they use fil-tering and maximum likelihood methods to keep track of the stochastic volatility process. We will calibrate the single-asset Heston model using the preferred market implied approach. We calibrate the parameter set to Black-Scholes European option prices using historic implied volatility surfaces, asset and interest rate time series. To avoid putting

(16)

emphasis on options that are far in or out of the money, we calibrate to both put and call options. The market implied parameter set Ω is the parameter set that minimizes

min X

i

|V_tM(Ki, Ti) − VtΩ(Ki, Ti)| (2.13)

where V_tM and VΩ are the option values on the market and based on Ω at time t respectively, (Ki, Ti) the ith strike-maturity pair and | · | is some norm. The results of

these calibrations are presented in Section 5.1. 2.3.1 Time Dependency and Stability

The market implied parameter set Ω determined by Equation (2.13) is modeled as if independent of past realizations and is calibrated solely to the current market. This can lead to over-fitting and volatile parameter evolution through time. In practice, stable model parameters are desirable for consistent market valuations. In (Singor et al., 2014) a linear mapping is proposed for the Heston model calibration of the S&P 500 asset class using the VIX index, such that Φt : V IXt → ΩHestont . We apply a similar more

generic linear parametrization to the Heston model, which is not immediately linked to a volatility index, such that

Figure 2.1: Linear parametrization of Heston parameter set with static components a and b and dynamic time component x.

Here, a and b are column vectors and xt a time component with t ∈ {0, . . . , n} = t

0

. The optimal parameter set defined by a = {aκ, . . . , aν}, x = {x0, . . . , xn} and b =

{bκ, . . . , bν} is the parameter set that minimizes

min X t0 X i |VM t (Ki, Ti) − Vt(a,xt,b)(Ki, Ti)| (2.14)

for all time points t in set t0, where V_tM and Va,xt,b

t are the option values on the market

and based on Ωt = f (a, xt, b) (Figure 2.1) at time t respectively, (Ki, Ti) the ith

strike-maturity pair and | · | is some norm. In Section 5.2 we will confirm that time component x is indeed highly correlated with the respective volatility index of the considered asset class as was noted in (Singor et al., 2014).

Such a time dependent structure is beneficial for creating stability in the parameters over time. In exchange, it is expected to lose some of the goodness of fit since the number of degrees of freedom has been reduced. We investigate the trade-of between

(17)

goodness of fit and model parameter stability. Additionally, in the case of (Singor et al., 2014), a calibrated parameter set can be obtained from the observable VIX index. We investigate if comparable behavior occurs for other asset classes besides the S&P 500 and VIX indices. The results are presented in Section 5.2.

2.3.2 Multi-Asset Correlation Structures

Multi-asset models are an extension of N single-asset models with additional specifica-tion of cross correlaspecifica-tion parameters between single-asset models (Secspecifica-tion 2.1). General practice dictates that the single-asset model parameters are calibrated beforehand (e.g. Equation (2.13) and (2.14) and then combined to calibrate the additional correlation parameters.

Consider N calibrated single-asset models with combined parameter set ΩM A =

{Ω₁, . . . , ΩN}. Now consider additional multi-asset cross correlation terms ρM A. The

implied calibrated multi-asset correlation structure are the multi-asset correlations ρM A

that minimize

min X

i

|V_tM(Ki, Ti) − VtΩM A,ρM A(Ki, Ti)| (2.15)

where VM and VΩM A,ρM A _{are the multi-asset option values on the market and based}

on {ΩM A, ρM A} at time t respectively, (Ki, Ti) the ith strike-maturity pair and | · | is

some norm.

Multi-asset options are options with payoffs dependent on more than one underlying asset. Such correlation sensitive options are required to accurately capture the corre-lation parameters. However, when liquid market data on correcorre-lation sensitive options is not available, a market implied approach is not suitable. In these cases, one can resort to a historic calibration approach. The equivalent historic calibrated multi-asset correlation structure are the multi-asset correlations ρM A that minimize

min | ˆΣ(S,S)(M ) − ˆΣ(S,S)(ΩM A, ρM A)| (2.16)

where ˆΣ(S,S)(M ) and ˆΣ(S,S)(ΩM A, ρM A) are the empiric log returns correlations

be-tween asset price time series observable in the market and based on model dynamics and model parameters respectively and | · | some norm. Alternatively, ˆΣ(S,S)(M ) can be replaced by expert opinion.

Note that ˆΣ(S,S)(M ) is observed under real world measure P. Girsanov’s theorem tells us that that the instantaneous correlations between Brownian motions are invariant under equivalent measure change, however the empiric correlations are not. Therefore,

ˆ

Σ(S,S)(ΩM A, ρM A) should also be modeled under the real world measure by equivalent

measure change.

In both cases the combined single-asset model parameters ΩM A remain fixed based

(18)

In (Dimitroff et al., 2010) and (Wadman, 2010) an historic and implied calibration approach is presented respectively for the calibration of the multi-asset Heston model for the cases described above where

ρνν_i,j = 0, for all i and j ρSν_i,j = 0, for i 6= j

with i, j ∈ {1, . . . , N }. Here the assumption is made that all cross correlations between models are carried through the stock-stock correlations, such that

Σ = ΣSS ΣSν T ΣSν Σνν ! , ΣSν =    ρSν_i,i . . . 0 .. . . .. ... 0 . . . ρSν N,N    and Σ νν _{= I}

here the ΣSν is a diagonal matrix containing the single-asset Heston stock-variance cor-relation parameters and Σνν the identity matrix.

When the number of models N increases, the number of multi-asset correlation pa-rameters in correlation matrix Σ can grow significantly large. For large numbers of free parameters, determining appropriate values through calibration becomes increasingly difficult. Both papers deal with the scaling complexity by performing the calibration in

N (N −1)

2 2-dimensional sub-models. After the calibration of all sub-models, the obtained

correlation sub-matrices can be combined into complete correlation matrix Σ. However, there is no guarantee that Σ will be a valid correlation matrix (i.e. positive definite) even if all sub-matrices are valid correlation matrices. Therefore, some regularization algorithm is necessary to find the nearest valid correlation matrix. Some examples are given in (Dimitroff et al., 2010). Additionally, we refer to (Rapisarda et al., 2007).

2.4 Conclusions

We use the methodologies described in this section to perform risk-neutral Monte-Carlo valuations and to obtain calibrated parameter sets based on repeated Monte-Carlo valu-ation. Since both cases rely on computationally expensive Monte-Carlo simulations, we aim to accelerate the Monte-Carlo simulations using Graphical Processing Units (GPUs). We will elaborate more on the what GPUs are and how to use them in the next section.

(19)

3 High Performance Computing with GPU’s

High-performance computing (HPC) considers the use of parallel processing techniques to improve application performance. In this thesis we will employ such techniques using Graphical Processing Units (GPUs) and the CUDA GPU programming model to accel-erate (nested) Monte-Carlo simulations, which are computationally expensive.

A GPU is a specialized piece of hardware, originally used to simultaneously render each pixel on a screen many times per second. Driven by the need for 3D imaging, GPUs are able to handle advanced computational tasks, which has made them an attractive so-lution for parallel computations. With the release of General Purpose GPUs (GPGPU), such as the NVidia Tesla cards in 2006, using GPUs for HPC applications has become increasingly popular.

In this section we will give an introduction to GPU programmability and hardware architecture to introduce the concepts used in constructing our framework. For a more in depth discussion we refer to (Wilt, 2013).

3.1 Python and PyCuda

At Ortec Finance it is customary to implement research prototypes using Python. Python is an open source interpreted programming language designed with emphasis on code readability. Though Python is a high level programming language, it is more and more accepted in the scientific community for the use in high performance computing (HPC) applications. Python functions are often implemented in lower level programming lan-guages such as C or FORTRAN. Additionally, Python offers bindings of libraries often used in HPC applications, such as CUDA, OPENCL, OPENMP and more.

For our GPU implementations, we choose to work with CUDA, a GPU programming model developed by NVidia, and its corresponding Python binding PyCuda2 developed by Andreas Kl¨ockner. Though other options are available, such as PyOpenCL from the same author, we choose to use PyCuda because it is the more mature programming model. Specifically we will make use of one of its modules: the Multi Processing Service (MPS) module, for which currently no alternatives are available on other platforms. Py-Cuda offers useful abstractions for GPU implementations, such as GPU arrays. However we choose to make limited use of them, to program as close to the hardware as possi-ble resulting in regular CUDA kernels which could easily be adopted to other CUDA bindings (e.g. in C++, .NET and Java).

3.2 GPU Architecture

Unlike general Central Processing Units (CPUs), a GPU can consists of thousands of cores. GPUs are made up of so called streaming multiprocessors. A single GPU can contain up to several dozen streaming multiprocessors. Each streaming multiprocessor contains a number of streaming processors (or cores). A visual representation is given

2

(20)

in Figure 3.1.

Figure 3.1: CPU (left) and GPU (right) processing core architecture. Source3 Each GPU core can can execute a single thread at a time. Threads are the finest granularity in a GPU computation and are instances of a so called kernel function. Ker-nels functions are sets of instructions for computations on the GPU. A Kernel function can be executed (or launched) on the GPU N times using N threads.

As a software abstraction, all threads are grouped into so called blocks and grids. A block contains some number of threads, and a grid contains some number of blocks. Each thread (in a block), block (in a grid) and grid are assigned unique identifiers, which can be used to select the data on which to operate. Threads in the same block are all executed on the same streaming multiprocessor.

In Figure 3.2 we illustrate the memory hierarchy on the GPU. The Host Device or CPU can copy data from and to the GPU through the global and constant memory. Additionally, each streaming multiprocessor contains a shared memory space. Threads in the same thread block are all executed on a single streaming multiprocessor and thus have access to the same shared memory space in addition to a local register per thread. Threads can be used to operate on these memory spaces by use of kernel launches. How-ever, the speed of memory access is not equal for each memory type. Access to shared memory can be up to 10 times faster than access to global memory and access to the local register can be up to 10 times faster than access to shared memory. This fact is useful for performance optimizations.

3

http://blog.goldenhelix.com/wp-content/uploads/2010/10/cpu vs gpu.png

4

(21)

Figure 3.2: GPU memory hierarchy and possible interactions between CPU (host), GPU (device), threads and memory. Source4

To read data from memory one has to access the respective memory space. The number of simultaneous memory accesses is constrained by the memory bandwidth of the GPU. Therefore, a GPU contains numerous caches to accelerate computation when multiple threads require access to the same memory locations. This is particularly important due to the fact that threads are executed in so called warps. Warps are groups of threads (typically 32). All GPU instructions are issued per warp. That is, each line of code in the kernel function is executed 32 times before moving the next. Deep conditional branching in kernel functions typically results in performance loss due to reduced concurrency, since threads in a warp can only pass through a single branching path at a time. Similarly, accessing memory is done through coalesced memory access, where each thread accesses a memory space at the same time. Adapting your memory indexing using the unique identifiers of each thread to access (nearly) the same memory space, can greatly improve performance. That is, one can potentially get 32 memory accesses for the price of 1.

3.3 Advanced CUDA Features

In the framework that will be introduced in Section 4, we will make use of a few CUDA specific features, which are beneficial for kernel programmability, CPU-GPU concurrency and GPU utilization. Namely, we will consider CUDA streams, HyperQ and Multi

(22)

Processing Service. We wil introduce each feature and elaborate on its purpose in our framework. For a more thorough overview we refer to (Wende et al., 2014).

3.3.1 CUDA Streams and HyperQ

CUDA streams are an abstraction for series of tasks used to enable more coarse-grained concurrency. Streams can be used to execute memory copies and kernel launches asyn-chronously, allowing for CPU-GPU concurrency. Furthermore, GPUs with high enough compute capability5 (3.5 or higher), such as Kepler GPUs, are able to execute streams concurrently using NVidia’s HyperQ feature. With HyperQ, each stream has its own work queue, which will be processed in order, but different streams can operate in paral-lel. GPUs with Fermi architectures contain only a single work queue. Workload offloaded by multiple streams is executed mostly serially. Due to the creation of false intra stream dependencies only the last and first job of a stream are executed concurrently. A visual representation for stream execution is given in Figure 3.3.

Stream concurrency is crucial for GPU utilization and programmability. It allows concurrent kernel execution and overlap between kernel execution and memory copies. Without it, programmers are forced to combine multiple calculations into a single kernel in order to fully utilize the GPU, which may come at the cost of kernel programmabil-ity. With concurrent streams, one can easily launch multiple small and efficient kernel functions, which can properly utilize the GPU due to the launch concurrency.

5

NVidia denotes the GPU compute capability by number e.g. 1.x (Tesla architecture), 2.x (Fermi architecture) and 3.x (Kepler architecture) etc.

6

http://images.nvidia.com/content/pdf/tesla/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

(23)

Figure 3.3: Stream execution without (left) and with (right) HyperQ. Only the circled kernels on the Fermi architecture are executed concurrently. Source6

3.3.2 CUDA Multi Processing Service

Communications between the CPU and GPU are handled through a so called GPU con-text. A context is a container that encapsulates the state of all CUDA related objects for a specific GPU, such as allocated resources and actions performed within the API (Application Programming Interface). GPU contexts can be created, activated, deac-tivated or destroyed through simple API commands. In case of the latter, the system automatically cleans up all the active resources. A GPU can only have a single active context at a time. In order to enable multiple parallel CPU processes to offload work to a single GPU, a software layer for managing contexts is required. Context management based on context swapping (activation and deactivation of contexts) requires scheduling resources to be swapped on and off the GPU. GPUs that include the HyperQ feature handle this process under the hood. However, context swapping removes any possible concurrency between workload from different clients (CPU processes). Therefore, we use the CUDA Multi Processing Service (MPS) module.

MPS is an unsupported feature developed by NVidia (Nvidia, 2015). The module allows the creation of an MPS server (or daemon), which is able to relay work of up to 16 clients to any number of GPUs. A visual representation is given in Figure 3.4. The MPS server shares one set of scheduling resources between all of its clients, eliminating the

(24)

overhead of swapping when the GPU is scheduling between those clients. Furthermore, the MPS module allows concurrency between different clients, such that they can all provide workload for the GPU at the same time. This is crucial for GPU utilization in cases when a single client provides insufficient or periodic workloads that do not fully utilize the GPU at all times.

The MPS service is only available on Linux platforms for GPUs with compute capabil-ity 3.5 (Kepler architecture) or higher. Though these restrictions limit the applicabilcapabil-ity, MPS is a relatively simple way to enable multiple concurrent CPU processes to offload work to a single GPU.

Figure 3.4: Multiple MPS clients (OS/CPU processes) which are executing work on a single GPU. The solid lines represent the communication of the clients through the MPS server process as a single GPU context. The dashed lines represents the same scenario without the MPS module. Source: (Nvidia, 2015)

(25)

4 Framework: Design, Requirements and Performance

In this section we will describe the design of a framework, which can be used to acceler-ate risk-neutral Monte-Carlo valuations using GPUs. We use this in the acceleration of both nested simulations with outer real world scenarios and embedded inner risk-neutral valuations as well as model parameter calibrations based on repeated risk-neutral valua-tion. Different functionality of the framework will be illustrated and each design choice, its purpose and requirements will be covered. Using our design, we demonstrate the capabilities of the framework with respect to performance acceleration, CPU-GPU con-currency, CPU scaling, GPU scaling and GPU utilization for computationally intensive tasks encountered in risk management and risk-neutral valuation.

4.1 Simulators

The basic purpose of our framework considers the acceleration of risk-neutral Monte-Carlo valuations. For this, we require functionality to perform Monte-Monte-Carlo path gen-eration and payoff evaluation on the GPU. Therefore we implement a custom simulator python class using PyCuda, which can be used to execute Monte-Carlo simulations on a single specified GPU (or CPU). The simulator can handle any model type for which kernel functions have been implemented. Currently only an implementation of the Hes-ton model is available.

Upon creation a simulator selects specified kernel functions and allocates a chunk of global memory on the GPU to store the generated Monte- Carlo paths. Additionally, the simulator creates its own CUDA streams. Memory copies (CPU to GPU, GPU to CPU) and kernel launches are executed asynchronously on these streams using the HyperQ feature in order to provide stream and CPU-GPU concurrency. Memory allocations on the CPU are pagelocked to attain highest bandwidth for memory transfers between the CPU and the GPU.

After initialization a simulator can be (re)used indefinitely to launch any number of Monte-Carlo simulations (sequential on CPU, concurrent on GPU) given sets of model parameters Ω and data container M with relevant market information, such as initial stock prices, interest rates and strike-maturity pairs. After execution, the data container is returned with the calculated option values and confidence margins. A visual repre-sentation is given in Figure 4.1. The re-usability of a simulator is key to avoid wasting time on GPU initialization and memory allocations.

(26)

Figure 4.1: A Monte-Carlo valuation on the GPU based on market data M and a pa-rameter set Ω. The calculated option values are returned after calculation.

4.2 Calibrators

For the parameter set calibration described in Section 2.3, we make use of the scipy.optimize Python module. Using this module, we can use the Minimize class to minimize some objective function f (), with choice out of several well known optimization algorithms.

For initial parameter set Ω and market data M , containing relevant option values, initial stock prices, interest rates and strike-maturity pairs, an error function f () (i.e. Equation (2.13), (2.16) or (2.14)) can be minimized by an iterative procedure. Since f () requires (multiple) Monte-Carlo simulations at each iteration, we can use the simula-tor specified in Section 4.1 for acceleration. A visual representation is given in Figure 4.2.

(27)

Figure 4.2: A parameter calibration on the GPU based on market data M , initial param-eter set Ω and repeated Monte-Carlo valuation on the GPU. Calculated option values are fed into the calibrator to calculate an error measure. This process is repeated until the error function has been minimized. The calibrated parameter set is then returned. We use a Weighted Sum of Squared Mean Relative Error (weighted SSE MRE) norm defined as =X i (yi− f (xi) yw i )2 (4.1)

with weight w = 1₂, target y and input x. Using a percentual measure ensures that errors are counted in proportion to their target value. Additionally, adding a weight term of w = 1₂, lowers the impact of relatively large errors on small target values, which can otherwise dominate the error measure.

An important aspect of calibration is the ability to impose parameter bounds. Model parameters often have some real world interpretation and should be bound to some (fi-nite) interval. However, not all optimization algorithms contain intrinsic properties to impose parameter bounds. Furthermore, difficult restrictions, such as the positive-definite property of correlation matrix Σ, are not easily specified and if violated will result in non-sensible results. An easy solution is to return a (very) large function value, when parameter restrictions are violated. Note however that this may have averse effects for some gradient based optimizers.

From the available optimization algorithms, we choose to use a deterministic min-imizer based on Powell’s method (Powell, 1964) due to its robustness and good per-formance in our general case. Powell’s method is an unconstrained non-gradient based optimization algorithm. The fact that it does not depend on the function gradient seems important as the estimation error of a Monte-Carlo valuation may have averse effects on the optimization in addition to our solution for enforcing difficult parameter restrictions.

(28)

4.3 Framework Solution

In this thesis, we design a framework, which can be used to accelerate risk-neutral Monte-Carlo valuations in (nested) simulations and model parameter calibrations using GPUs. For the framework, we choose to implement a job based system using process pools and pool queues. A process pool is a set of CPU processes (variable amount), which can be used to execute concurrent computations by submitting a set of tasks. These tasks are loaded into one of the processes in the pool and are executed. When all processes are currently executing a task, excess tasks are placed into a queue. Once a process is finished executing a task, the next task from the queue is loaded into the process. Here, a task is nothing more than a function and function input. In our case, the function input consists of a parameter set and market data, where the function either runs Monte-Carlo simulations using a simulator, or optimizes a given objective function using a calibrator. Results of each task can be retrieved individually once they’re fin-ished or at any time thereafter. This job based system is further illustrated in Figure 4.3.

Figure 4.3: A pool queue with process pool. The process pool currently has an open slot, the next task will be loaded into the process to be executed. New tasks can be added to the queue at anytime. Results can be retrieved once they are finished.

Python comes included with the standard multiprocessing module. This module en-ables the creation and management of parallel CPU processes through helpful classes. The process pool and pool queue can easily be setup using the Pool class. This pool object comes automatically included with an implementation of the Queue class, which handles the queuing of tasks. The number of processes in the process pool can be spec-ified upon creation. Processes in the process pool are initialized with the creation of a simulator object, which is configured to execute tasks on a specific GPU (or CPU). Processes are distributed evenly across the available GPUs. This is illustrated in Figure 4.4. Since the number of processes in the process pool (or pool size) is variable, it is possible to flexibly impose the average number of processes that offload work to a single

(29)

GPU. As mentioned in Section 3.3.2, to enable multiple CPU processes to offload work concurrently to a single GPU, we require the CUDA MPS module. Therefore, additional to the creation of the process pool, we launch a so called MPS daemon to relay up to 16 CPU processes. To use this module, it is required to run on a Linux operating system with GPUs of compute capability 3.5 or higher.

In (Cramwinckel, 2015) it is noted that memory allocations are not shared between MPS clients. In particular, they are referring to the memory allocations of random num-bers used to generate the sample paths, which are a significant portion of the required memory. Allocating this memory N times for N clients, is therefore quite inefficient. The Inter Process Communication (IPC) feature is proposed as a possible solution. Instead, we choose to bypass the memory allocations entirely by generating the random numbers directly on the GPU using the curand CUDA library. An additional benefit is that gener-ating the random numbers directly on the GPU increases the operational intensity, since the random numbers no longer have to be read from global memory. The operational intensity is the ratio between floating point operations and memory access. According to roofline theory (Williams et al., 2009), in multi-core architectures, when the operational intensity is lower than the ratio of peak performance and memory bandwidth, the kernel performance is bounded by memory throughput instead of the computational capability. In (Williams et al., 2009) it is noted that off-chip memory bandwidth will often be the constraining resource for system performance in the foreseeable future. Having a high operational intensity ensures that the application performance scales with newer hard-ware with improved peak performance.

We use a pseudo-random number algorithm from the curand library based on bit generation with XORWOW generators to create the random numbers on the GPU. All random numbers are generated using the same seed. Additionally each sample path is assigned a sequence number equal to its path number (the ith path in a Monte-Carlo simulation has path number i). Sequences for a specific seed are guaranteed to be statistically uncorrelated, while this may not be the case for different seeds. In this way we can make sure that that all Monte-Carlo paths are independent from each other. Furthermore, a pseudo-random number generator with state defined by specific seed and sequence number always generates the exact same set of random numbers. When running multiple Monte-Carlo simulations, we can impose correlations between simulations with Equation (2.6), since every Monte-Carlo path with equal path number generates the same random numbers.

(30)

Figure 4.4: 3 CPU processes (blue circles) are evenly distributed over the available GPUs for 1 GPU (A), 2 GPUs (B) or 3 GPUs (C). The number of CPU processes and GPUs is variable.

The advantage of a job based framework lies in its accessibility, flexibility of task execution and scalability of computational resources. By providing a middle layer, work supplying threads no longer have to implement functionality for GPU communication and resource allocation. Once established, the allocated resources can be used indefi-nitely without any effort through a simple entry level task submission interface. Using the variable pool size in combination with the HyperQ and MPS module enables full CPU-CPU, CPU-GPU concurrency and allows complete control over the number of CPU processes, the number of connected GPUs and the average number of CPU processes that can concurrently offload work to each of those GPUs. The full framework is illus-trated in Figure 4.5. An equivalent CPU variant exists, were equivalent kernel functions are executed on the CPU rather than a GPU. We use this variant for result validation and benchmarking.

The resulting framework facilitates in the acceleration of risk-neutral Monte-Carlo valuations in nested simulations and model parameter calibrations. Though currently we only have an implementation of the Heston model and European options available, the framework’s flexibility can support any risk-neutral model or payoff function if kernel functions are implemented. We will demonstrate the capabilities of the framework in the next section.

(31)

Figure 4.5: Full framework with process pool containing 8 CPU processes (blue circles) and 3 GPUs. Tasks can be submitted to the pool queue for execution. Each CPU process contains a simulator object bound to a specific GPU. GPU communications are relayed through the MPS daemon. The number of CPU processes and GPUs is variable.

4.4 Framework Validation

In this section we will demonstrate performance measures of risk-neutral Monte-Carlo valuations, model parameter calibrations and nested simulations and the effects of im-plemented features on these performance measures in order to validate our choices. Ad-ditionally, we will validate the martingale properties of the Monte-Carlo path generation and option values computed on the GPU.

Performance results are produced on UvA and VU DAS5 cluster using dual Intel Xeon (E5-2630 v3, 16-core, 20M Cache, 2.40 GHz) CPUs and NVidia GTX TitanX GPUs (Maxwell architecture) with Monte-Carlo simulations with 2 year horizon, 10.240 scenarios and step size δt = ₁₂₀1 unless stated otherwise. Calibrations are performed us-ing monthly historic market data on asset time series, interest rates and implied volatility surfaces in the period 30/9/2008 to 31/12/2015 (88 time points). For the risk-free in-terest rates, we use Overnight Index Swap (OIS) rates with 1 year maturity per asset class from its respective region. We calibrate to European put and call options with maturities from 1₄, 1₂, 1, 3₂ and 2 years and strikes from 90, 95, 97.5, 100, 102.5, 105 and 110 percent. For performance measures we only consider market data of the S&P 500 index. We demonstrate the capabilities of the framework with respect to performance acceleration, CPU-GPU concurrency, CPU scaling, GPU scaling and GPU utilization.

(32)

4.4.1 Kernel Validation

In order to validate the computed option values, each kernel is subjected to a set of tests. First off, we evaluate the arbitrage-free condition of the path generation by performing a martingale test. Under the martingale measure, the expected returns should equal the risk-free rate. Equivalently, the expected discounted asset price should equal the current asset price, such that

EQ[e−

RT

t r(s)dsS_T|F_t] = S_t (4.2)

By performing Monte-Carlo simulations we can check if the expectation holds for multiple maturities T . Validation of the martingale property of the asset price process is crucial as it is the foundation of risk-neutral pricing.

In (Cramwinckel, 2015) it is shown that the use of single precision floating point values does not have averse effects on the statistical properties of the path generation. Our tests are in agreement with these results. Single precision is of course preferred do to significantly higher performance and lower hardware costs. Additionally, we find that the pseudo-random number algorithm used to generate random numbers directly on the GPU is also sufficiently accurate. Tests for normality, zero mean and zero correlation between sequences all pass without problems and no significant effect on the martingale property of the asset price process is found.

Additionally we compare computed option values between the CPU and GPU variant, as well as with any available closed form solutions. Both CPU, GPU and closed form option prices are within each others respective confidence margins.

4.4.2 Simulator Performance

In this section we will evaluate the performance of plain risk-neutral Monte-Carlo simu-lations on the GPU using a simulator by comparing it to its CPU counterpart. Though we do not have an optimized C implementation. Our vectorized CPU implementation based on the highly optimized numpy Python module should be sufficient for a mean-ingful comparison. Note that we perform the computation using only a simulator object without framework middle layer (i.e. pool queue, process pool or MPS server).

In Table 4.1 and 4.2 we illustrate runtimes of a 100 single-asset simulations and 1 multi-asset simulation consisting of a 100 assets respectively for a variable amount of sample paths. Illustrated computation times are based on the average of 5 runs. Stan-dard deviations of the computation time are denoted in brackets. Both computations are equivalent, but with additional enforcement of correlations between all 100 assets for the multi-asset case.

We find accelerations of up to 178 times for the single-asset case. For the multi-asset case, we observe some performance loss relative to the CPU variant with lower accelera-tions of up to 45.3 times. This is due to the fact that each asset simulation must generate its own random numbers. Random numbers are therefore generated a 100 times. This is not the case for the CPU variant, where all random numbers are pre-generated once and simply stored into memory. This results in the observed performance loss. Though

(33)

a highly optimized kernel with respect to coalesced access and shared memory at the cost of readability may achieve higher performance when random numbers are read from memory. It will likely become constrained by the GPU memory bandwidth on newer hardware with higher peak performance. A non-optimized kernel, which reads the ran-dom numbers from global memory will almost surely result in even worse performance measures than when generating the random numbers on the GPU.

Increased computation time between the single-asset and multi-asset case for both variants can be attributed to a larger amount of required random numbers. The num-ber of random samples that must be generated per asset increases proportionally to the number of assets for the multi-asset case due to increased number of Brownian motions. Therefore, there is a significant increase in computational load.

(Time in Seconds)

# sample paths CPU 1 GPU speedup 512 1.80(0.017) 0.011(0.000) 167× 1.024 2.85(0.031) 0.016(0.000) 178× 10.240 19.2(0.036) 0.130(0.005) 148×

Table 4.1: Compute time in seconds of a 100 Single-Asset Heston Monte-Carlo simula-tion using a simulator object on the CPU or GPU. Standard deviasimula-tions of the computa-tion time are denoted in brackets.

(Time in Seconds)

# sample paths CPU 1 GPU speedup 512 2.88(0.008) 0.074(0.006) 38.9× 1.024 5.48(0.014) 0.124(0.005) 44.2× 10.240 43.6(0.016) 0.951(0.017) 45.8×

Table 4.2: Compute time in seconds of 1 Multi-Asset Heston Monte-Carlo simulation consisting of a 100 assets using a simulator object on the CPU or GPU. Standard devi-ations of the computation time are denoted in brackets.

4.4.3 Framework Overhead

In this section we will evaluate the framework latency (or overhead) by running a num-ber of simple tasks that sleep for specified time. We define the overhead as the difference between the expected and realized runtimes.

In Table 4.3 we present the overhead for an empty process pool and pool queue with variable pool size and amount of submitted tasks. Illustrated computation times are based on the average of 5 runs. Standard deviations of the computation time are denoted in brackets. We find a negligible amount of latency for tasks to enter a process from the queue. With ≈ 0.2 milliseconds per task for a single process and ≈ 0.04 mil-liseconds for 16 processes.

(34)

In Table 4.4 we present an equivalent case, but where each process contains an initialized GPU simulator instance. We find a non-negligible amount of overhead that does not scale with respect to the number of tasks or pool processes. However this latency only occurs on first execution. Repeated task computation does not show any sign of this overhead on subsequent runs. This is due to the fact that the initialization of the simulator instance in each pool process is not started until first use. The latency on first execution is caused by additional initialization procedures. Initialization time scales with the amount of resource that are to be allocated. By reusing initialized simulator instances, this overhead can be completely avoided after first use and should be negligible for big computations. Equivalent latency is found for CPU simulator instances.

(Time in Seconds)

# pool processes 32 tasks 64 tasks 128 tasks 1 0.008(0.0000) 0.015(0.0000) 0.029(0.0000) 16 0.002(0.0000) 0.004(0.0001) 0.005(0.0001)

Table 4.3: Pool queue and Process pool overhead in seconds for empty processes. Standard deviations of the computation time are denoted in brackets.

(Time in Seconds)

# pool processes 16 tasks 64 tasks 128 tasks 1 0.79(0.001) 0.80(0.002) 0.83(0.003) 16 0.83(0.007) 0.81(0.007) 0.82(0.008)

Table 4.4: Pool queue and Process pool overhead in seconds for processes containing an initialized GPU simulator object. Standard deviations of the computation time are denoted in brackets.

4.4.4 Nested Simulation Performance

In (Cramwinckel, 2015) Table 2 and 3, the performance of nested simulations (real world simulation with embedded risk-neutral valuations) is estimated using a mock-up model. This mock-up model is used since current models of Ortec Finance do not yet contain an implementation for nested simulations. We use an equivalent setup to make a fair comparison. The mock-up model emulates a nested simulation with the following dimensions:

• real world simulation

– number of scenarios: 2.000 – number of periods: 5 • risk neutral simulation

– number of assets: 1

– number of scenarios: 1.024

(35)

Computation time for a period in the real world simulation is sampled from a nor-mal distribution with mean µ and variance σ2. The real world simulation is divided into Light, Medium and Heavy classes with average computation time µ is 75, 150 and 300 milliseconds respectively and with σ = 0.05. Given a parameter set (parameter calibra-tions can be performed before the nested simulation and are thus not taken into account here), a risk neutral simulation can be performed concurrently with the computation of its real world period. With optimal acceleration and computational overlap, it may be possible to finish all risk-neutral computations within the runtime of its real world period, such that no performance loss occurs. At Ortec Finance, real world computa-tions are executed in a multi-process setting, thus we evaluate each class for a variable amount of pool processes. The real world scenarios are then submitted to the pool queue. The expected computation times of plain real world simulations are given in Table 4.5 (t = _{# pool processes}5∗2000∗(case µ)). We compare the expected real world simulation time with the realized computation time with the addition of inner risk-neutral simulations.

(Time in Seconds)

Benchmark Case Light Medium Heavy # pool processes 1 750.0 1500.0 3000.0 2 375.0 750.0 1500.0 4 187.5 375.0 750.0 8 93.8 187.5 375.0 16 46.9 93.8 187.5

Table 4.5: Expected computation time in seconds for a plain real world simulation. In Table 4.6 we find that due to kernel optimization and asynchronous design of the framework, no performance loss is taken for even the most extreme cases presented in (Cramwinckel, 2015). Therefore we extend the nested simulation to a multi-asset simu-lation consisting of 20 assets. The combined realized computation times and percentual increase are presented in Table 4.7 and 4.8.

(Time in Seconds)

# GPUs 1 GPU

Benchmark Case Light Medium Heavy # pool processes

16 49.2 96.2 189.6

Table 4.6: Realized computation time in seconds for nested simulation of 1 asset. We find that CPU and GPU computations are able to overlap near perfectly using the asynchronous stream execution, HyperQ feature and MPS module. Combined com-putation time is solely determined by which comcom-putation takes the longest. Here, the risk-neutral computations take approximately 210 and 110 seconds for a single GPU and

(36)

two GPUs respectively. Furthermore, for cases where the risk-neutral calculations dom-inate the computation time (i.e. the risk-neutral computation time is more than double the real world computation time), simply doubling the number of GPUs approximately doubles the acceleration, which is near perfect GPU scaling. Equivalently, doubling the amount of CPU processes, approximately doubles the acceleration, for cases where the real world calculations dominate the computation time. The transitions are observable in Figure 4.7.

(Time in Seconds)

# GPUs 1 GPU 2 GPUs

Benchmark Case Light Medium Heavy Light Medium Heavy # pool processes 1 753.8 - - - - -2 382.6 752.3 - 381.2 754.1 -4 213.7 377.2 752.3 190.2 377.8 753.2 8 211.0 213.9 376.5 107.4 190.0 377.4 16 214.9 213.3 212.0 106.2 106.5 189.6

Table 4.7: Realized computation time in seconds for nested simulation of 20 assets.

# GPUs 1 GPU 2 GPUs

Benchmark Case Light Medium Heavy Light Medium Heavy # pool processes 1 0.5% - - - - -2 2.0% 0.3% - 1.7% 0.5% -4 14.0% 0.6% 0.3% 1.4% 0.7% 0.4% 8 124.9% 14.1% 0.4% 14.5% 1.3% 0.6% 16 358.2% 127.4% 13.1% 126.4% 13.5% 1.1%

Table 4.8: Percentual performance loss for nested simulation of 20 assets. Lastly, in Table 4.9 we illustrate the runtimes of an equivalent simulation but using 2x dual chip NVidia Tesla K80 cards (4 GPUs) (Kepler architecture). We merely wish to illustrate the scalability with respect to the number of GPUs. In every case, the risk-neutral computations dominate the simulation and by doubling the number of GPUs, the acceleration is once again doubled. Furthermore, we observe that a single NVidia Tesla K80 card (2 chips/GPUs) is comparable with with a single NVidia GTX TitanX GPU (Maxwell architecture) for single precision calculations.

(37)

(Time in Seconds)

# GPUs 1 GPU 2 GPUs 4 GPUs # pool processes

8 425.5 228.1 116.3

Table 4.9: Realized computation time in seconds for nested simulation of 20 assets with Tesla K80 GPUs.

4.4.5 Calibration Performance

We demonstrate calibration performance, for 88 monthly time points in the period 30/9/2008 to 31/12/2015 using S&P 500 market data, where we vary the number of CPU processes and GPUs. Calibrated parameter sets are determined by iterative mini-mization of Equation (2.13) using a calibrator, where V_tΩ is determined by Monte-Carlo simulation using a simulator. Since we use a deterministic optimization algorithm and the same initial parameters, the calibration is equivalent for all cases. The results are illustrated in Table 4.10.

We find that increasing the number of CPU processes that offload work to a single GPU, increases the performance of the calibration process significantly. Using multiple CPU processes in combination with the MPS module allows multiple calibrations to be performed concurrently. Due to the periodic CPU-GPU computations of the calibration, additional CPU processes are useful to fill in the gaps, when other calibration processes are performing CPU computations. In our case, the GPU utilization reaches saturation for approximately 4 CPU processes. However no performance loss is created by adding additional processes. Furthermore, doubling the number of GPUs approximately doubles the acceleration, when the GPUs are doubly saturated (i.e. the workload is sufficient to fully utilize double the amount of GPUs. (approximately 8 CPU processes in the transition from 1 to 2 GPUS in our case)), which is near perfect GPU scaling.

(Time in Seconds)

# GPUs 1 GPU 2 GPUs # pool processes 1 78.08 -2 55.98 39.40 4 35.05 33.53 8 34.32 18.65 16 33.70 17.33

Table 4.10: Compute time in seconds for calibration of 88 independent time points (30/9/2008 to 31/12/2015) for a single-asset Monte-Carlo simulation with 10240 sample paths usin S&P 500 market data. function evaluations: 28850

(38)

5 Applications

In this section we discuss the calibration results of several asset classes. We consider the S&P 500, Euro Stoxx 50, FTSE 100, Nikkei, and Swiss Market indices. Calibrations are performed using monthly historic market data on asset time series, interest rates and implied volatility surfaces in the period 30/9/2008 to 31/12/2015. For the risk-free interest rates, we use Overnight Index Swap (OIS) rates with 1 year maturity per asset class from its respective region. We calibrate to European put and call options with maturities from 1₄, 1₂, 1, 3₂ and 2 years and strikes from 90, 95, 97.5, 100, 102.5, 105 and 110 percent. Monte-Carlo simulations for the Heston model are performed with 30.720 scenarios, 2 year horizon and step size δt = ₁₂₀1 unless stated otherwise.

We perform single-asset calibrations of the Heston model with time independent and time dependent parameters using the market implied approach introduced in Section 2.3 and 2.3.1 with Weighted Sum of Squared Mean Relative Error (weighted SSE MRE) norm (Equation 4.1). We will make a comparison between both approaches with respect to goodness of fit and parameter stability and investigate the correlation between time component x and its respective volatility index for the time dependent case.

5.1 Calibration of Single-Asset Valuation Models

Ortec Finance provides, as a service, calibrated parameter sets for a variety of asset classes to its customers each month. We emulate this process by performing monthly market implied calibrations of the single-asset Heston model based on historic market data as described in Section 2.3 and analyze the goodness of fit.

Calibrated Heston parameter sets for 88 monthly time points in the period 30/9/2008 to 31/12/2015 for S&P 500 and Euro Stoxx 50 indices are shown in Figure 5.1 and 5.2. These results, additional figures and the results for FTSE 100, Nikkei, and Swiss Market indices can be found in Appendix A.

Scalable GPU Accelerated Framework for Risk-Neutral Model Calibration and (Nested) Simulation

University of Amsterdam

Ortec Finance

Master Thesis Computational Science

Scalable GPU Accelerated Framework

for Risk-Neutral Model Calibration and

(Nested) Simulation

Scalable GPU Accelerated

Framework for Risk-Neutral

Model Calibration and (Nested)

Simulation

Author:

dhr. A.A (Alex) de Geus

Date:

August 18, 2016

Ortec Finance supervisors:

dhr. J. (Joris) Cramwinckel MSc.

dhr. ir. S. (Stefan) Singor

Ortec Finance BV

Boompjes 40

3011 XB, Rotterdam

The Netherlands

Academic supervisors:

Supervisor:

dhr. dr. B.D. (Drona) Kandhai

Second reader:

dhr. dr. M.H. (Mike) Lees

Third reader:

dhr. C.S.L. (Kees) de Graaf MSc.

University of Amsterdam

Faculty of Sciences

Science Park 904

1090 GE Amsterdam

The Netherlands

Abstract

Acknowledgements

Contents

1

Introduction

2

Risk-Neutral Valuation with the Heston Model

3

High Performance Computing with GPU’s

4

Framework: Design, Requirements and Performance

5

Applications