Efficient estimation of the Solvency Capital Requirement using Neural Networks

(1)

Capital Requirement using Neural

Networks

Author: S.P.H.M. Frerix

University of Amsterdam Amsterdam, the Netherlands Supervisor: Dr. P.J.C. Spreij Second reader: Dr. A.J. van Es

Deloitte Financial Risk Management Amsterdam, the Netherlands

Supervisor: T.D. Kraaij Supervisor: M. Westra

Master’s Thesis

(2)

Since the introduction of Solvency II, insurers have to value assets and liabilities according to market consistent principles. One of the key risk metrics in the Solvency II framework is the Solvency Capital requirement (“SCR”). The SCR is defined as the 99.5% Value-at-Risk of an insurers’ loss distribution. Market consistent estimation of the SCR involves valuation of assets and liabilities under many economic scenarios. Due to the complex pay-off structure of some insurance liabilities these values need to be estimated using Monte Carlo simulation for each economic scenario. This nested simulation scheme is compu-tationally expensive and introduces the need for proxy models. Currently it is market practice to use polynomials as proxy models. The volatility adjustment described by the Solvency II regulation introduces non-differentiability in the market value of liabilities. This non-differentiability is not captured accurately by polynomials.

This thesis proposes two novel proxy models based on neural networks. These two mod-els provide more accurate estimates of the SCR than the polynomials while still reducing the computational complexity. This thesis also shows that neural networks are appropri-ate functions for curve-fitting and do not fit simulation noise obtained in the Monte Carlo procedure. It also shows that neural networks are able to approximate function on small data sets.

Title: Efficient estimation of the Solvency Capital Requirement using Neural Networks Author: S.P.H.M. Frerix, bastiaanfrerix@gmail.com, 11218053

Examination date: August 21, 2018 Supervisor: Dr. P.J.C. Spreij Second examiner: Dr. A.J. van Es

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 105-107, 1098 XG Amsterdam Supervisor: T.D. Kraaij

Supervisor: M. Westra Financial Risk Management Deloitte Risk Advisory B.V.

(3)

Abstract i Contents ii Preface iv 1 Introduction 1 1.1 Problem background . . . 1 1.2 Research questions . . . 3 1.3 Thesis outline . . . 4 2 Literature review 5 2.1 Solvency II technical documentation . . . 5

2.2 Calculation of the solvency capital requirement . . . 6

2.3 Neural networks . . . 8

3 Neural networks 10 3.1 Neural networks as approximators . . . 11

3.2 Introduction to statistical learning theory . . . 20

3.3 Model complexity of neural networks . . . 23

3.4 Training neural networks . . . 28

4 Methodology 32 4.1 Mathematical framework . . . 33

4.2 Real-world scenario generator . . . 33

4.3 Risk-neutral scenario generator . . . 36

4.4 Solvency II discount curve . . . 38

4.5 Volatility adjustment . . . 39

4.6 Estimating the market value of liabilities . . . 41

5 Data 47 5.1 Simulation methods . . . 48

5.2 Deterministic pension liabilities . . . 49

(4)

6 Results 55 6.1 Deterministic pension liabilities . . . 55 6.2 Variable annuities . . . 61 6.3 Validation of the direct model . . . 65

7 Conclusion and discussion 67

Popular summary 70

Bibliography 71

Appendices 73

A Construction of the discount curve 74

B Economic scenario generator 80

C Supporting theorems 85

(5)

Six months ago I started this project with the ambition to master the concept of neural networks. It turned out to be a challenging, sometimes tedious, but also a highly rewarding and successful process. This text is the report of that process. While it is impossible to cover the frustration in searching thousands lines of code to find that missing minus sign, the tears in waking up next to another failed simulation attempt or the joy in finally obtaining an accurate model, I tried to my best of abilities to write a clear and concise text. While many people have been involved in my achievements in academia, I would like to thank the following in particular. At first my supervisors Thom Kraaij (Deloitte) and Martijn Westra (Deloitte), for providing me with the critical questions, comments and support throughout the process. I would like to thank Dr. Peter Spreij (University of Amsterdam) for the help in getting all the mathematical details right. Above all, I want to thank my parents and my brother and sister for their support during the process of writing this thesis and my time in academia in general.

Bastiaan Frerix

(6)

1

Introduction

This introductory chapter provides an overview of insurance and the Solvency II frame-work. It motivates the need for proper regulation of the insurance market and provides an introduction to the Solvency Capital Requirement. It illustrates the research problem central to this text and defines the research questions. At the end of this chapter there is a motivation for the structure of the thesis.

1.1 Problem background

The origin of the insurance business dates back many centuries. While the first signed insurance contract originates from 1347 Genoa, Italy, the Babylonian empire already had guaranteed shipping loans around 1800 BC. But insurance precedes even these examples. Think about prehistoric times, where men, women and children formed groups and allies to exploit benefits of the collectivity. Since an historical overview of insurance is not the scope of this text, let us fastforward to 1666; the great city fire in London. This disaster led to the founding of the first insurance company, Lloyd’s. Since then, insurance grew into one of the major businesses of the modern world. In 2016 the global amount of written premium in the insurance business was e3.66 trillion (Allianz, [1]). The enormous capi-tal involved make insurers to be one of the most important investors for both corporates and governments in need of funding. Another important aspect of the insurance business is on the side of the policyholders (i.e. the customers). Insurance undertakings provide financial protection against unexpected large losses of the policyholders. An example is health insurance. In the Netherlands, health insurance is mandatory. The government forces people to be a customer of an insurance company. Therefore it is only fair that the government also protects these customers from default of an insurer. Hence both compa-nies in which insurers invest and customers of the insurance business benefit from proper risk management. Current guidelines for risk management in insurance are prescribed by the Solvency II framework (European Parliament and Council, [29]).

(7)

The Solvency II framework has come into effect as of January 1, 2016. It encompasses regulation for determining capital requirements, management of the balance sheet, con-struction of internal models, financial reporting and much more. Solvency II replaces Solvency I, which originated from the ’70s. A major flaw of Solvency I was that it did not focus properly on the actual risks an insurer is subject to, resulting in a lack of motivation for proper risk management. Furthermore, Solvency I did not provide sufficient insights in the actual financial position of insurers and its risk sensitivities. Where Solvency I was a more country-focussed framework, Solvency II provides a European wide regulation, with the purpose of creating an European wide level field (Verbond van verzekeraars, [37]). Solvency II can be decomposed into three pillars.

• Pillar 1 focusses on the quantifiable risks and capital requirements;

• Pillar 2 focusses on the risk management and business strategy of an insurer; • Pillar 3 focusses on the publishing and reporting standards.

Pillar 1 describes risk metrics for insurance undertakings. When financial reserves fall below such metrics, this triggers regulatory actions to prevent default of an insurer. One of the key risk metrics in pillar 1 is the Solvency Capital Requirement (“SCR”). The SCR is the amount of capital an insurer needs to cover a 1-year loss which is expected to happen once in 200 years.

Pillar 1 also encompasses quantitative requirements. In Solvency II, assets and liabilities need to be valued according to market consistent valuation principles. This means that assets and liabilities need to be valued under different economic scenarios, which increases the complexity of estimating the SCR a great deal. The focus of this thesis is calculation of the SCR as prescribed by Solvency II regulation. Technical details regarding Solvency II are described by the European Insurance Occupational Pensions Authority (“EIOPA”), which is a supervisory institution for insurance and pension undertakings. Solvency II prescribes a standard model for calculation of the SCR. If the business of an insurer is suitable for the standard model, they are allowed to use this standard model. If their business is not suitable for the standard model, they need to develop a (partial) internal model (see for example Figure 1.1). This model needs to be approved by the home state regulator of the insurance undertaking (for the Netherlands, this is the Dutch Central Bank or “DNB”). The focus of this thesis is developing an internal model for estimation of the SCR.

(8)

Marginal risk factor distributions Correlation matrix

Economic Scenario Generator for generating scenarios x(i)_t

Market Value of Assets MVA(i) Market Value of Liabilities MVL(i)

MVS(i)= MVA(i)− MVL(i)

∆(i)_{= MVS}

today−MVS

(i) one year

1+r MVStoday= MVAtoday− MVLtoday

SCR := 99.5% percentile of the distribution of the losses ∆

Figure 1.1: Example of an internal model. This thesis is focussed particularly on estima-tion of the market value of liabilities.

1.2 Research questions

The main challenge with calculation of the SCR arises in calculating the market value of liabilities. Insurers typically have large balance sheets, for which market values need to be calculated under many different economic scenarios (also called shocks). Since the products on the liability side of the balance sheet might have a complex pay-off structure, there might not exist a closed-form solution for the market value of these products. Since insurance liabilities are not traded in the market either, there are no observable market prices. This means that the market value of liabilities needs to be estimated using Monte Carlo simulation under many economic scenarios. This is not feasible even with the pow-erful computing clusters of today (IBM, [4]). To overcome this computational complexity, insurers tend to proxy methods for estimating the market value of liabilities. These meth-ods include curve-fitting, replicating portfolio and least square Monte Carlo. The research in this thesis investigates the curve-fitting approach.

One of the challenges in the curve-fitting approach is capturing the volatility adjustment. The volatility adjustment is a parallel shift to the liquid part of the discount curve. This shift dampens the effect of non-credit (e.g. due to illiquidity of the bond market) related fluctuations in bond spreads. Insurers are allowed to make this adjustment because for long-term liabilities the illiquidity in the bond markets is not relevant. Determining the size of the volatility adjustment is a complex calculation which depends on different market parameters such as interest rates and credit spreads. The volatility adjustment has a floor at 0 basis points, which causes it to be a non-differentiable function of its parameters.

(9)

Since the volatility adjustment is not differentiable, the market value of liabilities becomes non-differentiable as well. Currently it is market practice to estimate the market value of liabilities by means of polynomial regression. In contrast to polynomials, neural networks are able to fit non-differentiable functions as well. This thesis aims to improve the fit of the polynomials by introducing neural networks as curves.

The research question central in this thesis can be formulated as follows. Are neural networks able to outperform polynomial regression for estimation of the Solvency Capital Requirement by capturing the non-differentiability in the market value of liabilities, intro-duced by the volatility adjustment? The research question as postulated is divided into various sub-questions.

1. Are neural networks suitable for curve-fitting on small data sets? 2. Are the results obtained by neural networks stable?

3. Are neural networks able to capture non-differentiability better than polynomials? 4. Are neural networks able to approximate functions when the training data is subject

to simulation noise?

5. Are neural networks suitable for curve-fitting in terms of computation time?

1.3 Thesis outline

The answers to the research questions in the preceding section are given from a theoretical and a practical point of view. To isolate the research questions stated in the preceding section, some simplifications are made. This thesis studies the impact of the market risk factors that are credit spread risk, interest rate risk and equity risk. Assumptions have been made on the underlying distributions of the risk factors. The volatility adjustment calculated in this thesis is a slightly simplified variant, in that it does not contain the “cost of downgrade” of bonds.

The framework of this thesis is as follows. Chapter 2 consists of a literature study. The technical documentation published by EIOPA is described briefly. The definition for the Solvency Capital Requirement and the volatility adjustment are given. Different approaches for estimating the SCR have been studied. Chapter 3 describes theoretical results for neural networks. Chapter 4 describes the internal model that is built for answering the research questions from a practical point of view. Chapter 5 describes the procedures for generating the data that is used in the internal models. Chapter 6 provides the obtained results and Chapter 7 gives the main conclusions of the thesis and further research possibilities.

To keep the text as concise as possible, conventional mathematics has been moved to the appendix. Although these may be skipped without missing essential knowledge on the practical subject matter, the reader is highly encouraged to read through the appendix in order to get a crisp picture of the mathematical details that are in play.

(10)

2

Literature review

This chapter outlines the literature study conducted for this research. Section 2.1 describes the essential concepts from the technical documentation published by EIOPA. Section 2.2 contains different subsections that outline various approaches for estimation of the SCR. One of the things constituting to the difficulty of this thesis is the lack of literature. This is explained at the end of this chapter.

2.1 Solvency II technical documentation

One of the important aspects of Solvency II is the market consistent valuation of the balance sheet. This means that insurers have to use the market value of their assets and liabilities instead of book values for evaluating of their economic capital position. This section discusses two major subjects of Solvency II that are of importance in estimation of the SCR. At first, the definition of the SCR as provided by the European Parliament and Council in Directive 2009/138/ec [29] is described. Since the original definition for the SCR is provided in writing rather than mathematics, it is important to make sure that the mathematical definition is consistent with the textual definition. To this end, a text from Niemeyer et al. [8] is discussed.

A major impact on the market values is the risk-free term structure of interest rates. The risk-free term structure of interest rates is published by EIOPA in the technical documentation [10]. Risk-free rates for insurance undertakings are somewhat different than the usual risk-free rates. Insurance undertakings are allowed to make an adjustment to the liquid part of the term structure. This adjustment is called the volatility adjustment. The volatility adjustment is invented to reduce the impact of fluctuations in market prices of assets that are not due to credit-related movements. Capturing the behaviour of the volatility adjustment and its impact on the SCR is part of the main goal of this thesis.

(11)

2.1.1 Definition of the Solvency Capital Requirement

The definition of the SCR in the directive of the European Parliament and Council [29] is the binding definition. This definition is given in writing rather than in mathematics. Definition 2.1.1. “It [Solvency Capital Requirement] shall correspond to the Value-at-Risk of the basic own funds of an insurance or reinsurance undertaking subject to a confidence level of 99.5% over a one-year period.”

The Value-at-Risk is defined as the amount of losses that is expected at certain proba-bility level. In the definition of the SCR, this means that for the insurance undertaking the SCR is the amount of funds that is needed to cover 99.5% of the losses over a 1-year horizon. As discussed in Niemeyer et al. [8], this definition leaves various interpretation open for the SCR. They [8] discuss various possible interpretations for the SCR. One of the interpretations is the following:

Definition 2.1.2. The Solvency Capital Requirement (“SCR”) is defined as the 1-in-200 one-year loss of a financial institution.

SCR := VaR99.5% MVStoday− MVSone year 1 + r ,

where MVS is the market value surplus, i.e. the difference between the market value of assets and the market value of liabilities and r is the interest rate.

2.1.2 The volatility adjustment

What makes the Solvency II curve special is the volatility adjustment. The volatility ad-justment is defined in EIOPA’s technical documentation for construction of the Solvency II risk-free term structures [10]. The volatility adjustment is a constant that is added on top of the liquid part of the risk-free curve. This leads to a decrease in the market value of liabilities. Although EIOPA calculates the volatility adjustment for the current economic scenario each month, in the internal model of the insurer the volatility adjustment needs to be calculated under different economic scenarios. Calculation of the volatility adjust-ment will be discussed in Section 4.5. For now, it is enough to note that the volatility adjustment has a floor at 0 basis points, which means that the volatility adjustment is not differentiable. This non-differentiability motivates the use of neural networks to estimate the SCR.

2.2 Calculation of the solvency capital requirement

As seen in Definition 2.1.2, calculating the SCR is a percentile estimation of the loss distribution of the insurers’ balance sheet. Market consistent valuation under Solvency II means that assets and liabilities need to be valued under a large set of different scenarios of next year’s economy. This economy is described by a set of risk factors. An example of a risk factor is the level of the current interest rates.

(12)

2.2.1 Nested Monte Carlo

Upon introduction of Solvency II, many insurers struggled with estimation of the SCR. Although the directive prescribes a standard formula for estimation of the SCR, many insurers find themselves in positions where the standard model does not suffice (e.g. see Lozano et al. [2]). A paper from Bauer et al. [5] describes a nested simulations approach for estimation of the SCR. The need for nested simulations arises from the fact that the market value of liabilities for insurers usually does not admit an analytic formula. This means that for each state of next year’s economy, the market value of liabilities needs to be estimated by virtue of a risk neutral simulation. This nested simulation approach is in practice not feasible (See IBM, [4]) due to the size of an insurers balance sheet and the complexity of its products.

2.2.2 Curve-fitting

One approach that attempts to solve the computational problem is curve-fitting. The curve-fitting approach tries to find curves that define a mapping between risk factors and market values of liabilities. These curves are called best estimate liability curves. To obtain such curves, the market value of liabilities is calculated using nested Monte Carlo simulation for a small amount of different values for each risk factor. Then a regression analysis on these calculated risk factors provides a best estimate liability curve.

Although curve-fitting is popular in the insurance industry (see IBM, [4]) it does present its own problems. It is a challenge to find curves for products with complex guarantees or in scenarios where management undertakes action in certain scenarios that results in highly non-linear loss behaviour (IBM, [4]). Furthermore, it is important to cover a sufficiently wide range of risk factor values. If determined that a 1-in-200 year shock for an equity index is 40%, the calibration set to which the curves are fitted should cover at least the 40% shock. The curve fitting methodology has been used for a number of years by many of the leading insurance firms in the UK according to the research conducted by IBM [4]. One method of estimating the best estimate liability curves is by polynomial regression. The problem with regression based on polynomials is that polynomials of finite degree are not able to approximate non-differentiable functions to high accuracy.

2.2.3 Replicating portfolios

Another proxy model can be a replicating portfolio approach. This approach aims to repli-cate the cash-flows from the liabilities as good as possible by selecting an appropriate set of financial instruments (e.g. bonds, derivatives) for which the market value is expressed in a closed form solution. The market value of liabilities can then be expressed through this set of financial instruments. While replicating portfolios might provide an appropriate proxy for the market value of some insurance products, it is inappropriate for others. For example, the text by Milliman, [6] describes that replicating portfolios add no value when managing insurance risk. There are no financial instruments related to, for example, the swine flu pandemic. Another possible problem is estimation of the accuracy. A replicating portfolio approach might work well on the economic scenarios on which it is calibrated, but might fail for other scenarios (see Milliman, [6]).

(13)

2.2.4 Neural networks

Little literature on applying machine learning to estimate the SCR is available. There is a paper from Hejazi and Jackson [18] that displays promising results. In their research, Hejazi and Jackson estimate the value of a large portfolio of variable annuities with neural networks. This portfolio is constructed based on a text written by Gan et al. [14]. In their research, Hejazi and Jackson obtain an error smaller than 4% for estimation of the SCR, while also decreasing the runtime by a factor 5. They note that the runtime can be decreased even further by implementing a parallel training procedure for the neural networks.

2.3 Neural networks

The non-differentiability introduced by the volatility adjustment calls for a flexible model. Since polynomials are differentiable on the real line, it might be hard to capture points at which the market value of liabilities is not differentiable. Neural networks provide the flexibility to capture such functions, since they enjoy the universal approximation property (see e.g. [19], [22]). But neural networks also have their drawbacks. One of the main problems with neural networks is explaining their ability to generalize well. Due to their large number of parameters they are prone to over-fitting. Neyshabur et al. [27] show a strange behaviour that neural networks exhibit. In the text there are two image classification problems considered. For both problems, data is splitted into a training and a testing set. Both neural networks are fitted to the training set, and then the scores of the networks on the testing set are compared. It is remarkable to see that even though networks have an error of zero on the training set, adding more hidden layers decreases the prediction error of the neural networks on the testing set. This means that neural networks must incorporate some inductive bias. How this inductive bias is incorporated still remains an open problem. Feinman et al. [12] illustrate another peculiar characteristic of neural networks. Their paper [12] considers state-of-the art convolutional neural networks. These networks include so many parameters that they are even able to fit random noise. Classical arguments for statistical techniques such as imposing a restriction on the amount of parameters do not seem to work for neural networks. It (largely) still remains an open problem to explain the generalization performance of neural networks and how these arrive at an appropriate prediction. This knowledge gap also prevents neural networks of being used at large in business. According to Taylor, [20] model validation has to be re-invented for machine learning techniques, and regulatory organs such as the Dutch Central Bank need to define new regulation regarding machine learning models.

Besides validation, the dependency of the neural network performance on the size of the data set. It is a common idea that neural networks only provide satisfying performance on large data sets (see for example Mao et al. [24]). There is not much research about neural networks in environments where data is scarce. One of the fields where research actually is done is in the field of biostatistics. This is due to the fact that it may be expensive to gather data due to the cost of the procedures needed to obtain the data (e.g. expensive scans). To this end, Shaikhina [34] investigates the performance of neural networks when data is scarce. In the paper, Shaikhina proposes a combination of neural networks with

(14)

different activation functions. This combination provides satisfying results but may well be regarded as a black box solution due to its complexity. Pasini [30] also proposes a neural network approach for small data sets. Some research has been done in the area of low data neural network analysis, but it still remains a largely undiscovered field.

A note on the literature study

As seen in this chapter, one of the challenges in this thesis is the lack of literature. There are different causes for the lack of literature. At first there is the business motivation. Since estimation of the SCR involves calculating sensitivities of the balance sheet of an insurer with respect to different risk factors, it is hard to write papers on this subject without revealing business specifics (This partially has been the reason for Gan et al. [14] to write their paper on variable annuities simulation, which is to accelerate research in this area). Publishing information regarding the internal models of an insurer provides competitors with insights in the strategy and risk management direction which an insurer is heading. Therefore, insurers usually do not publish any specifications regarding their internal models.

A second challenge is the lack of literature regarding neural networks in low-data en-vironments. Neural networks for large datasets have been extensively researched. Perfor-mance on small datasets seems to attract less interest. Small sample size perforPerfor-mance of neural networks is not only relevant in the context of estimating the SCR, it is relevant in each statistical problem where obtaining large samples is impossible due to time or budget constraints.

(15)

3

Neural networks

Artificial neural networks have become a hot topic over the last few years. The explosive gain in computing power made artificial neural networks a feasible statistical method. Although neural networks have only recently become popular, they have been around for a long time. It started with McCulloch and Pitts, [26] in 1943, with a description of a mathematical structure based on the biological neural network of the brain. After this description, it was Rosenblatt [33] in 1958 who first described the single-layer perceptron for classification problems. This perceptron would later be called “Rosenblatts perceptron” which has become the basic building block for the multilayer perceptron (“MLP”). The MLP is a chain of Rosenblatt perceptrons, and it will be the main model of this thesis. Although there exist many different types of neural networks, the MLP suffices for the modelling purpose of this thesis.

Figure 3.1: Diagram of a neural network with two hidden layers consisting of 4 respectively 3 hidden nodes.

(16)

The multilayer perceptron is a function f : Rr _{→ R}s _{that consists of the following}

elements.

1. Input layer. The first layer in the neural network is the input layer. Throughout the text, r will denote the dimension of the input layer. Inputs are defined as the vectors x(0) _{∈ R}r;

2. Hidden layers. In a neural network with l hidden layers, the hidden layer h ∈ {1, . . . , l} is characterised through a weight matrix W(h) ∈ Rnh−1×nh _{and a bias}

vector b_{∈ R}nh_{. The number n}

h ∈ N denotes the hidden nodes in layer h, that take

as input values the outcomes x(h−1)of the nodes from the preceding layer h−1. The outputs xj, j ∈ {1, . . . , nh} of layer h are calculated by

x(h)_j = σ

nh−1

X

i=1

w(h)_ji x(h_i −1)+ b(h)_j ,

where the function σ is called the activation function. There are many types of activation functions. The most common are the sigmoid functions (e.g. hyperbolic tangent, logistic), and the rectified linear unit f (x) = max(0, x);

3. Output layer. The output layer is the final layer of the network. It takes the outputs of hidden layer l as input, and applies a linear combination of the weights. This means that the output layer is characterised through a weight matrix W(l+1)_∈

Rnl×s _{and a bias vector b}_{∈ R}s_{, where the output is then calculated by}

yj = nl

X

i=1

wjix(l)i + bj, (3.1)

For the class of neural networks with l hidden layers is we write N_Wl. Neural networks

enjoy the universal approximation property, which means that any measurable function can be approximated by a neural network of the aforementioned form (we will see this in Section 3.1). However, their large amount of parameters makes neural networks prone to over-fitting. This will be discussed in Sections 3.2 and 3.3. An algorithm for finding the best parameters for the multilayer perceptron will be discussed in Section 3.4. For the remainder of this thesis, artificial neural networks will be referred to as “neural networks” or “ANN”.

3.1 Neural networks as approximators

Hornik et al. [19] were among the first to prove the universal approximation theorem for neural networks with sigmoid activation function. More recently the rectified linear unit (“ReLU”) became more popular (see e.g. Ramachandran et al. [31]). Leshno [22] generalized the proof of Hornik for neural networks that have activation functions which are not polynomial. This section studies the proof from Leshno [22] in detail. At the end of this section, arguments from the article by Hornik et al. [19] are used for extension of the results derived by Leshno from denseness in the space of continuous functions to denseness in the space of measurable functions.

(17)

Definition 3.1.1. Let f be a function defined almost everywhere with respect to the Lesbesgue measure on a measurable set Ω⊂ Rn_{. If u is bounded almost everywhere on Ω,}

we write u∈ L∞_{(Ω). If f} _{∈ L}∞_{(K) holds for every compact subset K} _{⊂ Ω, the function}

f is said to be locally essentially bounded on Ω and we write f _{∈ L}∞_loc(Ω). Throughout this chapter, we assume that σ ∈ L∞

loc(Ω), and we require that the closure

(in R) of the points of discontinuity of σ has Lebesgue measure zero.

Definition 3.1.2. The set C(Rr) is defined as the set of continuous functions from Rr to R. The set M(Rr_{) is defined as the set of all Borel measurable functions from R}r _to

R. The set C∞(R) denotes the functions which are infinitely many times continuously differentiable. The set C₀∞(R) denotes the C∞(R) functions that have compact support. Let a < b∈ R. The set of C∞

0 functions with support in the interval [a, b] is denoted as

C₀∞[a, b].

Definition 3.1.3. Let F ⊂ L∞

loc(Rn). The set F is called dense on compacta in C(Rn) if

for every function g_{∈ C(R}n) and for every compact set K _{⊂ R}n, there exists a sequence of functions (fj)∞j=1 with fj ∈ F such that

lim

j→∞kg − fjkL∞(K) = 0. (3.2)

If equation (3.2) holds, the sequence fj is said to converge uniformly on compacta to g.

Definition 3.1.4. Let A : Rn → R be a function. The class of affine functions A is defined as

A := {A : Rn→ R | A(x) = a1x1+ . . . + anxn+ b, ai, b∈ R}.

Definition 3.1.5. Let σ : R_{→ R be a measurable function, and let r ∈ N. The class of} neural networks with a single hidden layer Σr_{(σ) is defined as}

Σr(σ) :=_{{f : R}r _{→ R | f(x) =}

nh

X

i=1

wiσ(Ai(x)), x∈ Rr, wi ∈ R, Ai ∈ A}.

Lemma 3.1.6. Let σ _{∈ C}∞(R). Suppose for each x _{∈ R there exists k ∈ N such that} the k-th order derivative σ(k)(x) = 0. Then σ is a polynomial.

Proof. Suppose σ is not a polynomial. Define the sets Vk and W as follows.

Vk:={x : σ(k)(x) = 0},

W :=_{{x : ∀a, b with a < b, x ∈ (a, b) it holds that σ|}_(a,b)is no polynomial_}.

The set W is non-empty by the assumption that σ is not a polynomial. Furthermore, the set W is closed in R. It also holds that W does not contain any isolated points. Suppose that x0 ∈ W is an isolated point. Then it holds that on (a, x0) and (x0, b), σ is

a polynomial. Consider the Taylor expansions around x0 to obtain that σ is a polynomial

on (a, b). Hence W cannot contain isolated points, and since it is closed as well, it is a perfect set.

(18)

By assumption the unionS Vkcovers R, so it also provides a covering of W . Application

of Baire’s Category Theorem C.5.2 gives that a countable covering of W has to contain a set that is not nowhere dense. Then it holds that there exists a k such that Vk∩ W is

not nowhere dense. This means that the interior of the closure of Vk∩ W is not empty.

Because W is a non-empty perfect subset of the real line, it holds that W is not countable. Then there must exist an interval (a, b) such that (a, b)∩ W is non-empty and that

(a, b)∩ W ⊂ Vk,

for some k. By noting that W does not contain isolated points, we obtain that σ(k0₎

(x) = 0 for all k0 ≥ k and x ∈ (a, b) ∩ W . Consider any maximal interval (c, e) ⊂ (a, b) \ W . Then σ is a polynomial of some degree d on [c, e]. This means that σ(d) = constant_{6= 0 on [c, e].} Since either c or e_{∈ W (otherwise (c, e) would not have been a maximal interval), it holds} that d < k. Then σ(k) = 0 on any maximal interval [c, e]. Hence σ(k0) = 0 on (a, b)\ W for all k0_{≥ k. Combining this with the fact that σ}(k0) = 0 on Vk for all k0 ≥ k, we obtain

that σ(k0₎

= 0 on (a, b) for all k0_{≥ k. This contradicts that (a, b) ∩ W is non-empty.} Lemma 3.1.7. Suppose σ _{∈ C}∞(R) and that σ is not a polynomial of degree m. Then Σ1_{(σ) is dense on compacta in C(R).}

Proof. Let σ_{∈ C}∞(R). Because σ is differentiable, the following limit d

dwσ(wx + b) = limh↓0

σ((w + h)x + b)_{− σ(wx + b)}

h , (3.3)

is well defined. The functions σ((w+h)x+b)_h and σ(wx+b)_h in the limit on the right hand side of equation (3.3) are elements of the class Σ1(σ) for each h > 0. Since the limit in equation (3.3) exists, the derivative _dwd σ(wx + b) _{∈ Σ}1_{(σ), where the closure is with respect to}

C(R). This process can be iterated arbitrary many times. For the k-th order derivatives the following expression holds.

dk

dwkσ(wx + b) = x

k_σ(k)_{(wx + b).} _(3.4)

By equation (3.4) it follows that xkσ(k)(y)_{∈ Σ}1_{(σ). Under the assumption that σ is not a}

polynomial, there exists y_{∈ R such that σ}(k)_(y)_{6= 0 (see Lemma 3.1.6). This means that}

xk ∈ Σ1_{(σ) holds for any k. By linear combination, Σ}1_{(σ) contains all polynomials. It}

follows from Weierstrass’ Theorem (see Appendix C.3.1) that Σ1(σ) is dense on compacta in C(R).

Definition 3.1.8. Let u, v_{∈ M(R}r_{). The convolution of u and v is denoted by u}_{∗ v and}

is defined as (u_{∗ v)(t) :=} Z u(s)v(t_{− s)ds =} Z u(t_{− s)v(s)ds,} provided the integral is well-defined.

Lemma 3.1.9. If ϕ∈ C∞

(19)

Proof. Suppose that ϕ has bounded support (i.e. ϕ takes values in [_{−M, M] for some} M > 0). The aim is to prove that σ∗ ϕ can be uniformly approximated by functions of the class Σ1(σ) on [−M, M]. This means proving that for any > 0 there exists yi such

that the following bound holds. | Z σ(x_{− y)ϕ(y)dy −} m X i=1 σ(x_{− y}i)ϕ(yi)∆yi| ≤ 3. (3.5)

Fix > 0 arbitrary. Let_{−2M − 1 ≤ z}1 < . . . < zr≤ 2M + 1 denote the (possibly infinite)

points of discontinuity of σ in the interval [−2M −1, 2M +1]. Since σ and ϕ have bounded norms, it is possible to choose a δ such that the following inequality holds.

10δ_kσk_L∞_[_−2M,2M]kϕk_L∞ ≤ .

By assumption the closure of the set of points of discontinuity of σ has at most mea-sure zero. It is possible to cover it by open intervals (ai, bi) such that the measure of

U0 = S∞

i=1(ai, bi) is at most δ. Since the closure of the points of discontinuity is closed

and bounded, there exists a finite cover U consisting of the intervals of U0 such that U = Sr(δ)

i=1(ai, bi) still covers the closure of points of discontinuity. This means that σ is

uniformly continuous on the interval [_{−2M, 2M]\U. Now choose m such that mδ > Mr(δ),} and define yi=−M +2iM_m and ∆yi = yi− yi−1 for i = 1, . . . , m. We will break down the

proof of equation (3.5) in two steps. We start with proving the following equation. | m X i=1 σ(x− yi)ϕ(y)∆yi− m X i=1 Z ∆i σ(x− yi)ϕ(y)dy| ≤ . (3.6)

By uniform continuity of ϕ, the following inequality holds for any_{|s − t| ≤} 2M_m .

|ϕ(s) − ϕ(t)| ≤

2Mkσk_L[−2M,2M]∞

. (3.7)

The following chain of inequalities leads to the bound (3.6). | m X i=1 Z ∆i σ(x_{− y}i)ϕ(y)dy− m X i=1 σ(x_{− y}i)ϕ(yi)∆yi| = m X i=1 | Z ∆i

σ(x_{− y}i)(ϕ(y)− ϕ(yi))dy|,

≤ m X i=1 Z ∆i

|σ(x − yi)||ϕ(y) − ϕ(yi)|dy,

≤ m X i=1 Z ∆i |σ(x − yi)| 2M_kσk_L∞_[_−2M,2M] dy, ≤ m X i=1 Z ∆i _kσk_L∞_[_−2M,2M] 2MkσkL∞_[−2M,2M] dy, ≤ m X i=1 2M m 2M, ≤ ,

(20)

where equation (3.7) is used in step 3. The second step in the proof of equation (3.5) is derivation of the following bound.

| Z σ(x− y)ϕ(y)dy − m X i=1 Z ∆i σ(x− yi)ϕ(y)dy| ≤ 2. (3.8)

The derivation of bound (3.8) makes use of the uniform continuity of σ on the set [_{−2M, 2M] \ U. Suppose that |s − t| ≤} 2M_m and s, t _{∈ [−2M, 2M] \ U. By uniform} continuity of σ the following inequality holds.

|σ(s) − σ(t)| ≤ 2M_kϕk_L1

. (3.9)

Because ϕ has bounded support, the following inequality holds. | Z σ(x_{− y)ϕ(y)dy −} m X i=1 Z ∆i σ(x_{− y)ϕ(y)dy| ≤} m X i=1 Z ∆i |σ(x − y) − σ(x − yi)||ϕ(y)|dy.

Consider the two cases (x− yi−1, x− yi)∩ U = ∅ and (x − yi−1, x− yi)∩ U 6= ∅.

• Case (x−yi−1, x−yi)∩U = ∅. By invoking the uniform continuity of σ, i.e. equation

(3.9), the following inequality holds. Z ∆i |σ(x − y) − σ(x − yi)||ϕ(y)|dy ≤ 2M_kϕk_L1 Z ∆i |ϕ(y)|dy, ≤ m.

• Case (x − yi−1, x− yi)∩ U 6= ∅. Since U has measure δ, the total length of the

intervals under this case is at most δ + 4M_m r(δ). By the choice of m, it holds that δ + 4M_m r(δ)_{≤ 5δ. This means that the following chain of inequalities hold.}

X

∆i:(x−yi−1,x−yi)∩U6=∅

Z

∆i

|σ(x − y) − σ(x − yi)||ϕ(y)|dy,

≤ X

∆i:(x−yi−1,x−yi)∩U6=∅

Z

∆i

2kσkL∞_[−2M,2M]|ϕ(y)|dy,

≤ X

∆i:(x−yi−1,x−yi)∩U6=∅

Z

∆i

2_kσk_L∞_[_−2M,2M]kϕk_L∞dy,

≤ 2 kσkL∞_[_−2M,2M]kϕk_L∞

X

∆i:(x−yi−1,x−yi)∩U6=∅

Z

∆i

dy, ≤ 2 kσkL∞_[−2M,2M]kϕk_L∞5δ,

(21)

Combining the preceding cases yields the following bound. m X i=1 Z ∆i |σ(x − y) − σ(x − yi)||ϕ(y)|dy ≤ X

∆i:(x−yi−1,x−yi)∩U=∅

Z

∆i

|σ(x − y) − σ(x − yi)||ϕ(y)|dy

+ X

∆i:(x−yi−1,x−yi)∩U6=∅

Z

∆i

|σ(x − y) − σ(x − yi)||ϕ(y)|dy,

≤ + .

Combining equations (3.6) and (3.8) and invoking the triangle inequality proves inequality (3.5).

3.1.1 Denseness w.r.t. continuous functions

The main result of this section is the proof for denseness of the class Σr_{(σ) in the space of}

continuous functions C(Rr_{) for neural networks that have an activation function which is}

not a polynomial.

Theorem 3.1.10. Let σ be a measurable function. Then Σr(σ) is dense on compacta in C(Rr) if and only if σ is not a polynomial.

The proof of Theorem 3.1.10 is given in different parts, which have been formulated as various Lemma’s. Lemma 3.1.11 proves Theorem 3.1.10 from left to right. The remainder of the Lemma’s are combined at the end of this section to give the proof of Theorem 3.1.10.

Lemma 3.1.11. If σ is a polynomial, Σr_{(σ) is not dense on compacta in C(R}r_).

Proof. Let σ be a polynomial of degree k, i.e. σ(x) = c0+c1x+. . .+ckxk, and let A∈ A be

an affine function. Then the function σ(A) = c0+ (ax + b) + . . . ck(ax + b)k is a polynomial

of degree at most k as well. The set Σr(σ) is the set of linear combinations of functions σ(A), hence it consists of polynomials of degree at most k as well. These are not dense on compacta in C(Rr_).

Lemma 3.1.11 proves the implication from left to right in Theorem 3.1.10. For the statement in the converse direction, we assume that σ is not a polynomial.

Lemma 3.1.12. If Σ1_{(σ) is dense on compacta in C(R), then Σ}r_{(σ) is dense on compacta}

in C(Rr).

Proof. Suppose Σ1_{(σ) is dense on compacta in C(R). From Micchelli et al. [}₉_{] it follows}

that the space

(22)

is dense on compacta in C(Rr_{). Let g} _{∈ C(R}r_{) and K} _{⊂ R}r _{be a compact set. Then}

for any > 0 and g ∈ C(Rr_{) there exist functions f}

i ∈ C(R) and vectors ai ∈ Rr, for

i = 1, . . . , k, for some k∈ N such that |g(x) − k X i=1 fi(aix)| ≤ 2, (3.10)

for all x ∈ K. Because K is compact, the image of the functions f(ai_{x) is contained}

in intervals [Ai, Bi] ⊂ R for each x ∈ K and each ai such that equation (3.10) holds.

By assumption Σ1_{(σ) is dense in C(R) so it is dense in C([A}

i, Bi]) as well. Therefore,

there exist constants wij and affine functions Aj, for j = 1, . . . , mi such that the following

equation holds. |fi(y)− mi X j=1 wijσ(A(y))| ≤ 2k. (3.11)

for all y _{∈ [A}i, Bi]. Combining inequalities (3.10) and (3.11) leads to the following

in-equality. |g(x) − k X i=1 mi X j=1 wijσ(A(x))| < .

This means that Σr(σ) is dense on compacta in C(Rr).

Lemma 3.1.13. If there exists a function ϕ_{∈ C}₀∞ such that σ_{∗ ϕ is not a polynomial,} then Σ1(σ) is dense on compacta in C(R).

Proof. Lemma 3.1.9 shows that ϕ ∈ C∞

0 implies that σ ∗ ϕ ∈ Σ1(σ). But then we also

have that (σ_{∗ ϕ)(wx + θ) ∈ Σ}1_{(σ) for any w, θ}_{∈ R. Then for σ and ϕ ∈ C}∞

0 we have that

σ_{∗ ϕ ∈ C}∞_{. It follows from Lemma 3.1.7 that if σ}_{∗ ϕ is not a polynomial, then Σ}1_{(σ) is}

dense on compacta in C(R).

Lemma 3.1.14. Suppose σ_{∗ ϕ is a polynomial for all ϕ ∈ C}₀∞. Then there exists m_{∈ N} such that σ∗ ϕ is a polynomial of degree at most m.

Proof. Assume that ϕ∈ C∞

0 [a, b]. Define the metric

ρ(ϕ1, ϕ2) = ∞ X n=0 2−n kϕ1− ϕ2kn 1 +kϕ1− ϕ2k_n , where_kϕk_n=Pn

j=0supx∈[a,b]|ϕ(j)(x)| (the functions ϕ(j)(x) denote the j-th order

deriva-tives of ϕ). The space C₀∞[a, b] endowed with the metric ρ is a complete vector space. Let us define the following the sets Vk

Vk={ϕ ∈ C0∞[a, b]| deg(σ ∗ ϕ) ≤ k}.

Then Vk is a closed subspace, Vk⊂ Vk+1 and ∞

[

k=0

(23)

Application of Baire’s Category Theorem (see Appendix C.5.2) enables us to find an integer m such that Vm 6= ∅, which proves the case for ϕ ∈ C0∞[a, b]. Now let [A, B] be

any interval. For ϕ∈ C∞

0 [A, B] we can find ϕi∈ C0∞[ai, bi] such that [A, B]⊂Ski=1[ai, bi]

and ϕ =Pk

i=1ϕi. But then we have σ∗ ϕ =Pki=1σ∗ ϕi and we note that for each i, the

convolution σ_{∗ ϕ}i is a polynomial of degree at most m. Then σ ∗ ϕ is a polynomial of

degree at most m as well.

Lemma 3.1.15. Let ϕ _{∈ C}₀∞ and suppose σ_{∗ ϕ is a polynomial of degree at most m.} Then σ is a polynomial of degree at most m.

Proof. Define a sequence of mollifiers ϕ (see Definition C.6.1). By Theorem C.6.2 we

have that σ∗ ϕ→ σ as ↓ 0. Since polynomials of degree at most m form a closed linear

space and the functions σ_{∗ ϕ} are polynomials of degree at most m, we have that σ must

be a polynomial of degree at most m as well.

The proof for Theorem 3.1.10 can now be completed by the following argument. Suppose Σr(σ) is not dense. By Lemma 3.1.12 it follows that Σ1(σ) is not dense. Application of Lemma 3.1.13 proves that for any ϕ_{∈ C}₀∞, it holds that σ_{∗ ϕ is a polynomial. Then by} Lemma 3.1.14 it holds that σ_{∗ ϕ is a polynomial of degree at most m for any σ ∈ C}₀∞. Finally, application of Lemma 3.1.15 gives that σ is a polynomial of degree m.

3.1.2 An extension to measurable functions

In this section, the continuity result is extended to measurable functions. The proofs are based on the text by Hornik et al. [19].

Definition 3.1.16.Given a probability measure µ on (Rr_{, B(R) define the Ky Fan metric}

ρµ:M(Rr)× M(Rr)→ R+ by ρµ(f, g) = inf{ > 0 : µ{x : |f(x) − g(x)| > } < }.

Lemma 3.1.17. The following are equivalent. 1. ρµ(fn, f )→ 0.

2. For every > 0, µ_{{x : |f}n(x)− f(x)| > } → 0.

3. R min{|fn(x)− f(x)|, 1}µ(dx) → 0.

Proof.

1→ 2. Let , 0_{> 0 be arbitrary. By assumption, there exists M such that for all m}_{≥ M}

it holds that µ_{{x : |f}m(x)− f(x)| > } < . Without loss of generality, assume that 0<

(otherwise we are done). Again by assumption, there exists M0 such that for all m_{≥ M}0 the inequality µ{x : |fm(x)−f(x)| > 0} < 0holds. Since µ{x : |fm(x)−f(x)| > } < µ{x :

|fm(x)−f(x)| > 0}, the inequality µ{x : |fm(x)−f(x)| > } < 0holds. Then by choosing

N = max(M, M0) we see that for all n_{≥ N the inequality µ{x : |f}n(x)− f(x)| > } < 0

holds.

2 → 1. Let > 0. By assumption there exists N ∈ N such that for all n ≥ N the inequality µ_{{x : |f}m(x)− f(x)| > } < holds. Then ρµ(fn, f )→ 0.

(24)

2_{→ 3. Suppose µ{x : |f(x) − g(x)| >} ₂_{} <} ₂, then the following inequality holds. Z

min_{|fn(x)− f(x)|, 1}µ(dx) < .

3_{→ 2. Let > 0 be arbitrary. The integral can be decomposed in the following terms.} R min{|fn(x)− f(x)|, 1}µ(dx) = µ{x : |fn(x)− f(x)| > 1} +R_|f_n_(x)_−f(x)|≤1|fn(x)− f(x)|µ(dx).

Consider the case > 1. By assumption, for each 0 > 0 there exists N _{∈ N such that for} all n _{≥ N the inequality} _{R min{|f}n(x)− f(x)|, 1}µ(dx) ≤ 0 holds. Then the inequality

µ{x : |fn(x)− f(x)| > 1} ≤ 0 holds for all n ≥ N. Now consider the case where ≤ 1.

Fix 0 < . By assumption, there exists N _{∈ N such that for all n ≥ N it holds that} R min{|fn(x)− f(x)|, 1}µ(dx) ≤ 02 2 . Hence µ{x : |fn(x)− f(x)| > 1} ≤ 02 2 < 0 2. For the

second term the inequalityR

|fn(x)−f(x)|≤1|fn(x)− f(x)|µ(dx) ≤

02

2 holds. The following

inequality holds (by virtue of Markov’s inequality). µ{x : ≤ |fn(x)− f(x)| ≤ 1} ≤

R

|fn(x)−f(x)|≤1|fn(x)− f(x)|µ(dx)

,

which can be bounded by ₂02 < ₂0. Hence we obtain the inequality

µ_{{x : |f}n(x)− f(x)| > 1} + µ{x : < |fn(x)− f(x)| ≤ 1} ≤ 0,

which is the desired result.

Lemma 3.1.18. If {fn} is a sequence of functions in M(Rr) that converges uniformly

on compacta to the function f , then it holds that ρµ(fn, f )→ 0.

Proof. By Lemma 3.1.17 it is sufficient to show that R min{|fn(x)− f(x)|, 1}µ(dx) → 0.

By Halmos [16] Theorem 52.G it follows that µ is a regular measure because Rr is a locally compact metric space. Hence there exists a compact subset K _{⊂ R}r _{such that}

µ(K)≥ 1 − ₂. By assumption, there exists an N such that for all n ≥ N it holds that sup_x_∈K_|fn(x)− f(x)| < ₂. The following chain of inequalities hold.

Z min{|fn(x)− f(x)|, 1}µ(dx) = Z K min{|fn(x)− f(x)|, 1}µ(dx)+ Z Rr_\K min{|fn(x)− f(x)|, 1}µ(dx), ≤ Z K 1µ(dx) + 2µ(R r_), ≤ 2+ 2 = . Application of Lemma 3.1.17 yields the result.

Lemma 3.1.19. For any finite measure µ, the space of continuous functions C(Rr_{) is}

(25)

Proof. Let f _{∈ M(R}r_{) be a measurable function. Because µ is a finite measure, there}

exists and M such that the following inequality holds. Z

min{|1|f|<Mf (x)− f(x)|, 1}µ(dx) <

2.

By combination of Theorems 55.C and 55.D from Halmos [16] we obtain that the space of continuous functions are dense in the space of integrable functions. This means that we can find a continuous function g such that the following inequality holds.

Z

min{|f − g|, 1}µ(dx) < 2. Application of the triangle inequality yields the desired result.

By combination of Lemma 3.1.19 and Theorem 3.1.10 it is obtained that the class of neural networks with a single hidden layer is dense in the space of measurable functions. Although this is a hopeful result for function approximation, Theorem 3.1.10 does not pro-vide any results on the amount of data that is needed for neural networks to approximate a function well. The ability to learn from data is examined in the next section.

3.2 Introduction to statistical learning theory

As stated in the introduction of this chapter, there are many types of neural networks. The most coarse distinction between the types of networks can be made on the learning task they perform. Statistical learning can be divided into two sets: supervised learning and unsupervised learning. Unsupervised learning is concerned with pattern detection and feature extraction. This means that there is no target variable to predict. An unsupervised learning task uses a set of unlabelled data _{xi} as the domain set X and the model tries

to extract patterns from the data. These type of networks can be used for instance in pattern detection for credit card fraud or covariate selection. In supervised learning, the domain set _{X consists of labelled data {x}i, yi}. The goal is to predict the value of the

label y given the observation x. For example, in credit card fraud detection, a credit card transaction with characteristics described by the vector x is observed, and the goal is to predict whether this transaction is legitimate (i.e. y = 0), or the transaction is fraudulent (y = 1). Regression analysis is also a form of supervised learning. For example, let x be a vector describing the weather conditions observed for the last few weeks. The target value or label y might be the expected temperature tomorrow. The learning task in this thesis is a supervised learning task.

The statistical learning framework is defined along the lines of Shalev and Schwartz [35] and extended by Maurer [25] where necessary. Notational conventions are used from Shalev and Schwartz [35]. The statistical learning framework consists of the following elements.

• Input: The input for a learning task can be divided into the following classes: 1. Domain set: A set _{X that consists of unlabelled data. In this thesis these}

(26)

x _{∈ X is a vector containing information about credit spreads, interest rates} and equity returns;

2. Label set: A setY describing the output space of the learning task. In this the-sis, these are the market value of liabilities under economic conditions described by x;

3. Training data: A set S = {(x1, y1), . . . , (xm, ym)} ⊂ X × Y that consists of

labelled data. In this thesis the training data are the vectors of risk factors for which the market value of liabilities is calculated using Monte Carlo simulation; • The learner’s output: The goal is to find a mapping h : X → Y. This mapping h is called an hypothesis. In this thesis, this is the function that estimates the market value of liabilities based on the state of the economy x. These functions are called the best estimate liabilities. The space of possible hypotheses is written as _H; • Data generation model: In statistical learning, it is assumed that the data is

generated according to a probability distribution _{D over the domain set X . Prior} knowledge about this distribution may not be available. In this thesis, distribution D is the distribution generating the economic scenarios as described in Section 4.2; • Measures of success: The error of a statistical learning problem is defined in the following way. Let l :_{H × Z → R}+ be a function which is called the loss function

and Z is the space consisting of pairs (x, y). The error of an hypothesis h is defined as

L_D(h) = Ex∼Dl [h, (x, y)] .

The function L is called the risk function, and it is the expected loss of a certain hypothesis h.

Throughout this section, the set of hypotheses_{H is the class of multilayer neural networks.}

3.2.1 Probably approximately correct learning

Having defined the learning framework and the set of hypotheses H, we are interested in whether it is at all possible to arrive at a hypothesis h_{∈ H that has small prediction error.} In the case of neural networks the question is whether it is possible to find weight matrices such that the regression task is performed well. Statistical learning theory investigates this question by defining the agnostic probably approximately correct learning paradigm: Definition 3.2.1. A hypothesis class_{H is called agnostic PAC-learnable with respect to} a set Z and loss function l :_{H → R}+ if there exists a function mH: (0, 1)2 → N (which is

called the complexity function) and a learning algorithm with the following property: For every , δ ∈ (0, 1) and for every distribution D over Z, when running the algorithm over m_{≥ m}_H(, δ) i.i.d. examples generated by _{D, the algorithm returns h ∈ H such that the} following inequality holds.

P L_D(h)_{≤ min} h0_∈HLD(h 0_{) +} ≥ 1 − δ.

(27)

A class _{H is called probably approximately correct if there exists h}0 _{∈ H such that} L_D(h0) = 0. Class that are (agnostic) PAC learnable have the desirable property that there exists an algorithm to select an hypothesis that has low approximation error. To prove whether a class is PAC-learnable, the notion of model complexity is defined. In the book of Shalev and Schwartz [35] PAC learnability is connected to model complexity. A full description of this connection is not the scope of this thesis, but it is worth noting that smaller model complexity yields a smaller complexity function (as defined in Definition 3.2.1).

3.2.2 Model complexity

Since PAC-learnability is connected to model complexity, this section concerns itself with the model complexity of neural networks. One of the widely used complexity measures is the Rademacher complexity.

Definition 3.2.2. Let_{F be a set of real-valued functions F ⊂ R}X and a vector x_{∈ X}m. The empirical Rademacher complexity of_{F w.r.t. x is defined as}

ˆ Rx(F) := E " sup f∈F 1 m m X i=1 if (xi) # , (3.12)

where Edenotes the expectation over the independent random variables i∈ {−1, 1} with

uniform distribution (i.e. P[i = 1] = P[i =−1] = 1₂). If the xi are random variables Xi

distributed w.r.t. _{D(X ), the Rademacher complexity of F w.r.t. D is given by} Rm(F) := Exh ˆRx(F)

i

. (3.13)

The random variables in the definition are called “Rademacher variables”. Intu-itively, the Rademacher complexity makes sense; the richer the class F, the greater the Rademacher complexity. Consider the following difference between the empirical and the-oretical complexity. The Rademacher complexity (3.13) is defined as an expectation over a (possibly complicated and unknown) distribution. Since the empirical Rademacher com-plexity (3.12) is defined on a sample of m observations written as the vector x_{∈ X}m, it is easier to bound. Theorem 3.2.4 provides a result which connects the empirical Rademacher complexity to the theoretical Rademacher complexity. To prove this theorem, let us first state a lemma which is of major importance in statistical learning theory, called McDi-armid’s inequality.

Lemma 3.2.3. Let V be some set and let f : Vm _{→ R be a function of m variables such} that for some c > 0, for all i_{∈ {1, . . . , m} and for all x}1, . . . , xm, x0i∈ V we have

|f(x1, . . . , xm)− f(x1, . . . , xi−1, x0i, xi+1, . . . , xm)| ≤ c.

Let X1, . . . , Xmbe m independent random variables taking values in V . Then the following

inequality holds. P " |f(X1, . . . , Xm)− Ef(X1, . . . , Xm)| ≤ c s ln 2 δ m 2 # ≥ 1 − δ.

(28)

Proof. See e.g. Shalev and Schwartz [35].

Theorem 3.2.4. Let _{F ⊂ [−M, M]}X be a set of bounded, real-valued functions. For every > 0 and any product probability measure_Dm _on_Xm _{the following equation holds:}

Ph_{|( ˆ}_Rx(F) − Rm(F)| ≥ i ≤ exp −m 2 2M2 . Proof. Let ϕ : Xm _{→ R be defined as ϕ(x) := ˆ}_R

x(F). By virtue of Definition 3.2.2,

E[ϕ(X)] =_Rm(F). Let x, x0 ∈ Xm be two vectors differing in one component (i.e. such

as in McDiarmid’s inequality). Since the image of all functions f _{∈ F is bounded in} [−M, M], the change in components in supf∈F

P

iif (xi) changes by at most 2M if x is

replaced by x0. This means that the following inequality holds. |ϕ(x) − ϕ(x0)| = |( ˆRx(F) − ˆRx0(F)| ≤ 2M

m .

This means that we can apply McDiarmid’s inequality (Lemma 3.2.3) to the function ϕ, which results in the following.

P " |( ˆRx(F) − Rm(F)| > 2M m s ln 2 δ m 2 # < δ. The desired result follows by taking δ = 2 exp_2Mm22

.

Theorem 3.2.4 may be used to bound the actual Rademacher complexity in terms of the empirical Rademacher complexity. Theoretical results in this thesis are based on the empirical Rademacher complexity.

3.3 Model complexity of neural networks

This section inspects the model complexity of neural networks. Results are inspired on the paper by Golowich et al. [15]. Theorem 3.3.1 concerns with the model complexity of real-valued neural networks with a sigmoid activation function. The proof for this theorem has been omitted. Lemma 3.3.5 and Theorem 3.3.6 are two extensions of results from the text from Golowich et al. [15]. The definitions necessary for these extensions are given in Section 3.3.2, where the concept of Rademacher complexity is generalized to multivariate learning problems.

3.3.1 A bound for sigmoid activation functions

Theorem 3.3.1. Let _Hl be the function class of vector-valued networks with l hidden

layers over the domain _{X . Assume that weight matrix W for each hidden layer has} Fr¨obenius norm of at most M . Assume that the activation function has Lipschitz constant 1 and has the property that_{−σ(x) = σ(−x). Then the empirical Rademacher complexity} is bounded by m ˆR(Hl)≤ 2Mlpl + 1 + log r v u u t max j∈1,...,r m X i=1 x2_i,j.

(29)

Proof. See Golowich et al. [15].

3.3.2 Rademacher complexity for vector-valued learning problems

Definition 3.2.2 is only able to deal with real-valued functions. An extension of the Rademacher complexity for multivariate problems is given in the following definition along the lines of Maurer, [25].

Definition 3.3.2. Let_{F be a class of hypotheses with vector-valued output, i.e. f : X →} RK. The empirical Rademacher complexity ofF w.r.t. x is defined as

ˆ Rx(F) = 1 mE " sup f∈F m X i=1 K X k=1 ikfk(xi) # ,

where Edenotes the expectation over the random matrix with i.i.d. elements ik∈ {−1, 1}

with uniform distribution. The multivariate Rademacher complexity ofF w.r.t. D is given by Rm(F) := Ex 1 mE " sup f∈F m X i=1 K X k=1 ikfk(xi) #! .

3.3.3 A bound for multivariate neural networks

Definition 3.3.3. Let W _{∈ R}h×nbe a matrix. The Fr¨obenius normis defined as_{kW k}2_F = Ph

j=1kwjk2.

This section concerns with a bound on the empirical Rademacher complexity for neural networks with rectified linear activation function. First two lemmas are formulated, which will be the cornerstones of the proof.

Lemma 3.3.4. Let F be a class of vector-valued functions. Let λ > 0. The following inequality holds. m ˆ_{R(F) ≤} 1 λlog Efsup∈F exp λ m X i=1 K X k=1 ikfk(xi) !! .

Proof. The proof is an application of Jensen’s inequality and monotonicity of the expo-nential function. m ˆ_{R(F) =} Esup f∈F m X i=1 K X k=1 ikfk(xi) ! , = 1

λlog exp λ· E_f∈Fsup

m X i=1 K X k=1 ikfk(xi) ! , ≤ 1

λlog λ· Eexp supf∈F m X i=1 K X k=1 ikfk(xi) ! , ≤ 1 λlog Efsup∈F exp λ m X i=1 K X k=1 ikfk(xi) ! .

(30)

Lemma 3.3.5. Let σ be a function that is nonnegative, homogeneous and has Lips-chitz constant 1. Furthermore, let g : RK → [0, ∞) be the function defined by g(x) = exp(CPK

k=1xi) for any C > 0, and let Let W ∈ Rh×n. Then for any vector-valued class

F, the following inequality holds.

E sup f∈F,W :||W ||F≤R g m X i=1 i1σ (W f (xi)) , . . . , m X i=1 iKσ (W f (xi)) ! ≤ 2KEsup g R· m X i=1 i1f (xi) , . . . , m X i=1 iKf (xi) !! , (3.14) where the function σ applied to a vector is defined as σ(x) = (σ(x1), . . . , σ(xn)).

Proof. Suppose w1, . . . , wh are the rows of the matrix W . The norm of each element in

the function g on the left hand side of equation (3.3.5) can be rewritten as m X i=1 ikσ(W f (xi)) 2 = h X j=1 kwjk2· m X i=1 ikσ wT_j kwjk f (xi) !!2 ,

because the activation function σ is homogeneous. The supremum in equation (3.14) is attained once _kwjk = R holds for some j ∈ {1, . . . , h} and that kwik = 0 for all i 6= j.

The value j for which this holds might be dependent on k, so by defining the set indices J =_{j1, . . . jK}, the following inequality is derived.

E sup f∈F,W :kW kF≤R g m X i=1 i1σ(W f (xi)) , . . . , m X i=1 iKσ(W f (xi)) ! ≤ E sup f∈F,wj,j∈J,kwjk=R g | m X i=1 i1σ(wTj1f (xi))|, . . . , | m X i=1 iKσ(wTjKf (xi))| ! . (3.15) Consider the following chain of inequalities.

g(_|x1|, . . . , |xK|) ≤ g(x1,|x2|, |x3|, . . . , |xK|) + g(−x1,|x2|, |x3|, . . . , |xK|), ≤ g(x1, x2,|x3|, . . . , |xK|) + g(−x1,|x2|, |x3|, . . . , |xK|) + g(x1,−x2,|x3|, . . . , |xK|) + g(−x1,−x2,|x3|, . . . , |xK|), ≤ . . . , ≤ X (i1,...,iK)∈{−1,1}K g(i1x1, i2x2, . . . , iKxK).

Because the Rademacher random variables ij are independent random variables with

uni-form distribution on{−1, +1}, it is possible to bound the expectation over the Rademacher variables in equation (3.15) by the following.

2KE sup f∈F,wj,j∈J,kwjk=R g m X i=1 i1σ(wTj1f (xi)), . . . , m X i=1 iKσ(wTjKf (xi)) ! .

(31)

By convexity of the function g and independence of the Rademacher variables, equation (4.20) from Ledoux and Talagrand [21] (see Appendix C.2.1) can be applied for each K iteratively, to bound the above by

2KE sup f∈F,wj,j∈J,kwjk=R g m X i=1 i1wjT1f (xi), . . . , m X i=1 iKwTjKf (xi) ! ≤ 2KE sup f∈F,wj,j∈J,kwjk=R g kwj1k m X i=1 i1f (xi) , . . . ,kwjKk m X i=1 iKf (xi) ! , ≤ 2KEsup f∈F g R_· m X i=1 i1f (xi) , . . . , m X i=1 iKf (xi) !! . This concludes the proof of Lemma 3.3.5.

With Lemma 3.3.4 and 3.3.5 extended to multivariate neural networks, we are able to deduce a bound on the Rademacher complexity of the class of neural networks. This bound will be used to motivate choices for some parameters of the network.

Theorem 3.3.6. Let _Hl be the function class of vector-valued networks with l hidden

layers over the domain _{X . Assume that weight matrix W for each hidden layer has} Fr¨obenius norm of at most M . Assume that the activation function satisfies Lemma 3.3.5. Then the empirical Rademacher complexity is bounded by

m ˆR(Hl)≤ KMl p 2l log 2 + 1 v u u t m X i=1 kxik2.

Proof. Let λ > 0 be a constant to be specified later. By Lemma 3.3.4 we obtain m ˆ_R(Hl) = 1 mE " sup N_{W l−1},Wl K X k=1 m X i=1 ikwklσ(NWl−1(xi)) # , ≤ 1

λlog Esup exp λ

K X k=1 m X i=1 ikwklσ(NWl−1(xi)) !! , ≤ 1

λlog Esup exp λM

K X k=1 m X i=1 ikσ(NWl−1(xi)) !! ,

where the last inequality holds because the Fr¨obenius norms of the weight matrices are bounded by M . By application of Lemma 3.3.5 with g(x) = exp(λM ·PK

k=1|xi|) and x equal to x= m X i=1 i1σ(NWl−1(xi)) , . . . , m X i=1 iKσ(NWl−1(xi)) ! ,

(32)

we obtain that m ˆ_R(Hl)≤ 1 λlog  E sup f,_kWl−1_k F≤M) exp λM K X k=1 m X i=1 ikσ(Wl−1f (xi)) !  ≤ 1 λlog 2 K _E sup f exp M· M · λ K X k=1 m X i=1 ikf (xi) !! ,

where f is a function in the class σ◦ NWl−2(x), i.e. neural networks with l− 2 hidden

layers. This process can now be iterated over the remainder of the hidden layers, such that the following expression for the Rademacher complexity is derived.

m ˆ_R(Hl)≤ 1 λlog 2 Kl _E exp Mlλ K X k=1 m X i=1 ikxi !! , = Kl log 2 λ + 1 λlog Eexp M l_λ K X k=1 m X i=1 ikxi !! .

Define the random variable Z =PK

k=1Zk, where Zk = Ml· k

Pm

i=1ikxik. Note that Zk

and Zj are i.i.d. random variables for k 6= j. The bound on the Rademacher complexity

may be rewritten in terms of Z in the following way. m ˆ_R(Hl)≤ Kl log 2 λ + 1 λlog (Eexp (λZ)) , = Kl log 2 λ + 1

λlog (Eexp (λ(Z− EZ))) + EZ. To bound the expectation EZ, apply Jensen’s inequality to EZk.

EZk≤ Ml v u u u tE   m X i=1 ikxi 2 = Ml v u u u tE   m X i,i0₌₁ iki0_kxT i xi0  = Ml v u u t m X i=1 kxik2.

For the logarithmic term the concentration inequality from Appendix C.4 is used. To see why this inequality applies, note that the random variable Z is a deterministic function of the i.i.d. random variables ik and that Z satisfies the following bounded-difference

condition.

Z(1, . . . , ik, . . .)− Z(1, . . . ,−ik, . . .)≤ 2Mlkxik2.

Application of Lemma C.4.2 yields the following inequality. 1

λlog (Eexp (λ(Z− EZ))) ≤ 1 λ λ2M2lKPm i=1kxik2 2 = λM2lKPm i=1kxik2 2 ,

which leads to the following bound on the empirical Rademacher complexity.

m ˆR(Hl)≤ Kl log 2 λ + λM2lKPm i=1kxik 2 2 + KM l v u u t m X i=1 kxik2,

(33)

valid for any λ > 0. This bound is now minimized by taking the derivative of the right hand side with respect to λ, and solved for λ such that the derivative equals zero.

−Kl log 2_λ₂ +M 2l_KPm i=1kxik2 2 = 0, → λ = √ 2l log 2 MlqPm i=1kxik 2,

which leads to the bound

m ˆ_R(Hl)≤ KMl p 2l log 2 + 1 v u u t m X i=1 kxik2.

Theorems 3.3.6 and 3.3.1 provide the theoretical foundation for choosing the simplest models. We see that the bounds on the Rademacher complexities are smaller for neural networks a small amount of hidden layers (i.e. l is small). Furthermore, the bounds are smaller for small norms on the parameter matrices. This can be compared with regularization for regression (see for instance Hastie et al. [17]). The results from the theorems in this section are used in this thesis in the sense that neural networks are chosen with a small number of hidden layers and small number of hidden nodes.

3.4 Training neural networks

The flexibility of neural networks as seen in Theorem 3.1.10 is of little practical use if there is no algorithm which learns the neural network how the weights should be chosen. Fortunately, such algorithms exist. Backpropagation is an example of such an algorithm. Based on gradient descent, backpropagation adjusts the weights of the neural network by calculating the gradient of a loss function with respect to the weight matrices that describe the neural network. This section outlines the backpropagation algorithm and discusses the resilient backpropagation algorithm, which is used in this thesis. Please note that while it is important to define algorithms which enable the neural network structure to learn from the data, discussion of such algorithms is note the main scope of this thesis. Therefore, this section only discusses the essentials.

3.4.1 Weight initialization

There are various ways to initialize the weight of neural networks (See e.g. Nguyen and Widrow, [28], Yam and Chow, [38]). This is not the scope of this thesis. Initial weights are chosen independently from a standard normal distribution.

3.4.2 Weight optimization

The flexibility of neural networks comes from the many parameters (or weights) by which it is defined. An important factor in the performance of a neural network is the algorithm by