Clustering over the BeNeLux : modelling dependent wind storms

(1)

Modelling dependent wind storms

J.P. Kunst

Master’s Thesis to obtain the degree in Actuarial Science and Mathematical Finance University of Amsterdam

Faculty of Economics and Business Amsterdam School of Economics Author: J.P. Kunst Student nr: 6065082

Email: jasperkunst00@gmail.com Date: July 13, 2015

Supervisor: Dhr. Dr. S.U. Can Supervisor: Msc. Y. Arnoldus AAG Second reader: Prof. Dr. R. Kaas

(2)

(3)

Abstract

This thesis investigates the effect of clustering of windstorms over the BeNeLux. This weather feature has recently gained attention of catastrophe modellers and is impor-tant in the determination of capital requirements. To gain insight into this type of modelling, the process of storm formation and the definition of clustering are assessed. Storms arriving within seven, fourteen and thirty days are discussed. Based on Dutch and Belgiam weather data, a model for the arrival rate of severe storms is constructed. To measure the dependence of these storms, the overdispersion of the number of storms per winter is studied. The parameters for (zero-adjusted) Poisson, Negative Binomial and Poisson-Binomial distributions are fitted on the empirical data and compared with earlier literature on the subject. Also the storm tracks of some severe clusters are in-vestigated and used to illustrate the causes of dependent storm arrivals. Those storms arriving within short time intervals do not necessarily have the same origin. To describe the clustered arrival, a distinction is made between secondary cyclogenesis, clustering due to large scale pressure variations, and storms that appear to arrive independently.

Keywords Catastrophe modelling, Clustering, Extra-tropical cyclones, Insurance,

(4)

Acknowledgement

This thesis used the data of the Royal Dutch Meteorological Institute which was kindly and freely supplemented with Belgium data by their colleagues of the KMI. Also free and very useful were R, LA_{TEX, several packages and all the on-line questions and answers}

of people who struggled before me to get their code working. I would like to thank Dr. Umut Can of the University of Amsterdam for supervising and helping me with statistical issues, Nico de Boer for emphasising every programmed variable should be variable and for reading and feedback. Also my parents who are always supportive and available for critical notes. Finally, many thanks to Yoeri Arnoldus who suggested the topic and provided reflection and discussion which made this thesis to a very pleasant way to finish my Masters degree in Actuarial Science and Mathematical Finance.

(5)

Glossary

AIC Akaike Information Criterion, test statistic that pe-nalises for more complex models.

AIR Applied Insurance Research, a vendor of catastrophe models.

AEP Aggregated exceedance probability.

BFGS algorithm Quasi-Newton method to solve non-linear optimisa-tion.

background points Measurements below the specified threshold in the al-gorithm of Hodge (1994).

cost function A function that maps variables into a real number that can be minimised by an algorithm.

cyclogenesis Cyclone or storm development.

eddy Static or transient obstacle that results in a swirl of wind.

ECMWF European Centre for Medium-Range Weather Fore-casts.

EIOPA European insurance authority.

ELT ‘Event-loss’ table.

EQECAT WorldCat Enterprise, a vendor of catastrophe models. ERA-Interim European reanalysis of climate data since 1979. feature points Measurements that are part of the storm tracks in the

algorithm of Hodge (1994).

frontogenesis Storm development between the warm and cold air around the polar jet stream.

IFS Cy31R2 Integrated Forecasting System cycle 31R2, a General Circulation Model.

GCM General Circulation Model, meteorological model that predicts the state of the atmosphere.

NAO North Atlantic Oscillation.

object points Measurements above a specified threshold in the algo-rithm of Hodge (1994).

OEP Occurrence exceedance probability.

overdispersion Variability in data is bigger than the expected value of the data.

loss amplification When a loss occurs as part of a major catastrophic event, initial damage can be amplified due to lack of for example quick repairs.

Neymann-Scott process Model to describe clustering in space (as opposite to clustering in time).

relative vorticity Curl of the wind relative to a fixed frame of reference (often the earth’s surface).

return period Estimate of the likelihood of an event.

RMS Risk Management Solutions, a vendor of catastrophe models.

secondary cyclogenesis Storm development where the conditions due to the passage of first storm are favourable for the develop-ment of another storm.

vendor model Model created by a company that specialises in selling risk models as a service to the market.

(8)

XL Common abbreviation of excess-of-loss insurance con-tracts.

XWS Catalogue Website of Meteorological Offices from different uni-versities, containing a database of extreme wind storms.

(9)

Introduction

The Dutch coast is known for its engineering against wind and water and its flatlands are famous for the wind mills. Not surprisingly one of the main risk drivers for property insurers in the Netherlands is windstorm risk. Multiple times per winter coastal areas are exposed to heavy wind gusts that sometimes cause big losses. Insurers have large amounts of insured value that are exposed to this catastrophe risk. Parts of those risks are often transferred to a reinsurer who starts paying when the claims get very big. Recent meteorological research suggests that winter storms tend to cluster and arrive successively at the Dutch coast. Insurers and regulators struggle with defining and mea-suring the risks of clustering, but it is likely that clustering does have an effect on the insured value that is exposed to these catastrophe risks. Thus, for both the insurer and the reinsurer the determination of premiums and capital requirements depends on an adequate modelling of those risks (note: when insurer is written in this thesis it can be substituted with reinsurer unless stated explicitly different). This modelling of catas-trophe risk is usually done by a third party that calculates the event loss tables for the catastrophic events. Unfortunately those parties rarely disclose the precise details of their methods for generating the loss tables. If clustering is incorporated into these models this is specified, but the details of the effect of clustering are not easy to come by. This thesis aims to add value in the area between the meteorological and insurance modelling of windstorms. Meteorological climate models are freely accessible and con-stantly updated by research groups from all over the world in contrast to the insurance models that are by nature undisclosed and built for commercial purposes. This leads to the investigation of the effect of clustering on the scale of the BeNeLux as topic of this research. The main research question is:

What is the effect of clustering of storms on the scale of the BeNeLux for an (re)insurance company?

Existing literature is combined and compared with a windstorm model to answer the following sub-questions:

• How is windstorm risk modelled?

• How do storms and storm clusters develop? • How should clusters be defined?

• Which distribution should be used to estimate the number of storms and clusters? • What is the effect of clusters for the BeNeLux compared to other regions? The necessary information to understand the development of windstorms and clusters and also an introduction about (re)insurance is given in the background section. Then a windstorm model is created from raw weather data. With this model return rates for storms are derived for each winter in the dataset. The next step is the fitting of pa-rameters of a theoretical distribution to the empirical data. The estimated papa-rameters are compared with existing literature on clustering. Finally, the tracks of some major storms are discussed.

Chapter 1 gives the background. In chapter 2 the data is described that is used for the modelling described in method chapter 3. The results of the modelling are given in chapter 4. Chapter 5 dives deeper into the storm tracks and chapter 6 contains the conclusions and a discussion of suggested future research.

(10)

1 Background

The topics relevant and illustrative for further chapters of this thesis are described be-low. Catastrophe risk modelling for insurers and reinsurers asks for the understanding of windstorms. Therefore an overview of the basic meteorological background on wind-storms and clusters is given in this section. First the insurance and reinsurance business that is exposed to catastrophe risk is explained briefly, then the modelling of catastro-phe risk is described in sections1.2 and1.3. Sections about climate and weather follow in1.4 and1.5 before the theory on clustering is discussed in section1.6.

1.1 Insurance of windstorm risk

The insurance business is based on taking over risks from parties that do not want to suffer the losses from extreme events. Insurers often transfer part of their risks to a reinsurance company. This section describes the size of the Dutch non-life insurance market, some insurance contracts for catastrophe losses and the capital requirements of EIOPA for the catastrophe risk of the non-life business.

1.1.1 Size of the storm insurance business

The total gross premium for non-life business in the Netherlands is annually reported by the Dutch Association of Insurers (Verbond van Verzekeraars, 2014) and is fluctuating around 12 billion euros per year. The most important natural catastrophes causing property damage for the Netherlands and Belgium are windstorms. For example, storms named Jeannette1,2 with a loss of 75 million euro and Kyrill3 with 170 million euro (Verbond van Verzekeraars, 2014). For Belgium Kyrill caused a loss of 213 million (Assuralia. Beroepsvereniging van verzekeringsondernemingen, 2013). This can be seen in data of the most damaging days of the last fifteen years for private houses in the Netherlands. The number of reported claims for home insurance for 2013 in figure 1

shows the difference in proportion of storm damage compared to other damage which is clearly reflected in the damage per day. Underlining the fact that those storms are rare events is the fact that the storm of 28-10-2013 was the first big storm since 2007.

Figure 1: Private home insurance damage per day in 2013. Average is 100% (Verbond van Verzekeraars, 2013).

1_{The naming of storms is done by the Institute of Meteorology of the Freie Universit¨}_{at of Berlin, a}

well-accepted authority on winter storms

2

Jeannette arrived over Europe on 27-10-2002

(11)

1.1.2 Reinsurance

There are two common types of reinsurance contracts, namely proportional and non-proportional. For the first type, the part of claims that reinsurers take over is defined for each separate policy as quota-share or surplus share. For quota-share a certain percentage is for the reinsurer and for surplus share the reinsurer pays the amount above a retention level. For the non-proportional reinsurance contract the liability of the reinsurer is based on the aggregate claims that the insurer gets. This type of of-loss (XL) insurance can be subdivided into stop-of-loss reinsurance (aggregate XL), excess-of-loss per occurrence or event (Cat XL) or per risk (working XL). The reinsurance is triggered when the loss of the insurer exceeds the predetermined retention level. The ceded risk is paid for by the reinsurer and the retention is still a loss for the cedent/insurer. Often the losses above the retention level are divided in layers that are covered by different reinsurers. After a reinsurance is triggered the contract sometimes asks for a reinstatement premium. After the payment of this premium the contract is restored again and will be covering future potential losses.

1.1.3 Capital requirements

Regulation like Solvency II asks for an adequate and prudent estimation of the expected losses to determine capital requirements. Most insurers in the Netherlands use the stan-dard model proposed by the European insurance authority (EIOPA) to calculate their capital requirements and only a few big insurers have their own internal models. A short overview is given of the steps to calculate the capital for the Catastrophe module for Non-Life business according to the standard model. Catastrophe risk is defined by EIOPA as: “the risk of loss, or of adverse change in the value of insurance liabilities, resulting from significant uncertainty of pricing and provisioning assumptions related to extreme or exceptional events.” So an insurer tries to quantify the risk of a big storm on the price and provisions for its outstanding policies. The capital requirements are calculated for four sub-modules:

• the natural catastrophe risk sub-module;

• the sub-module for catastrophe risk of non-proportional property reinsurance; • the man-made catastrophe risk sub-module;

• the sub-module for non-life catastrophe risk.

Windstorm risk is part of the first sub-module among other natural catastrophes like earthquakes, flooding and hail. In the BeNeLux, of those catastrophes the windstorms are the most damaging. The calculations of the capital requirements are based on the sums insured and the estimated gross premium for those contracts. For all storm zones and regions the change in basic own funds is considered for two pre-specified loss sce-narios which describe a shock. Those shocks are severe events that may have a bad outcome for the insurer and can appeal to different layers of reinsurance. The wind-storm loss in the scenarios is calculated by weighting the sums of insured values with a factor that is specified by EIOPA for each zone and region. The capital requirements are then calculated taking into account the correlation between geographical regions. The correlation between regions should be chosen by the insurer. The shocks as described in the technical specifications for the calculation of capital requirements and are given in the box on the next page.

EIOPA states explicitly that the calculations of the capital requirements should fol-low the assumption that two consecutive events in the scenarios are independent. No new reinsurance contracts are entered but existing reinsurance can be reinstated ac-cording to normal procedures. The second loss in the scenarios can be seen as a storm but also as a loss amplification of a severe event (see section 1.2.3). When losses occur

(12)

on a big scale this may result in damage that is not adequately repaired. This leads in turn to new losses which amplify the loss of the first event. The standard model states as well that during the calculation steps of the scenarios no new reinsurance contracts are entered but existing reinsurance can be reinstated.

The shock is defined as:

Scenario 1

An instantaneous loss of an amount that, without deduction of the amounts recoverable from reinsurance contracts and special purpose vehicles, is equal to 100% of the specified windstorm loss in region r followed by a loss of an amount that, without deduction of the amounts recoverable from reinsurance contracts and special purpose vehicles, is equal to 20% of the specified windstorm loss in region r.

Scenario 2

An instantaneous loss of an amount that, without deduction of the amounts recoverable from reinsurance contracts and special purpose vehicles, is equal to 80% of the specified windstorm loss in region r followed by a loss of an amount that, without deduction of the amounts recoverable from reinsurance contracts and special purpose vehicles, is equal to 40% of the specified windstorm loss in region r. (Directive 2009/138/EC , 2009)

As an alternative to the standard model insurers can use an internal model to calculate capital requirements. When an internal model is used the insurance company should comply with the rules set by EIOPA to get the model approved. Directive 138/2009 articles 120 to 126 set standards for statistical quality, calibration, validation and doc-umentation of the internal model. When using an internal model for windstorm risk a thorough understanding of the windstorm risk is required. Part of the questions the insurer needs to address is the incorporation of the effect of clustering. A specific article is included in the Directive to state that if external data or models are used the insurer should still have sufficient knowledge of the methods used to get approval of the inter-nal model. An example of those exterinter-nal models are the vendor models that insurers often use to calculate their catastrophe risk. The next section describes the modelling of catastrophe risk and the existing vendor models. The correct application of the stan-dard or internal model leads to a required capital that totaled almost 18 billion in 2013 for 217 insurance companies, of which 149 were non-life insurance companies (Verbond van Verzekeraars, 2013). The available capital that insurers currently have to cover the requirements is around 2.5 times this capital. This number is known as the solvability ratio.

1.2 Modelling of catastrophe risk

Catastrophe risk and windstorm risk in particular are modelled from input through different modules to an output that can help to estimate losses. This section explains in more detail the modules that are used to model catastrophe risk. The models are combining input information about weather, insured values and reinsurance. This infor-mation goes into a hazard, an exposure, a vulnerability and a financial module, which results in return rates and estimated loss tables. The return rates are reversely related to the intensity of events. When storms observed in n years are ranked with respect to intensity the storm has an n/ranking return rate (see Method chapter3). This section describes the basics steps of catastrophe modelling.

The input consists of the insurance contracts, specifying the insured values and giv-ing information about the location, age, construction type and other key risk drivers of

(13)

economic damage. Apart from the insured values the main input is weather informa-tion. Based on historical data, weather models (see section 1.4) provide thousands of simulated years of weather and their resulting events (storms in this case). The modules of a catastrophe risk model that use this input are described below. They answer the main questions that an insurer might ask about the catastrophe.

1.2.1 Hazard and exposure module

Where, when and how intensely are insured values hit?

In the hazard module the distribution of storms in time and place is derived from the data produced by weather models. The outcomes of those stochastic events are simulated to expand the weather data to a very large list of probabilities for specified events. The exposure module gives all the relevant characteristics of the insured values. Exposed to windstorm risk are office, public and private property. Also for example greenhouses are typically found in the portfolio of an insurer. The geographic situation of those green-houses makes them vulnerable to storms (see section1.5about storm development). The hazard module should be able to generate events that are consistent with me-teorological observations. Therefore the predicted return period and intensity of storms should be sensible. From the inter-arrival time and severity of a storm the identified storms can be listed and ordered based on rarity and a return rate can be derived. Of interest for insurers are the precise estimation of the return period (long) and the return rate (small) of rare storms that cause high material damage. In this part of the model clustering can be incorporated as well by making the arrival pattern of the storms dependent on the number of storms that have already arrived.

1.2.2 Vulnerability module What is the resulting damage?

In the vulnerability module the events from the hazard module are released on the exposures that the insurer has. Events and intensities are simulated and show how vulnerable the insured values are. Typically more vulnerable are buildings with much open space around them. Some examples are agricultural buildings like greenhouses due to their location in the fields, schools which usually have playgrounds around and large windows, or airports with large parking lots where wind blows freely.

1.2.3 Financial module What does it cost?

Taking into account the insurance contracts and reinsurance of losses, this module de-rives the financial consequences of the simulated events. All possible simulated outcomes are returned in ‘Event-loss’ tables (ELTs). If the worst case scenarios are considered, the occurrence exceedance probability curve (OEP) and aggregated exceedance prob-ability curve (AEP) can be calculated from these tables. The OEP is the distribution curve of the maximum loss that occurs in a given year. Similar is the distribution of the aggregated loss the AEP.

Another part that falls under the financial module is post loss amplification. When a loss occurs and this is part of a major catastrophic event, it can be the case that the initial damage is amplified. Clustering of storms can be accounted for in this loss amplification. An example is a storm that causes damage to many roofs in a city. When not enough roof repair technicians are available to repair the damage sufficiently before the next rainy day, more property damage will occur only because of the first event. This example illustrates deterioration of vulnerability and economic demand surge i.e. the price of the reconstruction will rise because of high demand. Other causes of loss

(14)

amplification are long business interruption, claim inflation and coverage expansion. Claim inflation arises for example when the insurer gets so many claims, that it is im-possible to follow all the procedures to check the claims for exaggeration and fraud. Also change in units of exposure, exposure characteristics and social inflation are sources of claim inflation (Gesmann et al., 2014). Coverage expansion is often a result of political pressure after a catastrophe (Souch, 2010). When a catastrophic event happens insurers are forced to expand the coverage beyond the specification of the contracts.

1.2.4 Output of the catastrophe model

The modules above return the expected loss tables and the distribution of the expected losses. With these tables the insurer has the information to calculate premiums, pro-visions and capital requirements. The output of the catastrophe models comprises two types of uncertainty: primary uncertainty in the likelihood that events occur and sec-ondary uncertainty in the possible losses that these events cause. This thesis will be of interest to understand the uncertainty in the arrival of storms and how well these could be predicted. Besides this primary uncertainty, secondary uncertainty is also related to clustering by means of the definition of an event. When two storms arrive in a very short period those can be considered as the same event by insurers but will probably lead to more damage than one single storm.

1.3 Vendor models

Most insurers use one or more of the existing vendor models that implement the mod-ules of the previous section. Vendor models are created by companies that specialise in selling (the outcomes of) risk models as a service in the market. Insurers can also get the estimates of loss tables through a broker that has bought several licences for those vendor models. This gives the possibility to use more models for a lower price. The downside of using a broker is the increased separation between insurer and the actual model. This makes the process of risk modelling less transparent. Three main US based companies offer vendor models for catastrophe and windstorm risks. Applied Insurance Research (AIR), WorldCat enterprise (EQECAT) and Risk Management So-lutions (RMS) all use historical data and climate models (see section 1.4) to calibrate their parameters for simulation of storm tracks. Since those vendors often do not dis-close the specific techniques in their models they are difficult to compare. A way to evaluate them is to look at the assumptions and input they use but more important is to compare how the models perform for the risks and areas of interest. Evaluation of the vendor models output can be done for the industry losses, portfolio specific losses, losses by zone etcetera. Output can be compared with other versions of the same model and models from different suppliers. The model assumptions could be compared to scientific data and checked for biases or limitations. Also the fit with the exposure of the user of the model should be evaluated. It is beyond the scope of this thesis to dive into the different versions of all those three models. A way to model dependency in a vendor model is discussed in section 4.5. That the choice of only the version of a model is al-ready important is shown by the press release of world’s third largest insurance broker Willis Re. “UK insurers may see increases of up to 97% in their capital requirement for catastrophe exposures under Solvency II rules when making calculations under the new RMS model version 11, in comparison to under version 10”. This new version of RMS included, among others, a more severe effect of clustering (Willis Re. press release, June 14, 2012).

A possibility to make use of the vendor models for catastrophe modelling is to blend the different outcomes. Blending is an illustrative example of the balance an insurer needs to find between model enhancement and increase in complexity. Calder et al. (2012)

(15)

describe the steps necessary when using multiple models for pricing. First, outcomes of the models need to be obtained and compared. After this assessment they can be com-bined with given weights. This last step however is often not carried out by the insurer because of technical and operational complexity. In that case just one model is used. Blending can have its merits by reducing the vulnerability to uncertainty in a specific model. However it cannot reduce all uncertainty and the implementation and gover-nance plays a major role in the choice to use multiple models. Two ways to blend the models are severity and frequency blending. For severity blending the weighted average of the output (estimated loss) of models is taken at each exceedance probability. For frequency blending the probabilities that give the same output (loss) are weighted. The choice to overweight or underweight a model is based on the characteristics of a model. It may have been updated more recently or have access to better historical data. Less transparency can be a reason to underweight a model. Severity blending requires easier calculations which is an advantage. However, frequency blending is assumed better be-cause it handles the non-linearities that are usual in insurance contracts (excesses and deductibles for example). The use of vendor models or a blended combination of vendor models is common practice for insurers exposed to catastrophe risk. Even though the models are not perfect they can be very valuable if used in an appropriate manner.

1.4 Climate models

The following section briefly describes the data used to calibrate catastrophe models. The first and most obvious input data is historical data, but as this data is structurally collected only since 1950 or later for the Netherlands (and most other countries), there is a need to expand this data. Another source for climate data can therefore be obtained from General Circulation Models, which are meteorological models that calculate the state of the atmosphere on a detailed level for multiple time steps to predict the weather. Those models are calibrated based on the historical data and can generate thousands of years of simulated weather. The nature of severe storms with long periods between arrivals (e.g. once in 200 year events) asks for the use of these simulations to extend the possible time dimension. However it should be noticed that those simulations are always an extrapolation of the small available historical dataset. The dataset used for this thesis is the ERA-Interim reanalysis of ECMWF. This dataset is a result of an In-tegrated Forecast System (Dee et al., 2011). It solves for every time step and grid point (position on the earth) of the prediction the Euler equations for continuity (mass conser-vation), conservation of momentum and thermal energy. These equations describe the state of the atmosphere in terms of velocity (relative to the earth surface), humidity and temperature. The state of the atmosphere causes the weather and possible storms. Per-fect understanding of those climate models is not required to use the resulting weather patterns but it is useful to understand the principles of this modelling to evaluate the assumptions as in Dee et al. (2011).

1.5 Storm development

The basics about weather and wind are included below to understand the processes un-derlying storm development. Investigating this subject is useful and sometimes necessary to understand the literature about clustering which follows in section1.6.1. Anyway it is nice to show some pictures of clouds to flavour the physics in this section and the theory and formulas coming.

The storms that arrive each winter from the Atlantic Ocean and the North Sea in the Netherlands are examples of extra-tropical storms. They are different from hurri-canes (tropical storms) in their formation characteristics, intensity, source of energy and occurrence. This thesis focuses on the winter months to look for extra-tropical cyclones

(16)

that follow a path from the Atlantic Ocean to the European mainland.

The atmosphere is filled with air of different temperature. The easiest way to look at this was proposed by meteorologists of the Bergen School in 1922. Between areas with different temperature polar and warmth fronts are causing low and high pressure areas and the flow of air through the atmosphere. Along these fronts cyclones can form. However, the existence of a polar front has never been observed and is not a necessary condition for storm development. This way to describe the weather patterns is more qualitative than quantitative (Joly, 1995) and in many ways the vision on storm devel-opment has changed since 1922.

Another theory following the idea of a polar front is that of baroclinic zones. Instead of a discontinuous border between air of different temperature, the density and humidity varies smoothly over the baroclinic zone (Eady, 1957). The flow of air in this area is caused by four forces on the air in the baroclinic zone (Wieringa & Rijkoort, 1983):

• the gradient force; • the Coriolis force; • the frictional force; • the centrifugal force.

The first one is the result of the pressure gradient between high- and low-pressure ar-eas. The force is perpendicular to the isobars (lines connecting the points with equal pressure) and causes, combined with the Coriolis force, the so called geostrophic wind that is parallel to the isobars (figure2).

Figure 2: Gradient pressure and Coriolis force are balanced and make wind flow parallel to isobars (Pidwirny, 2006). PGF: Pressure gradient force, CF: Coriolis force.

On the Northern Hemisphere the Coriolis force causes the (counter)-clockwise flow around (low-) high-pressure areas (and on the Southern Hemisphere the other way around). Closer to the earth’s surface the frictional force makes the geostrophical wind deflect away from the isobars and this causes the typical spiral of wind around a low pressure area (figure 3). When the isobars are curved, the centrifugal force causes the non-geostrophic gradient wind. This last force is one of the main reasons that makes the gradient around depressions (low-pressure areas) usually bigger than around high pressure areas. This causes the more extreme weather associated with a low-pressure

(17)

Figure 3: The effect of the Coriolis force and friction of the earth’s surface in a low-pressure system be-tween Iceland and Greenland (NASA/GSFC, 4th of September, 2003 Denmark Strait)

area. In higher layers of the atmosphere the winds are mainly geostrophic where in the planet boundary layer (between 50 and 2000m above the earth’s surface depending on location) the effect of friction is more present. When modelling the wind speeds over the BeNeLux one should therefore also take into account the surface roughness. Urban areas or for example forests decrease the wind speeds whilst on flat parts of the surface, like pastures or spacious greenhouses, less friction lets wind gusts maintain their speed. On a bigger scale, on the edge of the troposphere (approximately lowest 17 km of the earth’s atmosphere) another important cause of wind is found. Strong polar jet streams arise here in the baroclinic zone. A steep pressure gradient (difference in pres-sure) makes that air start flowing into sub-tropical air masses over the colder layers of the atmosphere. This jet stream follows the path of least resistance and is one of the phenomena associated with extra-tropical cyclones. The Coriolis force then makes this wind deflect and gives it a meandering pattern. In the relatively narrow area around the upper level of the polar jet stream the process of frontogenesis takes place. Here storms develop between the warm and the cold air.

Extra-tropical storms sometimes follow the cyclogenesis that is described above. An-other way they can develop is as a transition from the tropical storms. Those storms arise usually between 30 and 40 degrees latitude. At the end of their existence their prime energy source changes from condensation of water vapor to a baroclinic process. In a baroclinic process the density depends on temperature and pressure instead of only on pressure as is the case in a barotropic process (Hart & Evans, 2000). During this process the storms are generally driven in a west to east direction. When the extra-tropical storm reaches the European mainland the generated wind speeds are affected by the roughness of the surface of the land. On sea there is very little that breaks the winds but over land, terrain characteristics have a high influence on the decrease of wind speed. Mountains, but also urban areas with high buildings, decrease wind speeds but they can create some curling of wind as well. This may result in high peak gusts. Another example of the effect of roughness of the surface are greenhouses. If those cover

(18)

Figure 4: Relative strength of Islandic low and Azore high control storm tracks (Thompson Higher Education, 2007)

a large area this surface is relative flat and wind speed will not decrease much due to friction. On top of that their structure is mostly glass. The combination of those two characteristics makes them quite vulnerable to wind.

Besides peak gusts several indicators are available to identify windstorms from me-teorological data. For example the mean sea level pressure is an indicator that is used to to locate the eye of the storm. Another possibility is the relative vorticity. This is the curl of the wind relative to the earth surface. Measurements of those indicators are available for points all over the earth. Therefore those indicators are useful to track a storm when it arises above the Atlantic Ocean as a (extra-)tropical cyclone.

One last important aspect of the global climate is large scale teleconnection. These are climate anomalies related over large distances. As an example the North Atlantic Oscillation (NAO) will be described. A low pressure area between Iceland and Green-land and a high pressure area south of the Azores control the tracks of storms that move in eastern direction over the Atlantic Ocean (Bachman, 2007). The relative position and strength of the Icelandic Low and Azores High fluctuate over the years and are known as the North Atlantic Oscillation. This oscillation is one of the main drivers of fluctua-tion of weather in western Europe (Malier, 2005, pp. 2237 and Villarini et al., 2010, pp. 2686). The oscillation has a positive and a negative phase that causes the storm tracks to follow a general direction as shown in figure 4.

1.6 Clustering

Literature investigating extra-tropical cyclones suggest that the arrival rate of storms depends on the number of storms that already have arrived. If a storm has occurred it is more likely that another storm will develop. This phenomenon is called the clustering of windstorms. Below some meteorological research is discussed that looks into clustering. 1.6.1 Literature on clustering

With the background about storm development in section 1.5 the conditions in which clusters form can be understood better. Malier (2005) suggests possible mechanisms that cause the clusters of extra-tropical cyclones. He estimates the effect of large scale variables with a Poisson regression. One of the significant variables that helps predict areas where clustering occurs is the North Atlantic Oscillation. The polar jet stream is very important for the weather in Western Europe and is very sensitive to the chang-ing phases of the North Atlantic Oscillation. Two approaches to identify the tracks of storm are the Lagrangian and the Eulerian viewpoint. The first focuses on individual cyclones with characteristics like intensity, growth and velocity that change over the

(19)

Figure 5: Downwind of eddies the wind is curling in turbulent patterns (NASA/GSFC, 5th of July, 2002 at the Canary Islands)

lifetime of the cyclone. The second describes storms as a result of a swirl of wind cre-ated by obstacles called eddies. These obstacles can be static, like islands, or transient, like an instability in pressure. Figure 5 is showing this effect of (static) eddies around the Canary Islands. When the eddy disturbs the flow of the wind it can lead to storms or even a large jet stream. The Eulerian approach is less useful to study the individual characteristics of cyclones than the Lagrangian approach.

Pinto et al. (2014) suggest that when a favourable eddy driven jet lasts for more than one week those conditions are optimal for the development of storm clusters. The loca-tion to look for storms developing as a result of an earlier storm is slightly south of the primary cyclogenesis (Rivals et al., 1998). To investigate clustering, first the question of defining a cluster should be addressed. Windstorms are produced under favourable conditions over the Atlantic Ocean and move during a period of days to weeks over the ocean before reaching the European mainland. There are two theories about the depen-dent arrival of storms in clusters. First it is possible that storms develop more separated in time, but arrive closer to each other to Europe. This can be due to variations in large scale patterns like the NAO that affect the individual stormtrack (Malier et al., 2005). The thesis of Mailier (2005, p. 92) gives a theoretical framework for the development of storms and describes the two ways that clusters can form in a comprehensive way. The comparison is made with buses in traffic. Buses depart from a bus station according to a schedule with regular time intervals between them. But due to traffic jams they tend to arrive close behind each other. The traffic jams are an analogy for the large scale flow patterns like jet streams. An illustration of some of such storms is given in section

5.2.2. The other explanation for series of storms is known under the name secondary cyclogenesis (Rivals et al., 1998). Some regions are favourable for storm development because of the large scale conditions at these locations. Behind a cyclone the offspring can develop. This new storm(s) have the same characteristics as the primary storm but are in an earlier stage of the storm development (see section5.2.1for some examples of this phenomenon).

Other examples that show the recent interest into clustering are studies to storm (clus-tering) and the effect of climate change (Pinto et al., 2007 and 2013), clustering in Germany (Karremann et al., 2014) and the use of dependent distributions to model return rates (Cannelle, 2011). Karremann et al. (2014) estimate the effect of clustering for winter storms over Germany. They use weather data from the Deutscher

(20)

Wetterdi-enst and 4000 years of simulated data from a GCM and apply the method of Klawa and Ulbrich (2003) to identify windstorms. This windstorm model looks at the return rate of storms per winter and is described in more detail in the method chapter 3. Karremann et al. conclude that there is clearly dependence in the inter-arrival rate. Nevertheless the modelling with Poisson or Negative Binomial distributions overestimates the return period of storm series. They calculate the variation of storms compared to the mean of the number of storms predicted by the model. This ratio is called overdispersion and it tends to decrease for more rare events in their study. This is contrary to what Pinto et al. (pp. 12479, 2013) and Vitolo and Stephenson (pp. 6, 2009) found. They say that there is a positive relation between clustering and more intensive events.

Mailier (2005) uses a more sophisticated storm track model to look for overdispersion. At the US west coast where storms develop he calculates underdispersion. However at the east coast of the United Kingdom overdispersion tends to be bigger for storms with stronger vorticity (an indicator of intensity).

Pinto et al. (2007) use the simulations of a GCM with different sets of input parameters for future climate scenarios to look at the effect on storm loss potentials. They use the model of Klawa and Ulbrich (2003) to identify the most severe storms. For Germany, the United Kingdom and France the variability of potential losses may be expected to increase in the coming years but their predictions exclude the effect of social adaptation to climate change. As the weather gets more severe over the coming decades it is likely that people will adapt to the new conditions and buildings and other vulnerable insured values will be sustainable in these weather conditions. Another research using reanaly-sis datasets and simulations of GCMs comes in 2013 (Pinto et al.) where the effect of intensity of storms on clustering is quantified. They model cyclone clustering from the cyclone tracks and find that clustering is mainly found on the flank and downstream area under the dominant North Atlantic storm track. The results from simulation under future climate conditions suggest, although not statistically significant, that clustering decreases over the dominant storm track area and parts of Western Europe, while on the Atlantic Ocean south of Newfoundland clustering increases.

1.6.2 Time frame of clustering

To identify the number of clusters assumptions need to be made about the time frame in which storms need to occur to be in the same cluster. Keeping in mind the method that will be used to identify storms it makes sense to start with the assumptions of Karremann et al. (2014, p. 2043). They use a three day window to identify storms. This time frame of three days is justified because for insurance 72 hours is often used as a maximum duration of one event. However, notice that within this three days two sepa-rate storms can arrive. The wind gusts after the first storm need to be lower than the first day before increasing again the last of the three days to separate those storms. This also led to the assumption that is made requiring the wind to decrease only one day between the two peak days. This assumptions is based on the fact that storms cannot be too close to each other because of their size which is 1000 - 2000 km of diameter (Malier, 2005, p. 31). Comparing the list of storms found by Karremann et al. (p. 2046) with reported storms by insurers, shows that the method of Klawa and Ulbrich is a good way to identify storms. Cannelle (2011, p. 47) assumes - after consulting meteorology experts from Munich Re on their opinion about independent storms - that when storms arrive within three till five days they are not independent. Vitolo et al. (2009) studied the dependence of storms on intensity, length of the period over which the cyclones are counted and the width of the area over which the storms are counted. They find a positive dependence between clustering (characterised as dispersion) and intensity of extra-tropical storms. For the time frame in which they investigate clustering, they look

(21)

from a minimum of four days till three consecutive months. The dispersion factor in-creases quasi-linearly with the logarithm of the time frame.

5 10 15 20 1/20/1993 1/21/1993 1/22/1993 1/23/1993 1/24/1993 1/25/1993 1/26/1993 Date Windspeed (m/s)

Figure 6: Daily mean wind at IJmuiden

To illustrate the assumptions above here follow two examples. First the wind speeds are shown for a weather station on the Dutch coast in figure 6. If a window of 5 days is chosen then three separate peaks occurred in 1993 between January 20 and 24. (20, 22 and 24). If the requirement is set, for the wind to be decreasing two days, one can also see just two storms at the 20th and 24th of this period.

Another example of a suggestion for bigger time windows to define clusters is December 2013. The daily mean wind speeds for another Dutch weather station (Vlissingen4_{) show}

several peaks in figure7. Looking at the paths of the extrema of vorticity measures these days (indicating the track of a cyclone) in figure8, one can see three consecutive storms developing on 22, 26 and 29 of December above the Atlantic Ocean. The tracks approach Western Europa, but do not necessarily reach the BeNeLux. The storms developing in the area between Iceland and Greenland will arrive in the same period through the channel into the North Sea. The combination of those storms is hitting the BeNeLux during this period.

The examples above show the complexity of the choices around clustering and the nec-essary assumptions. Even though the origin of storms is not always the same, they affect each other in their tracks and the independence of arrivals needs to be investigated.

(22)

5 10 15 20 12/21/2013 12/22/2013 12/23/2013 12/24/2013 12/25/2013 12/26/2013 12/27/2013 12/28/2013 12/29/2013 12/30/2013 12/31/2013 Date Windspeed (m/s)

Figure 7: Daily mean wind at Vlissingen

This background chapter gives information about insurance and catastrophe models for storm risk. This information is useful to understand the importance of the modelling of dependency of storms. Also, the time frames in which clusters will be investigated are discussed. This comes back in section4.3 and 5.

Figure 8: Illustration of storm tracks in the last week of 2013 (UQAM-Montreal Weather Centre). The numbers at the starting points of the tracks are the day and time of the first identification of the tracks. The coloured numbers denote the relative intensity. Every segment of a track corresponds to a six hour time interval.

(23)

2 Data

This chapter describes the two sources of data that are used in the storm model of section 3. The daily wind gust data from Dutch and Belgian weather stations are used for a wind storm model. Besides, a sample of the reanalysis data from the General Cir-culation Model IFS Cy31R2 is used to compare the results of the BeNeLux data and to track storms paths to Europe. Below the quality and necessary adjustments of the data are discussed.

The Royal Dutch Weather Institute (KNMI) offers on its website weather data for 149 stations since January 1st, 1950. From this date some stations started delivering weather data. Until now stations have been added and 61 of them are measuring wind gusts on an hourly basis. To compare the measurements and be able to compare storm data it is necessary to have a set of stations that deliver wind data for the whole period of interest. Since 1950 the number of stations that delivered data about wind speeds increased slowly. In the winter of 1971/1972 sixteen stations were in use. For the first identification of clusters those stations are used and supplemented with six Belgian weather stations of the KMI (Royal Meteorological Institute) that also deliver data since 1971. The Belgian weather stations sometimes showed strange outliers (wind gusts from 50 to over 90 m/s). If this was the case for one station and the other stations did not measure a high wind speed it is assumed that the data point is a measurement error and not a very local wind storm. If stations are moved or had a change in the way they measure wind gusts they are left out of this research. Stepek and Wijnant (2011) use the KNMI data to interpolate the wind speeds to a higher resolution to predict the wind speeds between stations; therefore they are very strict in checking the stations for heterogeneity in measurements. They use only 25 stations which will also be used here. For this thesis the number is even further narrowed down to get a homogeneous dataset consisting of as many years as possible.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 51 52 53 3 4 5 6 7 lon lat

Figure 9: Weather stations in the Nether-lands and Belgium. The red dots indicate the 22 stations used in this thesis.

(24)

Figure 10: Resolution of ERA-Interim grid points.

As discussed in the meteorological background extra-tropical cyclones develop only in the winter months. The small number of storms that occurs during summer will be neglected since these will mostly be small scale events like thunderstorms (Kasperski, 2002). Therefore the dataset is reduced to the extended winter periods from October till March for each of the years since October 1971. Figure 9shows the spatial distribution of the stations over the Netherlands and Belgium. The black dots are all the stations in use in 2015. The red dots are the sixteen Dutch stations and the six stations in Belgium used in this thesis. The list of the stations with corresponding longitude and latitude can be found in the appendix.

Besides the data from local weather stations data from a GCM is studied as well. Commonly used is the ERA-Interim dataset from the European Centre for Medium-Range Weather Forecasts (EMWCF). This dataset is a result of an Integrated Forecast System IFS Cy31R2 (Dee et al., 2011) that solves for every time step and grid point of the model the Euler equations of motion. In this way the state of the atmosphere is described and the weather corresponding to this state is derived. The calibration of this forecast is done with virtually all relevant atmospheric data available in 2002. Uppala et al. (2005) describe this data thoroughly in their paper on the configuration and per-formance of this model. Examples of input are data from satellites, weather balloons, buoys and land stations that deliver radiance, surface pressure, ozone concentration, wind speed, wave height data and much more. This data is reanalysed and output of the weather variables since 1979 is produced and updated with new data available on an ongoing basis.

For this thesis the maximum number of winters (1979-2014) available in the ERA-Interim data is taken into account and results are compared to the raw data from Dutch and Belgian weather stations since 1970. Instead of measurements from stations the ERA-Interim data consists of grid points with a resolution of 0.125 degrees. One degree of latitude corresponds to roughly 111 km (east-west direction). The distance

(25)

between longitude coordinates depends on the latitude and is roughly 70 km at the latitude of the BeNeLux. This means that the grid points in figure 10 are quite finely meshed (respectively about fourteen and nine kilometres between the points). To cover the Benelux only points between N2.5◦E48.5◦ and N7.5◦E53.5◦ are used.

This chapter described the steps applied on the data to be used for the storm model in this thesis. First, the daily maximum wind gusts of KNMI5 and KMI6 are obtained. The number of stations is narrowed down to 22 (16+6). The Belgium data is corrected for measurement errors: when wind speeds where extremely high for just one station. Finally, only the data between the 1st of October and the 31st of March are used for each year between 1971 and 2015.

5

From https://data.knmi.nl/

(26)

3 Method

A wind storm model is built for this thesis based on the KNMI/KMI data and with the derived list of wind storms the occurrence of clusters is investigated. The effect of clustering is investigated by first identifying storms per winter. Several ways to do this range from tracking wind storms from genesis to simply looking at measured wind speeds on the European mainland. The shortest approach based on daily wind gusts to derive the number of storms per winter from local weather data is described in this chapter. After describing the storm model the distributions that will be estimated are introduced. The homogeneous Poisson process is used as a starting point. To model dependency between storms, parameters for the Negative Binomial distribution are estimated. As a third class of models the Poisson-Binomial distribution is explained in subsection 3.3.4. The methods to compare these models are discussed in the last subsection3.3.5.

3.1 Windstorm model

The daily maximum wind gusts v measured at ten meters above the ground are used to build a wind storm model. All stations in Belgium and the Netherlands deliver this data. Also a prediction for this variable is output of the GCM that is used. To model univariate extremes like wind gusts, Gilleland and Ribatet (in Gilleland and Ribatet, 2015, p. 258) suggest two approaches. The first is analysing block maxima, by characterising the distribution of the extremes for a large period. This could be done by finding a distribution for the storms in a winter. The second, more appropriate in this context, is looking at exceedances of the extremes for a suitable threshold. This last approach is used here. For each station (grid point) the 98%-quantile of the daily average wind speeds is calculated to determine the wind speeds that are above this calculated threshold. Weather stations near the coast obviously measure higher wind gusts when storms arrive over sea. To the contrary, land stations surrounded by buildings or forest measure on average lower wind speeds. To identify usable wind gusts the local characteristics of stations are accounted for by taking the gusts above the 98% quantile. This quantile gives the minimum wind speed that is expected to produce any loss (Klawa and Ulbrich, 2003, p. 728). Taking 95% or 99.5% as a threshold will lead to an over- or underestimation, respectively of the number of storms that actually do result in storm damage. The next step to compare the wind gusts above this quantile is normalisation of the wind speeds for all stations with their quantiles according to the method of Klawa and Ulbrich (2003). The loss caused by wind is proportional to the cubed excess of wind above the 98% threshold.

Loss ∝ v v98% − 1 3 (1) Most other studies into storm losses assume that the losses resulting from a storm are proportional to the cubed value of the wind speed. Therefore the relation in formula (1) is a good way to reflect the difference in severity of storms. Justification for the choice of the 3rd power relation is both theoretical and empirical. From laws of physics the kinetic energy of moving air is proportional to the cubed wind speed7. Besides the theoretical reasoning, Klawa and Ulbrich summarise empirical findings of the relation between wind speed and loss. Empirical findings lie between v2.7and v5 so here we chose for v3.

Following the method of Karreman et al. (2014) to identify storms, a meteorological

7_{The mass per second that hits a building follows the basics laws of physics; Mass/sec = wind speed}

* surface of building * density of air. The kinetic energy of this air is 1

2 * mass * wind speed 2

= 1

2 * surface * density * velocity 3

, therefore the loss is expected to be proportional to v3_.

(27)

index (MI) is calculated for each day. The index is the sum over the normalised cubed wind speeds for all stations when they are above zero. The indicator function in the fol-lowing formula ensures that only values above the 98% quantile are taken into account8.

M I = X stations v v98% − 1 3 · I_v>v_98% (2)

This list of MI values per date can be ranked and a return rate of the storms can be computed. For the 44 years (October 1971 till March 2015) the return period is calculated. The most severe storm is expected to be returning every 44 years. A storm with at least the severity of the second storm comes once in 22 years etcetera.

3.2 Identifying clusters

From the resulting list of indices (MI) per date the clusters can be found. For each winter a window of chosen size n (in days) is moved over the list to identify consecutive storms within this time frame. If the meteorological index for a given day is a local maximum, meaning the wind speed lies above the 98% quantile and is higher than the day before and after this day, this day is denoted as an event. From that day the following n days are checked for local maxima as defined above. This procedure gives a list of clusters with a certain number of events per cluster.

First part of the identification of clustering will be checking the storm data for in-dependent arrival of storms (no clustering) where the wind decreases at least one day between two storms. Clustering may occur with two storms in a time frame of three days but this might not have an impact for the insurer since contracts usually define losses within 72 hours as losses caused by the same event. Discussion with an insurer leads to the choice to look at three time frames which seem reasonable and practical. So windows of seven, fourteen and thirty days are used to investigate clusters. In literature Vitolo et al. (2009, p. 6) look at weekly, monthly and three monthly time frames. But as local weather variables that are favourable for the storm development stay the same from one day to a week (Malier,2005, p. 11) we chose for a maximum time frame of thirty days. Investigating bigger time frames requires a better understanding of the large scale atmospheric teleconnections. It should be noted that with a thirty day time frame, it is not possible to say what is the exact moment that conditions above the ocean start to be favourable for storms and when they end. A thirty day period can have favourable storm conditions, which change and then turn back to favourable again. A cluster under the definition used here starts with the first severe storm of a winter and lasts at most the number of days of the window used. After those days a new cluster can arrive. This gives the possibility that a storm, not part of a cluster, starts the time frame. In the following days no storm arrives and at the end of the frame an actual cluster is split over two periods. This shows that the assumption of the window size is strong and not very flexible, unless manually checked and corrected for every period. However, for smaller window sizes it captures clusters quite well (see section4.3).

3.3 Stochastic model fitting

When assuming that clustering does not exist -or has no significant impact- the return of storms can be modelled with a stochastic Poisson process. Since literature suggests that clustering actually is occurring it is expected that the variance of storm arrivals is bigger than the variance predicted by the Poisson model. To model this overdispersion

8

Karremann et al.(2014) add the MI of the days before and after the event date to the MI of this peak day. Here is chosen not to do this, as wind gusts belong not necessarily to the storm on this event day. So MI values can not be compared straightforward.

(28)

Figure 11: The pattern of storm arrivals in the winter of 1990 (a) compared to a regular point process (b).

also the Negative Binomial and the Poisson-Binomial distribution are fitted in the next section.

3.3.1 Homogeneous Poisson process

The arrival of storms can be modelled with a point process that describes the arrival time of an event, the time between events or the number of events in an interval. In this thesis the number of events N (t) is modelled. The nature of the arrival of storms suggests the use of a Poisson process with the following (commonly used) definition (Kaas et al., 2008). The probability for the number of events N (t) for history N (s) is

P r[N (t + dt) − N (t) = 0 | N (s), 0 ≤ s ≤ t] = e−λdt = 1 − λdt

P r[N (t + dt) − N (t) = 1 | N (s), 0 ≤ s ≤ t] = e−λdtλdt = λdt (3) P r[N (t + dt) − N (t) = 2 | N (s), 0 ≤ s ≤ t] = 0

Assuming that not more than one storm can arrive at any time is very reasonable from meteorological perspective. The number of storms per unit time N (t) under the assump-tions above is Poisson distributed with intensity rate λ.

The number of events is assumed to be completely serially independent when the occur-rence of storms is modelled with a homogeneous Poisson process. The parameter λ of the Poisson distribution specifies the distribution of storms per unit of time. The Poisson parameter ˆλ estimated with the maximum likelihood method or method of moments is simply the average of storms per winter. Figure11 shows the pattern of storms in the winter of 1990 (a) compared to a completely regular point process with no dispersion (b). This illustrates the dependent arrivals. Storms are denoted with a black dot at a certain date. Under a regular point process the dots have an equal interval between them. When there is dependence in the arrival rate after a storm the chance of another storm increases. This happens in the empirical data where clearly groups of dots are present. After estimating the rate parameter for the Poisson process it can be checked if there is overdispersion in the model and if an additional parameter is required to fit the distribution. If this is the case, estimation of parameters of a Negative Binomial dis-tribution is suitable. The measure often used for dispersion is φ = V ar[N ]_{E[N ]} (for example by Karremann et al., 2014, Vitolo et al., 2012, and Malier, 2005).

3.3.2 Negative Binomial model

To model the dependence between arrival of a storm and the history of arrivals the Negative Binomial distribution can be used to model the annual storm distribution. The probability for n storms is given by the following formula, in Hastings and Peacock (1975)9 notation: P r[N = n] =n + r − 1 n pn(1 − p)r (4) 9

A comprehensive article about the maximum likelihood estimation and notation of the Negative Binomial distribution is Piegorsch (1990) who describes the different notations for the Negative Binomial distribution a.o.

(29)

Here the method of moments is the most straightforward way to estimate r and p. The first and second empirical moments of the data m1 and m2 should be set equal to the

theoretical moments. (

E[N ] = r1−p_p

V ar[N ] = E[N2] − E[N ]2 = r1−p_p2 = E[N ]

p

(5) The two parameters can be estimated jointly using the following equations:

( ˆ p = m1 m2 ˆ r = m1_{1− ˆ}pˆ_p (6)

The overdispersion in the Negative Binomial model can be written as φ = m2

m1 =

1 ˆ p

The maximum likelihood estimator can be calculated numerically with for example the R package Mass. This package uses the BFGS algorithm (by Broyden, Fletcher, Goldfarb and Shanno developed in 1970). This is a quasi-Newton method, that finds local maxima like a Newton method (finding iteratively better approximations of the roots of a function) but replaces the Jacobian by an approximation that is easier to compute (Broyden, 1968). This increases the speed of the computation. A usual prob-lem with derivatives used in the Jacobian, is the trade-off between computational speed and precision. Errors may accumulate in each step of the division algorithm that is needed to compute the derivatives in the Jacobian matrix (Goldberg, 1991, pp. 41). The BFGS algorithm overcomes this problem by updating the Jacobian at each step of the iteration.

3.3.3 Zero-adjusted models

The models described above appear to underestimate the chance of zero storms. To improve the estimation, four other distributions are fitted. First the zero-inflated Pois-son and Negative Binomial distribution are described and thereafter the zero-adjusted Poisson and Negative Binomial.

The zero-inflated Poisson (ZIP) model is a combination of a binary distribution that generates zeros and a Poisson distribution that produces the storms (Yee, 2008). The probability distribution function is:

P [N = 0] = π + (1 − π)e−λ P [N = n] = (1 − π)λ

n_e−λ

n! f or n = 1, 2, 3... (7) The mean and variance are (1 − π)λ and λ(1 − π)(1 − λπ) for this distribution. Note that the Poisson distribution has just one parameter for the mean and variance. With this two parameter model, the dispersion can be modelled more precisely. The estimated λ is expected to be close to the λ of the Poisson distribution. The more likely zero storms are in the observation the higher the expected estimate of π is. The method of moment estimators for the ZIP are:

(ˆ_{λ =} m2+m21−m1 m1

ˆ

π = m2−m1 m2+m21−m1

The Negative Binomial distribution can also be zero-inflated. Compare the Anscombe (1950) notation also mentioned in Piegorsch (1990):

P r[N = n] = Γ(θ + n) Γ(θ)n! µ µ + θ n 1 +µ θ −θ (8)

(30)

that is adjusted with a probability π for excess zeros. P [N = 0] = π + (1 − π) 1 +µ θ −θ (since n = 0) P [N = n] = (1 − π)Γ(θ + n) Γ(θ)n! µ µ + θ n 1 +µ θ −θ f or n = 1, 2, 3... (9) A problem with the zero-inflated model is that the probability of N = 0 is always equal or bigger than the probability under the simple Poisson or Negative Binomial distribution. A possible improvement that keeps the same number of parameters but also allows for a smaller number of zeros is found in zero-adjusted models (Rigby et al., 2014). They estimate the probability of zero counts completely independently from the nominal probability. To get to these results the Poisson and Negative Binomial distri-bution are zero truncated10.

For the zero-adjusted Poisson let N = 0 with probability π and let N follow a truncated P oisson(λ) distribution with probability (1 − π).

P [N = 0] =π

P [N = n] =(1 − π) λ

n_e−λ

n!(1 − e−λ₎ f or n = 1, 2, 3... (10)

And similar for the zero-adjusted Negative Binomial distribution P [N = 0] = π P [N = n] = (1 − π) Γ(θ+n) Γ(θ) n! µ µ+θ n µ+θ θ θ − 1 f or n = 1, 2, 3... (11)

The maximum likelihood estimator is found with help of gamlss R package that gives the estimated mean, size and probability for the fitted zero-inflated Negative Binomial distribution.

3.3.4 Poisson-Binomial model

An alternative way of looking at the storms is seeing the process of cluster arrivals as a point process. The same definition of the homogeneous Poisson process as above is used but then simply for the number of clusters per year. For all window sizes the number of observed clusters is noted in table5on page28. In this subsection the derived empirical number of cluster and cluster sizes are combined to model the dependency between storms.

Let N denote the number of storms on a given winter, and let Y denote the num-ber of clusters. Suppose cluster y has Cy storms, for y = 1, . . . , Y . One can notice that

an arriving cluster can also contain zero storms. Then the cluster will not be observed. Thus, observed clusters are truncated above zero according to C0 = C|C ≥ 0 to be included in the data. The number of storms for each winter can then be written as:

N =

Y

X

y=1

C_y0 (12)

N is the number of storms that follows a Poisson compound Binomial model. Y is the number of clusters that has a P oisson(λ) distribution and Cy is the number of storms

10

The zero-adjusted model is equivalent to the Hurdle-at-zero model which is a special case of the general Hurdle models. This terminology was introduced by Cragg (1971, p. 831).

(31)

inside cluster y. Cy ∼ Binomial(m, p) distribution and Cy are independent and have a

maximum of m storms where m is assumed to be known. The expectation and variance of N are respectively:

E[N ] = E[E[N | Y ]] (Tower rule) = E[Y E[Cy]]

= E[Y ]E[Cy] (independent Cy and Y)

= mλp

V ar[N ] = E[V ar[N | Y ]] + V ar[E[N | Y ]] = E[Y V ar[C]] + V ar[Y E[C]] = E[Y ]V ar[C] + V ar[Y ]E[C] = mλp(1 − p) + λ(mp)2 = mλp(1 − p + mp)

(13)

Dispersion can be written as V ar[N ]_{E[N ]} = 1 − p + mp. Setting this expression equal to the empirical dispersion s2/ ¯N leads straight to the method of moments estimators

   ˆ p = 1− s2 ¯ N 1−m ˆ λ = _{m ˆ}N¯_p (14)

here ¯N is the mean number of storms in the empirical data and s2 the variance. 3.3.5 Goodness of fit

The evaluation of the models above is done with the chi-square goodness of fit test which is suitable for count data like the number of storms or clusters. The residuals of the prediction are supposed to be normally distributed and independent with zero mean. Under this assumption it is possible to test if observed frequencies come from a completely specified distribution X ∼ F (x).

The sample space is divided into k cells. The null hypothesis is that the expected number of storms or clusters in the r th cell is er= npr, where n is the number of years

in the dataset (44) and pj is defined as P (X ∈ rj). The test statistic is

χ2= k X r=0 (Obsr− er)2 er (15) where Obsrdenotes the observed number of storms or clusters in cell r. The

correspond-ing p-value is then P [Zk−1> χ2], where Zk−1denotes a chi-square random variable with

k − 1 degrees of freedom. This is the probability that any deviation from the expected distribution is due to chance, assuming that the null hypothesis holds. Therefore a higher p-value indicates a better fit of the model. To ensure that the chi-square approx-imation is fairly accurate the degrees of freedom should be as high as possible under the condition that the expected number er≥ 5. Therefore the values in the tails of the

estimated distribution are added together to make sure Obsr is higher than five. Given

the relatively small sample size (44 years) the zero inflation that is modelled is not very well reflected in the test statistic. Combining the probabilities of zero and one storm actually ignores the special shape of the zero inflated models. Visual representation in those cases shows if the fit improves compared to the usual Poisson and Negative Bino-mial distributions.

(32)

Another possibility is to compare the Akaike Information Criterion (AIC) which pe-nalises for more complex models. It is possible to compute the AIC from the χ2_-test

statistic. The AIC is defined as 2K − 2ln(L), where K is a penalty that depends on the sample size n and the number Np of estimated parameters, and L denotes the

maxi-mized likelihood for the model. The log-likelihood can be related to the χ2_{-test statistic}

by the approximation below:

ln(L) =ln I Y i=1 1 2π ˆσ2 1/2! −1 2 k X r=1 (Obsr− er)2 er (16) ln(L) = Constant −1 2χ 2 ₍₁₇₎

Here ˆσ is the sample variance which makes the constant independent of the model used, thus, it can be ignored while comparing models. So the AIC based on the χ2test statistic is

2K − 2ln(L) = 2Np(Np+ 1) n − Np− 1

− 2 ∗ Constant + χ2 ₍₁₈₎

After first introducing the storm model to obtain a list of winter storms and clusters, chapter 3 describes the model specification and the goodness of fit. The statistics de-scribed above will be used in the results chapter (4) to compare the fit of the Poisson, Negative Binomial and Poisson-Binomial models with the observed storm data.

Clustering over the BeNeLux : modelling dependent wind storms

Modelling dependent wind storms

J.P. Kunst

Acknowledgement

Contents

Glossary

Introduction

1

Background

2

Data

3

Method