earthquakes in Groningen as a Poisson process

(1)

Modeling the induced

earthquakes in Groningen as a Poisson process

using GLM and GAM

Bachelorthesis Mathematics July 2014

Student: E. Geerdink

Primary Supervisor: E.C. Wit

(2)

1 Introduction 5

2 Methodology 6

2.1 Goal . . . . 6

2.2 Study design and data collection . . . . 7

3 Explanatory Data Analysis 12 3.1 Earthquake registrations in the Netherlands . . . . 12

3.2 Empirical distributions and stationarity . . . . 13

3.3 Production and pressure . . . . 17

3.4 Structural Domains . . . . 18

3.5 Subsidence . . . . 20

3.6 Collinearity . . . . 21

4 Model validation and evaluation 23 4.1 Sojourn time models . . . . 23

4.1.1 Base model and diagnosis . . . . 23

4.1.2 Sojourn time model expansion . . . . 30

4.2 Magnitude models . . . . 31

4.2.1 Base model and diagnosis . . . . 31

4.2.2 Magnitude model expansion . . . . 35

4.3 Model Selection . . . . 36

5 Results and Conclusions 37 6 Discussion 42 A Dataset 46 A.1 Annual and cumulative production in the Groningen Field . . . . 46

A.2 Global pressure levels in the Groningen Field . . . . 47

B R source code 50 B.1 Data collection . . . . 50

B.1.1 Subsidence forecast 2025 pixel data . . . . 50

(3)

B.1.2 Subsidence forecast 2050 pixel data . . . . 52

B.1.3 Subsidence forecast 2070 pixel data . . . . 55

B.1.4 Pressure measurement locations 2003 pixel data . . . . . 57

B.1.5 Structural domains pixel data . . . . 58

B.1.6 Generate RD data . . . . 59

B.1.7 Transform RD data . . . . 59

B.1.8 Function: grid generation . . . . 60

B.1.9 Point in Polygon algorithm . . . . 61

B.1.10 Function: test if a point is inside a polygonal area . . . . 62

B.1.11 Generate dataset . . . . 63

B.2 Explanatory Data Analysis . . . . 65

B.2.1 Function: generate mass plots . . . . 65

B.2.2 Function: compute aggregated group statistics . . . . . 66

B.2.3 Function: compute sojourn times for groups . . . . 67

B.2.4 EDA plots . . . . 68

B.3 Model fitting and diagnostics . . . . 79

B.3.1 Function: Marginal model plots with smoothers (wrapper) 79 B.3.2 Function: GLM diagnostics . . . . 80

B.3.3 GLM models . . . . 82

B.3.4 GAM models . . . . 85

B.3.5 Model selection using AIC and BIC . . . . 87

B.3.6 Hazard maps . . . . 88

(4)

List of Figures

2.1 Causal diagram for the effects on seismicity . . . . 8

3.1 Number of induced earthquake registrations . . . . 13

3.2 Histograms of sojourn times . . . . 14

3.3 Annual moving quantiles for sojourn times . . . . 14

3.4 Histograms of magnitudes . . . . 15

3.5 Annual moving quantiles for magnitudes . . . . 15

3.6 Historical production rates, pressure level differences and number of earthquakes . . . . 18

3.7 Structural domains, earthquakes and 2003 pressure measurement locations . . . . 19

3.8 Counts per Structural Domain and subsidence region . . . . 20

3.9 Subsidence regions, earthquakes and 2003 pressure measurement locations . . . . 21

3.10 Correlation plot for continuous regressors . . . . 22

4.1 Residual plots for sojourn time base model . . . . 25

4.2 Marginal model plots for sojourn time base model . . . . 27

4.3 Diagnostic plots for sojourn time base model: Cook’s distances, Studentized residuals, Bonferroni p-values and hat-values . . . . 28

4.4 CERES plots for sojourn time base model . . . . 29

4.5 Residual plots for magnitude base model . . . . 31

4.6 Diagnostic plots for magnitude base model: Cook’s distances, Studentized residuals, Bonferroni p-values and hat-values . . . . 33

4.7 CERES plots for magnitude base model . . . . 35

4.8 AIC and BIC for sojourn time and magnitude models . . . . 36

5.1 Component smooth functions for preferred sojourn time model . 40 5.2 Component smooth functions for preferred magnitude model . . 40

5.3 Hazard map with earthquake rates . . . . 41

5.4 Magnitude hazard map . . . . 41

(5)

Chapter 1 Introduction

The province of Groningen has been dealing with seismicity since 1991 due to gas extraction. In recent years the number of earthquakes has increased steadily, and perhaps an even stronger reason for concern is the upward trend of maximum magnitudes [20]. The largest earthquake to date occurred near Huizinge on August 16, 2012, with a moment magnitude of 3.6 and caused considerable damage and worry among the population. A reliable estimate of an upper bound for magnitudes is yet to be determined.

The Groningen Field is one of the world’s largest gas fields, spanning over 900 km

²

. It was discovered in 1959, and gas extraction commenced in 1963.

The gas reservoirs are located at approx. 3 km depth and store a tremendous volume of 2800.000.000.000 N m

³

, i.e. 2.8 trillion cubic metres. Nearly 60%

has been extracted so far [6]. These numbers leave no doubt about the mone- tary gains for beneficiary parties, including the Government of the Netherlands [15]. Hence, simply reducing production rates to counteract the impact and consequences of seismicity is not feasible, and needs to be carefully balanced against financial effects. Scientific research is therefore essential for quantify- ing seismic hazard due to gas extraction. Geological and statistical research is primarily conducted by the Nederlandse Aardolie Maatschappij (NAM) and the Nederlandse Organisatie voor toegepast-natuurwetenschappelijk onderzoek (TNO).

In this report, we develop a stochastic model for the earthquakes in Gronin-

gen using publicly available data. Assuming that the occurrences of events can

be described by a Poisson process, inference is made on the sojourn times using

Generalized Linear and Additive Models. The results of this research include

hazard maps for the rates and magnitudes of earthquakes in the Groningen Field.

(6)

Chapter 2 Methodology

2.1 Goal

The goal of this study is to develop an explanatory statistical model for the induced earthquakes in the Groningen field. In particular, the effect size of available variables such as annual and cumulative gas production, pressure levels and extraction activity across the field is investigated. Statistical conclusions are inferred from the analysis and compared against the underlying theoretical framework. The results are visualised in a hazard map for the Groningen field.

It is essential to note that these results are not intended for prediction of new earthquakes. Explaining and predicting are different goals and generally lead to different models [23]. Explanatory models are primarily focused on testing causal explanations while predictive models aim for the most accurate prediction of future observations. This leads to different decisions in the modeling process, particularly regarding data requirements, variable selection, model evaluation, inference and the use of interpretable methods (contrary to uninterpretable al- gorithmic data-mining methods). Hence, predictive power cannot, and should not be inferred from explanatory power.

For example, predicting future induced seismic activity is complicated by the fact that human behaviour needs be taken into account, in contrast to tectonic earthquakes which are out of human control. Market demand and corresponding production scenarios [2], legal authorities enforcing production ceilings (e.g. [21, pp. 8-9]) and resistance from residents and environmental organizations may all affect the production rates and hence pressure levels, subsequent compaction, subsidence and eventually the frequency and magnitudes of future earthquakes.

Clearly, these prospective considerations need to be accounted for and require a very different approach from retrospective explanatory modeling.

Previous research based on probability and historical data has been successful

to some degree. According to the NAM, the lack of historical earthquake data

from the Groningen field makes statistical analysis alone infeasible [5]. KNMI

derived the same conclusion in a report on the Huizinge earthquake [12, pp. 21].

(7)

On the other hand, TNO states that the statistical estimation of the probability of induced seismic activity in the Dutch gas fields is adequate [26, pp.3].

Successful application of statistical and probabilistic methods depends to a large extent on the availability of appropriate data. Unfortunately, the current amount of knowledge about the causes, the severity and the effects of future earthquakes is insufficient to draw conclusions with a reasonable degree of cer- tainty [9, pp. 12]. Additional reasearch has to be done in order to make more reliable statements about the future earthquakes. The complex geological and physical processes taking place at large depth complicate the aquisition of data.

However, even if further research and measurements yield more insight in the geological and physical processes involved, there will inevitably remain sources of (yet) unknown and unpredictable effects. The reasoning that any possible effect can be measured and be accounted for, and uncertainty bounds can be reduced to insignificant proportions is not realistic. This rather evident observation is also expressed in a report by the State Supervision of Mines [20, pp. 28]:

Full deterministic prediction of the induced seismicity based on mod- elling or monitoring is not considered possible. Any predictions will remain of a statistical nature, at best providing the probability/fre- quency of tremors of a given magnitude as a function of time.

The use of randomness in models for the induced earthquakes as a conscious choice [19, pp. 187-189] can therefore be justified, and statistical methods serve the purpose of quantification of the unknown effects within confidence boundaries using currently available data.

2.2 Study design and data collection

The first step is to establish a theoretical foundation for the underlying mech- anism that leads to induced earthquakes. The theory leads to hypotheses ac- cording to which data is collected and inference is done; statistical conclusions are justified by the underlying theory.

Figure 2.1 shows a causal diagram for the relationships between gas produc- tion, subsidence and seismicity [25, pp. 20]. The most important aspect and challenge in data collection is to find quantities that realistically and accurately represent the theoretical components from the causal diagram in Figure 2.1.

There is a gap between theoretical ’constructs’ and measurable observations

that intend to represent those constructs. Shmueli calls this the ’operational-

ization’ of theoretical constructs. [23, pp. 2-3]. The suitability of selected

variables and the accuracy of the measurements of these variables, determine

the quality of the dataset on which a statistical study is based. For example, ide-

ally extensive, accurate continuous data for the historical decline of fine-grained

local pressure levels througout the field would be used, but in reality this is

infeasible.

(8)

Figure 2.1: Causal diagram for the effects on seismicity

Measurements in geologic science are complicated by the fact that processes and quantities of interest can be hard to operationalize and measure. For in- stance, the locations, size and stability of faults is difficult to capture in a single variable. And even if a suitable variable is found, accurate measurements may be impossible and estimation is required. In addition measurements and model information may be confidential, restricting the public availability. Also models may be very sophisticated such that output can only be interpreted by profes- sionals/geologists, constraining the use by researchers from other fields even more. Numerical model output is usually visualised in plots and graphs for eas- ier interpretation. However, this means data has to be extracted from figures, which is time consuming and may lead to errors.

Despite these limiting factors, an attempt has been made to collect as much relevant and complete information from available publications:

Gas production

Total annual and cumulative gas production rates from the Groningen field since 1962 [8]. NAM is responsible for gas extraction from the Groningen field. They publish various production statistics on their website in graph- ical form: gas production rates per production location and region since 2010, and total annual production rates since the discovery of the Gronin- gen field in 1962 [8]. The former lacks sufficient historical data. The latter is useful, and can be used to compute total cumulative production rates.

Pressure decrease

Historical pressure decline for the total Groningen field. Minimum and maximum values for the field are given [3, pp. 27], local pressure mea- surements in 2003 [24, pp. 12], and the number of production boreholes [13] in an area which could be related to pressure decrease.

Reservoir compaction

(9)

No suitable data was found that could accurately represent reservoir com- paction in the theoretical model. However, in very large gas field such as the Groningen Field, subsidence level close to the center will be nearly equal to the reservoir compaction at large depth, while for smaller gas fields this will only be a small proportion [3, pp.6].

Subsidence

Subsidence forecasts for 2025, 2050 and 2070 by NAM [3, pp. 38-43].

Stress accumulation

Fault zones, called Structural Domains [25, pp. 28,211] [7, pp. 25] may capture the local effects of faults in the gas field. The theoretical model in Figure 2.1 states the seismicity is a consequence of stress accumu- lation due to reservoir compaction. Rock formations are ’squeezed’ to- gether (compacted), such that stress builds up. Because natural faults are present in the sand stone layers, the powerful forces acting upon the large rock formations result in movement along the faults, a process in which a large amount of energy is released. Consequential shocks are perceived as seismicity [4].

Since natural faults have various properties such as length and stability, it not not straightforward to capture this information in one or several variables that can be incorporated in a model. Instead, the gas field is divided into several fault zones.

Seismicity

KNMI catalog of all induced earthquakes in the Netherlands, containing date, time, location, spatial coordinates (latitude/longitude and Rijsks- driehoek), intensity, magnitude, depth [18].

KNMI maintains and publishes a catalog of all induced earthquakes in the Netherlands [18]. At the time of writing the catalog contains 1034 entries, each representing the registration of an event and associated data, as displayed in Table 2.1. The entries are in chronological order. In addition to the earth- quake catalog, data from the Groningen field such as annual and cumulative production numbers, pressure differences between areas in the field and the dis- tribution of faults are of interest. Although numerous research reports have been publicly disclosed, numerical data is not readily available, and needs to be extracted from figures scattered throughout the articles. The results are shown in Table 2.2. A few notes regarding the data:

• Missing values: since 1996 the KNMI measurement network has been

complete for earthquakes with magnitude M ≥ 1.5 [20, pp. 30]. Larger

earthquakes before 1996, and current smaller quakes may not have been

recorded

(10)

Table 2.1: Induced earthquake catalog in the Netherlands

Variable Description Units

YYMMDD Date of occurrence TIME Time of occurrence

LOCATION Name of nearest city or town

LAT Latitude of epicentre Degrees

LON Longitude of epicentre Degrees

X RD Rijksdriehoek X coordinate of epicentre m Y RD Rijksdriehoek Y coordinate of epicentre m INT Intensity on Mercalli’s scale

MAG Magnitude on Richter’s scale

DEPTH Estimated depth of hypocentre km

• The intensity variable INT contains many missing values, such that it is of no value

• Virtually all values of Depth are equal to 3km, so it contains no informa- tion

• MAG, SUB2025, SUB2050, SUB2070, PROD, CUMPROD, PMIN, PMAX, PDIFF are continuous variables in theory, but take a discrete set of values in the dataset due to the way they are measured (e.g. only annual values or limited accuracy of instruments). They will be treated as continuous (quantitative) variables in our models

• P2003 is factor with three levels: L: 130 − 134 bar, M: 135 − 139 bar, H:

≥ 140 bar.

• SD is a factor with 10 levels: 9 Structural Domains and 1 for other earth- quakes outside any domain

The Poisson process involves counts of events, which at first thought de- mands for Poisson regression and grouping the dataset into count groups. How- ever, with the data under consideration, grouping events and their explanatory variables inevitably leads to data loss: how should we relate the explanatory variables of each quake to a count? Fortunately, the one-dimensional Poisson process can also be defined in terms of exponentially distributed sojourn times [16, pp. 39], which can be computed without data loss. The exponential distri- bution is obviously a member or the exponential family of distributions, hence inference can be made using Generalized Linear and Additive Models.

This approach allows us to estimate a mean rate for each earthquake, which

we see as a point in the Groningen Field stochastically independent of other

points, under the assumptions of the Poisson process. The mean rate is as-

sumed to be dependent on the explanatory variables in the linear component

(11)

Table 2.2: Data from the Groningen field

Variable Description Units

SUB2025 NAM subsidence forecast for 2025 cm

SUB2050 NAM subsidence forecast for 2050 cm

SUB2070 NAM subsidence forecast for 2070 cm

SD NAM designated Structural Domains for fault intensity

PROD Annual production for Groningen field 10

⁹

N m

³

CUMPROD Cumulative production for Groningen field since 1962 10

⁹

N m

³

P2003 NAM pressure measurements in 2003 bar

PMIN Minimum pressure across Groningen Field bar PMAX Minimum pressure across Groningen Field bar PDIFF Maximum pressure difference (P M AX − P M IN ) bar BORE PROD Number of production boreholes within specified radius BORE ALL All boreholes (exploration etc.) within specified radius GRO FIELD Indicator for occurrence in the Groningen Field

of the model, including the spatial coordinates X RD and Y RD. Thus, the rate associated to a point is modeled as a deterministic function of location and other explanatory variables, and in this context the rates of adjacent points are related.

Since the sojourn times depend on the chronological order of the data entries,

they should be computed separately when groups are involved, in constrast to

other variables which just have a single value per row independent of its position

in the dataset. For instance, computing sojourn times for the Groningen Field

earthquakes requires exclusion of quakes outside the field, which are spread all

accross the data set. Otherwise the sojourn times mistakenly appear shorter

than they actually are.

(12)

Chapter 3 Explanatory Data Analysis

To get a better understanding of the locations and magnitudes of earthquakes in the Groningen Field, and the relationships between these features and the explanatory variables, we first conduct an explanatory data analysis to establish a set of variables suitable for further modeling. In addition, we need to identify the distributions of the time between earthquakes and the magnitudes, so that appropriate models can be developed. We also need to check whether our assumption about the exponential distribution of the sojourn times holds.

3.1 Earthquake registrations in the Netherlands

On 26 december 1986 the first induced earthquake in the Netherlands was recorded near Assen and had a magnitude of 2.8. Two other quakes of magni- tudes 2.5 and 2.7 were registered near Hooghalen and Purmerend in 1987 and 1989, respectively. No earthquakes had been observed in the Groningen gas field prior to 1991. During subsequent years between 10 and 25 earthquakes were registered annually, until a sharp increase of 47 events in 2003 as shown in Figure 3.1. From 2003 onwards, an increasing trend can be observed for the frequencies in the Groningen field, while the frequencies outside the Groningen field have remained relatively stable, except for a burst of 49 events in 2009.

Groningen has seen a particularly strong increase of smaller quakes in 2011, 2012 and 2013. The number of larger earthquakes has increased only slightly in recent years. The quake intensity in Groningen is much higher compared to other areas in the Netherlands, which also includes a few offshore earthquakes.

The largest earthquake in the Groningen field to date happened near Huizinge

on 16 august 2012 and had a moment magnitude of 3.6 [12]. Importantly,

the data under consideration is subject to measurement instrument accuracy

and completeness of the recording network. Hence, an increase in the the

number of observed earthquakes does not necessarily imply that the number of

earthquakes increases. Since 1996 the measurement network has been complete

for magnitudes of 1.5 and higher, so smaller earthquakes might have occurred

(13)

Figure 3.1: Number of induced earthquake registrations

before with no registration due to an incomplete recording network [20, pp. 30].

Smaller quakes can be observed in the vicinity of measurement locations, but not in a wider area. Therefore KNMI recommends to only include earthquakes of magnitudes 1.5 and larger for temporal and spatial analysis [17].

3.2 Empirical distributions and stationarity

Developing appropriate models for the sojourn times and magnitudes requires knowledge about the distributions of these random variables. If we consider our observations to be a sample (or realization) of some unknown distribution, then a sufficiently large sample should give us some information about the underlying ’true’ distribution that generated this sample. Since it is likely that the distribution changes over time, i.e. it is nonstationary, we should also try to visualize the change in distribution using observations from different time periods. This is done for the sojourn times in Figures 3.2, 3.3 and for magnitudes in Figures 3.4 and 3.5.

Six disjoint time intervals were chosen in such a way that the counts in each

time period are approximately equal, insofar this is feasible: 1991-2000 (126

obs.), 2001-2005 (119 obs.), 2006-2008 (110 obs.), 2009-2011 (158 obs.), 2012

(14)

Figure 3.2: Histograms of sojourn times

Figure 3.3: Annual moving quantiles for sojourn times

(15)

Figure 3.4: Histograms of magnitudes

Figure 3.5: Annual moving quantiles for magnitudes

(16)

(93 obs.), 2013-2014 (140 obs.). This allows for easier comparison and avoids issues due to large differences in sample sizes.

For each time interval, histograms representing empirical probability densi- ties (such that each histogram has a total area of one) are plotted in Figures 3.2 and 3.4. The histograms for the sojourn times resemble an exponential density, while the magnitudes appear to be approximately normally distributed.

To compare the empirical densities with the theoretical densities, two curves are added to each plot. The parameters for the theoretical distributions are estimated using Maximum Likelihood. For the red curve, all observations are used, while for each blue curve only the observations from the corresponding time period are considered. Hence, the red curves designate an overall approx- imation (ignoring nonstationarity) and are identical across the six plots, while the blue curves are different from each other and should match the histograms best.

Looking at Figure 3.2, we see that the blue curves accurately match the empirical densities in each of the six plots. This endorses the validity of our assumption of exponential sojourn times. However, as the red curve shows, stationarity does not hold. Between 1991 and 2000, higher sojourn times (40- 70 days) happened more frequently compared to other time periods. Therefore the histogram densities exceed the red curve at this point. As time progresses, sojourn times gradually decrease. Between 2006 and 2008 both curves match and fit the empirical distribution well, but more recently, the density of higher sojourn times is overestimated.

Thus, the sojourn times show a decreasing trend. That implies that the annual number of earthquakes tends to increase, which we already observed from Figure 3.1. It appears that the sojourn times are well described by an exponential distribution, if nonstationarity is taken into account.

Nonstationarity is also observable in Figure 3.5 which depicts annual sample quantiles. The median sojourn times showed large fluctuations between 1994 and 2004. In recent years this variablity somewhat stabilized into a gradually decreasing trend. A similar statement holds for the spread of the distribution, visualized by the changes of the Interquantile range (IQR = |Q3 − Q1|) and the Mean Absolute Deviation (MAD, which is a consistent estimator of the standard deviation, but more robust against outliers). A few outliers (larger than 1.5 · IQR) are present, particularly in 2013, but those sojourn times are not extremely high compared to previous years.

Turning to the distribution of magnitudes, we observe that the empirical

densities also suffer from nonstationarity. In particular, the increasing positive

skewness (’leaning to the left’) with time cannot be captured by the symmetric

normal approximation. Moreover, the empirical pdf appears to more pronounced

around the mean, as seen from the density peak exceeding the mode of the nor-

mal distribution. Alternative exponential family distributions, such as Gamma

with large shape parameter, or skewed log-normal, unfortunately do not really

provide a better fit. The mean magnitudes for different time periods vary be-

(17)

tween 1.1 and 1.4, which is close to the overall mean of 1.3. Hence the overall and period-specific approximations tend to match quite well.

Skewness can also be observed from the empirical quantile plot for magni- tudes, Figure 3.5. For a symmetric distribution, the median is the center of the interquantile range. This is not the case most years, as the median is generally closer to the lower quantile Q1 than to the upper quantile Q3, most clearly seen for 1995, 1996, 2001-2003 and 2012.

Power transformation could be used to address the problem of positive skew and non-normality [14, pp. 303]. Since some magnitudes have negative val- ues, this would either involve adding a constant to obtain positive values only, or using the Yeo-Johnson family of transformation or similar. Transformation introduces its own set of issues however (e.g. interpretability), so we will not make use of them, and leave the magnitudes unmodified.

The annual median fluctuates around the overall median of 1.2, which at a quick glance could indicate that the magnitudes have not changed that much during the years. This is not true, as the upward trend of maximum magnitudes demonstrates, and should be reason for concern. Moreover, the last decade has seen an increase of unusually large quakes, as can be clearly seen from the number of orange triangles. These outliers show that larger earthquakes have become more common.

Note that negative magnitudes were measured both in Groningen and else- where. These are not errors; magnitudes are measured on a logarithmic scale, hence minor quakes may exhibit negative magnitudes.

3.3 Production and pressure

According to a letter from the State Supervision of Mines, based on a report by the same agency, there might exist a nearly linear relationship between the production rate and the probability of seismic activity [21, pp.2-3] [20, pp.18- 22]. To this end, the annual production rates since the start of production in 1963 are depicted in Figure 3.6, together with the global pressure level difference across the whole field and the annual number of earthquakes.

Initially pressure difference increased almost linearly with production rate, reaching a maximum of 68.1 bar across the field in 1975. After 1976, annual production steadily decreased from 87 billion N m

³

to 27 billion N m

³

in 1989.

Until that point, no seismicity had been observed in the Groningen Field.

After 1990, the annual earthquake frequency shows a striking correspon-

dence to the movements of the annual production graph. Two small production

peaks in 1993 and 1996 are followed by peaks in the number of earthquakes one

year later, in 1994 and 1997. Both graphs then show a small decrease in sub-

sequent years until 2001. Since 2001 production rate have gradually increased

again, followed by a burst of seismic activity lagging 2 years behind. Similar

observation are mentioned in the report by the State Supervision of Mines,

(18)

Figure 3.6: Historical production rates, pressure level differences and number of earthquakes

restricted to earthquakes of magnitude M ≥ 1.5 [20, pp.18,30].

The association between global pressure difference and seismicity is less evident. In 1993 pressure difference was reduced to 11.4 bar, and subsequently increased steadily to a level of 24.9 bar in 2005. This incline was accompanied by more seismic activity, however from 2005 onwards the simultaneous increase does not hold, and the association is not clear.

The above observations suggest that annual production may be a relevant variable in our model, and should therefore be included.

3.4 Structural Domains

As mentioned in section 2.2, natural faults present in the rock formations are described using fault zones, called Structural Domains. Based on structural geological features, the NAM distinguishes nine Structural Domains for which the the fault density and seismicity features are described [25, pp.25]. Table 3.1 shows the most important features of these Structural Domains [25, pp.119].

Most seismicity occurred in domains 1, 4, 5, and 7, which is very clear from

the barplot in Figure 3.8. A map of the Groningen Field [1], together with the

(19)

Table 3.1: Overview of most important features of Structural Domains SD Fault size

(km)

Fault density (km/km

²

)

Fault reactivation potential (-)

1 370 1.6 0.17

2 68 1.1 0.25

3 150 1.2 0.20

4 100 0.8 0.22

5 150 1.6 0.21

6 130 0.7 0.22

7 270 1.2 0.21

8 110 1.6 0.20

9 210 1.4 0.23

Figure 3.7: Structural domains, earthquakes and 2003 pressure measurement

locations

(20)

structural domains and the earthquake occurrences are displayed in Figure 3.7.

Domains 1, 5, 8 and 9 are characterised by a relatively high fault intensity, while the opposite is true for domains 4 and 6, which have a relatively low fault density [25, pp.119]. Also, an analysis by TNO suggests that there exists a correlation between seismicity and certain fault specific features [24].

3.5 Subsidence

The degree of reservoir compaction depends on several factors such as rock material composition, the amount of pressure decline and the thickness of the exploited (”depleted”) reservoir [3, pp.6] Compaction leads to subsidence, which affects a large area, is a gradual process and therefore does not cause damage to buildings [4]. According the the theoretical model, subsidence does not cause seismicity, but compaction does. Since the Groningen Field is a very large gas field, subsidence near the center of the field will be nearly equal to the reservoir compaction deep below the surface [3, pp.6]. The high level subsidence regions correspond to areas of higher avarage porosity, while zones with lower porosity in the southern part of the field are less affected by seismicity.

State Supervision of Mines therefore suggests a relationship between reser- voir compaction, which is higher in high porosity areas, and seismicity [20, pp.10] Since we did not find any compaction data that can be used in our models, subsidence data may serve as a substitute.

Figure 3.8: Counts per Structural Domain and subsidence region

Figure 3.9 shows the subsidence regions for the 2025 forecast by NAM [3,

pp. 38-43] The correlation between subsidence level and earthquake frequency

seems evident: most earthquakes occurred in regions most severely affected by

subsidence. In southern direction of the field, a fair number of earthquakes

also happened in lower level regions. These numbers are clearly reflected in the

bar plot of Figure 3.8 Notably, large earthquakes (M ≥ 2.5) occurred almost

exclusively in high level subsidence regions.

(21)

Figure 3.9: Subsidence regions, earthquakes and 2003 pressure measurement locations

3.6 Collinearity

When two or more explanatory variables are correlated with each other, this is called multicollinearity. Collinearity affects the precision of estimated coeffi- cients [14, pp.325] and can therefore be problematic if [23, pp. 12]

1. Individual regression coefficients are of interest, or

2. Contribution of one explanatory variable, without influence of other ex- planatory variables, is of interest.

Strongly correlated variables are redundant and should be omitted from the model. For this purpose, a correlation plot as in Figure 3.10 allows us to assess the degree of association among variables. It is actually a visual representation of a correlation matrix.

Very strong positive correlation obviously exists between the spatial coordi-

nates X RD, LON and Y RD, LAT. The subsidence forecasts for 2025, 2050 and

(22)

2070 are also highly correlated and therefore redundant; since the forecast for 2025 will likely be the most accurate, we will include this variable and omit the other two. CUMPROD has a strong negative relationship with global pressure level variables PMIN and PMAX; as more gas is extracted, the pressure level across the whole gas field declines. The latter two variables are also related to annual production rates PROD. Another evident correlation exists between ty, the time in years, and CUMPROD (and consequently PMIN, PMAX).

The correlation plot also shows that the sojourn times st are (linearly) cor- related with cumulative production, and magnitudes MAG with spatial coordinate Y RD and SUB2025. These result will be confirmed in the next chapter.

Figure 3.10: Correlation plot for continuous regressors

(23)

Chapter 4 Model validation and evaluation

Based on the explanatory data analysis we decide to construct models using the following variables: X RD, Y RD, SUB2025, SD, PROD, CUMPROD, P2003, PDIFF and BORE PROD. The models are implemented in the R programming language [22].

4.1 Sojourn time models

4.1.1 Base model and diagnosis

We will assume that the ith sojourn times S

_i

has the exponential distribution with rate λ

i

. This allows us to construct a Generalized Linear Model with a log link; this link function allows for convenient interpretation and ensures that ˆ λ

_i

> 0 as desired. Consequently, the base model for the sojourn times can be expressed as follows:

log(λ

i

) = β

0

+ β

1

X RD

i

+ β

2

Y RD

i

+ β

3

SU B2025

i

+ β

4

SD

i

+ β

5

P ROD

i

+ β

6

CU M P ROD

i

+ β

7

P 2003

i

+ β

8

P DIF F

i

+ β

₉

BORE P ROD

_i

= η

_i

, i = 1, . . . , n.

(4.1)

The regression output is shown in Table 4.1. CUMPROD is highly significant and has a negative slope, which corresponds to our intuitive expectation that the earthquake rate increases, and the time between consecutive events decreases, as more gas is extracted from the Groningen Field. P2003 and BORE PROD are significant at the 5% level.

The first 9 plots in Figure 4.1 shows the Pearson residuals against the ex-

planatory variables. The Pearson residuals should be independent of the ex-

planatory variables, hence ideally these plots should be null plots without any

(24)

Model 1 (Intercept) 8.957 (10.498)

X RD 0.000 (0.000)

Y RD 0.000 (0.000)

SUB2025 0.005 (0.012)

SD1 −0.126 (0.445)

SD2 −0.257 (0.636)

SD3 −0.575 (0.520)

SD4 −0.459 (0.517)

SD5 −0.372 (0.558)

SD6 −0.597 (0.623)

SD7 −0.370 (0.558)

SD8 0.149 (0.517)

SD9 −0.696 (0.518)

PROD −0.006 (0.006)

CUMPROD −0.003 (0.000)

^∗∗∗

P2003M −0.257 (0.148)

P2003H −0.424 (0.196)

^∗

PDIFF 0.015 (0.019)

BORE PROD −0.013 (0.005)

^∗

AIC 4581.803

BIC 4674.071

Log Likelihood -2270.902

Deviance 1370.915

Num. obs. 745

∗∗∗p < 0.001,^∗∗p < 0.01,^∗p < 0.05

Table 4.1: Sojourn time base model coefficients

patterns. A correctly specified model requires that the conditional mean func- tion in any plot be constant as we move accross the plot. Nonlinear trends, trends in variation across plots and isolated points could indicate a violation of the model assumptions [14, pp. 288].

The curve for X RD is fairly constant while the curve for Y RD increases slightly. The clouds of points look random and the plots do not suggest any problems.

The residual plot for SUB2025 reveals a clear increase of variation as we

move towards the right along the horizontal axis; the phenomenon is called

heteroscedasticity. This plot implies that SUB2025 is endogenous because it

is not independent of the error terms. According to the theoretical model in

Figure 2.1, subsidence is a consequence of gas production, pressure level decline

and compaction, not a causal effect on seismicity. Thus, correlation with the

residuals is likely due to incorrectly omitting an explanatory variable which is also

(25)

Figure 4.1: Residual plots for sojourn time base model

correlated with SUB2025. Looking at the theoretical model again, compaction might be the missing link. However, we currently have no opportunity to test this presumption as we lack compaction data. It does tell us that care should be taken when examining the coefficients for this variable, because endogeneity results in biased parameter estimates [23, pp. 10].

Non-constant error variance can also be observed in the PROD, CUMPROD,

PDIFF and BORE PROD plots. The variability appears to increase with increasing

(26)

values of PROD, CUMPROD, PDIFF, while the opposite holds for BORE PROD. A pronounced peak is present at a pressure difference of 20 bar.

The plots for SD and P2003 contains boxplots for the residuals at each factor level. The boxplots should all have approximately the same center and spread [14, pp. 288], which is the case for all three P2003 levels and most Structural Domains, except for SD2 and earthquakes outside any of the Structural Domains (indicated by SD0). The number of outliers does vary per factor level.

Each plot features residuals of values larger than 4, which are relatively isolated. The influence of those potential outliers is considered shortly. The smooth in the plot for BORE PROD appears to be ’pulled down’ by a few negative residuals at the right end of the horizontal axis, which may exaggerate the nonconstant behaviour of the curve.

The bottom left plot in Figure 4.1 shows the Pearson residuals plotted against the linear component η

_i

of the Generalized Linear Model. Since we used a log link function in this model, the fitted values on response scale are obtained by taking the natural exponential function e

^ηⁱ

. Despite some small variation in the smooth curve, particularly towards larger fitted values, no clear patterns can be observed.

The bottom middle plot in Figure 4.1 shows the Deviance residuals in chronological order, corresponding to the order of observations. Any patterns may indicate serial correlation and a lack of independence among the observa- tions [11, pp. 35]. Despite a tiny bump of the smoother in the center of the plot, the residuals seem randomly scattered throughout the plot. There is no sign of serial correlation.

Lastly, the Normal quantile plot of the deviance residuals does not suggest any problematic features.

Marginal model plots for CUMPROD and fitted values are depicted in Fig- ure 4.2. These plots display the conditional distribution of the response given CUMPROD and the fit of the model, respectively, ignoring other variables in the model [14, pp. 291].

The solid blue smooth shows the conditional expectation of the response.

If we replace the reponse on the vertical axis by the fitted values and redraw the smooth in the modified plot, we obtain the red dashed curve. This smooth should estimate the conditional expectation of the response when the model is suitable for the data. Therefore, mismatch between any pair of red and blue smooths indicates that the model does not fit the data well.

For smaller values of CUMPROD the actual conditional expectation exhibits more variation which is not captured by the estimated smooth. This variation is primarily due to a very high sojourn time of 367 days between the first and second registered earthquake in the Groningen Field. The same observation distorts the conditional expectation of the response given the fitted values, causing a peak in the curve. Omitting this observation yields a better match.

The marginal model plots for other explanatory variables (not shown) do

(27)

Figure 4.2: Marginal model plots for sojourn time base model

not reveal any problems, the model matches the marginal relationships well as the smooths are nearly identical.

As we have seen for the marginal model plots, outliers may negatively affect estimation accuracy. We therefore need to analyse the influence of unusual observation on the estimated coefficients. Figure 4.3 shows an influence index plot for the current sojourn time model, containing Cook’s distances, studentized residuals, Bonferroni p-values and hat-values. The Cook’s distances for the majority of observations are small; for 9 observations it is larger than 0.02.

Noteworthy are earthquakes 2 and 125 which have the largest distances.

There are quite a few Studentized residuals (which are approximations for GLM in order to avoid refitting [14, pp. 318]) exceeding 2 in absolute value, which may be reason for concern. Observation 2 and 125 also have higher standardized residuals, but the largest residuals correspond to observations 66, 130, 133, 414 and 602. The associated Bonferroni-adjusted p-values all exceed 0.3 so they are not significant. For the other studentized residuals there is no evidence that they are extreme, according to the Bonferroni adjustment.

The relationships between the response variable and the explanatory vari- ables in the current model is likely to be more complicated than we have ac- counted for so far (i.e. only first order terms in the linear predictor, related to sojourn times using a straightforward log link function). For instance, the par- tial relationship may be descibed more accurately by adding higher order terms, or perhaps by some complex nonlinear smooth function. In this case a variable transformation would lead to a better fitting model. The question of interest is how to figure out whether the model would benefit from such a transformation, and what this transformation should look like.

CERES plots (Combining conditional Expectation and RESiduals), developed

by Cook, can be used to detect the need for transforming an explanatory variable

[14, pp. 311]. They can also provide a visualization of a suitable transformation.

(28)

Figure 4.3: Diagnostic plots for sojourn time base model: Cook’s distances, Studentized residuals, Bonferroni p-values and hat-values

CERES plots are a generalization of partial residual plots frequently deployed for studying regressor transformations, but which may not perform as well for GLMs [10, pp. 731] or if the relationships among the variables are strongly nonlinear [14, pp. 311].

Figure 4.4 shows CERES plots for all quantitative explanatory variables (ex-

cept SUB2025) in our model. Each plot has an explanatory variable on the

horizontal axis, and the CERES residuals on the vertical axis (see [10, pp. 731-

732] for the definition of CERES residuals). The dashed red line represents a

simple linear regression line and the solid green curve is a nonparametric smooth

(loess curve) which visualizes a possible transformation. The smooths for X RD

and Y RD seem slightly quadratic while the other plots indicate nontrivial, non-

monotone transformations, particularly for PROD, CUMPROD and PDIFF which

are the most curved. From these plots it is clear that a more sophisticated

(29)

Figure 4.4: CERES plots for sojourn time base model

approach is required, as straightforward transformations do not suffice. Gen- eralized Additive Models, to be discussed shortly, will be used to address the problem of necessary variable transformations.

Since we are interested in the individual coefficients in our explanatory model, we need to verify that the regressors are not correlated by checking the Variance Inflation Factors (VIF) [14, pp. 325]. Since SD is a factor with 9 degrees of freedom (1 reference group and 9 dummy variables in the linear com- ponent) and likewise P2003 is a factor with 2 degrees of freedom, we compute the Generalized VIF which accounts for related sets of regressors. The values in Table 4.2 are no reason for concern.

GVIF Df GVIF^(1/(2*Df))

X RD 5.52 1.00 2.35

Y RD 6.01 1.00 2.45

SUB2025 4.25 1.00 2.06

SD 79.42 9.00 1.28

PROD 2.24 1.00 1.50

CUMPROD 2.44 1.00 1.56

P2003 4.18 2.00 1.43

PDIFF 1.63 1.00 1.28

BORE PROD 2.31 1.00 1.52

Table 4.2: Variance Inflation Factors

(30)

4.1.2 Sojourn time model expansion

The CERES plots in Figure 4.4 indicated the need for transformations of explana- tory variables. Therefore we will deploy Generalized Additive Models to account for the nonlinear relationships. Equation 4.2 states the base GAM model.

log(λ

_i

) = β

₀

+ β

₁

SD

_i

+ β

₂

P 2003

_i

+ f

₁

(X RD

_i

, Y RD

_i

) + f

₂

(SU B2025

_i

) + f

₃

(P ROD

_i

) + f

₄

(CU M P ROD

_i

) + f

₅

(P DIF F

_i

) + f

₆

(BORE P ROD

_i

)

= η

i

, i = 1, . . . , n.

(4.2)

In addition to the base models, we constructed several other models by carefully excluding insignificant variables which do not contribute much to the estima- tion of the response. This was done with the underlying theoretical model in mind, meanwhile ensuring that exclusion of these variables did not introduce misspecification. We omit the vast number of diagnostics plots as discussed for the base model, instead we present an overview of resulting models.

Model 1 GLM base model given by equation 4.1 Model 2 Model 1 excluding SUB2025

Model 3 Model 2 excluding PDIFF

Model 4 Model 3 including interaction PROD:SD

Model 5 Model 4 including interaction SUB2025:P2003 Model 6 GAM base model given by equation 4.2 Model 7 Model 6 excluding f (SU B2025) Model 8 Model 7 excluding f (P DIF F )

Model 9 Model 8 including interaction f (P ROD, by = SD)

Model 10 Model 9 including interaction f (SU B2025, by = P 2003)

(31)

4.2 Magnitude models

4.2.1 Base model and diagnosis

A similar analysis as above can be conducted for the earthquake magnitudes in the Groningen Field. From Figure 3.4 in the Explanatory Data Analysis we observed that the magnitudes are approximately Normally distributed. If we use an identity link to relate the mean magnitude to the linear component, the GLM reduces to ordinary linear regression. The base model can be expressed as follows:

µ

i

= β

0

+ β

1

X RD

i

+ β

2

Y RD

i

+ β

3

SU B2025

i

+ β

4

SD

i

+ β

5

P ROD

i

+ β

6

CU M P ROD

i

+ β

7

P 2003

i

+ β

8

P DIF F

i

+ β

9

BORE P ROD

i

= η

_i

, i = 1, . . . , n.

(4.3)

Table 4.3 shows the coefficients of the fitted model. The intercept and spatial coordinate Y RD are highly significant. There is some evidence that subsidence is associated with magnitude, and is significant at the 5% level. Remarkably, in contrast to the sojourn time models, there is no evidence that cumulative production has any effect on magnitudes. Likewise, this model shows no evi- dence that local and global pressure differences, and the number of production boreholes, contribute to the reponse.

Figure 4.5: Residual plots for magnitude base model

Selected Pearson residual plots are shown in Figure 4.5. Residuals against

Y RD do not show any clear pattern. The smooth is mostly constant along the

(32)

Model 1 (Intercept) −13.061 (4.561)

^∗∗

X RD 0.000 (0.000)

Y RD 0.000 (0.000)

^∗∗∗

SUB2025 0.011 (0.005)

^∗

SD1 0.249 (0.194)

SD2 −0.015 (0.276)

SD3 0.262 (0.226)

SD4 0.137 (0.225)

SD5 0.159 (0.242)

SD6 0.022 (0.271)

SD7 0.027 (0.242)

SD8 0.345 (0.225)

SD9 0.169 (0.225)

PROD −0.002 (0.003)

CUMPROD 0.000 (0.000)

P2003M −0.091 (0.064)

P2003H 0.030 (0.085)

PDIFF 0.002 (0.008)

BORE PROD 0.000 (0.002)

AIC 1123.296

BIC 1215.590

Log Likelihood -541.648

Deviance 186.604

Num. obs. 746

∗∗∗p < 0.001,^∗∗p < 0.01,^∗p < 0.05

Table 4.3: Magnitude base model coefficients

range of Y RD except for smaller values (i.e. towards the south), where a slight change of variation can be observed.

Residuals plotted against SUB2025 display a cone-shaped expansion along the horizontal axis, and is a typical pattern for errors with non-constant variance.

As previously mentioned in the residual analysis of the sojourn time models, SUB2025 appears to be endogenous and this can be explained by the theoretical model. The PDIFF plot again shows a pronounced peak, the green smooth now has a peculiar bump around a pressure difference of 18 bar, visualizing a sudden shift of the locations of residuals.

Looking at the residual cloud in the linear predictor plot (i.e. fitted values because of the identity link), heteroscedasticity is not directly evident, although the green curve does suggest some variability as it is not entirely constant.

Pearson residuals in chronological order show little sign of serial correlation.

Nonetheless the red smooth slowly fluctuates around zero in a seemingly cyclic,

(33)

or seasonal, manner. It is possible that such an effect exists (e.g. due to varying production rates), but the trend, as it appears in the graph, is weak and can probably safely be ignored.

The same cannot be said about the distribution of the residuals. If the model fits the data well, the residuals should have a Normal distribution with mean zero and a constant variance. The bottom-right Normal quantile plot in Figure 4.5 clearly shows that the residuals are positively skewed, which corresponds to prior observations from the explanatory data analysis. This feature may be corrected by applying a power transformation to the magnitude values [14, pp. 304]. Since the magnitudes can be negative because of the logarithmic scale, the Yeo-Johnson family of transformations could be used instead of Box- Cox. This may improve model fit but comes at the expense of interpretability;

we opted for models without transformations.

Figure 4.6: Diagnostic plots for magnitude base model: Cook’s distances, Stu-

dentized residuals, Bonferroni p-values and hat-values

(34)

Diagnostics plots for the magnitude model are depicted in Figure 4.6. The most prominent feature is the very low Bonferroni-adjusted p-values in the third graph. Earthquakes 582, 283, 479 have magnitudes 3.6, 3.5, 3.2 and p-values 0.02, 0.07 and 0.27, respectively. The presence of such outliers was already noted in the explanatory data analysis, Figure 3.5. Despite the evidence that these magnitudes are outliers, their influence on the estimated coefficients is not very large, since we do not find them back in the leverage nor Cook’s distance plot; points are influential when they combine outlyingness and high leverage.

Observations 65, 746 and 740 have the largest Cook’s distances and correspond to magnitudes -0.8, 2.3 and 3.0 respectively. The negative magnitude belongs to a tiny tremor and can be treated as noise.

The CERES plots in Figure 4.7 clearly indicate the need for transformations of the variables X RD, Y RD, CUMPROD and PDIFF. The visualized transformation for Y RD resembles a quadratic curve while the others are not simple. This partly contrasts the CERES plots for the sojourn time model, which show nearly flat smoothers (e.g. no transformation) for the spatial coordinates. The suggested smooths for CUMPROD are very similar for both models. The notable bump in the proposed PDIFF transformations matches exactly with the one shown in the Pearson residual plot in Figure 4.5. A similar correspondence is found for SUB2025. This variable, PROD and BORE PROD (not shown) likely do not benefit from tranformations.

Lastly, a VIF check shows no collinearity issues:

GVIF Df GVIF^(1/(2*Df))

X RD 5.53 1.00 2.35

Y RD 6.02 1.00 2.45

SUB2025 4.25 1.00 2.06

SD 79.29 9.00 1.28

PROD 2.24 1.00 1.50

CUMPROD 2.45 1.00 1.57

P2003 4.19 2.00 1.43

PDIFF 1.64 1.00 1.28

BORE PROD 2.32 1.00 1.52

Table 4.4: Variance Inflation Factors

(35)

Figure 4.7: CERES plots for magnitude base model 4.2.2 Magnitude model expansion

Analogous to the sojourn time models, we attempt to improve the base model by removing unimportant variables, adding interpretable interactions and adding smooth transformations while paying attention to the diagnostics presented above.

The GAM models with variable transformations using thin-plate regression splines are given by

µ

i

= β

0

+ β

1

SD

i

+ β

2

P 2003

i

+ f

₁

(X RD

_i

, Y RD

_i

) + f

₂

(SU B2025

_i

) + f

₃

(P ROD

_i

) + f

₄

(CU M P ROD

_i

) + f

₅

(P DIF F

_i

) + f

₆

(BORE P ROD

_i

)

= η

i

, i = 1, . . . , n.

(4.4)

This leads to the following set of candidate models:

Model 1 GLM base model given by equation 4.3 Model 2 Model 1 excluding BORE PROD

Model 3 Model 2 excluding PDIFF

Model 4 Model 3 including interaction SUB2025:SD

Model 5 Model 3 including interaction Y RD:SD

(36)

Model 6 GAM base model given by equation 4.4 Model 7 Model 6 excluding f (BORE P ROD) Model 8 Model 7 excluding f (P DIF F )

Model 9 Model 8 including interaction f (SU B2025, by = SD) Model 10 Model 8 including interaction f (SU B2025, by = SD)

4.3 Model Selection

We now have established two sets of ten models each for sojourn times and magnitudes, and we are interested which models show the best performance.

Since these models are not all nested, we select our preferred models based on the information criteria AIC and BIC.

AIC attempts to select the best approximating model to the unknown ’true’

data generating process, while the BIC provides a measure of evidence that the model is the ’true’ model, based on the data [27, pp.2-8]. Generally speaking, AIC measures predictive accuracy while BIC measures goodness of fit [23, pp.13].

Practically, the BIC imposes a larger penalty on model complexity and tends to favor parsimony.

In Figure 4.8 the AIC values are plotted against the BIC. Models 8, 7 and 6 have the smallest AIC values both for sojourn times and magnitudes. The smallest BIC is assigned to model 3 which is the GLM with the least parameters, hence the most parsimoneous.

Considering both criteria, we select Model 8 as our best model, both in the case of sojourn times and magnitudes.

Figure 4.8: AIC and BIC for sojourn time and magnitude models

(37)

Chapter 5 Results and Conclusions

Table 5.1 shows a summary of the final models obtained after model selection in the previous section.

First looking at the goodness of fit statistics at the bottom of the table, we note that both models have fairly low R

²

values of 0.17 and 0.20 for the sojourn time model and magnitude model, respectively. Similarly the amount of deviance explained is limited to 25% and 23% which is not very high. Therefore the explanatory power of the models is low, and care should be taken when inferring conclusions from these models.

Both models have a statistically significant intercept. Since we did not center the explanatory variables (e.g. at their mean), there is no case that alle explanatory variables would be zero, because cumulative production CUMPROD is included in both models and cannot sensibly attain zero values for the time period of interest. Consequently the intercept has no directly relevant meaning.

There is statistical evidence that cumulative production contributes to the mean sojourn time. This observation corresponds to our initial assumptions and particularly serves as a sanity check. The estimated relationship between cumulative production and sojourn times (on the scale of the linear predictor) is shown in Figure 5.1. Assuming all other factors remain equal (although annual production is necessarily related to comulative production), increasing cumulative production implies decreasing sojourn times, and consequently a higher mean number of occurrences in a given period of time. We note that this effect appears to have accellerated somewhat since a cumulative production level of 1850 billion N m

³

, attained in 2009.

The direct relationship between annual production and sojourn time is less

obvious. The relationship is negative for annual production rates lower than 35

billion N m

³

, but then shows a positive relationship for higher rates. That is,

if all other factors remain equal, production rates higher than 35 billion N m

³

lead to higher mean sojourn time in the same year. Notice that shape of the

estimated smooth function for PROD matches the suggested transformation by

the CERES plots from Figure 4.4.

(38)

Model 8 (st) Model 8 (mag)

(Intercept) 2.502

^∗∗∗

0.881

^∗∗∗

(0.533) (0.251)

SD1 0.028 0.607

^∗∗

(0.481) (0.234)

SD2 −0.155 0.473

(0.743) (0.357)

SD3 −0.476 0.600

^∗

(0.581) (0.274)

SD4 −0.254 0.461

(0.508) (0.263)

SD5 −0.130 0.419

(0.545) (0.272)

SD6 −0.393 0.320

(0.684) (0.298)

SD7 −0.158 0.378

(0.568) (0.278)

SD8 0.273 0.677

^∗

(0.562) (0.266)

SD9 −0.569 0.156

(0.596) (0.233)

P2003M −0.277 −0.127

(0.174) (0.067)

P2003H −0.424 −0.028

(0.229) (0.089) EDF: s(X RD,Y RD) 2.000 9.849

^∗∗∗

(2.001) (14.031)

EDF: s(PROD) 2.926 1.000

(3.642) (1.000)

EDF: s(CUMPROD) 2.583

^∗∗∗

2.847 (3.196) (3.557) EDF: s(BORE PROD) 1.000

(1.000)

EDF: s(SUB2025) 1.000

(1.000)

AIC 4567.348 1082.432

BIC 4666.580 1210.242

Log Likelihood -2262.165 -513.520

Deviance 1345.481 173.049

Deviance explained 0.253 0.226

Dispersion 1.857 0.241

R

²

0.171 0.199

GCV score 2261.977 535.568

Num. obs. 745 746

Num. smooth terms 4 4

∗∗∗p < 0.001,^∗∗p < 0.01,^∗p < 0.05

38

(39)

The magnitude model contains a highly significant smooth function of the spatial coordinates X RD and Y RD. The topleft contour plot in Figure 5.2 shows how the expected magnitudes vary across different locations. As we move away from the center of the Groningen Field towards the North or West, we observe that the estimated magnitudes increase. The mean magnitude is the lowest near the center of the field.

Also present are several significant Structural Domains SD1, SD3, SD8 with positive coefficients between 0.6 and 0.7. These values should be interpreted against the reference domain SD0, which is equivalent to ”none of the structural domains”. Finally we note that, based on this model, there is no evidence that subsidence level affects magnitudes, despite the seemingly obvious relationship visualized in Figure 3.9 from the explanatory analysis.

In addition to the models just discussed, we present two hazard maps for

the Groningen Field: one for the average arrival time (in days) and one for the

average magnitude across the field. The maps are shows in Figures 5.3 and

5.4. The arrival times are a linear function of the location coordinates X RD

and Y RD and indicates that the interarrival time is not well described by spatial

coordinates only. This agrees with our preferred model for the sojourn times,

in which the smooth component of location was not significant. Arrival times

are nearly equal across the field, averaging 10 or 11 days between consecutive

earthquakes. The highest magnitudes should be expected near the towns of

Loppersum and Stedum.

(40)

Figure 5.1: Component smooth functions for preferred sojourn time model

Figure 5.2: Component smooth functions for preferred magnitude model

(41)

Figure 5.3: Hazard map with earthquake rates

Figure 5.4: Magnitude hazard map

(42)

Chapter 6 Discussion

The models developed and discussed in this report can be improved and extended in various ways.

First of all, a larger set of explanatory variables could likely lead to more accurate estimations. The collected data had to be extracted from reports, generally not in numerical form. Not only was this manual task time-consuming, more importantly the accuracy and amount of information that can be obtained in this manner is limited. For example, annual and cumulative production rates for the whole Groningen Field are available, but we could not find fine-grained historic production data of individual production boreholes. Possibly relevant local differences across the large gas field requires more specific data than we used. The complexity of geological processes that lead to seismicity require more sophisticated models.

Compaction and stress accumulation, two important causal effects in the theoretical model (Figure 2.1), were approximated using subsidence levels and fault zones (structural domains). This approach is too simplistic and we cannot expect these variables to be accurate representations of very complex geological processes.

We decided to use all earthquake registrations in the Groningen Field, in- cluding many smaller earthquake with magnitudes less than 1.5, which can only be observed in the vicinity of measurement locations. KNMI, operator of the recording network, therefore recommends only earthquakes larger than 1.5 should be used for analysis [17]. Many smaller quakes may introduce noise into the dataset, deteriorating model fit.

The explanatory data analysis revealed that the empirical distribution of magnitudes exhibits positive skew. We modeled the magnitudes using a normal (symmetric) distribution, without correcting for the skewness. Power transfor- mations may improve the model fit, at the expense of interpretability.

The presented models could be extended by introducing random effects. For

instance, earthquake in distinct structural domains could be modeled as different

(43)

’subjects’. Earthquakes within the same structural domain (i.e. cluster) may be more similar than quakes from distinct domains, and share features absent elsewhere. The models developed so far did not include this refinement, and are based on a ’naive pooled’ data approach.

Lastly, predictive performance quality should be evaluated also for explana-

tory models, even if prediction is not the primary goal [23, pp. 18]. This can be

done using new records in the induced earthquake catalog.

(44)

Bibliography

[1] Groninger Bodem Beweging. Kaartenportaal gasbevingen en bodemdaling [2014-06-23], jun 2014.

[2] J. N. Breunese and K. van Thienen-Visser. Effecten van verschillende productiescenario’s op de verdeling van de compactie in het groningen veld in de periode 2014 tot en met 2016. Technical report, TNO, mar 2014.

[3] Nederlandse Aardolie Maatschappij B.V. Bodemdaling door aardgaswin- ning (ep201006302236). Technical report, sep 2010.

[4] Nederlandse Aardolie Maatschappij B.V. Aardbevingen in groningen (deel 1), mar 2013.

[5] Nederlandse Aardolie Maatschappij B.V. Actualisatie seismologische inzichten groningenveld. (ep201301221641), jan 2013.

[6] Nederlandse Aardolie Maatschappij B.V. Gasveld groningen, 2013.

[7] Nederlandse Aardolie Maatschappij B.V. Technical addendum to the win- ningsplan groningen 2013. subsidence, induced earthquakes and seismic hazard analysis in the groningen field. Technical report, nov 2013.

[8] Nederlandse Aardolie Maatschappij B.V. Nam platform feiten en cijfers:

Gaswinning [2014-04-29], apr 2014.

[9] Technische commissie bodembeweging. Interim-advies tcbb mijn- bouwschade en winningsplan groningen 2013, dec 2013.

[10] R. D. Cook and R. Croos-Dabrera. Partial residual plots in generalized linear models. Journal of the American Statistical Association, 93(442):pp.

730–739, 1998.

[11] A. J. Dobson and A. G. Barnett. An Introduction to Generalized Linear Models. Chapman and Hall/CRC, Boca Raton, FL, 3 edition, 2008.

[12] B. Dost and D. Kraaijpoel. The august 16, 2012 earthquake near huizinge

(groningen). Technical report, KNMI, De Bilt, jan 2013.

earthquakes in Groningen as a Poisson process

Modeling the induced

earthquakes in Groningen as a Poisson process

using GLM and GAM

Bachelorthesis Mathematics July 2014

Student: E. Geerdink

Primary Supervisor: E.C. Wit

Contents

1 Introduction 5

2 Methodology 6

2.1 Goal . . . . 6

2.2 Study design and data collection . . . . 7

3 Explanatory Data Analysis 12 3.1 Earthquake registrations in the Netherlands . . . . 12

3.2 Empirical distributions and stationarity . . . . 13

3.3 Production and pressure . . . . 17

3.4 Structural Domains . . . . 18

3.5 Subsidence . . . . 20

3.6 Collinearity . . . . 21

4 Model validation and evaluation 23 4.1 Sojourn time models . . . . 23

4.1.1 Base model and diagnosis . . . . 23

4.1.2 Sojourn time model expansion . . . . 30

4.2 Magnitude models . . . . 31

4.2.1 Base model and diagnosis . . . . 31

4.2.2 Magnitude model expansion . . . . 35

4.3 Model Selection . . . . 36

5 Results and Conclusions 37 6 Discussion 42 A Dataset 46 A.1 Annual and cumulative production in the Groningen Field . . . . 46

A.2 Global pressure levels in the Groningen Field . . . . 47

B R source code 50 B.1 Data collection . . . . 50

B.1.1 Subsidence forecast 2025 pixel data . . . . 50

B.1.2 Subsidence forecast 2050 pixel data . . . . 52

B.1.3 Subsidence forecast 2070 pixel data . . . . 55

B.1.4 Pressure measurement locations 2003 pixel data . . . . . 57

B.1.5 Structural domains pixel data . . . . 58

B.1.6 Generate RD data . . . . 59

B.1.7 Transform RD data . . . . 59

B.1.8 Function: grid generation . . . . 60

B.1.9 Point in Polygon algorithm . . . . 61

B.1.10 Function: test if a point is inside a polygonal area . . . . 62

B.1.11 Generate dataset . . . . 63

B.2 Explanatory Data Analysis . . . . 65

B.2.1 Function: generate mass plots . . . . 65

B.2.2 Function: compute aggregated group statistics . . . . . 66

B.2.3 Function: compute sojourn times for groups . . . . 67

B.2.4 EDA plots . . . . 68

B.3 Model fitting and diagnostics . . . . 79

B.3.1 Function: Marginal model plots with smoothers (wrapper) 79 B.3.2 Function: GLM diagnostics . . . . 80

B.3.3 GLM models . . . . 82

B.3.4 GAM models . . . . 85

B.3.5 Model selection using AIC and BIC . . . . 87

B.3.6 Hazard maps . . . . 88

List of Figures

2.1 Causal diagram for the effects on seismicity . . . . 8

3.1 Number of induced earthquake registrations . . . . 13

3.2 Histograms of sojourn times . . . . 14

3.3 Annual moving quantiles for sojourn times . . . . 14

3.4 Histograms of magnitudes . . . . 15

3.5 Annual moving quantiles for magnitudes . . . . 15

3.6 Historical production rates, pressure level differences and number of earthquakes . . . . 18

3.7 Structural domains, earthquakes and 2003 pressure measurement locations . . . . 19

3.8 Counts per Structural Domain and subsidence region . . . . 20

3.9 Subsidence regions, earthquakes and 2003 pressure measurement locations . . . . 21

3.10 Correlation plot for continuous regressors . . . . 22

4.1 Residual plots for sojourn time base model . . . . 25

4.2 Marginal model plots for sojourn time base model . . . . 27

4.3 Diagnostic plots for sojourn time base model: Cook’s distances, Studentized residuals, Bonferroni p-values and hat-values . . . . 28

4.4 CERES plots for sojourn time base model . . . . 29

4.5 Residual plots for magnitude base model . . . . 31

4.6 Diagnostic plots for magnitude base model: Cook’s distances, Studentized residuals, Bonferroni p-values and hat-values . . . . 33

4.7 CERES plots for magnitude base model . . . . 35

4.8 AIC and BIC for sojourn time and magnitude models . . . . 36

5.1 Component smooth functions for preferred sojourn time model . 40 5.2 Component smooth functions for preferred magnitude model . . 40

5.3 Hazard map with earthquake rates . . . . 41

5.4 Magnitude hazard map . . . . 41

Chapter 1

Introduction

The Groningen Field is one of the world’s largest gas fields, spanning over 900 km

. It was discovered in 1959, and gas extraction commenced in 1963.

The gas reservoirs are located at approx. 3 km depth and store a tremendous volume of 2800.000.000.000 N m

, i.e. 2.8 trillion cubic metres. Nearly 60%

In this report, we develop a stochastic model for the earthquakes in Gronin-