Forecasting laboratory earthquake's time to failure from acoustic data with a recurrent neural network.

(1)

Forecasting laboratory

earthquake’s time to failure from

acoustic data making use of a

recurrent neural network

H.B.P. van Laatum 10812431

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. D. K. Gupta Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 28, 2019

(2)

Abstract

Seismologists have attempted to forecast earthquake events for a long period of time. Only recently, research has demonstrated that machine learning tech-niques can extract laboratory earthquake characteristics from acoustic data. This thesis aims to predict laboratory earthquake’s time to failure, which is the time left until the next earthquake event, from an acoustic signal. This signal comes from recording laboratory simulated earthquakes, resulting into earth-quake cycles. An RNN, constructed of LSTM cells, is used to learn pattern sequences in the earthquake cycles and link such a pattern sequence to a time to failure. The predictions are based on statistical features extracted from seg-ments of the acoustic signal. Although the network learned a pattern sequence in the signal, the model’s R2_{-score did not yield over 0.56. However, this}

corre-sponds with research predicting similar earthquake data. The lack of features corresponding to varying times to failure is the cause for this model performance. To increase model’s accuracy, better features must be found.

(3)

1 Introduction

In the field of geophysics, the prediction of earthquakes is a problem seismol-ogists have been dealing with for over 40 years (Scholz et al., 1973). The pre-diction of earthquakes using seismic data has always been a time and memory consuming task, and the attempts at doing so have not been successful so far (Geller, 1997). Recently, however, research has demonstrated how the use of machine learning techniques has led to improvement in the prediction of earth-quakes (Hulbert et al., 2019; Lubbers et al., 2018; Perol et al., 2018; Rouet-Leduc et al., 2017, 2018). These machine learning techniques present significantly faster and more accurate results in comparison with previous earthquake pre-diction methods. Due to increased volume of data, a broader variety of data, improved machine learning models and more computational power (Kong et al., 2018).

When these machine learning methods are applied to seismic waves, earthquake event characteristics could be derived. Seismic waves are waves of energy caused by natural friction in the earth. These waves are recorded and converted to an electrical signal. This signal might contain information about earthquake events such as the magnitude and the time to failure. Time to failure represents the time that is left until the next earthquake event occurs. In the field of earthquake predictions, forecasting earthquake characteristics from seismic waves is a recent development. Before any correlation between these to factors was discovered geophysicists thought seismic waves consisted of noise (Jordan et al., 2011; Kong et al., 2018; Rouet-Leduc et al., 2017, 2018). Rouet-Leduc et al. (2017) were the first to extract earthquake characteristics from these kind of waves. This research will attempt to contribute to forecasting earthquake characteristics form seismic data.

This thesis focuses on the application of machine learning techniques to seismic waves to forecast earthquake’s time to failure. The stated earthquake event is a laboratory simulated earthquake, these are experiments in laboratory that closely mimic real earthquakes. Section 2.1 will explain laboratory earthquakes in more detail. The data recorded during such an experiment is a form of seismic waves called acoustic data. Throughout this thesis there will be referred to this acoustic data. Figure 1 represents a slice of data upon which predictions in this thesis are based. This figure clearly shows laboratory earthquake cycles represented by time to failure. At the moment time to failure becomes zero a laboratory earthquake occurs, and is succeeded by a new earthquake cycle. Few characteristics of the recorded data can be noticed. First, the extreme acoustic signal peaks moments before an earthquake event, these peaks seem to be an indicator for an earthquake event. Second, an increase of activity in the signal, if the time to failure approaches zero. Finally, the varying duration of the two visualised earthquake cycles, as the first cycle has a duration less than

(5)

12 seconds versus over 14 seconds of the other one. Although, the signal appears to hold presages for an earthquake event, the signal does not emerge to be a precursor for the whole time to failure cycle. Because of the small differences in signal activity throughout the earthquake cycles.

Figure 1: First 106M data points of the data. The acoustic signal plotted vs the TTF. Two earthquake cycles clearly visible. Extreme peaks seem to be indicator of laboratory earthquake.

Despite the apparently lack of correlation between the signal and time to failure, Rouet-Leduc et al. (2017, 2018) presented an advanced technique for forecast-ing earthquakes by makforecast-ing use of a machine learnforecast-ing model solely based upon acoustic data. Their method involves taking segments of acoustic data, and forecast time to failure at the end of such an segment. The results were surpris-ing because their model was able to predict time to failure from the acoustic signal with a R2-score of 0.89. Rouet-Leduc et al. (2017) was the first study to successfully extract this kind of information from an acoustic signal. Lubbers et al. (2018) used a similar machine learning model to a different data set, and had a different way of pre-processing the signal, and found a R2_{-score of 0.612.}

The differences between these two scores indicate that one model does not have the ability to always predict time to failure correctly. What might have caused this discrepancy between the two studies will be further discussed in Section 2.3.

The goal of this thesis is to forecast laboratory earthquake’s time to failure through application of a Recurrent Neural Network (RNN) that is constructed of long short-term memory cells to a seismic signal. This kind of RNN, known as a LSTM network, has been proven to recognise patterns in continuous time series data, and use its memory to predict future events of the data (Hochreiter and Schmidhuber, 1997; Malhotra et al., 2015). Due to this ability of the LSTM,

(6)

and given the fact that the acoustic signal is continuous, the LSTM should be capable of forecasting time to failure. The network could, for example, recognise the pattern of increasing activity in the acoustic signal and link this to a decreasing time to failure. Therefore, this thesis aims to answer the following question:

“How accurate can a Recurrent Neural Network, constructed of Long short-term memory cells, forecast laboratory earthquake’s time to failure

(TTF) from continuous acoustic data.”

As mentioned before, it is expected that the LSTM is able to pattern sequences in the acoustic signal which describe varying cycles of laboratory earthquake’s time to failure. If the LSTM succeeds in learning these patterns in the signal the network could outperform the state-of-the-art machine learning techniques in forecasting time to failure. This research is interesting because LSTM networks have never been used in the forecasting of earthquakes based on acoustic data, but they have been proven to be successfully forecast time series data.

Before being able to answer the stated question, some theoretical framework concerning about laboratory earthquakes, RNNs and LSTMs will be provided. Furthermore, the state-of-the-art methods that forecast earthquake’s time to failure are discussed. In Section 3, the methodology of the thesis is explained and discussed. This section will explore and process the used earthquake’s data set, which includes feature creation, analysis and selection as well as the defining of the model, and the stating of evaluation metrics. In section 4, the results of this thesis will be presented and evaluated by the mentioned evaluation metrics. This research and its results will be discussed in section 5. Moreover, in this section, a comparison will be made to the earlier mentioned state-of-the-art methods. Finally, the conclusion of this thesis will be drawn and further recommendations will be made.

2 Background

2.1 Laboratory earthquake data

As stated in the introduction, the acoustic data used for this research em-anates from simulating laboratory earthquakes. In laboratories, it is possible to simulate repeating earthquakes that closely mimic natural earthquakes. This simulation results in earthquake cycles described by the earthquake’s time to failure as can be observed in figure 1. The end of such a earthquake cycle can be visualised as a micro-earthquake. This entails small amounts of friction from laboratory shearing layers, which results in earthquakes with minuscule

(7)

magnitudes, that last for microseconds (McLaskey et al., 2012). During these laboratory earthquake experiments the shearing layers, which cause the friction, emanate a acoustic signal. This acoustic emission is the seismic data from which the time to failure must be predicted (Lubbers et al., 2018; Rouet-Leduc et al., 2017, 2018). Difficulty of the time to failure predictions depends on several aspects of the earthquake data. These aspects will be addressed next.

If the aforementioned earthquake cycles vary in duration from one another, forecasting time to failure from acoustic data becomes more difficult. Vary-ing earthquake cycles means that the earthquake data to predict is aperiodic. Such aperiodic cycles complicate model’s predictions. If earthquake cycles do demonstrate periodicity, finding a general pattern sequence describing all these cycles individually becomes a lot easier. Since probably one pattern sequence describes them all. Whereas earthquake data is aperiodic, multiple pattern se-quences must be learned that correctly describe each varying earthquake cycle individually. Moreover, forecasting earthquake’s time to failure becomes more challenging as the time to failure is higher. Since the more time left until the next earthquake event, the less information the acoustic signal contains about time to failure.

In section 3.1 a close description of the data and the complexity of forecasting time to failure from the used data will be provided.

2.2 Recurrent Neural Networks

Recurrent neural networks (RNN) differ from classical feedforward neural net-work in the fact that they are capable of processing sequences of inputs, by making use of loops in their network layers. Each RNN-layer consists of a num-ber of hidden cells and the output of such an cell is a combination of the input of the data and the output of a previous hidden cell. This architecture ensures that the output of the previous hidden cell influences the output of the next cell. This means RNNs are capable of integrating previous learned information into the task of learning/predicting new data (Rumelhart et al., 1988). Therefore, RNNs are a good fit for the problem of learning patterns in continuous data, such as the acoustic signal, and linking these learned patterns to a time to fail-ure. Although RNNs are capable of integrating recent information, they are not capable of learning patterns between long time lags because RNN tend to forget learned information in a further past. Furthermore, RNNs make use of back propagation through time (BPTT), which can lead to the vanishing gradient problem. BPTT entails propagating the error from the output layer back to the input layer. As the network gets bigger and more complex this long way of back propagation of the error, could cause an insufficient weight update based on the error, which leads to the problem of a vanishing gradient. These two disadvan-tages make RNNs incapable of learning long-term dependencies (Bengio et al.,

(8)

1994; Hochreiter, 1998).

LSTM-networks provide a solution to these error back propagation problems, as LSTMs have the ability to learn long time interval dependencies without los-ing short-term dependencies (Hochreiter and Schmidhuber, 1997). This feature results from the construction of a LSTM layer and its cells. Each LSTM cell has a specific cell state. This cell state can be envisioned as the information that a cell contains, and it is inherited from the previous cell. In each LSTM cell, information can be added or removed to the current cell state. Firstly, is calculated which information can be discarded through a forget gate ft, based

on the output of the previous cell, ht−1and the current time sequence input, xt.

Secondly, is decided what information to add to the cell state through the input gate it. Thirdly, the new cell state, Ctis computed using previous cell state, the

forget and the input gate. Finally, the output of the cell, ht, is calculated based

on the new cell state (Graves, 2013; Hochreiter and Schmidhuber, 1997).

ft= σ(Wf· [ht−1, xt] + bf) (1)

it= σ(Wi· [ht−1, xt] + bi) (2)

Ct= ft· Ct−1+ it· tanh(WC· [ht−1, xt] + bc) (3)

ot= σ(Wo· [ht−1, xt] + bo) (4)

ht= ot· tanh(Ct) (5)

Due to the above stated construction of LSTMs using gates to regulate their information flow, LSTMs have gained the ability to learn long-term dependen-cies. Consequently, LSTMs seem fit for the task of predicting time to failure from a continuous acoustic signal. As mentioned earlier, through learning pat-tern sequences over a long time a LSTM network could learn which patpat-terns sequence corresponds to which part in a earthquake cycle and based on this knowledge, forecast time to failure. Therefore, this thesis has chosen to imple-ment an LSTM-network to predict laboratory earthquake’s time to failure from acoustic data.

2.3 Related work

Not much research has been done into predicting earthquake’s time to failure from acoustic/seismic data. Rouet-Leduc et al. (2017) performed a pioneering study in the field of earthquake predictions, when they applied a machine

(9)

learn-ing technique to acoustic data. The goal of this research was to forecast the time left until a next laboratory earthquake event, with only the acoustic data as input (Rouet-Leduc et al., 2017). To achieve this the researchers created segments of acoustic data and attempted to predict the time to failure at the end of each segment. From each segment multiple statistical features from the continuous acoustic signal were computed such as variance, kurtosis, standard deviation and autocorrelation. A random forest was used to predict the re-lating time to failure. The random forest averaged the predictions of multiple regression decision trees, these predictions were based on the computed features, which resulted in a remarkably high R2_{-score of 0.89.}

Lubbers et al. (2018) used a different model to predict laboratory earthquake’s time to failure. Instead of computing statistical features directly from the acous-tic signal, an event catalog was generated from the acousacous-tic data. The statisacous-tical features were then extracted from the generated event catalog. Even though the set up was the same as Rouet-Leduc et al. (2017), this study only managed to obtain a R2-score of 0.612 for the time to failure predictions.

An explanation for these differences in accuracy can be found in the differences of the earthquake data both studies used (see Section 2.1). The earthquake cycles Rouet-Leduc et al. (2017) aimed to predict all start with time to failure that was not greater than 12 seconds, while the some of the cycles Lubbers et al. (2018) attempted to predict start at 16 seconds. Obviously, the more time left until an earthquake event the less information the acoustic signal emits about the time to failure. This, in turn, results in more difficult to predict earthquake’s time to failure. Furthermore, the difference in periodicity of the earthquake cycles played a role in the studies accuracy. Since Rouet-Leduc et al. (2017) earthquake cycles all demonstrated lot of similarity whereas the data Lubbers et al. (2018) used showed much less periodicity. As mentioned before, aperiodic cycles complicate forecasting earthquake’s time to failure from acoustic data.

3 Method and Approach

In this section the model’s design is described and discussed. As mentioned in Section 2.2 because of its ability to learn sequence patterns in continuous data, an RNN, with LSTM structure, is chosen to predict time to failure from continuous acoustic data. First, the used data set is described as well as the manner in which the data is preprocessed before being able to forecast time to failure out of it. Second, the statistical features extracted from the acoustic signal are presented and discussed. Third, a more profround analysis of the features and their relation to the time to failure is provided. Fourth, the feature selection method is discussed and substantiated. Fifth, the configuration of

(10)

the LSTM network is presented. Finally, the model’s evaluation metrics, mean absolute error (MAE) and R2_{-score are discussed. Also a simple support-vector}

regression model will be introduced, which is used to produce a baseline result. This thesis results will be compared to these baseline results.

3.1 The earthquake data set

The earthquake data set used in this research comes from a Kaggle competition organised by the Los Alamos National Laboratory. For the purpose of this competition, the Los Alamos National Laboratory distributed an earthquake data set (LANL, 2018). This set contains 630 million acoustic data points in which each acoustic value corresponds to a time to failure. Figure 2 shows a plot of the first ten million data points. When the time to failure hits its minimum, an earthquake has occurred and immediately afterwards, it jumps on to the next earthquake cycle. As figure 2 exhibits, there is no direct correlation perceptible between the acoustic signal and the time to failure. However, just moments before the earthquake event there does seem more activity in the acoustic signal as well as a huge spike milliseconds before the laboratory earthquake.

Figure 2: Graph of the first 10 million recorded acoustic data points and its corresponding time to failure.

The acoustic data is recorded at a 4MHz frequency, and the measuring unit is in voltage. The goal is to design a model that is capable of predicting time to failure based on the input of 150K continuous acoustic data points. The time to failure at the end of each segment of 150K data points is the value to predict. Due to the sample frequency of 4MHz, the input of the model corresponds with a 0.0375 seconds time window of acoustic signal. If the time to failure for each segment is plotted, clearly varying earthquake cycles are visualised, see figure 3.

(11)

The figure also demonstrates the lack of periodicity in the cycles, as one cycles has a duration of 16 seconds while the next takes no longer than 8 seconds. Due to the fact that the acoustic signal must contain patterns describing both the large cycles as the smaller ones, aperiodic earthquake data will complicate forecasting earthquake’s time to failure.

Figure 3: A plot of the TTF of the laboratory earthquake cycles. Dividing the acoustic signal into segments of 150K data points of the acoustic signal results into 4194 segments. Each segment has a corresponding TTF which to predict. To forecast the times several statistical features are extracted from each segment of the acoustic signal. What features these are and why they were chosen is discussed next.

3.2 Feature extraction

In the field of signal processing there are hundreds of possible features that could be extracted from a signal like the presented acoustic signal. In this section the extracted features for this thesis are presented and discussed. During the pro-cess of feature extraction following criteria are taken into account; does the feature have correlation with the response variable, does the feature describe most times to failure correctly and does the feature describe the earthquake cycles individually. Because the standard features such as mean, standard de-viation or variance did not demonstrate any of the criteria sufficiently, other feature creation techniques were required.

First, the autocorrelation of each segment was computed. Autocorrelation of a signal is the correlation of a signal with itself over a set time lag. The higher the autocorrelation the better precursor a signal is for its future behaving. In equation 6 the formula to compute the autocorrelation is shown. In here l is the time lag over which the signal’s correlation is computed, µ is the mean of

(12)

the signal, n is the length of the signal and σ2 _{is the signal its variance.} 1 (n − l)σ2 n−l X t=1 (Xt− µ)(Xt+l− µ) (6)

The plot in figure 4 exhibits a slight correlation between the autocorrelation of the segment and the time to failure. Because the autocorrelation with a time lag of 1 and 10 demonstrated this slight correlation only these two autocorrelation features were computed. However, there does not seem to be any differences in autocorrelation values describing high times to failure versus low time to failure. All earthquake cycles appear to be characterised by the same autocorrelation values.

Figure 4: Autocorrelation value per segment, plotted versus the time to failure corresponding to that segment.

In the field of signal processing and pattern recognition in continuous time series data, the standard deviation can be of a huge tool in detecting patterns in these signals (Seyrafi, 2009). An even more powerful tool is the moving standard deviation of a signal, because this demonstrates how the standard deviation changes over time in a continuous signal. One would assume as an earthquake cycle approaches a time to failure of zero, the standard deviation of a segment would increase. Equation 7 shows the standard deviation of a given signal X. std(X) = s PN i=1(Xi− µ)2 N (7) moving std(Xj, w) = s Pj i=j−w(Xi− µ)2 w (8)

(13)

Figure 5: The moving standard deviation feature, with a window size of 50 from which the 20thpercentile is taken. Plotted against the time to failure.

In equation 8 the standard deviation of a given point, j, given a specified time window, w, is computed. Equation 8 can be used to compute the moving window standard deviation for each point in each segment. After computing the mov-ing standard deviation, for each segment the pthpercentile is taken to subtract a value from the moving standard deviation feature. This process is repeated for several time windows and several percentiles, which results in 24 features of varying windows and percentiles. Figure 5 shows a plot of the moving stan-dard deviation feature. This plot visualises a much greater correlation between the response variable the created feature in comparison to the autocorrelation feature.

Another feature created for forecasting time to failure is the short-term average/long-term average ratio (STA/LTA-ratio). This ratio feature, introduced by Allen (1978), is still widely used in the field of earthquake detection (Chen, 2017). Therefore, as it is assumed to be a precursor an earthquake event, it might also be a precursor in earthquake data for TTF. STA/LTA-ratio is constructed from the average of a specified short-term window that is then divided by the average of a specified longer window. Equation 9, 10 demonstrate the construction of the STA and the LTA of a single data point, equation 11 combines these two into the STA/LTA-ratio. This technique is applied to the each of the 4194 seg-ments which results into new segseg-ments with a STA/LTA-ratio for the use of this feature in the LSTM the mean is extracted from the newly created segments. As it appears, the window size of the STA and LTA matters a lot, therefore this research has chosen to combine several STA- and LTA-window sizes which resulted into six STA/LTA-ratio features. Figure 6 shows a clear negative cor-relation between the response variable, and the created feature. However, the feature appears to consists of quite some noise. How this will be dealt with is

(14)

discussed later. ST A(Xj) = 1 N ST A j X i=j−n Xi (9) LT A(Xj) = 1 N LT A j X i=j−n Xi (10) ST A/LT A − ratio =ST A LT A (11)

Figure 6: The mean of each STA/LTA-ratio segment plotted against the time to failure. In this plot a short-term window of 20 and a long-term window of 300 was used.

Finally, if the acoustic signal is explored in more detail in figure 2 shows how more activity can be visualised moments before time to failure hits a value of zero. Result of this difference in activity can be captured in a feature that counts the number of peaks in a segment. In figure 7, a graph of this feature is presented. Again, a correlation is visualised between the feature and time to failure.

(15)

Figure 7: Number of peaks greater than 10 per segment, plotted against the time to failure.

As mentioned before the segments of the acoustic data were recorded at a very high frequency. The problem with this kind of high frequency signals is that a lot of noise is recorded. During the recording of a signal, background noise, which is everything that is recorded from an unidentifiable source, is super-imposed on the real acoustic signal (Curtis et al., 2006). Therefore, all the described features extraction techniques have also been applied to a denoised signal. The purpose of denoising the acoustic signal is to filter the signal about which there exists uncertainty about its source, so that the signal’s information that is relevant to the variable to forecast is kept as high as possible. Signal denoising has proven to be a powerful tool in other signal processing fields, such as EEG signal classification and image processing (Vantuch, 2018). Figure 8 displays the first acoustic signal segment and its denoised component next to it. This plot demonstrates that most low amplitude values have been filtered and mainly the peaks remain, which could result in better correlation to the response variable. Figure 9 is an example of the number of peaks feature extracted from the denoised signal.

In figures 4, 5, 6, 7 and 9, the features represent correlations with the time to failure variable. Despite the spotted correlation all the presented features also contain lots of fluctuation in their trend, which could complicate predictions based on such features. Therefore, is chosen to subtract each regular extracted feature’s trend. This trend might cause better correlation between a feature and earthquake’s time to failure, which could result into more accurate predic-tions. Most common manner to subtract a continuous feature’s trend is the moving average smoothing technique (Box et al., 2015). Although, a basic sta-tistical method, see equation 12, it is of great efficiency. Figure 10 demonstrates the transformation of the number of peaks feature after applying the moving average technique. The feature shows a much clearer trend than the regular

(16)

Figure 8: The left graph presents a plot of the previously shown first segment of the acoustic signal, on the right the first segment’s denoised component is plotted.

Figure 9: Number of peaks greater than 10 per segment of the denoised signal.

extracted features. Due to the clearer trend of the features after applying the moving average, the correlation between the features and the response variable also increases, see table 1. The next section will give a deeper analysis of the features’ characteristics and their relations to laboratory earthquake’s time to failure. M A(Xj) = 1 w j X i=j−w Xi (12)

(17)

(a) Number of peaks feature extracted from the acoustic signal.

(b) Number of peaks feature after moving average smoothing.

Figure 10: Two plots to demonstrate the feature’s decrease in fluctuations after applying the moving average to it. Much clearer trend visualised by the feature.

Feature Correlation TTF MA Correlation TTF Autocorrelation time lag:1 0.475 0.605

Moving std w:50 p:20 0.629 0.736 STA/LTA-ratio STA:20 LTA:300 0.600 0.715 Number of peaks>10 0.638 0.732 Number of peaks>10 (DS) 0.669 0.723

Table 1: Table represents the increase of correlation between four feature and time to failure (TTF) when applying the moving average of equation 12 to the features. (DS) means the feature is extracted from the denoised signal.

3.3 Feature analysis

Like mentioned in the beginning of the previous section, three feature character-istics were taken into account when creating them: correlation to the response variable, corresponding to all specific times to failure and describing earth-quake cycles individually. The first one is covered by the features as demon-strated in previous plots and tables. The second characteristic will be discussed next.

(18)

Figure 11a represents the distribution of a regular extracted feature (the moving std) versus the time to failure’s distribution, and figure 11b represents that same feature’s distribution after applying the moving average to the feature versus the time to failure’s distribution. Two feature characteristics become clear from this plot. Most striking characteristic is the broader distribution after applying the moving average to the specific feature. Indicating a better correlation between the features’ values and specific times to failure. Second, the higher time to failure gets the less feature values describe these times to failure. It seems that a value 2.5 for this feature is corresponding to a time to failure of about 9 seconds, but this is also the value correlating with time to failure greater than 9 seconds. Consequently, the higher time to failure gets, the worse features become of a precursor. Hence, the feature characteristic corresponding to most times to failure is partly covered by the features.

(a) Distribution feature versus distri-bution TTF.

(b) Distribution feature after moving average versus distribution TTF.

Figure 11: Joint distribution of the moving standard deviation feature vs the time to failure. On the left the regular extracted feature and on the right the same feature after applying the moving average.

The third feature’s characteristic, the ability to describe earthquake cycles in-dividually, will be discussed next. Like the previous characteristic this one is partly encompassed. Since features demonstrate great similarity in their distri-bution when describing unequal earthquake cycles. If the joint distridistri-bution of a feature for a specific earthquake cycle is compared with the joint distribution of that same feature for a complete different earthquake cycle, great similarity can be found. Figure 12 demonstrates this similarity in a feature’s distribution for two varying earthquake cycles. Having this similarity on different time to failure cycles, again demonstrates similar feature values corresponding to different time to failure. This limitation in the features could cause inaccurate predictions for aperiodic earthquake data. Nevertheless, the features do demonstrate clear lin-ear correlation with the response variable; therefore a LSTM network, based on

(19)

these features, should be able to forecast earthquake’s time to failure.

Figure 12: Distribution plots of moving standard deviation feature after apply-ing movapply-ing average to it in comparison to the distribution of the time to failure. Two specific earthquake cycles are presented, on the left biggest earthquake cycle, duration 16 seconds, right one is the following earthquake cycle.

After applying the described feature extraction techniques, the parameter vari-ations, and its denoised components, each segment now consists of 66 features. As figure 4, 5, 6, 7 demonstrate, there is a correlation between the features and the response variable. Although one would presume more correlating features with the response variable indicate more accurate predictions. Research has demonstrated this is not the case. For example, when a model uses two features is enough of a predictor for a response variable y, then no more than these two features should be used (Hawkins, 2004). When a model is too complex, if, for example, a model is based on too many variables, the possibility of overfitting exists. To counteract this possibility statistical feature selection was applied to the 66 created features.

3.4 Feature selection

Feature selection is of huge importance in building a predictive model. It can be compared to the previous mentioned signal denoising. While all the features could provide valuable information, some might hold none. Feature selection is a process where the number of features is brought down while trying too keep as much relevant information to the response variable as possible. In this section a description will be given of the feature selection algorithm that this thesis used.

As mentioned before the addition of a feature does not imply a more accurate prediction (Hawkins, 2004). Features can be highly related to one another, which means the features are highly correlated to each other. If this is the

(20)

case for multiple explanatory variables in a linear regression problem, this re-lationship is called multicollinearity. Multicollinearity can result in less precise predictions than if the features are uncorrelated with one another (Farrar and Glauber, 1967). Furthermore, the usage of too many features could result in to a complex model that leads to overfitting. The phenomenon of overfitting occurs when a model is trained too closely on a set of data/features and its parameters, and therefore, not capable of accurately predicting unseen data.

For these reasons, is started to discard the features that correlate with one another. Each feature that has a correlation over 0.9 with another feature is discarded. The feature with the highest correlation with the time to failure will be kept. This then results into nine features with a reduced multicollinearity. Figure 13 presents a heatmap of the nine features left, their correlation with one another and their correlation with time to failure. This selection has been made based upon the correlation between the features. However, a feature’s impact on the time to failure is at least of equal importance. If a feature does not have any relation/influence on the prediction, it should be discarded, because it could influence the prediction accuracy in a bad way.

To discard features based on their influence on the time to failure variable, hypothesis testing for regression models is used. First, a least squares regression (LSR) model is built using all nine features to capture the linear relation to the time to failure, see equation 13. This model estimates the coefficient, βi of each

feature, xi. Second, the following null hypothesis is assumed: ‘The selected

set of features do not affect the time to failure variable to predict.’ Third, for each estimated coefficient βi, in the LSR model, the p-value is calculated. This

p-value of a coefficient βi correlates with the probability of that coefficient to

be zero. A coefficient βi with value zero means the corresponding feature xi

is of no influence to the response variable. Therefore, all the features in with a p-value above 0.05 are discarded. Consequentially, four features remain after this statistical process of feature selection. Figure 14 presents the four features after feature selection their correlation with one another as well as with the reponse variable (TTF). In the next section the LSTM configuration, such as number of hidden cells and layers size will be discussed.

T T F = βj+ N

X

i=0

(21)

Figure 13: Correlation matrix of the nine features after selection based on their multicollinearity. Their correlation with eachother is displayed as well as their correlation with time to failure (TTF). (DS) means the feature is extracted from the denoised signal.

3.5 LSTM configuration

Since this thesis attempts to forecast time to failure from continuous acoustic time series data, the built LSTM needs to take previously seen data into ac-count in order to predict unseen data. Due to this structure, an LSTM has the capability to learn pattern sequences in the acoustic data (Section 2.2). In figure 15 the structure of the LSTM is shown. The Xtis the input of the current

features, Xt−1and Xt+1are the previous and the next inputs. The

correspond-ing hi symbols are the outputs of the LSTM cell, which is also passed on to the

next cell. The parameters are defined as follows: • LSTM sequence length: 350

• Batch size: 32 • Epoch: 200

(22)

Figure 14: Correlation matrix of the features after feature selection, displayed is the correlation between eachother as well to the time to failure (TTF). (DS) means the feature is extracted from the denoised signal.

The LSTM sequence length parameter is set to 350 sequences because this is approximately the length of one earthquake cycle. For this reason, the LSTM might learn from previous seen data what an unseen data point’s position is in the earthquake cycle, and relate this to a time to failure. Therefore, it is able to give more accurate predictions. Furthermore, this research has chosen for 200 epochs because at this moment the model’s is overfitting the training data. The two other presented variables had no significant influence on the model its performance.

3.6 Evaluation

For the evaluation of the model, its average performance over a 5-fold cross validation is taken. Due to the small number of input segments, 4194, the choice was made to not split of an evaluation set for testing. The testing will be done on the five validation sets. Two evaluation metrics are used for testing the model’s performance: the mean absolute error (MAE) and the coefficient of

(23)

Figure 15: 3 LSTM cells in a LSTM sequence layer and their in- and outputs.

determination R2_{-score. Before analysing the model’s performance its results}

are compared to a baseline model. The baseline results are produced by a sim-ple support-vector regression model (Smola and Sch¨olkopf, 2004). This model makes use of four standard statistical features: mean, standard deviation, min-imum and maxmin-imum value. If a model’s performance is worse in comparison to the baseline results, the results are considered useless and discarded.

During the model’s simulations, the model is evaluated through the mean ab-solute error (MAE), see equation 14. The MAE gives a first impression of the model’s performance. The lower its error/loss, the more accurate its perfor-mance. During training, the model validates its performance on one of the 5-fold cross validation sets. The cross validation score is an indicator for which model performs best on every part of the data set. Furthermore, through the use of validation sets, a model’s tipping point can be detected, the moment from where a model starts to overfit.

M AE = 1 N N X i=0 |yi− ˆyi| (14)

The second evaluation metric used to evaluate the model’s predictions is the R2_{-score, see equations 15, 16 and 17 where ¯}_{y is the population mean, y}

i is

the observed value and ˆyi is the predicted value. The R2-value gives an

in-terpretation of the linear correlation between the observed time to failure and the predicted time to failure. The R2_{-score ranges from 0 to 1 with one giving}

(24)

the R2_{-score is not the most appropriate metric to evaluate the goodness-of-fit}

between yi and ˆyi (Legates and McCabe, 1999) it can be used to compare the

model’s predictions with the ones from Lubbers et al. (2018); Rouet-Leduc et al. (2017) as they have used the metric during their research.

SStot= N X i (yi− ¯y)2 (15) SSres = N X i ( ˆyi− ¯y)2 (16) R2= 1 −SSres SStot (17)

4 Results

In this section, the results of the various model’s will be evaluated through the MAE and the R2_{-scores. First, the obtained results will be compared to the}

baseline model results. This baseline model is a simple support-vector regression model, which is shortly described in the previous section. Anything below the baseline result is considered useless and discarded. Second, a comparison based on the MAE will be made between the LSTM networks making use of the regular extracted features and the features to which the moving average (equation 12) is applied. Furthermore, the differences between the models using all the 66 features, the one using nine features (figure 13) after the first selection process and the model using the four features (figure 14) after the whole selection process are presented. Third, the R2_{-scores are presented and compared for the six}

models. Moreover, the predictions from the best performing model will be paralleled to the real time to failure. Finally, the best model’s R2_{-score per}

Kth-split from the 5-fold validation splits will be shown and their differences in accuracy will come forward.

4.1 Mean absolute error results

In table 2, the MAE scores for the six simulated models are introduced. The baseline model achieves a MAE score of 2.314. Since all the found results out-perform the baseline result, these can be considered as accurate predictions by the proposed models. What is most striking about these results, is the increase of accuracy when the moving average is applied to the extracted features. The moving average seems to decrease the error of the model by approximately 0.2 seconds, an improvement of about 10%. Figure 16 visualises the average 5-fold

(25)

validation error of the six models during training. The figure clarifies what the results in table 2 implicated: throughout the training of the models, using the moving average features led to more accurate in predictions of the earthquake’s time to failure. Another characteristic that could be observed, is that the mod-els using all 66 features tend to overfit a lot faster in contrast to the modmod-els using less features, see figure 16.

MAE MAE MA 66 features 2.030 1.879 9 features 2.074 1.852 4 features 2.072 1.851

Table 2: Minimum average 5-fold MAE validation loss of the six models. Left columns display the regular features’ loss, right one the features’ loss after ap-plying MA to them.

Figure 16: Validation losses of each model during training. Each graph dis-plays two models using N features, one using regular features the other one after applying the MA.

4.2 R-squared results

In table 3, the R2_{-scores of the six models are introduced. Comparing these}

scores to the 0.363 R2-score produced by the baseline model, indicates that all models perform more accurate than the baseline model. This observation indi-cate that the models can perform as predictors for earthquake’s time to failure. There are two characteristics that stand out the most in these results. First, the increase of accuracy when applying the moving average to the features. Like the MAE results, an increase in accuracy by the models is noticed after applying the moving average smoothing to the features. A second characteristic is the improvement of the model’s performance if the number of features is brought down. Especially from 66 to 9 features the R2-values increase significantly. On the other hand, the differences between the use of nine and four features are negligible.

In figure 17, the time to failure predictions of the model using four moving average features are visualised and plotted against the observed time to failure.

(26)

Rˆ2 Rˆ2 MA 66 features 0.386 0.468 9 features 0.424 0.560 4 features 0.425 0.557

Table 3: R2-scores per model, left column are the regular score and the right ones after applying the MA.

Although the model is trained on aperiodic earthquake cycles, the LSTM is not able to predict these aperiodic cycles. Times to failure greater than 10 seconds are not predicted at all, while as time to failure approaches zero, the model’s forecasting progresses.

Figure 17: Plot of the TTF predictions made by the four MA feature model vs the real TTF.

In table 4, the results per Kth _{split of the 5-fold validation splits are shown.}

The results are from the models that used the moving average features. A large variety in accuracy between the K-splits can be noticed. The validation set from split K=4 gives the best score while split K=3 is the worst.

Figure 18 and 19 demonstrate predictions of the validation set of the two men-tioned splits. Due to the model’s incapability of predicting large times to failure the first cycle in validation set of split 3 is predicted very poorly. Furthermore, somewhere halfway the cycle the model forecasts time to failure is going to hit zero while this is not the case. Therefore, the R2 _{accuracy in this split does}

(27)

Rˆ2 K=1 Rˆ2 K=2 Rˆ2 K=3 Rˆ2 K=4 Rˆ2 K=5 66 features 0.461 0.496 0.334 0.606 0.461 9 features 0.582 0.564 0.398 0.792 0.524 4 features 0.586 0.539 0.378 0.811 0.541 Table 4: The rows present the number of features used by the model, the columns their matching R2_{-score per K}th_{split. To all the features the MA was}

applied.

in which the earthquake cycles are much more periodic and therefore easier to predict. Here a result is found of an R2_{-score of 0.811, which is over twice as}

high as the previous validation split.

Figure 18: Plot of the TTF predictions of the K=3 split validation set, made by the four MA feature model vs the real TTF.

Figure 19: Plot of the TTF predictions of the K=4 split validation set, made by the four MA feature model vs the real TTF.

(28)

5 Discussion

The results from Section 4, indicate that a RNN constructed as a LSTM net-work solely making use of continuous acoustic/seismic time series data is, to a certain extent, able forecast laboratory earthquake’s time to failure. Besides the LSTM network ability of forecasting time to failure, a few more interesting results were obtained in this research. The most striking result is the improve-ment in accuracy when applying the moving average smoothing to the extracted features. In both evaluation metrics, MAE and R2_{-score, an increase of}

accu-racy occurred (table 2 and 3). This increase is mainly caused by the features’ broader distribution visualised in figure 11. Since a broader range of feature values corresponds to more unique times to failure. Secondly, the feature selec-tion process seems to have some impact on the preciseness of the predicselec-tions. The models using smaller number of features seem less susceptible to overfitting (figure 16). Moreover, bringing the features down from 66 to nine have the most impact on model performance (table 3).

Some more interesting interpretations can be derived from this thesis’ results. As time to failure approaches zero, the model’s accuracy improves (figure 17). This is a result of features values that have a lot of correspondence to low times to failure. The model seems to have learned which feature values relate to which times to failure. Besides accurate prediction of low times to failure, a better accuracy is noticed when forecasting periodic earthquake cycles. The differences in accuracy between the 5-fold validation demonstrate this the best. Figures 17, 18 and 19 support this claim since the R2-score for the time to failure cycles demonstrating periodicity, validation split K=4, is significantly higher compared to the four other parts of the data (table 4).

Furthermore, the LSTM models illustrate to have learned pattern sequences from the acoustic signal. However, these patterns do not appear to be the ones that describe each earthquake’s time to failure cycle individually. Since the earthquake cycles with a length over 10 seconds were covered very poorly. Analysing the predictions for time to failure in figure 17, indicate that the learned pattern is more of a precursor for earthquake cycles which have period-icity with an amplitude of about 10 seconds.

Although Rouet-Leduc et al. (2017) reported a much higher R2_{accuracy of 0.89,}

this thesis’ results are not bad, since the results of the two studies are hard to compare. The data (Rouet-Leduc et al., 2017) used concerned more periodic earthquake data than the data used in this thesis, whereas the data used in this study was aperiodic. However, the results on the periodic validation set, K = 3 come close to Rouet-Leduc et al. (2017) results, 0.811. This thesis’ results are more in line with Lubbers et al. (2018) since their data is more paralleled to the data of this research. Likewise, the results of their research corresponds better to those of this thesis, 0.612 versus 0.56.

(29)

As the hypothesis of this research implied an LSTM network has proven to be able to learn pattern sequences in the continuous acoustic signal to predict lab-oratory earthquake’s time to failure. The learned patterns appear to be a accu-rate prognosticator for earthquake’s time to failure smaller than 10 seconds and earthquake’s time to failure cycles that demonstrate similar periodicity. Like mentioned before, the model used in this thesis is unable to forecast high time to failures and aperiodic earthquake’s time to failure cycles accurately. Several criteria have caused these inabilities of the LSTM model. Firstly, the lack of features corresponding to times to failure greater than 10 seconds. When time to failure rises to above this threshold the feature values stay more or less the same, figure 4, 5, 6, 7, 9, 10, 11 and

Future research should seek for broader data sets, with more varying earth-quake’s times to failure. Furthermore, these studies should try to find char-acteristics in the continuous signal that do describe times to failure over 10 seconds, which this thesis failed to do. If these data sets become available and features describing the high times to failure can be found, breakthroughs could be made in forecasting earthquake’s time to failure solely from acoustic/seismic data. When this is the case, the accuracy of a LSTM model, like the one con-structed for the purpose of this study, might be capable of predicting laboratory earthquake’s time to failure cycles more precisely.

6 Conclusion

This thesis endeavoured forecasting laboratory earthquake’s time to failure through applying a RNN constructed of LSTM cells onto a continuous acoustic signal as accurate as possible. Eventually a MAE of 1.851 and a R2_{-score of 0.56}

was reached. Expected was that an LSTM network, making use of statistical features extracted from segments of the signal, would be able to learn pattern sequences that relate to specific earthquake’s times to failure. The results in this thesis imply such patterns were learned during training the model. Since the model demonstrates high accuracy when forecasting times to failure below 10 seconds. Furthermore, the pattern sequence that was learned by the LSTM model appears to describe periodic earthquake cycles. However, the high time to failures, as well as the aperiodic cycles were not accurately predicted by the model’s predictions. Although, the R2-score did not yield over 0.56, this score is no worse in comparison to research forecasting similar earthquake data (Lubbers et al., 2018).

Two modifications in the model’s implementation gave an increase in its perfor-mance. First, the application of the moving average smoothing to the regular extracted statistical features. Second, reducing the number of features demon-strated a similar increase in model’s accuracy. However, still lots of

(30)

improve-ment is to be made in forecasting earthquake’s time to failure from acoustic data, since in some cases the LSTM network showed very poor performance. These cases mainly include predicting high times to failure, over 10 seconds, as well predicting time to failure from earthquake cycles with huge differences in periodicity.

To cover these model’s disabilities, features that describe high earthquake time to failure must found and extracted from the acoustic signal. Only if these features can be extracted the in this thesis constructed model will be able to improve and forecast aperiodic earthquake’s time to failure cycles more accu-rately.

(31)

References

Allen, R. V. (1978). Automatic earthquake recognition and timing from single traces. Bulletin of the Seismological Society of America, 68(5):1521–1532. Bengio, Y., Simard, P., Frasconi, P., et al. (1994). Learning long-term

dependen-cies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.

Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. (2015). Time series analysis: forecasting and control. John Wiley & Sons.

Chen, Y. (2017). Automatic microseismic event picking via unsupervised ma-chine learning. Geophysical Journal International, 212(1):88–102.

Curtis, A., Gerstoft, P., Sato, H., Snieder, R., and Wapenaar, K. (2006). Seismic interferometry—turning noise into signal. The Leading Edge, 25(9):1082– 1092.

Farrar, D. E. and Glauber, R. R. (1967). Multicollinearity in regression analysis: the problem revisited. The Review of Economic and Statistics, pages 92–107. Geller, R. J. (1997). Earthquake prediction: a critical review. Geophysical

Journal International, 131(3):425–450.

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.

Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical infor-mation and computer sciences, 44(1):1–12.

Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.

Hulbert, C., Rouet-Leduc, B., Johnson, P. A., Ren, C. X., Rivi`ere, J., Bolton, D. C., and Marone, C. (2019). Similarity of fast and slow earthquakes illumi-nated by machine learning. Nature Geoscience, 12(1):69.

Jordan, T. H., Chen, Y.-T., Gasparini, P., Madariaga, R., Main, I., Marzocchi, W., Papadopoulos, G., Sobolev, G., Yamaoka, K., and Zschau, J. (2011). Operational earthquake forecasting. state of knowledge and guidelines for utilization. Annals of Geophysics, 54(4).

Kong, Q., Trugman, D. T., Ross, Z. E., Bianco, M. J., Meade, B. J., and Ger-stoft, P. (2018). Machine learning in seismology: Turning data into insights. Seismological Research Letters, 90(1):3–14.

(32)

LANL (2018). LANL earthquake data set. https://www.kaggle.com/c/ LANL-Earthquake-Prediction/data [Accessed: 2019-04-12].

Legates, D. R. and McCabe, G. J. (1999). Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water resources research, 35(1):233–241.

Lubbers, N., Bolton, D. C., Mohd-Yusof, J., Marone, C., Barros, K., and John-son, P. A. (2018). Earthquake catalog-based machine learning identification of laboratory fault states and the effects of magnitude of completeness. Geo-physical Research Letters, 45(24):13–269.

Malhotra, P., Vig, L., Shroff, G., and Agarwal, P. (2015). Long short term mem-ory networks for anomaly detection in time series. In Proceedings, page 89. Presses universitaires de Louvain.

McLaskey, G. C., Thomas, A. M., Glaser, S. D., and Nadeau, R. M. (2012). Fault healing promotes high-frequency earthquakes in laboratory experiments and on natural faults. Nature, 491(7422):101.

Perol, T., Gharbi, M., and Denolle, M. (2018). Convolutional neural network for earthquake detection and location. Science Advances, 4(2):e1700578. Rouet-Leduc, B., Hulbert, C., Bolton, D. C., Ren, C. X., Riviere, J., Marone,

C., Guyer, R. A., and Johnson, P. A. (2018). Estimating fault friction from seismic signals in the laboratory. Geophysical Research Letters, 45(3):1321– 1329.

Rouet-Leduc, B., Hulbert, C., Lubbers, N., Barros, K., Humphreys, C. J., and Johnson, P. A. (2017). Machine learning predicts laboratory earthquakes. Geophysical Research Letters, 44(18):9276–9282.

Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5(3):1. Scholz, C. H., Sykes, L. R., and Aggarwal, Y. P. (1973). Earthquake prediction:

a physical basis. Science, 181(4102):803–810.

Seyrafi, A. (2009). Developing real time automatic step detection in the three dimensional accelerometer signal implemented on a microcontroller system. Smola, A. J. and Sch¨olkopf, B. (2004). A tutorial on support vector regression.

Statistics and computing, 14(3):199–222. Vantuch, T. (2018). Analysis of time series data.

Forecasting laboratory earthquake's time to failure from acoustic data with a recurrent neural network.