Quantifying the predictive value of soil moisture for vegetation growth using neural networks

(1)

MSc Artificial Intelligence

Track: Machine learning

Master Thesis

Quantifying the predictive value of soil

moisture for vegetation growth using

neural networks

by

Robert Leenders

10811548 42 ECTS April 2016 – September 2016

Supervisor:

Dr R de Jeu

Assessor:

Dr M Welling

Machine Learning Group

University of Amsterdam

(2)

Abstract

Soil moisture is a crucial constraint for vegetation growth, and has there-fore potentially predictive value. However, the strength of this predictive value is still to a large degree unknown. This thesis quantifies the pre-dictive value of soil moisture for vegetation growth. New methods are introduced to predict vegetation growth using satellite based soil mois-ture observations. These new methods are based on neural networks and are evaluated over mainland Australia. Analysis on the predictions of our 3 layer neural network revealed that (a) soil moisture provides a strong predictive value for vegetation, (b) soil moisture can be used to reliably predict vegetation up to two months in advance, and (c) soil moisture has a strong local spatial relation with vegetation. The accuracy of vegetation predictions are dependent on the magnitude of soil moisture, where the quality of the vegetation prediction is higher in dry regions as compared to wet areas.

(3)

Chapter 1 Introduction

Vegetation is the assemblage of plant species and their ground cover. It plays an important role in our ecosystem where it regulates various biogeochemical cycles such as water, carbon, and nitrogen. It converts carbon to oxygen, converts solar energy into biomass, is the basis of all food chains, and provides wildlife habitat and food. Understanding the impact of climate change on vegetation dynamics is crucial in understanding ecosystem dynamics. This is also the reason why vegetation dynamics are observed for analyzing climate change. Besides the importance of vegetation for our ecosystem, vegetation is being used in a wide range of important problems, most notably in forecasting and monitoring. Examples of such problems are climate change monitoring [Bounoua et al., 2000], agricultural productivity (crop yield [Teal et al., 2006]), drought monitoring [Peters et al., 2002], and forest fire detection [Illera et al., 1996].

The effect of climate change on vegetation dynamics is complex and is influenced by a wide range of different climatic constraints. The three strongest climatic con-straints are water availability, solar radiation, and temperature [Stephenson, 1990, Churkina and Running, 1998, Nemani et al., 2003]. The impact of these three com-ponents on vegetation are relatively well studied, with water availability being the least well studied [Lotsch et al., 2003, Mercado et al., 2009]. This is peculiar as more than half of the world’s ecosystems are substantially limited by the availability of water [Heimann and Reichstein, 2008]. A decrease in water availability reduces the ability of vegetation to convert carbon-dioxide to oxygen due to a restriction in stomatal conductance and a limited availability of root water [van der Molen et al., 2011].

The climatic constraint of water availability on vegetation consists mainly of pre-cipitation and soil moisture, with soil moisture being more strongly related to plant growth dynamics than precipitation. There are three important factors that make

(5)

soil moisture crucial to plants. First off, it provides water and nutrients to the plants, allowing it to grow. Secondly it creates a buffer and ensures water availability to plants, even in absence of precipitation. And finally, it enhances the soil chemical processes which aids the availability of macro-nutrients such as nitrogen. Besides the influence on vegetation, soil moisture is also of fundamental importance to many other hydrological and biological processes.

Observed soil moisture is arguably the key variable for modulating the complex dynamics of the climate-soil-vegetation system and controlling the spatial and tempo-ral patterns of vegetation [Porporato and Rodriguez-Iturbe, 2002]. However, instead of using soil moisture observations to study the relation between vegetation dynamics and water availability, often proxies are used such as model based soil moisture and drought indices [Hirschi et al., 2011, Lotsch et al., 2003]. Near surface soil moisture can be accurately observed at a regional and global scale using passive and active mi-crowave sensing instruments [Owe et al., 2008, Liu et al., 2012, 2011, Miralles et al., 2010]. The combination of passive and active observations gives a robust observed satellite based soil moisture product [De Jeu et al., 2008, Dorigo et al., 2010].

Satellite observed soil moisture has been used to show a strong positive relation between soil moisture and vegetation at large spatial and long-term temporal scales over mainland Australia [Chen et al., 2014], with dry regions that have low vegetation density being more sensitive to soil moisture and with vegetation lagging about one month behind soil moisture. However, the details of the relationship between soil moisture and vegetation are not yet clear.

The main objective of this thesis is to quantify the predictive value of satellite based soil moisture for vegetation by forecasting vegetation maps. To forecast vege-tation maps powerful machine learning techniques are used to model the relationship between satellite based soil moisture and vegetation. The predictive value of satellite based soil moisture for vegetation will be analyzed for different spatial and temporal regions, and for different soil moisture levels. Additionally, the quantity of how far into the future soil moisture has predictive value will be analyzed.

Long term satellite soil moisture data from ESA CCI Liu et al. [2011] and satellite vegetation proxies as described by the normalized difference vegetation index (NDVI) [Rouse, 1973] are used. Neural networks are used as our machine learning model to forecast vegetation maps. The neural networks will take satellite soil moisture as input and produce NDVI maps as output. Neural networks are a powerful set of models that can model complex non-linear relationships between input and output. A deep neural network is a neural network which is composed of multiple hidden layers. By stacking layers, which represent linear and non-linear transformations,

(6)

deep neural networks can learn increasingly complex abstractions of the data. Deep neural networks have become very popular over the last couple of years, especially under the term deep learning [Hinton et al., 2012, Collobert and Weston, 2008, LeCun et al., 2015].

To quantify the predictive value of soil moisture for vegetation the accuracy of the neural networks are analyzed. The analysis will be done over different spatial regions as well as different temporal regions. To quantify how far into the future soil moisture has predictive value for vegetation a lag period between input and output samples is introduced. The accuracy of the models with different lag periods are then analyzed to quantify how far into the future soil moisture has predictive value. To quantify the spatial relation between soil moisture and vegetation locally connected neural networks are used, and their accuracies are analyzed. Finally a measure of uncertainty is introduced to the models, thereby introducing another way to possibly quantify the predictive value of soil moisture for vegetation. Having a measure of uncertainty is also useful for the practical applicability of our models.

Several studies have set up methodologies to predict vegetation (NDVI). However, none of them used soil moisture as input. Indeje et al. [2006] predicted NDVI in Ke-nia using the seasonal rainfall from the global climate models (GCM). It is assumed that climate variability, especially precipitation, drive variability in NDVI. The au-thors apply a correction to the GCM output using the model output statistics (MOS) approach, and then predict NDVI using a combination of empirical orthogonal func-tion (EOF), singular value decomposifunc-tion (SVD), or canonical correlafunc-tion analysis (CCA), and multiple linear regression. They report that NDVI can be skillfully pre-dicted (with ≥ 0.6 correlation), however, they do not report any error characterization such as the mean squared error (MSE).

[Jiang et al., 2016] studied the spatiotemporal variability and predictability of NDVI in Alberta, Canada. They showed that vegetation in southern Alberta is pre-dominantly driven by precipitation. Instead of predicting NDVI it predicts smoothed NDVI (sNDVI). The authors use a linear regression model and an artificial neural network model calibrated by a genetic algorithm (ANN-GA) to predict sNDVI. Simi-lar to our findings, they found that the non-linear model (ANN-GA) performed better than the linear model. This study will take a similar approach, but then with a direct focus on soil moisture using more advanced neural networks over Australia.

In this study the focus will be on both the influence of soil moisture on NDVI (as already investigated by Chen et al. [2014]) and the predictive value of soil mois-ture. This allows for a deeper analysis on the relationship between soil moisture and vegetation and allows for a better look at the predictive value of soil moisture

(7)

for vegetation. The main contribution of this thesis, presented in chapter 3, is the analysis and prediction of vegetation with high accuracy using neural networks. The focus is on analyzing the predictability of vegetation in different temporal and spatial regions. Also analyzed is the effect of introducing a lag period between the soil mois-ture observations and the vegetation predictions. Furthermore, the performance of a regular neural network and an ensemble of small locally connected neural networks is compared. Finally, chapter 4 will focus on improving the predictions by adding a measure of uncertainty to them. This is done using a Bayesian approach; by replac-ing the neural network with a Bayesian neural network based on work of Louizos and Welling [2016].

(8)

Chapter 2 Background

2.1 Data

2.1.1 Soil moisture

The soil moisture dataset [Liu et al., 2012, 2011, Wagner et al., 2012] is provided by the CCI project which is part of the ESA programme on global monitoring of essential cli-mate variables. The dataset was retrieved from http://www.esa-soilmoisture-cci. org/node/145 on April 2016. It provides surface soil moisture maps at a 0.25◦ res-olution from 1972 to 2014. It uses active as well as passive microwave sensors and combines these two data streams into one final dataset. Observations are available daily, however, not every area has a daily valid soil moisture observation. In other words, daily maps are incomplete. To help with this issue, and to make the data consistent with the NDVI dataset, the observations are averaged over the first fifteen days of a month and the remaining observations of a month. This results in two soil moisture maps per month. Figure 2.1 shows an example of a 15 day soil moisture map.

2.1.2 NDVI

To quantify vegetation the normalized difference vegetation index (NDVI) [Rouse, 1973] is used. NDVI is an index that captures the amount of live green vegetation or photosynthetic activity in an area and was first introduced in 1973 by Rouse et al. It is a popular index that has found a wide range of applications in areas such as vegetation dynamics, biomass production, and crop yield prediction.

NDVI uses visible light and near infrared light to distinguish between healthy and unhealthy vegetation. It uses the concept that in general healthy vegetation will

(9)

0 200 400 600 800 1000 1200 1400 0 100 200 300 400 500 600 700

Figure 2.1: Example of a soil moisture map. White indicates no soil moisture in-formation is available for that area, blue indicates dry areas, and red indicates wet areas. Even when averaged there are still areas without data (e.g. the white areas in South-America).

absorb most of the visible light while it reflects more of the near infrared light. In contrast, unhealthy vegetation reflects more visible light and less near infrared light. This leads to the following fraction:

N DV I = N IR − RED N IR + RED

where N IR is the near infrared reflectance value for a cell and RED is the red reflectance value for that cell. The near infrared reflectance and red reflectance values for cells are captured using satellite instruments. In general NDVI values range from -1 to +1 with larger values indicating more vegetation.

NDVI is not the only index that measures live green vegetation. Other indices such as the soil-adjusted vegetation index (SAVI) or the enhanced vegetation index (EVI) also try to measure live green vegetation. For this study NDVI is chosen due to its wide recognition within the science community.

The NDVI data is obtained from the GIMMS AVHRR Global NDVI dataset [Pinzon and Tucker, 2014]. The dataset was retrieved from https://ecocast.arc. nasa.gov/data/pub/gimms/3g.v0/ on April 2016. The dataset is assembled from a collection of observation of NOAAs Advanced Very High Resolution Radiometers. It

(10)

provides bimonthly observations at a 1/8th◦ resolution from 1981 to 2014. To avoid resolution incompatibility with the soil moisture data the NDVI dataset is downscaled to the same 0.25◦ resolution of the soil moisture dataset. Figure 2.2 shows an example of a 15 day NDVI map.

0 200 400 600 800 1000 1200 1400 0 100 200 300 400 500 600 700

Figure 2.2: Example of an NDVI map. Blue indicates an NDVI value of -1 and red indicates an NDVI value of +1.

2.1.3 Preprocessing

Before feeding the data to the models three preprocessing transformations were per-formed. The first transformation is the aggregation of 10 observations (2 per month) into a single observation, effectively introducing a notion of history to all our observa-tions. This comes from the work of Chen et al. [2014], which shows that a soil moisture observation influences the NDVI up until the following 5 months. Performing this preprocessing step results in a small increase, 2-3%, in performance.

The second transformation is normalizing the input data. This is commonly done as it often leads to faster convergence and better local optima. To normalize the input data the mean is subtracted from the input data and the result is divided by the standard deviation of the input data.

The third transformation is removing the seasonality so our neural network can focus on learning anomalies. The data contains a strong seasonality, in other words, vegetation maps of the same months (or adjacent months) look similar. Note that every month is represented by two samples, such that there are 24 samples in a year.

(11)

Each sample then spans a period of roughly two weeks. To remove the seasonality the mean is computed for each period and this mean is then subtracted from each sample within that same period. An example of this transformation on Australia can be seen in figure 2.3. 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

Figure 2.3: The first map is the original output, the second map is the average of all samples of that same time period, and the third map is the difference between the first two maps. Orange indicates a difference of zero, red means an increase of vegetation, blue means a decrease of vegetation.

This transformation is also applied on the input data. Note that this might remove some important information, most importantly the scale. Consider two samples from different months, that have completely different averages. It is then possible that after applying this transformation two (input) samples have the same difference maps but have completely different output maps.

To evaluate this transformation two neural networks were trained, one on a dataset without this transformation (’non-anomaly’) and one with this transformation applied (’anomaly’). Figure 2.4 shows the test accuracy of both neural networks. The non-anomaly neural network has an error of 0.002310 and the non-anomaly neural network has an error of 0.002075, an improvement of ±10%. The differences are small but the anomaly neural network outperforms the non-anomaly neural network consistently. Therefore, this transformation will be used as an extra preprocessing step on the input dataset.

2.2 Machine learning models

In this study a few machine learning models are used including ridge regression, neural networks, and Bayesian neural networks. Familiarity with ridge regression is assumed so that the next two sections can focus on giving an overview of neural networks and Bayesian neural networks.

(12)

2009 2010 2011 2012 2013 Time 0.000 0.002 0.004 0.006 0.008 0.010 0.012 Error non-anomaly anomaly

Figure 2.4: Mean squared error for original dataset and transformed dataset

2.2.1 Neural Networks

To explain what a neural network is, it is important to first understand what a perceptron is, which is a type of artificial neuron. The perceptron takes several inputs x1, x2, . . . , xn, and multiplies each input by a weight w1, w2, . . . , wn, it then

sums up all these values together and if that value is larger than a certain threshold it will output a 1 and otherwise a 0. To be more precise:

output =    0, if Pn i=0xiwi ≤ threshold 1, if Pn i=0xiwi > threshold (2.1)

By varying the weights and the threshold the perceptron will learn to make dif-ferent decisions. The last few years difdif-ferent artificial neurons are often being used instead of perceptrons, they still use the same idea of weights, except they will often apply a non-linearity such as the sigmoid function to the result sum instead of com-paring it to some threshold. By stacking these perceptrons a more powerful model called the multilayer perceptron is obtained.

One example of a multilayer perceptron is shown in figure 2.5. The first column of nodes is usually the input, the second column is the first layer of perceptrons, the

(13)

third column is the second layer of perceptrons, etcetera. By having multiple layers of perceptrons increasingly difficult decision can be made. A multilayer perceptron is a certain type of neural network, one where the artificial neurons are perceptrons, however, as will become clear in the next section, it is possible to have different kind of neurons. A neuron is often called a (hidden) unit. A neural network then has input units, hidden units (in a multilayer perceptron case the perceptrons), and output units. To be clear, a multilayer perceptron is a neural network, but a neural network is not necessarily a multilayer perceptron.

Input #1 Input #2 Input #3 Output Hidden layer #2 Hidden layer #1 Input layer Output layer

Figure 2.5: An example of a multilayer perceptron

The equation 2.1 can be rewritten in a more general form:

output = f (w · x + b) (2.2)

where x and w are vectors of input and weights, and b is a bias term. The bias term is simply the threshold except it has been moved to the left hand side. The function f defines what kind of artificial neuron it is. Given the function:

f (x) =    0, if x ≤ 0 1, if x > 0 (2.3)

the neuron corresponds to a perceptron and the equation is equal to equation 2.1. However, f can be any kind of function like a sigmoid, tanh, or a rectified linear one. It is important to note that often f is a non-linear function as this makes the neural network more powerful. Below three non-linearities are highlighted. Firstly, the sigmoid function which squashes inputs to a value between 0 and 1 as can been

(14)

seen in figure 2.6, the equation is as follows:

σ(x) = 1

1 + e−x

Secondly, the tanh function which is similar to the sigmoid function except it squashes inputs to a value between -1 and +1, it is plotted in figure 2.6, and the equation is as follows:

tanh(x) = 1 − e

−2x

1 + e−2x

Finally, the rectified linear function, these units are often called rectified linear units or ReLU units. This function returns the input if it’s larger than zero, otherwise it returns zero. It is plotted in figure 2.7 and the equation is as follows:

relu(x) =    x, if x > 0 0, if x ≤ 0 (2.4) −5.0 −4.0 −3.0 −2.0 −1.0 1.0 2.0 3.0 4.0 5.0 −1.0 −0.5 0.5 1.0 x y σ(x) = _1+e1−x

tanh(x) = 1−e_1+e−2x−2x

Figure 2.6: The sigmoid and tanh functions

The output of a neural network could be a unit in which f is the identify function, which is often used for regression problems. There are other possible options such as a softmax, which is often used for classification problems. The problem considered in this thesis has as many output units as there are pixels in the NDVI map that is

(15)

−5.0 −4.0 −3.0 −2.0 −1.0 1.0 2.0 3.0 4.0 5.0 −1.0 1.0 2.0 3.0 4.0 5.0 x y relu(x) = max(0, x)

Figure 2.7: The ReLU function

being predicted. Each output unit has to predict a real value between -1 and 1 (the range of NDVI values) so for this problem f is set to the identify function for the output units.

By changing the weights of the neural network, the decisions made by the neural network change, but how should one change these weights? Basically, one would like to have an algorithm that changes the weights and the biases of the neural network so that it outputs correct answers, based on some training data. To quantify how correct an answer is a cost function is defined, or an error measure. One example of a cost function is the mean squared error. The learning algorithm then tries to minimize this cost function by changing the weights and biases. One of the most common learning algorithms is gradient descent, that computes the gradient of the error with respect to the weights and biases, and then updates the weights and biases, so that the error decreases. Computing this gradient is often done using backpropagation [Rumelhart et al., 1985].

As gradient descent requires computation of the gradient over the complete dataset, which is expensive, stochastic gradient descent (SGD) is often used. SGD is a stochas-tic approximation of gradient descent that instead computing the gradient over the complete dataset, computes the gradient over a subset of the dataset. SGD is widely used but is inefficient when it comes to optimizing objectives that contain other sources of noise than data subsampling. Adam [Kingma and Ba, 2014] is a learning algorithm that tries to be efficient at optimizing these stochastic objectives. The ad-vantage of using Adam over SGD is that it is invariant to rescaling of gradients and robust to noisy and sparse gradients while having little memory and computational

(16)

overhead. In this thesis Adam is used as optimizer for all our experiments.

Another crucial part of a neural network is its architecture. The architecture of a neural network consists of layers, each containing a number of units. Usually, the first layer is the input, the final layer is the output, and all layers in between are hidden layers. For example, the neural network in figure 2.5 has 4 layers. The first layer has 3 input units, the two following layers are hidden layers one with 4 units and one with 5 units, and the final layer has a single output unit.

Finally, dropout [Srivastava et al., 2014] is briefly discussed. Dropout is a regu-larization technique that randomly drops units during training. To be precise, during training the output of randomly selected units will be set to zero, while during testing all outputs will be scaled by some factor (this factor depends on the probability that a unit drops). Usually a certain probability set per layer on whether or not a unit drops. The hope is that this prevents units from co-adapting too much. Dropout is a very simple technique but surprisingly effective.

2.2.2 Bayesian Neural Networks

A disadvantage of neural networks is that they do not provide any kind of uncertainty measure with their output, in other words, they do not provide any confidence inter-vals. This is especially important for problems where key decisions are being made based on the output.

To introduce confidence levels to neural networks Bayesian methods are applied to it. In the Bayesian treatment of neural networks we marginalize over the distri-bution of parameters in order to make a prediction. In other words, a probability distribution is put over the weights w. As a neural network is highly non-linear and complex an exact Bayesian treatment is practically impossible. Therefore, approxi-mation methods are used to approximate the distribution. This section focuses on a family of approximation methods called variational inference. Another family of approximation methods are the Markov Chain Monte Carlo (MCMC) methods. The advantage of variational inference methods over MCMC methods is that variational inference methods do not require any sampling and hence are fast and deterministic. Variational inference is a family of methods that cast inference in a distribution as an optimization problem. This is done by minimizing the Kullback-Leibler (KL) divergence [Kullback and Leibler, 1951] between the approximate posterior and the true posterior. In other words, the distribution p(y|x), which is too complicated to evaluate directly, is approximated by a simpler distribution q(y). Usually this simpler distribution q makes more independence assumptions than p. The problem then becomes which simpler distribution q to select. This is done by defining a family

(17)

of distributions Q that are all simple enough to evaluate, then the q in Q that best approximates p is selected (this is the optimization part). To evaluate how well q approximates p the KL divergence is often used.

There exist a lot of variational inference methods such as loopy belief propaga-tion, mean-field approximapropaga-tion, and expectation propagation. Recent research in this area, with a focus on applications in neural networks, includes work from [Graves, 2011, Hern´andez-Lobato and Adams, 2015, Blundell et al., 2015, Kingma et al., 2015, Louizos and Welling, 2016]. This thesis will use the variational Bayesian neural net-work method defined in this last paper, called VMG (Variational Matrix Gaussian).

All recent approaches mentioned above, besides the approach of Louizos and Welling [2016], assume a fully factorized posterior distribution over the neural network weights. In other words, they treat each weight of the weight matrix independently. In contrast, Louizos and Welling [2016] treat the whole weight matrix as one us-ing a matrix variate Gaussian distribution Gupta and Nagar [1999]. This leads to a reduction in variance parameters to estimate, better weight posterior uncertainty estimation, more information sharing between weights, and an easier learning task.

(18)

Chapter 3 Predicting NDVI

In this chapter NDVI is predicted using soil moisture. The performance of different neural network architectures is analyzed, as well as the influence of different areas and time periods on performance. Afterwards a lag period is introduced between input and prediction to quantify how far into the future soil moisture has predictive value. Finally, the performance and benefits of using ensembles of locally connected neural networks over a single large neural network are discussed.

All our models take a map with soil moisture levels as input, and produce a map with vegetation levels as output. To represent vegetation the normalized difference vegetation index (NDVI) is used, throughout this chapter vegetation and NDVI can be used somewhat interchangeably. Both the CCI Soil Moisture dataset and the GIMMS NDVI dataset provide maps that cover the entire world. A lot of areas are simply not interesting to look at since water is always readily available, and so soil moisture has a small impact on vegetation. Furthermore, predicting vegetation for the entire world instead of a single country is more expensive computationally. Hence, in the experiments only a single country will be considered, namely Australia. Australia was chosen, and not for example south-Africa, because it is a well studied area and because the data available for Australia is of high quality. To focus on Australia image patches of 140 by 180 pixels containing just Australia were extracted and used as new input and output datasets.

Figure 3.1 shows a few examples of input, output, and prediction maps (predictions were done by our best performing model). Visually the predictions look similar to the expected output, with areas bordering water looking more similar, while a few spots in the middle of Australia prove more difficult to predict.

(19)

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(a) The first month in the test set

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(b) The sixth month in the test set

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(c) The twelfth month in the test set

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(d) The wettest (and worst) month in the test set

Figure 3.1: Input, output and prediction for a few samples in the test set. For the first column blue represents dry areas, while red represents wet areas. For the other two columns blue represents no vegetation, while red represents high vegetation.

(20)

3.1 Using neural networks

In this thesis the focus is on using neural networks as a prediction model, however it is helpful to establish a baseline using a few simpler models. This is done by applying two linear models on the problem: linear regression and ridge regression. The linear regression model has no parameters and the ridge regression model uses a weight penalty of α = 100. Figure 3.2 shows that the ridge regression model performs significantly better than linear regression suggesting that overfitting is a problem. The fact that ridge regression performs similar to our best neural network suggests that the relation between soil moisture and NDVI might be of linear nature.

A variety of different neural network architectures were tested on our problem, the performance of the most interesting ones together with the performance of the two baseline models can be found in table 3.1. The neural network with the best performance has 2 hidden layers each with 2500 ReLU units and uses heavy `1 and

`2 regularization. It performs about 39% better than the ridge regression model.

Table 4.1 also contains the performance of a model that simply predicts all zeros, in other words, it predicts there are no anomalies and so the predicted vegetation map is equal to the average of all vegetation maps in the training data of that same time period. The best neural network model performs about 20% better than this model. The next few paragraphs will elaborate more on how these neural networks were initialized, trained, how their architectures impacted performance, what properties worked best and why.

All neural networks weights were initialized by drawing from a standard normal distribution with standard deviation 0.01 as described in Alex’ One Weird Trick Paper [Krizhevsky et al., 2012]. All biases were initialized to a constant bias of 0.1.

Each neural network was optimized using the Adam [?] optimizer with a learning rate of 0.0001. The Adam optimizer was initialized with the following parameters: β1 = 0.9, β2 = 0.999, = 10−8. The minibatch size was set to 24 at the beginning

of training and was slowly increased to the size of the complete training set. In total there are about 520 samples in the dataset, of which 390 (75%) are used for training purposes. The training dataset is small enough to allow for a full non-stochastic gradient update.

So far there hasn’t been any details on what exactly is being optimized. The output data (the vegetation maps) are essentially matrices of real values, and so any matrix similarity measure might be used as an error measure. This work uses the `2 norm ||A − B||2₂ which is simply the mean of the squared differences between two matrices: E(A, B) = _n1 Pn

i=0(Ai− Bi) 2_.

(21)

-2009 2010 2011 2012 2013 Time 0.000 0.005 0.010 0.015 0.020 0.025 Error LR Ridge NN

Figure 3.2: Time series with the mean squared error of linear regression, ridge regres-sion, and the best neural network, on a period from Aug 2008 to 2014.

Model Parameters Error

All zeros - 0.00234371

Linear regression - 0.0100653

Ridge regression α = 100 0.00305129

Neural network 1x500 ReLU 0.00192698

Neural network 2x2500 ReLU dropout 0.00186956

Neural network 2x2500 tanh 0.0021368

Neural network 2x2500 sigmoid 0.00222824

Neural network 2x3500 ReLU, dropout 0.00186964

Table 3.1: The performance of several models. The error column contains the mean squared error for each model. The best performing model is an NN with parameters 2x2500 ReLU, meaning it has 2 hidden layers each with 2500 ReLU units and does not use dropout.

(22)

regularization was important. This is probably due to the strong spatial relation present in the data, i.e. vegetation growth in the south hardly depends on the soil moisture levels in the north. For our models using a regularization rate of `1 = 10−6

and `2 = 10−5 as a general rule worked well. These were also the exact values used as

regularization constants in our best performing neural network. For smaller networks the regularization parameters were decreased by a factor 10 - 100 and for larger networks there were increased by a factor 10 - 100.

Out of the three different non-linearities tested the ReLU non-linearity performed best. As table 3.1 shows, it performs about 15% to 20% more accurate than the sig-moid or tanh non-linearities. The problem of predicting vegetation using soil moisture can be seen as a problem were essentially one map is morphed into another, both containing real values that are somehow correlated. By using a non-linearity that squashes values such as the sigmoid or tanh (see figure 2.6) the model loses valuable information about the input signal.

Table 3.1 shows that having a neural network with one hidden layer only decreases performance by 5%, again suggesting that the relation between soil moisture and NDVI has a linear nature. Having three or more hidden layers did not improve performance and having more than five hidden layers actually decreased performance. This is probably due to the small amount of available data; deep neural networks require a lot of data samples to train on. Experiments showed that 2500 hidden units per layer was the sweet spot. Adding any additional hidden units did not result in an increase of performance. Neural networks with less hidden units (e.g. 500) were able to predict general vegetation levels of large areas correctly, but couldn’t accurately predict smaller areas.

Finally, adding dropout [Srivastava et al., 2014] to the neural networks hardly impacted the test error, which is surprising. It did, however, improve the training error. Dropout makes a lot of sense because, like mentioned above, there is a strong spatial relation, where an output pixel only depends on the surrounding input pixels. By dropping a lot of pixels from the input image the model effectively gets rid of a lot of noise. One possible reason for the lack of improvement in test error is the strong `1 regularization all neural networks have. Another possible explanation would be

that our neural network architectures aren’t very deep which is often a requirement for dropout to work satisfactory.

One architecture that has not been discussed so far is that of convolutional works. Convolutional networks did not perform as well as conventional neural net-works and due to their high computational cost no further research was done in this direction. There are two problems with the convolution neural networks: the first one

(23)

is that filters are only applicable locally and therefore a lot of them are a waste of time and space, and the second one is that pooling removes a lot of important information. This is a problem as the model needs to learn exactly which pixels have increased or decreased vegetation, and by what amount, and not just the general areas which have increased or decreased vegetation. Another problem with using convolutional neural networks is that they are inherently deep, which due to the amount of available data, is a problem.

3.2 Analyzing the performance

In this section the performance of our best model: a 3 layer neural network with 2500 ReLU units per hidden layer is analyzed. First the performance of the prediction method over different areas in Australia are analyzed. Second, the impact of timing on the predictive skills are investigated. As will be shown in the following two sec-tions, the prediction performance depends on the quantity of soil moisture available where more soil moisture means worse performance. Finally, in the last subsection an attempt will be made to explain why an increase in soil moisture results in a decrease in test accuracy.

3.2.1 Performance on different areas

In this section different regions of Australia are analyzed to see if vegetation is more difficult to predict in any of the regions. Measuring the difficulty of predicting vege-tation of a region will be done by aggregating the mean squared error per pixel over the test data.

The resulting map can be seen in figure 3.3a, and shows that there are a few small spots in the middle of Australia where the errors are concentrated. These seem to be spots where the model generalizes poorly. To get a more realistic look at areas that are troublesome another map where the maximum error is bounded by 0.02, is shown in figure 3.3b.

Figure 3.3b shows that the errors concentrate in east Australia and that the north-west of Australia contains the least amount of errors. It is probable that this happens due to sudden changes in wetness in these areas. Figure 3.4 shows the average soil moisture over the entire dataset for the year 2011, which is the wettest period. Look-ing at these figures one can see that, for example, in the year 2011 in south Australia there is a spike in soil moisture which is an area with high error. Similar reasoning can be applied to other spots where there is a large difference between average soil

(24)

moisture and soil moisture in 2011, for example, the spots in mid Australia. 0 50 100 150 0 20 40 60 80 100 120 _0.0150.030 0.045 0.060 0.075 0.090 0.105 0.120

(a) Error map with unbounded errors.

0 50 100 150 0 20 40 60 80 100 120 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020

(b) Error map with the error bounded to a maximum of 0.02

Figure 3.3: Average error maps where each pixel represents the average mean squared error for that pixel.

0 50 100 150 0 20 40 60 80 100 120 0.00 0.04 0.08 0.12 0.16 0.20 0.24 0.28 0.32

(a) Average soil moisture over the entire dataset

0 50 100 150 0 20 40 60 80 100 120 0.00 0.04 0.08 0.12 0.16 0.20 0.24 0.28 0.32

(b) Average soil moisture over the wettest year (2011)

Figure 3.4: Maps with the average soil moisture for Australia

However, this does not explain the lack of error in the north west of Australia which also has an increased soil moisture during 2011. The average NDVI over the entire dataset and the average NDVI over 2011 are shown in figure 3.5. From these two figures one can clearly see that even though there were increased soil moisture levels in the north west there was hardly any increase in NDVI. Intuitively, this can be explained by the idea that different areas respond differently to an increase of soil moisture. As our neural network has not encountered anything this extreme before it

(25)

has to guess which areas respond strongly to the increased soil moisture and which do not respond at all. Furthermore, areas with less NDVI response have less variation and are thus easier to predict.

0 50 100 150 0 20 40 60 80 100 120 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

(a) Average NDVI over the entire dataset

0 50 100 150 0 20 40 60 80 100 120 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

(b) Average NDVI over the wettest year (2011)

Figure 3.5: Maps with the average NDVI for Australia

3.2.2 Performance on different time periods

The analysis in the previous section hinted at a strong relation between soil moisture levels and prediction performance. This section further explores this theory and analyzes the driest and wettest periods and their performances.

The five driest periods in the test set are the samples from periods: 1-15 Aug 2008, 1-15 Nov 2009, 1-15 Oct 2009, 6-31 Oct 2009, and 16-30 Nov 2009, these are the 1st, 32nd, 30th, 31st, and 33rd samples in the test set. These samples have the 3th, 34th, 20th, 21th and 26th best performances (out of 130 samples). These performances aren’t spectacular but nearly all of them are in the top 20% of performances. In contrast, the five wettest periods are all in the bottom 8% of performances. The five wettest periods are: 1-15 Mar 2011, 16-28 Feb 2011, 16-30 Mar 2011, 1-15 Apr 2011, 1-15 Feb 2011. They have the 3rd, 1st, 2nd, 5th and 10th worst prediction performances, four of them are even in the top 5 of worst performances.

To further solidify the relation between soil moisture and prediction performance two time series are shown in figure 3.6. The green line represents the error of the neural network and is scaled to be on the same scale as that of the blue line which represents soil moisture. Besides the clear relation between the two time series that one can see visually, it has a Pearson correlation of 0.816. In other words, 66.5% of the error variance can be explained by a simple linear regression on the average soil

(26)

moisture. A scatter plot with the error on the y-axis and the average soil moisture on the x-axis can be seen in figure 3.7. This scatter plot shows a clear positive relation between the average soil moisture and the prediction error of the neural network.

2009 2010 2011 2012 2013

Time

Average SM Error

Figure 3.6: The average soil moisture and (scaled) test error per sample over a period from Aug 2008 to 2014. The graphs move similar suggesting a relation between soil moisture levels and errors.

Combing the observations of this section with the observations of the previous sections one can conclude there is a strong negative relation between wetness and prediction performance. The next section will attempt to explain why wetness de-creases performance.

3.2.3 Why wetness decreases performance

The previous two sections showed that an increase of soil moisture often results in an increase of test error or a decrease in test accuracy. This section discusses two reasons why this is.

Firstly, an increase in soil moisture often results in an increase in vegetation which leads to greater variability. This variability is difficult to predict as each area reacts differently and as there is little of data available. Another possible issue is that the soil moisture differential maps fed to the neural networks are relative to the average soil moisture for that month. In other words, the input are relative values, not say percentages of change, which means that one input map might represent a 5% increase

(27)

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0 Average soil moisture

0.002

0.000

0.002

0.004

0.006

0.008

0.010 Error

Figure 3.7: Scatter plot comparing average soil moisture and error

in soil moisture for one month but 15% for another month. Further research using percentages of change as input to the models did not yield any positive performance improvements, proving this is not an issue.

Secondly, vegetation is a complex process that is influenced by a lot of different factors. Soil moisture, which is part of water availability, is only one such factor. Other factors include sunlight availability, geography of area, and temperature. Dur-ing drier periods, when water is scarce, soil moisture is one of the most important factors, because water availability is the limiting factor. During wetter periods where water availability is high soil moisture because a less important factor and other factors such as sunlight availability become more important.

3.2.4 Adaptability on anomalies

In the experiments conducted above the models are simply trained on 75% of the available data and tested on the remaining 25%. However, this is not a very realistic scenario. In practice, the machine learning models would be continuously re-trained whenever new data becomes available. In other words, Therefore, for practical appli-cability of the models, it might be interesting to see if the model could have predicted the anomaly (the wettest year) better had it been trained on all data available just

(28)

up before this anomaly. Furthermore, it might be interesting to see when the neural network adapts and learns how vegetation reacts to these wet circumstances.

To investigate how quickly the neural network adapts to wet circumstances several neural networks have been trained. Each was given one more training sample (of 2011 the wettest year) than the previous one. Figure 3.8 shows the performance of four neural networks on the data for the year 2011 - 2012. The blue line corresponds to the error of the original neural network that has not seen any extra data, the green line corresponds to the error of the neural network that has been trained up until 2011, the red line corresponds to the error of the neural network that has been trained up to the first vertical black line and the cyan line corresponds to the error of the neural networks that has been trained up until the second vertical black line.

Time 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 Error Original End 2011 BL 1 BL 2

Figure 3.8: Error on 2011 - 2012 period for different training datasets. Original is the original 75% training dataset, End 2011 is trained until the end of 2011, BL 1 is trained till the first vertical black line, and BL 2 is trained until the second vertical black line.

The results show that training until 2011 gives a performance boost for the first two months of 2011, which was expected, however it doesn’t improve performance for the wettest period from March to May. Training until just before the wettest period results in a 28.5% performance improvement over the original neural network for the wettest period. Extending the training data by including the first peak of the

(29)

wettest period results in a 45% performance improvements over the original neural networks for the second peak in the wettest period. One can conclude that the neural network has trouble predicting the anomaly until just before it happens, and that once reaching that peak it quickly adapts. This is useful for practical applicability of the neural network because one could be confident it quickly adapts to current trends.

3.3 Predicting further into the future

In the previous experiments the models were trying to predict vegetation for the same period as the given soil moisture. In other words the lag between an input and output sample was zero. To stronger quantify the predictive value of soil moisture a lag period between input and output samples is introduced. This lag period ranges from zero to four months and allows for analysis on where relevant soil moisture information is being stored. In other words, it allows for quantification of the future predictive value of soil moisture for vegetation.

Multiple neural networks, each with the same architecture, parameters and opti-mization methods as described in the previous section, but each with a different lag period have been trained. In total four neural networks were trained, one with a lag of 1 month, one with a lag of 2 months, one with a lag of 3 months, and one with a lag of 4 months.

The mean squared error per lag has been plotted in figure 3.9, and table 3.2 shows the combined mean squared error per lag. Looking at the wet anomalies in 2011, figure 3.9 clearly shows that most important information is contained within the first month. For time periods where the behavior is relatively normal, all lags seem to perform equally well, this is probably due to the strong seasonality present in the problem. Looking at the mean squared error there seems to be a linear increase in error per increased lag period.

A model which always predicts all zeros is added to table 3.2, all zeros meaning it predicts the average NDVI recorded for that biweek. This clearly shows that the neural network with a lag of 3 months is only marginally better than predicting all zeros, and the neural network with a lag of 4 months performs even worse than predicting all zeros. This suggests that all relevant information is present in the three months preceding the month one wants to predict.

To further analyze the future predictive value of soil moisture the average vegeta-tion per time period is shown in figure 3.10, with table 3.2 showing the corresponding Pearson correlation between the averages of true vegetation maps and each lag

(30)

pe-riod. The Pearson correlation coefficients show that neural networks with a lag up to two months can still predict the average vegetation trends well, however, with a lag of 3 months or more these predictions become inaccurate. This reinforces the idea that all relevant information is present in the three months preceding the month one wants to predict. 2009 2010 2011 2012 2013 Time 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 Error No lag Lag of 1 month Lag of 2 months Lag of 3 months Lag of 4 months All zeros

Figure 3.9: The mean squared error per lag period from 2008 to 2014.

Lag Error Pearson correlation

Zero lag 0.00186817 0.800 Lag of a month 0.00205393 0.581 Lag of 2 months 0.00213289 0.234 Lag of 3 months 0.00230171 -0.032 Lag of 4 months 0.00237467 -0.095 All zeros 0.00234371

(31)

2009 2010 2011 2012 2013 Time 0.03 0.02 0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Average NDVI Target averages No lag Lag of 1 month Lag of 2 months Lag of 3 months Lag of 4 months

Figure 3.10: The correct average vegetation and the predicted average vegetation for each lag period from 2008 to 2014.

3.4 Locally connected methods

The previous sections showed there is a strong spatial relation present in the data. This section will take advantage of this by using multiple neural networks that each predict a single pixel of the NDVI map. Instead of having one neural network that predicts the entire vegetation map at once, 140 · 180 = 25200 neural networks were trained each predicting one pixel of the vegetation map. Instead of receiving the entire soil moisture map as input each neural network will receive a 16 by 16 patch surrounding the pixel it tries to predict as input. A 16 by 16 pixel patch is an area of 400km by 400km which should contain all the important information.

The first experiment will use a one layer neural network, essentially a perceptron, as architecture. These neural networks are initialized and trained similarly to previous experiments. There are two notable differences in the hyperparameters, one is that the regularization parameters have been decreased to `1 = 10−3 and `2 = 10−3, and

the second one is that the learning rate has been decreased by a factor 10 to 10−5. These hyperparameters have been decreased because the neural networks are much smaller.

(32)

Model Error

Neural network 0.00186817

Ensemble of 1 layer neural networks 0.00194662 Ensemble of 2 layer neural networks 0.00184438

Table 3.3: The mean squared error of our best neural network and of the two ensem-bles.

architecture to a two layer neural network with 16 ReLU units. Again these neural networks are initialized and trained like before. The learning rate is the same as our first experiment, 10−5, however, the regularization parameters have changed to `1 = 10−5 and `2 = 10−2.

Table 3.3 shows the test error of our standard neural network, our most optimized neural network and these two ensembles of neural networks. Both ensemble models perform better than the standard neural network, with the one layer neural network ensemble performing worse than the optimized neural network and the two layer neural network ensemble performing a tiny bit better. This reinforces the intuition that there is a strong spatial relation in the data. The fact that the ensemble of one layer neural networks is able to predict with such high accuracy, again suggest that there are a lot of areas where there is a linear relation between soil moisture and vegetation. The fact that the 2 layer neural network ensemble performs better suggests that at least a few areas where there is a non-linear relation between soil moisture and vegetation. Regardless of what kind of relation it is, it is clear that these ensembles perform equally well or better than a single large neural network.

A plot containing the average error per sample, can be seen in figure 3.11. It is interesting to see that although the ensemble of 2 layer neural networks outperforms the other neural networks on average, it does not outperform them on the wet periods in 2011-2012. In fact the ensemble of one layer neural networks performs best there. One disadvantage of having to train so many neural network is that manual tuning of the hyperparameters is infeasible, most importantly the regularization parameters. In a single neural network, as all the weights are regularized together, the neural network is able to make intelligent decisions about which neurons require more weight, and which can do with less. This is in contrast with the ensemble of neural networks where it is highly likely that some neural networks will overfit and some will underfit, as the hyperparameters are not optimized per neural network separately. This makes selecting a good set of hyperparameters for the neural networks in the ensemble extra important. In our experiments above we, due to computation constraints, gave each

(33)

2009 2010 2011 2012 2013 Time 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 Error NN LC 1 layer LC 2 layers

Figure 3.11: The mean squared error of our best neural network and of the two ensembles. The green line represents the error of the ensemble of 1 layer neural networks, and the red line the ensemble of 2 layer neural networks.

neural network in the ensemble the same hyperparameters. The performance of these ensembles could be improved by applying techniques such as grid search or random search to select a good set of hyperparameters for each neural network in the ensemble individually.

Having an ensemble of neural networks also offers some advantages. Besides the improved performance, the neural networks are a lot smaller and can be trained very quickly, and as they are all independent they can be trained in parallel. For our datasets, where the resolution of input maps is still manageable, this might not seem very important, especially with the support for multi-cpu/gpu and distributed training in deep learning frameworks. However, when this resolution scales up by a large factor this becomes a problem. This is mainly due to memory constraints and the computational overhead. Remember that our input maps contain 25200 pixels and that one pixel corresponds to an area of 25km by 25km. There are efforts, for example by Vandersat, to improve the quality of these maps such that one pixel corresponds to an area of 100m by 100m, leading to input maps of 6300000 pixels. For input maps of this magnitude these locally connected methods might be highly beneficial.

(34)

Chapter 4 Predicting NDVI with uncertainty

The previous chapter focused on predicting vegetation using soil moisture. In this chapter vegetation will again be predicted using soil moisture, however, this time models are used that also try to assign a level of uncertainty to the predictions. A disadvantage of neural networks used in the previous chapter is that they do not provide any kind of uncertainty measure with their output, in other words, they do not provide any confidence intervals. This is especially important for problems where key decisions are being made based on the predictions.

To add a level of confidence to the predictions Bayesian methods are used. The focus here will be on Bayesian neural networks. Many different kind Bayesian neural networks models exist, in this work the focus will be on a variational one called the Variational Matrix Gaussian introduced by Louizos and Welling [2016].

Again the GIMMS NDVI dataset is used for vegetation and the CCI SM dataset is used for soil moisture. The same data preprocessing steps as in the previous chapter have been applied. Figure 4.1 shows a few examples of input, output, and prediction maps (predictions were done by the best performing Bayesian neural network). Again the predictions look very accurate, albeit a bit worse than the non-Bayesian neural network.

The next section will focus on different architectures and hyperparameters and investigate how each one affects performance on our problem. The remaining sections will focus on the analysis of uncertainty levels for different areas and time regions.

4.1 Bayesian neural networks

This section presents the results of the Variational Matrix Gaussian model for various architectures and hyperparameters. The performance of various models are analyzed and the performance between non-Bayesian neural networks and Bayesian neural

(35)

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(a) The first month in the test set

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(b) The sixth month in the test set

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(c) The twelfth month in the test set

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(d) The wettest (and worst) month in the test set

Figure 4.1: Input, output and prediction for a few samples in the test set. For the first column blue represents dry areas, while red represents wet areas. For the other two columns blue represents no vegetation, while red represents high vegetation.

(36)

networks are compared.

All models are trained with the Adam optimizer using the following parameters: β1 = 0.9, β2 = 0.999, = 10−8. A learning rate of 0.01 and a batch size of 72 were

used. Each Bayesian neural network was trained for 100 epochs. The same initializa-tion and parameterizainitializa-tion as described in the regression experiments of Louizos and Welling [2016] were used. This means all models were initialized using the default he2 initialization scheme [He et al., 2015] for the mean of each matrix variate Gaus-sian. A Gamma prior p(τ ) = Gamma(a0 = 6, b0) was introduced, as was a posterior

q(τ ) = Gamma(a1, b1) for the precision of the Gaussian likelihood. The matrix

vari-ate Gaussian prior for each layer was parametrized as p(W ) = MN (0, τ_r−1I, τ_c−1I), where p(τr) and p(τc) equals Gamma(a0 = 1, b0 = 0.5) and q(τr) = Gamma(ar, br)

and q(τc) = Gamma(ac, bc). The pseudo-data was initialized using samples from

the entries of A, B. One difference is that instead of using one posterior sample to estimate the expected log-likelihood used to update the parameters, five posterior samples were used.

Table 4.1 shows the test error for the most interesting architecture and hyperpa-rameter combinations. All models perform similar to their non-Bayesian counterpart, albeit a little bit worse. The best Bayesian neural network has 3 hidden layers each with 2500 ReLU units, uses 500 pseudo data pairs, and has a variational dropout rate of 0.05.

The performance of various architecture for Bayesian neural networks follow a similar pattern to that of non-Bayesian neural networks. For example increasing the number of hidden units past 2500 did not result in any performance improvements. One difference is that in contrast with the non-Bayesian neural network, the Bayesian neural networks with 3 hidden layers performs better than the Bayesian neural net-work with 2 hidden layers.

Selecting the number of pseudo-data pairs is a trade off between increased per-formance and increased training time. In a sense it is a limiting factor as increasing the number of pseudo-data pairs never decreases performance. By manual search 500 pseudo-data pairs was selected as a good balance. Similarly by a simple linear search the variational dropout rate resulting in the best performance was found to be 0.05. It is interesting to look at the errors of the best performing Bayesian neural net-works and the best non-Bayesian neural network. Therefore, both time series are shown in figure 4.2. The plot shows that both graphs are very similar, they per-form well on the same time periods and perper-form poorly on the same time periods. The Bayesian neural network almost always performs a little bit worse than the non-Bayesian neural network, however, it never performs a lot worse.

(37)

Model Parameters pdp vdr Error

Neural network 2x2500 ReLU - - 0.00186817

Bayesian neural network 1x500 ReLU 5 0.1 0.002008188 Bayesian neural network 1x500 ReLU 25 0.1 0.001999300 Bayesian neural network 1x1500 ReLU 5 0.1 0.001994170 Bayesian neural network 1x1500 ReLU 50 0.1 0.001982113 Bayesian neural network 2x1500 ReLU 50 0.1 0.001962147 Bayesian neural network 2x1500 ReLU 150 0.1 0.00195913 Bayesian neural network 2x2500 ReLU 150 0.1 0.00195690 Bayesian neural network 2x2500 ReLU 500 0.1 0.00195016 Bayesian neural network 2x2500 ReLU 500 0.05 0.00192934 Bayesian neural network 3x2500 ReLU 500 0.01 0.00193218 Bayesian neural network 3x2500 ReLU 500 0.05 0.00191544 Bayesian neural network 3x3500 ReLU 500 0.05 0.00191622 Bayesian neural network 3x2500 ReLU 500 0.1 0.00194690 Bayesian neural network 3x2500 ReLU 1000 0.1 0.00194662 Bayesian neural network 3x2500 ReLU 500 0.2 0.00198915

Table 4.1: The mean squared error of several models. pdp stands for pseudo data pairs, and vdr stands for variational dropout rate. The best Bayesian NN has 3 hidden layers each with 2500 ReLU units.

(38)

2009 2010 2011 2012 2013 Time 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 Error NN Bayes NN

Figure 4.2: The mean squared error for the best Bayesian neural network and best regular neural network on a period from Aug 2008 to 2014.

To show how similar predictions are figure 4.3 contains the expected output, the prediction of the non-Bayesian neural network, and the prediction of the Bayesian neural network, for three samples in the test dataset. All images have the same scale with red representing higher than average vegetation and blue representing lower than average vegetation. With the Bayesian neural network performing similar to the non-Bayesian neural networks and with the added benefit of providing confidence levels it might be a suitable alternative to non-Bayesian neural networks.

(39)

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120

(a) Period of 1-15 August 2009.

0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 (b) Period of 16-30 November 2009. 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 0 50 100 150 0 20 40 60 80 100 120 (c) Period of 16-31 March 2013.

Figure 4.3: The output, non-Bayesian NN prediction, and Bayesian NN prediction for a few samples in the test set. Blue represents no vegetation, while red represents high vegetation.

(40)

4.2 Analyzing the uncertainty

This section contains the analysis on the performance of the best performing Bayesian neural network. As a strong relationship between soil moisture and the models’ accu-racy was already established in the previous chapter the focus of this section is on the relationship between soil moisture and uncertainty, and between error levels and un-certainty. A strong relationship between soil moisture and uncertainty would solidify that soil moisture has a strong predictive value for vegetation, and a strong relation-ship between error levels and uncertainty would prove the effectiveness of Bayesian neural networks. Additionally, the difference in uncertainty between different areas and time periods is analyzed.

4.2.1 Uncertainty in different areas

This section presents the analysis on the average error, uncertainty, soil moisture, and vegetation maps. The maps can be seen in figure 4.4. The average error, soil moisture, and vegetation maps are created the same way as in the previous chapter. The average uncertainty map represents the standard deviation of 1000 drawn samples for each pixel.

Ideally one would want the error map and the uncertainty map to be roughly equal, unfortunately this is not the case. Many areas in mid and south Australia with high error have low uncertainty, an undesirable result. Similarly many area with low error have relatively high uncertainty, for example south west Australia.

The average soil moisture map bears more resemblance to the average uncertainty map, where higher soil moisture levels correspond to more uncertainty. This suggests that soil moisture has a stronger predictive value when soil moisture levels are low, which is in line with the results seen so far. Areas with high soil moisture levels and low error generally also have low uncertainty. This suggests that soil moisture is the leading factor for uncertainty unless the models’ predictions are accurate. The vegetation map and the uncertainty map share the same general structure but do not seem to have any clear relation.

Although these maps seem to confirm soil moistures’ predictive value, they fail to show that the uncertainty is working well. The next section focuses on uncertainty in different time series and shows a clearer relationship between soil moisture levels, error levels, and vegetation levels.

(41)

0 50 100 150 0 20 40 60 80 100 120 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020

(a) The average error map bounded by 0.02

0 50 100 150 0 20 40 60 80 100 120 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014

(b) The average uncertainty map

0 50 100 150 0 20 40 60 80 100 120 0.00 0.04 0.08 0.12 0.16 0.20 0.24 0.28 0.32

(c) The average soil moisture

0 50 100 150 0 20 40 60 80 100 120 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 (d) Average vegetation

Figure 4.4: The error, uncertainty, average soil moisture, and average vegetation maps for the best Bayesian neural network. All maps are averaged over a period

from Aug 2008 to 2014.

4.2.2 Uncertainty in different time periods

The average maps in previous section failed to show a clear relation between the uncertainty map and the error and soil moisture maps. This section presents the analysis on the time series of the average soil moisture levels, error levels, and uncer-tainty levels, as well as the analysis on specific time periods. The time series is shown in figure 4.5. All three the time series are scaled to be between 0 and 1 to provide a better comparison.

The average soil moisture levels and error levels seems to be similar to that of a non-Bayesian neural network, and have high correlation. The uncertainty levels can be roughly categorized into three groups, large uncertainty, medium uncertainty,

(42)

2009 2010 2011 2012 2013

Time

Average SM Error Uncertainty

Figure 4.5: The average uncertainty, average error, and average soil moisture levels of each map from Aug 2008 to 2014.

and low uncertainty. Visually uncertainty seems to be more related to soil moisture than to error levels. This is confirmed by the Pearson correlation coefficients, which is 0.724 for soil moisture and uncertainty, and 0.525 for error and uncertainty. This provides extra evidence that soil moisture’s predictive value becomes stronger as it becomes a scarcity. When soil moisture becomes readily available other ecological factors such temperature and solar radiation become more important for vegetation levels. As this information is unavailable to the models the uncertainty increases, and potentially also the error. This explains why uncertainty is more closely related to soil moisture than to the error levels.

From the analysis it is clear there is a strong positive relation between soil moisture and uncertainty, however, as shown in the previous section, it is not clear that the uncertainty is concentrated in the right areas. To further investigate the relations between uncertainty and soil moisture, and between uncertainty and error, a few sample cases are analyzed. Differential soil moisture, error, and uncertainty maps for a few selected samples are shown in figure 4.6.

Although the Bayesian neural network is able to capture general uncertainty levels well it is unable to consistently estimate high uncertainty for areas with high error. In other words, the model is able to detect the existence of areas with relatively high error, but it isn’t able to estimate where exactly. Analysis on the error maps and uncertainty maps for all test cases showed that uncertainty was most accurately

(43)

estimated when there was little overall uncertainty. There are only a handful of test cases where this is the case, one such example is shown in figure 4.6d. The model correctly estimates high uncertainty in mid west Australia, the area where there is also high error. However, more commonly the model is not able to estimate where these areas are. The result is a very grainy uncertainty map where areas with high error have low uncertainty and areas with low error have high uncertainty. A few of such examples are shown in figures 4.6a, 4.6c, and 4.6b.

The relation between uncertainty and soil moisture suffers from the same problems as the relation between uncertainty and error levels, where the model is unable to consistently estimate high uncertainty for areas with large amounts of soil moisture, something which is expected as there exists a strong positive relation between soil moisture and uncertainty levels. It could have been that these areas simply have low error, however, as can be seen in figure 4.6, this is not the case.

In conclusion, the Bayesian neural network is able to capture the general uncer-tainty, however, it cannot pinpoint the location of this uncertainty. Further research is required to investigate why uncertainty estimations are so poorly localized.

Quantifying the predictive value of soil moisture for vegetation growth using neural networks

MSc Artificial Intelligence

Master Thesis

Quantifying the predictive value of soil

moisture for vegetation growth using

neural networks

Robert Leenders

Supervisor:

Dr R de Jeu

Assessor:

Dr M Welling

Machine Learning Group

University of Amsterdam

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Data

2.1.1

Soil moisture

2.1.2

NDVI

2.1.3

Preprocessing

2.2

Machine learning models

2.2.1

Neural Networks

2.2.2

Bayesian Neural Networks

Chapter 3

Predicting NDVI

3.1

Using neural networks

3.2

Analyzing the performance

3.2.1

Performance on different areas

3.2.2

Performance on different time periods

3.2.3

Why wetness decreases performance

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

Average soil moisture

0.002

0.000

0.002

0.004

0.006

0.008

0.010

Error

3.2.4

Adaptability on anomalies

3.3

Predicting further into the future

3.4

Locally connected methods

Chapter 4

Predicting NDVI with uncertainty

4.1

Bayesian neural networks

4.2

Analyzing the uncertainty

4.2.1

Uncertainty in different areas

4.2.2

Uncertainty in different time periods