Testing Different Independent Component Analysis Methods in the Earth Sciences

(1)

Universiteit van Amsterdam

Future Planet Studies

BSc Thesis

Testing Different Independent

Component Analysis Methods in the

Earth Sciences

Author

Tamara Stoof

Supervisor

Dr. Ir. Emiel van Loon

June 30, 2020

Amsterdam

(2)

Abstract

This BSc thesis analyses the effect of different contrast functions and optimization algorithms on the estimative ability of Independent Component Analysis (ICA) in solving the Blind Source Sep-aration (BSS) problem. This was done by testing 19 different ICA methods that were constructed using available R functions. The accuracy of the ICA methods in estimating the independent components was tested for six models using the signal to intereference ratio (SIR) function in R. These SIR values indicated the correlation between the sources and the estimations. The kruskalwallis test was used to test whether or not there was a significant difference in the SIR values of, respectively, different ICA methods, contrast function, and optimization algorithms. The test showed that the choice of ICA methods, contrast functions, and optimization algorithms does influence the accuracy of the estimations. The effect sizes for all three group types showed that the chosen optimization algorithm has a higher impact on the SIR values than the chosen contrast function.

1 Introduction

Independent Component Analysis (ICA) is a widely used method in signal processing to separate mixed signals into the individual original components when there is no known information con-cerning the original components and how they are mixed. The only known information is that there is some type of linear mixing of the original components without losing any information regarding the original components. This problem is better known as Blind Source Separation (BSS) (Hyv¨arinen, Karhunen & Oja, 2001; Yu, Hu, & Xu, 2014). ‘Blind Source’ acknowledges the lack of available information on the original components, i.e. the source signals. The diffi-culty now lies in the fact that both the mixing matrix and source signals are unknown. Only the information available in the mixed signals can be used to find an estimation for the sources (Hyv¨arinen & Oja, 2000; Matsuoka, 2002). Fortunately, there are multiple methods available that can solve the BSS problem, with ICA being the most widely used (Cao, & Liu, 1996; Matsuoka, 2002; Yu et al., 2014).

1.1 Independent Component Analysis

In order to be able to solve the BSS problem, ICA uses two simple assumptions: the original components are 1) independent, and 2) non-Gaussian (Hyvärinen et al., 2001). ICA can solve the BSS problem by searching for a transformation matrix that maximizes these assumptions in order to find the estimated components (see section 2.5). The ICA method consists of two parts: a contrast function, and an optimization algorithm (Hyvärinen et al., 2001; Yu et al., 2014). The contrast function is commonly used as a measurement of independence or non-Gaussianity. The optimization algorithms either maximizes or minimizes the contrast function, depending on the used contrast function, in order to find the most independent and non-Gaussian components from the mixed components (Comon, 1994; Hyvärinen et al., 2001; Yu et al., 2014).

1.2 Earth Sciences and Blind Source Separation

The BSS problem can be found in many Earth Science related areas. Take, for example, the reconstruction of vegetation based on mixed biomarker data (Jansen, van Loon, Hooghiemstra & Verstraten, 2010), the identification of the sources responsible for variations in hydroecological systems (e.g. identifying water pollution sources) (Alexandrov & Vesselinov, 2014), or separating measured waves from each other and filtering out any noise stemming from the seismometer itself

(4)

(Trauth, Gebbers, Marwan & Sillmann, 2007). To solve these Earth Scientific BSS problems, it is necessary to have an understanding of the available methods and in which situations they can be efficiently applied. This paper will therefore also provide a quick overview of the statistical and mathematical principals of ICA in the theoretical framework to provide the necessary background to understand ICA.

This paper strives to make the ICA method more approachable for usage in the Earth Sci-ences. To do this, several different ICA methods are being tested on two different artificial scenarios that simulate real problems within the Earth Sciences. These scenarios are a vegeta-tion reconstrucvegeta-tion model and separating seismic waves. The ICA methods are being tested on their accuracy in separating the original components from the mixtures in each scenario. Nine-teen ICA methods will be tested. These methods differ from each other in the chosen contrast function, optimization algorithm, and, if applicable, the chosen nonlinearities and orthogonaliza-tion techniques. Six contrast funcorthogonaliza-tions (i.e. Maximum Likelihood, kurtosis, negentropy, mutual information, joint entropy, and cumulant tensors), and eight optimization algorithms (i.e. Fas-tICA, natural gradient algorithm, Newton’s method, GAMs, Pearson’s system, Infomax, FOBI, and JADE) will be tested. This paper will answer the following research question: “How does the choice in contrast function and optimization algorithm influence the estimative ability of independent component analysis in solving the blind source separation problem?” To answer the question, this paper will look at the influence of the contrast function and the optimization algorithm on the accuracy of estimating the independent components.

This paper will first discuss the necessary statistical and mathematical background that explains the validity of ICA as a means to solve the BSS problem in the theoretical framework. The methods will outline how the different ICA methods are constructed and tested on two Earth Science related scenarios. The results and discussion will further explain which method proves to be more efficient in usage in both scenarios.

2 Theoretical Framework

The theoretical framework will provide the essential statistical and mathematical background to understand the basic principles of ICA. Additionally, an explanation on the different contrast functions and optimization algorithms used in this BSc thesis is given.

2.1 Blind Source Separation

BSS, as was mentioned before, is a problem that is defined by its lack of available information on the original components si and the mixing matrix A. Only the mixed components xi are

known. The mixed components are generated after the original components have undergone a linear transformation as described by the mixing matrix. A linear transformation of a matrix can be seen as the transformation (e.g. matrix multiplication) of one or two matrices into another matrix as long as the following requirements for the resulting matrix holds (Rowland & Weisstein, n.d.):

1. T(v1 + v2) = T(v1) + T(v2) 2. T(a v) = a T(v)

Where T(·) is the used linear transformation (e.g. the used matrix multiplication), v1 and v2 are vectors, and a is a scalar. The linear transformation resulting in the matrix containing the mixed signals is the matrix multiplication of the matrix containing the source signals with the

(5)

mixing matrix. The mathematical expression is as follows (Cao, & Liu, 1996; Yu et al., 2014):

AS = X

Where A is the mixing matrix with dimensions m-by-n, S is a matrix with dimensions m-by-n containing the original components, and X is a matrix with dimensions n-by-m containing the mixed components. The matrices S and X consist of vectors si and xi respectively. The vector

si contains the information pertaining to source signal i, and vector xi contains information

concerning mixed signal i. These vectors make up the columns of their respective matrices, so that the matrices contain all the information of the source and mixed signals. So, in short, each column of matrix S and X corresponds to one source or mixed signal, and the rows contain all the information on each component (i.e. different samples) (Trauth et al., 2007).

The BSS problem strives to find the S based solely on X. Usually, the equation would be rewritten as follows (Hyv¨arinen et al., 2001):

Y = WX

Where Y is the estimated original components S, with the columns in Y representing the es-timated independent components. X is the mixed signal matrix. W is the de-mixing matrix, which is the inverse of A so that W = A−1. However, since A is also unknown, other methods are needed to solve the BSS problem. Fortunately, there are several methods available, with ICA being the most popular. ICA is able to solve the BSS because of two simple assumptions: the original components are 1) independent and 2) have a non-Gaussian distribution (Hyv¨arinen et al, 2001; Yu et al, 2014).

It must be stressed that, while ICA is able to estimate the source signals, it is unable to retrieve the scale and the sign of the source signals. This is due to the simple fact that both A and S are unknown. The multiplication of S with a certain scalar can be nullified by dividing A with the same scalar (Hyvärinen et al, 2001). Since the scalar itself does not necessarily matter, it is possible to assume unit variance in order to establish a pre-set magnitude of the independent components (Hyvärinen et al, 2001; Trauth et al., 2007). This means two things: ICA is unable to estimate the variance/scale of the independent components, and it is therefore unable to accurately estimate the sign of the independent components (Hyvärinen et al, 2001; Yu et al, 2014).

2.2 Independence of the Original Components

It is important that the original components are independent in order to retrieve the original components from the mixed components. The mixed components are dependent and correlated due to the linear transformation it underwent (A in Figure 1). After all, the matrix multiplication of A with S resulted in a matrix X in which every entry ij is the sum of all source signals on row i multiplied by their respective scalars. The mixed signals are therefore a mixture of the source signals and hence dependent and correlated.

The original components can be retrieved from the mixed components in two steps (Hyv¨arinen et al, 2001; Trauth et al., 2007; Yu et al, 2014). First, the data needs to be pre-processed by centering and whitening the data (as indicated by the V in Figure 1). Second, the resulting orthogonal matrix needs to find a linear transformation matrix that corresponds to the right direction (Hyv¨arinen et al, 2001). This transformation matrix is found using an optimization -or learning- algorithm (Figure 1) (Yu et al, 2014). Figure 1 shows a detailed overview of the steps involved in solving a linear BSS problem. These steps will be discussed in greater detail in the next two sections.

(6)

s3 s2 s1 A x3 V z3 x2 V z2 x1 V z1 conv. W Learning algorithm y1 y2 y3

Figure 1: A detailed overview of a linear BSS problem and how to solve it. The input arguments are the source signals, si. The source signals are linearly mixed by a mixing matrix A, resulting in the mixed

signals, xi. The mixed signals are then whitened by a whitening matrix V, leading to the whitened

signals zi. These whitened signals are then used to find a de-mixing matrix W. W can be found

through a learning algorithm which checks if W converges. If not converged, the algorithm is run again. If converged, the resulting W is used to compute the estimated signals, yi.

2.3 Centering

In order to reduce the complexity of the computations, the data is always centered around a zero-mean (Hyv¨arinen et al, 2001; Trauth et al., 2007). If the variables are not already centered, some simple pre-processing is necessary. Hyv¨arinen et al. (2001) show how this can be done simply by subtracting the mean of the variable:

centered X = X − E{X}

By centering X, S is also automatically centered. After all,

centered S = A−1centered X

Note that the mixing matrix A is not affected by the centering of the X. It is therefore still possible to find the estimation of A without any interference of the centering. Hence, the original components can be found from the estimated centered components by adding A−1E{X} to it,

S = centered S + A−1E{X}

2.4 Whitening

Independent components are uncorrelated. However, the uncorrelatedness of the components does not imply that the components are independent (Hyv¨arinen, 2001; Lee, 2013; Yu et al., 2007). Whiteness, while not the same as independence, comes closer to finding independent components. Whitening is a process in which the correlated mixed components, after being made zero-mean, are decorrelated and their variances resemble the identity matrix, i.e. E{xxT_{} = I}

(Lee, 2013; Yu et al., 2014). The latter is an important characteristic that greatly simplifies the BSS problem.

There are several whitening methods, such as PCA based whitening using singular value decomposition (SVD) for non-square matrices, or eigenvalue decomposition (EVD) for square matrices (Lee, 2013; Trauth et al., 2007). The PCA based whitening is used mostly in overde-termined situations, i.e. where there are more samples than components. PCA can be used to reduce the dimensions of the matrix, resulting in a square mixing matrix A (Hyv¨arinen et al., 2001). It is preferable to have a square mixing matrix as it greatly reduces the computations needed to estimate the components. This paper will therefore only consider situations with

(7)

square mixing matrices A. However, this does not imply that the matrices S or X need to be square.

Whichever is chosen, the result will be a whitening matrix V (Figure 1). The whitening of the mixed components occurs as follows (Hyv¨arinen et al., 2001; Lee, 2013):

1. AS = X 2. VAS = VX 3. Z = VX, U = VA

4. Z = US

Where Z is the matrix containing the whitened mixed components, and U = VA. Both sides of the equation AS = X undergo a matrix multiplication with a projection matrix V. V is chosen in such a way that VX and VA, thus Z and U, are orthogonal.

The usefulness of whitening as a pre-processing step for ICA now lies in the orthogonality of the new matrix U. Orthogonal matrices have as characteristic that UUT_{= I (Giesbertz,}

2020; Hyvärinen et al., 2007; Koldovský & Tichavský, 2018). This characteristic, along with the restriction of the components to be of unit variance, E{xxT_{} = I, can be used to prove that U}

is orthogonal (Hyv¨arinen et al., 2001):

E{ZZT} = UE{SST}UT= UIUT= UUT= I

The orthogonality of U means that there are less parameters to estimate when trying to find the original mixing matrix A (Hyv¨arinen et al., 2001). After all,

A−1X = WX = S can be rewritten, after whitening, as

U−1Z = BZ = Y

Where B = U−1. The inverse of an orthogonal matrix is also orthogonal (Giesbertz, 2020; Hyv¨arinen et al., 2007), so ICA tries to find an orthogonal de-mixing matrix B that finds the estimated original independent components. If B matches W, and thus A, the separation is perfect. Whitening, in short, limits the search of a matrix B to matrices with an orthogonal base. This means that, instead of having to the estimate n2 _{parameters of the original mixing}

matrix A, only n(n − 1)/2 parameters need to be estimated. The reduction of the parameters is due to the degrees of freedom in an orthogonal matrix (i.e. n(n − 1)/2) (Hyv¨arinen et al, 2001). The validity of using Y as an approximation of S, even after whitening the data, can be proven as follows: X = AS VX = VAS (VA)−1VX = S V−1A−1VX = S A−1X = S = Y

In short, while whitening is not a prerequisite of ICA, it reduces the problem by almost a half, greatly simplifying the computations needed in ICA to solve the BSS problem (Hyv¨arinen et al, 2001).

(8)

2.5 Linear Transformation Matrix

The second step in ICA is to find a matrix B (or W) corresponding to the linear transformation (Figure 1) the whitened mixed components need to undergo in order to find the estimated components. However, the search of matrix B has been limited to finding an orthogonal matrix that changes the direction of the edges of the joint distribution of the whitened mixed components (Hyv¨arinen et al., 2001). Figure 2 shows the uniform distribution of two independent original components, and the uniform joint distribution of two whitened mixed components. The figure clearly shows that the only difference between the original components and the whitened mixed components is due to a linear transformation that simply changes the direction of the edges of the distribution. In short, ICA searches for a matrix B that finds the rotation of the whitened mixed components in which the independence of the estimated components are maximized (Hyv¨arinen et al., 2001).

Figure 2: The uniform distribution of two independent original components (left) and two uncorrelated mixed components (right) (Hyv¨arinen et al., 2001).

2.6 Non-Gaussianity and Estimation of the Original Components

Non-Gaussianity of the source signals is another prerequisite for ICA (Cao, & Liu, 1996; Hyv¨arinen et al., 2001; Roberts & Everson, 2001; Stone, 2004; Yu et al., 2014). The mixed signals are a linear sum of the non-Gaussian original components. The central limit theorem states that the sum of independent random variables is more gaussian than the original variables (Hyv¨arinen et al., 2001). In other words, a linear combination, such as yi =Piwixi, would be maximally

non-Gaussian if it equaled one of the original components (Hyv¨arinen et al., 2001). After all, the original components are maximally non-Gaussian themselves. The de-mixing matrix will need to be formulated in such a way that local maxima of non-Gaussianity are found, with each maximum representing a different original component (Hyv¨arinen et al., 2001).

Moreover, (gaussian) variables have a property which states that jointly (gaussian) variables that are uncorrelated are always independent (Hyv¨arinen et al., 2001). This means that it is impossible to find the orthogonal matrix that describes the linear transformation that differ-entiates the original components and the whitened mixed components. After all, finding the estimated mixing matrix depends on finding the direction that maximizes the independence of

(9)

the whitened mixed components (Hyv¨arinen et al., 2001). If the whitened mixed components are already independent after whitening, it will be impossible to find the original components. Note that whitening alone will not result in finding the original components. Gaussian variables therefore cannot be used in ICA.

Finally, a visualization of the distribution of two gaussian independent variables as shown in Figure 3 further explains why gaussian variables cannot be used to find the orthogonal linear transformation matrix. This distribution clearly shows a distinct lack of edges. After whitening the mixed components, resulting in a similar distribution, it will be impossible to find a rotational orthogonal matrix that shifts the distribution in such a way that the edges of the estimated components have the same position as the edges of the original components (Hyv¨arinen et al., 2001). After all, there are no edges. In short, ICA does not allow the components to have a gaussian distribution.

Figure 3: The distribution of two gaussian independent variables (Hyv¨arinen et al., 2001).

2.7 Independent Component Analysis Methods

The ICA method consists of two parts: a contrast function, and an optimization algorithm. There are many different contrast functions (Table 1) and optimization algorithms (Table 1) available to solve the BSS problem through ICA (Hyvärinen et al., 2001; Hyvärinen & Oja, 2000; Wang, Wang, Zhang & Zhong, 2016; Yu et al., 2014). The contrast functions are usually a measure of independence or non-Gaussianity, while the optimization algorithm are used to maximize either the non-Gaussianity or the independence of the whitened mixed components in order to find the orthogonal matrix B (Hyvärinen et al., 2001; Yu et al., 2014). Each contrast function and each optimization algorithm has their own advantages and disadvantages (Hyvärinen et al., 2001). The detailed workings of all contrast functions and optimization algorithms is beyond the scope of this paper. However, Hyvärinen et al., (2001) and Yu et al., (2014) offer a more in-depth exploration of several contrast functions and optimization algorithms. This paper will only provide a short overview of the contrast functions and optimization algorithms tested in this study. Note that more contrast functions and optimization algorithms exist to solve ICA.

(10)

2.8 Contrast functions

Kurtosis is used as a measurement of the non-Gaussianity of a variable (Hyvärinen et al., 2001). Kurtosis checks the form of the distribution (i.e. flat or peak) and can thus be used to test whether the distribution of a variable resembles the normal distribution or not. Kurtosis is a fourth order cumulant. Higher order (i.e. order > 2) cumulants have an important property that make them useful in determining the non-Gaussianity (Cardoso, 1990). The higher order cumulants of a variable are zero if, and only if, the variable comes from a gaussian distribution (Hyvärinen et al., 2001). Therefore, the kurtosis of a variable indicates how far removed from a non-Gaussian distribution the variable is. The optimization algorithm would therefore seek to maximize the kurtosis of a variable to find the estimated independent components. How-ever, kurtosis is sensitive to outliers and does not make a robust measure of non-Gaussianity (Hyvärinen et al., 2001).

Negentropy is used as a measure of nongaussianity. Negentropy is a normalized version of differential entropy, which, in information theory, describes the unpredictability of a variable and how unstructured a variable is (Novey & Adali, 2008). The higher the entropy, the more unpredictable and unstructured, and hence more random the variable is (Chaitin, 1977). A gaussian distribution is considered to be the most random and thus has the highest entropy of all distributions (Hyv¨arinen et al., 2001). The entropy of a variable X can be calculated as follows (Hyv¨arinen et al., 2001):

H(X) = −

n

X

i=1

P(xi) logbP(xi)

Where H(X) is the entropy of variable X, and P(·) is the probability mass function of entry i in variable X, n is the number of entries in X, and b is the base used in the logarithm (e.g. 10 or e).

Negentropy is thus defined as (Hyv¨arinen, 1997; Novey & Adali, 2008):

J(Y) = H(Ygaussian) − H(Y)

Where J(Y) is the negentropy of a variable Y, H(Ygaussian) is the entropy of Ygaussian, which is

a random vector of the same correlation as Y but from a gaussian distribution, while H(Y) is the entropy of a variable Y that has a non-Gaussian distribution. Maximizing the negentropy maximizes the non-Gaussianity of the components. After all, if gaussian distributions have the highest entropy then the most non-Gaussian distributions will have the lowest entropies (Chaitin, 1977; Hyv¨arinen et al., 2001). According to the equation of Negentropy, if Y is to be as non-Gaussian as possible, say H(Y) equals zero, then J(Y) = H(Ygaussian), which is the highest

negentropy value obtainable from variable Y. However, if Y is a gaussian variable then:

J(y) = H(ygaussian) − H(ygaussian) = 0

In other words, to find the orthogonal matrix that finds the estimated components corresponding to the original components using a measure of non-Gaussianity, one needs to maximize the negentropy of the whitened mixed components.

Joint entropy is a term used in information theory. Entropy is used to determine the un-certainty or unpredictability of a variable. The joint entropy of two independent variables is always larger than the joint entropy of two dependent variables (Bavaud, Chappelier & Kohlas,

(11)

2005; Yu et al., 2014). If there is any overlap in the entropies of X and Y, as is the case for dependent variables, then the joint entropy will naturally be lower compared to the entropy of two independent variables. After all,

H(X, Y) = H(X|Y) + H(Y)

Where H(X|Y) is the entropy of X without any information that is also contained in Y. The more information in X that is also a part of Y (i.e. the larger the information overlap), the lower the value for H(X|Y) and the lower the joint entropy H(X,Y). Joint entropy can therefore be used as a measurement for independence. The optimization algorithm therefore seeks to maximize the joint entropy to compute the estimated independent components.

Mutual Information is also used in information theory. Mutual information measures the dependence of variables. In the case of two variables, mutual information represents the overlap of the entropies of the two variables. Mutual information of two variables X and Y, I(X,Y) can be calculated as follows (Hyv¨arinen, 1997; Novey & Adali, 2008; Yu et al., 2014):

I(X, Y) = H(X) + H(Y) − H(X, Y)

Where H(X) and H(Y) are the entropies of X and Y if they were independent, and H(X,Y) the joint entropy of the variables X and Y. The joint entropy of two independent variables is H(X,Y) = H(X) + H(Y) (Yu et al., 2014). Therefore, if the two variables are independent, then their mutual information is zero. The optimization algorithm therefore seeks to minimize the mutual information to compute the estimated independent components.

Maximum Likelihood Estimation is used to find the most likely values of the parameters of a model that explain the observed data (Brooks-Bartlett, 2018). In the case of ICA, the model that is being used is matrix B, which is the inverse of the whitened mixing matrix. Each entry in B is a different parameter for which the values need to be computed. Essentially, the values of the parameters in B are being estimated in such a way that the likelihood of B having been able to generate the observed data is maximized (Hyv¨arinen et al., 2001). The likelihood is calculated in ICA using the log-likelihood function:

log L(B) = T X t=1 n X i=1

log pi(bTix(t)) + T log | det B|

Where log L(B) is the log likelihood of matrix B. Matrix B is the whitened de-mixing matrix (i.e. the inverse of the whitened mixing matrix, (VA)−1_{). T is the length of the mixed signals}

matrix X, x(t) is a row in X, n is the number of entries in matrix B, biis a row in B, and log pi(·)

is the log probability density function (pdf) (Yu et al., 2014). As the independent components, and therefore their pdf’s, are unknown, an approximation of the pdf’s is necessary to avoid the nonparametric (i.e. infinite) problem of estimating the densities. This approximation is based on a simple choice between one of two densities, thus greatly simplifying the problem (Hyv¨arinen et al., 2001; Yu et al., 2014).

After finding matrix B, it is used to calculate the estimated independent components through matrix multiplication of B with the observed data in matrix X. The optimization algorithm therefore uses the maximization of the maximum likelihood estimations to compute the indepen-dent components.

(12)

Cumulant Tensors makes use of the Eigenvalue Decomposition (EVD) of the fourth-order joint cumulants to find the independent components (Cardoso, 1990; Hyvärinen et al., 2001). The EVD of the whitened matrix Z will produce eigenvalues, which are used to approximate the kurtoses of the independent components (after all, the cumulant tensors of the fourth-order are related to the fourth-order moment kurtosis (Weisstein, n.d.)), and eigenvectors, which are the columns of the eigen-matrix (Cardoso, 1990; Hyvärinen et al., 2001; McCullagh, 2018). The eigenmatrix of the fourth-order cumulants are obtained as follows (Hyvärinen et al., 2001)

Mi= bmbTm

where Mi is the “eigenmatrix” (i.e. the eigenvalue of bmbTm), i indicates the number of

eigen-matrices, b is the row of the estimated whitened mixing matrix B, and m denotes the row in B.

If the non-zero eigenvalues are all distinct, then the kurtoses of the independent components are all distinct as well, meaning that the independent components can all be easily computed (Hyv¨arinen et al., 2001). In fact, each eigenmatrix corresponds to one column of the unknown whitened mixing matrix B. B can then be used to compute the independent components.

If the kurtoses are not all distinct, then the eigenmatrices can be decomposed through EVD once more (Hyv¨arinen et al., 2001). The resulting eigenvectors then form the columns of B. Additional complex computations are needed if not all of the non-zero eigenvalues from this last EVD are distinct.

The optimization algorithms using cumulant tensors differ from the other optimization algo-rithms, since the cumulant tensors need to undergo EVD instead of being minimized or maxi-mized (Cardoso, 1990; Hyv¨arinen et al., 2001).

2.9 Optimization Algorithms

FastICA tries to maximize the non-Gaussianity of the whitened matrix Z in order to find the independent components. According to Hyv¨arinen et al. (2001), the creators of FastICA, the contrast functions are either maximized (e.g. negentropy) or minimized (e.g. mutual informa-tion), depending on the used contrast function, through a certain orthogonal rotation of matrix Z. FastICA used a fixed-point iteration method to find the direction of this rotation. FastICA uses nonquadratic functions to measure the non-Gaussianity. The choice of the nonquadratic function determines which contrast function is used.

Newton’s method is an algorithm used to minimize the contrast function. Newton’s method is a numerical iteration method that uses differential equations in order to find the zero, and thus the minima and maxima, of a function, thus minimizing/maximizing the contrast function (Hyv¨arinen et al., 2001).

Natural Gradient uses the first derivative, i.e. the sign of the slope, of the contrast function to find the direction that minimizes the contrast function (Hyv¨arinen et al., 2001; Joseph, 2020). The natural gradient will update the parameters in such a way that it leads to the minimization of the contrast function:

X = X − aδf(x) δx

Where X is the contrast function, δf(x)/δx is the natural gradient algorithm (Yu et al., 2014), and a is a learning rate that constrains the updates of the parameter, so that the algorithm eventually converges (Joseph, 2020).

(13)

FOBI (Fourth-Order Blind Identification) uses the EVD of the weighted correlation matrix of the whitened matrix Z. The resulting eigenvectors are the columns of the estimated unknown mixing matrix B. FOBI is computationally very simple, but works under the assumptions that the kurtoses of all the independent components are distinct. If the independent components do have non-distinct kurtoses (if they come from the same distribution, which is often the case), then FOBI does not work (Cardoso, 1990; Hyv¨arinen et al., 2001).

JADE (Joint Approximate Diagonalization of Eigenmatrices) uses, as the name implies, the eigenmatrices of the cumulant tensors. These eigenmatrices are then diagonalized by the esti-mated unknown mixing matrix B (Cardoso, 1990; Hyv¨arinen et al., 2001), as

Q = BMiBT

Where Q is a diagonal matrix, and Mi is any of the eigenmatrices of the cumulant tensor. The

minimization of the sum of squares of off-diagonal entries maximizes the sum of squares of the diagonal entries (Hyv¨arinen et al., 2001). When minimizing JADE, the sum of squared cross-cumulants of the whitened matrix Z is also minimized. This can be seen as the minimization of nonlinear correlations and will therefore lead to finding independent components. The estimation of the unknown mixing matrix B can thus be found by minimizing the sum of squares of the off-diagonal entries in the matrix Q (Hyv¨arinen et al., 2001).

GAMs (generalized additive models) are generalized linear models that make use of smooth functions to determine the predictor variables that, if added together, result in the linear predictor of the model. GAMs are used in ICA in the Product Density ICA algorithm (ProDenICA) (Hastie & Tibshirani, 2010). This method finds the estimation of a non-Gaussian density by multiplying a cubic smoothing spline by the gaussian density (Hastie & Tibshirani, 2003; Hastie, Tibshirani & Friedman, 2009). The used contrast function is then used to measure the non-Gaussianity of the estimated density. Maximizing the contrast function will then lead to the independent components.

Infomax (information maximization) maximizes the output entropy, i.e. maximizes mutual information, of the whitened matrix Z using nonlinear scalar functions. These nonlinear functions are used as an approximation of the unknown probability density functions of the source signals (Yu et al., 2014). Infomax and maximum likelihood estimation are similar as they both estimate the densities of the independent components to solve the BSS problem (Hyv¨arinen et al., 2001; Yu et al., 2014).

PearsonICA, based on the Pearson system, is a special case. The Pearson system is a group of equations obtained by generalizing the differential equation of the normal distribution (Weisstein, n.d.). The Pearson system is not a real optimization algorithm on its own; it needs one of the above-mentioned algorithms to actually find the independent components. However, PearsonICA adds an extra step to the optimization algorithms by making use of the score functions (deriva-tives of the log densities) of the contrast function to estimate the distribution of the whitened matrix Z (Karvanen, Eriksson, & Koivunen, 2000; Karvanen & Koivunen, 2002). Finding the maximally non-Gaussian distribution thus leads to the estimated independent components.

(14)

3 Methods

This study wanted to test which ICA methods can best be used to solve the BSS problem. In order to test this, nineteen ICA methods were chosen. Each of these methods were then tested on two different scenarios. Finally, the results of the accuracy of every method were tested to be significant or not. The methods will first discuss how the ICA methods were chosen, then how the two scenarios were generated, and finally how the data was analysed.

3.1 The ICA methods

The influence of the chosen contrast function and optimization algorithm on the estimation of the independent components was tested. This was done by using nineteen different ICA methods. These methods differ from each other in either the chosen contrast function, opti-mization algorithm, nonlinearities, or orthogonalization technique (Table 1). The nonlinearities and orthogonalization techniques can be seen as a subcategory within the chosen optimization algorithm. Therefore, not all algorithms have the same nonlinearities or orthogonalization tech-niques, if at all, to choose from. The nineteen ICA methods were based on the available packages in the CRAN R repository that solve the BSS problem using ICA. Consequently, there are more ICA methods that can be constructed than just the nineteen presented in this paper. However, these are not readily available and need to be constructed manually.

Moreover, there were multiple packages available that solved the BSS problem using the same ICA method. Instead of testing the same ICA method twice or more using different R functions, only one function was chosen to represent the ICA method. The R packages and functions and how they are used in the construction of the nineteen ICA methods is displayed in Table 1. If the R function allowed for multiple different input arguments (e.g. to differentiate between the used nonlinearities), then these input arguments were used in such a way to construct the nineteen ICA methods.

(15)

Table 1: The R packages and functions used in the construction of the 19 ICA methods through different combinations of the contrast function, optimization algorithm, nonlinearities, and orthogonalization. See references for the citation of each package.

Package Function Contrast function Optimization Algorithm Non linearity Orthogon-alization ICA method ProDenICA ProDenICA Negentropy GAMs* log(cosh(y)) SVD** 1

Kurtosis GAMs y3 SVD 2 steadyICA infomax Maximum

likelihood infomax 3 ica icaimax Joint newton log(cosh(y)) 4 entropy*** log 5 exp 6 Joint gradient log(cosh(y)) 7 entropy*** log 8 exp 9 icafast

Negentropy fastica log(cosh(y)) Parallel 10 Deflation 11 Negentropy fastica exp Parallel 12 Deflation 13 Kurtosis fastica y3 Parallel 14

Deflation 15 mlica2 mlica2 Maximum

likelihood

fastica log(cosh(y)) EVD**** 16

PearsonICA PearsonICA Mutual information

Pearson system log(cosh(y)) 17

JADE FOBI Cumulant tensor

FOBI EVD 18

JADE Cumulant tensor

JADE EVD 19

* Generalized Additive Models using the Poisson distribution ** Singular Value Decomposition

*** Joint entropy is used to calculate Mutual Information and is therefore included **** Eigenvalue Decomposition

3.2 Constructing the Scenarios

Two different scenarios were constructed: the vegetation reconstruction scenario and the seismic wave scenario. The two scenarios differed from each other in the structure (i.e. smoothness, predictability, and periodicity) of the source and mixed signals. Figure 4 shows that the seismic wave scenario was more structured compared to the vegetation reconstruction model. After all, the seismic wave scenario was used to represent measured seismic waves, while the vegetation reconstruction scenario was used to represent possible samples taken from a field area. In other words, one the former scenario is meant to resemble actual waves, while the other is a series of random measured values. Each scenario is represented by a model. Each model has a few variants: two model variants for the vegetation reconstruction scenario, and four model variants for the seismic wave scenario. The variation between the models of one scenario is due to slight differences in the generation of the models, however the overall structure (i.e. predictability and periodicity) remains similar. Different variants of the models are used so that the variability of

(16)

the ICA methods can be tested.

Figure 4: The source signals of one of the vegetation reconstruction scenarios (top) and the seismic wave scenarios (bottom). The source signals of the vegetation reconstruction scenario have a larger randomness and less structure than the source signals of the seismic wave scenario.

All models consisted of three independent source signals. Each of the source signals contain 100 values. For the vegetation model, these 100 values represent 100 samples taking from the study area. For the seismic wave model, the 100 values represent the length of the measured waves. The values in the vegetation reconstruction model are generated randomly from a uniform distribution using the runif() function. The seismic wave scenario has a non-random functional relationship leading to a structured and periodic wave of length 100.

The models were constructed in such a way that it became possible to test whether or not the structure/randomness of the source signals affects the estimative ability of ICA. There are many fields within the Earth Sciences that might profit from the usage of ICA. However, the data used in real life experiments will be less ideal compared to the artificially generated data used in this study. Therefore, it was pertinent to construct several models ranging from random to structured. Naturally, there are countless other models that could be constructed that follow the same types of structure as the models used in this study. Even though this is the case, the specific models that were used are not of great importance as long as they range from random to structured.

Additionally, each source signal is centered as a preprocessing step. The source signals are centered by subtracting the means from their respective source signals. These individual source signals are then put together in a matrix S, where the source signals on the columns, and the measurements on the rows.

The mixing matrix A is a square 3-by-3 matrix. The dimensions of the mixing matrix are based on the amount of source signals. The matrices A and S are then used to construct matrix X, which contains the mixed signals. The transpose of matrix X is used in the last line in Figure 4, so that the mixed signals are on the columns, and the “measurements” are on the rows. The

(17)

resulting X is a 100-by-3 matrix. X is used as the input argument of all the functions in Table 1. Additional input arguments are chosen in such a way that the different ICA methods in Table 1 are generated.

Moreover, most functions need the number of components to extract as an additional input argument. This value was generated using the PriorNormPCA() and proposeNCP() functions from the mlica2 package (Teschendorff, 2012). This value, naturally, is equal to the number of constructed source signals. However, in real life situations the number of source signals is unknown. PriorNormPCA() and proposeNCP() perform PCA on the data to extract a square subspace which dimensions are equal to the amount of the estimated signals to extract. Therefore, to simulate a real life scenario, the PriorNormPCA() and proposeNCP() functions were used.

Lastly, some functions give the option to whiten the data if it has not already been whitened. While PriorNormPCA() and proposeNCP() do whiten the data using PCA, only the mlica() function actually makes use of the matrix output of these functions. The other functions only use the information concerning the number of estimated signals. Hence, whiten is set to TRUE if the function offers to whiten the data.

3.3 Data Analysis

The output argument of the constructed ICA methods is a matrix Y containing the estimated signals. The accuracy of this estimation was tested using the SIR() function from the JADE package (Miettinen, Nordhausen, & Taskinen, 2017). SIR, or signal to interference ration uses the mean square error (MSE), which has been transformed to a logarithmic scale, between Y and S to determine the accuracy of the ICA methods (Eriksson, Karvanen & Koivunen, 2000; Nordhausen, Ollila & Oja, 2011). The MSE between Y and S can be computed using the correlation matrix R of the two variables according to the following formula (Nordhausen et al., 2011):

MSE(Y, S) = 1 − (1/d) Tr(RRT) + MD(R)2

Where d is the number of independent components, Tr(·) is the matrix trace (i.e. the sum of the elements on the main diagonal of the matrix) (Weisstein, n.d.), and MD(·) is the minimum distance. Both the Tr(·) and MD(·) of the correlation matrix R can be computed using the sums of the maximum correlations between yd and sd (Miettinen et al., 2017; Nordhausen et

al., 2011). Both Y and S are scaled to unit variance to be able to have meaningful comparisons between the two matrices (Nordhausen et al., 2011).

The SIR() function in the JADE package hence uses the correlation between the scaled matrix S and matrix Y to compute the SIR value (units in dB) (Miettinen et al., 2017). The higher the correlation, the higher the resemblance between S and Y, and the higher the SIR value. The SIR values of all ICA methods for all models are then used to analyse whether or not the choice of contrast function and/or optimization algorithm influences the accuracy of the estimation.

Three hypothesis tests are formulated. One to test whether or not the choice of ICA method influences the estimation, one to test if contrast function influences the estimation, and the last to test if the optimization algorithm influences the estimation. All three hypotheses follow the same structure. The general formal notation of these hypotheses is

H0 : µ1 = µ2 = µ3

H1 : At least one of the means are not all equal

The lillietest() function in MATLAB was used to determine whether or not the SIR values came from a normal distribution. As this was not the case, one of the assumptions for ANOVA was violated. Therefore, the non-parametric equivalent of ANOVA, the kruskalwallis test, was used

(18)

instead. The kruskalwallis test was used to determine whether or not there was a difference in the ranks of the SIR values between groups, indicating whether or not the groups significantly differ (i.e. the groups come from different distributions) from each other or not. The multcom-pare() function was used to determine which groups differed significantly from one another. The multcompare() function produces an interactive graph of the estimated SIR values of each group with their corresponding comparison intervals. Two groups are disjoint if their intervals do not overlap. There are three group types that will be tested: the ICA methods (i.e. every method is a separate group), the contrast functions (i.e. every contrast function is a separate group), and the optimization algorithms (i.e. every algorithm is a separate group).

Finally, the effect size of the contrast functions and the optimization algorithms were used to determine the impact of the choice in contrast function/optimization algorithm on the accuracy of the estimations.

4 Results & Discussion

4.1 Earth Sciences and ICA

The SIR values of all nineteen ICA methods for the six different models of the two scenarios are given in Table 2. The first and third models represent the vegetation reconstruction scenarios, the other models correspond to the seismic wave separation scenarios. At first glance, the SIR values of the seismic wave models overall seem higher than the SIR values of the vegetation reconstruction models.

Moreover, there are four ICA methods whose SIR values are considerably lower than the SIR values of the other methods. These are ICA methods 3, 7, 8, and 18. However, there are many methods that do not accurately (i.e. SIR value < 20) estimate the independent components for all models. Models 1, 3, and 5 show significantly lower SIR values than models 2, 4, and 6 (i.e. p-value obtained from the kruskalwallis test is 6.08742e-06). Lower values for models 1 and 3 compared to the other models could indicate that periodic waves (as was used in the seismic wave models) are easier to separate using ICA than the “waves” created by the different measurements per independent component. These measurements create a more random, and thus less structured and periodic, wave than the seismic waves. Note that model 5 is also more randomized and less structured and periodic than the other wave models, indicating that the complexity of the structure of the original independent components also influences the accuracy of the estimation. However, all SIR values over 20 are considered to be good estimations, so even somewhat more complex structures can be solved accurately.

It must be noted that it is more likely to find less structured models in real Earth Science related research. After all, taking, for example, multiple soil samples from a study area will most often not provide a smooth periodic “wave”. The ’goodness of fit’ of the estimations will therefore probably be more akin to the results of models 1, 3, and 5 than that of models 2, 4, and 6. It is therefore expected that it is not always possible to find a good estimation (i.e. SIR value > 20) of the source signals. After all, most ICA methods in models 1, 3, and 5 oscillate around a SIR value of 20. However, even if the ICA methods do not always offer a good estimation, but oscillate around a SIR value of 20, the rudimentary general form of the source signals can be found. The Earth Sciences could therefore profit from ICA, regardless of the precise accuracy of the estimations, as long as the general form of the source signals is also of interest. Of course, there are situations were good estimations can be obtained, but it is not guaranteed.

(19)

Table 2: The SIR values of the nineteen tested ICA methods for six different models. Models 1 and 3 (i.e. SIR1 and SIR3) correspond to the vegetation reconstruction scenario, and models 2, 4, 5, and 6 represent the seismic wave separation scenarios.

ICA method SIR1 SIR2 SIR3 SIR4 SIR5 SIR6 1 20.28858 32.19076 19.54831 33.61597 19.61961 26.46394 2 20.81054 33.06685 19.24506 36.04927 22.00329 32.25103 3 2.466783 1.885558 1.042227 1.717847 2.759003 1.818014 4 20.52624 32.29327 19.52894 33.33929 20.27813 26.96456 5 20.55703 32.7204 19.35672 34.96531 22.30527 30.39886 6 19.86933 32.05005 19.57331 33.87134 19.18729 25.78057 7 2.506013 1.915498 0.9334796 1.754805 3.405712 1.820738 8 2.644087 1.145524 1.231908 1.741565 2.355433 1.87154 9 19.69274 31.48636 19.59583 32.17022 23.59187 29.58872 10 20.22857 32.17058 19.55056 33.57041 19.75135 26.47703 11 14.20311 37.30067 22.42341 30.50288 12.99584 22.84245 12 20.12672 32.13892 19.58315 33.4595 19.94099 25.74442 13 14.11941 37.1571 22.2626 30.41415 12.96699 22.55109 14 20.75552 33.06031 19.24542 35.94757 22.00765 32.23807 15 22.39881 37.59373 23.99535 31.72896 16.549 25.14747 16 17.25457 31.75275 19.71589 36.1767 17.3093 15.57936 17 13.84742 32.39083 19.58409 34.42462 20.87665 26.28811 18 18.32876 1.266668 3.397268 6.998567 9.213224 7.733034 19 25.80542 33.09531 19.46112 33.90585 25.29321 33.10462

4.2 The ICA methods

A boxplot visualizing the results in Table 2 is shown in Figure 5. Again, the SIR values of the methods 3, 7, 8, and 18 are noticeably lower than that of the other methods. Additionally, every ICA method has produced a SIR value less than 20 at least once. Figure 5 also shows that the range of SIR values differs between the ICA methods, with methods 11 and 13 having the largest range, and method 19 having the shortest range. In addition, method 19 is also located in the section with the highest SIR values. That combined with a short range, and hence less variability in the ability of the methods to accurately estimate the independent components, makes method 19 the seemingly best ICA method.

(20)

Figure 5: A boxplot visualizing the relation between the ICA method and the SIR values for the 6 tested models. Methods 3, 7, 8, and 18 have lower SIR values than the other methods. Methods 11, 13, and 16 also show to generate low SIR values (i.e. < 20) more often than the other methods. The range of method 19 indicates that its SIR values are high in every model, with less variation than the other methods.

In order to test whether or not there was a significant difference between the ICA methods, a kruskalwallis test was conducted. This test resulted in a p-value of 3.18728e-06. This value is much lower than the chosen significance level, 0.05. Therefore, the null hypothesis can be rejected. Hence, there is a difference in the estimation of the independent components between the different ICA methods. Figure 6 visualizes the results of the kruskalwallis test. It shows that the methods 3, 7, and 8 differ significantly from method 19. Moreover, while not a significant difference, method 18 is also lower in rank than the other methods (excluding methods 3, 7, and 8).

(21)

Figure 6: The estimated mean SIR values of each ICA method with their corresponding comparison intervals. The blue and red lines clearly indicate that the SIR values of ICA method 19 differ significantly from method 3. Moreover, although not colorized, method 19 also differs significantly from methods 7 and 8. All other combinations of methods do not differ significantly. However, method 18 is noticeable lower in rank than most methods.

Figure 7 visualizes the source signals, the estimated independent components (or estimated sig-nals) obtained from ICA method 19, and the estimated signals obtained from ICA method 3 using the data generated in model 6, which represents the seismic wave scenario. The visualization of the ability of the two ICA methods to solve the vegetation reconstruction scenario as described by model 3 is shown in Figure 11 in Appendix 2. Method 19 is one of the ICA methods with the highest SIR values, method 3 has one of the lowest SIR values. There is a stark difference in the ability of the methods to accurately estimate the independent components. While the results from method 19 show a clear resemblance between the source signals and the estimated signals, the results from method 3 shows no resemblance whatsoever. This corresponds natu-rally with the SIR values. After all, the SIR() function calculates the correlation between the source signals and the estimated signals (Miettinen et al., 2017). The higher the SIR value, the higher the correlation between the sources and the estimate signals, and the higher the resem-blance. The choice of ICA method is therefore clearly important to the accuracy of estimating the independent components.

(22)

Figure 7: The source signals are visualized in the three graphs on the first row. Each graph represents a separate source signal. The second row consists the estimated independent components gotten resulting from ICA method 19, which uses the JADE algorithm. The third row shows the estimated independent components gotten from ICA method 3, which uses InfoMax. The estimated signals are the results gotten from the application of the methods on the data in model 6.

Figure 8 visualizes the correlations between the source signals and the estimated signals computed using ICA methods 3 and 19 on the data generated in model 6. Figure 12 in Appendix 2 shows the correlation plot of the source signals and estimated signals computed using the same methods but with the data generated in model 3. The correlation plots of both models clarify why ICA method 3 has less success in accurately estimating the independent components than method 19. ICA is able to find the independent components by maximizing the independence (or non-Gaussianity) of the variables. It is therefore to be expected that the correlation plot between the source and estimated signals should result in a clear relationship between one of the sources and one of the estimates without any overlapping correlations between one of the sources and two or more of the estimates. Large overlaps would imply that the estimated components are not independent from each other, and that the ICA method has failed. After all, if the independent source signals correlate with more than one of the estimated signals, then there is still some mixing present. Figures 8 and 12 clearly show that ICA method 3 resulted in a lot of overlapping correlations, while ICA method 19 hardly has any overlap. ICA method 19 is therefore better able to separate the independent components compared to method 3, which was to be expected based on the SIR values and the visualizations of the estimates in Figure 7 and Figure 11.

(23)

Figure 8: The correlation plots of the correlations between the source signals and the estimated signals computed using method 3 (left) and method 19 (right) on the data generated in model 6. The source signals are on the y-axis, and the estimated signals are on the x-axis. Method 3 shows a large overlap of correlations between all sources and estimates, while method 19 shows a clear relationship between one of the sources and one of the estimates.

4.3 The Contrast Functions

Now that the ICA methods have shown to differ significantly in their estimative abilities, it is important to test whether it is the contrast function, the optimization algorithm, or both that cause this difference. The influence of the contrast functions and the optimization algorithms is also tested using the kruskalwallis test statistics, where the different groups being tested are either the contrast functions or optimization algorithms instead of the ICA methods.

First, the contrast functions will be discussed. The p-value of the kruskalwallis test is 0.0055, which is less than the significance level, 0.05, resulting in the rejection of the null hypothesis. The choice of contrast function therefore does influence the average SIR values, and thus the ability of the ICA method to estimate the independent components. Figure 9 shows which contrast functions differ significantly from one another. The SIR values on the x-axis show where the mean SIR value and the range of SIR values lies per contrast function. This, for example, shows that negentropy is significantly better at estimating the independent components than maximum likelihood, but not the other contrast functions.

(24)

Figure 9: The estimated mean SIR values of each contrast function with their corresponding comparison intervals. The contrast functions are abbreviated. n = negentropy, k = kurtosis, l = maximum likelihood estimation, j = joint entropy, i = mutual information, and c = cumulant tensor. The only significant differences are between l and n, and l and k.

4.4 The Optimization Algorithms

Finally, the optimization algorithms. The kruskalwallis test to see if the choice of optimization algorithm influenced the estimation accuracy of ICA produced a p-value of 1.76248e-07. This is smaller than the significance level, 0.05. The null hypothesis can therefore be rejected in favour of the alternative hypothesis. The choice of optimization algorithm thus influences the estimative ability of ICA. Figure 10 visualizes the results from the kruskalwallis test. It clearly shows two different sets of groups, exempting Pearson’s system, that differ significantly from one another, but have no significant difference within the group itself. The two groups consist of the InfoMax, natural gradient, and the FOBI algorithm on the lower side of the SIR values, and the GAMs, Newton’s method, FastICA, and JADE algorithm on the higher side of the SIR values. The algorithms in the latter group are therefore significantly better at estimating the independent components than the first group. Note that the group algorithms with the lowest SIR values also directly correspond with the four ICA methods that have significantly lower SIR values.

(25)

Figure 10: The estimated mean SIR values of each optimization algorithm with their corresponding comparison intervals. The optimization algorithms are abbreviated. G = GAMs, I = InfoMax, N = Newton’s method, g = natural gradient, F = FastICA, P = Pearson’s system, O = FOBI, and J = JADE. There are two groups that differ significantly from each other. The group containing the algorithms I, g, and O produce significantly lower SIR values than the G, N, F, and J algorithms. P does not differ significantly from either group.

The FOBI, natural gradient, and Infomax optimization algorithms performed worse than the other algorithms. A short possible explanation will be given:

First, the FOBI algorithm will be discussed. The failure of the FOBI algorithm to estimate the source signals accurately is due to the simple fact that FOBI assumes that the kurtoses of the independent components are distinct. This is the case if the source signals are from different distributions (Novey & Adali, 2008). However, the source signals used in this study were all from a uniform distribution, resulting in the low performance of FOBI. It is not uncommon, though, to have source signals from the same distribution (Hyv¨arinen et al., 2001). FOBI can therefore only accurately be used in certain circumstances. However, as the distributions of the source signals are unknown, FOBI is unreliable.

Second, Infomax is useful when there is an additional noise in the mixed signals that needs to be filtered out. Infomax needs a certain amount of information loss to function properly (Hyv¨arinen et al., 2001). The models used, however, do not contain any noise. This might be the reason why the Infomax algorithm failed to produce accurate estimations of the source signals.

(26)

Lastly, the natural gradient is discussed. The natural gradient algorithm is described in literature as a reliable optimization algorithm (Yu et al., 2014). However, two out of three versions of the algorithms failed to accurately estimate the sources. The only difference between the different versions were the choice in nonlinearities. It is therefore possible that the choice of the nonlinearities has a significant effect on the effectiveness of the algorithm. However, this is something that should be tested further.

It would have also been interesting to test whether the interaction between the contrast functions and optimization algorithms had a significant impact. Unfortunately, the available R functions did not offer the possibility of generating all possible combinations, making it impossible to test this interaction. However, future research on this topic would be possible using either less contrast functions and optimization algorithms or constructing more ICA methods.

Finally, it must be noted that the differences in significance are just an indication of the possible impacts of the contrast functions and optimization algorithms on the estimative ability of ICA. After all, computer experiments do not provide an independent sample nor is it guaranteed that the generated data is similar to real data (Soltana, Sabetzadeh & Briand, 2017). Therefore, one cannot simply use statistical tests to infer any conclusions from it. However, it serves as a strong indicator for the types of data that can be solved using ICA. Additional research using real life data are, however, needed to confirm the results.

4.5 The Effect Sizes

To test whether the optimization algorithm or the contrast function has a greater influence on the estimative ability of the ICA method, the effect sizes of the ICA methods, the contrast functions, and the optimization algorithms were calculated and compared. The effect size of a kruskalwallis test can be computed as follows

E2_R= H (n2_{− 1)/(n + 1)}

With E2

R as the effect size, the H is the kruskalwallis test statistic (i.e. chi-squared), and n the

total number of observations. The effect size has a possible range of 0 to 1, with 0 indicating no relationship and 1 indicating a perfect relationship (Tomczak & Tomczak, 2014).

The chi-squared value obtained from the kruskalwallis test of the ICA methods groups is 58.8. This resulted in an effect size of 0.524. The chi-squared value of the contrast function groups is 16.52, and the resulting effect size is 0.1462. The chi-squared value of the optimization groups is 44.43 with an effect size of 0.3932.

The effect size of the ICA methods group clearly shows that the choice of ICA has a large influence on the estimative ability of ICA. Of course, the ICA methods consist of different choices of contrast functions and optimization algorithms. As the effect size of the optimization algorithm is approximately twice as large as the effect size of the contrast functions, it becomes clear that the optimization algorithm has a bigger influence on the accuracy of the estimations than the contrast functions. This would explain why the three optimization algorithms that were significantly lower in rank corresponded with the four lowest scoring ICA methods, while the contrast functions did not show such a clear relationship. Moreover, following the observed limitations of the FOBI, Infomax, and natural gradient algorithms, it is unsurprising that the optimization algorithms have a larger impact on the estimation’s accuracy. After all, the optimization algorithms dictate the manner of finding the estimations, while the contrast functions are simply a measuring tool, albeit it with their own (dis)advantages (Comon, 1994; Hyv¨arinen et al., 2001; Yu et al., 2014).

(27)

4.6 The Synthetic Data

The models used to test the accuracy of the ICA methods’ estimations were all examples of severely overdetermined systems. The models are overdetermined because the number of rows in the matrices S and X exceed the number of columns. In a strictly mathematical sense, each row represents a different dimension. However, this research used a slightly different interpretation of the rows. Namely, the rows of the matrices S and X in the models represent the values of the source/mixed signals at a certain time (i.e. seismic wave model) or location (i.e. samples in the vegetation reconstruction model). While the used synthetic data did not consist of any zero-valued entrees, it is expected, especially in the vegetation reconstruction models, to find non-zero as well as zero valued entrees. It would be interesting to test whether or not zero-valued entrees would influence the accuracy of the estimations or if it, perhaps, would change the general ranks of the ICA methods.

A very overdetermined system therefore equals more data sampling points in this research compared to a less overdetermined (or even an even- or underdetermined) system. It is, however, not uncommon to have less than 100 sampling points (as was used in the models) in real life experimental setups. It was therefore interesting to test how a less overdetermined system consisting of simply ten measurements would affect the estimative ability of the ICA methods. Figure 13 in Appendix 3 shows that, once again, ICA method 19 surpasses method 3 in its ability to separate and accurately estimate the independent components. However, the SIR value of ICA method 19 is reduced tremendously, an average SIR value that is larger than 20, to a mere 7.603081. The SIR value of method 3 is 1.111976. However, this does not differ much from the SIR values obtained with the used models. The drastic decrease of the estimative ability of method 19 suggests that a larger dataset greatly improves the estimative ability of ICA, even if method 19 is able to retrieve a very general outline of the form of the source signals (Figure 13 in Appendix 3). However, many of the nuances are lost.

Another noteworthy difference between the used synthetic data and possible real life exper-iments is the used mixing matrix. All six models used a square mixing matrix in which all entrees were nonnegative as well as non-zero. However, it could occur that a mixing matrix does contain a certain amount of zero values. To test whether or not ICA can still be applied even with sparce mixing matrices, several different mixing matrices were constructed. The null matrix was obviously not considered as that would imply that there are no mixed signals. The sparce mixing matrices that were tested consisted of one, two, three, and four non-zero values. The most noteworthy outcome is that the amount of independent components that can be estimated equals the amount of rows containing a non-zero value. If a row in the mixing matrix only consists of zeros, then there will be a corresponding row the mixed signals matrix X that also only contains zeros. There would therefore be one less mixed signal to obtain an independent component from. Thus, more empty rows in a mixing matrix means that the ICA methods can find less independent components.

Additionally, if there is only one non-zero entree per row then the only difference between the source signals and the estimated signals is the scale, which ICA cannot compute. ICA could then only be used superfluously to show that there is no additional mixing present, besides scaling, solely based on the fact that the source signals, mixed signals, and estimated signals all have the same visual form. Therefore, ICA can only really be effectively (and usefully) implemented if there are at least two non-zero values in a row. The amount of independent components to be found depends on the amount of filled rows. However, that does not mean that the empty rows impede the ability of ICA to correctly estimate the independent components that can be separated.

(28)

greatly reduces the complexity of the computations needed in each ICA method. However, square mixing matrices are not strictly necessarily, as long as the number of columns in the mixing matrix is the same as the number of columns (i.e. number of individual source signals) in the source signal matrix S. Otherwise, matrix multiplication between the two matrices is impossible, resulting in the failure to construct the mixed signal matrix X. The number of columns in the mixing matrix is therefore restricted. However, the number of rows in the mixing matrix does not necessarily matter. Of course, if there are less rows than number of source signals then not all source signals can be retrieved as there will not be enough mixed signals to match the source signals. However, it does not matter for the estimation of the source signals if the number of rows in the mixing matrix equal or exceed the number of source signals. After all, PCA is used to reduce the dimensions of the mixed signals. The dimensions of the mixed signal matrix can therefore be reduced in such a way that the number of estimated signals equals the number of source signals. Therefore, a square mixing matrix is not a requirement for the ICA methods to work correctly.

And finally, it was tested if ICA really cannot separate the mixed signals if not all sources are independent. The problem of having non independent source signals is that the rank of the matrix is reduced. In this case, that would result in a source signal matrix S and a mixed signal matrix X where the number of columns equals the number of independent components, which is less than the number of source signals. The maximum number of components that can be estimated by ICA equals the number of columns in X. Hence, if there are non independent sources present, ICA is unable to estimate all the source signals. It must be noted that not all ICA methods are able to compute the source signals if there are non independent sources. However, there are enough ICA methods that are able to run and accurately calculate the sources when there are non independent sources.

In short, while ICA is a useful tool for separating the mixed signals and retrieving the inde-pendent components, it might not work in every situation. As the source signals and the mixing matrices are unknown in real life experimental setups, it is impossible to verify whether or not ICA has accurately separated the signals. However, if it is likely that the source signals are independent, non-Gaussian, and the mixing matrix contains no rows with only zeros, then one can assume that most ICA methods can accurately estimate the independent components.

5 Conclusion

The research question was “How does the choice in contrast function and optimization algorithm influence the estimative ability of independent component analysis in solving the blind source separation problem?” The kruskalwallis test showed that there was a significant difference in the SIR values of different ICA methods, contrast functions, and optimization algorithm. The choice of contrast function and optimization algorithm therefore influences the estimation accuracies. The effect size of the contrast functions and optimization algorithms showed that the choice in optimization algorithm has a more pronounced impact on the estimation accuracy than the choice in contrast functions. There are three optimization algorithms that are significantly worse in estimating the independent components accurately. These algorithms are FOBI, natural gra-dient, and Infomax. The corresponding ICA methods are methods 3, 7, 8, and 18. Consequently, methods 3, 7, and 8 also scored significantly lower than the other methods. The choice in contrast function and optimization algorithm therefore determines the accuracy of the estimations, and thus the estimative ability of ICA to solve the BSS problem. However, the unpredictability (i.e. lack of structure and periodicity) of the independent source signals also negatively influences the estimation accuracy. The ICA methods have shown to be able to estimate, in varying degrees,

Testing Different Independent Component Analysis Methods in the Earth Sciences

Universiteit van Amsterdam

Future Planet Studies

BSc Thesis

Testing Different Independent

Component Analysis Methods in the

Earth Sciences

Author

Tamara Stoof

Supervisor

Dr. Ir. Emiel van Loon

June 30, 2020

Amsterdam

Contents

Abstract

1

Introduction

1.1

Independent Component Analysis

1.2

Earth Sciences and Blind Source Separation

2

Theoretical Framework

2.1

Blind Source Separation

2.2

Independence of the Original Components

2.3

Centering

2.4

Whitening

2.5

Linear Transformation Matrix

2.6

Non-Gaussianity and Estimation of the Original Components

2.7

Independent Component Analysis Methods

2.8

Contrast functions

2.9

Optimization Algorithms

3

Methods

3.1

The ICA methods

3.2

Constructing the Scenarios

3.3

Data Analysis

4

Results & Discussion

4.1

Earth Sciences and ICA

4.2

The ICA methods

4.3

The Contrast Functions

4.4

The Optimization Algorithms

4.5

The Effect Sizes

4.6

The Synthetic Data

5

Conclusion