• No results found

Application of autoencoders on the analysis of Raman Hyperspectral Imagery

N/A
N/A
Protected

Academic year: 2021

Share "Application of autoencoders on the analysis of Raman Hyperspectral Imagery"

Copied!
64
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Application of autoencoders

on the analysis of Raman

Hyperspectral Imagery

(2)

Layout: typeset by the author using LATEX.

(3)

Application of autoencoders on the analysis

of Raman Hyperspectral Imagery

with splitting Raman and photoluminescence signal components

Jonne J. Goedhart 11890169

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor

Dr. Vassilis M. Papadakis1, MSc. Ivan Sosnovik2 1Institute of Molecular Biology and Biotechnology

Foundation for Research and Technology Hellas Nikolaou Plastira 100

GR-70013, Heraklion Crete, Greece

&

2Institute of Informatics, UvA-Bosch Delta Lab

Faculty of Science University of Amsterdam

Science Park 904 1098 XH, Amsterdam North-Holland, The Netherlands

(4)

Abstract

Raman spectroscopy is a very sensitive, non-invasive optical technique and therefore a popular technique to determine what a material or substance is composed of. Although, Raman spectroscopy is useful, its signals are rather complex, and require a lot of knowledge and skills to understand their meaning. Com-plexity increases even more when it is combined with imaging creating a Raman hyperspectral imagery, which contains spacial information beneficial to the end user. The common methodology of analysing such a data set is to use a non-negative least square algorithm combined with reference data. If reference data is not available, unsupervised algorithms are used to perform spectral clusterisation of similar areas. This can be achieved using PCA and k-means. One problem of such an approach is that a Raman Signal consists of three sources: Raman, photoluminescence and noise. This creates the need for preprocessing methods that split this raw signal. The goal of this project is to improve the unmixing performance and successfully split the raw signal into a Raman and photoluminescence signals. In this study, various unsupervised methods are explored to understand which methods work best for which type of micro-scopic RHSI data sets. Here, the focus is on using autoencoders for the data analysis and a Gaussian smoothing filter combined with a polynomial fittings to split the data. In the results can be seen that the best method depends greatly on the application it is used for, but when a large dataset is available and accuracy and creating a model are the goal the new splitting technique combined with an autoencoder are the best methods.

Keywords: Raman hyperspectral imaging, autoencoder, spectral signal separation

(5)

Acknowledgement

Foremost, I want to express my gratitude towards my two supervisors Dr. Vassilis M. Papadakis and MSc. Ivan Sosnovik for their effort, patience and for making open source research on this topic possible. Special thanks to Vassilis for the data acquisition and with that providing the project with open source datasets. Also, thanks for the help with understanding all the physics behind Raman hyperspectral imag-ing. Special thanks to Ivan for explaining the more complicated parts of machine learning and helping me find resources about them as well.

I also want to thank Vassilis and Ivan as well as my mother and my friends – Sandra and Julius – for helping me correcting my writing and supporting me during my bachelor thesis.

(6)

Table of Contents

List of Abbreviations vi

1 Introduction 1

2 Spectroscopy and Imaging 3

2.1 Raman Spectroscopy . . . 3

2.2 Hyperspectral Imaging . . . 5

2.3 Raman Hyperspectral Imaging . . . 7

2.3.1 Microscopic RHSI, its Spatial and Spectral Resolution . . . 7

2.3.2 Signal properties . . . 8

3 Machine Learning 9 3.1 General Machine Learning . . . 9

3.1.1 Machine Learning Core Problems . . . 9

3.1.2 Supervised, Unsupervised and Semi-Supervised Learning . . . 10

3.1.3 Distance Functions . . . 11

3.2 Clustering . . . 12

3.2.1 K-means . . . 12

3.3 Matrix Decomposition Methods . . . 15

3.3.1 PCA . . . 15

3.3.2 NMF . . . 18

3.3.3 NMF as a K-means Algorithm . . . 19

3.4 Neural Networks . . . 20

3.4.1 Layers . . . 21

3.4.2 Loss Function and Back Propagation . . . 22

3.4.3 Autoencoders . . . 23 4 Method 24 4.1 Signal Separation . . . 26 4.1.1 Polynomial Separation . . . 26 4.1.2 GSP . . . 26 4.2 Normalisation . . . 28 4.3 NFS . . . 28 4.4 PCA Method . . . 28 4.5 NMF Methods . . . 28 4.6 Autoencoder Methods . . . 29 4.7 Error Function . . . 30 5 Related work 31 6 Implementation 32 6.1 Required software . . . 32 6.2 Method Implementations . . . 32 7 Experiments 33 7.1 Datasets Explanation . . . 33 7.2 Splitting Algorithms . . . 33

7.3 Choosing the Hyperparameters for the Autoencoder . . . 35

7.4 Choosing the Hyperparameters of the NMF . . . 37

7.5 Data Processing Methods . . . 37

7.6 Cross Validation in a Data set and between Data sets . . . 39

7.7 Raman vs Photoluminescence and a Combination of Both . . . 40

8 Conclusion and Discussion 42

References 45

(7)

A Removing Noise from the Raw Signal Code 47

B GSP Code 48

C Autoencoder Class Code 49

D Instrument Details 50

E NFS Results with Different Preprocessing Methods 51

F Data Processing Method Results Using GSP Preprocessing 52

G Alina4 Component Spectra Results Using GSP Preprocessing 54

(8)

List of Abbreviations

AE autoencoder

AI artificial intelligence

DBSCAN density-based spatial clustering of applications with noise EM expectation-maximisation

FTIR Fourier-transform infrared spectroscopy spectroscopy FWHM full width at half maximum

GSP Gaussian smoothing polynomial fitting HSI hyperspectral imaging

NFS no feature selection

NMF non-negative matrix factorisation NN neural network

NNDSVD non-negative double singular value decomposition NNLS non-negative least square

PCA principal component analysis RGB red green blue

RHSI Raman hyperspectral imaging SVD single value decomposition TIPP tensor image processing platform VQ vector quantisation

(9)

1

Introduction

Raman spectroscopy is a noninvasive optical technique and therefore a popular technique to study e.g. mining samples and determine what material or substance they are composed of. The principle of Raman spectroscopy is to excite electrons in a molecule with a laser and measure the emitted electromagnetic waves – light – that those electrons release when going back to a lower excited state. When electrons return to a different excitation level than they originally came from, the light has a different frequency, which is called Stokes Raman scattering. These different frequencies together form a spectrum containing information about what molecules the sample is made of.

Although, Raman spectroscopy is useful, for some samples e.g. artworks it would be better to have this spectrum for small areas or points in the painting rather than having one spectrum for the whole painting. To achieve this Raman spectroscopy can be combined with imaging to produce a hyperspectral imaging (HSI) data set. HSI is a technique where an image is taken but instead of three colours, hun-dreds of colours are measured. Here, a colour should be seen as a measurement of a small part of the electromagnetic spectrum, which does not even have to be visible light. When Raman spectroscopy is combined with HSI, instead of colours Raman frequencies are measured, creating Raman hyperspectral imaging (RHSI).

The best way of analysing such a data set would be to know beforehand the reference Raman spectra that are present in the image and then perform unmixing. The goal of unmixing is to find for each pixel what it is composed of. This is useful because the user gets detailed information about what substance can be found wherein the image. When there are no reference information spectra beforehand, unsupervised algorithms are needed to find the general components — component spectra — in the image, from which each pixel can be reconstructed.

Till now, the tensor image processing platform (TIPP) could be used to analyse HSI data and RHSI data [1]. TIPP provides an environment to study those datasets. The idea of TIPP is based on a sandbox architecture, were the researcher could apply various algorithms, to explore their data and find appro-priate solutions to their problems. For data processing, TIPP uses principal component analysis (PCA) for feature selection and then k-means for clustering, which gives the mean vectors — component spectra — that are then used in non-negative least square (NNLS) for unmixing. While this is a robust method, it does not use feedback loops, which means that the feature selection and clustering are independent but also do not learn from each other. One problem of such approach is that a Raman Signal consists of three sources: Raman, photoluminescence and noise. This creates the need for preprocessing methods that split this raw signal. TIPP does not provide any preprocessing algorithms for Raman hyperspectral data or signal splitting algorithms.

The goal of this project is to improve unmixing performance and successfully split the raw signal into a Raman signal and photoluminescence signal. In this project, the focus is on a microscopic RHSI dataset. In this study, various unsupervised methods are explored to understand which methods work best for which type of microscopic RHSI data sets, or which application that microscopic RHSI can be used in. The most widely used splitting method is a polynomial fitting method [2]. However, this method does not recover the photoluminescence signal and is not super accurate. Therefore, a more complicated but more accurate solution like multi polynomial fit splitting is suggested [2]. With the same idea, the Gaus-sian smoothing polynomial fitting (GSP) algorithm was developed. To improve the unmixing process, non-negative matrix factorisation (NMF) instead of PCA and k-means is suggested [3], which optimises, alternatingly, between the feature selection and clustering. However, as neural networks (NNs) are in-creasingly more popular in machine learning nowadays, a NN based solution should also be included in this project. As demonstrated an autoencoder (AE), which is a semi-supervised NN, can be used for feature selection in hyperspectral images [4]. Thus several different methods including an autoencoder are proposed.

(10)

To find which method is most suitable for which application the various unsupervised methods — in-cluding the original/TIPP methods — are tested against several datasets. Tests that are conducted are: which splitting methods works the best, what are the optimal hyperparameters for an NMF and an AE, which data processing method is the best, how well do the methods work as a model for future data and what information can be found in the Raman signal versus the photoluminescence signal. In section 3, a detailed explanation is given for all the machine learning algorithms used in this project.

(11)

2

Spectroscopy and Imaging

In this section, general information about spectroscopy and imaging is provided. Specifically, in section 2.1 Raman spectroscopy is explained in terms of the properties and the different components of the sig-nal, and the advantages as well as the problems encountered when using Raman spectroscopy. Next, in section 2.2 the difference between HSI and normal RGB imaging is discussed. This includes how the signal is composed and how measurements are taken. Lastly, in section 2.3 the combination of Raman spectroscopy and hyperspectral imaging, and the properties of microscopic RHSI, like spatial and spectral resolution, are explained.

2.1

Raman Spectroscopy

Raman spectroscopy is a non-invasive optical technique whereby a laser, with a specific frequency, is used to excite electrons in the measured matter. This leads to four possible physical reactions that emit electromagnetic waves, which can be measured [5], see figure (1). Most of the laser energy is converted into Rayleigh scattering. Rayleigh scattering happens when an electron gets excited to a virtual energy state and then release a photon with the same frequency as the laser to go back to the original energy state. This signal is always filtered out before measuring, because this does not contain any information about the sample. The rest is measured as one signal that consists of three components.

Figure 1: Energy-level diagram showing the possible reactions, that emit electromagnetic waves, used in Raman spectroscopy [6].

The source for the main component of this signal is absorption [5]. When absorption takes place, an electron goes to a higher energy state. Later, when the electron falls back to its initial vibration energy state, this stored energy can be released in steps in the form of photons of various frequencies of lower energy. This is also known in living organisms as fluorescence or in materials as photoluminescence, which forms the main component of the signal.

The other two components of the signal are the stokes and anti-stokes Raman scattering. Both are the result of the same reaction whereby an electron gets excited to a virtual energy state, but instead of falling back to the original state, it falls back to a lower (anti-stokes) or higher (stokes) vibrational energy state [5]. This gives the photon a different energy level, thus a different frequency then the frequency of the laser. The shift in frequency is called Raman shift and is expressed in wavenumbers. This leads then to a spectrum plot where Raman shift numbers are plotted against intensity. This can also later be converted to wavelengths using equation (1).

(12)

∆˜ν = 1 λ0 − 1 λ1 ⇒ λ1= ( 1 λ0 − ∆˜ν)−1 (1)

Where ∆˜ν stands for Raman shift in cm−1, λ0is the wavelength of the excitation laser and λ1is Raman

spectrum wavelength in nm. To use equation (1) the units need to be equalised, therefore the Raman shift number should first be divided by 107.

The advantage of using Raman spectroscopy is the detailed information it provides regarding the sample’s chemical composition. A sample consists of many molecular structures, where each molecular structure contains electrons that have different energy and polarisability. This polarisability determines how a molecular structure reacts to the laser, and the frequency of the photons determine the energy of the laser interaction, which is Raman scattering. Those different frequencies lead to detailed information about the different molecular structures in the sample, making it possible to distinguish or identify dif-ferent materials. Examples of molecular structures are double or triple carbon bindings, which both give Raman spikes in the signal at different wavenumbers [7].

One of the problems of Raman spectroscopy is differentiating the different component signals from each other. This is even more difficult because usually the photoluminescence is stronger in intensity than the Raman signal, but fortunately, the characteristics of the signals are different. Theoretically, a pure Raman signal consists of high-intensity spikes with a full width at half maximum (FWHM) of generally less than 30 wavenumbers, which can be formulated as seen in equation (2). However, a theoretical pure photoluminescence signal consists of a high-intensity signal with a broad FWHM of hundreds or thousands of wavenumbers. This makes a rough estimate of a split possible, see figure (2) [2].

r(∆˜ν) = (

G(∆˜ν), if there is a spike

0, otherwise (2)

Where r(∆˜ν) is the Raman spectroscopy signal in wavenumbers and G(∆˜ν) should be a Gaussian func-tion with a small σ to create an individual spike and with a µ that is in the middle of the spike. Normally a Lorentz distribution is used to model the spikes, but in this case the instrument used for Raman signal acquisition had a spectral resolution of less than 1cm−1. This makes a Gaussian distribution, which is very similar to a Lorentz distribution, a valid model.

Figure 2: Example of splitting a raw signal without noise into the Raman and photoluminescence signal components. The difference in FWHM is clearly visible, where the Raman has high spikes and a small FWHM and the photoluminescence is a broad signal with a more constant intensity.

(13)

2.2

Hyperspectral Imaging

HSI is an noninvasive optical technique similar to making an RGB image with a photo camera or human eyes. RGB images or human eyes perceive three colors or color bands: short (blue), medium (green) and long (red) wavelengths. These bands consist of broad spectral curves that overlap. For example, the red spectral signature is a bell-shaped curve which ranges from around 500 to 700 nm, while the green spectral signature is a bell-shaped curve ranging from approximately 450 to 650 nm, which has a big overlap with the red [8].

There are four main differences between HSI and RGB. HSI uses significantly more bands, commonly assumed at least 30 or more. These bands are smaller in width, 1 to 10 nm wide, and form a continuous spectrum. When the bands do not form a continuous spectrum and the number of bands is lower then 30 the technique is no longer called HSI but multispectral imaging [9]. The last difference is that HSI is not bound to the visible spectrum, but for example, can also take measurements in the ultraviolet or infrared part of the electromagnetic spectrum.

Figure 3: Example of a pixel’s spectrum in an HSI. Each point in the plot is a measurement, which makes the bands in this plot approximately 6nm. This spectrum contains the visible spectrum and a part of the near infrared spectrum.

Both RGB images and HSI have the purpose to find, identify or detect different objects or materials, but RGB images rely mostly on spatial information whereas HSI focuses mostly on the spectral information. This is possible due to the difference between HSI and RGB. The smaller bandwidth of each band lead to the effect that with HSI, small increasing or decreasing spikes in intensity can be measured. These spikes can often be correlated to the presence or absence of certain particles, which can be used e.g. in mineralogy [10]. The smaller bandwidth of the bands combined with the higher number of bands lead to more detailed description of the spectral features, which means that for example different kind of red pigments can be distinguished between them, which would not be possible on an RGB image, which is useful for many applications including art conservation [11]. Also, extending the range of the electromag-netic spectrum could help with such a task.

(14)

There are four theoretical methods to create a hyperspectral image: non-scanning, spectral scanning, spatial scanning or spatio-spectral scanning. Non-scanning is equivalent to a "normal" camera, where the whole image is taken or other words for each pixel the whole spectrum is measured at the same time. For HSI it captures the variables (x, y, λ) at the same time, where x is the width, y is the height and λ is the wavelength. However, with spectral scanning each wavelength is measured separately, thus it captures the variables (x, y) at the same time. The third possibility is spatial scanning. This can be done in two ways: point scan (λ) or line scan (x, λ) which is similar to a "normal" RGB scanner. Here the spectrum for each pixel is measured at the same time [9]. Finally, Spatio-spectral scanning is a combination of spatial and spectral scanning, whereby in each axis measurements are done, but not the whole image at once like non-scanning.

Only three of the four methods, mentioned above, are used in practise. The reason for that is that non-scanning is too complex to realise with the amount of optical parts that are necessary, for example already x · y · λ sensors are needed. The easiest system to realise, in therms of optical components, is a spatial scanning device, which is just a Spectrometer that does multiple measurements in x and/or y direction.

Figure 4: The different acquisition techniques for hyperspectral imaging. Each datacube (x, y, λ) repre-sents the section of the cube that is taking at the same moment in time [12].

HSI is successfully applied in various research fields such as agriculture, medicine, food processing, min-eralogy or art conservation. Here, HSI is used and not RGB for the reasons mentioned in the examples above. The main disadvantages of HSI are: the cost of the acquisition system, the time to scan and analyse the data, data storage and (pre)processing the data because of the large amount.

The measured intensity for hyperspectral or RGB images are always reflected electromagnetic waves (light). While this is useful to find out certain properties of the measured substance, there are also other noninvasive optical technologies to measure a surface such as Raman spectroscopy, Fourier-transform infrared spectroscopy spectroscopy (FTIR) [13] or any other similar analytical technique.

(15)

2.3

Raman Hyperspectral Imaging

While Raman spectroscopy is a useful tool to get general information about a specimen, combining it with imaging gives additional information about where the different substance can be found in a specimen. Combining both methods seems quite straight forward. For example, HSI refers to an image where each pixel has a spectral signature, reflected light intensity plotted against a part of the electromagnetic spec-trum. However, with RHSI an image is made whereby each pixel contains a spectral signature, Raman scattering intensity plotted against a part of the electromagnetic spectrum or wavenumbers. In general, imaging can be combined with many different types of spectroscopy, especially when using point scans.

2.3.1 Microscopic RHSI, its Spatial and Spectral Resolution

A microscope can also be used with RHSI creating a microscopic image [5]. This creates a higher spatial resolution – pixels per inch (ppi)– and a smaller spot size – the size of the area measured – depending on the magnification of the microscope. The pixel size is not important for data analysis because even if the pixel size is small, it is about the area that is measured. This is why spatial resolution is used, which stands for independent/non-overlapping pixels per unit length. For most systems, the spot size is bigger then the pixel size. This creates partly shared information between pixels. However, when using point scans the spot size can be smaller than the pixel size. This means that there is space between the pixels that is not measured, see figure 5 [14].

Figure 5: Spot size vs pixel size. Left, the spot size, in red, is smaller then the pixel size leaving unmeasured gaps. Right, the spot size, in green, is larger then the pixel size. This causes the dark green where spot sizes overlap causing partly shared information between pixels.

When analysing the data it is important to keep this in mind, because when there is unmeasured space between pixels there is no overlapping information between neighbouring pixels and therefore neighbour-ing pixels can not be used to extract information from an area. Another important aspect to keep in mind is the size of the substance, that needs to be distinguished, versus the pixel size. When the substance size is similar to the pixel size, neighbouring pixels can not be used to determine what type of sub-stance a pixel contains. Together these concepts create a problem where neighbouring pixels can not be used, which means that for this type of datasets spatial information can not be used during data analysis. Another key aspect is the spectral resolution, especially when special information can not be used as this leaves only the spectral information for analyses. The spectral resolution consists of two parts the width of the bands and the number of bands. With a smaller band width the system is better in detecting spikes in the data, which is important for RHSI. The number of bands must be high to create a continuous spectrum with small bands, that still has a broad range of wavenumbers.

(16)

2.3.2 Signal properties

The raw data from microscopic RHSI is a hypercube with three axis (x, y, ∆˜ν), which gives an intensity for each pixel (x, y) and wavenumber (∆˜ν). This raw signal consists of the signals: photoluminescence and Raman as explained in section 2.1, and noise. This leads to the equation (3), where r is the Raman signal, a the photoluminescence signal and  the noise.

f (x, y, ∆˜ν) = r(x, y, ∆˜ν) + a(x, y, ∆˜ν) + (x, y, ∆˜ν) (3)

r, a ≥ 0 (4)

  f, ∀u, v, s (5)

  a, ∀u, v, s (6)

This splitting of the raw signal has some restrains, because the measured intensity should be, from a physical point of view, the number of photons emitted by the sample at a certain place. Therefore, photons can not have negative energy, which means that a and r should be both positive as stated in equation (4). Furthermore, the raw signal and photoluminescence should be in a higher order of magnitude then the noise, see equations (5) and (6). This is also the case for the Raman signal, but because of equation (2) this results in equation (7) [2].

(

 ≥ r, ∀u, v, s if r(x, y, ∆˜ν) = 0   r, ∀u, v, s otherwise

The last properties of r, a and  are their FWHM, explained in section 2.1 and illustrated in figure (2). Where a has a high FWHM, r a medium FWHM and  a small FWHM. A medium FWHM is generally less than 30 wavenumbers, while a high is more than 100 wavenumbers and a small FWHM is in the neighbourhood of one wavenumber.

Not only these properties of the raw signal, but also the dimensions of a RHSI make it hard to process the data by hand. Therefore, several machine learning algorithms are commonly used to assist researchers to analyse RHSI.

(17)

3

Machine Learning

In this section, general information about machine learning is provided and an overview of the algorithms that will be used in this project. Firstly, a general description of machine learning, the core problems and distance functions are given. Then, an explanation is provided for when to use supervised or unsuper-vised learning and what the differences are between both. Next, two machine learning tasks, clustering and matrix decomposition, will be discussed as well as a few algorithms that can be used to solve them. Finally, neural networks are explained and their correlation with clustering and matrix decomposition.

3.1

General Machine Learning

Machine learning is a part of artificial intelligence that focuses on making models, which can make predic-tions (regression) or decisions (classification) based on some input while not specifically programmed in such a way. More mathematically, machine learning algorithms can be described as a function f (x) = y0, with an input vector x, an output vector y0, while the exact form of f is formed during the training phase

and it is not preprogrammed [15]. Three phases can be considered to make a machine learning algorithm. The first phase is to build or to program the algorithm and set the hyperparameters. These are parame-ters that determine the architecture of the algorithm and do not change during the training phase. Later in the chapter, they will be discussed for each algorithm, but an example could be the size of the input vector x. After setting the hyperparameters an instance of the algorithm can be made. This instance cannot predict anything at this point because only the architecture of the algorithm is made containing random values for all the parameters.

The second phase is the training or learning phase. Here training data, a set of input vectors, is put into the algorithm to change the variables of f , such that the algorithm can predict y. The problem here is that the training data only contains a small part of all the possible input vectors. If the algorithm learns only to predict the training data correctly then the model will be useless. Therefore, the goal in this phase is to find general information in the training data, which can be used to predict all the data correctly. The last phase is to use the algorithm and to verify it. The verification can be done with test data, a different set of input vectors which are not used during training. The test data is put through the algorithm. If the algorithm predicts the outcomes correctly, the algorithm is verified.

While this is the general structure for each machine learning algorithm, there are many types of machine learning models. In this project, two approaches are considered: a closed-form approach and a learning approach. The difference between both approaches is that a closed-form approach calculates the param-eters in the second phase whereas learning approach starts with randomly assigned parameter values and learns with each iteration the optimal parameter values in the second phase.

3.1.1 Machine Learning Core Problems

Two main problems of learning models are overfitting and underfitting [16]. Overfitting happens when the algorithm does not generalise enough, thus is learning on the training data perfectly, but does not predict the test data correctly. Underfitting is the opposite, the algorithm does not learn enough from the data and the algorithm performs worse on both the training and test data. Later in this chapter, it will be explained how, assuming the problem is not in the data, both can be solved with choosing the correct hyperparameters.

There can occur two main problems in the data. The first one is that there is no pattern in the data. It is important to understand that all machine learning algorithms search in essence for patterns in the data [17], when there is none the algorithm will just randomly choose an outcome. The second problem is that the data is not suited for the algorithm. This can be solved by prepossessing the data [18], which on its own can be a machine learning algorithm.

(18)

3.1.2 Supervised, Unsupervised and Semi-Supervised Learning

Generally, there are two approaches to learning when making a machine learning model: supervised and unsupervised [19]. The main difference between them is how they know if what they are learning is correct and therefore how they learn. This makes both approaches suitable for different situation. With supervised learning the training and test data are annotated or labelled, meaning that not only the data (x) is given but also the label (y), thus it is known what the output should be. This makes it possible to let the algorithm learn what the correct outcome is by comparing y and the outcome of the model (y0). In general, the goal is to optimise the outcome of this comparison. For example, the comparison can be the distance between both vectors: pP(y0− y)2.

When using unsupervised algorithms y is not known and therefore it is harder to check if the algorithm works correctly. These algorithms cannot use the same metrics as supervised learning, therefore they should use different metrics to train the data. For example, when the data needs to be clustered, a useful metric can be to minimise the distance between the data points that are in one cluster.

Supervised learning is most of the time the preferred learning model because the learning is more straight forward and it is clear if the algorithm works or not when verifying. However, the problem is that labelled data can not always be used, which could have two reasons. The first reason is that the goal is not about predicting but about finding some general features in the data. For example, a new clothing line of small, medium and large T-shirts has to be designed. To find these T-shirt sizes, three measurements have to be found with the constrains that most people will fit one of the sizes or in other words, the distance between the data, peoples T-shirt size, and the closest measurement should be minimised. The second reason is that labelled data is not available because labelling data is expensive and takes a lot of time. If experts are needed to label the data the cost rise even more.

This can be a problem for some tasks where supervised learning would be the ideal choice, but it is not possible due to the second reason above. If this is the case, another method is sometimes possible. Where a supervised algorithm is used but instead of using y as the output vector x is also used as the output vector. This is called semi-supervised learning.

(19)

3.1.3 Distance Functions

In this project, several different distance functions or metrics are used. For clarity, each metric and the difference between them will be discussed here. We use metrics induced by vector and matrix norms. They cover a big set of widely-used metrics.

In this project, there are two distance functions that are used for vectors. The first distance function is the L1 norm also called Manhattan distance, see equation (7). The second distance function is the L2

norm also called Euclidean distance, see equation (8).

L1(x) = ||x||1= N X i=1 |xi| (7) L2(x) = ||x||2= v u u t N X i=1 x2 i (8)

There are two types of norms used for matrices: matrix norms (same naming as vector norms) and "entrywise" matrix norms. Only "entrywise" matrix norms are used in this project. These norms can be generalised to the Lp,q norm, see equation (9). The "entrywise" matrix norm should not be confused

with the vector norm, which can be distinguished by the subscript in the equations.

Lp,q(x) = ||x||p,q= q v u u u u t N X i=1   M X j=1 xpij   q p (9)

In this project, two versions are mainly used: the Frobinius norm or the L2,2, see equation (10) and the

L1,1 norm, see equation (11).

LF(x) = ||x||F = ||x||2,2 = v u u t N X i=1 M X j=1 x2 ij (10) L1,1(x) = ||x||1,1 = N X i=1 M X j=1 |xij| (11)

(20)

3.2

Clustering

Clustering is one of the basic tasks of machine learning. The goal of clustering is to find data points that are related, according to a distance metric, and divide the dataset into k amount of groups/clusters. Clustering is always an unsupervised learning model, although the final result can be used for classifica-tion or regression. The reason that clustering is always unsupervised is that the goal is to find a new set of labels, which would not be possible if the data already has labels. However, labelled data can be used to check if the clustering algorithm works for a certain type of dataset.

There are many types of clustering algorithms each with there own purpose and goal. For most al-gorithms, two things are important: how the clusters are formed and which distance metric is used. Examples of clustering models are density models like density-based spatial clustering of applications with noise (DBSCAN) [20], centroid models like k-means, see section 3.2.1 and distribution models like expectation-maximisation (EM) [15].

Some clustering algorithms use hard-clustering or one-hot encoding, which means that a data point either belongs to a cluster or it does not. Other models use soft-clustering or fuzzy clustering, where each data point belongs to a certain degree — like probabilities or weights — to a cluster. DBSCAN and k-means are both hard-clustering, while EM is soft-clustering. Another difference between clustering algorithms is how clusters are defined, k-means and EM have for example just a couple of parameters to define a cluster, while DBSCAN needs all the original data points to define a cluster.

3.2.1 K-means

K-means is a clustering algorithm that is based on finding data points that are close together which form a cluster with a centre. This explains the name k-means: find k number of means. In short, k-means works as follows: each data point is assigned to the closest cluster mean and the mean of a cluster is calculated as the mean of all the data points currently assigned to that cluster. Now the goal is to optimise the sum of the distance metric of all the data points with their cluster mean. There are a few variations possible when implementing k-means. Here the most simple variant will be discussed with as distance metric squared euclidean distance.

There are four steps in k-means to minimise J given by equation (12). Here the number of data points are N , the number of dimensions is D and the number of clusters is K. x ∈RN ×Dis a matrix containing

the data points and µ ∈RK×D contains the cluster means. r ∈

RN ×K is a one-hot matrix, see equation

(14). Each row in r contains 1 one and K − 1 zeros because each row belongs to a data point, which only belongs to one cluster. All the equations in this subsection are based on [15].

J = N X n=1 K X k=1 rnk||xn− µk||22 (12) 12

(21)

(a) Original data before k-means. (b) Initialisation of the means.

(c) Assigning clusters to each data point based on the closest mean.

(d) Results after the first iteration after checking con-vergence.

(e) Results after the second iteration. (f) Final results of k-means, when convergence is achieved.

Figure 6: K-means, with k = 3, applied to the iris plants dataset [21]. On the x-axis is the petal length in cm and on the y-axis the petal width in cm. From subfigure c on the data is coloured according to the matrix r, where each cluster is represented by one colour, see equation (14).

(22)

The k-means algorithm works as follows: First, the means (µ) are initialised for each cluster. This can be done by randomly taking K different data points. Following, the distance from each cluster mean to each data point should be calculated and saved in a matrix M , see equation (13). Next, change r where for each data point a one is placed where the lowest value is for that data point in M , see equation (14).

M =     ||x0− µ0|| ||x0− µ1|| ... ||x0− µk|| ||x1− µ0|| ||x1− µ1|| ... ||x1− µk|| ... ... ... ... ||xn− µ0|| ||xn− µ1|| ... ||xn− µk||     (13) rnk= ( 1 if k = argmin Mn 0 otherwise (14)

Run the next steps, until there is a convergence of J or the number of iterations exceeds the preset allowed number of iterations. The next steps consist of: optimising µ and r alternatingly. µ is optimised by fixing r and minimising J . This can be done by setting the derivative of J in respect to µk to zero as

follows 0 = δJ δµ = 2 N X n=1 rnk(xn− µk) = 2 N X n=1 rnkxn− 2 N X n=1 rnkµk= 2 N X n=1 rnkxn− 2µk N X n=1 rnk (15)

This lead to the expression of the cluster mean

µk= PN n=1rnkxn PN n=1rnk (16) Therefore, the cluster mean is the mean of all data points that belong to that cluster. Next r is calculated, as described above, where first M is updated, see equation (13). Then r is updated, see equation (14). K-means is a good algorithm when data points form ball-shaped clusters. An example of a dataset that is appropriate for k-means is a dataset that consists of three substances and some noise. In such a case the means of k-means can even be used to reconstruct the original three substances. K-means does not work as well when the data follows a certain pattern which is not ball-shaped, see figure 7.

Figure 7: Three examples of the outcome of k-means. Circles with a black outer rim are the cluster means and each colour represents a cluster. In case A, the shape of the clusters that are found is incorrect because the pattern in the data should be an outer and inner circle. In case B, the data points are not that badly clustered because most of them are in the crescent shape, but the means are not on the crescent. C is a case where k-means works and there are no problems.

(23)

3.3

Matrix Decomposition Methods

Another task of machine learning is handling datasets with high dimensionality and make them more accessible. Accessible in this context means that the data gets easier to visualise because data with high dimensionality is difficult to display or visualise. This also means that high dimensional data can be hard to label. Also when using machine learning algorithms with high dimensionality the time and complexity to process the data goes up. Therefore algorithms are needed to compresses the data into a lower number of dimensions. Other terms commonly used instead of data compression are feature selection or noise reduction. They all share the same idea, which is: take the important information from the data, store it in fewer dimensions and remove the rest, such as noise.

One possibility to achieve this task is to use matrix decompositions, whereby one matrix of the matrix product contains a low number of dimensions and all the information. PCA and NMF are two algorithms that are often used to solve this task.

3.3.1 PCA

PCA is a matrix decomposition algorithm which rotates the coordinate system in such a way that most of the information is stored in the first axis. Here, the term information means where data points differ the most. Thus PCA rotates the coordinate system in such a way that the data has the largest standard deviation (σ) possible in the first axis. The second axis has then the biggest standard deviation possible after the first axis is determined. This continues for all axes, leaving the last axes with the smallest standard deviation in data as possible. Rotating the coordinate system this way serves two purposes. First, the information in the data gets compressed in the first few axes. This makes it possible to do feature selection because now only a couple of axes/features are needed to represent the data. However, this is only possible under the assumption that the data is highly correlated. Because, when features are correlated, it is possible to combine them into one feature. For example, in figure (9a) x and y have a correlation of almost one and can be expressed as z +  = 2x − y. When the coordinate system is rotated in the direction of z only one axis contains information, the other axis now only contains errors. A typical distribution of the information, after PCA is applied to the data, can be seen in figure (8).

This leads to the second purpose and that is removing noise. Almost all measured data contains errors and/or noise, which can be found in each axis/feature. Under two circumstances noise can be removed with PCA: the noise to signal ratio should be high, meaning that only a small percentage of each feature is noise, and the noise should be independent of any features or variables in the data. This is important because the coordinates system after using PCA contains independent axes. Thus, noise will be gathered in separated axes. Not only will the noise be gathered in random axes but more precise in the last axes because of signal to noise ratio, where the noise has a low standard deviation and the signal a high standard deviation. Which means that the algorithm will place the axes with noise last.

(24)

Let X ∈ RN ×D be a dataset with N data points, where X

i is a data point with dimensionality D,

denoted by j. The algorithm of PCA consists of four parts. The first part consists of mean normalising the data, see equation (17), where Xjis one feature or axis. This is needed for later parts of the algorithm.

Xj= Xj− µXj ∀j ∈ D (17)

In the second part, the covariance matrix has to be calculated, see equation (18).

Cov(X, X) = E[(X − E[X])(X − E[X])T] = E[(X − µX)(X − µX)T]

Cov(X, X) = E[XXT], see equation (17)

Cov(X, X) ∼ XXT (18)

This gives the underlying structure of the data and how the data is correlated because the covariance matrix contains all the variances between the features of the data. Also, the covariance matrix is tightly connected to the correlation matrix, see equation (19). Where Corr(x, y) is the correlation between x and y, and Cov(x, y) is the covariation between x and y.

Corr(x, y) = Cov(x, y) σxσy

(19) Now the eigenvalues and eigenvectors of proportional covariance can be calculated. The eigenvectors give the direction of the correlation in the data, while the eigenvalues are proportional to the standard deviation and therefore can be used to reorder the axes from features with a large standard deviation to a low standard deviation. The rotation matrix (V ∈RD×D) is the result of reordering the eigenvectors

in a matrix.

The last step is slicing the rotation matrix by extracting the desired number of dimensions (K) creating the reduced transformation matrix (W ), see equation (20). Next, by multiplying the data with the reduced transformation matrix the end result (H ∈RN ×K) is calculated, where Hi is a data point with

the new dimensionality K, see equation (21). An example of PCA can be found in figure (9).

W = (Vjk)j∈D,1≤k≤K (20)

H = XW (21)

There are many more algorithms to calculate PCA. For example, single value decomposition (SVD) can also be used. But all the algorithms share the same restrictions: the data gets centred around zero, the new axis are all orthonormal or independent from each other and PCA focuses on the largest differences in the data. The new data must contain negative numbers because all new axes are orthonormal, this can also be observed in figure (9).

(25)

(a) The original data represented in the original coor-dinate system x and y.

(b) The same data plotted, but now plotted in the rotate PCA coordinated system.

(c) The same data, but PCA has removed the second component leaving only the first principle component.

Figure 9: Data plotted in blue containing two variables x and y with almost a correlation of 1. The red line is the line z = 2x − y. When PCA is applied this data should contain an axis with the combined variable x and y (z) and an axis with an error. This is because there is no dependency between the error and z in this dataset. The error is Gaussian noise with µ = 0 and σ = 0.5, which is clearly visible in figure 9b.

The Representations of W To get a deeper understanding of what PCA does it is important to investigate the matrix W . The column space of W represents how the original data is transformed into the new data H. For example, W1 is the vector

h−2 √ 5 1 √ 5 iT

for the data in figure (9), which is the function y = 2x. When using PCA for facial images the columns of W are also called eigenfaces [22] or when PCA is used on HSI the columns of W are spectral curves. This means that when PCA is used as a feature selection tool the description of the features can be found in the columns of W , which is helpful to get a better understanding of the dependencies and structures in the data.

(26)

3.3.2 NMF

While PCA is a valuable matrix decomposition algorithm, a lot of datasets are non-negative, which makes PCA not applicable. This is where NMF algorithms are useful. Examples where such datasets come from and where NMF is successfully applied are: chemometrics [23], text mining [24] and spectrophotometry [25]. In an NMF algorithm, there are still two matrices where one transforms the data and one contains the new data or weights, but now all matrices must be positive [26], resulting in equation (22).

X ≈ HWT, where X ∈RN ×D+ , W ∈ RD×K+ and H ∈RN ×K+ (22) While both PCA and NMF solve the same task, they do it quite differently. The basic idea of NMF is initialising the two matrices and gradually solving the optimisation problem which is minimising the distance between X and HWT [27], where H and W serve the same purpose as H and W in PCA. Half the Frobenius norm is most commonly used as a distance metric for NMF, see equation (23). The ma-trices are optimised iteratively alternating between optimising W or H while maintaining the positive constraints for both matrices.

DF(X|| eX) = 1 2||X − eX|| 2 F = 1 2 N X n=1 D X j=1 (Xnj− eXnj)2, X = HWe T (23) This results in the following algorithm. Initialise W randomly or, to speed up the algorithm, non-negative double singular value decomposition (NNDSVD) can be used [28]. The next steps will repeat until a min-imum has been reached of the cost function (23). It should be noted that this cost function is only convex with respect to W or H, but not both at the same time [27]. This means that always W or H must be fixed during optimisation. This makes the cost function (23) an extension of a nonnegative least square problem (NNLS) [29], which is formulated as minimising the cost function (24), where y ∈ RD is the data, A ∈RD×K is a matrix and x ∈RK+.

min x J (x) = 1 2||y − Ax|| 2 F (24)

While (23) is the basic cost function for NMF, most algorithms included also sparseness constrains, which consists of an added L1,1 norm and Frobenius norm [30]. Where the L1,1 norm gives sparseness to the

model and the Frobenius norm add regularisation. In this paper the cost function (25) is used [31], where α gives control over how much regularisation and ρ gives control about how much sparseness.

min W,HJ (W, H) = DF(X|| eX) + αρ||W T ||1,1+ αρ||H||1,1+ α(1 − ρ) 2 ||W T ||F+ α(1 − ρ) 2 ||H||F (25) The repeating steps start with estimating H by minimising (25), while W is fixed [27]. Next, setting all the negative components of H to zero or  (10−16). Then do the same for W by fixing H and optimising (25) for W . Lastly, set all the negative numbers of W to zero or .

There are many links with PCA, such as W is calculated using SVD for PCA and NMF, in the first initialisation step. However NMF has different restrictions, all the matrices are nonnegative and W is not orthonormal or orthogonal anymore. Another difference is that the representation of W in PCA is about all the dimensions D of X, while when sparseness is used in NMF the representation of W contains localised features of X [22].

(27)

3.3.3 NMF as a K-means Algorithm

While generally NMF is used as a matrix decomposition method, it can also be used for clustering [22, 26]. to make this possible the right norm in the optimisation should be chosen, resulting in the same algorithm as k-means. The mathematical proof can be found in [26]. Here an intuitive explanation will be given. First, the constrain that the rows of H must be unary vectors must be enforced, making H diagonalisable. This constrained can also be seen in vector quantisation (VQ) [22], which is solved using k-means. This would mean that H becomes the same as r from equation (12). When H contains no longer weights or probabilities, the transformation matrix W transforms all the data points with the same cluster/u-nary vector to the same point. Thus, to minimise the distance between X and eX all points that belong to the same cluster should be transformed to a point in the middle of those points. This means that the columns of W contain the means of the points that are in the same cluster, which is similar to k-means. More mathematically, when optimizing W the optimisation functions J are similar for k-means and NMF, under the above mentioned constrains. Starting from the objective function of k-means (12), the objective function (26) is derived. J = N X n=1 K X k=1 rnk||xn− µk||22= N X n=1 K X k=1 D X j=1 rnk(xnj− µkj)2= N X n=1 D X j=1 K X k=1 rnk(x2nj− 2xnjµkj+ µ2kj) = N X n=1 D X j=1 x2nj K X k=1 rnk− 2xnj K X k=1 rnkµkj+ K X k=1 rnkµ2kj ! (26)

Given the fact that rnkis either 0 or 1 andP K

k=1rnk= 1 we simplify equation (26) to the following form

(27). J = N X n=1 D X j=1 x2nj− 2xnj K X k=1 rnkµkj+ K X k=1 rnkµ2kj ! = N X n=1 D X j=1 x2nj− 2xnj K X k=1 rnkµkj+ K X k=1 (rnkµkj)2 ! = N X n=1 D X j=1 xnj− K X k=1 (rnkµkj) !2 = N X n=1 D X j=1 (xnj− rn· µj)2= ||X − HWT||2F (27) Where X = x, H = r and W = µT. Which makes the objective function of k-means identical to the objective function of NMF.

(28)

3.4

Neural Networks

While the algorithm discussed so far are designed for a specific task, NNs are designed to tackle a wider number of machine learning tasks. The idea of a NN is that a very general structure and optimisation algorithm is provided, which can learn many things, instead of a more precise and dedicated algorithm. This is also the downside of NNs because for each combination of a task and a type of data set, a new NN must be designed. Thus, there is no general NN implementation.

Most people will explain a NN using the analogy with the human brain, where a lot of neurons work together to transmit signal and process them from input to output [32]. While this is an intuitive ex-planation, neural networks in their most simple form do not differ that much from matrix decomposition methods or clustering algorithms. Matrix decomposition methods and NN are designed to take a matrix with a certain dimensionality and transform it into a new matrix, most of the time with a much lower dimensionality. While clustering methods are used to group data, which can also be done with NN. Even on a lower level, NNs are almost the same as earlier mentioned methods. Take the simplest NN possible, which consists of one neuron, see figure (10). This NN can consists then of weights (W ∈RD×1) and a bias (b ∈R). Mathematically, this NN can be represented with the equation (28). So W is just a transformation matrix and the activation function (30) acts just a threshold such that Y ∈R+. Now, if

b = 0 and Y is known and this NN would be optimised using the optimisation function (24), then this NN would, in fact, be NNLS.

Y = max(0, A), where A = b +

D

X

j=1

XjWj,1 and A ∈RN ×1 (28)

Figure 10: A representation of a NN with a single neuron.

There are many more activation functions, where the most important are: Relu (30), Sigmoid (31), Tanh (32) and Softmax (33). The reason to add activation functions is to go from a linear system to a non-linear system because otherwise two neurons will just be the same as one neuron, see equation (29). This also gives the designer of a NN the option to add restrictions to the network such as non-negativity (Relu), probability/weights or values between 0 and 1 (Sigmoid), a kind of ternary operator or values between -1 and 1 (Tanh) or making the output y a probability mass function (Softmax).

The reason for adding a bias into each neuron has two explanations that come down to the same idea. Firstly, when a neuron is seen as a switch that is on or off than a bias increases or decreases the probability that the neuron is on or off. Secondly, a neuron can be seen as a linear regression solution for the data, thus a bias makes it possible to have an equation that does not go through the origin. Thus, adding a bias term makes a NN more flexible.

XW1W2= XW where W = W1W2 (29)

(29)

Relu(a) = max(0, a) (30) Sigmoid(a) = 1 1 + e−a = ea ea+ 1 (31) Tanh(a) = e 2a− 1 e2a+ 1 (32) Softmax(Ai) = eAi PN n=1eAi (33) 3.4.1 Layers

While a single neuron can be used for binary classification, in other words, does a data point belong to a class yes/no, to make a more complex NN, layers need to be added to the NN. For example, PCA can be recreated using a NN when multiple neurons are used in one layer. This creates an NN with one layer given W ∈RD×K, but the whole system is still linear. This changes when hidden layers are added, see figure (11). When using hidden layers, NNs are put in front of each other and because of the non-linearity of each layer, the whole system learns non-linear patterns in the data.

Figure 11: Representation of a neural network with 3 features as input, one hidden layer with 4 neurons and 2 outputs. Assuming Relu activation is used the equation for the NN would be Y = max(0, max(0, XW1+ b1) · W2+ b2), with X ∈ RN ×3, Y ∈ R+N ×2, W1 ∈ R3×4, W2 ∈ R4×2, b1 ∈

R4 and b

2∈ R2 [33].

Till now, the layers in the NNs are fully connected, which means each input is connected to each neuron in the next layer. There are also other possibilities, like convolutions layers, where convolutions are applied to an input creating multiple outputs with approximately the same shape as the input shape [15]. This creates an extra dimension, namely channels (C), X ∈RC×N ×D. In each layer, D can change but also

the number of channels.

A NN can then be summarized as the equation (34) with l layers, where fk is a layer, which does a

certain function (35). For example, for a fully connected NN θ = W . This also makes sense looking at figure (12), where it is known from propositional calculus/logic that an XOR-gate can be made with two AND-gates followed by an OR-gate (and some negations). Thus, it is not surprising that the neural network needs this structure if a NN is just the equation (34).

(30)

Choosing the right hyperparameters, the number of layers (depth) and the number of neurons in each layer (height), is crucial for making a working NN. In general, a too-large total number of neurons makes it harder to train a NN because this can lead to overfitting. To counter this more training data is needed. On the other hand, a too-small total number of neurons can not capture all the patterns in the data, which will likely lead to underfitting the data.

The best way to determine the size of a fully connected NN is to do a hyperparameters mesh grid test, where the NN is fitted and tested with a set of combinations of the depth and height for the NN. To speed up this process, a rough estimation has to be made, which can be done intuitively. Intuitively, each neuron in a layer is able to learn one feature of his input. For example, whether a certain Xn is below

or above a certain value. With each extra hidden layer, a more complex pattern can be learned. For example, an AND-gate does not need a hidden layer, but an XOR-gate does because an XOR-gate is not linear and needs one hidden layer. In figure (12) both gates are visualised with all the possible inputs and the regression lines of a NN.

(a) AND-gate, no hidden layers are needed data can be split with one neuron.

(b) XOR-gate, two hidden neuron (1 layer) are needed to first split the input in the white and cyan area. Then an output neuron determines if the points lay in the cyan area or white area.

Figure 12: The dots are the input (X), where the colour determines there class (Y red = 1 or blue = 0). The NN is visualized with two colours: green is a hidden neuron and cyan is the output neuron.

3.4.2 Loss Function and Back Propagation

The structure of a NN determines what a network could potentially learn, but to train the network two things are needed: an indication of how good the NN currently is, a loss function and a method to opti-mise the current NN, backpropagation. Depending on what the NN should learn, a loss function should be chosen. This can be for instance the L1,1 norm, the Frobenius norm or a cross-entropy loss function

[34]. The name backpropagation comes from the facts that at the "back" of the NN an error is calculated and then from there the weights of the layers are changed propagating the error to the front of the NN. The next thing, when the error or loss is calculated, is to do backpropagation or, in other words, to tune the NN in a way to make the error smaller in the next iteration, which can be expressed as follows

L(Y, Y0) → min

θk

(36) This is done by calculating the gradient descent of the loss function in respect to θk.

∂L ∂θk = ∂L ∂Y0 ∂Y0 ∂fl ∂fl ∂fl−1 ...∂fk+1 ∂fk ∂fk ∂θk (37) 22

(31)

Now the gradient is known θ can be updated using equation (38), where α is the learning rate which determines how fast the network learns. Important to note here is that when training a NN not the whole set of training data is used because it would take too long. Generally, a NN goes through thousands of iterations to train the networks and calculating for each iteration the error for the whole set of training data would not be possible. The fastest would be just to use one data point per iteration, but this gives a gradient only applicable to that datapoint which is not accurate. Therefore the best training method is using batches, which are commonly around 512 to 32 data points.

θnew = θold− α

∂L ∂θk

(38)

3.4.3 Autoencoders

NNs are generally a supervised method with an input X and an output Y , but as discussed in section 3.1.2 sometimes Y is not available and then a semi-supervised version can be used. One of the possibilities for a semi-supervised NN is an AE. The goal of an AE is to find general features in the data (H), similar to that of PCA or NMF, but an AE can also learn more complex non-linear features.

It is called an autoencoder because the idea is that it learns to encode and decode the data based on the data. First, the data is encoded to a certain number of features and then the features are decoded to the original data, see figure (13). Both encoder and decoder are NNs. An AE works the same as a "normal" NN in terms of backpropagation and loss functions.

Figure 13: A schematic drawing of an autoencoder. Both the encoder and decoder are a NN and H is of a lower dimension than X. The idea is to minimise the loss functionL(X, X0) to find general features H in the data.

(32)

4

Method

In this section, the goal of the project and the problems that are encountered when trying to reach the goal will be explored. Next, several solutions will be summarised to solve these problems after which the proposed solutions will be explained in detail. The proposed solutions are discussed in the order in which they were encountered. Lastly, a measurement of effectiveness, the error, will be discussed, such that the solutions can be compared.

The goal of this project is to solve the problem of unmixing pixels given a Raman hyperspectral image. Here, the task is to find general component spectra and find how they contribute to each pixel. A com-ponent spectrum is a spectrum that can be seen as a building block from which a pixel can be made. For example, in a painting, this could be a mixture of colours or just one colour.

Mathematically speaking the goal is, given a function f of spatial parameters (x, y) and spectral param-eter (δ ˜ν), to find a set of functions f1, f2, ..., fK such that equation (39) is satisfied, where a function fk

represents one component spectrum. The function f can be modeled as the sum of a set of functions because when the experimental variables stay constant, the Raman signal can be seen as linear combina-tion of Raman signals. In other words, if there is a ratio of two kind of molecules that produce a Raman signal, the ratio can be found as the weighted sum of Raman signals in the total signal.

f = f1+ f2+ ... + fk (39)

So far, there is no closed-form solution for this problem or to produce f from the data. Therefore a numerical model (40) is used to describe equation (39). Here, X is the Raman data, W contains the component spectra and H contains the weights or contributions for each pixel. Currently, there are several methods to find the variables in this model, which proved to be sufficiently accurate in various HSI/RHSI processing tasks. However, they still do not perfectly describe the observed data.

X = HWT, where X ∈RN ×D+ , W ∈ RD×K+ and H ∈RN ×K+ (40) There are two challenges when obtaining X. The first major challenge is that the raw signal from the camera does not just contain X, but also noise and photoluminescence as is explained in section 2.1. This means that firstly equation (3) has to be solved. Next, to make each pixel as important as another and to make it easier to find W the data needs to be normalised. The reason is that raw data can vary from val-ues in the hundreds or thousands to a maximum of 216, but those smaller numbers are not less important.

There are several challenges when finding W and H. The first one is that there are rarely reference spec-tra in X. Reference specspec-tra would be specspec-tra that are used in supervised learning, where each spectrum contains a known substance, like oil or a pigment. Therefore, data points with similar characteristics have to be found, using those data points the component spectra can be calculated. Next, when W and X are obtained H can be calculated.

The last challenge does not arise from the data, but from some of the algorithms used to solve the previously mentioned challenges. The problem with some of the algorithms is that they do not handle high dimensional data well, which can mean two things: the algorithm does perform worse with high dimensional data or the computer power needed increase exponentially with more dimensions making it not cost-effective to run the algorithm with high dimensional data.

In this project, the focus lies on solving the task numerically using the model (40). Based on the chal-lenges two separate sets of stages are defined, each combination gives a general solution for the tasks. The first set is preprocessing and the second set is data processing consisting of features selection, clustering, component spectra calculation and unmixing.

(33)

As will be discussed later in section 4.1: smoothing can be used for removing noise from the signal and the signal can be split using a polynomial function or a GSP algorithm. When the signal is split into all three components, see section 2.1, each general solution can be run using the Raman signal or both the Raman and photoluminescence signal. Before the signals can be used, they need to be normalised, see section 4.2. The possible combinations that can be used for preprocessing are summarised in table (1).

Noise Removal Signal Separation Normalisation Related Work

Raw unit vector

Poly1 polynomial unit vector [2, 35]

Poly2* smoothing polynomial unit vector [35]

GSP* smoothing GSP unit vector [2]

Table 1: The different possibilities to split the raw signal and in which related work a similar approach was taken. *In the following sets the Raman signal can be used or both the Raman and the photoluminescence signal can be used.

As discussed in sections 3.2, 3.3 and 3.4: PCA, NMF and AE proved to be effective features selectors, K-means is suitable for clustering, NNLS can be used for unmixing and a weighted mean can be used to calculate the component spectra. A weighted mean calculation uses H and bX, see equation (47), to calculate W similar to how µ is calculated in k-means using r and x (16), but now weights are used instead of a one-hot matrix, see equation (41). Based on these algorithms, several methods are proposed for the data processing stages, which are summarised in table (2).

Wk = PN n=1HnkXbn PN n=1Hnk (41)

Features Selection Clustering Component Spectra Unmixing Related Work

NFS — k-means* — NNLS [1, 36]

PCA PCA k-means weighted mean* NNLS [1, 36]

NMF1 NMF (k-means cons.)* — — NNLS

NMF2 NMF — weighted mean* NNLS

NMF3 NMF — — — [3]

AE1 AE k-means weighted mean* NNLS [4]

AE2 AE weighted mean* NNLS

AE3 AE — — —

Table 2: All the proposed general methods summarized as well as where similar methods can be found in related work. If a cell contains a ’—’ the algorithm of to the left is used, if their is no algorithm to the left no algorithm is used for that step. *The component spectra should be unit vector normalised.

(34)

4.1

Signal Separation

The goal of signal separation is solving equation (3). To achieve this goal the properties of each signal, explained in section 2.3.2 can be used. The main property that is used to differentiate the signal is the FWHM. FWHM can also be seen as the frequency of a spectrum, meaning if the spectrum would be a signal then the FWHM of each signal can be seen as the frequency of the spectrum. This should not be confused with the wavenumbers or frequency variable in the data. In this project, two methods are proposed to split the raw signal: Polynomial Separation and GSP Separation.

4.1.1 Polynomial Separation

It is possible to model the photoluminescence spectrum as a polynomial function because of the low fre-quency of this spectrum [2, 35]. It is important to keep the polynomial order low to prevent overfitting, which would mean that not only the photoluminescence signal is fitted but also a part of the Raman signal. Good estimates of the number of polynomial orders, that are needed, are between four and six. The polynomial fit is calculated for each data point using a least square solution. If all wavenumbers are used, the polynomial fit will include the Raman signal, therefore it is important to only use wavenumbers in the least square solution that do not contain Raman signal. These wavenumbers are commonly found by calculating the minima in a raw spectrum. A solution for avoiding minima that are part of the Raman signal is to take the minima of each three minima. In table (1), this is referred to as poly1.

The problem here is that this would make a fit through the points with photoluminescence and maximum negative noise, which almost certainly splits the raw signal in a photoluminescence signal and a Raman signal plus noise plus some leftover photoluminescence signal. This would not be desirable, therefore Gaussian smoothing can be applied before the splitting. This will reduce the noise in the signal but also lowers the Raman spikes resulting in a loss of the Raman signal, but the signal will be cleaner. In table (1), this is referred to as poly2.

4.1.2 GSP

GSP is an algorithm that uses the FWHM property to remove spikes with a medium FWHM from the data. This leaves a signal that only contains photoluminescence, by subtraction with the raw data the Raman signal can now be recovered. Therefore, this technique only works when the noise is removed from the data first.

It is important when removing noise that the Raman signal is not removed and stays intact. The problem, however, is that the noise in Raman spikes is hard to determine. Thus, the assumption is made that in a Raman spike there is no noise. Now, noise can be removed with a Gaussian filter with a standard devia-tion of around five, but this leaves the quesdevia-tion on how to determine if there is a Raman spike in the data. To determine if there is a Raman spike in the data, the gradient is calculated. The gradient created by noise can be temporally as high as the gradient created by a Raman spike, therefore the Gaussian smoothed absolute gradient is calculated, which can be seen in figure (14). Next, the mean of the absolute gradient is calculated. If a gradient is higher than the mean plus some correction it is part of a spike otherwise it is not. This is then smooth enough to create the certainty measurement (p(r)) which can also be seen in figure (14). The last step is to smooth where there is no spike and to keep the data as is where there is a spike based on p(r), see equation (42). The component spectrum is not smooth when there is a spike because when the signal to noise ratio becomes higher the sensitivity in the system to noise reduces. Also the purpose of separating the Raman signal is to make the Raman spikes more distinct and a part of the Raman signal gets lost when smoothing a spike. The code for the smoothing can be found in appendix A.

Xno_noise= p(r) · X + (1 − p(r)) · Gaussian_filter(X, 5) (42)

(35)

Figure 14: A raw spectrum of a data point with a high noise contribution (blue). In the orange, the Gaus-sian smoothed absolute values of the gradients are displayed times five and in the green the probability in % of the certainty that there is a Raman spike.

Now only the Raman signal and photoluminescence signal remain, which is then split based on the FWHM and on the fact that the Raman signal is either high or zero, see equation (2). Next, the Raman signal will be filtered out leaving behind the photoluminescence signal. This is done in three steps, where each step is repeated ten times. Firstly, remove the top of the spikes by a Gaussian smoothing filter with a σ equal to the maximum FWHM of the Raman spikes. To satisfy the restrain (4) the minimum value is taken from the original signal and the smooth signal making sure that the new signal is not higher than the old signal, resulting in equation (43).

Xnew = min(Xold, Gaussian_filter(Xold, σ)) (43)

Secondly, a polynomial is fitted through the remaining signal and again, for the same reason as before, the minimum value is taken from the original signal and the polynomial fitted signal, resulting in equation (44). This removes the Raman signal completely from the data leaving only the photoluminescence signal. Xnew = min(Xold, g(∆˜ν, n)), where g(∆˜ν, n) = polyfit(Xold, n) (44)

Lastly, the photoluminescence signal is cleaned up to remove any remaining noise, which is done with a Gaussian smoothing filter, but with a σ of ten so that noise is smooth and nothing else, see equation (45).

Xphotoluminescence= Gaussian_filter(Xold, σ) (45)

Now, the Raman signal can be calculated by subtracting the photoluminescence signal from the original signal without noise. To resolve the problem that the Raman signal can get slightly negative at some wavenumbers, due to the last step in filtering out the Raman signal, the Raman signal is cut off at zero, see equation (46). The full code can be found in appendix C.

Referenties

GERELATEERDE DOCUMENTEN

This property guarantees that squared elements of the core matrix can be interpreted as contributions to the fit, which parallels the interpre- tation of squared

Several centrings can be performed in the program, primarily on frontal slices of the three-way matrix, such as centring rows, columns or frontal slices, and standardization of

In this paper three-mode principal component analysis and perfect congruence analysis for weights applied to sets of covariance matrices are explained and detailed, and

With the exception of honest and gonat (good-natured), the stimuli are labeled by the first five letters of their names (see Table 1). The fourteen stimuli are labeled by

As a following step we may introduce yet more detail by computing the trends of each variable separately for each type of hospital according to equation 8. In Figure 4 we show on

Regularized ScA is an extension of the (sparse) principle component analysis model to the cases where at least two data blocks are jointly analyzed, which - in order to reveal

These model parameters were then used for the prediction of the 2007 antibody titers (the inde- pendent test set): Component scores were derived from the 2007 data using the

In both studies a large number of data sets are generated according to a known Switching PCA model, manipulating several characteristics which were selected on the basis