Exploring the possibility to employ a consumer-grade EEG-device as an educational tool in a machine learning course for physicists

(1)

a consumer-grade EEG-device as

an educational tool in a machine

learning course for physicists

THESIS

submitted in partial fulfillment of the requirements for the degree of

BACHELOR OF SCIENCE in PHYSICS Author : T.S. Pool Student ID : S0141453 Supervisor : Dr. S. Semrau

2ndcorrector : Prof. dr. K.E. Schalm

(2)

(3)

a consumer-grade EEG-device as

an educational tool in a machine

learning course for physicists

T.S. Pool

Huygens-Kamerlingh Onnes Laboratory, Leiden University P.O. Box 9500, 2300 RA Leiden, The Netherlands

April 2018

Abstract

With data-generation becoming increasingly complex and automatized as a result of technological developments, using computers to perform

data-selection, preprocessing and data-analysis has become

indispensable in many fields of physics and astronomy. Hence, acquiring some basic knowledge of machine-learning techniques should be an essential part of the curriculum of these subjects. However, courses on the subject are mainly aimed at future computer-scientists. In this study,

we explore the potential of using the Emotiv EPOC+, a consumer-grade EEG-device, as an educational tool in a hands-on machine learning course, tailor-made for physics and astronomy students. For this, we perform various experiments with a single subject, and use elementary neural networks to perform a binary classification to identify events in

the self-produced EEG-data. We find that the Emotiv is capable of producing data containing sufficient consistency within a single recording to detect blinks and full-arm motion with more then 90% accuracy. However, these results are not reproducible with the same

neural network once the head-set has been removed from the head between recordings. This means the networks have to be trained anew in

order to classify events in new data. For the Emotiv to serve as an educational tool in a machine learning course a better understanding of

this difference in noise between recordings is necessary, and a standardized preprocessing to reduce noise should be developed.

(4)

(5)

1 Introduction 7

2 Machine learning and Physics 9

3 Machine learning and EEG-data 11

4 Emotiv EPOC+ and MNE-Python 13

4.1 Emotiv EPOC+ headset 14

4.2 EmotivPRO software 14

4.3 MNE-Python 15

4.3.1 The Raw and Epochs data structures 15

5 Experiments performed with Emotiv EPOC+ 17

5.1 Additional software and instruments used 17

5.2 General setup of experiments 18

5.2.1 Goal of the experiments 18

5.2.2 Making a recording 19

5.2.3 Training the neural networks 20

5.2.4 General properties of neural networks 20

5.2.5 Elementary workings of neural networks 22

5.3 Detecting Blinks 27 5.3.1 Creating data 27 5.3.2 Bayesan classifier 29 5.3.3 Training perceptrons 31 5.3.4 Convolutional networks 34 5.3.5 Comparison of classification 36

5.3.6 Reading in a continuous data-stream 36

(6)

5.4.1 Detecting single hand movements: optimizing

para-meters 37

5.4.2 Testing consistency between files 48

5.4.3 Discerning left- and right-hand movements 50

5.4.4 Analyze limits of detectable movements 52

5.4.5 Detecting continuous movement 55

5.5 Detecting sounds 57

5.6 Suggestions for further experiments 63

5.6.1 Explore noise reduction 63

5.6.2 Compare the performances of more types of neural

networks 64

5.6.3 Explore unsupervised learning techniques 64

5.6.4 Explore multi-class classification 65

6 Conclusion 67

Appendices 71

(7)

Chapter

1

Introduction

In this research project, we explore the possibility to use a consumer ori-ented EEG device as an educational tool in a machine learning course. For this, the Emotive Epoc+ is used, which comes with an easy-to-use EEG headset, a USB-device to read the data into your personal computer via a Bluetooth connection, and supporting software to visualize, record and export data. This way we construct original data sets of EEG recordings. We then employ machine learning techniques to train neural networks to recognize patterns in the recorded EEG-data.

This thesis contains 3 main components:

• an account of creating a hands-on machine learning course for physics students, and an argument for choosing EEG-data as practice mate-rial.

• a description of the Emotiv Epoc+ and the Python packages used to analyze the data and build the neural networks.

• an description of the experiments performed with the EEG headset, and the results obtained by several kinds of neural networks.

This report is meant to give inside into the potential of the Emotiv Epoc+, as well as a practical guide for future users of the device.

(8)

(9)

Chapter

2

Machine learning and Physics

Even though here and there some paradigm shifts may take place, overall physics can be considered as a cumulative enterprise. Using prior achieve-ments as building blocks, it succeeds in creating more and more complex systems of knowledge. This complexity grows in theory construction (in particular mathematically) as well as in experimental undertakings, re-quiring evermore advanced instruments. Developments in technology al-low scientists to probe deeper and deeper into the subatomic world as well as further and further into the universe, going far beyond detecting what the human eyes can see. Making ’observations’ is outsourced to machines. Besides having more sensitive and more various sensory systems then hu-mans, a great benefit of exploiting machines is that they allow these obser-vations to be automatized, being able to collect data at high speed, 24/7.

Because of the tremendous amount of data and the complexity of the obtained information, now, not only the observation itself, but also the interpretation of the data has to be outsources to machines. This can be in the form of preprocessing as well as in classifying.

The Large Hadron Collider at CERN produces about a million giga-bytes of data every second, making it impossible (and insofar as possible, very costly) to process by human labour alone. Machine learning algo-rithms are used to decide in real-time which data is potentially interesting for further analysis and which data to toss out. Beyond just preprocessing, computor vision, similar to facial recognition, trained by machine learning algorithms can be used to identify particle jets (narrow sprays of particles originating at collisions), and even identify features within these jets [1]. Likewise machine learning has important applications in quantum me-chanics [2] and condensed matter physics [3], e.g. accelerating the search for new materials.

(10)

Many advancements in machine learning are driven by tech giants’ commercial applications and the data explosion generated by them. Sci-ence can benefit from these developments. For example, a neural

net-work used by NOvA (NuMI Off-Axis νe Appearance), an experiment

de-signed to detect neutrinos, is inspired by the architecture of GoogleNet [4]. Exploiting these deep learning tools is also met with skepticism, since these machine learning algorithms work mostly like a ”black box”; it is increasingly difficult to keep an intuitive insight in how exactly certain conclusions are reached at. Therefore, the growing employment of ma-chine learning techniques to physical problems requires a constant effort, not only from computer scientists, but also from physicists themselves to better understand the inner workings of such algorithms and keep doing cross-checks on real data. Both the growing importance as the increasing complexity make it vital for any physics student to be at least familiar with the basics techniques and capabilities of the machine learning enterprise.

Now, courses for machine learning for physicists exist [5] [6], but most of these employ existing data sets, available in bite-size chunks on many internet databases. This way however it omits a very important step in exploiting the power of machine-learning algorithms, which is producing data that is actually suitable to feed to a neural network. The way data-sets are constructed effects the performance of different neural networks in a distinctive way [7]. Creating training-sets entails producing consis-tent data to ensure equivalence between training and test samples, label-ing samples of data without influenclabel-ing the content of the samples itself (preferable in an automated way in order to be able to produce large data-sets), and preprocessing the data. Also, when producing the data yourself, it is up to you to control what the neural network is actually training on, and that it doesn’t classify different kinds of noise, for example.

Hence, to enable students to produce their own data in a hands-on machine learning course, in this research project the potentiality of the EEG-device Emotiv EPOC+ as an educational tool is explored.

(11)

Chapter

3

Machine learning and EEG-data

One of the scientific fields where machine learning is extensively utilized is in the preprocessing, analysis and classification of EEG data. Electroen-cephalography (EEG) is a method to monitor electrophysiological signals. It is a noninvasive method that uses electrodes placed on the scalp to record the electrical activity of the brain. EEG measures voltage fluctu-ations resulting from ionic currents within the neurons of the brain. Un-derstanding how measured EEG data relates to certain quantities of the brain is of substantial value for medical purposes, for example to give in-side in the effect of psychiatric conditions and to predict the effectiveness of possible treatments [8]. Also, because of its non-invasive nature and the fact that nowadays data-recordings are easy to obtain also for non-experts, EEG is a highly promising medium to create a brain-computer-interface. Applications of this are the direct control of prosthetics and exoskeletons, and even partial recovery of spinal cord injuries by long-term training on a brain-machine interface-based gait protocol [9], but also more commercial applications like controlling gaming interfaces[10].

Because of the very small signal-to-noise ratio, it is often very hard to identify specific features by the naked eye, even with the help of more direct analysis techniques like using a Bayesian classifier or applying a power-spectral-density algorithm. On the other hand, with an EEG it is easy to collect a lot of information because of the high resolution and the use of multiple channels. This makes the data very suitable for analysis with the help of machine-learning algorithms.

So just like machine learning can be applied to preprocess and analyze EEG data, also EEG data can be used to understand machine learning [7]. Because it is relatively easy to manipulate and rich in information, it is suitable to compare the performances of different types of networks.

(12)

Again, in this research we want to know whether a consumer grade EEG-device such as the Emotiv EPOC+ is suited as an educational tool in a machine-learning course. The main obvious advantage of this tool is that it is affordable and has a plug-and-play setup. One of the goals of this research is to explore the disadvantages.

For this it is necessary to first investigate whether the data obtained by this device is consistent enough to classify certain basic events, like blinks or bodily movements. Reaching basic results will be necessary prior to any preprocessing, in order to determine the correctness and effectiveness of any preprocessing steps. Only there-after preprocessing like noise (both extra-cranial as well as perturbation of the data by unintended thoughts and movements) and artifact removal can be examined to improve previ-ously obtained results and possibly lead to EEG data suitable for pursu-ing more sophisticated classifications like identifypursu-ing specific intentional states. In this thesis however, no machine-learning techniques as applied to the preporcessing, denoising and artifact detection will be discussed. The only pre-prossesing used during this research is the employement of implemented functions in the MNE-package itself that require virtually no programming to apply to the data.

(13)

Chapter

4

Emotiv EPOC+ and MNE-Python

This chapter aims to introduce the reader to the Emotiv EPOC+ headset and supporting software, and explains the basic features of MNE-Python; a free software package designed for handling human neurophysiological data in an accessible and efficient way. It can be read as a manual, enabling anyone new to the Emotiv and the MNE software to quickly produce orig-inal data and manipulate it in such way that it can be used to train neural networks and explore machine learning techniques.

Figure 4.1: The Emotiv EPOC+ consists of 14 channels. The channels connect to the head via felt pads, saturated with a salt solution.

(14)

4.1 Emotiv EPOC+ headset

The Emotiv EPOC+ is a 14 channel EEG headset, allowing the user to pro-duce high resolution raw EEG data (fig. 4.1) [11].Next to the EEG channels, it has 7 gyroscopic channels to monitor the tilt and movements of the head. The headset is via Bluetooth connected to a USB-stick to read the recorded data into your personal computer. The resolution of the recording is 128 Hz, which means that for every 14 channels we collect 128 data-points per second. Additional specifications can be found in appendix A.

4.2 EmotivPRO software

With the EEG-device comes a user interface, called Emotive Pro. This soft-ware allows the user to read in the data and visualize in real-time the elec-tric potential recorded by the 14 channels, the movements of the gyro-scopic channels, and the frequency spectrum of the data via a Fast Fourier Transform (fig.4.2).

Figure 4.2:The Emotiv Pro software interface. It can show the Fourier transform of a specific channel (front), the electric potential of every channel (in µV) and a listing of the recordings to play back.

At any time a recording of the data can be started and stopped. To these recordings you can add certain ’markers’. To any button on the key-board you can assign a marker-value. Pushing this button will store the time and the value (the assigned marker-id) of the marker in a separate marker-file, that will automatically be exported along with the recorded EEG-data. Recordings can be exported as an ’.csv’ file or with a .edf ex-tension (European Data Format). We will choose the later, because

(15)

MNE-Python comes with a straight forward read-function that recognizes this format.

4.3 MNE-Python

MNE-Python, as a part of the MNE software suite, is a free software pack-age created for exploring, visualizing, and analyzing human neurophysio-logical data. It provides state-of-the art algorithms implemented in Python and builds upon the core libraries for scientific computation such as Numpy and Scipy. What follows below is just a description of the bare neces-sities for working with MNE and to sketch an idea of the structure of MNE-python. A full documentation and detailed class-descriptions can be found at the Martinos website https://martinos.org/mne/. More de-tailed descriptions of how to apply MNE-python is to be found in the nu-merous articles on the topic available on the internet [12].

4.3.1 The Raw and Epochs data structures

Analysis of EEG-data with MNE-Python typically involves the two basic data-structures in this library; the Raw and Epochs objects.

An object of the Raw class is automatically created when the readfunc-tion is employed to load raw EEG data. The core structure of the object is a 2D Numpy array (n channels x n data-points). The number of data-points is equal to the length of recording multiplied by the resolu-tion (n seconds x 128Hz). It has several attributes, such as an info-object, which is a dictionary containing the measurement info, a list of strings containing the channel names and an array of time points. Furthermore the class contains ample methods for manipulating and plotting the Raw object.

From the Raw object you can extract a collection of time-locked trials (events) and store these in an Epochs-object. The basic structure of this ob-ject is a 3D Numpy array of shape (n events, n channels, n data-points). This can be used to create training and test sets to feed to the neural net-work to be trained.

Because the Emotiv software does not produce an events-list that is recognized by the Epochs class we have to create this by hand. Next to the channels that contain the actual EEG data, the Raw object also includes a so-called marker-channel, which gives us the times the events occurred and the corresponding event ids ( the value assigned to the event). Using this information an event-object can be created, which is basically a 2D

(16)

Numpy array, optionally attributed by a dictionary containing the values of the event id’s and the corresponding meaning of that value, i.e. the events assigned to the id. Having done this, an Epochs object is created by calling the class, accompanied by the event-object as a variable.

Further more MNE python encompasses a tremendous amount of im-plemented functionality to manipulate and visualize the imported data that is stored in a Raw or Epochs-object, which will not be employed during this research. Some examples of the effects of preprocessing will be given, but in a machine-learning course we would want the preprocessing to be done by neural networks as well, like the use of a denoising auto-encoder. The implemented methods in the MNE-library are basically a black-box for the user, and exploiting these to much just for the sake of improving the accuracy of the network might not be very instructive.

(17)

Chapter

5

Experiments performed with

Emotiv EPOC+

To explore the possibility of using this EEG-headset as a tool to create original data that can be utilized in a course on machine learning, some practical research is done to investigate the workings and the limits of the device. The experiments are also meant to serve as instructive examples.

5.1 Additional software and instruments used

The code to process and manipulate the raw EEG data imported from the EmotivPRO software should be written in Python 2.7 [13]. The reason for this is that the visualization of the data using the MNE-package is only supported in Python 2, and not yet available in Python 3.

For creating models of different kinds of networks, it is very efficient to use Keras, an open-source library designed to easily build and train neural networks, also written in Python [14]. Examples of such networks exploiting Keras can be found in the next sections, which describes several experiments with machine-learning in detail.

For these experiments, the JupyterLab interface is used as a computa-tional environment [15].

All the computation is done on an Apple Notebook (1,8 GHz Intel Core i7 processor, 4 GB 1333 MHz DDR3 Memory). Although the computation time for training the networks was not excessive (up to several minutes), it is advisable to use a stronger computer, especially when multiple net-works should be trained in a sequence while step-wise varying one or more parameters in search for optimization.

(18)

5.2 General setup of experiments

The main goal of this research project is to examine the possibility of us-ing the Emotiv EPOC+ to personally produce data-sets that are suitable to employ in a machine learning course, i.e. data-sets that can be fed to a rel-atively basic neural network and produce significant results in a straight forward way. It is important that this EEG-devise allows the user (i.e. stu-dents) to create easy and intuitive results to serve as a starting point for possible improvements and more complex experiments that also require more refined networks. Therefore all analysis and results presented below is performed on data that comes from experiments done by and on the re-searcher himself. No other professionals, participants or data from open data-sets was used.

The experiments were performed on two different locations: the office location at Gorlaeus Laboratories, and a home environment. In analyzing the experiments the location where the data was taken is not taken into account, although it can not be excluded that differences in back-ground noise might have influenced the results. In general no effort is made to minimize back-ground noise produced by em-radiation from electronic devices. However a basic notch-filter is applied to all the data to remove the powerline hum that was clearly present at 50 Hz.

All experiments were done with the subject in a similar sitting posi-tion, cautiously minimizing bodily movements (that were not part of the specific experiment) while recording data to prevent unnecessary noise or contamination of the data. It should be mentioned however that this research is performed by, and hence the presented data was taken on, a person suffering from a complete high spinal cord injury, which means that all muscles in trunk and lower limbs are paralyzed. This might have caused potential noise from involuntary muscular activity to be lower then in case the data was taken on an able-bodied subject. Therefore it could in principle be possible that repeating the performed experiments on data taken on other subjects could lead to slightly different results. This, how-ever, should not influence the general conclusions drawn from the results presented below.

5.2.1 Goal of the experiments

The practical goal in performing the experiments and creating data is to produce data-sets containing samples of labeled data in which a chosen event is known to have occurred. A neural network can then be trained to distinguish either data representing an event from an a sample of blank

(19)

data (in which no specific event has taken place, at least not known), or to distinguish two different events. So in all experiments a network is trained on a binary classification, i.e. distinguishing samples from just two different classes. If a network is trained to recognize the occurrence of a chosen event, a continuous data-stream can then be read in window for window, with a step size that can be optimized after investigation. Every window can be fed to a trained network as a data-sample, to determine whether the specific event took place within that time-window.

Alas, the software accessible during this research does not allow for a real-time analysis of recorded data. The data has to be recorded by the Emotive-Pro software and after saving be exported as a .csv or, in the cur-rent research, as an .edf format for further analysis at a later time.

5.2.2 Making a recording

After making sure the felt pads of the sensors are fully saturated with a salt solution (basic physiological salt solution or any contact lens solution) the headset can be slid over the skull until the reference pads are at the correct location behind the ears. After this the individual sensors should always be replaced or moved under the hairs until the connectivity-help of the Emotiv software confirms the connectivity is 100%

When starting a recording there is always the option to include a base-line recording implemented in the software. If chosen so, the user is in-structed to sit still for 15 seconds, first with eyes open and then again with eyes closed. Markers are automatically placed by the software, so no hand made labelling of the base-line recording is needed. Although in current preliminary research this convenient tool is not exploited, the results pre-sented below indicate that for any noise reduction it is of paramount im-portance to always start any new recording with a base-line recording. This is because the noise seems to differ significantly with every record-ing. Consequently, when feeding new recording into a network to search for the occurrence of events, the network first has to learn again what a blank sample of that particular recording looks like. This also means that in actively reducing noise, the noise has to be sampled within every sin-gle recording, and cannot be fully removed by a standard prepossessing procedure.

(20)

5.2.3 Training the neural networks

Using the Epochs-class an array containing the samples can be created. As mentioned, the samples stored in the Epochs-object have a 2D-shape:(n events, n channels. The samples are then divided in a 80/20 proportion to make a training-set and a test-set. For training of the neural networks 2 ap-proaches are chosen:

1. The 2D-shape of the samples can be linearized. For this the data of all the 14 channels are put front to back. The result is a 1D-array of length: (14 x sample-length). This arrays can be fed to a 1D-perceptron.

2. The 2D-samples can serve as input in a 2D-convolutional network.

5.2.4 General properties of neural networks

To give a basic idea of all the components and features of the most com-monly used neural networks build with the implemented keras-library are be explained here. In code, the network in general looks like this example:

b a t c h _ s i z e = 32 e p o c h s = 40 m o d e l = S e q u e n t i a l () m o d e l . add ( C o n v 2 D (10 , k e r n e l _ s i z e = ( 1 4 , 2 0 ) , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e = (1 4 ,1 29 ,1 ) ) ) m o d e l . add ( D r o p o u t ( 0 . 4 5 ) ) m o d e l . add ( F l a t t e n () ) m o d e l . add ( D e n s e (256 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 4 5 ) ) m o d e l . add ( D e n s e (32 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 4 5 ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) ) m o d e l .c o m p i l e( l o s s =’ b i n a r y _ c r o s s e n t r o p y ’, o p t i m i z e r = k e r a s . o p t i m i z e r s . A d a d e l t a () , m e t r i c s =[’ a c c u r a c y ’]) m o d e l . s u m m a r y () h i s t o r i e s _ 3 = H i s t o r i e s () m o d e l . fit ( x_train , y_train ,

b a t c h _ s i z e = b a t c h _ s i z e , e p o c h s = epochs ,

v e r b o s e =1 , c a l l b a c k s =[ h i s t o r i e s _ 3 ] , v a l i d a t i o n _ d a t a =( x_test , y _ t e s t ) )

To go step by step through the terminology:

Batchsize:

(21)

weights are updated. So choosing a higher batchsize will increase the speed of the training process. However, because the weights are then less frequently updated you might need more training rounds (epochs)

Epochs:

In Keras, the epochs denote the number of times that the entire dataset is passed through the network during the full training of the network. It is unfortunate that in the terminology of machine learning ’epoch’ then refers to two things: In MNE-Python it is also the name for the data recorded around an event.

Con2D:

This denotes that we are using a convolutional network.

kernelsize:

This is the size of the kernels or ’filters’ that slide over the data. During the forward pass, each filter is ’convolved’ across the width and height of the input data, computing the dot product between the entries of the filter and the input and producing a 2-dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature that occurs in the input.

activation:

The type of activation function that is used. This output of the network is converted by this function to determine the amplitude of the output. ’Relu’ stands for rectified linear unit:

R(x) = x+ =max(0, x)

For the output of the final layer the sigmoid function is used because we are training the network to do a binary classification.

Dropout:

The dropout is the fraction of weights that will be put back to a random value. This can be used to prevent overtraining and to avoid the network from getting stuck in local minima.

Flatten:

Converts the 2D-structure of the network to a single dimension, so it can serve as input for the dense layers to follow. Dense layers are the standard building blocks of a 1D-perceptron.

loss:

The loss function is used to measure the inconsistency between predicted value ( ˆy ) and actual label (y). The network learns by minimizing the loss function. During this research, the binary crossentropy function is used:

(22)

The function f(~x)represents the dot product of the inputvector~x with the weights of the neurons in the network. This will become more clear in the example below. The network learns by adjusting the weights after feeding one batchsize of samples to the network in such a way to minimize the loss function. This is done by calculating its gradient.

optimizer:

The learning rate of the network is a parameter that determines by how much the weights are adjusted after the gradient of the loss function has been calculated. An ’optimizer’ is just to ajust this learning rate in a smart way. Usually, a high gradient means the loss-function is not yet close to a minimum, and a big learning rate can be used without the risk of over-shooting the minimum. Adadelta (’Ada’ is for ’Adaptive’) also takes pre-vious gradients into account when adjusting the learning rate.

5.2.5 Elementary workings of neural networks

An artificial neural network is a circuit composed of artificial neurons, or nodes. The connections between these nodes are modelled as weights. The weights can take values between 0 and 1; large weights represent an exci-tatory connection. When feeding an input to a neural network, all input values are modified by the weights of the nodes and summed. A single bias value can then be added as a last parameter to determine the single output value. Finally an activation function determines the amplitude of the output.

To explain these basic workings of neural networks, it might be instruc-tive to go through a simple example. Letˆas say we want to build a neural network that works as an AND gate. This means we have only two input nodes and the possible combinations and required output can be repre-sented in the following table:

input output

0 0 0

0 1 0

1 0 0

1 1 1

So in this network there are just two input nodes that can only take bi-nary values. We have to find two weights that gives us the desired output in every possible combination of input. The input nodes can be repre-sented in a vector of length two, just as the weights. In formula, to go

(23)

f(~x) = H((w1+w2)~x+b) =y

In this formula, b is the bias-parameter, andH(n) our activation-function,

in this case the Heaviside-function:

H(n) =

(

0 if n<0

1 if n≥0

It is easy to see that good values of the parameters would be:

w1 =0, 5 w2 =0, 5 b = −0, 75

Now, in a more complex network, an algorithm is used to calculate the values of the weights and the bias. The initial values of the weights are randomly generated with a value between 0 and 1. As explained, the batchsize determines after how many samples the weights are updated. For this optimization of the weights, the loss-function and its gradient are used, as explained above.

For an other example, using the building blocks of Keras, we can train a neural network to function as an XOR-gate. Going through this example might give some insight in how a network is coded in Keras.

The possible inputs and corresponding outputs of the XOR-gate are shown in the table below.

input output

0 0 0

0 1 1

1 0 1

1 1 0

Now, in contrast to the simples AND-gate, this network can not be solved by a single-layer perceptron, so we need to add a so-called ’hidden’-layer. Schematically, the network we use can be pictured as in figure 5.1.

(24)

Figure 5.1:Schematic picture of the XOR-network. The biases are written as W00 and W10, with input value 1.

In keras, this network is coded as follows:

x _ t r a i n = np . a r r a y ([[0 ,0] ,[0 ,1] ,[1 ,0] ,[1 ,1]]) y _ t r a i n = np . a r r a y ([0 ,1 ,1 ,0]) x _ t e s t = x _ t r a i n y _ t e s t = y _ t r a i n b a t c h _ s i z e = 4 e p o c h s = 500 m o d e l = S e q u e n t i a l () m o d e l . add ( D e n s e (2 , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e =(2 ,) ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) ) m o d e l .c o m p i l e( o p t i m i z e r =’ A d a D e l t a ’, l o s s =’ b i n a r y _ c r o s s e n t r o p y ’, m e t r i c s =[’ a c c u r a c y ’])

We use all the possible inputs as training inputs as well as a testset. The batchsize is set on 4, so everytime all the possible inputs go through the network the weights are updated. Running this code gives the following output: 4 t r a i n s a m p l e s 4 t e s t s a m p l e s S a m p l e l e n g h t : 2 , i . e . 0 s e c o n d s _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ L a y e r ( t y p e ) O u t p u t S h a p e P a r a m # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = d e n s e _ 6 5 ( D e n s e ) ( None , 2) 6 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ d e n s e _ 6 6 ( D e n s e ) ( None , 1) 3 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = T o t a l p a r a m s : 9 T r a i n a b l e p a r a m s : 9 Non - t r a i n a b l e p a r a m s : 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

(25)

As you can see there are 6 parameters in the first layer, and 3 parame-ters in the second layer. These are the weights as shown in figure 5.1. The result of the trainingprocess on the performance of the network is shown in figure 5.2

Figure 5.2: Performance of the double-layer perceptron training on the XOR-problem

We can print the final weights of the network, and recalculate by hand the outputs from the four possible inputs.

The weigths are: W00 = -0.52 W01 = 0.0 W02 = 0.52 W03 = 0.35 W04 = 0.52 W05 = 0.35

For the second layer: W10 = -0.22

W11 = -2.10 W12 = 1.4

Let’s say~x = (1, x1, x2). The first element is added as input for the bias,

(26)

~ x1 = (1, 0, 0) ~ x2 = (1, 0, 1) ~ x3 = (1, 1, 0) ~ x4 = (1, 1, 1)

To check the result of the network by hand we calculate the output,

starting withx1. First we have to determine the values of a1 and a2 as in~

figure 5.1:

a1 =R( ~W0· ~x1) = R(W00·1+W02·0+W04·0) =

R(−0.52·1+0.52·0+0.52·0) = R(−0.52) =0

a2 =R( ~W0· ~x2) = R(W01·1+W03·0+W05·0) =

R(0·1+0.35·0+0.35·0) = R(0) =0

Then the final output, using sigmoid function ’S’ is:

y1 =S( ~W1· ~a1) = S(W10·1+W11·0+W12·0) = S(−0.22) =0.44 →0

We can do the same for the other possible inputs:

y2 =S( ~W1· ~a2) =S(W10·1+W11·0+W12·0.35) =S(0.27) =0.57 →1

y3 =S( ~W1· ~a3) =S(W10·1+W11·0+W12·0.35) =S(0.27) =0.57 →1

And for the final one:

y4 =S(W10·1+W11·0.52+W12·0.7) =

S(W10·1−2.1·0.52+1.4·0.7) = S(−0.11) = 0.47→0

So you can see this network actually learns quite slowly. Only after more then 250 epochs it produces the correct output, and when we check by hand the result after 500 epochs the output is only just within the limits. Now of course this is the most simple network that can learn to functon as an XOR-gate. We could adjust the second layer to have 64 nodes. Then the network has in total (3 x 64) + 65 = 257 trainable parameters, and it learns much faster, as is shown in figure 5.3. Also, how fast the network trains towards a perfect performance, depends on the initial values of the weights, which have a random value when the networks starts training.

(27)

That is why we find different performances over multiple runs.

Figure 5.3:Performance of the double-layer perceptron with 64 nodes over 3 runs, training on the XOR-problem

5.3 Detecting Blinks

For a first attempt to create data suitable for a network to train on, a record-ing is made where blinks are labeled. The main reason to choose ’blinks’ as a first event to train on is that occurrences are very easily discernibly by the naked eye as well, so we can check by visualization whether the labeling process and the code to manipulate to data into a stack of training-samples that can serve as input for a network, is performing as it should. The goal is to train a network to recognize a blink when it occurs in a continuous stream of data. For this the network has to learn to distinguish a blink from a ’blank’ piece of EEG data where no blinking took place.

5.3.1 Creating data

Two markers are assigned: the ’1’ button of a regular computer-keyboard marks a (conscious) blink, and the ’2’ button is used to mark a piece of EEG data where no blinking occurred. We created a recording of 294 seconds, containing in total 55 events: 28 blinks, and 27 marked fragments of data where no blinking occurred. This can for example be done as follows:

(28)

# c r e a t i n g raw o b j e c t s by r e a d i n g in EEG d a t a b l i n k s _ r a w = mne . io . r e a d _ r a w _ e d f ( f i l e n a m e )

# D e f i n e the t i m e p o i n t s b e f o r e and a f t e r the m a r k e r # w h e r e the r a w _ f i l e w i l l be cut . So h e r e e v e r y e p o c h # w i l l be c o n t a i n 2 s e c o n d s of data , i . e . 257 d a t a p o i n t s t m i n = -1 t m a x = 1 # d e f i n e by h a n d the v a l u e s a s s i g n e d to the e v e n t s . e v e n t _ i d = d i c t( b l i n k s =0 , n o _ b l i n k s =1) # c a l l f u n c t i o n to c r e a t e events - o b j e c t ( f u l l c o d e in A p p e n d i x ) e v e n t s = c r e a t e _ e v e n t s ( m a r k e r s ) # p a s s events - o b j e c t as a v a r i a b l e to E p o c h s class , to c r e a t e a r r a y of s a m p l e s e p o c h s = mne . E p o c h s ( b l i n k s _ r a w , e v e n t s = events , e v e n t _ i d = e v e n t _ i d , t m i n = tmin , t m a x = tmax , p r e l o a d = T r u e )

This gives the following output:

64 m a t c h i n g e v e n t s f o u n d 0 p r o j e c t i o n i t e m s a c t i v a t e d

L o a d i n g d a t a for 64 e v e n t s and 257 o r i g i n a l t i m e p o i n t s ... 0 bad e p o c h s d r o p p e d

From this Epochs objects we can, with some manipulation, create a train-ing set and a test set, and a 1D Numpy array that contains the correct labels for the supervised learning that we intent to do. For example:

# c o n v e r t o b j e c t s to n u m p y a r r a y s s a m p l e s = np . a r r a y ( e p o c h s )

l a b e l s = np . a r r a y ( e v e n t s [: ,2]) # s e c o n d c o l u m n c o n t a i n s the l a b e l s

# s h u f f l e the s a m p l e s and labels , b e f o r e d i v i d i n g i n t o t r a i n and t e s t set . np . r a n d o m . s e e d ( S E E D ) np . r a n d o m . s h u f f l e ( x _ t r a i n ) np . r a n d o m . s e e d ( S E E D ) np . r a n d o m . s h u f f l e ( y _ t r a i n ) # d i v i d e s a m p l e s in a t r a i n i n g and t e s t set . s p l i t = int((len( x _ s a m p l e s ) ) * 0 . 8 ) x _ t e s t = s a m p l e s [ s p l i t :] x _ t r a i n = s a m p l e s [: s p l i t ] y _ t e s t = l a b e l s [ s p l i t :] y _ t r a i n = l a b e l s [: s p l i t ]

(29)

0 50 100 150 200 250 time(s*Hz) 20 15 10 5 0 5 Ele ctr ic po te nt ial ( V)

3 samples labeled as 'blinks'

(a) 3 samples from the set of ’blinks’

0 50 100 150 200 250 time(s*Hz) 20 15 10 5 0 5 Ele ctr ic po te nt ial ( V)

3 samples labeled as 'blinks'

(b)3 samples of ’blank’ data.

Figure 5.4: Samples from both categories to be distinguished by the neural net-work. Clearly the data shows quite some differences between the recorded blinks, but they show enough similarity, and difference with the ’blank’ data, that we may expect a network to recognize this.

5.3.2 Bayesan classifier

Since the blinks are easily discernible by eye as well, the most straightfor-ward way of identifying a blink in a sample of data is to apply a Bayesan classifier. In general, a Bayesan classifier is a rule that assigns an to obser-vation a ’best guess’ or estimate of what the unobserved label correspond-ing to that observation actually was.In the case of identifycorrespond-ing blinks in a sample of data, we can, for example, just look at the maximum value of the EEG-signal of both the samples where a blink occurred and the blank samples taken in between blinks.

(30)

Figure 5.5:Some samples of the signal recorded by channel T7

Plotting the maximum values of all the 272 samples of the training set in a histogram can then visualize a value to choose as a classifier, as shown in figure 5.6.

Figure 5.6:Histogram of maximum values of samples of channel T7

By inspecting the plots, we conclude that a maximum value of 6 will serve well as a Bayesan classifier. The samples in the test set can now be classified. The results are shown in the following confusion matrix:

(31)

blinks blank

blinks predicted 34 3

blanks predicted 3 29

5.3.3 Training perceptrons

The samples are now linearized so a perceptron can be trained. We choose to take a time-window of 0.5 second around every blink, which equals 65 data-points. After putting all the channels behind each other the total length of a sample is then 14 x 65 = 910 data-points. This will be the shape of the input-layer of the perceptron. The following model is trained on a

recording containing±100 blinks and 100 labeled blank samples:

# S i n g l e l a y e r p e r c e p t r o n b a t c h _ s i z e = 8 e p o c h s = 15 # n u m b e r of t i m e s the m o d e l is t r a i n e d on the t o t a l t r a i n i n g set . p r i n t( x _ t r a i n . s h a p e [0] , ’ t r a i n s a m p l e s ’) p r i n t( x _ t e s t . s h a p e [0] , ’ t e s t s a m p l e s ’) m o d e l = S e q u e n t i a l () m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’, i n p u t _ s h a p e =(910 ,) ) ) m o d e l . s u m m a r y () m o d e l .c o m p i l e( o p t i m i z e r =’ r m s p r o p ’, l o s s =’ b i n a r y _ c r o s s e n t r o p y ’, m e t r i c s =[’ a c c u r a c y ’])

m o d e l . fit ( x_train , y_train , e p o c h s = epochs , b a t c h _ s i z e = b a t c h _ s i z e , v e r b o s e = 1 ) s c o r e = m o d e l . e v a l u a t e ( x_test , y_test , v e r b o s e =1)

p r i n t(’ T e s t l o s s : ’, s c o r e [ 0 ] )

p r i n t(’ T e s t a c c u r a c y : ’, s c o r e [ 1 ] )

In the fit-attribute of the model-class the variable verbose = 1. The same is done for the evaluate-attribute. This makes it possible to quickly exam-ine the properties of the model, and follow the improvements after every trained epoch. To illustrate, the above model then gives the following out-put over the first couple epochs:

173 t r a i n s a m p l e s 44 t e s t s a m p l e s _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ L a y e r ( t y p e ) O u t p u t S h a p e P a r a m # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = d e n s e _ 3 8 ( D e n s e ) ( None , 1) 911 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = T o t a l p a r a m s : 911

(32)

T r a i n a b l e p a r a m s : 911 Non - t r a i n a b l e p a r a m s : 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ T r a i n on 173 samples , v a l i d a t e on 44 s a m p l e s E p o c h 1 / 1 5 1 7 3 / 1 7 3 [ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ] - 1 s 4 ms / s t e p - l o s s : 0 . 5 9 0 8 - acc : 0 . 7 5 7 2 - v a l _ l o s s : 0 . 5 2 9 3 - v a l _ a c c : 0 . 7 0 4 5 E p o c h 2 / 1 5 1 7 3 / 1 7 3 [ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ] 0 s 457 us / s t e p l o s s : 0 . 3 8 8 6 -acc : 0 . 8 4 3 9 - v a l _ l o s s : 0 . 6 2 8 1 - v a l _ a c c : 0 . 7 2 7 3 E p o c h 3 / 1 5 1 7 3 / 1 7 3 [ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ] 0 s 498 us / s t e p l o s s : 0 . 3 0 2 5 -acc : 0 . 8 9 0 2 - v a l _ l o s s : 0 . 6 5 7 8 - v a l _ a c c : 0 . 7 7 2 7 E p o c h 4 / 1 5 1 7 3 / 1 7 3 [ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ] 0 s 461 us / s t e p l o s s : 0 . 2 6 5 4 -acc : 0 . 9 1 9 1 - v a l _ l o s s : 0 . 5 6 9 6 - v a l _ a c c : 0 . 7 0 4 5 E p o c h 5 / 1 5 1 7 3 / 1 7 3 [ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ] 0 s 504 us / s t e p l o s s : 0 . 2 0 5 7 -acc : 0 . 9 5 3 8 - v a l _ l o s s : 0 . 6 6 5 7 - v a l _ a c c : 0 . 7 2 7 3

The validation accuracy of 72,73% is calculated by dividing the number of misclassifications by the total number of classifications (i.e. the number of samples in the testset)

The following two models are also trained and the performance is com-pared to see if adding layers improves the accuracy (taking into account that training a multilayer perceptron (MLP) takes more computation time, and might be less effective even though it might need less epochs to reach sufficient accuracy):

(33)

# d o u b l e l a y e r p e r c e p t r o n m o d e l = S e q u e n t i a l () m o d e l . add ( D e n s e (128 , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e =(910 ,) ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) ) # m u l t i l a y e r p e r c e p t r o n m o d e l = S e q u e n t i a l () m o d e l . add ( D e n s e (128 , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e =(910 ,) ) ) m o d e l . add ( D e n s e (16 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) )

Over 15 epochs the following results are obtained:

Figure 5.7: The accuracy of 3 perceptrons plotted against the increasing number of epochs.

Clearly the multi-layer perceptron performs significantly better then the single-layer. Now the next step is to train the network on one or more recordings of blinks, and then test whether the network is capable of rec-ognizing blinks when reading in a separate file on which it is not trained.

Two other files, both containing±120 samples are used to check the

(34)

Figure 5.8:Training on one file, testing on 2 others files

Clearly the performance is not as good as between using blinks within the same file. But considering the perfect accuracy on the training set, it is reasonable to assume that the model over-trains fairly quickly.

5.3.4 Convolutional networks

Keeping the samples in their 2D-shape, the following convolutional net-work is trained: m o d e l = S e q u e n t i a l () m o d e l . add ( C o n v 2 D (4 , k e r n e l _ s i z e = ( 1 4 , 2 0 ) , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e =(14 ,65 ,1) ) ) m o d e l . add ( F l a t t e n () ) m o d e l . add ( D e n s e (8 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) )

The width of the kernel here is an important variable. As is to see in the convolutional layer, this model trains 4 kernels of width 20.

First the 3 files are trained individually, and the performance is checked on test samples taken from the same file as the training samples. The ac-curacy is in all three cases a 100%

(35)

Figure 5.9:The performance of the convolution network on 3 independent files

Again we train the model on one recording, and test the performance on the 2 other files. The results are plotted in figure 5.10.

Figure 5.10: The performance of the convolution network is plotted against the number of epochs. The network is trained on file 1, and then tested on file 1, file 2 and file 3

Since the convolutional network performs better then a perceptron, this type of network is used in trying to locate blinks when reading in a con-tinuous data-stream.

(36)

5.3.5 Comparison of classification

Now, using the samples from all the available recordings we can compare the performance of the Bayesan Classifier and the different neural net-works that are used. We train on 270 samples and test on 69 samples. Both networks are trained for 5 epochs.

misclassifications accuracy (%)

Bayes classifier 6 91

Perceptron 2 97

Convolutional network 0 100

5.3.6 Reading in a continuous data-stream

The next step is to train the network on one or more recordings of blinks, and then test whether the network is capable of recognizing blinks when reading in a separate file on which it is not trained.

The network is trained using the convolutional network as shown in the previous section. Next, a raw file containing blinks that where not used as a training sample is read in, window by window. Every window is evaluated by the trained network and classified as a blink or a blank sam-ple. After classifying, the window advances 5 data-points, and another sample of the data is taken and classified. This means that the algorithm checks for the occurrence of a blink about 25 times a second. The assigned class, 0 or 1, is marked with orange cross (x) and plotted in figure 5.11. For visualization, the actual data of one channel (channel AF4) is also shown.

(37)

Figure 5.11: Classification of data on the occurrence of blinks. The orange line at zero is in fact a row of crosses, which indicate pieces of data classified as a ’blank’.

We see that the algorithm detects the blinks as expected.

5.4 Detecting movements

To investigate the ability to detect patterns in EEG data stemming from bodily movements, several experiments are done, with different kind of movements.

First we discuss one experiment, and explore the effects of changing parameters of the networks. Then several other experiments are done, with more or less the same network.

5.4.1 Detecting single hand movements: optimizing

para-meters

The basic experiment for this first exploration is as follows: one hand is resting on the keyboard, with the index-finger on the ’1’-button, and the middle finger placed on the ’2’-button, labeling a movement. The free arm is making a sharp outwards movement when pressing the ’1’-button with the other arm. The reason to choose for this set up is that it makes it possible to do the same experiment with eyes closed, because the fingers of one hand are already on the labeling buttons.

Some samples of this experiment are shown in figure 5.12. It is hard to identify by eye a pattern in this data that would correspond to a move-ment of the arm. In general, surprisingly enough, there seems to be more

(38)

random noise in the blank samples then in the samples where an event occurred.

Figure 5.12:Raw data of all channels. Two samples are taken around a movement en 2 blank samples are shown to compare.

Perceptron

The 14 samples from the 14 channels corresponding to a single event are linearized and several perceptrons are trained to compare the perfor-mance. A time-window of 1 second around the markers is used to capture the events: t-min = -0.5, t-max = 0.5. The input shape for the network is then 14 channels x 129 Hz x 1 sec = 1806.

The very first significant thing to do is remove the power-line hum and the slow drifts. This can be easily done by adding a high-pass filter that

(39)

removes all the frequencies below 1 HZ, and a notch filter of width 2 at 50 Hz.

Figure 5.13: Plot of the Power Spectrum Densities of a recording of actual EEG-data (left) and from a recording made on a dummy-head (right). On both record-ings the power-line hum at 50 Hz is clearly visible

To see the difference in performance on raw and filtered data, a fairly basic 2-layer network is trained:

m o d e l = S e q u e n t i a l ()

m o d e l . add ( D e n s e (256 , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e =(len( x _ t r a i n [ 0 ] ) ,) ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) )

(a)Result of perceptron training on raw data.

(b) Result of perceptron training on fil-tered data.

Figure 5.14:2 plots comparing the difference in performance of a network trained on detecting movements in raw and filtered data

Several things can be deduced from Figure 5.14. First of all, it is im-portant to notice there are some deviations between the different runs,

(40)

which is due to the fact that the initial weights of the network are assigned random values between 0 and 1. From that starting point the network im-proves by adjusting the weights after every epoch, until the error-function reaches a (local) minimum. Also, when a perfect accuracy is reached on the training-set, the network does not improve anymore, and also the per-formance on the test-set stabilizes. This is something that can in principle be overcome by creating more data, so it will be harder for the network to reach a perfect classification on the training-set.

Figure 5.15:Performance of 3 perceptrons with multiple layers

Although the differences are not that significant, the above plot indi-cates again several things: in minimizing the error-function, the algorithm seems to get stuck in a point of local minimum, after which the improve-ment is put to a halt. Also, although the training performance of the 4-layer perceptron is best, the accuracy in classifying the test-set is the low-est, which indicates over-training. Also the accuracy on the training-set is 100% around 15 epochs, after which it cannot improve anymore and this limits the accuracy on the test set as well. Both these problems can be avoided by adding one or more drop-out layers to the network, which resets some of the nodes in the network back to a random value between 0 and 1 after every training-epoch. In finding to optimal parameters for a network however, it is important to stress here that the data, although con-taining 300 samples, is still quite cheap. The test-set consists of just over 50 samples, which means that a misclassification of a single sample leads to a drop in the accuracy of 2%. Before adding the drop-out layers, a sanity check is done to see whether we are actually training on the movements. A network is trained to distinguish the first half of the movement-samples from the second half, and also to distinguish alternating samples. The re-sult in shown in figure 5.16.

(41)

Figure 5.16: A perceptron is trained to distinguish samples all labelled as data corresponding to a movement of the arm.

The network still trains very well (figure 5.16), which again indicates that the data is ’cheap’: the network simply learns to recognize every in-dividual sample instead of training on general features.

Adding 2 dropout layers on the multi-layer perceptron, also makes it possible to train over more epochs, since the takes the network more epochs before it starts over-training. The following 3-layer perceptron is trained over 120 epochs, varying the drop-out between 0.25 and 0.75:

m o d e l = S e q u e n t i a l () m o d e l . add ( D e n s e (512 , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e =(len( x _ t r a i n [ 0 ] ) ,) ) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (64 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) )

Local minima are now mostly avoided. Although the differences are not that significant, the network still over-trains when using a small drop-out value. When using a high drop-drop-out, the network does not train effi-ciently anymore, as is clear from figure 5.17 on the next page.

(42)

Figure 5.17: Performance of 3-layer perceptrons with 3 different values for the drop-out

We look at the confusion matrices corresponding to the performances shown in the graph above.

dropout 0.25 dropout 0.5 dropout 0.75

move no move move no move move no move

move pred. 22 2 25 2 24 2

(43)

Because the test-set only contains of 59 samples, the difference in mis-classifications is in fact marginal. Of course, the restriction for a higher ac-curacy could originate in some man-made mistakes in labeling the events. Another possibility is the occurrence of some random peaks in the noise. A way to investigate this all the samples are plotted in a single graph.

Figure 5.18: All samples in the recording, containing blank samples as well as samples where a movement occurred. For every timepoint an event is labeled, one second of data from all the 14 channels are put in a 1D-array. The length of every sample is therefore 14 x 129 = 1806 data-points

From figure 5.18, it seems reasonable to consider all samples with an amplitude of over 30 µV as noisy or to contain outliers. After removing these, the total sample-set is reduced from 293 to 285. The remaining sam-ples are plotted in figure 5.19.

(44)

Figure 5.19: Samples after removing all samples containing an absolute value above 30 µV. Samples with data above this threshold are considered to be noisy or contain outliers and are not used in the training of the networks to improve the performance.

A 3-layer network with a drop-out of 0.5 is trained again over 125 epochs to show the difference in performance on the complete set and the ’cleaned’ dataset (figure 5.20).

Figure 5.20:Performance of a perceptron on all samples and on the ’cleaned’ set, where all noisy samples are removed.

There are of course still much more parameters we could try to opti-mize here. As a last example, we can look at different lenghts of the

(45)

time-window wherein we look for the events. The results are shown in figure 5.21.

Figure 5.21:Performance of a perceptron on several different time-windows Convolutional network

We can also choose the preserve the 2 dimensional shape of the events. The same perceptron as in the previous section is used, but now combined with a convolutional network:

m o d e l = S e q u e n t i a l () m o d e l . add ( C o n v 2 D (5 , k e r n e l _ s i z e = ( 1 4 , 2 0 ) , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e = (1 4 ,1 29 ,1 ) ) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( F l a t t e n () ) m o d e l . add ( D e n s e (512 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (64 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) )

We see the performance of this network does not further improve the accuracy already reached by the perceptron (figure 5.22).

(46)

(a) Result of training a perceptron on 3 runs.

(b)Result of training the combination of a convolutional network combined with a perceptron.

Figure 5.22:2 plots comparing the difference in performance of a network trained on raw and filtered data

When this apparent maximum in training result is reached, the exact values for the parameters are not that relevant (as long as the network does not over-train), as shown in figure 5.23.

(47)

(a) Performance of a convolutional net-work training on 15 filters (instead of 5, as in figure 5.22b).

(b)Performance of a convolutional net-work training on filters with width 40 (instead of 20, as in figure 5.22b).

Figure 5.23: 2 plots comparing the difference in performance after varying pa-rameters

Now the trained convolutional network is used to read in a piece of continuous data. The data belongs to the same recording as the one used to train the network, however the events in this fragment are not used in either the training- or the test-set, to make sure that the network has not simply learned the individual events. The stepsize of the window wherein we look for an event is 20 datapoints, which means that about every 0.15 seconds a snapshot is made from the data and fed to the network to test whether a movement occurred.

Figure 5.24: Result of reading in a continuous data stream. The stars represent a marked movement. So the horizontal blue line at zero is in fact a collection of blue stars, marking a period without movement.

(48)

on the test set. Most movements are detected at the correct location, but some misclassifications are made. Of course it is always possible that some movement is unwillingly made without consciously marking an event. In-teresting to see is also that mostly the movement is already detected in the data before the marker-button is actually pressed. This would indicate that the brain’s activity corresponding to a physical movement is mostly present at the very start of the movement.

Now the same routine is repeated for a different recording (figure 5.25). So a network is trained on a specific recording and then used to try and detect movements in some other recording.

Figure 5.25: Result of reading in a continuous data stream from a recording on which is not trained.

Clearly the network is not capable of recognizing the movements in the EEG-data belonging to a file on which it is not trained. To investigate why this is, an experiment is done as described in the next section.

5.4.2 Testing consistency between files

One important goal we want to achieve with training a network, is that, after being trained, the network will be able to recognize the same events on other files as well. To explore this possibility the following experiment is done:

• File 1: A recording of±100 samples is made (50 movements and 50

fragments of blank data).

• The head set is taken of and replaced on the head again.

• File 2.1: A recording of±300 samples is made (150 movements and

(49)

• File 2.2: A recording of±100 samples is made.

• File 3: A recording of±100 samples is made.

• File 4: A recording of±100 samples is made.

The convolutional network as shown in the previous section is used to train on file 2.1. The trained network is then used to classify all the samples from file 1, file 2.2, file 3 and file 4. The results are shown in above confusion matrices.

file 1 file 2.1 file 3 file 4

move blank move blank move blank move blank

move 37 43 43 19 45 38 48 1

blank 6 1 9 32 3 4 5 50

In general, the network seems to perform better if the headset has not been moved. The network especially misclassifies blanks as being move-ments. This would indicate that the background noise is quite specific, and changes with a slightly different positioning of the headset. On file 4 however, apparently the headset was replaced on the skull with the sen-sors in a similar position. In this case the noise turned out to be similar enough to the noise in file 2.2 to be able to distinguish the blanks from the movements.

The same experiment is repeated but now with eyes closed, to see if this will give a more stable result. The training result of file 2.1 (eyes closed) is shown in figure 5.26.

A similar trend is observed: it is mostly the noise (or baseline) that changes when the headset is removed and placed back on, which makes it difficult to train on a variety of events collected over multiple experi-ments or evaluate a new recording on possible events without the need of training the network again from scratch.

In an attempt to generalize the noise, the network is now trained on 3 files where movements where labeled with closed eyes, and then tested on a 4th file (figure 5.27). The combined set then contains 307 train-samples and 79 test samples.

Indeed, the network is now better capable of detecting the blank sam-ples, but performs significantly worse on detecting the movements. One way to resolve this problem is to make a base-line recording at the start of every new recording. Taking blank samples from this, the network can learn to recognize the basic noise in that specific file.

(50)

Figure 5.26: Performance of a convolutional network on movements from a recording made with closed eyes.

file 1 file 2.2 file 3 file 4

move blank move blank move blank move blank

move 50 50 39 7 48 31 100 49

blank 1 0 3 31 2 18 5 54

Figure 5.27: Performance of a convolutional network in detecting movements when samples of 3 separate files are put together.

5.4.3 Discerning left- and right-hand movements

One possible way to overcome the problem with the varying background signal, is to not discern an event from a blank sample, but instead on clas-sifying two different events. Therefor, a network is trained to distinguish left-hand from right-hand movements. A similar experiment is done as in the previous section. The network is trained on ’File 2.1’. File 2.2 is recorded after some time, but without moving the position of the sensors

(51)

file 1, file 2.1, file 3 file 4

move blank move blank

move predicted 40 1 83 25

blank predicted 4 34 22 78

of the head-set.

Figure 5.28: Performance of a convolutional network in discerning a left-hand movement from a right-hand movement.

file 1 file 2.2 file 3

left right left right left right

left 31 23 33 3 19 14

right 16 23 1 30 5 10

Results are again not consistent between files. For further examina-tion of what is happening here, a network is trained to recognize a blank sample of data of a recording. This recording also contains marked occur-rences of left- and right-hand movements. An other network is trained to discern left and right hand movements from previous files. A fragment of data is now read in as a continuous stream, and classified window by window. First a network classifies the fragment as blank data or contain-ing an event. When classified as an event, a second network classifies the samples as a left hand movement or a right hand movement (figure 5.29).

Interesting is to see that a right-hand movements is detected first as a left-hand movement. Now this could be just because left- and right- hand movements are not so easily distinguished, but could also be originating in actual movement. The movements are made by a left handed person, leaning with the left elbow on the table, compensating for lack of trunk-stability due to a spinal-cord injury and resulting paralysis. This means

(52)

Figure 5.29: Performance of a convolutional network in discerning a left-hand movement from a right-hand movement.

also leaning on left arm when reaching forward with right arm to push to marker button. But more research would have to done to draw any definite conclusions.

5.4.4 Analyze limits of detectable movements

To investigate the limits of the network in being able to detect changes in the EEG data due to subtle movements, two experiments are done:

1. Discerning movement of index- and middle-finger.

The hand that does the labeling is rested on the marker buttons. Al-ternating a button is pressed by the index-finger and the middle-finger. The experiment is done with the left hand, with eyes open. There are 200 labeled events in the recording. The chosen time-window is 1 sec.

The following 3 perceptrons are trained:

Network 1: m o d e l = S e q u e n t i a l () m o d e l . add ( D e n s e (512 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (64 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) ) Network 2: m o d e l = S e q u e n t i a l () m o d e l . add ( D e n s e (1028 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (512 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) )

(53)

m o d e l . add ( D e n s e (64 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) ) Network 3: m o d e l = S e q u e n t i a l () m o d e l . add ( D e n s e (512 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 7 5 ) ) m o d e l . add ( D e n s e (64 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 7 5 ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) )

Figure 5.30:Performance of 3 multi-layer perceptrons in discerning a movements of index- and middle-finger.

For the multi-layer perceptron, adding an extra layer, nor increasing the drop-out, can inprove the performance. Also 2 convolutional networks are trained: Network 1:

m o d e l = S e q u e n t i a l () m o d e l . add ( C o n v 2 D (5 , k e r n e l _ s i z e = ( 1 4 , 2 0 ) , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e = (1 4 ,1 29 ,1 ) ) ) m o d e l . add ( D r o p o u t ( 0 . 7 5 ) ) m o d e l . add ( F l a t t e n () ) m o d e l . add ( D e n s e (512 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 7 5 ) ) m o d e l . add ( D e n s e (64 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) ) Network 2:

(54)

m o d e l = S e q u e n t i a l () m o d e l . add ( C o n v 2 D (10 , k e r n e l _ s i z e = ( 1 4 , 2 0 ) , a c t i v a t i o n =’ r e l u ’, i n p u t _ s h a p e = (1 4 ,1 29 ,1 ) ) ) m o d e l . add ( D r o p o u t ( 0 . 7 5 ) ) m o d e l . add ( F l a t t e n () ) m o d e l . add ( D e n s e (512 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 7 5 ) ) m o d e l . add ( D e n s e (64 , a c t i v a t i o n =’ r e l u ’) ) m o d e l . add ( D r o p o u t ( 0 . 5 ) ) m o d e l . add ( D e n s e (1 , a c t i v a t i o n =’ s i g m o i d ’) )

Figure 5.31:Performance of 2 convolutional networks in discerning a movements of index- and middle-finger.

Also for these networks, the performance cannot be improved by altering the network, as shown in figure 5.31.

2. Discerning movement of left and right index-finger.

Both left and right arm are left to rest on the marker-buttons. Alter-nating, the left and right index-finger presses a marker-button. For this experiment the first of the convolutional networks shown in pre-vious experiment is used (figure 5.32).