Intelligent Data Augmentation for Physiological Signals using Conditional Generative Attention Models

(1)

MSc Artificial Intelligence

Master Thesis

Intelligent Data Augmentation for

Physiological Signals using

Conditional Generative Attention

Models

by

Andrei Furdui

12256439

July, 2020

48 ECTS November 2019 - July 2020 Supervisors: Dr. Abdallah El Ali Prof. Dr. Marcel Worring

Assessor: Dr. Shaodi You

Centrum Wiskunde & Informatica Distributed and Interactive Systems Group

(2)

A B S T R A C T

Computational recognition of human emotion has recently become a ma-jor field of study, following advances in Deep Learning (DL) methods of learning from large collections of data. Emotions are internal subjective states that are accompanied by certain autonomous bodily reactions, also known as physiological signals. The relation between physiological data and subjective emotional states is well understood, which affords their use in the task of emotion recognition. Major difficulties arise from the fact that physio-logical signals are highly individual for each human organism. The complex processes involved in collecting and annotating physiological data leads to datasets with a small sample size of subjects. Models trained on such limited data often have the problem of not generalizing well when used on new sub-jects. In order to obtain more robust emotion recognition models, we aim to overcome the issue of data scarcity by learning a generative distribution from a dataset of physiological data, with the purpose of intelligent data augmen-tation. To achieve this goal, we use an Auxiliary Conditioned Wasserstein Generative Adversarial Network with Gradient Penalty (AC-WGAN-GP) to generate synthetic data. We study the use of Self-Attention in conjunction with Convolutional layers within the Generator and Critic architectures. The generative process is conditioned by arousal and subject information, which we use to explicitly encode the differences in physiological manifestations between different subjects. Moreover, in the attempt to produce more varied synthetic datasets, we study different approaches to generative sampling. We use the synthetic data to train simple classification models in the task of binary arousal classification, and we show that generative data augmen-tation significantly improves the model’s accuracy in subject-independent testing. Additionally, we show that our subject-conditioning approach in-creases the quality and diversity of generated samples, while our proposed sampling methods allow us to bridge the gap between the differences in physiological manifestation.

(3)

A C K N O W L E D G E M E N T S

Many thanks to Abdallah El Ali and Tianyi Zhang for wisely guiding me along the right paths during my research. As many thanks to prof. Worring and prof. You, for making sure this thesis progressed, and was completed, successfully.

It has been a great pleasure and privilege to conduct this research as part of the Distributed and Interactive Systems Group at CWI. I am deeply grateful to the entire team, for all the good advice, and the good memories. Your spirit and motivation has been highly infectious, and made this work more enjoyable than I could have ever expected.

Last, but not least, I want to thank my family, friends, and the universe, without whom all this would not have been possible.

(4)

C O N T E N T S

1 introduction 2

1.1 Conceptual Approach Motivation 4

1.2 Technical Approach Motivation 6

1.3 Research Question 8

2 background & related work 9

2.1 Emotion and Physiology 9

2.1.1 Emotional Models of Affect 10

2.1.2 Physiological Signals 10

2.1.3 Affect Recognition from Physiological Data 11

2.2 Generative Adversarial Networks 13

2.2.1 Theoretical Background 13

2.2.2 Generative Models for Time-Series Data 15

2.3 Self-Attention 16

2.3.1 Technical Background 16

2.3.2 Self-Attention for Time-Series Data 17

3 approach 19

3.1 Problem Statement 19

3.2 Dataset & Preprocessing 19

3.2.1 Dataset Selection 20

3.2.2 Data Preprocessing 21

3.3 The AC-WGAN-GP architecture 23

3.4 The Generator Network 27

3.4.1 The Temporal Self-Attention Block 27

3.4.2 The UpRes Layer 29

3.4.3 Generator Pipeline 29

3.5 The Critic Network 30

3.5.1 The DownRes Block 30

3.5.2 The Critic Pipeline 31

3.6 Classification Stage 32

4 experimental setup 34

4.1 The Generative Framework 34

4.2 The Augmented Classification Framework 36

4.2.1 Data Augmentation Techniques 36

4.2.2 Leave-One-Subject-Out Cross Validation 38

4.2.3 Classification Metrics 39

4.3 Quantitative Evaluation Metrics 40

4.3.1 Wasserstein Distance 40

4.3.2 DTW Distance 40

4.4 Implementation Details 41

5 results and analysis 43

5.1 Data Overview 43

5.2 Ablation Study 44

5.3 Arousal Classification Experiment 45

5.3.1 On Data Augmentation Techniques 45

(5)

contents 1

5.3.2 On The Benefits of Self-Attention 48

5.3.3 On Generative Conditioning Methods 48

5.3.4 On Sampling Procedures 48

5.4 Wasserstein Distance Measure 52

5.5 DTW Distance 52 5.6 Qualitative Analysis 56 6 conclusion 60 6.1 Further Work 62 bibliography 62 a appendix 70

(6)

1

I N T R O D U C T I O N

Massive advancements in our ability to collect, store and process large amounts of data have resulted in great strides in our understanding of the world. An important benefit afforded by data-driven research is that it al-lows us to study those more elusive, subjective aspects of our experience. A prime example of this sort is the study of the manifestation of human emotion.

Emotions can be defined as subjectively observed internal states that are

usu-ally accompanied by autonomous bodily reactions [1]. However, further

identifi-cation and categorization of individual emotional states is a difficult process. Several models have been proposed, which aim to guide the research process by offering clearly delineated dimensions along which subjective states can be mapped. One such model has been of great use, especially due to its ap-plicability in computational models: the circumplex model proposed by James

Russell [2]. It aligns emotional states along two axes, valence and arousal.

Valence corresponds to the level of pleasantness of an affective state, while arousal corresponds to the level of excitation it generates.

Given any emotion model, there still remains the problem of reliably map-ping a specific subjective state to the said model. To this purpose, we turn towards the second element of our definition of the emotional state: the au-tonomous bodily reactions. Another term for these is physiological signals. Emotional states are accompanied by certain involuntary responses in differ-ent parts of the body, such as in the brain, heart (and the differ-entire circulatory system) or at the skin level. These responses can be measured using physio-logical sensors, thus giving us an objective window into the realm of emotions. The interrelation between the circumplex emotional model, composed of its valence and arousal dimensions, and these physiological signals, is very well

understood [3][4]. This provides a solid basis upon which research can be

conducted. Datasets collating physiological data, along with emotional

an-notation on the circumplex model, are widely available [5][6][7].

In order to take full advantage of the available data, computational meth-ods of learning correlations between physiological data and emotional an-notations have recently been used with promising results. Physiological fea-tures, such as heart rate and associated statistics, or even more complex information combining multiple physiological modalities, can be extracted

from the raw physiological data using manually crafted methods [3]. These

features have been processed and used to recognize emotional states using

traditional statistical and machine learning methods [8].

Deep Learning techniques afford the researcher the ability to work directly with the raw physiological data. This not only makes the job easier, as it no longer requires an expert’s knowledge for processing the physiological data; Deep Learning methods have revolutionized the computational modelling field, having significantly advanced the state-of-the-art in many topics where large amounts of data are available for processing.

(7)

introduction 3

Our work aims at applying Deep Learning methods to the field of af-fect recognition from physiological sensor data. In our research, we aim to tackle several of the major difficulties which concern this domain; most of these difficulties concern the physiological data itself. Of greatest impor-tance to our work is the lack of sufficient amounts of diverse data – physiological data patterns are highly specific to each individual. If the available data is collected from a limited number of subjects, as is currently the case, models constructed using such data can suffer from low degrees of generalizability. When this happens, we notice that models do not perform well when applied to data coming from new, unseen subjects. To this purpose, we propose the use of Generative Deep Learning approaches, popularized by the

Genera-tive Adversarial Network (GAN) of Goodfellow et al. [9]. Instead of

learn-ing to discriminate among the data along a specific dimension, generative networks learn to explicitly model the distribution of a dataset. From this generative distribution we can then sample new, unseen data instances. Us-ing intelligent learnUs-ing and samplUs-ing techniques, we can thus increase the size and varietyof our datasets. We use this procedure in an attempt to ob-tain more accurate affect recognition models, which would better generalize when applied to new subjects.

The main contribution of this work lies in the generative framework we

describe in chapter 3 and section 4.2. We combine the Wasserstein GAN

with Gradient Penalty (WGAN-GP)[10][11] with a double Auxiliary

Classi-fier network (AC-WGAN-GP)[12] in order to introduce conditioning

infor-mation within the generative framework. As conditioning variables we use the arousal information, which allows us to sample labeled synthetic sig-nals to use in the process of supervised learning; moreover, we condition the AC-WGAN-GP using subject information, with the aim of preserving, and potentially increasing the variety of generated signals. In order to fully exploit the particularities of our physiological data, we investigate the use of Self-Attention, alongside Convolutional layers, in the Generator networks. To reliably measure the level of generalizability of our approach, we test the entire ensemble in the task of arousal recognition, using a

Leave-One-Subject-Out cross-validation strategy [13]; further quantitative and

qualita-tive analysis is performed in order to asses the quality of the generaqualita-tive methods.

The rest of this chapter will delve deeper into the difficulties which we face in dealing with this topic. We offer a motivation for the chosen ap-proach in facing these problems, and we finally state the research questions

which define this work. In Chapter2, we explore the theoretical foundations

of our problem; we give an in-depth presentation of emotion and its relation to physiology, as well as provide a theoretical introduction to the main tech-nical building blocks of our research: Attention and Generative Adversarial Networks. We also discuss the current literature on the intersection of these

topics. In Chapter 3, we dissect the generative approach which we shall

employ in answering our research questions; we present the AC-WGAN-GP architecture and its functioning, and the supervised classification models we

use in quantitative testing. Chapter4contains our experimental setup. Here,

we first detail our selection and handling of the physiological data. We then present the framework of the augmented classification experiment, which

(8)

1.1 conceptual approach motivation 4

forms the bulk of our quantitative analysis, and finally the remainder of

the quantitative analysis methods. In chapter5, we discuss the outcomes of

each individual experiment, and finally provide a qualitative visual analysis

of the synthetic signals. Chapter 6 condenses the observed outcomes into

a conclusive report in which we answer the research questions, and finally offer a short view of possible research directions opened by our work.

1.1 conceptual approach motivation

This work is situated at the intersection of the fields of Human Computer Interaction (HCI)and Artificial Intelligence (AI). One of the major concerns of this interdisciplinary field is the ability of computers to understand hu-man emotions and act accordingly – Huhu-man Emotion Recognition (HER). The ability of computers to objectively and correctly assess emotion could be of high potential use in clinical work with people that are not able to ex-press their emotions explicitly, such as comatose patients, newborn infants, or autistic persons. It can improve communication from machines to users, giving feedback to calm an angry or distracted driver, or providing emotion-tailored recommendations for entertainment purposes. Deep learning based systems could bring efficient solutions for such applications. However, we identify four critical points for obtaining efficient solutions in this domain:

1) The use of non-invasive, wearable sensors, that can be used in the wild; 2) A practical and efficient approach to learning from affect-annotated

data;

3) A focus on practices that can generalize well when applied to new subjects;

4) Using intelligent data augmentation methods to improve learning from small annotated datasets.

In the next sections we will explore these aspects in more detail.

Non-Invasive Sensors

First is the difficulty of deploying practical applications for HER purposes due to sensor intrusiveness. This includes the use of video cameras for fa-cial or pose recognition, microphones for speech recording and analysis, or electrodes that capture electroencephalogram (EEG) signals. A rich and suc-cessful literature is available for such use-cases; however, the sensors men-tioned above are invasive and/or not portable, and so their use is usually restricted to the lab environment. Since technology is ubiquitous, we would like ubiquitous sensor monitoring as well. Hence our focus on those

wear-able sensorsthat could be embedded in daily used gadgets and machinery.

Such applications could be easily deployed in the wild, and thus be of real practical use.

(9)

1.1 conceptual approach motivation 5

Annotation Acquisition Difficulties

The data-hungry nature of supervised deep learning based systems re-quires a massive amount of information to harness their full potential. Col-lecting such massive amounts of physiological measurements from sensors such as those described above is a trivial task. Reliably annotating these

large datasets is, unfortunately, not so trivial [6][14]. Dataset collection

pro-tocols devised for such purposes follow a common setup. A number of subjects are subjected to visual and/or auditory cues designed to induce specific emotional responses, such as boredom, amusement or fright. Sen-sors are connected to the subjects and their physiological signals are tracked. The annotation process, in which subjects are asked to report their subjective

emotional state, is usually performed at the end of each video [7][5], or

con-tinuously over the course of the experiment [6]. Both techniques bring about

their own caveats; what they share in nature is that they both limit the poten-tial size of the datasets: the collection and annotation processes are complex procedures, which require significant time and resources. Moreover, the an-notation process itself is highly subjective in nature; it depends on a myriad of factors that current methods are unable to account for, such as the de-gree of emotional involvement of the user taking part in the experiment, or their ability to correctly and objectively observe their emotional response and translate it to arousal and valence scores. We can understand then why deep learning models trained on such data achieve poor performance when compared against other tasks such as object recognition, or even human ac-tivity recognition, which can be more objectively assessed. The issue then, we argue, much rather lies in developing intelligent means of using poten-tially unreliable annotated data, than in using more complex models.

Subject Independence

A key aspect of any research focused on physiological or medical data is that the results can be generalized to other human subjects. As we have seen, the datasets we are concerned with are collected from a small sample of people, typically less than 50. Moreover, physiological signals are highly individual to each of us, which makes it more difficult to learn recognition models that generalize well to new subjects.

We aim to explicitly handle the variance in subject’s physiology in two ways. First is a fair, un-biased evaluation methodology. We have to ensure that our models are trained, validated and that results are reported based on a strict subject-independent data split. Model selection must also be made without any data leakage involved, so that the final results are completely free of any subject-dependent bias. Secondly, we have to take into account the variance in the data distribution between subjects when devising our approach. We propose an explicit integration of subject information within our learning method, by feeding this information as conditional input to the generative process.

(10)

1.2 technical approach motivation 6

Intelligent Data Augmentation

It is well known that deep recognition models trained on limited data are prone to overfitting, which leads to poor generalization in novel situations. This fact lies at the core of the fourth and last important aspect we aim to tackle in this thesis. A fundamental way to increase the robustness of overfit-ting systems is to provide them with more training data. But since real data is hard to come by, as we have argued above, we have to resort to intelligent means of expanding existing datasets. The simplest technique involves data augmentation, a process through which we apply random transformations to the input data, with the hard condition that the transformations be

in-variant with respect to the labels we are trying to model[15]. For example,

we might apply rotation, scaling and noise addition to images in order to obtain a more robust object detection model; but if our labels depend on the object’s size, using scaling transformations could prove counterproductive.

A more advanced data augmentation method involves the use of Gener-ative models; by learning a probability distribution function over the avail-able labeled data, we can generate new, unseen instances by sampling from the learned distribution. We could then augment the real dataset with a subset of synthetic sensor sequences, which should allow us to train a more robust model. We hold this goal as the main purpose of the work – the generation of sensor data of high authenticity.

In order to generate plausible synthetic biosignals using deep learning methods, and under the limitation of data scarcity, we aim to take full ad-vantage of the characteristics and particularities of our input. If generation of authentic sensor data is what we set to achieve, learning a powerful rep-resentation of our data is how we propose to reach our goal.

1.2 technical approach motivation

Our work is focused on multivariate time-series data, with a structure characterized by long sequences of measurements taken from multiple sen-sors which track signals that may differ in periodicity, phase or sensitivity.

Recurrent architectures such as LSTM [16] models have been used with great

success for learning patterns in sequential data. However, such models are known to suffer from vanishing/exploding gradients and training instabil-ities when sequences become too long, and thus may fail to capture very long-term interdependencies. Moreover, they take significantly longer to train than non-recurrent methods of equivalent size, which, in combination with the already long generative network training time, we consider to be impractical for our purpose.

In addition to the high computational requirements of an LSTM, the out-put of a recurrent layer is a context vector, typically a one-dimensional vector which describes the entire input sequence; this output may be enough for short-term forecasting or sequence classification, but it may prove difficult to generate a realistic sequence of any significant length from such a lossy summary of one. These observations have been at the core of the recent rise

(11)

1.2 technical approach motivation 7

of the Attention models, albeit in the natural language processing context of

sequence to sequence translation [17][18].

The Attention mechanism [18] allows for dynamic relevant context

selec-tion; it permits the learning of interrelations between time-steps invariant of distance or (non)periodicity. Moreover, an attention layer outputs a rich representation of a sequence, in the sense that the signal, its temporality and its locality are dynamically condensed. This richer representation may prove to be helpful in the generative stage. We contrast this against RNN methods, which as we have pointed out, tend to "forget" old information; but also, importantly, against convolutional layers, which operate based on a limited, local spatial window. And since we considered RNNs to be outside our scope for the task of biosignals generation, it is convolutional methods that we shall use as a frame of reference.

Convolutional neural networks (CNN) [19] are a biologically inspired method

of learning patterns from data. They rely on learned filters of small size which convolve across the input data to detect patterns. These patterns are then hierarchically combined through the stacking of several such convolu-tional layers, resulting in the ability to learn complex patterns. This occurs at a relatively low computational cost due to the small scale reusable filters which are shared across the entire input space in each layer.

Convolutional layers can be seen a natural precursor for a Self-Attention layer, and can serve as powerful base for constructing the attention mecha-nism. A convolutional layer operates locally, individually detecting features at each local spatial patch of input; the size of the patch being defined by the size of the feature filters, usually a number several orders of magnitude smaller than the dimensionality of the input. In contrast, a Self-Attention layer can further combine these individual patches; it can modulate the learned patterns at each individual position by learning a joint represen-tation with the other patterns detected across the input space. Moreover, the use of the positional encoding commonly used in natural language pro-cessing (NLP) self-attention literature could provide additional benefits. The positional encoding is an explicit representation of the dimensional structure of the input which is combined with the input signal; this adds a concrete sense of locality to the extracted features, whereas a convolutional layer has to rely on a hierarchical composition of features from which to infer such locality.

Given the reasons above, we propose to investigate the use of Self-Attention used in conjunction with convolutional layers for learning a powerful repre-sentation of physiological signal segments. A better reprerepre-sentation should lead to generating signals of higher authenticity, that better preserve the par-ticularities of the input data and thus are harder to distinguish as synthetic. And most importantly, we aim to achieve a level of synthetic signal quality that would allow for their successful use in training better affect recognition models.

(12)

1.3 research question 8

1.3 research question

In light of the motives laid out in the previous section, we formulate our research question as follows: Can we use intelligent data augmentation through generative techniques for the purpose of learning more robust models of affect recognition from multi-variate physiological sensor data? In order to arrive at such "intelligent generative data augmentation tech-niques", we must take into account the specificity of our data and of our problem. Therefore, to obtain an empirical measure of the "intelligence" of our proposed generative method, we further extend our research question with three important aspects:

(a) Does the use of Self-Attention lead to an improvement in the quality of the generative process?

(b) Can we use subject information as a conditioning variable during the generative process to better preserve data multi-modality, and thus produce more diverse synthetic data?

(c) Does the sampling procedure from the generative distributions affect the quality and variety of the synthetic datasets?

In the next chapter, we will explore the related literature and introduce the technical building blocks of the model design we employ in answering these questions.

(13)

2

B A C K G R O U N D & R E L A T E D W O R K

In this chapter we shall attempt to provide a foundational understanding of the principal topics with which our work is concerned, and see how cur-rent literature approaches these fields. We begin with an overview of the concept of emotions and how they manifest in the human body through physiological responses. We then explore the current approaches to learn-ing emotional aspects from physiological data. The followlearn-ing sections are concerned with introducing the core technical concepts employed in our

re-search. In section 2.2 we present the theoretical building blocks used to

evolve the original Generative Adversarial Network [9] to our proposed

AC-WGAN-GP architecture; additionally, we take a look at how current litera-ture applies generative methods to time-series data and physiological data

in particular. Finally, in section2.3, we offer a similar treatment of Attention

mechanisms, focusing especially on their application to time-series data.

2.1 emotion and physiology

At the basis of what we call subjective emotional states can be found the

no-tion of affect. Russell [20] defines affect as a raw, primitive

neuropsycholog-ical state that appears before any conscious self-reflective thought. Emotion, on the other hand, results from cognitive processing of a stimulus, such as a

thought or an affect [3].

Emotional experience is tightly linked with physiological arousal. The ner-vous system can be split into the central nerner-vous system (CNS) (the brain and the spinal cord) and the peripheral nervous system (PNS). In the CNS, the area most heavily implicated in handling emotion is the limbic system. Its counterpart in the PNS is the autonomic nervous system (ANS), com-posed of sensory and motor neurons. The limbic system gives rise to the conscious concept of emotion in the amygdala and categorizes it as either pleasant or unpleasant. Depending on this categorization, it regulates the release of neurochemicals in the brain (e.g. serotonin, dopamine etc.) that directly affect bodily gestures, poses and movement. Moreover, it regulates the activity of the ANS, which autonomously (unconsciously) effects pro-cesses such as pulse, blood pressure, sweat or breathing. The signals picked up from these bodily functions are named physiological signals and will con-stitute the focus of our work. Studies suggest that affect and emotions have "certain physiological fingerprints" [3][21] which should, within certain

lim-its, allow for the recognition of affect from physiological signals.

(14)

2.1 emotion and physiology 10

Figure 1: The circumplex model of emotion. The valence and arousal axes form a

continuous 2D emotion space, allowing specific emotions to be mapped and related across the two dimensions. Taken from Valenza et al. [22].

2.1.1 Emotional Models of Affect

Over time, scientists have proposed numerous ways of classifying and dif-ferentiating between different types of affect and/or emotion. Their meth-ods fall into two types: categorical and dimensional models. Categorical models propose the division of emotions into distinct categories, such as

Ekman and Friesen’s popular split[23] into joy, sadness, fear, anger, disgust

and surprise. Dimensional models map emotion onto a multidimensional,

continuous space. Often used is the circumplex model proposed by Russell[2]

(figure1), a two-dimensional space defined by the arousal and valence axes.

Arousal indicates the level of excitement a person feels, from low (calm, tired) to high (excited, alarmed). Valence measures how negative (upset, depressed) or positive (serene) the affective state is. The circumplex model is widely used in the Human Emotion Recognition field, as it can be easily assessed

us-ing the Self-Assessment Manikins scale [24]. Moreover, the two-dimensional

space can be easily fragmented in diverse ways and used in classification problems.

In this work we will be using the circumplex model as a method to classify affective states. In particular, we focus our study on the arousal axis. Arousal is tightly linked with stress response, and its correlations with the SNS

ac-tivity is well understood [3]. The SNS is primarily involved in the

"fight-or-flight" response, which translates into high arousal levels. This allows us to specifically pinpoint those physiological signals that are of interest, and thus narrow our focus. We shall do this in the next section.

2.1.2 Physiological Signals

Physiological responses can be measured and recorded using sensors. In

figure 2, the major physiological sensors and their placement can be

(15)

Figure 2: Overview of main physiological sensors and their placement.

From Shu et al. [4]

(EEG)measures the ionic current of brain neurons using electrodes placed

on the scalp [7][3]. Electrooculography (EOG) measures eye movement

us-ing electrodes placed around the eye[3]. Electromyogram (EMG) also uses

electrodes placed on specific muscles in order to pick up their electric ac-tivity. These sensors are intrusive, as their use during day-to-day activities might cause inconveniences.

There are several physiological signals that can be easily picked up using non-invasive sensors. Electrodermal activity (EDA) sensors measure the electrical characteristics of the skin i.e. galvanic skin response (GSR), skin conductance etc.. EDA is directly controlled by the SNS, and so is

partic-ularly sensitive to high arousal states [3][4]. Heart rate and variability can

be measured using the electrocardiogram (ECG), which attaches three elec-trodes to the torso and measures the polarization of the heart tissue during

each heartbeat[3]. This is an intrusive technique as well. However, the same

signals can be picked up by photoplethysmograph (PPG) sensors which are attached to the finger and use optical measurements to track the cardiac cy-cle through the blood vessels. High arousal states are strongly correlated

with higher heart rate and lower heart rate variability[4]. These two sensor

modalities show the strongest correlation with arousal levels and so will be

the focus of our thesis; figure3illustrates several instances of ECG and GSR

signals, in the form used as input in our work.

2.1.3 Affect Recognition from Physiological Data

The AI revival of the past decade has brought a lot of interest to the field of emotion recognition. As we already mentioned, affect recognition meth-ods have been applied to multiple physical and behavioural cues such as

speech [25][26] or facial expressions [27][28]. Nonetheless, using

physiolog-ical data, either by itself or in combination with speech or facial data, is a well documented approach, which we will explore here.

(16)

Figure 3: Examples of 5-second segments of ECG and GSR signals, corresponding

to different subjects. Physiological data taken from the CASE dataset [6].

While a large degree of studies in the field focuses their attention on

elec-troencephalogram (EEG) signals [29], a well documented literature can be

found on the topic of affect recognition from what we define as wearable sen-sors: sensors that can be embedded in daily used gadgets or machinery, and which do not interfere in the day-to-day life of the user. Two wearable sen-sor signals which are often used in practice are the galvanic skin response

(GSR) and the electrocardiogram (ECG)[30][31][32][3]. We identify two

dis-tinct approaches to affect recognition from mobile physiological data, which we will elaborate next.

The first approach involves extracting descriptive features from signal seg-ments which are then used instead of the raw data for classification. The type of features which can be extracted depend on the signals used, and range from time domain statistical features (e.g. mean and median values, heart rate and heart rate variability statistics, first derivatives etc.) to very plex features, such as frequency domain statistics, or medical statistics

com-puted from multimodal or non-linear features. Schmidt et al. [3] gives an

overview of several feature extraction methods, which are well documented in medical-adjacent literature. Extracting such features is the focus of im-portant early work, which applied them in analogously named tasks such

as emotion recognition [33][34], valence and arousal recognition [22] or

neg-ative emotion detection [35]. However, implementing these methods is an

resource expensive and time consuming process, which has to be specifically

adapted to both the data and the classification models used [4]. Moreover, it

requires an expert in the field which is familiar with the interactions between signals, signal features and the emotion space we aim to predict.

The above issues are explored by Martinez et al. [36], in which the

au-thors propose the focus on the second approach: the use of Deep Learning methods to automatically construct features from raw signal data. The same study reports that their two layer 1D CNN model easily beats the state-of-the-art of the time in the task of detection of emotional manifestations of relaxation, anxiety, excitement, and fun. The authors focus on skin con-ductance and blood volume pulse signals, achieving between 70 and 75%

accuracy. Santamaria et al. [8] use a similar 1D CNN model on a dataset

of galvanic skin response (GSR) and electrocardiogram (ECG) signals for binary class prediction of arousal and valence. They compare their method against manual feature extraction used in conjunction with common ma-chine learning algorithms (random forest, AdaBoost, KNN etc.) and report

(17)

2.2 generative adversarial networks 13

a 5 to 10% improvement in arousal (76% accuracy) and valence (75% accu-racy) classification performance using their CNN method. Temporal pattern deep learning methods such as LSTMs are also, more rarely, used for this

task. Ma et al.[37] uses a residual LSTM network for arousal/valence

classi-fication from EEG data, reporting very good results of 92% accuracy in both tasks. All of these results however are reported using subject-dependent

test-ing, which, as argued by Saeb et al.[13], leads to over-optimistic results that

are less relevant in a real use-case.

While a review of related works in the affect recognition field offers some insight into the problem of emotion and its manifestation through physio-logical signals, for a strong technical background more suited to our work we must look beyond the field of affect recognition. Our two main interests are, respectively, the use of generative and attention models in sensor data modelling, which we shall investigate next.

2.2 generative adversarial networks

2.2.1 Theoretical Background

From GAN to WGAN-GP

The first formulation of the generative adversarial network (GAN)

ap-pears in Goodfellow et al.’s seminal work [9]. It consists of two neural

net-works. The generator G(z) takes as input the noise vector z and learns to

map it to the data space; in effect, it learns a data distributionPg as an

ap-proximation of the real data distributionPr. To achieve this, it is trained in

tandem with a second network D(x), the discriminator, which takes as input a data sample and must distinguish whether it comes from the real

distri-butionPr or the synthetic onePg. The mathematical model of the training

process is formulated as a minmax game between the two networks: min

G maxD x∼PEr

[log(D(x))] + E

ˆx∼Pg

[log(1 − D( ˆx))].

The minimization of this value function is theoretically equivalent to

min-imizing the Jensen-Shannon divergence between the two distributions Pr

andPg. In practice, this method notoriously suffers from training

instabil-ities resulting from poor convergence properties. In [10], the authors argue

that this is caused by a discontinuity between the generator’s parameters and the divergence function used by the GAN. They propose the use of the Wasserstein distance instead, which is continuous everywhere, as long as the discriminator D(x) is formulated as a compact function. This is accom-plished by enforcing that D(x) is restricted to the set of K-Lipschitz functions, or functions which have the absolute value of the first derivative bounded at K (usually 1). We can see how this leads to a smoother loss function that can be more easily navigated using gradient descent, and which should aid with the convergence issues. The authors enforce this constraint by clipping the weights of the discriminator function within a small range, [−0.01, 0.01], a method they admit as being naive. In practice, the rest of the changes

(18)

involve modifying the discriminator, which is renamed to critic. Instead of discriminating between real and synthetic inputs, the critic now outputs a scalar value, the 1-Lipschitz function that optimizes the objective function

min G D∈maxDx∼PEr [D(x)] − E ˆ x∼Pg [D(_ˆx))].

Under optimal training regimes, minimizing the WGAN value function is equivalent with minimizing the Wasserstein distance between the two

distri-butions Pr and Pg. Moreover, the authors show that the value of the loss

function correlates with sample quality. This affords the use of the loss func-tion as an addifunc-tional metric for comparing, under a fixed discriminator, the sample quality of different generator architectures, a method that we will employ in our work.

Finally, Guljarani et al. [11] provide an improved method of enforcing

the 1-Lipschitz constraint on the critic function. They add a penalty term to the loss function. They argue that in an optimal 1-Lipschitz critic, two

points fromPrandPgare connected by a straight line, and along the space

traversed by the line, the derivative of the loss function must be 1; and that by enforcing this constraint, a 1-Lipschitz critic can be obtained. Since enforcing the constraint everywhere at each step is intractable, they sample a random point along this connecting line and apply a penalty if the gradient norm of the critic’s output with respect to this point strays from 1. The objective function is then defined as:

L (x, ˜x, ˆx) = E ˜ x_∼Pg [D(˜x)] − E x_∼Pr [D(x)] + λ E ˆ x∼Pxˆ h (k∇x_ˆD(ˆx)k2− 1) 2i

where ˆx is the sampled interpolated point and λ is a hyperparameter used as a scaling factor for the gradient penalty.

The authors report results comparable with the state-of-the-art using this improved technique, and a remarkable ease of use regarding hyperparam-eter setup and convergence properties, which we, as well, found to be the case.

Conditional GANs

Since we require labeled synthetic data for classification purposes, we have to introduce conditional information to guide the learning process. We identify two main families of conditional GANs, namely the cGAN and the

AC-GAN. The cGAN [38], or conditional GAN, simply concatenates the

la-bel vector y to the inputs of both the generator and discriminator, with no changes to the loss function. The authors report that this simple technique can successfully learn multi-modal data and conditionally generate realistic samples. The AC-GAN, which stands for auxiliary classifier GAN, comes as an overhaul of the simple conditioning of the cGAN. The generator stays the same, taking the noise vector along with the label information y. The discriminator is modified; it takes as input only the real or synthetic data instance, and uses a dual output architecture to output both a probability distribution over the data sources, and a probability distribution over the

(19)

class labels. The log-likelihood of the correct class is then added to the loss function to be minimized.

On the fusion of WGAN-GP and AC-GAN

We have presented these conditional generation methods using the vanilla GAN as a basis since there is no theoretical work on the interaction of con-ditional GANs and the WGAN-GP architecture. We only managed to find a

single paper with no citations [39] of a more practical nature which uses the

two methods in conjunction with good final results. This work provides us a good basis from which to start our work of merging the two paradigms. However, our implementation is original, and represents one of the principal contributions of this thesis.

2.2.2 Generative Models for Time-Series Data

GANs[9] have been successfully used in the recent past for generating

real-istic data samples in a number of different applications, the most common of

which reside in the field of computer vision[40]. However, their application

in the context of temporal data, not to mention signal and sensor modelling, is far less studied. A large section of these studies are concerned with audio

synthesis; van den Oord et. al [41] uses dilated causal convolutions as the

basis of their auto-regressive model, and uses it to generate raw speech sig-nals with great success. However, the auto-regressive nature results in long generation times, as predictions are performed for each time-step and used

recursively to generate the next predictions. Donahue et al. [42] propose

in-stead to re-purpose the DCGAN architecture [43], a deep 2D convolutional

network, to work on speech signals. They replace the 2D convolutions with

1D filters, and use this model to generate entire 1-second audio waveforms.

In the field of multi-variate time-series sensor modelling however, authors claim that these convolutional methods, which were adapted from image generation tasks, are not suitable for fully capturing the temporal

proper-ties of their data [44][45]. Instead, their work heavily relies on recurrent

architectures.

Esteban et al. [46] used an LSTM network to drive both the generator

and the discriminator of their conditional GAN. They feed an independently drawn latent noise sample at each time-step, along with the conditioning la-bel, in the generator; the discriminator outputs a real/fake prediction at each time-step, with the signal’s average cross-entropy as a loss. They used their RCGAN to generate synthetic datasets for 4 physiological signals (oxygen saturation, heart rate, respiratory rate and mean arterial pressure). Zec et al.

[45] expands Esteban et al. approach by using separate generator LSTM

net-works for the latent noise and conditioning variables and adding some skip connections. They applied their method to the generation of sensor outputs

used in autonomous driving, reporting great success. Yoon et al. [47] further

develops this architecture by incorporating a parallel embedding/recovery network which maps an input sequence to a latent representation and is trained to encode specific features of a time-series sample. They add this as a reconstruction loss, along with the GAN classification loss, reporting that

(20)

2.3 self-attention 16

it adds greater generation quality to the model. In human activity

recogni-tion, Wang et al. [48] use a mix of CNN and LSTM models, depending on

the presumed difficulty of the task, for generating raw accelerometer data corresponding to three activities, "stay", "walk", and "run". However, their research makes it difficult to compare the two methods, as they were used in different tasks.

Recurrent methods are also used for generating physiological signals. Haradal

et al. [49] use LSTMs as generator/discriminator for generating EEG and

ECG signals, and is the only work aiming at generating such signals that we could find.

Our study proposes a change from the norm, by shifting the comparison from recurrent architectures to attention-based ones, for the purpose of sen-sor data generation. As this constitutes a novel approach in the signal mod-elling field, in the next section we review a few works in which self-attention is applied to time-series data for prediction purposes.

2.3 self-attention

2.3.1 Technical Background

The mechanism of Attention, specifically self-attention [18] lies at the

core of the Transformer model that has recently surpassed recurrent meth-ods in the task of efficiently capturing long-term temporal dependencies. Transformer-based models achieved state-of-the-art in various tasks, such as

question answering and language inference [50][51] or music/long sequence

generation [52]. It discards the sequential processing of temporally linked

input common to the RNN/LSTM architecture, instead going over the entire sequence in one pass. By doing so, self-attention can capture dependencies independent of temporal distance within the sequence at the expense of greater computational requirements.

The first step of a self-attention module is the computation of the Key (K), Query (Q) and Value (V) matrices. In its original formulation, the three matrices are obtained by passing the layer input through three distinct linear layers; however, convolutional layers can also be used for this purpose, if the

nature of the data calls for it [53][54]. The three matrices are then used to

compute the attention score. There are multiple approaches to doing this. However, the most commonly used method is the Scaled Dot-Product

self-attention [18], which is computed as:

Attention (Q, K, V) = softmax QK T √ d_k V.

Two major mechanisms are used to, in theory, improve the capabilities of a self-attention layer: multi-headed attention and positional encoding. In multi-headed self-attention, multiple attention scores are computed in parallel. The Q, K and V matrices are split into distinct matrices corresponding to the number of attention heads, which allows the independent computation of

multi-headed attention. Vaswani et al. [18] claim that multi-headed attention

(21)

to pay attention to different sets of features in each attention head. Positional encodings were proposed specifically in the context of NLP tasks, where each instance consists of a word embedding. Since the attention mechanism it-self learns a non-local representation of its input, Vaswani et al. found that explicitly encoding a temporal series pattern within the sequence of word embeddings helps in modelling the order of the words in a sentence. They proposed the use of a combination of sine and cosine waves of decreasing frequency, which they found to work just as well as more complex learned

positional encodings[55]. The benefits of using positional encoding in

con-junction with time-series data are however less clear. Therefore, we must investigate this matter in the context of our generative attention model.

2.3.2 Self-Attention for Time-Series Data

So far, attention models have mostly been used in NLP tasks. Research into their application to time-series data has only recently appeared. Rajan

et al. [56] uses multi-headed self-attention fed by a 1D convolutional layer

for channel expansion to classify multivariate medical time-series data. Their contribution results in a simplified positional encoding used in conjunction with the self-attention layer. The authors report state-of-the-art results in multiple tasks such as length of stay forecasting or mortality rate. A more

interesting approach is discussed in [57], where the authors developed a

Cross-Dimensional Self-Attention module for time-series imputation. They expand the self-attention layer to not only operate on the temporal dimen-sion, but also on the channel dimension. This allows them to model relations both across time-steps and across channels, which they argue offers a more precise focus of attention and thus more powerful modelling capabilities. However, this method introduces another quadratic increase in computa-tional complexity, in addition to the quadratic dependency on the sequence

length. Another approach is taken in [54], which propose an adaptation of

the self-attention mechanism. They use a larger filter width in the convolu-tional layer acting as input to the self-attention layer. In this way, the authors claim that the convolutional layer will model (short) temporal patterns in the time-series data instead of relying on individual values at each time-step. They apply this model in the task of time-series forecasting, using different datasets of traffic or electricity consumption/generation measurements.

We note that at the time of writing, Self Attention has not been used in conjunction with physiological sensor data. In the light of their use in the ad-jacent fields mentioned previously, we believe it is well worth investigating their use in our specific case.

Finally, we mention an important work in the context of this thesis, which serves as a foundational inspiration for constructing our proposed genera-tive architecture. At the intersection of self-attention and generagenera-tive models

lies the SAGAN model [53]. Zhang et al. propose the use of self-attention in

conjunction with convolutional layers. They aim to improve image genera-tion through the ability of attengenera-tion layers to connect informagenera-tion across the entire input space, a limitation when considering pure convolutional layers.

(22)

As noted, this model was used for image generation, but its main insights can be adapted to time-series generation as well.

By combining the ideas behind the SAGAN model with a self-attention mechanism specifically adapted to physiological sensor data, we believe that a strong contribution to time-series generation can be obtained.

(23)

3

A P P R O A C H

3.1 problem statement

The central research question tackled in our work is whether data aug-mentation through generative means can improve classification performance. For this reason we refrain from conducting an extensive investigation into comparing different generative architectures. Instead, we’ll be using the Wasserstein GAN with Gradient Penalty as the basic generative framework. We prefer this variant due to its reported stability, and good convergence

properties and results compared with the traditional GAN architecture [11].

If we can successfully learn and generate realistic signals using this architec-ture, we believe that the results should translate across to other generative approaches with differences only in magnitude of performance; thus we mo-tivate the sufficiency of our singular generative approach with respect to our research question.

For the study on the benefits of Self-Attention in our context (RQ a), we propose the use of two generative architectures: the temporal self-attention GAN (TAGAN) and the temporal convolutional GAN (TCGAN). The two dif-ferent architectures are employed in the Generator network, while the Critic uses a common architecture based on the TAGAN model. This should allow us to isolate the performance of self-attention in the Generator and obtain more conclusive answers from our experiments.

The final two research questions which constitute the purpose of this the-sis, namely RQ b, the effects of subject-conditioning, and RQ c, the investi-gation of sampling procedures, will be answered on the basis of the archi-tectural framework we set up in this chapter. We shall further discuss the

experimental setup required to study these questions in Chapter4.

In the next parts, we present a detailed technical exploration of the gen-erative stage. First, however, we shall discuss the dataset we employ in our research, and the processing steps required to arrive at what constitutes the input of our networks.

3.2 dataset & preprocessing

As we argued in the motivation section above, the choice and use of data is of great importance for our work. In this section, we will present how our dataset was selected and pre-processed under the guidelines imposed by our research direction.

(24)

3.2 dataset & preprocessing 20

3.2.1 Dataset Selection

A limited number of datasets containing affect-annotated physiological signals are available in literature. Of note, we mention here MAHNOB-HCI

[7], DEAP [5], CASE [6], which constituted our initial shortlist. They were

selected based on our need for data coming from non-intrusive sensors and labeled on an arousal-valence scale. However, there is a significant distinc-tion which lead us to our final choice.

While in DEAP and MAHNOB-HCI the subjects were asked to annotate each video once, after viewing it, in CASE we have continuous annotation taken during the video. The difference is significant, especially in the case where window segmentation is applied to the data and arousal/valence la-bels have to be assigned to each individual data segment. We consider that this fine-grained approach to annotation allows us to be more confident in the correlation between the data segment and the label we assign to it. Affect,

and its manifestation in physiology, have a short-term duration [3]. Much

shorter, in fact, than the length of the videos used by these experiments. A person can thus experience different levels of arousal during the course of the 1-minute video. To describe the entire video with a single arousal anno-tation is thus only a summary, which does a poor job at describing affective states from moment to moment. Since at the time of this research CASE was the only such continuously annotated database containing physiological sig-nals, we have selected it as the sole source of data for this thesis.

CASE[6] is a continuously annotated emotion assessment dataset.

Phys-iological measurements were taken from 30 participants that were asked to watch 11 videos corresponding to emotional states labeled as amusing, bor-ing, scary or sad and quantify their emotional state on a 2D valence-arousal system in real-time using a joystick. Three of the videos were neutral videos used for sensor calibration; we excluded them from the final dataset as they were noisy and randomly annotated.

Given our focus on non-intrusive sensor tracking methodologies, we re-stricted our attention only to those measurements provided by such sen-sors. CASE contains all the common ones: electrocardiogram (ECG), blood volume pulse (BVP), galvanic skin response (GSR), respiration (RSP) and skin temperature (SKT). However, given the exploratory nature of our work, as we shall see in the following sections, we decided to further reduce the amount of information used. First, we reduce the annotation dimension so that we are only concerned with arousal annotations. Secondly, we re-strict the number of sensor modalities to two: electrocardiogram (ECG) and galvanic skin response (GSR). These two biosignals in conjunction with the arousal scale were selected based on existing neurophysiological research on

the interaction between emotion and physiology [4][58][3]; a more in depth

motivation was given in section 2.1. While a thorough study of arousal

recognition from physiological signals would consider several combinations of sensor modalities, we consider this outside the scope of this work; if im-provements in arousal recognition can be obtained through intelligent data augmentation, we can show this on a single set of manually selected modal-ities.

(25)

To sum up, our data is extracted from the CASE dataset and is composed of ECG and GSR measurements taken from 30 participants over the course of watching 8 videos of duration between 119 and 197 seconds; the data is continuously annotated on an arousal scale with continuous values between 0.5 and 9.5.

3.2.2 Data Preprocessing

Downsampling and Segmentation

We first downsample our data from 1000Hz to 100Hz using an order 8 Chebyshev type I filter from the Scipy.Signal Python library. This can be

done without any loss in information [3][7], and the smaller size of the input

data is practically convenient. Next, we segment the signals using a sliding window of length 500, corresponding to 5 seconds of measurement. The choice of window size is highly dependent on the biosignals that are used and the time lag between the occurrence of emotions and their manifestation

in physiology [59][60]. This response delay is insignificant in the case of the

ECG signal, but changes in GSR appear with a delay of up to 4 seconds

[60], which lead to our choice of 5 second segments. We use a stride of

100, or one second, to obtain data segments which overlap in part, a process

which has the effect of generating more data instances, but also acting as a source of phase invariance. While two consecutive segments will share 80% of the data, the segments will greatly differ in phase, which should in theory

increase the robustness of any model trained on such data [61].

Normalization and Outlier Removal

After the first step of the preprocessing stage, the data is normalized and a basic outlier rejection scheme is employed. The target of the outlier removal are segments which contain abnormal signal measurements,

com-monly known in ECG studies as baseline drift[62]. The normalization process

is applied subject- and modality-independently, and consists of centering the data and applying a modified min-max scaling. Since mean-centred ECG signals are not symmetrical with respect to the axis, and the nature of the non-symmetry is highly individual to each person, we must preserve the non-symmetry. As the regular min-max normalization would stretch the signals to fill the desired range, we must adapt the algorithm.

The procedure described in algorithm1 has the effect of normalizing the

instances such that the statistical median of a signal characterized by its ab-solute peaks is bounded in the range of [−1, 1]. Naturally, some normalized signals will show peaks exceeding this range; however, we cannot reject all of them as outliers, since some are still within bounds considered normal. By computing the standard deviation of the maximum peaks in each win-dow, and considering all instances showing peaks lying outside 2 standard deviations from the median as outliers, we arrive at an empiric bounding range of [−1.2, 1.2]. We have found that this value works well in practice, rejecting most invalid signals, whilst showing a false positive rate close to 0,

(26)

Figure 4: Sample of signals which pass the outlier test (left), and signals which do

not pass the test (right).

outlier detection algorithm labels as valid (left column) and invalid (right column). We can see an example of a false negative outcome in the fourth signal on the left column: as the signal is bounded within a "normal" range, it passes the outlier test, although it is a noisy signal.

One observation that we made while devising this procedure is that highly

noisy signals such as those shown in figure4are highly correlated with very

high arousal values. High arousal corresponds with moments of intense emotion, be it amusement or fright, which can be accompanied by sudden body movements. We speculate that such movements led to imperfect con-tact of the physiological sensors, which is usually the cause of moving base-line noise in ECG signals. Unfortunately, some subjects in the CASE dataset were more prone to this sort of behaviour, as evidenced by large values in the standard deviation value of the peaks, computed as described above, and the larger number of rejected samples. In those cases, our outlier removal is prone to false negatives, meaning it does not reject all denatured signals. De-vising a proper method of detecting and fixing this type of noise is outside

the scope of our work, and is in itself a current research topic [63][62]. For

the purpose of this work, we consider a summary removal of noisy signals to be sufficient, leaving improved baseline drift removal methods for future work.

Instance Labeling

After obtaining our segments, we then label them using the average arousal value of the instance, which should offer an accurate estimate of the arousal level of each 5 second window. Next we perform a discretization of the arousal space, such that each segment is associated with a categorical label

(27)

3.3 the ac-wgan-gp architecture 23

Algorithm 1:Data normalization procedure

Data:Windowed multivariate time-series dataset x with instances indexed by i and defined by attributes: subject ID s, sensor modality m and time-step t;

for every subject s and signal modality m do for every windowed data instance i do

max_vals,m,i← max (xs,m,i,t)

min_vals,m,i← min (xs,m,i,t)

end

median_maxs,m← median (max_vals,m,i)

median_mins,m ← median (min_vals,m,i)

abs_maxs,m ← max{|median_maxs,m|, |median_mins,m|}

min_xs,m ← −abs_maxs,m

max_xs,m ← abs_maxs,m

for every windowed data instance i and time-step t do xs,m,i,t← 2_{max_x}xs,m,i,t_s,m−min_x_{−min_x}s,m_s,m− 1end

end

y. As commonly done in related literature [4][3], we split the arousal plane

in 2 discrete classes. We bin the continuous labels into low arousal – arousal interval [0.5, 4.95] –, and high arousal –[5.05, 9.5] –. Instances that are within a very close range of the separation value are discarded, so as to remove instances that occur near the start of the experiment, before the subject has had any input concerning his subjective state. The problem is then reduced to a binary classification task. Moreover, we make available the ID of the subject, encoded as a one-hot vector, for use in the generative stage.

The output of the preprocessing stage will be a tensor X of N instances,

each consisting of a sequence of physiological measurements (x1, ..., xT), xt ∈

RM_{, where M represents the number of signals and T is the sequence (or}

window) length. Each instance will be associated with its corresponding arousal label and subject ID information.

3.3 the ac-wgan-gp architecture

We begin by presenting the overall AC-WGAN-GP architecture. For the purposes of this section, the Generator and the Critic will be considered as black boxes. A schematic representation of the AC-WGAN-GP framework

is shown in figure5.

The implementation of the fusion between the WGAN-GP and the AC-GAN represents one of the main technical contributions of our work. To achieve this, we need to make some specific architectural changes to the WGAN-GP architecture. For the generator input, we concatenate the noise vector z of length 128 with the one-hot encoded arousal and subject class la-bels. The critic, parameterized by w takes as input a real or a synthetic data

point and outputs 3 values: a scalar Dw:f(x)which is the critic score

corre-sponding to the 1-Lipschitz function, a probability distribution P (C|x) over

(28)

distribu-3.3 the ac-wgan-gp architecture 24

Figure 5: AC-WGAN-GP main architecture.

tion P (S|x) over the subject ID S (hereon denoted as Dw:s(x))). The whole

network is trained end-to-end, using an objective function that combines the Wasserstein loss with the Gradient Penalty and the two classification losses. The most important changes brought about by the fusion of the two methods lies in the training procedure.

The Training Procedure

The network is trained end to end using an algorithm similar to the one

proposed in [11], with the necessary changes implied by the addition of the

auxiliary classifier conditioning. See algorithm2for a full description.

While the Generator and the Critic are trained simultaneously, we train

the Critic for ncritic times for one train step of the Generator. We use a

value of 5 for the ncritic hyperparameter, as suggested in [11], which we

found to work well in practice.

For each training step of the Critic, we sample a mini-batch of real data, as well as an equal sized mini-batch of latent variables z, ˜c, ˜s. Here, as well as for the rest of the section, the latent conditioning inputs are independently sampled and processed in batches of size 128. z represents the latent noise which is sampled from a Normal distribution with mean 0 and standard deviation 1. ˜c and ˜s are the two latent conditioning inputs, corresponding to the arousal and subject ID respectively. Both are constructed by sampling

uniformly from the set of possible conditioning values represented by C

and S respectively. In our case, C = 0, 1, corresponding to low and high

arousal classes; and S = 0, ..., 29, corresponding to the 30 subjects of the

CASE dataset. In practice, as the GANs are trained using the

Leave-One-Subject-Out cross validation strategy we detail in section4.2.2, the subject ID

set is reduced toS = 0, ..., 28, where the ID of the left-out subject is deleted

from the set. The two latent conditioning variables are then one-hot encoded and concatenated with the latent noise z, resulting in the latent conditioning inputs we pass to the Generator network to obtain the synthetic signal ˜x.

In order to compute the Gradient Penalty term, we then compute ˜x as a random interpolated point along the line connecting the real signal x and the synthetic signal ˆx. The location of the point is given by uniformly sampling

between [0, 1].

We next pass the real, synthetic and interpolated data batches through

(29)

Algorithm 2:AC-WGAN-GP training procedure;

Data:The number of critic iterations per generator iteration ncritic,

the gradient penalty coefficient λ, Adam hyperparameters α, β1

and β2, classification coefficients γc and γs, batch size m;

Data:Generator G parameterized by θ and Critic D parameterized by

wwith 3 outputs for critic score, arousal class prediction and

subject ID prediction, symbolized by Dw:f, Dw:cand Dw:s

respectively;

Data:C the set of arousal class labels and S the set of subject IDs, as one hot vectors;

Hyperparameters values: ncritic= 5, m = 128, λ = 10, γc= 0.75,

γ_s= 0.25, α = 0.0008, β1 = 0.5, β2 = 0.99, train_steps = 200000;

for train_steps do for t = 1 to ncriticdo

Sample a batch of real data (x, c, s)∼ Pr

Sample a batch of latent variables z∼ N(0, 1), and conditioning

variables ˜c∼ U{C} and ˜s ∼ U{S}

Sample a batch of random numbers ∼ U[0, 1]

˜x ← Gθ(z, ˜c, ˜s) ˆx ← x + (1 − ) ˜x Lcritic← 1 m Pm i=1 h Dw:f(x˜i) − Dw:f(xi) + λ ∇x_ˆiDw:f(xˆi) 2− 1 2i Lcls←

−_m1 Pm_i=1[γclog P(ci|Dw:c(xi)) + γslog P(si|Dw:s(xi))]

Ltotal←Lcritic+Lcls

w← Adam (∇wLtotal, w, α, β1, β2)

end

Sample a batch of latent variables z∼ N(0, 1), and conditioning

variables ˜c∼ U{C} and ˜s ∼ U{S}

˜x ← Gθ(z, ˜c, ˜s) Lgen← −_m1 Pm i=1[Dw:f(˜xi)] Lgcls← −_m1 Pm

i=1[γclog P( ˜ci|Dw:c(˜xi)) + γslog P(˜si|Dw:s(˜xi))]

Lgen_total←Lgen+Lgcls

θ← Adam ∇θLgen_total, θ, α, β1, β2

(30)

arousal and subject predictions Dw:cand Dw:sfor each data type. The value

function outputs are used to compute the WGAN-GP Critic loss as

Lcritic= 1 m m X i=1 h D_w:f(x˜i) − Dw:f(xi) + λ ∇_x_ˆ_iD_w:f(xˆi) 2− 1 2i , where λ is a hyperparameter which weighs the Gradient Penalty term in

the WGAN-GP loss. We use λ = 10 as suggested by Gulrajani et al.[11].

The arousal and subject predictions are used to compute the Critic classi-fication loss as Lcls= − 1 m m X i=1 [γclog P(ci|Dw:c(xi)) + γslog P(si|Dw:s(xi))],

where γc = 0.75 and γs = 0.25 are hyperparameters that control the

weight of each classification head. We set these two hyperparameters empir-ically, such that the associated losses converge at the same rate; this prevents one of the classification heads to overpower the other, which we observed leads to poorer performance in both the underpowered classification head and the Wasserstein Critic loss. We use a categorical cross-entropy function to compute the two classification log-probabilities. We also use an inverse time decay to reduce the learning rate over time; while the Adam optimizer uses an adaptive learning rate that should make this redundant, we found that, in practice, using a learning rate decay helps reduce variance, leading to stabler late convergence behaviour.

Finally, we compute the total Critic loss as Ltotal=Lcritic+Lcls

We use this loss to train the Critic using the Adam optimizer[64], which

is commonly used in the literature we studied when devising this approach

[11][53]. We use an initial learning rate of α = 0.0008 and set the Adam

exponential momentum decay hyperparameters to β1 = 0.5, β2 = 0.99. We

arrived at these values by slightly modifying the values used by Zhang et al.

[53] for the SAGAN model, after empirical experimentation and observation.

The Generator is trained once after every ncritic= 5training steps of the

Critic. Here, we use an identical procedure to sample the latent condition-ing inputs z, ˜c, ˜s, which we pass to the Generator to obtain the synthetic signals ˜x. We pass the synthetic data ˜x through the Critic to obtain the

Critic value function Dw:f, along with the arousal and subject predictions

D_w:cand Dw:s. The value function output is used to compute the standard

WGAN Generator loss:

Lgen= − 1 m m X i=1 [Dw:f(˜xi)].

The arousal and subject predictions are used to compute the Generator classification loss as