Finding Traces of Neurophenomenology Predicting self-reported arousal trajectories using LSTM recurrent neural networks on SSD extracted alpha-components of the EEG signal recorded during virtual roller coaster rides

(1)

Finding Traces of Neurophenomenology

Predicting self-reported arousal trajectories using LSTM recurrent

neural networks on SSD extracted alpha-components of the EEG

signal recorded during virtual roller coaster rides

Research Report

Simon Hofmann | Master Brain & Cognitive Sciences, Cognitive Science Track | Feb 02, 2018

(2)

Abstract

A fundamental question in cognitive neuroscience concerns the relationship between physiological states and subjectively experienced mental phenomena such as emotions. Arousal is a key feature of both subjective experience and bodily activation and prepares an agent to behaviourally respond to events of the environment. The dynamic and non-linear characteristics of arousal are not easily captured in classic laboratory designs, which often employ static and simplified stimuli.

We used virtual reality (VR) to investigate arousal under ecologically valid conditions. 45 subjects experienced virtual roller coaster rides while their neural (EEG) and peripheral physiological (ECG) responses were recorded. Afterwards, they rated their arousal during the rides on a continuous scale while viewing a recording of their experience.

Particularly under naturalistic settings, continuous measurements and the analysis of the resulting multimodal time series of dynamic phenomena such as arousal require new analytical methods. Therefore, we applied a Long Short-Term Memory (LSTM) recurrent neural network (RNN), a deep neural network able to transform high dimensional data into target output variables, by finding statistical invariances and hidden representations in time-sequential input data. The network was trained on alpha-frequency (8-12Hz) band-passed neural components of the recorded EEG signal that were generated via Spatio Spectral Decomposition (SSD) or Source Power Comodulation (SPoC). The neural network was able to predict retrospective reports of our subjects above chance level on the unseen validation sets. Thus, the model is able to find meaningful features in the neural signal that reflect the phenomenal experience of the subjects. Furthermore, by extending the neural components with the recorded ECG signal to train the LSTM, we tested whether peripheral physiological responses, here the cardiac information, increase the performance of the model, and therefore encode additional information about the subjective experience of arousal.

The study demonstrates that i) VR is a valuable research tool to investigate not just peripheral physiological but also neurophysiological and subjective phenomena, since users can fully immerse in their experience while experimental control is retained. Moreover it suggests that ii) LSTM models can be an invaluable analytical method to more ecologically valid research designs.

(3)

Abstract ... 2

1. Introduction ... 4

1.1. Subjective and physiological components of human ecology ... 4

1.2. In light of complexity: tools from machine learning for time-sequential data

... 6

1.3. Approaching experiences of arousal with a novel research design ... 8

2. Methods ... 8

2.1. Participants

... 8

2.2. Materials ... 9

2.2.1. Questionnaires ... 9

2.2.2. Neural and peripheral physiological recording devices ... 9

2.2.3. Virtual Reality Setup

... 9

2.3. Experimental Design and Procedure

... 11

2.4. Analysis ... 12

2.4.1. Subjective Ratings of Arousal

... 13

2.4.2. Cardiac Responses ... 13

2.4.3. Neurophysiological Data: EEG

... 14

2.4.4. LSTM recurrent neural network ... 16

3. Results ... 21

3.1. Binary classification

... 21

3.2. Continuous prediction ... 28

4. Discussion ... 31

5. Conclusion ... 34

Acknowledgements

... 36

Literature ... 37

(4)

1. Introduction

1.1. Subjective and physiological components of human ecology

On a macroscopic and microscopic level, biological systems are inherently complex. Evolutionary forces pose a continuous flux of modulations, which shapes the dynamic relationship between such systems and their physical environment. While being inevitably subject to these dynamics, as humans we have the capacity to reflect upon the contributing processes. In the beginning of the 20th_{century this reflection was} formalized in the scientific discipline of (human) ecology1_{(Gross, 2004; Hawley, 1944), which} systemically shaped modern psychology (e.g., Gibson, 1979) and cognitive neuroscience (Bruineberg, Kiverstein, & Rietveld, 2016; Friston & Stephan, 2007; Friston, 2015)2_{. Despite the widespread} acknowledgment of the subject, we still lack sufficient empirical approaches to tackle the triadic dynamical relationship between the natural environment, the biological nature of humans and their reflective capability (Gallagher & Zahavi, 2016), i.e. their subjective experience of themselves and their surroundings.

This deficit is mirrored in our vague concept of the primary states of an organism such as arousal (Schachter & Singer, 1962). Russell and Feldman Barrett (1999) coin arousal as one of two (orthogonal)

core affects, a state of activation in contrast to rest or sleep (see also Duffy, 1957). The second affect

dimension is constituted by the concept of valence and describes the degree of how pleasant or unpleasant the subjective state is experienced (Duffy, 1957). The functional role of arousal is to prepare an agent to respond to specific environmental events (Cannon, 1929; James, 1884, 1890). Despite this understanding, the link between associated physiological states and phenomenologically accessible mental states remains unclear. This deficit was attributed to i) the complexity of the non-linear relationship between subjective experience and bodily states – a relationship that is mediated by neural processes (Başar, 2011; Freeman, 1999), and ii) inter-individual variability given the same environmental condition, i.e. stimuli (Barrett, Quigley, Bliss-Moreau, & Aronson, 2004; Blascovich, 1990). Moreover, particularly in the field of cognitive neuroscience, there are major experimental constraints to capture the multifaceted phenomenon of arousal. Basically all neuroscientific research tools, whether invasive or not, demand controlled laboratory environments, primarily due to their immobility but also as a result of their sensitivity towards (head) movements of tested subjects. Even though controlling for such confounding factors, i.e. reducing experimental complexity, comes along with advantages such as the isolation of putative fundamental neurocognitive mechanisms (Banaji & Crowder, 1989; Rust & Movshon, 2005), it also raises a central question about the ecological validity of the common practice3_{(Neisser, 1978, 1982;} Olshausen & Field, 2005).

Ecological validity is of great importance to empirically capture the manifold aspects of arousal, since this core effect is a key feature of subjective and bodily preparation to events of the natural environment.

1_{In 1866 Ernst H.P.A. Haeckel introduced the term ‚ecology’ in his book “Generelle Morphologie der Organismen”.}

2_{Despite the conflicting interlude of the era of (radical) behaviourism (Schneider & Morris, 1987; Skinner, 1971; Watson, 1913),}

the entry of ecologically thinking in the mind and brain sciences is best-captured in modern embodied cognition theories (Rietveld & Kiverstein, 2014; Varela et al., 1991; Wilson, 2006).

3_{With some exceptions, simplified experimental setups have low levels of ecological validity, since they rarely capture the full}

(5)

However, ecologically valid research designs in neurocognitive science remain a big challenge. It is not surprising that in a major corpus of neuroscientific studies the concept of arousal is often used in terms of

mild bodily activations during sleep or relaxed wakefulness (e.g., Barry et al., 2005; Cantero, Atienza, Salas,

& Gómez, 1999; Chang et al., 2016; Huang et al., 2015; Sforza, Jouny, & Ibanez, 2000). This understanding is incompatible with its initial definition. However, there are also attempts to probe stronger emotional responses, for example, via the depiction of fearful scenes in form of movie-clips or photos (e.g., Williams et al., 2001, Young et al., 2017). Nevertheless, watching clips of exciting or distressing scenes on a 2D-display4_{while being surrounded by laboratory equipment can only be the first step to more realistic} paradigms.

In contrast, measuring the subjective side5_{of arousal has been done by interviewing the experiencer} (Petitmengin, 2006; Petitmengin et al., 2007), or in more quantitative designs by using questionnaires (e.g., Leventhal & Niles, 1964) or graphical depictions (e.g., Lang, 1985; Desmet, 2003). Showing video-recordings of their previous experience allows subjects to rate their emotional states continuously (e.g., Ickes et al., 1990; Levenson & Ruef, 1992). All these methods have in common that subjects can go through naturalistic experiences, while their reports are given retrospectively, thus, these self-assessments do not interfere with the experience itself (cf. Mauss et al., 2005).

Psychophysiological studies of arousal are well suited to test the bodily activation via robust peripheral physiological measurements such as of heart rate or skin conductance (e.g., Borkovec & O’Brien, 1977; Kreibig et al., 2007). Consequently, peripheral physiological components are well investigated in natural or at least realistic settings of activation, such as in situations of excitement (e.g., sports: Landers, 1980) or of distress (e.g., Sapolsky et al., 20006_).

Recently, McCall et al. (2015) merged the research paradigms on physiological and subjective components of arousal in a study, employing virtual reality (VR). Participants underwent threatening sceneries within a 3D-computer-generated world, while their cardiac activity and skin conductance was recorded. Afterwards, they were asked to rate their emotional arousal during that experience on a continuous scale by watching a video recording of that particular experience. The authors found a correlation between the retrospective ratings and the physiological responses during the experience; in addition, they also demonstrated that VR is a valuable tool for naturalistic research designs. In particular, VR head mounted displays (HMD), which found their way into the market in the previous years7_, allow for a fully immersive experience, while experimental control is maintained (Wilson & Soranzo, 2015). VR goggles cover a large part of the visual field, while the user can actively look around and explore the presented 3D-content, which is adapted to her or his head movements. Therefore, also regarding neuroscientific studies, such novel VR displays allow for more naturalistic stimulation (Hofmann, Scholte, van Gerven, in prep; Parsons, 2015; Tromp, Peeters, Meyer, & Hagoort, 2017).

4_{We should consider the omnipresence of media, since early studies have already demonstrated desensitization towards}

presented movie contents (Lazarus et al., 1962; Linz et al., 1989).

5_{Phenomenologically speaking, we need to distinguish between the experience of the affective state itself (pre-reflective}

self-consciousness) and its meta-experience (reflective introspection), that is, a reflective process that people are normally able to verbalise in their daily life (Gallagher & Zahavi, 2016; Husserl, 1959; Mayer & Gaschke, 1988).

6_{The review also sheds light on the endocrine (hormonal) and social components of negative forms of arousal, here, stress.} 7_{It is noteworthy that the computational and technological research regarding HMDs has a long history (e.g., Sutherland, 1968;}

(6)

Taken together, investigating arousal promises to connect neuronal, physiological and phenomenological dimensions of experience. As core affect it describes the activation of an agent ready to respond to relevant, i.e. in terms of ecologist determinant, events in the environment. Therefore, the importance of ecologically valid research designs becomes salient. Although phenomenological as well as physiological studies of arousal have been conducted in realistic settings (e.g., McCall et al., 2015), the key to a better understanding of the multifaceted phenomenon is to extend the novel VR research designs with neuroscientific methods in order to scrutinize the role of the brain in this ecological ensemble. 1.2. In light of complexity: tools from machine learning for time-sequential data

More complex experimental paradigms do not come without obstacles. Particularly under naturalistic settings, continuous measurements and the analysis of resulting multimodal time series of dynamic phenomena such as arousal require new analytical methods. This challenge is even more pronounced for multichannel brain recordings – which might explain why only a handful of studies probe neural processes in experimental settings with VR (e.g., Banaei, Hatami, Yazdanfar, & Gramann, 2017; Snider, Plank, Lee, & Poizner, 20138_{; Tromp et al., 2017). Yet, recent developments in the field of machine learning} promise to bridge the gap between multidimensional data and reports of subjective experience. In particular, deep neural networks are able to transform high dimensional data into target output variables by finding statistical invariances and hidden representations in the input (Goodfellow et al., 2016; LeCun, Bengio, & Hinton, 2015; Schmidhuber, 2015). When it comes to time-sequential data, Long Short-Term Memory (LSTM) recurrent neural networks (RNNs; Greff et al., 2017; Hochreiter & Schmidhuber, 1995, 1997) are the computational models of choice9_{. With its in-built nonlinear gating units, the LSTM} regulates which information flows in and out

of the memory cell (see Figure 1). Consequently, the model is able to find long- as well as short-term dependencies over time.

LSTMs demonstrate state-of-the-art performance in various domains such as phoneme classification (Graves & Schmidhuber, 2005), speech recognition (Graves, Jaitly, & Mohamaed, 2013), image scene labelling (Byeon, Breuel, Raue, & Liwicki, 2015), language modelling and

translation10_{(Luong et al., 2015; Zaremba, Sutskever, & Vinyals, 2015) and video analysis (Donahue et al.,} 2016). Crucially, LSTMs have been successfully applied in naturalistic settings for the detection of

8_{To the best of our knowledge, Snider et al. (2013) provide the first published study on the successful combination of a VR}

HMD system with EEG.

9_{See also Colah’s blog post for a neat description of LSTMs (Olah, 2015).}

10_{In the language domain, LSTMs are also used for a wide range of commercial technology such as Google Translate (Wu et al.,}

2016), Siri by Apple or Amazon’s Alexa.

Figure 1 A LSTM cell reads out and writes in its memory c through its internal ‘gates’ (f, i, g, o). For a more detailed description see Section 2.4.4.

(7)

emotions from speech and facial expressions (Wöllmer et al., 2008, 2010), and for continuous predictions in the orthogonal affect dimensions of arousal and valence (Nicolaou, Gunes, & Pantic, 2011; see also Gunes & Schuller, 2013). Recently, Soleymani et al. (2014, 2016) combined these approaches with EEG to continuously detect levels of valence in participants watching various movie scenes. They show that LSTMs outperform classic machine learning models such as support vector regression (SVR) or multi-linear regression (MLR) when trained on multimodal fusion features, in this case a combination of facial features generated from point clouds and power spectral features of different frequency bands in the EEG signal. More generally, there is an increasing interest in exploring the temporal properties of LSTMs when it comes to EEG based classification and regression tasks such as motor imagery classification (Li, Zhang, Luo, & Yang, 2016; Zhang et al., 2017), micro-sleep detection (Davidson et al., 2005, 2007) or workload estimations (Bashivan et al., 2014, 2016; Hefron, Borghetti, Christensen, & Schubert, 2017). Using prediction over classical statistical hypothesis testing comes with major benefits. Yarkoni and Westfall (2017) recently argued that classical statistical approaches are rarely able to predict unseen data due to their proneness towards overfitting, i.e. the closest model approximation to the given data. This deficit of generalizability is reflected in the widely spread replication crisis (Baker, 2015, 2016; Button et al., 2013; Open Science Collaboration, 2015; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012) – a phenomenon which is accompanied by the problem of p-hacking (Simmons, Nelson, & Simonsohn, 2011) and other controversial research practices (John, Loewenstein, & Prelec, 2012). Using prediction models in combination with established machine learning routines such as cross-validation or regularization methods for model complexity provide the means for augmented generalization performances; this holds even, or rather specifically, for multivariate data recorded under natural conditions without clear-cut onsets and offsets of stimulus-events (Holdgraf et al., 2017; Nishimoto et al., 2011; Yarkoni & Westfall, 2017). Moreover, there are increasing efforts towards the interpretability of machine learning models11_{. For instance, comparing the predictive performances of models fed with} distinctive sets of features of the data allows for a ranking of those features with respect to the

explanandum, i.e. the phenomenon to be explained (Yarkoni & Westfall, 2017). Regarding deep learning

models, new gradient-based visualisation techniques shed light on what parts of the input data elicit different feature maps, i.e. activation patterns in specific network layers (Erhan, Bengio, Courville, & Vincent, 2009; Samek, Binder, Montavon, Lapuschkin, Müller, 2017; Yosinski, Clune, Nguyen, Fuchs, & Lipson, 2015).

In short, novel machine learning approaches bring forth solutions for multidimensional and time-sequential data, while avoiding common methodological flaws of classical explanatory approaches (Holdgraf et al., 2017; Yarkoni & Westfall, 2017). The most recent attempts using LSTMs for EEG recordings promise to detect the relevant neural features of core affects such as arousal even under naturalistic conditions.

11_{There is the common critique of machine learning models, particularly those falling in the category of deep learning, being a}

(8)

1.3. Approaching experiences of arousal with a novel research design

The major aim of the presented study is to contribute to objectively measuring subjective experience under ecologically valid settings by means of a novel and at the same time complex research design that demands for sophisticated machine learning techniques. Ultimately, this should broaden our understanding of the triadic interrelationship of the mind, the body, and its environment.

McCall et al. (2015) have demonstrated that VR is an effective stimulation tool in order to probe the continuous dynamics between peripheral physiological and subjective aspects of arousal. We extended this approach by recording neurophysiological responses with a 30-channel EEG setup, while participants experienced arousing virtual roller coaster rides. Studies that were conducted under rather artificial conditions suggested an association between the decrease of alpha-frequency band neural oscillations12 (8 − 12!") and an increase of autonomic measures, specifically peripheral physiological components of arousal (e.g., Cantero et al., 1999; Di Bernardi Luft & Bhattacharya, 2015; Moosmann et al., 2003). Therefore, we used alpha-frequency components of the EEG recordings to train LSTM recurrent neural networks to predict subjective ratings of arousal. If the model is able to predict these levels, it would show i) that the previous findings of arousal related alpha-oscillations extend to more ecologically valid settings, ii) that memory recalled emotional experiences encode not just peripheral physiological (bodily) but also neurophysiological information, and iii) on a methodological side that LSTM recurrent neural networks are an adequate computational model to capture the multifaceted phenomenon of arousal. More generally, the exploratory approach of the presented study is twofold: first, we attempt to push forward the often stated but rarely realized concept of ecological validity as guiding research principle; and second, we aim to contribute to more data-driven analysis methods by applying advanced machine learning algorithms to accommodate for the complexity of the recorded data and to guarantee a higher degree of generalizability. 2. Methods 2.1. Participants

We tested 45 healthy participants (23 women; age 20-32, !!"#= 22.73, !"!"# = 3.81), which were recruited via the Online Recruitment System of the Berlin School of Mind and Brain (ORSEE, an adaption of Greiner, 2015). Requirements for participation were being right-handed, having normal or corrected to normal vision (preferably contact lenses), being fluent in German (at least C1-level), having no psychiatric or neurological history in the past ten years, and having less than three hours of (lifetime) experience with VR. Participants were requested to drink no coffee or equivalents one hour before the experiment. The experiment lasted about 2.5 hours, and participants were reimbursed with nine euros per hour. The Ethics Committee of the Institute of Psychology at the Humboldt-Universität zu Berlin approved the research (number 2017-22).

12_{Cortical neural oscillations that can be measured via EEG are regarded as being caused by aligned firing of neural}

populations (e.g., Wang, 2010) and are commonly attributed to the following frequencies ranges: delta (1-4Hz), theta (4-8 Hz), alpha (8-12Hz), beta (12-30Hz), gamma (>30 Hz). Although the functional role of cortical oscillations is still debated (e.g., cross-region communication), they robustly correlate with specific behaviour and cognitive phenomena (Wang, 2010) such as arousal.

(9)

2.2. Materials 2.2.1. Questionnaires At the beginning of the experiment, participants were asked to answer three questionnaires about arousal related personality traits and about the experience with virtual reality: German versions of i) the ‘Trait’ subsection of the State-Trait Anxiety Inventory (STAI-T; Spielberger, 1983, 1989), ii) the sensation seeking subsection of the UPPS (Schmidt, Gay, D’Acremont, & van der Linden, 2008; Whiteside & Lynam, 2001), and iii) a customized version of the Simulator Sickness Questionnaire (SSQ, Bouchard, Robillard, & Renau, 2007) with items from the nausea and oculomotor subscale. At the end of the experiment, participants were asked to fill out the SSQ again in order to capture the VR induced symptoms and (side) effects (see also: Sharples, Cobb, Moody, & Wilson, 2008). Furthermore, they were requested to answer two questions on 7-point Likert scales addressing the degree of presence (e.g., Heeter, 1992; Robinett, 1992; Steuer, 1992) and valence (Frijda, 1986) they experienced during each of the virtual roller coaster rides. 2.2.2. Neural and peripheral physiological recording devices Peripheral physiological (electrocardiography, ECG; galvanic skin response, GSR) and neurophysiological (EEG) measurements were synchronously recorded (sampling frequency !_!= 500 !") via a multi-channel recording system (fabricated by Brain Products GmbH, Germany). For the EEG recordings, 30 active Ag/AgCl electrodes (actiCAP) were arranged according to the International 10-20 system (Sharbrough et al., 1991); the reference electrode was set at FCz. For electrooculography (EOG), two electrodes were placed on the lower lid bag and next to the external canthus of the right eye, respectively. EOG measurements allow controlling for eye-movements, which are known to interfere with the neural recordings. Impedances of these 32 electrodes were hold below 10kΩ. The modules of ECG (BIP2AUX

adapter) and GSR (BrainVision GSR sensor) were connected to the system via the BrainProducts’ TriggerBox. Two electrodes for the heart rate measurement were placed on the lowest rip of the left and

right side of the thorax, while the corresponding grounding electrode was attached on a central position above the right clavicle (collarbone). The two GSR electrodes were taped on the distal phalanxes of the left ring finger and the index finger. All signals were transmitted via Bluetooth from a mobile amplifier (LiveAmp) to the BrainVision Recorder.

2.2.3. Virtual Reality Setup

We used the HTC Vive VR system (HTC, Taiwan; OLED display panels, 2060×1200 resolution, 90 !" frame rate, 110° field of view) combined with a set of headphones for the presentation of the virtual roller coaster rides. Critical was the placement of the HMD on top of the EEG and EOG electrodes. To avoid pressure artefacts as well as to increase the wearing comfort, small cushions were placed between the

(10)

straps of the HMD and the subject’s head (see Figure 2). Besides the virtual content, the HTC Vive system provides information about the user’s movements (via the HTC Lighthouse base stations) and head rotations (via a built-in gyroscope and accelerometer). The data was captured (!_!= 250 !") with a customized script available on GitHub (OpenVR-Tracking-Example, Thor, 2016) and contains information with respect to the position in three-dimensional space as well as the corresponding quaternions, which are a measure of spatial rotation (Hamilton, 1843).

The roller coaster rides come in a bundle from the

SteamVR store (Russian VR Coasters by Funny Twins Games,

Russia, 2016). In order to increase the variance of the emotional experience, two roller coaster rides (Space

Coaster, Andes Coaster; see Figure 3) were chosen from the bundle, which were classified by an unrepresentative group of volunteers as least and most exciting, respectively. The rationale behind employing virtual roller coasters for the presented study is twofold. First, the intrinsic purpose of roller coasters, whether real or virtual, is to induce arousing experiences13_{. Second, subjects are seated during the rides.} This comes with the benefit of reducing motion related artefacts without interfering with the nature of the experience itself.

We screen-recorded both roller coaster rides (via OBS Studio, 2017) including the intermediate break (about 30 !"#) from the ego-perspective of the participant. The resulting 2D-videos were the foundation for the retrospective subjective ratings of arousal (see Section 2.3.). These recordings were presented on a 2D-virtual screen in the HMD via the view desktop option of SteamVR. Next to the video, a vertical colour bar was shown (implemented via Processing 3.0), ranging from low to high emotional arousal (in 50 steps). While watching the video recording of their experience, participants could continuously manipulate this bar according to their arousal rating with a programmable shuttle wheel in their right hand (PowerMate by Griffin Technology, USA; !!= 50 !").

13_{The major differences between real roller coasters and the virtual versions that remain are the absence of physical forces and}

the contextual knowledge of participants. However, studies demonstrated that despite the user’s conscious knowledge about being in a safe real environment, the simultaneous experience height in VR can lead to an overwhelming fear of falling (e.g., Seinfeld et al., 2016)

Figure 2 Combined EEG-VR Setup

Figure 3: Space-Coaster (left) with !"#$%&; and Andes-Coaster (right) with !"#$%. Subjects experienced both rides subsequently, interrupted by a break of about !"#$%.

(11)

Finally, all digital procedures were controlled and executed via a unified script written in MATLAB (version R2016a), in order to guarantee synchronized multi-channel data-recordings and a standardized presentation order across all subjects.

2.3. Experimental Design and Procedure

Introduction and questionnaires The participants were welcomed and informed about the procedure, while remaining uninformed about the goal of the study. After filling in the consent form and the first three questionnaires, the subject was guided to the main seat, where she or he spent the rest of the experiment. The ECG electrodes were attached and the subject was instructed about the subsequent heart beat perception task (HBP; Garfinkel, Seth, Barrett, Suzuki, & Critchley, 2015; Schandry, 1981).

Heart Beat Perception Task The HPB provides the means to measure the subject’s interoceptive accuracy (IC), i.e. the ability to consciously access bodily internal signals. Our procedure is a slightly customized version of the HBP reported in McCall et al. (2015). The main goal of the task is to count the own heart beat in varying time-intervals of unknown length (45, 15, 55, 35!"#), without any contact of the limbs, and torso or other help. Simultaneously, the cardiac activity is recorded. After each interval, the participant noted the count and indicated how well she or he was able to feel the beats on a 7-point Likert scale (ranging from not at all to very precisely) without receiving feedback about the performance. Resting State Measurements Next, all other electrodes (EEG, EOG, GSR) and devices, including the VR HMD, were carefully attached to the subject and activated. This was followed by two five-minutes resting state measurements (eyes open, eyes closed, respectively). While having their eyes open, participants saw a neutral wide grid on a dark background in the VR goggles. Wearing the HMD already during the resting state measurements warrants consistency regarding electrode-placement and other experimental confounds. Subsequent to the resting state measurements, the experiment approached its main phase.

Virtual Roller Coaster Rides All subjects experienced both roller coasters two times (two runs) and gave their arousal rating after each run. Before, they were semi-randomly assigned to one of two movement conditions (2!2 factorial design), which determined whether they were allowed to move their head freely during the first (Condition 1: movement in first run, but not second run; !!!= 21; 11 women) or the second run (Condition 2: non-movement in first run, but second run; !_!! = 24; 12 women). During the non-movement run, participants were requested to keep their head still, but encouraged to look around with their eyes. General instructions were to keep the eyes open (excluding eye blinks), to avoid biting one’s teeth and to prevent rapid movements with one’s arms or legs. Subsequent to these instructions, the Russian VR Coasters were activated and participants were guided to choose from the application menu the first ride, namely the Space Coaster. Having finished this ride and being back in the menu, subjects centred their view and rested for 30!"# (Break). Next, they started the Andes Coaster. Before subjects did the second run they entered the rating phase.

(12)

Rating phase Subjects were instructed to continuously rate their whole experience, including the break. Watching the 2D-video recordings of their VR experience aids the subjects to recall the course of their emotional arousal with a higher temporal resolution. Consequently, subjects were instructed to continuously rate their whole experience, including the break. Conceptualizing the experience as a whole rather than framing both roller coasters separately, aims to enlarge the rating space. In other words, the participant could reference arousing and non-arousing moments to each other across both roller coasters and the break. After the ratings and a brief relaxation phase, the participant entered the second run. That is, she or he experienced the two roller coaster rides in the alternative movement condition. Having done the second ratings thereafter, the VR goggles were removed and the subject filled in the final two questionnaires before she or he was eventually freed from all other experimental tools.

2.4. Analysis

The here presented research is part of a larger study that had the goal to examine multiple research questions addressing i) the multifaceted phenomenon of emotional arousal and ii) the VR experience in neuroscientific studies in general14_{. However, the primary goal of the analysis presented in this report is} to explore new analytical tools coming from machine learning for the prediction of subjective experience from multichannel time series of neural and cardiac responses. Consequently, at this stage we restricted the analysis on data that was acquired during the non-movement condition, since head movements are a major factor for artefacts in neurophysiological recordings15_{. If the modelling is successful, the approach} could be extended to the more demanding subset of the data in future work. The main objective is to train LSTM recurrent neural networks with neural and physiological (here: cardiac) components of single subjects in order to predict the target variable, i.e. subjective levels of arousal. Training the model with different sets of these components aims to inform about which of those lead to better performances and consequently encode phenomenal experience of the subject. For this purpose, we implemented two prediction tasks: first a binary classification task in which the model needs to distinguish between states of high and low arousal, and second, a regression task in which the LSTM is trained to continuously predict the ratings.

Dropouts From the initial 45 participants, the study was stopped for one subject because of problems with an experimental device. Six other subjects were removed from data analysis due to incomplete datasets. Among those was only one subject who stopped the experiment due to motion sickness; another subject’s recordings were dismissed as a result of a violation of the participation requirements, which was discovered in hindsight, leading to !!"# = 38.

14_{For instance, the factorial design (two conditions) allows investigating the effect of constraining head movements on the VR}

experience and/or physiological responses; or the questionnaires could inform how different personality types modulate the dynamic relationship of neural, physiological and subjective responses.

15_{On one hand, head movements can lead to physical disruptions such as displacements of the EEG electrodes or pulling of}

cables. On the other hand, they can cause voltage fluctuations on the participant’s skull due to muscle activations. For the young computational modelling approach it is reasonable to avoid these confounds at this stage.

(13)

Data preparation All recordings of the main part of the experiment were cut (via BrainVision

Analyzer 2.1.1 by BrainProducts GmbH, Germany) according to the time-markers of the two roller coaster

rides and the intermediate break. In order to control for artefacts that are related to the transition phases, we removed 2.5!"# at the beginning and the end of each roller coaster (via a script implemented in Python

3.5.1). This resulted in concatenated data arrays in all recording modalities corresponding to 270!"#

(148!"# !"#$% !"#$%&' + 30!"# !"#$% + 92!"# !"#$% !"#$%&') of the whole experience. It should be noted that the actual break was longer than 30!"# depending on individual circumstances. However, in the time-window we refer to as Break here, all subjects rested and centred their view before continuing with the Andes Coaster.

2.4.1. Subjective Ratings of Arousal

The continuous ratings of subjective levels of arousal serve as ground-truth for all subsequently introduced analysis methods. In order to be able to compare the ratings to different data modalities and across subjects, we first applied a non-overlapping sliding-window average to downsample the ratings to 1Hz. Subsequently, similar to McCall et al. (2015), we calculated the z-score of the variable (!), i.e. a normalization algorithm that centres the input by subtracting the mean (!) and preserves its variance. For the LSTM model, which will be described in Section 2.4.4., the ratings were rescaled such that they lie in the range (!!") between -1 and 1 (min-max scaling: see Formula 1)16.

!!" = !!",!"#− !!",!"# ⋅ ! − !!"# !!"#− !!"#+ !!",!"# (1) Finally, for the binary classification task of the LSTM model we split the ratings into equally sized classes of low, medium and high arousal (!!"# = !!"#= !!!"!= 90) and labelled all elements accordingly (!!"#→ −1 , !!"#→ 0 , !!!"!→ 1 ). Entries on the tercile boundaries were semi-randomly assigned to one of the adjacent classes. The medium class (!!"#) was then removed. This split of data aims to maximally separate the two target classes in order to aid the model prediction, and consequently, to find the corresponding neural as well physiological components. In sum, we use the z-scored ratings (!_!) for comparison to other modalities and between subjects, the rescaled variant (!!") for the regression task of the deep learning model, and the classes of the lower and upper tercile of the rating data (!!"#, !!!"!) for the binary classification task. 2.4.2. Cardiac Responses The analysis of cardiac activity is a central feature of research on emotional arousal. For an accelerated pre-processing, the ECG recordings of the whole experiment per subject were re-concatenated and then imported in Kubios HRV 2.0 (Finland, 2008; Tarvainen, Niskanen, Lipponen, Ranta-aho, & Karjalainen,

(14)

2014). Kubios HRV allows for an automated and robust R-peak detection. After visual inspection, misclassified or missing R-peaks were corrected (∅!",!""#" = 1.23; ∅!",!"#$%"&= 1.02). The output vector of subject ! containing the timestamps of each R-peak (!_!",!) was used to derive the RR-intervals, i.e. the time (in seconds) between two successive R-Peaks, and consequently to calculate the heart rate (HR) in beats per minute (bmp; !!",!≔ 60 !!!,!). Finally, the data was resampled to 1!" and re-cropped into the according phases of the experiment. 2.4.3. Neurophysiological Data: EEG EEG pre-processing Pre-processing of EEG neural data was done with the MATLAB package EEGLAB 14.1.0.b (Delorme & Makeig, 2004). All EEG recordings were downsampled to 250Hz and then imported into the EEGLAB extension PREP pipeline (Bigdely-Shamlo, Mullen, Kothe, Su, & Robbins, 2015). With the objective to improve reproducibility and bias-free pre-processing, PREP pipeline enables a standardized and automatized i) high-pass filtering (at 1Hz) and removal of line-noise (at 50Hz), ii) referencing of the signal with respect to an estimate of the ‘true’ mean reference, and, iii) detection and spherical interpolation of bad channels. The rejection of bad channels follows a standardized set of criteria (for details see Bigdely-Shamlo et al., 2015), and PREP pipeline provides exhaustive reports of the processing steps for each subject. Channels close to the participants’ mastoids (T7, T8, TP9, TP10) had the highest rejection rates, which most likely can be attributed to muscle related and/or pressure induced artefacts caused by the VR headset. On average, for each subject 2.18 channels were removed (range 0-6). Subsequent to the PREP pipeline, we used the EEGLAB plug-in MARA (Winkler et al. 2014; Winkler, Haufe, & Tangermann, 2011) to automatically remove independent components – obtained by the Infomax ICA transformation (Bell & Sejnowski, 1995) – that were related to blinks and cascades of the eyes, and movements of the head and the neck (∅_!"#$,!"#$%"&= 16.22; range 8-25).

Fostering the Good Guys, Neglecting the Bad Guys: Spatio-Spectral Decomposition As mentioned already in Section 1.3., a large body of literature indicates a negative correlation between alpha-band oscillations (8 − 12Hz) and levels of arousal (e.g., Cantero et al., 1999; Di Bernardi Luft & Bhattacharya, 2015; DiFrancesco et al., 2008; Moosmann et al., 2003). Consequently, we concentrated the search of neural components that code for arousal in the alpha-frequency range. Besides following those previous findings, limiting the search space of neural components eases the computational load for the evaluation of the decoding model described in Section 2.4.4.

.

A recently introduced linear decomposition method is well suited for the extraction of neuronal oscillations in a range of interest from multi-channel EEG recordings. Spatio-spectral decomposition (SSD) maximizes the signal power at a defined frequency band (‘peak’, here alpha: 8 − 12Hz), while attenuating adjacent frequency bins (here: 6 − 7!"; 13 − 14Hz) that are referred to as ‘noise’ (Nikulin, Nolte, & Curio, 2011). In contrast to traditional ICA decompostion approaches, SSD provides an ordered matrix of neural components with respect to their signal-to-noise ratio (SNR). In other words, while the first components, i.e. columns of the decompostion matrix, contain more information regarding the bandwidth of interest, later components hold less information about that particular range. Analysing SSD neural components

(15)

with a high SNR in the alpha-frequency band (‘good guys’) comes with an additional benefit, namely, reducing the risk of signal contamination elicited by muscular activity related oscillations (‘bad guys’) in the EEG signal that normally lie in higher frequency ranges (~20 − 300!"; Muthukumaraswamy, 2013)17_. Consequently, if the following analyis reveals an association between these ‘good guys’ and the reports of subjective experience (target variable), it most likely reflects neuronal processes and not muscular activity. For the presented study, we decomposed the EEG signal via a the SSD EEGLAB plug-in.

The LSTM model, described in the following section, was fed with four variants of the SSD components. First, the SSD output was either narrowband-passed filtered in the alpha-frequency range (8 − 12Hz), which means that it primarily contained the alpha-‘peak’-information (see above), or it remained in its broadband spectrally non-filtered state (1 − 125Hz)18_{. Second, either the model was} trained on the these raw but z-scored SSD components (narrow-/broadband), or on the z-scored alpha-band power of each component ! (!_!!!"_{), i.e. the squared amplitute of the alpha frequencies in the EEG} signal (see Formula 2). !_!,!"#$%!!" _{= !} ! !!"#$%&'(!! !!") ! (2)

A common way to calculate the according amplitutde is by extracting the analytical signal of the component via the Hilbert Transformation !!"#$%&' (Cohen, 2014). This step was implemented in the

Python3 package SciPy (Jones, Oliphant, & Peterson, 2001).

Showing the Way: the Supervised Decomposition Method Source Power Comodulation Finally, in succession to the SSD, we applied the Source Power Comodulation (SPoC) algorithm (Dähne et al., 2014; Haufe, Dähne, & Nikulin, 2014) to supervise the decomposition of the EEG signal with respect to the target variable (subjective ratings). SPoC maximizes the correlation or decorrelation, i.e. it optimizes the comodulation, between the target variable and the power time course of the spatially filtered neural signal. The algorithm assumes neural data that is band-pass filtered in the frequency range of interest (here alpha). The SSD narrow-band components are ideal for this purpose due to their contrastive frequency ‘peak’ (see Haufe et al., 2014a). SPoC was implemented via the SPoC EEGLAB plug-in.

Training LSTM recurrent neural networks on SPoC filtered data aims to obtain an estimate of how well the decoding model should be able to predict the target variable from alpha-range neural components. In other words, training models on the supervised SPoC components sets a (performance) benchmark proxy for decoding models trained on only unsupervised SSD transformed neural data. It is to be noted that no systematic comparison between the four above-mentioned SSD variants and SPoC was conducted, since it was not the focus of the presented research to approximate established decomposition algorithms with the applied neural network19_{. More specifically, all transformations of EEG data were} treated as network hyperparameters among many others, as we will see in the following section.

17_{It should be noted that an absolute avoidance of leakage from higher frequencies to lower bands is not guaranteed. However,}

the effect becomes only significant in frequencies of 20Hz or higher (Muthukumaraswamy, 2013).

18_{Also the broadband-filtered SSD data was emphasized at the alpha frequency range (‘peak’).}

19_{A systemic comparison is not straightforward, since it is to be expected that each (input) data type requires different}

(16)

2.4.4. LSTM recurrent neural network

In recent years, neuroscientific research made major leaps forward in decoding neural information by means of deep learning models (e.g., Agrawal et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014; see also Hofmann et al., in prep). By employing Long Short-Term Memory recurrent neural networks (Hochreiter & Schmidhuber, 1995, 1997), we can extend these successful approaches to the non-stationary neural time series of EEG recordings (e.g., Bashivan et al., 2014). LSTMs have the property to store and control relevant information over time. Therefore, they are able to find adjacent as well as distant patterns in time sequential data20_._{Box 1}_offers a brief introduction into the basic mechanisms of a LSTM unit. The models used for the presented research were implemented in the Python based deep learning library

TensorFlow 1.4.1. (Google Inc., USA; Abadi et al., 2015;

Zaremba et al., 2015). The code of the prediction models is open-source and available on GitHub21_.

As mentioned in the introduction of Section 2.4., two prediction tasks were implemented, namely, i) a binary classification task of high and low arousal that aids the extraction of corresponding neural and physiological features due to the dichotomic targets, and ii) a regression task in which the ratings are continuously predicted. The latter is expected to be more demanding due to the more fine-grained time-serial resolution. The configurations of the corresponding models do only differ in the way the target variable, i.e. subjective ratings, are fed into the model as described in Section 2.4.1.

Before we could train an LSTM model for each subject separately, the best hyperparameter (HP) settings for each task had to be found. Finding the right HP for a deep learning model, such as the model architecture or its weight update behaviour and regularization, comes with heavy computational costs. Therefore, it is necessary to restrict the search space. We applied a random search strategy (Bergstra & Bengio, 2012) to find optimal model settings22_{for a random subset of ten subjects (40 different} combinations of hyperparameter settings). Then, the two best HP settings per subject were taken and applied to the datasets of all subjects. And, importantly, these final sets of HPs were separately trained for both the SSD and the SPoC data. As mentioned in the previous section, training the LSTMs on the SPoC components approximates a benchmark for the SSD trained models, since SPoC is an established

the (debatable) premise that similar performances entail that the model approximates a decomposition step (e.g., SPoC) within its computations.

20_{Despite some propitious adaptations of the LSTM cell in recent years (e.g., Neil, Pfeiffer, & Shih-Chii Liu, 2016), we employed}

the vanilla, i.e. classical, LSTM cell since it turned out to be the best model choice for a wide range of tasks (Greff et al., 2017).

21_{https://github.com/SHEscher/NeVRo}

22_{Random search eases the computation load by restricting sets of hyperparameters to a specific amount of combinations that}

are randomly drawn. Especially, for higher dimensional search spaces (here our hyperparameter space) the results are often better than those from a classic grid search (Bergstra & Bengio, 2012).

Figure 4 Different depths of the network (varying between 2-4 layers) were tested during the hyperparameter search. There were at least one LSTM and one FC layer for each setup.

(17)

supervised decomposition algorithm that is guided by the comodulation of target variable and EEG signal, which is the relationship we aim to find.

Hyperparameters Similar to Hefron et al. (2017), our model architecture was constrained to one or two LSTM layers followed by one or two fully connected layers (!")23_(seeFigure 4_{). Each layer size !"}

! varied between 10 and 100 nodes24_(!"

!∈ 10, 15, 20, 25, 30, 40, 50, 65, 80, 100 ), and a successive layer needed to be equal or smaller in size. The output of each layer was squashed through either rectified linear units (ReLU) or exponential linear units (ELU), which both allow for steeper learning curves in contrast to conventional activation functions such as the sigmoid nonlinearity (Clevert, Unterthiner, & Hochreiter, 2016). The output of the last network layer (!"!) was fed into a tangens hyperbolicus (!"#ℎ), which ranges between -1 and 1. As mentioned in Section 2.4.1.

,

the ratings were normalized in the same range for the regression task, or, in case of the binary classification task, they were assigned to one of two classes (!!"#, !!!"!) and labelled with -1 or 1, respectively. This allowed for a direct comparison between the model prediction and the target variable. We applied a mean-squared error loss (MSE) to calculate the difference between this prediction and the ground-truth25_{. Moreover, different regularization methods} (L1, L2) with various regularization strengths (! ∈ 0.026_{, 0.18, 0.36, 0.72, 1.44 ) were tested in order to} control and tax too large model weights. Similar to Bashivan et al. (2016), the weights were updated according to the loss using the Adam optimizer (learning rate: !" ∈ 1!!!_{, 1!}!!_{, 1!}!!_{, 5!}!! _{) because of its} fast convergence (Kingma & Ba, 2015, see also Ruder, 201727_).

As mentioned in the previous section, the number of neural components (!!"#$: ℕ → 1, 10 ) and their transformation (narrow- or broadband, Hilbert extracted alpha-power band) were treated as HPs. The specific !_!"#$ components were randomly chosen or came sorted starting from 1 to !_!"#$. Another hypermetric option was to feed the best correlating component. In case of SPoC data, this was the first column of the filter matrix, while for SSD components we calculated the cross-correlation28_between the subject’s rating and each column in the decomposition matrix. Moreover, the cardiac responses (see

Section 2.4.2.), here the heart rate information of each subject ! (!!",!), was either concatenated to the

neural input component(s) or left out29_{. In other words, the random selection of HPs (random search) not} only determined, amongst others, the learning rate or weight regularization, but also how the input was modulated before it was fed into the LSTM.

Training procedure Consequently, the final dataset per subject was a three-dimensional tensor of size 270(sec) × 250(Hz) × !_{!"#$!! !}. Following the !-Fold cross-validation approach, a common solution when training data is sparse (Bishop, 2006), the dataset of a subject was randomly split in ten folds

23_{The FC output is defined as !}_!"_{= ! !" + Β} _{. Where each node of the FC receives all the input ! through weighted}

connections !. Β denotes the added bias term and ! represents the activation function of choice.

24_{In case of the LSTM unit, the amount of nodes or neurons is equal to the size of the hidden state ℎ}_!_. 25_{In case of the binary classification, the ground-truth was either -1 (low arousal) or 1 (high arousal).} 26_{A regularization strength of zero equals applying no regularization.}

27_{The paper is also available as a blog post: http://ruder.io/optimizing-gradient-descent/index.html#minibatchgradientdescent} 28_{Sliding window with a lag of 10.}

(18)

(! = 10) of equal size. Nine of the folds served as training dataset !_!"#$%!"#×!"#× !!"#$!! !_{, while one was kept} out for the validation (!_!"#!"×!"#× !!"#$!! !_{) of the trained model. This is repeated ! times, and for each trial} the model was reinitialized30_{. That is, at each iteration, the training of the model starts from scratch.} Finally, averaging the prediction accuracies across the ! validation sets leads to a robust and bias-free performance value.

After some initial test runs and visual observation of the convergence behaviour of the learning progress, the training iterations were set to 30. That is, the model ran 30 times through the whole training dataset. At each training step !", the model was fed with slices of the selected neuronal, and if applicable physiological, components corresponding to 1!"# of the experience. These were randomly drawn from the training set. Consequently, the time-depending information the LSTM model was able to extract from the input components was limited to 1!"#. This choice was a trade-off between the preferably high number of samples from the sparse datasets, and the sample-length, which is necessary to extract neural trends and corresponding changes in the phenomenal experience31_{. These samples were} fed into the LSTM in random mini-batches of size 9 (!" = 9), since training on batches allows for faster and more robust feature learning (Ruder, 2017), leading to the following input tensor at training step !": !_!"#$%,!"!" × !"# × !!"#$!! !_{. Tensorflow automatically deals with input data coming in batches. To better} understand what the model is reading out at training step !", let’s ignore the mini-batching for now, such that we consider the case of one single sample (!" = 1). Here, the LSTM cell successively slides over 250 data-arrays of components (!!!!, !!!!, … , !!!!"#) computing at each step ! its hidden state ℎ! (see Box 1). Only at time-step !=249 we take the LSTM output (ℎ_!) and feed it into the FC, i.e. the next layer of the

30_{Since the hyperparameter search and the final model performances that are reported were conducted independently, no}

nested cross-validation was employed. Also, random batching made this step redundant.

31_{It should be noted that training a prediction model across all subjects could overcome this limitation and allow for longer}

samples (> 1!"#). However, this was beyond the scope of the presented research.

Figure 5 Each sample corresponds to !sec of data. Since we have a sampling frequency of !"#Hz, the LSTM cell slides over 250 data-arrays of neural, and cardiac (if applicable), components. Only the last LSTM output !! will be fed into successive layers and leads to the prediction of the model.

(19)

network. Thus, not earlier than when the LSTM has seen the whole data stream corresponding to one second, we take its output in consideration for the final prediction of the target variable (rating; see

Figure 5).

(20)

Box 1 Here, we provide a brief description of the functional constitution of a LSTM unit that receives time-serial input data. For a graphical depiction of corresponding processes see Figure 6. The internal operations and the output of a LSTM unit are described with the equations in Formula 3. ! ! ! ! = ! ! ! tanh ! ℎ!!! !! !!= ! ⊙ !!!!+ ! ⊙ ! ℎ!= ! ⊙ !"#ℎ(!!) (3)

At each time-step ! the LSTM receives an external input vector !! and its previous hidden state ℎ!!!. By taking the dot product

between this stacked input and the weight matrix !32_{and by applying respective nonlinearities (sigmoid function !, tangens}

hyperbolicus !"#ℎ), the four gating units (!, !, !, !) are computed. While the forget gate ! controls how much will be deleted from the previous memory cell state !!!!, the input gate ! regulates how much will be written into the new state !!. The gate ! modulates

how the new input flows into the cell. Lastly, the output gate ! dictates how much to output from the current LSTM state. How do

the gates control this information flow? The functionality is more easily understood if we consider the extremes of the

corresponding nonlinearities. The sigmoid function ! is in the range between 0 and 1 (!: ℝ → 0, 1 ), hence the gates !, !, and ! serve as forms of binary on/off switches. For instance, in the ideal case, the forget gate ! decides for each block in the memory cell whether it is to be deleted (1) or not (0). Each block is reached via the element-wise multiplication denoted as ⊙. The tangens hyperbolicus !"#ℎ has its range between -1 and 1 (!"#ℎ: ℝ → −1, 1 ). Consequently, after the input gate ! switches on, the gate ! regulates whether to add or subtract the new input in the memory. We can think about these modulations of the memory cell as form of down and up scalable counters. The LSTM output, that is the hidden state ℎ!, is based on the memory cell state !! and the output

gate !. The modulations follow the same principles as just described. The vector ℎ! can be externally read out at each time-step or,

for instance, be picked up only at the end of a sequence at time-step !33_{. The output vectors can be fed in subsequent network layers,}

such as another LSTM or a fully connected layer, or they can directly be interpreted as the model prediction.

In comparison to other RNN types, the LSTM performs very efficient and fast weight updates due to its numerical properties, copes better with long input strings, and has less issues with exploding or vanishing gradients (for details see Hochreiter & Schmidhuber, 1997). For more introductory information on the mechanisms of LSTMs see colah’s blog post “Understanding LSTM Networks” (Olah, 2015).

32_{For notational convenience, we summarize the weight matrices of the single gates in !; hence the weight matrix ! is a}

concatenated version of !!, !!, !!, and !!.

33_{The different RNN read outs are well described in the blog-post “}_{The Unreasonable Effectiveness of Recurrent Neural}

Networks” (Karpathy, 2015).

Figure 6 A LSTM RNN fed with a time-sequential input vector X calculates for each time-step a hidden-state ht that is fed into the network in the successive step (recurrent connection) and can simultaneously function as the network’s output. Furthermore, the LSTM has a memory cell c, which can be modulated as a function of the network’s external input and previous state. Consequently, this type of RNN has the property to detect even long-term dependencies and non-stationary patterns in the time-serial input data.

(21)

3. Results

For both tasks, binary classification (Section 3.1.) and continuous ratings (Section 3.2.) we compared the performances of the SSD trained models with those trained on the SPoC components. As described in the previous section, after an initial broad random search of HPs on a subset of ten random subjects, we narrowed down the search space for the whole dataset to the two best HP settings per subject of this subset. This led to 18 (out of 20) unique HP sets for the classification task, and ten (out of 20) unique sets for the regression. Eventually, it will be discussed why some of the HP sets differ per subject and what we can derive from the best settings. For an illustration of the rating variables, i.e. the ground-truths, of all subjects see Figure 7. 3.1. Binary classification

SSD trained LSTMs The average prediction accuracy on the validation sets over all subjects and tested 18 HP sets was . 634 (range . 514 – .816, !" = .068)34_{, and was significantly above chance level}

34_{The accuracy ranges from 0 to 1, where 0 depicts the worst prediction and 1 represents a perfect performance. In case of the}

binary classification task, .5 stands for chance level accuracy.

Figure 7 Ratings of all subjects (mean rating in black). For illustration, the two classification bins of low and high arousal of the mean rating are coloured in purple and red, respectively. For the actual training process, each subject’s rating trajectory was individually split into the according bins. Vertical (dotted) lines indicate the beginning of the different phases (Space Coaster, Break, Andes Coaster) and arousing events during the VR rollercoasters.

(22)

(permuted3000 ! < .0004). In order to test whether the selection of the best performance out of 18 applied HP sets per subject was biased, we ran a permutation test with 3000 biased selections35_.

To explore which HPs played a crucial role for the outcome of the prediction, the best HP settings across all subjects are listed in Table 1.

Table 1 Six best HP sets for SSD trained LSTM models. LSTM size/FC size: Layers (comma-separated) and their number of hidden units/nodes. Note that the first FC-layer always has the same size as the previous LSTM layer.

Weight regul.: Applied weight regularizer. regul. strength: regularization strength (!). activat. function: Between-layer activation function. Hilbert power: Alpha band power transformation. band pass: band-passed filtered input signal. SSD comps: SSD input components, i.e. column(s) of the SSD filter matrix. Here xcorrbest represents

the best cross-correlating component with respect to the target variable (see Section 2.4.4.). HR: Heart rate input component. Ø val acc: average accuracy on the validation sets across subjects SD: corresponding standard deviation. All listed HP sets are significantly above chance level (*: !"#$%&"'!""" ! < . !!!").

All best settings were significantly above chance level. The results in Table 1 suggest that neither the cardiac information (HR) nor the precise transformations of the EEG input (Hilbert power, band pass) were pivotal for the performance outcome. The same goes for the weight regularization (weight regul.) and its corresponding strength ! (regul. strength). In contrast, the size of the LSTM layers (LSTM size) seemed to have an effect on the validation prediction. All best performing models had layer sizes !"! equal to or smaller than 30 (max !"_!= 100, see Section 2.4.4.). The inspection of networks with wider layers showed that these networks often tend to overfit (high training accuracy, low validation accuracy). Concerning the depth of the network, we see that only the number of FC-layers, namely one layer, seems to be advantageous with respect to the model performance. As an illustration, the validation and training predictions of the model built on the dataset with the best performance across all subjects are depicted in

Figure 8 (SSD, Subject 23, Ø_!"#= .816). The corresponding learning progress can be found in Figure 9.

The worst of the best model performances across all subjects is shown in Figure 10 (SSD, Subject 34, Ø_!"#= .514).

35_{That is, the best out of 18 average accuracies over N}_!"#_{×270 samples randomly drawn from the binary set {0, 1}. Also, the}

LSTM model receives 270 samples from the validation sets in all 10 folds, which constitute the model’s final prediction accuracy. The permutation test was implemented via the R package perm (Version 1.0, Fay & Shaw, 2010). Note that the maximal precision of the probability estimate is given by the inverse of permutations (Legendre & Legendre, 1998, p.25)

subject LSTM

size FC size learning rate weight regul. regul. strength function activat. Hilbert power band pass SSD comps HR Ø val acc SD

all 30 30 5E-04 L2 0.36 ELU False True 2, 4 False .592* .071

all 20 20 1E-03 L2 1.44 ELU False True 2, 3, 5 True .585* .07

all 30, 10 10 5E-04 L1 0.72 ELU False False xcorrbest False .585* .08

all 25 25 5E-04 L2 0.18 ReLU True False 1, 2 True .584* _.09

all 10, 10 10 1E-03 L2 0.36 ReLU False False 1, 2 False .583* .073

(23)

Irrespective of the arrangement of the performances (best across subjects, best across HP settings), the results demonstrate that the LSTM models are able to predict retrospectively reported subjective levels of emotional arousal from neurophysiological data that was recorded during the VR experience.

Figure 8 10-fold validation performances over the time-course of the VR experience (!"#$%&) of best SSD dataset (Subject 23, Ø!"#=. !"#). Samples in green (training set) and cyan (validation set) are correctly classified; samples in red (training set) and orange-red (validation set) are misclassified. Continuous (grey line) and dichotomised (black squares) ground-truth rating of Subject 23. As described in Section 2.4.1., only rating data falling in the upper (high arousal, top squares) and lower tercile (low arousal, bottom squares) are considered for the classification task.

(24)

Figure 9 Average Performance and learning progress of best SSD dataset (Subject 23, Ø!"#=. !"#) Top row: average training accuracy. Second row: concatenated validation accuracy over all folds (!"#$%&). Bottom rows: Training progress over 810 iterations. For both the accuracy and the loss (bottom), we recognize an early convergence of the training procedure around 200 training iterations. Both trajectories (validation, training) stay on top of each other, which suggests that the model is not overfitting nor underfitting the data of Subject 23 (Rank 5 HP set, see Table 1).

(25)

Figure 10 Average Performance and learning progress of worst SSD dataset (Subject 34, Ø!"#=. !"#) The model tends to overfit (see, e.g., the high training accuracy, low validation accuracy, and diverging loss-trajectories between the validation set and training set). The large size of the first LSTM layer (LSTM size 100, 30) in this setup (Rank 14) supports the conclusion that was made based on other observations, namely that wider networks tend to overfit. Inspecting the worst performance (Ø!"#=. !"#*, < chance level, permuted3000 ! <. !!!") of HP sets (Rank 4) applied on the SSD dataset of Subject 34, indicates that the corresponding model seems to find meaningful representation in the data, but updates its weights in the wrong direction.

Finding Traces of Neurophenomenology Predicting self-reported arousal trajectories using LSTM recurrent neural networks on SSD extracted alpha-components of the EEG signal recorded during virtual roller coaster rides