Deep physiological arousal detection in a driving simulator

(1)

Master Thesis

Deep Physiological Arousal Detection in a Driving

Simulator

Aaqib Saeed

August 2017

(2)

Deep Physiological Arousal Detection in a Driving Simulator

Master Thesis

by

Aaqib Saeed

1651854

MSc Computer Science

Data Science and Smart Services

Enschede, August 2017

Graduation Committee:

Dr.Ir. Maurice van Keulen, University of Twente Dr. Stojan Trajanovski, Philips Research

Prof. Dr. Jan B.F. van Erp, University of Twente Dr. A.K. Ramakrishnan, University of Twente

(3)

(4)

Acknowledgments

I am thankful to many people for their support, contribution and trust which made two years of master studies very exciting, and an experience in itself. I would like to thank my advisers Stojan Trajanovski, Maurice van Keulen and Jan van Erp for their guidance, critical feedback, thought-provoking discussions and above all giving me freedom to pursue my own ideas. Their unfailing help not just enabled me to get a better understanding of applied Machine Learning but helped me develop a research mindset. I also want to thank my internship supervisors (in Fit2Perform project) Adrienne Heinrich and Victor Kallen for sparking my interest in biometrics data analytics.

I owe a special thanks to my friends back home in Pakistan, Nouman Farooq, Inam Akbar and Mudasir Ali for all their support and motivation. I am also very grateful to my teachers from undergraduate studies especially, M.Qasim Pasta, Faraz Zaidi and Husain Parvez for their encouragement and being an inspiration.

But most importantly, I am very thankful to my parents for all their love, sacrifices and countless efforts in providing new opportunities to me and my siblings for a better future.

A special word of gratitude to my fiance, Zaharah A.Bukhsh, for always being there for me and being my emotional support in making this journey wonderful.

(5)

(6)

Abstract

Driving is an activity that requires considerable alertness. Insufficient attention, imperfect perception, inadequate information processing, and sub-optimal arousal are possible causes of poor human performance. Understanding of these causes and the implementation of effective remedies is of key importance to increase traffic safety and improve driver’s well-being.

For this purpose, we used deep learning algorithms to detect arousal level, namely, under- aroused, normal and over-aroused for professional truck drivers in a simulated environment.

The physiological signals are collected from 11 participants by wrist wearable devices. We presented a cost effective ground-truth generation scheme for arousal based on a subjective measure of sleepiness and score of stress stimuli. On this dataset, we evaluated a range of deep neural network models for representation learning as an alternative to handcrafted feature extraction. Our results show that a 7-layers convolutional neural network trained on raw physiological signals (such as heart rate, skin conductance and skin temperature) outperforms a baseline neural network and denoising autoencoder models with weighted F- score of 0.82 vs. 0.75 and Kappa of 0.64 vs. 0.53, respectively. The proposed convolutional model not only improves the overall results but also enhances the detection rate for every driver in the dataset as determined by leave-one-subject-out cross-validation.

(7)

(8)

Contents

Acknowledgments 3

Abstract 5

Contents 7

List of Figures 11

List of Tables 13

List of Acronyms 15

1 Introduction 1

2 Background Study and Related Work 7

2.1 Preliminaries . . . . 7

2.2 Recognizing Physiological Arousal . . . . 8

2.2.1 Fatigue . . . . 8

2.2.2 Stress . . . . 10

2.3 Deep Learning for Sequence Classification . . . . 12

2.4 Analysis of Existing Approaches . . . . 14

3 Data and Methodology 15 3.1 Experimental Protocol . . . . 15

3.2 Data Collection and Analysis . . . . 16

(9)

Contents

3.3 Arousal Ground Truth Annotation . . . . 17

3.4 Pre-processing and Segmentation . . . . 18

4 Deep Neural Networks for Arousal Classification 23 4.1 Problem Definition . . . . 23

4.2 Neural Network and Denoising Autoencoder . . . . 24

4.3 Convolutional Neural Network . . . . 25

4.4 Recurrent Neural Network . . . . 29

4.5 Hybrid Models . . . . 30

4.6 Training Method . . . . 32

4.7 Tackling Data Imbalance . . . . 34

4.7.1 Threshold-Moving . . . . 35

4.7.2 Over-Sampling . . . . 36

4.7.3 Weighted Categorical Cross Entropy . . . . 37

5 Experiments and Discussion 39 5.1 Experimental Setup . . . . 39

5.1.1 Dataset Preparation . . . . 39

5.1.2 Synthetic Data Generation . . . . 39

5.1.3 Evaluation Strategy . . . . 40

5.1.4 Performance Measures . . . . 40

5.1.5 Implementation . . . . 43

5.1.6 Baselines . . . . 43

5.2 Results . . . . 44

5.2.1 Validation of Baseline . . . . 44

5.2.2 Convolutional and Recurrent Neural Networks . . . . 45

5.2.3 Evaluation of Hybrid Models . . . . 51

5.2.4 Effect of techniques to solve Data Imbalance . . . . 52

5.2.5 Application . . . . 55

8

(10)

5.2.6 Summary . . . . 56

5.3 Discussion . . . . 57

6 Conclusion and Future Work 61 Appendix 63 A1 Hyperparameters . . . . 63

A2 Oversmapled classes per driver using SMOTE . . . . 63

A3 Cost matrices for Threshold-Moving method . . . . 64

A4 Illustration of ground truth unavailability . . . . 65

Bibliography 67

(11)

(12)

List of Figures

1.1 Illustration of Yerkes-Dodson law (Coughlin et al., 2009). . . . 2

1.2 Data sources for driver state detection (Coughlin et al., 2009). . . . . 3

2.1 Illustration of Locus Coeruleus in the Brain Stem. . . . 10

3.1 Sequence and duration of events in simulator a study. . . . . 16

3.2 Mean heart rate by arousal level. . . . 19

3.3 Mean skin conductance by arousal level. . . . 19

3.4 Mean skin temperature by arousal level. . . . 20

3.5 Derivation of arousal ground truth from stress and KSS scores. . . . 21

3.6 Class label distribution by drivers. . . . 22

3.7 Overall class label distribution. . . . 22

4.1 Baseline neural network architecture. . . . 25

4.2 Baseline architecture of denoising autoencoder for unsupervised pre-training with fully connected neural network. . . . 26

4.3 Convolutional neural network architecture. . . . . 28

4.4 GRU based recurrent neural network architecture. . . . 30

4.5 Hybrid architectures. . . . 31

4.6 Illustration of z-normalization. . . . . 34

4.7 Effect of feature normalization on error surface. . . . 35

4.8 Illustration of class overlap and identified Tomek links. . . . 37

(13)

5.1 Class label distribution by drivers after applying SMOTE. . . . 41

5.2 Overall class label distribution after applying SMOTE. . . . 41

5.3 Illustration of leave-one-subject-out cross validation. . . . 42

5.4 Neural Network - LOOCV and Test set confusion matrix. . . . 46

5.5 Denoising Autoencoder - LOOCV and Test set confusion matrix. . . . 46

5.6 Convolutional Neural Network - Training and Validation Loss . . . . 47

5.7 Convolutional Neural Network - LOOCV and Test set confusion matrix . . . 48

5.8 Out of sample performance evaluation of Convolutional Neural Network. . . . 49

5.9 t-SNE Visualization - Last layer features from CNN model. . . . 50

5.10 GRU - Recurrent Neural Network - LOOCV and Test set confusion matrix . 51 5.11 Illustration of physiological parameters after applying SMOTE and Tomek Links. . . . 53

5.12 Oversampling using SMOTE - LOOCV and Test set confusion matrix. . . . . 53

5.13 Threshold Moving - LOOCV and Test set confusion matrix. . . . 54

5.14 Weighted Cross Entropy - LOOCV and Test set confusion matrix. . . . 55

5.15 Android prototype application screenshtots. . . . 56 5.16 Summarized results of all the classifiers for optimal window size of 60 seconds. 57

A4 Illustration of ground truth unavailability labels for one of the arousal classes. 65

(14)

List of Tables

3.1 The Karolinska Sleepiness Scale . . . . 17

3.2 Stress Scores . . . . 17

4.1 Evaluated design choices in Convolutional Neural Network. . . . 27

4.2 Sample Cost Matrix . . . . 36

5.1 Baseline - Neural Network Results . . . . 44

5.2 Evaluation of non-linear activation functions with Neural Network. . . . 45

5.3 Baseline - Denoising Autoencoder Results . . . . 45

5.4 Convolutional Neural Network Results. . . . 48

5.5 Impact of Architectural Design Choices on Convolutional Neural Network Results. . . . 48

5.6 GRU - Recurrent Neural Network Results . . . . 49

5.7 Hybrid Model A Results . . . . 51

5.8 Hybrid Model B Results . . . . 52

5.9 Oversampling using SMOTE Results. . . . 52

5.10 Optimal Cost Matrix . . . . 54

5.11 Threshold Moving Results. . . . . 55

5.12 Weighted Cross Entropy Results . . . . 55

A1 Hyperparameter Configuration. . . . 63

A2 Oversmapled Classes by Driver. . . . 63

A3 Cost Matrices . . . . 64

(15)

(16)

List of Acronyms

FNN Feed-Forward Neural Network

DAE Denoising Autoencoder

CNN Convolutional Neural Network RNN Recurrent Neural Network

GRU Gated Recurrent Unit

LSTM Long short-term Memory ELU Exponential Linear Unit ReLU Rectified Linear Unit

SGD Stochastic Gradient Descent

WCE Weighted Categorical Cross Entropy PCA Principal Component Analysis SVM Support Vector Machine RBF Radial Basis Function

HMM Hidden Markov Models

SMOTE Synthetic Minority Over-sampling Technique LOOCV Leave-One-subject-Out Cross-Validation IID Independent and Identically Distributed

PPG Photoplethysmogram

ANS Autonomic Nervous System SNS Sympathetic Nervous System PNS Parasympathetic Nervous System HRV Heart Rate Variability

ECG Electrocardiography

EEG Electroencephalography GSR Galvanic Skin Response IBI Inter-beat Interval

HR Heart Rate

SC Skin Conductance

ST Skin Temperature

EHR Electronic Health Records HAR Human Activity Recognition KSS Karolinska Sleepiness Scale

(17)

(18)

Chapter 1

Introduction

Driving is a complex task involving several motor and cognitive abilities. Inadequate human performance is a major cause of injuries in road traffic accidents. Imperfect perception, insufficient attention, inadequate information processing, and sub-optimal arousal have all been mentioned as possible causes for poor human performance. For instance, driver drowsiness or fatigue caused by extended hours of driving as well as situations of cognitive overload can significantly impair a driver’s ability to react appropriately to relevant events (Coughlin et al., 2009; Vicente et al., 2011). Understanding of these causes and the implementation of effective remedies is of key importance to increase traffic safety and driver’s well-being.

The physiological arousal level can be described as the available capacity to perform the task in a timely and an effective manner. The potential threat of both under-arousal as well as over-arousal is reflected in many human performance models (e.g. see van Erp et al. (2010) for an overview). The more complex models take at least the relationship between task demands, workload, effort and performance into account of which workload is considered a multidimensional, multifaceted concept that is difficult to define and quantify using a single representative measure (Gopher and Donchin, 1986). However, in the context of driving a simple model consisting of a single dimension —here referred to as arousal— may suffice.

Therefore, the ability to detect drivers’ state in a timely manner provide an opportunity to help them move towards an optimal state of arousal; make driving safer and eventually contributing to the improved well-being.

Over a century ago, Yerkes and Dodson (1908) established a law stating that the relationship between performance and level of arousal has an inverted U-shape (see Figure 1.1). If physiological signals reliably reflect a possible threat of under-arousal or over-arousal before a decline in driving performance becomes noticeable, they may form the basis for an effective remedy. The rapid development of physiological wearable sensors over the past years makes the development of effective solutions more feasible than ever before. Moreover, the model built by leveraging raw signals collected from non-invasive wearable devices (such as Photo- plethysmogram [PPG] sensor for heart rate) is more feasible to use in an everyday situation as opposed to Electroencephalography (EEG) and Electrocardiography (ECG).

(19)

Chapter 1. Introduction

Arousal

Performance

Low High

Strong

Weak

Optimal arousal

Fatigue,

sleepiness Strong stress, anxiety, impared

performance

Figure 1.1:Illustration of Yerkes-Dodson law (Coughlin et al., 2009).The graph depicts the relationship defined between arousal and performance by Yerkes-Dodson law. The both extremes (Yellow and Red shaded regions) represents areas of concerns, where performance diminishes either due to fatigue or high stress. The middle green section is the desired zone showing optimal arousal level required to complete the task efficiently.

In earlier research, significant work has been done to detect stress and fatigue using machine learning and signal processing methods, on both simulator and on-road datasets [see Sharma and Gedeon (2012)]. The majority of the work provides a solution to either stress or drowsiness detection, but does not take both into account. However, Coughlin et al. (2009) has done significant work for physiological arousal detection. They showed that driver’s state can be detected using overt and covert measures such as driving style, vehicle performance, environmental conditions, emotional and physiological signals (see Figure 1.2). Furthermore, they find changes in physiological signals to be seen early before any noticeable difference become apparent in the driving behaviour. However, the collection and fusion of heterogeneous data sources have the potential to improve the model performance but it can be quite challenging as they requires sophisticated sensors in the vehicle. While, the physiological signals from consumer wearable devices can be easily collected and the solutions build around them are more feasible for real-life usage.

The proposed techniques in the literature for driver’s state detection using traditional machine learning algorithms mainly rely on hand-crafted features; to classify physiological signal segments e.g. as either stressed or fatigued. The process of manual feature engineering is cum- bersome and requires extensive domain knowledge. Furthermore, the generated features are not guaranteed to be optimally discriminative to solve the task at hand and hence require usage of feature selection or dimensionality reduction techniques, such as principal component analysis. However, several recent works have shown that better performance can be achieved when feature extraction is performed jointly along with training models in an end-to-end fashion. For instance, Sutskever et al. (2014) proposed an end-to-end learning approach for sequence learning to extract discriminative features for machine translation task. The recent

2

(20)

Acceleration Breaking Wheel Movement

Lane Discipline

Following Distance Lane Changes

Speed

Aggressive Defensive

Cautious

Weather Traffic Road Geometry

Face Recoginition Voice

Gaze Concentration Perclose Pupillometry

Heart Rate Brain Waves Skin Conductance Skin Temperature

Respiration Driving Style

Driving Behavior

Vehicle Performance

Visual Attention

Emotion Environment

Biometrics Detecting

Driver State

Figure 1.2: Data sources for driver state detection (Coughlin et al., 2009). The figure illustrates a variety of data sources that can be utilized for a driver state detection. The brown colored circle under gray shaded patch (of Biometrics) indicate the focus of this study. We are proposing to solely use physiological signals instead of complex driving behavior and environmental data for arousal detection.

success of deep learning for discriminant feature extraction from raw data resulted in a state of the art results in several domains, such as speech recognition (Xiong et al., 2016), clinical diagnosis (Lipton et al., 2015b) and image (Krizhevsky et al., 2012).

Deep learning has the potential to have a significant impact on problems involving multivariate time series datasets. It can substitute manually designed feature extraction procedures and automatically learn variations in the signal. Especially, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) have achieved state of the art results on solving problems involving time series data (Xiong et al., 2016; Lipton et al., 2015b). RNNs are powerful in discovering the dependency and nonlinear dynamics in sequential data. In particular, a variant of RNN called Long Short-Term Memory (LSTM) works well in capturing long-term dependencies using a gating mechanism (Hochreiter and Schmidhuber, 1997).

Similarly, CNN local connectivity between neurons, which allows extraction of effective local salience representation from raw time series data (Hammerla et al., 2016). The other sequence modelling techniques such as Hidden Markov Models (HMM) and Kalman filters are not suitable to learn long-range dependencies in time series and become computationally expensive for large datasets (Lipton et al., 2015b). Likewise, it is showed by Bengio and Delalleau (2011) that deep neural networks can do representation learning without compro- mising on the accuracy of the obtained models.

(21)

We find that in experimental studies on sleepiness detection during driving, person exhibit different levels of drowsiness e.g. some tend to be more sleepy as compared to others. Like- wise, the duration of different trials (within a study) also varies such as some tests are of short duration of a few minutes whereas, others could be of an hour. Hence, collected data during each trial can have different size and when combined together for an end goal (e.g. arousal detection from data collected during stress and sleepiness trials) can result in disparate distribution. It has been found that supervised classification techniques tend to work well with reasonably balanced datasets but perform poorly when the class distribution is different or dataset is imbalanced. Likewise, in many real-world applications, the cost of misclassification is high e.g. mistakenly classifying sleepy driver as normal can have severe consequences. The class imbalance is recognized as a fundamental problem in machine learning and various techniques are proposed for learning with skewed datasets [see (He and Garcia, 2009)]. Keeping this in mind, we also propose to empirically explore widely used techniques for handling imbalance datasets with deep neural networks such as over-sampling, threshold-moving and using cost-sensitive loss function.

We identified three major limitations in driver state detection systems: a) mostly stress and drowsiness are treated independently, b) hand-crafted feature extraction procedures are applied and c) models are learned from complex heterogeneous data sources making it challenging for realistic usage. To alleviate these issues, we focus on exploring deep learning techniques for classifying the level of arousal (i.e. under-aroused, optimal and over-aroused) in a simulated driving environment. We hypothesize that learning deep models in an end-to- end fashion can eliminate the need for ad-hoc feature extraction and selection approaches for arousal detection. Likewise, the model build by leveraging raw signals collected from non- invasive wearable devices is more feasible to use in an everyday situation. Similarly, another reason for using physiological signals as opposed to driving behaviour and vehicle data is that, they are found to be more indicative of the driver’s state (Mehler et al., 2009).

The main contributions of this thesis consist of the following:

• Tackling stress and drowsiness together as a problem of arousal detection, using only physiological data collected from wearable devices.

• A ground truth generation scheme for physiological arousal by combining self-assessment questionnaire of sleepiness and score of task-induced stimuli from a stressful task.

• Applying various deep learning techniques to find robust architecture for arousal classification.

• Empirically exploring the effect of techniques to resolve data imbalance with a deep neural network.

The rest of the thesis is structured as follows: Chapter 2 presents the background and related studies for stress and fatigue detection in professional drivers and outlines a review of the deep learning methods for supervised sequential learning. The experimental setting for data collection, arousal ground truth generation, and signal segmentation is provided in

4

(22)

Chapter 3. The arousal classification problem formulation, explanation of widely used deep neural networks and model architectures are presented in Chapter 4. Subsequently, the results of performed experiments and discussion are provided in Chapter 5. The conclusion of this thesis is provided in Chapter 6 along with the limitations of the current work and the direction of future research work.

(23)

(24)

Chapter 2

Background Study and Related Work

In this chapter, we provide background information on measuring physiological parameters to determine mental state. The difficulties of ad-hoc feature extraction methods that arise with traditional machine learning algorithms and how deep learning algorithms are applied on time series datasets. Moreover, as we are interested in the detection of both stress and sleepiness in professional drivers, literature review of related techniques is also discussed.

2.1 Preliminaries

It is important to highlight functional control of the human body by the Autonomic Ner- vous System (ANS) and its role in stress and fatigue. The review of this area is important to understand the relationship between physiological signals and arousal. Moreover, how measuring biometrics provide an indirect non-invasive way of estimating ANS activity.

ANS is responsible for the regulation of the internal functioning of the human body, which includes respiration, blood pressure, secretion of glands and hearts electric activity (Ca- cioppo et al., 2007). It is composed of two parts: a) Sympathetic Nervous System (SNS) and b) Parasympathetic Nervous System (PNS). The SNS allocate resources required for action during stressful state, while PNS works as a stabilizer to relax the body and brings it back to normal state. A mental state having a balance between these two divisions is called homoeostasis. The increased sympathetic activity or a lower vagal activity represents wakefulness characteristics, whereas, a lower heart rate or increased parasympathetic activity signifies sleepy characteristics (Lal and Craig, 2002). Similarly, skin conductance and body temperature are also important physiological parameters that depend on the person’s state;

reflecting the autonomic responses (Rogado et al., 2009). Since, the heart and other biolog- ical processes in the body are controlled by the ANS, measuring cardiac and physiological activity is a non-invasive way of determining autonomic nervous system state.

The heart rate measures the number of heartbeats per unit of time. It describes the heart activity when the ANS attempts to tackle with the human body’s demands depending on the received stimuli (Healey and Picard, 2005). Heart Rate Variability (HRV) refers to

(25)

Chapter 2. Background Study and Related Work

the oscillation of the interval between consecutive heartbeats or beat-to-beat variation. In a stressful situation, the SNS increases heart rate or HRV is suppressed. During recuperation phase, PNS mobilizes resources to normalize the body and HRV increases. It is mentioned earlier that the domination of sympathetic activity describes wakefulness characteristics, which reflects by decrease in HRV. Similarly, a drop in heart rate or an increase in HRV can occur at the beginning of a drowsiness state (Lal and Craig, 2001; Vicente et al., 2011).

The HRV analysis has been used for stress, fatigue detection as well as cardiovascular disease monitoring (Abe et al., 2016). However, it is important to note that, baseline heart rate largely depends on cardiovascular fitness and type of activity a person is doing (Sharma and Gedeon, 2012). Hence, heart activity measurements cannot be directly compared across people until they are normalized using a baseline measurement.

2.2 Recognizing Physiological Arousal

In this section, an overview of the relevant work of fatigue and stress detection using machine learning techniques is provided. To the best of our knowledge, the majority of the work have been done for either stress or fatigue (sleepiness) detection using ad-hoc feature extraction process and do not take both into account. For detecting level of arousal, it is very important to recognize both stress and fatigue as it is discussed earlier that, under-arousal and over- arousal both lead to a decline in performance and increase the chances of driver’s distraction.

The next section highlights related work about fatigue recognition; followed by details of stress detection.

2.2.1 Fatigue

Fatigue is the transitory period between awake and sleepiness states if uninterrupted can lead to sleep. It can be divided into two categorize, mental and physical fatigue. The former is believed to be physiological in nature, whereas the latter is related to muscle fatigue, characterized by reduced muscular power and movement (Lal and Craig, 2001). The term sleepiness and fatigue are used synonymously to refer to sleep resulting from a neurobio- logical process that controls circadian rhythm (Rosekind et al., 1994). In comparison with fatigue, sleepiness has a more precise definition. The term fatigue is largely used to indicate a state of reduced mental alertness due to prolonged activity, which results in impairment of performance. Due to the overlapping nature of fatigue and sleepiness, we will use these interchangeably in this thesis.

The significant work on arousal detection in professional drivers has been done by Cough- lin et al. (2009) in developing a system for an AwareCar concept. A car that is aware of its driver’s state and can help the person to be at optimal arousal level. They propose a system to detect driver’s state based on physiological signals, driving behaviour and vehicle information, etc. Moreover, they discuss in detail how such a system can benefit even in self-

8

(26)

driving cars. This work largely provides bases for our research to detect arousal level using only physiological signals provided by consumer wearable devices. This will allow to ignore fusing of complex vehicle and environment information, which require significant sensors in the vehicle.

A critical review of the psycho-physiology of driver fatigue is provided by (Lal and Craig, 2001). The authors covered different types of fatigue and their impact on performance.

In particular, discussion on how fatigue can affect professional drivers and what are its indicators. It is shown that the large body of work on detecting driver fatigue is based on EEG, eye movement and video analysis. Regarding heart rate (ECG) it has been stated that it needs further controlled investigation of the autonomic changes that occur during driving fatigue to draw firm conclusions. This further motivates us to explore heart rate and other physiological signals for arousal detection in a controlled environment.

To counter the interpersonal variation in heart rate Abe et al. (2016) took a personalization approach for drowsiness detection using HRV features for each driver individually. They applied “multivariate statistical process control” technique, which uses principal component analysis and defines the normal operating condition with two monitored indexes namely, T2 and Q statistics. They suggested optimizing threshold value for each driver independently for better results. The proposed technique’s detection rate was 68%. The sensitivity for 12 participants was 80% and for rest around 20%. The recommended threshold optimization method motivates us to employ weighted threshold-moving (Zhou and Liu, 2006) and early- stopping of the model training, when the loss on validation set starts to increase.

Michail et al. (2008) performed power spectral analysis of drivers’ HR along with a variation of fractal dimension. The ECG and EEG features were extracted from the data of sleep deprived drivers exposed to real-life driving conditions. Their analysis shows that the combination of both spectral and non-linear characteristic of EEG and ECG can reveal the physiological differences that occur due to lack of sleep during driving and more specifically decrease in arousal and loss of control. They found the lower ratio of low-frequency to high- frequency component and lower low-frequency values were correlated with the occurrence of driving errors.

Vicente et al. (2011) used HRV features extracted from ECG data for driver’s drowsiness detection. Two datasets were used for training the linear discriminant classifier. First, where participants went through the sleep deprivation protocol between 7 and 26 hours before driving in a simulator for 2 hours. Seconds, driving without any sleep deprivation protocol.

The presented results suggest that the classifier identified a global state of the driver, if he is awake enough to drive. But, the beginning and end of isolated drowsy states for non-sleep deprived subjects were not detected precisely. Furthermore, a neural network for fatigue detection is applied by Patel et al. (2011) on frequency domain HRV features extracted from ECG dataset, collected from 12 participants in a laboratory study. They reported 90%

accuracy on binary classification problem, i.e. alert or fatigue.

Krajewski et al. (2009) presented a method for fatigue detection using steering wheel be-

(27)

haviour. During fatigue or drowsiness induced changes in steering behaviour is related to slow drifting and fast corrective counter steering. They extracted around 1200+ features from time, frequency and state space domain to train an ensemble of five classifiers. For simulator driving data of 12 drivers, their proposed approach achieved 86.1% recognition rate in classifying slight to strong fatigue.

2.2.2 Stress

Stress can be described as a physiological response to emotional, mental and physical chal- lenges (Schneiderman et al., 2005). The immediate behavioural response of a human to stress is usually attributed as fight or flight, which is being kick-started by so-called sympathetic arousal. In response to a potential threat to ones physical well-being the sympathetic branch of the ANS stimulates specified target organs via efferent (nor-adrenaline mediated) neurons tracts, initiated in the Locus Coeruleus (see Figure 2.1), located in the brain stem (Cacioppo et al., 2007). The immediate effect of this stimulation is measurable changes in physiological status, like an increased heart rate and skin conductance level.

Cortex

Thalamus Locus coeruleus

Spinal cord Cerebellum

Figure 2.1:Illustration of Locus Coeruleus in the Brain Stem.In potential threat situation sympathetic branch of ANS stimulates specified target organs via efferent neurons tracts, initiated in the Locus Coeruleus. This instantaneous stimulation affect results in measurable changes in physiological parameters such as skin conductance.

In stress recognition, Healey and Picard (2005), presented a detailed methodology for col- lecting and analyzing physiological data during real world driving task to detect stress. They collected 50 minutes of data, including skin conductance, respiration rate and heart rate from twenty-four drivers. For three level stress classification, their system achieved 97% accuracy on features calculated from 5 minutes’ intervals. Besides, for most drivers, they found skin conductance and heart rate are most closely correlated with driver stress level.

Zhai and Barreto (2006) proposed a stress detection system based on Galvanic Skin Response (GSR), blood volume pulse, pupil diameter and skin temperature to differentiate affective states (stressed or relaxed) in computer users. The “Paced Stroop Test” is used to introduce stressor on thirty-two participants. While, emotionally neutral pictures are presented as a baseline (relax situation). Support Vector Machine (SVM) is trained on the various features,

10

(28)

such as mean IBI, mean value of pupil diameter, etc. Their system achieved 90% accuracy by incorporating all the features. While pupil diameter found to be most significant affective state indicator as compared to others.

Kurniawan et al. (2013) used GSR with speech data in a controlled environment for stress prediction. They found features extracted from speech data to be more informative compared to GSR but more person dependent, i.e. they do not allow classifiers to generalize well on unseen data. They used five classifiers Decision Trees, SVM (with RBF kernel), Gaussian Mixture Model, Logistic Regression (LR) and K-Means. According to their analysis, SVM outperformed other classifiers with an accuracy of 80% when using both GSR and speech and achieved 92% accuracy when only features extracted from the speech were used. They found the accuracy of K-Means to be the lowest and not suitable for stress detection. Likewise, the ensemble classifier trained by combining both features did not result in improved accuracy.

Sun et al. (2010) presented activity-aware stress classification method using GSR, ECG and accelerometer data. They argue that physical activity can hide the physiological changes caused by mental stress. Hence, it is important to consider activity performed by the person.

The sixty minutes of data collected from 20 participants, while performing three activities i.e. sitting, standing and walking. They employed 60-seconds window for feature extraction and trained Decision Tree, Bayesian Network and SVM models. Their results indicates that physiological signals are highly user-dependent and hence for developing stress monitoring applications, it is important to also use personalized data for model training.

We find that aforementioned studies on fatigue recognition widely use EEG and ECG signals.

Similarly, stress detection methods mainly rely on ECG, respiration rate as well as GSR.

However, a arousal detection system that could be used in an everyday situation should rely on physiological parameters that can be collected easily e.g. from widely available consumer wearable devices. Therefore, in this work, we apply deep pattern recognition methods to see whether heart rate, skin conductance and skin temperature recorded during a simulation study can be used for arousal classification.

Another crucial step in the development of a classification model is feature extraction from raw signals that can optimally represent the data. For this purpose, aforementioned studies used ad-hoc procedures; features are calculated manually by subject experts, which is not only time consuming, but limited to researcher’s creativity. Moreover, the created features are not guaranteed to be discriminant for solving the end task. A more better approach is to directly learn informative features from the data. This idea is at the core of deep learning algorithms that can extract complex non-linear features automatically. The next section and Chapter 4 provides detailed discussion on how deep neural networks can be used for supervised sequence classification.

(29)

2.3 Deep Learning for Sequence Classification

This section reviews the recent developments in deep learning for time-series problems. Deep learning has shown state-of-the-art results in several domains such as computer vision and machine translation as well as in time-series and sequence classification tasks. In contempo- rary research work on applying deep learning to sequential data, resulted in different model architectures and data modeling techniques to capture temporal information. Hence, the work reviewed in this section are from disparate domains but related to problems involving sequence classification.

Miotto et al. (2016) applied denoising autoencoder (DAE) for unsupervised representation learning of patients’ characteristics from Electronic Health Records (EHR). EHR is composed of medications, diagnosis, procedures and lab tests, etc. for patients collected over the different time periods. The proposed technique is evaluated using 76,214 test patients com- prising 78 diseases from diverse clinical domains and temporal windows. They claimed that their results significantly outperformed those achieved using representations based on raw EHR data and alternative feature learning strategies such as principal component analysis.

The result achieved by the proposed simple model inspired us to apply DAE along with feed-forward neural network as a baseline.

Lipton et al. (2015b) applied RNN and LSTM model with target replication cost function for multi-label time series classification. Instead of getting prediction result at the last time step of LSTM and calculating the cost, they calculated loss at each time step and combined it with the prediction at the last time step. The proposed architecture was employed for classification of 128 ICU phenotypes from EHR. They compared the performance of multilayer neural network and hand-crafted features and achieved better results. Likewise, De Baets et al.

(2016) used bi-directional LSTM for predicting blood culture test outcomes of patients.

The clinical parameters such as temperature, heart rate and blood leukocyte count etc. are presented as time series, collected from 2177 ICU admissions. Their preliminary results show an area of 71.95% under the precision-recall curve.

Sathyanarayana et al. (2016) compared several deep learning architectures for sleep quality prediction using actigraphy sensors data collected from 92 adolescents over 1 full week. They showed that CNN based model achieved highest precision and recall. Furthermore, Choi et al. (2016) applied GRU based recurrent neural network model for heart failure prediction using EHR, with heart failure and control cases. They compared the model performance to traditional machine learning approaches such as LR and SVM and showed their models exhibited superior performance. Martinez et al. (2013) applied deep convolutional autoencoder and preceptron for affective modelling. They used the physiological signals (i.e.

skin conductance and blood volume pulse) of 36 game players and subjective self-reports of affect. It is shown that, the deep learning methods outperformed manual feature extraction and selection for affective modelling. This work provides us further motivation for using not only deep convolutional neural network but also subjective measure for ground truth annotations.

12

(30)

Zheng et al. (2014) proposed a deep CNN architecture for activity recognition and congestive heart failure detection with ECG. Instead of representing signal data as 2D image as it is usually done in case of CNN, they applied separate convolution and pooling operations on each input channel (e.g. x, y and z components of accelerometer) and at later stage flattened and concatenated the output of convolution layers for fully connected neural network layer to get classification results. In comparison with distance-based classifiers and variations of dynamic time warping, their proposed method resulted in a good performance as well as improved prediction time. Similarly, Cui et al. (2016) proposed multi-scale CNN (MCNN) tailored for time series classification problems. The suggested architecture contains multiple branches that perform various transformations of the time series, which extract features of different frequency and time scales. In extensive experiments, they showed superior performance of MCNN as compared to dynamic time warping, distance based and ensemble classifiers. These work provoke us to model input as 1-D data, with each signal as one channel for CNN model.

Also, to apply depth-wise convolution for independently learning features from each signal and later use general convolution to extract interactional features.

Hammerla et al. (2016) have applied deep learning methods to accelerometer data for Hu- man Activity Recognition (HAR). Across thousands of experiments, they investigated the suitability of different models for HAR. On diverse benchmark datasets, CNN and LSTM have outperformed hand-crafted features by a significant margin. Moreover, they suggested using LSTM in real-time applications on a sample-by-sample basis. Alsheikh et al. (2015) have proposed a hybrid approach combining deep learning and HMM for sequential activity recognition. They showed the importance of unsupervised training of models in weight tuning and optimization. Moreover, they performed spectrogram analysis on raw accelerometer data and found it to be helpful for capturing variation is the input data for activity recognition.

The discussed studies in the section highlight a growing interest of deep learning for sequence classification problems. An important thing to recognize is that deep models are learned com- pletely from scratch with minimal pre-processing and no feature engineering. This clearly shows the capability of deep learning algorithms to learn highly discriminant features directly from data. In particular, convoluational neural network stands out with incredible performance in affective computing and human activity recognition problems. Likewise, recurrent neural network and denoising autoencoder show superior performance on electronic health records data. However, designing an optimal deep architecture (e.g. in CNN to decide on number of layer, filters, stride size, pooling type etc.) is an open research problem. This is usually done by trial and error or transferring models working well in other domains.

Recently, there is a growing interest in applying evolutionary algorithms for automatically designing deep neural network architecture (see Miikkulainen et al. (2017)) but it requires significant computing resources. In this study, we took inspiration from existing architectural designs working well in other domains for the purpose of arousal classification.

(31)

2.4 Analysis of Existing Approaches

From the above discussion, it can be noticed that very few work investigated both stress and sleepiness detection, together. Likewise, the majority of the work mainly used ad-hoc feature extraction techniques to develop classification models. The prime limitation of these techniques is that, they rely on collection and usage of ECG, EEG, Electromyograms and speech data, which is not feasible in real-life situations. The wearable devices, having a PPG sensor for heart rate measurements, can be used easily during day-to-day activities. Like- wise, skin conductance, skin temperature and accelerometer data can also be acquired from similar non-invasive wearable devices. In this work, we specifically aim to explore stress and drowsiness detection together and develop an arousal classification model that can leverage sensors of widely available wearable devices.

In the mentioned studies on stress and fatigue detection, the employed machine learning algorithms can be grouped into three categorize. First, algorithms which use distance based measures and generally known as lazy learners, such as k-Nearest Neighbours. These methods become computationally infeasible as the dataset size grow and require carefully chosen similarity metric. Seconds, the classification algorithm like Random Forest is used on high- level extracted features. In this case, the extraction is performed either by a human researcher or using highly distributed libraries¹, which calculate hundreds of widely used time-series features. However, the produced features are not guarantee to be optimal for the end task.

Third and last, dynamic probabilistic models e.g. HMM are applied to capture temporality of the data. These methods also become computationally expensive as their state space grows exponentially with context window size. Also, they found to be impractical for modeling long-range dependencies in sequential data (Graves et al., 2014).

To overcome the limitations of the preceding techniques, we intend to use deep neural networks for representation learning directly from raw physiological signals. This will avoid complicated feature extraction and selection procedures, which traditionally requires a domain expert level knowledge. Moreover, it will enable to automatically extract important discriminant features by end-to-end training. In particular, we think employing temporal convolutional and recurrent neural networks can significantly improve the detection rate due to their ability to extract local and global dependencies in sequential data. Furthermore, a learned model can be used in near real-time to detect the driver arousal level and hence can facilitate the user to achieve an optimal state.

1tsfresh is a python package that is used to automatically calculate and select huge number of time series features for regression and classification tasks. It can be found at the following address: http://tsfresh.readthedocs.io/en/latest/

14

(32)

Chapter 3

Data and Methodology

In this chapter, a detailed description of experimental environment, data analysis, arousal ground truth generation approach and the segmentation strategy of the dataset is outlined.

3.1 Experimental Protocol

In the study, 30 drivers participated, only one of them being female. The average age and weight of the members were 43±13 years and 91±17 KG respectively. All the participants were professional drivers, each participant went through health check-up to avoid collection of data from unhealthy subjects. The driving task was realized using SILAB¹ driving simulation software and participants received standardized instructions from an audiotape. The Figure 3.1 shows a breakdown of the experimental protocol activities. The participants com- pleted a practice trial to get used to the simulator setup and the driving task. Just before the start, participants filled-in Karolinska Sleepiness Scale (KSS) and other related forms.

The first experimental trial consisted of normal driving (baseline condition) for 15 minutes.

After which, each subject was asked twice, to count from 1-60 as a moderate stress activity with some interval between the two moderate stress activities. After a one-minute period of normal driving, stress was induced by an arithmetic task. The backwards counting in steps of 7 from some random number was used as a high stressor. The subject was asked to complete the count in approximately 30 seconds and after that asked to count backwards again from another random number, this process was repeated for approximately 5 minutes. The length of the stress simulation task was 25 minutes, including baseline. In the break period (which varied from driver to driver), participants filled in KSS form for the second time. Then, the second experimental (sleepiness or fatigue) trial started, which lasted for 90 minutes. In this phase, no secondary tasks were applied. After every 10 minutes KSS prompt was given (on tablet) to drivers to collect their sleepiness level. At the end of the experiment, drivers filled in KSS and other required forms and devices were removed.

1https://wivw.de/en/silab

(33)

Chapter 3. Data and Methodology

Post driving (21 mins) Fatigue trial

(90 mins) Break

(15 mins) Stress trial

(25 mins) Practice trial

(15.5 mins) Pre-driving

(15 mins)

Figure 3.1:Sequence and duration of events in a simulator study.At the beginning, participants were welcomed, overview of the study activities was given and device setup was performed.

The practice trial was executed to make members comfortable and aware of simulator environment.

Afterwards, instructions were given for baseline and stress tasks, followed by a break. Subsequently, a long dull driving task was carried out to induce sleepiness in drivers. At the end, devices were removed, study feedback was collected and attendants were compensated.

3.2 Data Collection and Analysis

We collected heart rate, skin conductance, skin temperature and accelerometer data from each participant using wrist-worn devices. The heart rate with frequency of 1 Hz was derived from PPG sensor data. Likewise, rest of the physiological signals were recorded at frequency of 10 Hz.

To derive the ground truth labels for arousal, we followed the experimental protocol and used stress and KSS scores. The data collected during baseline, moderate and high stress task was simply assigned corresponding labels. The stress was induced by means of secondary arithmetic subtraction task. It is a component of widely used Trier Social Stress Test (Kudielka et al., 2007; Birkett, 2011), where user have to perform serial subtraction verbally in a loud manner and have to start over from the last correct answer if mistake is made. The baseline, moderate and high stress were assigned labels of 1, 2 and 3 respectively. Also, the data points during instruction periods were simply assigned the label of zero to avoid wrong labelling.

For fatigue trial, KSS was used for evaluating subjective sleepiness of each driver. The KSS spans nine levels and asks the user to provide the number that most closely represents their sleepiness level at the moment. The Table 3.1 provides a rating and description of KSS.

It appears to be most widely used sensitive and reliable measure of sleepiness (Gillberg et al., 1994). Likewise, studies show significant correlation between the KSS and objective measures of driving performance such as standard deviation of the lateral position and blink duration (Ingre et al., 2006). This makes KSS a feasible measure to derive ground truth for supervised models as compared to video-coding, which requires substantial human effort and have high chances of bias being included in the generated labels. In a 90 minutes fatigue trial, drivers were asked every 10 minutes for a KSS score. Furthermore, the two of the KSS scores provided by the drivers at the start of the experiment and before the break were also used. The values are linearly interpolated between the two, to get a discrete range of KSS scores, that we used later for arousal ground truth labelling. The KSS scores from 1 to 5 were considered to be “alert” state, whereas, 6 to 9 were considered as “sleepy” state.

16

(34)

Table 3.1:The Karolinska Sleepiness Scale.It is a subjective measure of sleepiness, used to collect sleepiness level of drivers in the study.

Rating k Description

1 Extremely alert

2 Very alert

3 Alert

4 Rather alert

5 Neither

6 Some signs of sleepiness 7 Sleepy but no effort to stay awake 8 Sleepy but some effort to stay awake 9 Very sleepy, great effort to stay awakes

Table 3.2: Stress Score. The explicit labels are assigned to each activity as stress was introduced by means of secondary task i.e. 1-60 counting for moderate stress and backward subtraction task for high stress activity.

Score s Description

1 Normal

2 Moderate stress 3 High stress

3.3 Arousal Ground Truth Annotation

We defined ground truth generation strategy for under-aroused, over-aroused and normal class labels based on stress and KSS scores (see Table 3.1 and Table 3.2) combined. Let s denote the stress label and k represents the set of KSS scores. The arousal label l can be specified as follows:

l =











under-aroused (1), if s ∈ {1, 2, 3} and k ∈ {6, 7, 8, 9}

normal (2), if s ∈ {1, 2} and k ∈ {1, 2, 3, 4, 5}

over-aroused (3), if s = 3 and k ∈ {1, 2, 3, 4, 5}

For the data points to be in under-aroused class, the stress label must be 1 and KSS score between 6 to 9, inclusive. Likewise, for a normal or optimal level of arousal, stress label needs to be 1 or 2 and KSS score between 1 to 5. Similarly, for over-aroused class, stress label has to be 3 and KSS score of 1 to 5, inclusive. Figure 3.5 provides the graphical illustration of

(35)

the arousal ground truth based on stress and KSS labels.

During analysis, we found several drivers with clipped physiological signals, which happen due to bad skin contact with the sensor. Similarly, arousal ground truth for one of the three classes was not available for some drivers. This happens as some driver did not feel drowsy and provided KSS score of less than 6. Hence, such drivers are filtered out to avoid model training on poor quality data; leaving 11 for further analysis. The criteria defined to discard the drivers is as follows:

• Clipped physiological signal

• Unavailability of arousal ground truth for one of the class

The mean heart rate, skin conductance and skin temperature by arousal levels for all drivers can be seen in Figure 3.2, 3.3 and 3.4 respectively. It is to be noted that the duration of each trial was different but still the changes in physiological signals are apparent. The skin conductance found to be highest during under-aroused, whereas, skin temperature found to be almost constant. Furthermore, the class label distribution of raw data for each of the 11 drivers can be seen in Figure 3.6. The Figure 3.7 provides overall class distribution of the dataset.

3.4 Pre-processing and Segmentation

Physiological signals varies significantly from person to person and depend on several factors such as age, gender, sleep and diet etc. (Picard et al., 2001). Hence, it becomes important to minimize the effect of these variations in the raw signal. The normalization rescale the values within a defined range i.e. in case of mean-normalization, the signal will have properties of standard normal distribution. It is important not only when raw signals have different measurement units but it is a general requirement of optimization based learning algorithms such as logistic regression. We minimally preprocess the dataset to let deep nets extract the key non-linear features. The heart rate, skin conductance and skin temperature of each driver are mean normalized by baseline to have zero mean and unit-variance (see Equation 3.1). The mean and standard deviation is calculated from the normal baseline driving of 15 minutes.

The day-to-day variations e.g. mood fluctuations are not considered in this work as the total duration of the simulation task was approx. 2 hours.

x⁰ = x − x_b

σ_b (3.1)

It is mentioned earlier that, heart rate and other physiological signals had different sampling rate, we upsampled the heart rate using linear interpolation to match the frequency of the rest i.e. 10 Hz. The upsampling is performed to keep the dataset size large enough and avoid loosing meaningful information. Hence, the dataset of each driver i can be represented as:

D_i = {(x^d₁, y₁), (x^d₂, y₂), . . . , (x^d_N_i, y_N_i)}

18