Heat exposure, heat stress and the prediction with physiological signals

(1)

Heat exposure, heat stress

and the prediction with

(2)

Layout: typeset by the author using LA_TEX.

(3)

Heat exposure, heat stress and the

prediction with physiological signals

Elise Sterk 10790330

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. V.V. Krzhizhanovskaya MSc. V.R. Melnikov Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Jun 26th, 2020

(4)

Abstract

Data was provided about an experiment involving behavioral adaptations in an outdoor thermal environment. The main goal for the experiment was to give more insight into how participants respond to dynamic environmental conditions. The goal for this thesis was to investigate if phys-iological signals (Heart rate, Electrodermal activity, Blood volume pulse and Skin temperature) could predict sun exposure and its heat intensity during the experiment and could consequently be indicators of heat stress in individuals. Features were extracted from the physiological signals and provided the input for a machine learning approach involving classification and regression. The results are very promising. Classification resulted in an accuracy score of 75% and even a score of 81% was reached when creating more distinction between the higher and lower target values. Moreover, the results of this thesis indicate that an even higher accuracy score could be reached with parameter optimization. The results of regression were slightly lower with a R2

score of 0.33. The most important feature for the prediction of heat exposure was the frequency of the skin conductance response peaks. This is especially interesting, since it indicates that it would be promising to further investigate the application of the electrodermal activity signal in the prediction of outdoor heat stress. In general, this thesis was successful in using physiological signals for the prediction of sun exposure and heat in participants. These results are relevant, since, with urbanization and climate change, it becomes increasingly important to find indica-tors that could detect heat stress in real time and help people mitigating heat stress, to prevent heat-related health illnesses.

(5)

4.4 Evaluation . . . 23 4.5 Two groups . . . 24 5 Results 25 5.1 Survey responses . . . 25 5.2 Features . . . 26 5.3 Sun exposure . . . 28 5.3.1 Logistic Regression . . . 28 5.3.2 Random Forest . . . 29 5.4 Heat score . . . 29 5.4.1 Linear Regression . . . 30 5.4.2 Logistic Regression . . . 31 5.4.3 Random Forest . . . 34

(6)

6 Conclusion 38

6.1 Minor data conclusions . . . 38

6.2 Overall conclusions . . . 38 7 Discussion 40 8 Future research 42 Bibliography 43 Appendices 48 A Dataframes . . . 48 B Parameters . . . 50 C Features . . . 51

D Results data analysis . . . 54

E Results Percentage_sun . . . 55

(7)

Chapter 1 Introduction

Heat and heat-related events are among the more principal of weather-related mortality causes (Borden & Cutter, 2008; National Weather Service, n.d.). Extreme heat and heat waves result in a large number of deaths across the world. For example, in 2003 a heat wave occurred in Europa which resulted in heat-related illnesses and more than 70000 fatalities (Stone Jr, 2012). Furthermore, 700 people died in Chicago in 1995 as a result of heat-related illnesses (Semenza et al., 1996), and the results of Gabriel and Endlicher (2011) indicate that the mortality rate is significantly higher during extreme heat conditions.

Furthermore, heat exposure could become even more dangerous, because, as a result of climate change, the global temperature is expected to increase (Sherwood & Huber, 2010; Collins et al., 2013). This increase in temperature could result in an increase in heat stress. Indeed, heat stress and extreme heat situations are expected to occur more frequently and with an increased magnitude and duration (Collins et al., 2013). Consequently, global warming could result in more heat-related illnesses and fatalities. Sherwood and Huber (2010) acknowledge the dangers of global warming and increasing heat and argue that humans will not be able to adapt to the rising temperature caused by global warming. They believe that global warming will result in more heat stress and, consequently, that it could become unbearable for humans. As a conclusion, Sherwood and Huber (2010) mention that there should be more awareness about the impact of heat stress.

First, to define heat stress and explain its dangers; heat stress occurs when the human body is incapable to cool itself and discard excessive heat (IOWA, n.d.). Heat stress is dangerous and can result in certain health issues, such as heat stroke and fatigue. Sweating, dizziness and headache are among the symptoms of heat-related health issues (IOWA, n.d.). Furthermore, specific demographic groups appear to be more at risk of heat-related illnesses and fatalities. For example, during the heat wave in Europe in 2003, the majority of the fatalities were elderly people (Stone Jr, 2012). In addition, the results of Gabriel and Endlicher (2011) imply that elderly women are especially vulnerable to heat stress. Overall, heat stress can result in fatalities and with the rising risk of heat-related events due to global warming, it becomes especially important to investigate when humans experience heat.

Moreover, the urban heat island is an interesting concept when investigating heat stress. An urban heat island is an urban environment where temperatures tend to be higher compared to surrounding rural environments (Stone Jr, 2012). The temperature in an urban environment could be as much as 4◦C higher. For example, during the 2003 heat wave in Europa, the temperature of urban heat islands was higher than those in the surrounding areas (Stone Jr, 2012). The urban heat island is an interesting concept because the increase in temperature, as

(8)

a result of global warming, could also have a significant effect on heat in urban environments. A study by Tan et al. (2010) examined the relation between health, heat waves and urban heat islands in Shanghai. Their results indicate that the temperature in urban environments tend to increase at a higher rate than the temperature in rural environments. And urban heat islands seem to intensify heat waves (Tan et al., 2010; Stone Jr, 2012), which are expected to occur more frequently (Collins et al., 2013). Also, Tan et al. (2010) investigated the relation between urban heat islands and mortality. They discovered that the amount of heat-related deaths is higher in urban environments than in rural environments and higher temperatures in urban heat islands could increase the risk of heat-related illnesses (Tan et al., 2010).

This thesis will examine heat stress and heat exposure in an urban environment, namely, Singapore. In a previous research, V. R. Melnikov, Krzhizhanovskaya, Lees, and Sloot (2020) examined the relation between walking speed and heat stress in Singapore. They discovered that living in Singapore results in extra heat stress in humans, as a consequence of the pace of life in urban environments. Hence, further investigating heat stress in Singapore is interesting, since Singaporeans already experience extra heat stress and heat stress in general is expected to increase as a consequence of global warming and urbanization. Especially since more than half of the world population already lives in urban environments and this number is expected to increase in the upcoming decades (United Nations Department of Economic and Social Affairs, May 16, 2018).

Overall, extreme heat conditions pose an increasing threat to human health, especially con-sidering global warming and urbanization. For these reasons, investigating heat stress is essential and relevant. Therefore, the research question for this thesis is: Can continuously measured physiological signals of humans predict heat exposure in a dynamic outdoor thermal environment and consequently be indicators of heat stress?. The context and theoretical background for this question will be explained in the following chapter.

(9)

Chapter 2 Literature review

The data, which is the foundation for this thesis, was collected during an experiment in Singapore about heat stress and decision making. The participants of the experiment had to walk in a thermal environment and decide choose between two paths. During the experiment, data from multiple sources was collected, namely, climate data, socio-demographics data and physiological data. Considering global warming, urbanization and the results of V. R. Melnikov et al. (2020), it would be interesting to examine the outdoor thermal comfort of participants in Singapore and if they experienced heat stress during the experiment. Here, thermal comfort is defined as the subjective evaluation of how people experience the surrounding thermal environment (Abdel-Ghany, Al-Helal, & Shady, 2013). Meaning, how comfortable are people with the environment. Consequently, heat stress can result in people feeling uncomfortable with the environment and when heat stress occurs people could experience thermal discomfort. Thus, thermal discomfort can be an indication of heat stress.

Moreover, it was mentioned that heat stress could occur when the human body cannot discard excessive heat. Overall, the human body can maintain a healthy heat balance and thermo regulate through heat loss and heat generation. To loose excessive heat, the human body starts sweating and the process of vasodilatation is activated, which increases the blood flow from the interior of the body to the skin (Arens & Zhang, 2006). Sweat will then be evaporated from the skin to lose heat. To generate extra heat, the human body starts shivering and the blood flow from the interior to the skin will decrease, which is called vasoconstriction. Overall, there are multiple processes that transfer heat between the human body and its surrounding environment and within the body internally. These processes are; shivering, sweat loss through evaporation, respiration, convection and radiation. Together with the metabolic rate and the external work, these processes comprise the heat balance equation of a human body (V. Melnikov, Krzhizhanovskaya, Lees, & Sloot, 2018). In general, when the human body cannot transfer excessive heat to the surrounding environment and cannot loose the extra heat the heat balance might be distorted and heat stress could occur.

Overall, factors that can cause heat stress, are divided between climate factors and factors relative to an individual (Abdel-Ghany et al., 2013; Epstein & Moran, 2006). There are six important factors that can affect the heat balance in a human body and that can result in heat stress. Namely, air temperature, radiant temperature, wind speed, humidity, metabolic rate and clothing insulation (Lundgren, Kuklane, Gao, & Holmer, 2013; Epstein & Moran, 2006). Furthermore, individual factors, such as: gender; age and ethnic background also influence heat stress. Individual factors have an influence on the metabolic heat generation rate of a person (Abdel-Ghany et al., 2013). Consequently, they indirectly affect heat stress in individuals.

(10)

Furthermore, the climate parameters can affect the heat transfer between the human body and its environment. For example, when the radiant temperature is higher than the skin temperature, heat will be transferred from the environment to the body through radiation (Havenith, 2005). And when the moisture in the air is higher than on the skin, sweat evaporation will become difficult (Havenith, 2005), which can result in excessive heat.

Moreover, sun exposure and solar radiation are factors which affect heat stress, cause thermal discomfort and directly affect the radiant temperature. BetterHealthChannel (n.d.) specifically identifies sun exposure as a cause for heat stress. And the advise for when people experience heat stress is, among other advise, to move to a shaded area. Furthermore, many studies have investigated the effect of sun exposure on human thermal comfort and heat stress. And the impact of shade on heat stress has proven to be significant. For example, blockage of direct solar radiation through building shade can decrease the mean radiant temperature (Rakha, Zhand, & Reinhart, 2017). Consequently, the heat stress experienced by an individual could decrease, since the mean radiant temperature is lower. Also, Zhang and He (2012) suggest that tree shading can improve the thermal environment in urban pedestrian streets, and Ali-Toudert and Mayer (2007) identified the significance of sun exposure on thermal discomfort and conclude that shading is of great importance to relieve heat stress in pedestrians. Moreover, tree shading results in a decrease in direct solar radiation, consequently reducing heat stress and improving thermal comfort (Ali-Toudert & Mayer, 2006), and tree shade has a cooling effect and reduces the globe temperature significantly in urban environments (Armson, Stringer, & Ennos, 2012). In addition, Tung et al. (2014) conducted a research to investigate the thermal comfort differences between men and women. They discovered that, to adapt to heat, the main behavioural adjustment for both men and women was to move to a shaded area. Hence, the results of numerous researches imply that, to relieve and prevent heat stress, humans could move to a shaded area. Thus, shading and sun exposure could be of great significance in assessing heat stress.

Overall, the amount of heat exposure that an individual experiences is, among other things, dependent on the duration of the exposure, the socio-demographics of an individual, the physical activity of an individual and if the individual is in the sun or in the shade (Kjellstrom et al., 2016). And, as mentioned, climate parameters other than solar radiation, such as relative humidity and wind speed, also affect the heat balance of an individual.

To assess and measure heat stress, many thermal indices have been created. The authors of different articles discuss and compare thermal indices and mention the formulas and parameters that are used to calculate thermal comfort and heat stress (Blazejczyk, Epstein, Jendritzky, Staiger, & Tinz, 2012; Epstein & Moran, 2006; Havenith & Fiala, 2011; Zare et al., 2018). Among the thermal indices there are some, such as wet bulb globe temperature (WBGT) and heat index (HI), that measure heat stress only with climate parameters (Blazejczyk et al., 2012). Whereas other thermal indices, such as predicted mean vote (PMV), also include individual parameters (Zare et al., 2018).

In general, there are many climate parameters that could potentially cause heat stress in an individual. And there are many thermal indices that include different parameters to asses heat stress. Which thermal index to use depends on the available data and the problem. However, not all indices can be generalized for an individual. Some people might experience heat stress under different conditions than other people or might react differently to extreme heat. The individual parameters can be important in the prediction and assessment of heat stress. For example, Lan, Lian, Liu, and Liu (2008) discovered that the highest comfortable temperature for a women is slightly higher than for a man, and obesity was found to have an influence on the thermal comfort of an individual (Wang et al., 2018). Overall, as Chaudhuri, Zhai, Soh, Li, and Xie (2018) state, thermal comfort is subjective to an individual. Individual factors could thus be important for a more accurate indication of heat stress in individuals.

(11)

However, not all studies identified a significant difference in individual perception of thermal comfort with different ages and gender (Wang et al., 2018). Wang et al. (2018) conducted a literature review about the individual differences in thermal comfort and concluded that there is no clear conclusion about the significance of gender and age on individual thermal comfort. A solution to address the individual differences in how thermal comfort is perceived and to really examine thermal comfort on an individual level, is to collect physiological signal data through wearable devices and to use machine learning to predict individual thermal comfort with the physiological signal data (Wang et al., 2018). With this method, heat stress can be assessed and predicted on an individual level.

Thus, including physiological signal data might be an interesting approach. The relation between physiological signals and heat stress has, to a certain extent, already been investigated and detected. Specifically, the influence of heat stress on certain physiological signals has been examined. For example, heat stress appears to affect an individual’s heart rate variability and heart rate, and results indicate that heart rate increases under heat stress (Tatterson, Hahn, Martini, & Febbraio, 2000; Jagomägi et al., 2013). Also, Madaniyazi et al. (2016) detected a V-shape for heart rate and blood pressure in which the heart rate and blood pressure increase when the temperature is above 28 degrees Celsius. Lastly, the results of Kaiser, Nimbarte, Davari, Gopalakrishnan, and He (2017) indicate that skin conductance is higher with a higher temperature and that temperature has an influence on skin conductance.

Overall, heat stress can be dangerous for people, hence it is important to research when people might experience heat stress and how heat stress affects an individual. Physiological signals are specific for an individual and could possible be used to assess heat stress and as indicators of heat stress. The results of the studies mentioned in the previous paragraph indicate that there is a relation between heat stress and physiological signals, and that heat and heat stress have an effect on physiological signals. Therefore, what would be interesting, is to investigate how physiological signals might be an indicator of heat stress. Meaning, if physiological signals could predict heat stress. If true, then heat stress can potentially not only be assessed based on climate parameters and metabolic rate, but also on physiological signals. Which makes it possible to accurately detect heat stress for humans on an individual level in real time. Which could help alert individuals about their current state and give advice on how to proceed to mitigate heat stress.

Some research has already been successful in predicting thermal comfort with physiological signals. For example, Chaudhuri et al. (2018) used a Random Forest classifier to predict thermal comfort of an individual, with physiological signal data (e.g. skin temperature, skin conduc-tance) collected with a wearable device (Chaudhuri et al., 2018). However, they focused on a cold environment and cold discomfort. They mention that for future research a parallel analy-sis could be performed with the focus on thermal hot discomfort. Burzo et al. (2014) tried to automatically detect comfort levels of participants based on physiological signals and climate pa-rameters, and Abouelenien, Burzo, Mihalcea, Rusinek, and Van Alstine (2017) extracted features from physiological signals similar as for this research and were successful in detecting different thermal comfort levels with machine learning. Lastly, Liu (2018) conducted an experiment in which physiological signals of participants were collected for two weeks with wearable sensors. The author combined the physiological signals and climate parameters to successfully predict thermal preference of participants.

In general, predicting thermal comfort with machine learning has proven to be effective (Chaudhuri et al., 2018; Abouelenien et al., 2017; Burzo et al., 2014; Liu, 2018). Thus, machine learning could be a valid method to predict heat stress. Furthermore, physiological signals have effectively been used to predict thermal discomfort of participants. Implying that physiological signals could be an indicator of heat stress. However, experiments were often conducted in an

(12)

indoor environment without constant fluctuations of the climate parameters and with limited physical activity, and for most studies participants had to rate their thermal experience and did not have to complete outdoor tasks (Chaudhuri et al., 2018; Burzo et al., 2014; Abouelenien et al., 2017). Also, often climate parameters were included as predictor values and, to the best knowledge, sun exposure as assessment of heat stress specifically was not predicted based on physiological signals in a dynamic outdoor environment.

Therefore, this research will investigate if sun exposure and its heat intensity can be predicted with machine learning, based on continuously measured physiological signals that were collected from participants walking in a hot outdoor dynamic environment. Sun exposure appears to be an important factor which affects heat stress and has an influence on the heat exposure of an individual. Consequently, heat stress could perhaps be assessed through sun exposure and the intensity of the exposure. Therefore, it is specifically interesting to investigate if physiological signals can be indicators of heat exposure, with a focus on the exposure to sun.

For this thesis heat stress will be examined indirectly by investigating the relation between physiological signals and exposure to heat. Here, heat exposure will be assessed through sun exposure and its intensity. Consequently, if physiological signals prove to be capable of predicting heat exposure, the signals could be indicators of heat stress in participants. As a result, heat stress can, to a certain extent, be assessed on an individual level through physiological signals. Which leads to the research question: Can continuously measured physiological signals of humans predict heat exposure in a dynamic outdoor thermal environment and consequently be indicators of heat stress?.

To clarify the research question, heat stress is examined by how participants respond to conducting tasks in a dynamic hot urban environment. Specifically, how and if the participants respond to heat exposure on a physiological level. If the participants respond physiologically, it is assumed that the participants experienced heat stress. However by predicting heat exposure with physiological signals, heat stress will only be identified till a certain extent. To asses heat stress more comprehensively, the survey responses of participants will be included for this research. The survey responses will be included to examine the subjective thermal comfort that a participant experienced and could be compared with heat exposure or as a verification that increased exposure to heat leads to more thermal discomfort.

The hypothesis of this research is that physiological signals can be indicators of sun exposure and its heat intensity and, consequently, of heat stress. And it is expected that participants that were exposed to more sunny conditions express more thermal discomfort during the experiment. Consequently the relation between sun exposure, heat and heat stress is confirmed.

(13)

Chapter 3 Data description and analysis

The data for this thesis was provided by Valentin Melnikov. The following chapter will give insight into the data and explain the data that was included for this research, which preprocessing steps have been performed and visualize the data and give insights for a better understanding.

The data that was provided for this thesis originates from a larger research project, called ’Cooler Calmer Singapore’ (V. R. Melnikov et al., 2020). The aim of this project is to examine thermal comfort in an urban environment and to research options for reducing thermal discomfort in Singapore. The provided data is for assessing the behavioral choices of participants in a hot urban environment. Specifically, path adaption and speed adaption for mitigating thermal discomfort and heat stress is part of the overall research. In previous research, speed adaption and outdoor thermoregulation were investigated (V. R. Melnikov et al., 2020; V. Melnikov et al., 2018; V. Melnikov, Krzhizhanovskaya, & Sloot, 2017). The current data gives the opportunity to specifically include physiological signals in the assessment of thermal comfort in outdoor environments.

The data was collected during an experiment in Singapore and provides insight into the behaviour choices of participants. During the experiment participants had to complete outdoor decision tasks in a hot urban environment. For every task the participant had to choose between two different paths leading to the next task. The participant could either take the shorter and often sunnier path, or the longer and often more shaded path. Thereafter, the participant had to walk the selected path to the start of the next decision task. During the entire experiment climate data and physiological data was collected. And participants had to complete survey questions before the start of the experiment and at the end of the experiment. The survey questions included socio-demographics questions, such as age and height, and questions about how the participants experience the climate during the experiment and if he or she was comfortable with the environmental conditions.

Data of 70 participants was provided. From the 70 participants one participant was excluded from the data, since the physiological signals of that participants were incomplete, as well as the climate data.

3.1 Wearable devices

During the experiment the Empatica E4 wristband (Empatica, n.d.) was used as a wearable device to collect the physiological signals of participants in real time. A wearable wristband gives real time insights into a participant’s physiological state, however it should be noted that the data could contain noise and that the data might not always be accurate for every timestamp.

(14)

This is because the wristband was used during a physical activity and not in a laboratory setting. However, wearable physiological signal devices have been used in other researches and proven to be successful. For example, Chaudhuri et al. (2018) and Liu (2018) were successful in predicting thermal comfort with physiological signals collected with a wearable device. And physiological data collected by wearable devices was used for detecting physiological stress in a real life environment (Kyriakou et al., 2019; Can, Chalabianloo, Ekiz, & Ersoy, 2019).

Thus, it should be noted that there might be some noise and inaccuracies with the data collected by the Empatica E4 wristband, however it is expected that the physiological signal data is very satisfactory for trying to predict heat exposure.

3.2 Stages data

For every participant a stages data file was provided. This data file includes the different stages of the experiment, as well as the walking duration, decision duration and start and end timestamps in seconds of a stage. The different stages are explained more below.

• Story

The first stage of the experiment was the story. During this stage participants were sitting in a shaded environment and were asked to read a story to relax. The story stage provides a baseline where the physiological signals of participants are measured without physical activity or external stimuli of the sun.

• Instructions

The second stage was the instructions stage where participants were given the instructions for the decision tasks and were also sitting in shaded conditions with minimal activity. • Task 0

During this task participants walked around the area. • Task 1 till 13

Task 1 till 13 are the decision tasks. At the start of each task participants had to decide which path to take and then walk the chosen path towards the decision point for the next task. During the 13 tasks participants were exposed to different environmental conditions. The walking time and decision time for the tasks differs per participants, which is illustrated in figure 3.1.

• Task 14

This is the end of the experiment where the wearable sensor is removed. Data of task 14 is not included for this research.

Figure 3.1 illustrates the average time in seconds per stage including the 69 participants. The figure illustrates the time differences between stages and between participants. However, these time differences are not only because of individual differences between participants. For the conduction of the experiment two tasks booklets were used. The tasks were different for the two booklets, which resulted in different tasks lengths between participants.

3.3 Events data

The events data file includes the environmental conditions that a participant was exposed to during the experiment as well as the start and end timestamps of a stage. The start of an

(15)

Figure 3.1: Time in seconds per stage

environmental condition was a timestamp column in seconds. Participants were exposed to multiple environmental conditions, which are explained in table A.1 in appendix A.

Figure 3.2 visualizes the different event labels per participant and provides an overview of the environmental conditions that a participant was exposed to during the entire experiment. All the event labels in the events data were used until the end of task 13. The data illustrates that some participants were not exposed to the sun during the experiment, whereas others were exposed to both sunny and shaded environmental conditions.

(16)

3.4 Climate data

The climate data file includes different climate parameters collected during the experiments as well as thermal indices. The climate parameters and thermal indices were recorded in a shaded and sunny environment. From the climate file only the Wet Bulb Globe Temperature (WBGT) in both the shaded and sunny environment was included for this research.

The WBGT is a heat stress index and combines all the collected climate parameters during the experiment, namely wind speed, relative humidity, air temperature and globe temperature. The WBGT index is the most used heat stress index around the world and the formula for the WBGT index in outdoor conditions is shown in equation 3.1 (Epstein & Moran, 2006).

W BGT = 0.7Tw+ 0.1Ta+ 0.2Tg (3.1)

To clarify, Ta is the air temperature, Tg is the globe temperature and Tw is the wet bulb

temperature.

Furthermore, the WBGT temperature in the sun is significantly higher (p-value = 0.000) compared to the WBGT temperature in the shade, which is illustrated in table 3.1. The large difference between the WBGT temperature in the sun compared to the shade, indicates that sun exposure and solar radiation could be important factors in assessing the heat that a participant was exposed to and if he or she experienced heat stress.

WBGT index Mean STD WBGT_sun 31.1◦C 2.38 WBGT_shade 27.5◦C 1.00

Table 3.1: The mean and standard deviation of the WBGT values

Lastly, there were some participants with missing climate data. The missing values were replaced with the previous value before the missing value, or in case of missing values in the beginning, with the first available value after the missing value.

3.5 Choices data

The choices data file includes the length of a path in meters, the walking duration and decision duration for a task in seconds and which path a participant choose. For this thesis the choices data was only used to calculate the average walking speed of a participant per task in m/s.

The average walking speed per participant is plotted in figure 3.3. The figure illustrates that the average walking speed per participant differs.

3.6 Physiological signals

For this research four physiological signals were analyzed and included. Those four signals are explained below.

3.6.1 Skin temperature

Skin temperature is important for regulation of a healthy heat balance of the human body and in the thermoregulation process. Namely, an increase in skin temperature will result in more heat

(17)

Figure 3.3: Average walking speed per participant

loss through sweating, whereas a decrease in skin temperature will make the human body generate more heat by shivering (Charkoudian, 2003). Besides sweating and shivering, vasodilatation and vasoconstriction influence skin temperature by increasing or decreasing the amount of blood that flows from the interior of the body to the skin (Arens & Zhang, 2006). Consequently, the skin is an integral part of thermoregulation and responds to heat exposure. The importance of the skin in the thermoregulation process together with the results from Blazejczyk, Nilsson, and Holmér (1993), which indicate that a higher solar radiation increases the skin temperature, makes the skin temperature interesting to include in the prediction of heat exposure and to investigate if skin temperature can be an indicator of heat stress.

The skin temperature (TEMP) in the physiological signal data is in ◦C and was sampled at a rate of 4 Hz. A sampling rate of 4 Hz means that 4 samples are collected per second.

3.6.2 Blood volume pulse

The blood volume pulse (BVP) was sampled at a rate of 64 Hz. The BVP data was collected using a photoplethysmography (PPG) sensor (Empatica, January 23, 2020), which can detect changes in the blood volume (Kamal, Harness, Irving, & Mearns, 1989). Features derived from the data of a PPG signal have been successful in detecting heat stress (Elgendi et al., 2015). Thus, the BVP could be interesting to include for predicting heat exposure. However, the BVP signal data is not completely reliable because of walking motions during the experiment. As explained on the Empatica E4 website, movements can result in corrupted BVP data which can lead to wrong peaks (Empatica, January 24, 2020d). This is illustrated in the inter-beat interval (IBI) data, which the Empatica wristband calculates from the BVP data (Empatica, January 24, 2020a). Wrong peaks in the BVP data will lead to missing values in the IBI data, since no two consecutive beats could be detected (Empatica, January 24, 2020d). This was true for the current data, since participants completed walking tasks. As an example, figure 3.4 demonstrates the plot of the IBI data for one participant. Long straight lines can be identified in the plot, these lines represent the missing values. Therefore, the IBI data was not included for this thesis and it should be noted that the BVP data might not be reliable.

3.6.3 Heart rate

The heart rate (HR) was sampled at a rate of 1 Hz. Furthermore, the HR was not directly measured by the Empatica E4 wristband, it was, just as the IBI data, computed from the BVP

(18)

Figure 3.4: The IBI data for one participant

signal (Empatica, January 24, 2020a). The HR data does not have missing values, however, it is not completely reliable either (Empatica, January 24, 2020b). Still, because the HR does not have missing values and research indicates that the HR increases under heat stress (Jagomägi et al., 2013; Madaniyazi et al., 2016), the HR data is included for this research.

3.6.4 Electrodermal activity

As mentioned, sweating is a process in thermoregulation. When the human body needs to lose heat, eccrine sweat glands are activated and the human body starts to sweat (Arens & Zhang, 2006). Stimulation of the sweat glands results in changing EDA values (Dawson, Schell, & Filion, 2017), and is often used in research to assess the emotional state of an individual. For example, de Santos Sierra, Ávila, Casanova, and del Pozo (2011) were successful in detecting stress in individuals by assessing the HR and EDA of participants in a laboratory environment, and Can et al. (2019) used HR, EDA and accelerometer signals to detect stress during a programming contest.

EDA is measured by assessing the current flow between two electrodes on the peripheral skin (Dawson et al., 2017), and consists of two components, namely the tonic skin conductance level (SCL) and the phasic skin conductance response (SCR) (Empatica, January 24, 2020c). The SCL responds more gradually, whereas the SCR responds rapidly after exposure to stimuli. Figure 3.5 illustrates the SCL compared to the EDA signal. The figure demonstrates that the SCL changes gradually, whereas, as we zoom in in figure 3.6, the EDA signal consists of many small waves. These waves are the SCR peaks and increase as response to stimuli.

Figure 3.5: The SCL signal Figure 3.6: The SCR waves

(19)

a stimulus. The latency is the time in seconds from the the onset of the stimulus to the onset of the waveform. The amplitude is the microSiemens from the onset to the peak of the wave and reflects the increase in skin conductance. The rise time is the time in seconds from the onset of the waveform to the peak, and the half recovery time is the time in seconds from the peak to the point of 50% of the amplitude (Dawson et al., 2017).

Figure 3.7: SCR waveform

These measures can indicate the psychological state of an individual. However, the EDA signal was measured during physical activity with a wristband and outside a laboratory environment, without planned stimuli. And, as mentioned by Dawson et al. (2017), it could be possible that sweat glands at the wrist are activated for thermoregulation and not for psychological arousal.

Therefore, for this thesis, changes in the SCL and SCR do not refer to the psychological state of a participant but to the thermoregulation process. Sweat evaporation is a critical and effective process for the body to lose heat in hot environments (Havenith, 2005; Lundgren et al., 2013). Therefore, changes in the EDA signal can indicate that a participant is exposed to more heat and needs to lose heat to maintain a healthy heat balance. The expectation for this research is that the EDA will then change, because the eccrine sweat glands are stimulated. Consequently, the expectation is that, with increased heat exposure, the amplitude and frequency of the peaks of the SCR will increase. Which can indicate that a participant is experiencing heat stress and thermal discomfort.

The electrodermal activity (EDA) was sampled at a frequency of 4 Hz. And the decomposition of the EDA signal into its SCL and SCR components was part of the provided data set. Two data files per participants were provided, together with some code on how to read the files. One data file contained the decomposition into the SCL component. The other data file identified the peaks of the SCR waves and included the impulse onset, impulse peak time, impulse amplitude, onset of the peak, amplitude of the peak and the time of the peak. The onsets and peak times are in seconds and the amplitudes in microSiemens. For the decomposition of the EDA signal into the SCL and SCR components the Ledalab software from Matlab was used (Ledalab, n.d.).

(20)

Chapter 4 Method

The first few steps in the method were analysing, preprocessing and visualising the data, which was mostly explained in the previous chapter. However, a preprocessing step not yet mentioned was the processing of the physiological signals to the corresponding sampling frequencies. It was decided to include all the physiological signal samples, to preserve the most data for feature extraction. Therefore, the sampling frequency for every feature was preserved, and a timestamp column was created for every signal with the correct timestamp, in seconds, per sample.

Next, to answer the research question, a supervised machine learning approach was selected, where a target feature was calculated from the available data. Features were extracted from the physiological signals and these features try to predict the target feature. Supervised learning specifically was chosen because the intention of this research is to try and predict if a participant is exposed to heat, which can be formulated as a binary classification problem where a threshold is selected for low and high heat exposure. Therefore, the main approach was to create a binary classification algorithm, which predicts a target feature based on features extracted from the physiological signals. If the features extracted from the physiological signals are successful in predicting the target feature, it implies that there is a relation between physiological signals, sun exposure and heat. And that physiological signals could be indicators of heat exposure and, consequently, of heat stress. In general, the method for this thesis is to test if physiological signals can predict heat exposure, by extracting features from the signals and using those features to try and predict the target feature.

As a note; all the code was written in Python 3.6 in a Jupyter notebook environment (Jupyter, n.d.), and Pandas (McKinney et al., 2010), MatplotLib (Hunter, 2007) and Bokeh (Bokeh De-velopment Team, 2020) were used for creating visualizations.

4.1 Approach

The initial approach was to focus mainly on the sun exposure and calculate per task the per-centage of sun that a participant was exposed to and to predict if a participant was exposed to a higher or lower percentage of sun value. The Percentage_sun target feature (Ps) was then

calculated with equation 4.1. Where the Ts is the time in seconds that a participant was in the

sun during the task and Tt is the duration in seconds of that particular task.

Ps= (Ts/Tt) · 100 (4.1)

To illustrate the Percentage_sun values, figure C.1 in appendix C depicts the distribution of the percentage of sun for all the participants. The distribution illustrates that there are a many

(21)

0% sun exposure samples. A disadvantage of this approach is illustrated in figure 3.1, namely, the duration of the tasks differs per participant and per task. This resulted in unequal sized physiological samples, as well as a short duration for some samples.

A solution was to create a Pandas dataframe where the experiment data of participants was divided into window segments without overlap, which resulted in a large dataframe with multiple windows of different time ranges per participant. The physiological signals were then appended to the dataframe rows with a matching time range and participant. An example of a dataframe, with the physiological signals and timestamps is illustrated in figure A.1 in appendix A. To still focus on the tasks, the window segments started at the beginning of task 1, until the end of task 13. The disadvantage of this approach is that decision stages are included in the dataframe, which could influence the results, since physiological signals might respond to the change in physical activity rather than the change in sun exposure. Consequently, it might become difficult to predict the percentage of sun during a window. As a solution, the decision time was included as a predictor feature for the prediction of the percentage of sun.

However, while conducting the research it was realized that the intensity of the sun exposure should also be included, and not only if a participant was in the sun or in the shade. Therefore, it was decided to not use the percentage of sun as target feature anymore. A new target feature was created together with the daily supervisor, Valentin Melnikov. This target feature is called the Heat_score and is more comprehensive.

The Heat_score was designed so that it accounts for the walking speed, the climate parame-ters, the sun exposure and for accumulated heat experienced previously in the experiment. The sun exposure was included in the Heat_score because the goal of this thesis is to predict heat exposure with sun exposure and its heat intensity, therefore sun exposure should be represented. The climate parameters were included to account for the intensity of the sun exposure. Namely, it can be expected that the combination of a high temperature and being exposed to the sun would add more heat in participants than the combination of a low temperature and being ex-posed to the sun. Also, as mentioned in chapter 2, the climate parameters can affect the body heat balance and consequently cause heat stress, which could be reflected in the physiological signals. Thus, it makes sense to include the climate parameters in the target feature.

Moreover, the walking speed was included, since walking is a physical activity that increases the metabolic heat production which, consequently, influences the heat balance of a human body (V. Melnikov et al., 2018). As a result, the human body needs to restore balance by heat loss, which subsequently could be represented in the physiological signals. Therefore, including the walking speed might lead to improved results. Additionally, figure 3.3 illustrates that the walking speed per participant differs. These different walking speeds could influence the heat balance and should therefore be included. Furthermore, the variation between walking speeds in figure 3.3 might be a consequence of the environmental conditions that a participant was exposed to. Namely, in a previous research, V. Melnikov et al. (2017) simulated speed adaptation results for different environmental conditions and their results indicate that walking speed increases during heat conditions, which consequently resulted in less heat stress. It is a possibility that participants in the current data also adapted their speed to mitigate heat stress during the more sunny and hot environmental stages. However, in a follow-up research V. R. Melnikov et al. (2020) did not find a significant difference in walking speed between participants for different air temperatures. Consequently, the influence of heat on walking speed is undecided.

Thermal experience prior to the moment for which the score is calculated was included in the Heat_score calculations, since it might be important to account for heat experienced previously in the experiment. Namely, it could be expected that participants who have experienced more heat already have different physiological signal values than participants that have not experienced any heat or engaged in any physical activity. Therefore, the Heat_score should represent the heat

(22)

and activity already experienced before the current window segment over which the Heat_score is calculated.

Additionally, for the Heat_score another different approach was selected about which data to include. Namely, the entire experiment was included and not just task 1 till 13. The motivation for this approach is that the story and instructions stages are a good baseline where the partici-pants were not yet exposed to the sun or had to perform physical activity. Therefore, including the entire experiment could allow for more distinguished physiological data between being in the sun or shade. Additionally, it would add more samples for the classification task. For the window segments a window length of 180 seconds was selected, because it creates windows that are not too long, which would result in less samples for classification, and not too short, which would result in less physiological data per window. Besides 180 seconds, windows of length 120 and 240 seconds were also tested to find the optimal duration. And a window length of 180 seconds starting at task 1 was tested to validate the decision to include the entire experiment.

In general, the main approach for this thesis is to create the Heat_score target feature and calculate the Heat_score per window segment for windows with a length of 180 seconds. Consequently, with the Heat_score approach the more comprehensive heat exposure is as-sessed, instead of only the sun exposure. Additionally, for comparison, the results of the initial Percentage_sun approach are also included in the results chapter. However, these results are less emphasized; the main focus is on the Heat_score.

4.1.1 Target feature - Heat score

Thus, the main target feature for this research is the Heat_score, which consists of two compo-nents; the heat index and the window decay. The heat index is comprised of the sun exposure, walking speed and climate parameters, and the window decay accounts for the accumulated heat experienced previously in the experiment. First the heat index will be explained, next the window decay and lastly how the two components are combined into one target feature, the Heat_score. Heat index

The heat index was calculated per second per participant, resulting in a dataframe including all the participants, where the rows represent the seconds of the experiment.

Climate parameters

The climate parameters are included in the heat index through one component, the WBGT tem-perature. This temperature was chosen specifically because it includes all the measured climate parameters, including the climate parameters that can cause heat stress. The WBGT tempera-ture was measured in both the sun and the shade. For the heat index the average of the sun and shade temperatures was calculated per second for every participant, resulting in a WBGT_average feature. This feature was normalized by first identifying the highest WBGT_average over all the participants, which resulted in one value, the WBGT_max_average. Thereafter, for every times-tamp the WBGT_average was divided by the WBGT_max_average, which resulted in a normalized WBGT temperature per second per participant, the WBGT_norm_average.

Sun exposure

Per second, per participant it was checked if the participant was in the sun or in the shade. The event labels CN and SN were recognized as sun exposure and the event labels NN, SB, SG, CB and CG were recognized as in the shade. It was decided to identify CN also as sun exposure, because during that event the participant was exposed to the sun, even though the sun was occluded. To validate this decision, only identifying the SN label as sun exposure was also

(23)

tested. The value 1 was assigned if the participant was in the sun, 0 if he or she was in the shade. Walking speed

The walking speed of a participant was calculated as explained in section 3.5, and assigned per second, per participant. This resulted in similar walking speeds for timestamps during the same tasks. Consequently, the heat index consisted of 13 different walking speeds per participant. For the timestamps that a participant was not walking, namely, during the decision stages, the story and the instructions stage, a 0 was added as walking speed. The walking speed was normalized just as the WBGT temperature, resulting in a Walking_speed_norm feature.

Subsequently, with the three components of the heat index calculated, the final heat index feature, named Heat_index_final, was calculated per timestamp with equation 4.2.

HI = Wc· Cn+ Ws· E + Ww· Sn (4.2)

To explain equation 4.2; Wc, Wsand Wware the weights of the three components. Cn is the

WBGT_norm_average. E is the environmental condition, meaning the participant is in the sun or not. And Snis the Walking_speed_norm. The weights were chosen intuitively and were initially

specified as: Wc = 0.7, Ws= 0.2 and Ww= 0.1. These weights were selected to prevent the heat

index for being to sensitive to the large changes in the walking speed and sun exposure, since the values of those components change significantly between timestamps. The WBGT parameter changes more gradually during the experiment, therefore it has a higher weight. However, to asses the influence of the three components, different weight combinations were tested with an increased weight for Wsand Ww. The different weights combinations are depicted in table B.1

in appendix B.

Figure A.2 in appendix A is a screenshot of the heat index dataframe as an illustration of the heat index for one participant over ten seconds. Also, figures 4.1 and 4.2 illustrate the heat index values of one participant during the entire experiment. The figures demonstrate the differ-ence in heat index values between walking and standing and between in the sun and in the shade.

Figure 4.1: Heat index walking vs standing Figure 4.2: Heat index sun vs shade Window decay

For every window in a dataframe the heat index values of the timestamps before the start of the current window were discounted, which is referred to as the window decay. These values are discounted, since the accumulated heat that a participant experienced before the start of the current window is important to include. However heat experienced during the current window is more important, since that heat will have more influence on the physiological signals during the current window. Therefore, no discounting was applied for heat index values during the current window.

(24)

To apply discounting; a linear window was selected with decreasing weights from 1 till 0.5. The length of the window decay is 900 seconds, not including the length of the current window. For every window segment the time-discounted heat index values were calculated per second by multiplying the heat index values with the window weights. The time-discounted heat index values are eventually integrated into the heat score of the current window.

Figures 4.3 and 4.4 demonstrate the calculation. Figure 4.3 depicts the decaying weights window in red. The decaying weights window is a linear window of length 900 + 180. 180 is the length of the current window segment, which ranges from timestamp 1260-1440. The figure illustrates that no discounting was applied for the current window, since the discounting starts at timestamp 1260. Figure 4.4 is calculated by multiplying the decaying weights line and the heat index values of figure 4.3. The final Heat_score value for the window segment of timestamp 1260-1440, is the sum of the time-discounted heat index values in figure 4.4.

Figure 4.3: Heat index and decaying weights Figure 4.4: Time-discounted heat index values Figure 4.5 depicts the values of the Heat_score of one participant in black, compared to the heat index values of that participant. The figure demonstrates that the Heat_score increases when the participant starts to walk and/or is exposed to the sun. Hence, the Heat_score responds well to the walking speed and sun exposure.

The above steps result in one target value; the Heat_score, per window segment per par-ticipant. A distribution of all the Heat_score values for the 180 seconds window dataframe is plotted in figure 4.6. The Heat_score values are continuous and the distribution for the Heat_score looks like a more equal spread than the distribution of the Percentage_sun in fig-ure C.1 in appendix C. Therefore, a regression model will first be applied to try and predict the continuous Heat_score values, before implementing classification.

The window decay parameter of 900 seconds was selected intuitively and other lengths were tested. Table B.2 in appendix B includes the lengths that were tested.

4.2 Features

Overall, there are two target features for this research which both have slightly different ap-proaches. However, the predictor features for both approaches are exactly the same. Further-more, this research is mainly interested in predictor features extracted from the EDA signal. Since, as mentioned, the HR and BVP data might not be reliable, therefore those signals will not be included extensively. Moreover, using EDA signal data to predict outdoor heat stress is a relative novel approach. EDA is often used to asses the psychological state of an individual, but not to operate as an indicator of outdoor heat stress. Therefore, predicting heat stress with

(25)

Figure 4.5: The Heat_score of one participant

Figure 4.6: Distribution of the Heat_score

features extracted from the EDA signal would be a contribution to the current research. Table C.1 in appendix C is an overview of all the features that were extracted from the physiological signal data for every window segment per participant. The HR and TEMP features are gen-eral statistics and need no more explanation. Some of the EDA features do need some more information, which will be provided in the next few paragraphs.

Code was provided by Valentin Melnikov for the calculation and the plotting of the EDA_alpha and the BVP_alpha. The code is for both features similar, therefore only the EDA_alpha will be explained. The EDA_alpha is the slope of the fitted line of the power spectrum of the EDA signal, as visualized in figure 4.7. Figure 4.7 demonstrates the EDA_alpha of the story stage of one participant. In the EDA_alpha code two parameters could be altered, the Tukey window value and the detrending of the EDA signal. For both parameters the default parameters from SciPy’s spectogram module (scipy.signal.spectrogram, n.d.) were selected, which is a Tukey of 0.25 and constant detrending.

Figure 4.7: EDA_alpha

To test the error of the EDA_alpha and BVP_alpha, the coefficient of determination (R2_{) was}

calculated with the r_value from Scipy’s linregress module. Equation 4.3 shows how the R2 _{was calculated.}

R2= rvalue2 (4.3)

(26)

the window segments of the 180 seconds window dataframe. Both the figures show a distribution with high R2 _{values. Table C.2 in appendix C includes the mean and standard deviation of the}

distributions. From the distributions it can be concluded that the EDA_alpha and BVP_alpha explain the variance of the data relatively well.

Furthermore, the Peaks_freq itself is explained in table C.1. However, the peaks are called significant for all the features which include the SCR peaks, for reason that there was noise in the EDA signal which first had to be removed. The noise was removed by excluding peaks with an amplitude below a specific threshold. It was decided to not specify one threshold for all the participants for all the window segments, but to specify a threshold per window segment per participant. The selected approach for removing noise is equation 4.4, where the AmpAvg is the average amplitude of the peaks in a window segment. All the peaks with an amplitude below the threshold are removed. As a result, only the significant peaks are included for calculating the features in appendix C, table C.1.

threshold = 0.5 · AmpAvg (4.4) The weight of 0.5 in equation 4.4 was selected by plotting and examining the distributions of the peak amplitudes. Figures C.3 and C.4 in appendix C illustrate the distributions of the amplitudes of one window segment and one participant before and after removing the noise. Figure C.4 shows that the peaks with the lower amplitudes were removed and the number of peaks was reduced from 182 in figure C.3 to 112 in figure C.4.

Overall, the motivation for calculating EDA features that include the SCR peaks, is the expectation that, with increased heat exposure, the amplitude and frequency of the peaks will increase. This increase is expected, since the skin conductance increases as a result of the activation of sweat glands. Therefore, features related to the peaks are extremely relevant. Furthermore, as mentioned in Dawson et al. (2017), high amplitudes and the frequency of peaks is sometimes grouped with a high SCL. The SCL represents the more gradual change, therefore, with extended heat exposure, the SCL could gradually increase. Thus, the slope and average value of the SCL might be informative of a participant’s exposure.

Furthermore, the SCL feature Crossings_freq_scl_200 needs some more explanation. For this feature the smoothed version of the SCL component was calculated and subtracted from the regular SCL component. The SCL was smoothed by applying a moving average rolling window with a window length of 200. Window lengths of 50, 100, 300 and 500 were also tested, and 200 was eventually selected after visualizing and examining the plots for different window lengths. Figure 4.8 shows the SCL component, the smoothed SCL component and in black the subtraction of the smoothed SCL from the SCL. The feature Crossings_freq_scl_200 is the frequency of the number of zero crossings of the black line, which was calculated with BioSPPy (biosppy.signals, n.d.).

Lastly, a correlation matrix was plotted with Seaborn (Waskom et al., 2020) to identify correlated features. Figure 4.9 illustrates the correlation matrix of features with a high negative or positive correlation. Since the Crossings_freq_scl_200 and the Crossings_diff_scl_200 are correlated, it was decided to only include the Crossings_freq_scl_200. In addition, from the STD_amp and Inter_peak_avg only the Inter_peak_avg was included, since those two features have a high correlation. And the N_peaks, Peaks_freq, Inter_peak_avg and STD_inter_peak also correlate, therefore only the Peaks_freq was included. In general, the correlations between the last few features could be expected, since they all compute the frequency of the peaks with slightly different methods.

The features in bold in table C.1 in appendix C are the eventual features that were included for the machine learning models.

(27)

Figure 4.8: Example of the composition of the Crossings_freq_scl_200 feature

Figure 4.9: Correlation matrix

4.3 Machine learning

The target feature and predictor features were added to the dataframes, which resulted in a comprehensive dataframe with multiple window segments per participant and features for every window segment. As a next step, machine learning for the prediction of the target features was performed. Scikit-learn modules were used for all the machine learning models and related tasks for this research (Pedregosa et al., 2011).

4.3.1 Preparation

As a first step, labels were assigned to the target features for the classification models. For the Percentage_sun approach, the feature was divided into two groups, depending on a selected threshold. The initial threshold value was 20% where the samples with a Percentage_sun value of lower than 20% were assigned label 0 and higher label 1. Multiple thresholds were tested. A consequence of this approach is that data becomes imbalanced for some thresholds, which will

(28)

be addressed later.

For the main approach the median of the Heat_score was selected as the threshold, which resulted in a balanced data set. However, simply splitting the Heat_score in two groups and assigning Low_heat and High_heat might not be the best approach, since it is not clear which Heat_score values are really low and high. Therefore, a second approach was added, where a percentage of the middle Heat_score values was removed, so that the classifiers could try to distinguish between truly Low_heat and High_heat labels. The percentage of values to remove from the Heat_score is a parameter that was tested. Initially 25% was removed. Figure C.2 in appendix C illustrates the distribution of the Heat_score after removing the middle 25% values. After the label assignment, the dataframe was shuffled and split into a training and a testing set, with a test size of 0.3. For the train and test split Scikit-learn’s train_test_split was implemented. Additionally, for the classification tasks, the parameter stratify=target feature was included in the train and test split, to ensure that the two classes are split uniformly between the training and testing data.

A next step in machine learning is often to scale the features with standardization. This ap-proach was not selected for this research, since the range of different physiological features might be informative and relevant to included. This decision was tested by applying standardization for the 180 second windows dataframe. Scikit-learn’s StandardScaler module was used for implementing standardization.

Lastly, every dataframe contains multiple samples per participant, which are independent from each other and do not overlap in time. The aim of this research is to examine if differences in physiological signals can predict heat exposure, and participants are not compared to each other. Therefore, samples of the same participant can be in both the training and the testing set. This approach was validated by also dividing the participants between training and testing set, so that samples of the same participant are not in both set. For this split Scikit-learn’s GroupShuffleSplit was used with the parameters n_splits=1, train_size=0.7. Further-more, with the GroupShuffleSplit approach, overlapping sliding windows can be tested, since the windows with time overlap will be in either the training or the testing set. Therefore, one dataframe with a sliding window length of 180 seconds and a 10 seconds window overlap was tested for the classification problem.

4.3.2 Linear Regression

For predicting the continuous Heat_score values, Scikit-learns’s LinearRegression module, with its default parameters, was implemented. As a regression algorithm, Linear Regression was chosen specifically because it is a simple model with not many parameters, and it is expected that there is some linearity between the Heat_score and the independent predictor features.

4.3.3 Logistic Regression

Logistic Regression is a supervised classification model and was implemented with Scikit-Learn’s LogisticRegression module. Logistic Regression was selected because it is a fast and simple classifier, of which the results can be compared with the more complex Random Forest classifier. The solver=’liblinear’ was specified as Logistic Regression parameter, which is an appropri-ate solver for small datasets (Scikit-learn, n.d.). The other parameters were defined with a combi-nation of trial and error and Scikit_learn’s RandomizedSearchCV module, which searches the optimal parameters with cross validation. This combination resulted in the following parameters; penalty=’l1’, max_iter=300, C=5. Additionally, the parameter class_weight=‘balanced’

(29)

was added when the input data was imbalanced. Thus, no oversampling or undersampling tech-niques were used for imbalanced datasets, which only occurred for the Percentage_sun approach.

4.3.4 Random Forest classification

The Random Forest classifier is also a supervised classification model and was implemented with Scikit_Learn’s RandomForestClassifier module. Random Forest is a more complex and a slower algorithm compared to Logistic Regression. However, Random Forest is transparent and returns the Gini importance of the input features. This is very useful for assessing which features are more important. Also, Random Forest was specifically selected because previous research was successful in using Random Forest to predict thermal comfort (Chaudhuri et al., 2018; Liu, 2018).

The parameters for the Random Forest classifier were selected with a combination of trial and

error and RandomizedSearchCV. The parameters that were tuned are; n_estimator, min_samples_split, min_samples_leaf, max_features and max_depth. The parameters are different for the two

ap-proaches and the different parameters. The two parameters that are the similar are; n_estimator=100 and max_depth=’auto’. Additionally, the parameter class_weight=‘balanced’ was, similar to Logistic Regression, included when the input dataset was imbalanced.

4.4 Evaluation

For the evaluation of the machine learning models a cross validation approach was applied by iterating a 100 times. On every iteration the dataframe was shuffled, split in a training and testing set, the machine learning model was trained, fitted and the results were calculated. On every iteration the results were added to a list and at the end of the 100 iterations the mean and standard deviation of the list were returned as final score. This method ensures that the results are reliable and that the machine learning algorithm does not overfit on the testing set, since the performance is tested a 100 times with different combinations of samples in the training and testing sets.

Evaluation metrics of Scikit_learn were used to evaluate the performance of the machine learning models. The evaluation metric of the classifications problems with balanced input data is the accuracy score. The accuracy score is reliable for balanced problems and returns the per-centage of correctly classified labels. For imbalanced data the accuracy score together with the F1 score are the evaluation metrics. The F1 score returns the average of the F1 scores per label. With imbalanced data the F1 score is more representative of the performance of the algorithm, since it includes the performance per label. However, with the class_weight=‘balanced’ pa-rameter the accuracy score alone might be sufficient. The higher the accuracy and the F1 score, the better the performance of the algorithm.

For Linear Regression three evaluation metrics from Scikit_learn were calculated. The first metric is the R2-score, which is an indication for how well the target feature is predicted by the predictor features. The highest possible score is 1. The second metric is the mean absolute error (MAE), which is the average of the absolute errors between the target values and the predicted values. The lower the MAE the smaller the difference between the target and predicted values and the better the performance of the Linear Regression. The third metric is the mean squared error (MSE) which is similar to the MAE, except that it squares the error between the target values and the predicted values.

Furthermore, the feature importance for the Random Forest classifier is evaluated with the Gini importance per feature. The Gini importance per features is calculated for all the 100

(30)

iterations and the average importance score is returned. For the feature importance of the Linear Regression and Logistic Regression models, the p-values were evaluated, where a p-value below 0.050 indicates that a predictor feature is significant for the prediction of the target feature. For the p-values of Linear Regression Statsmodel’s ordinary least squares regression model was implemented, and for Logistic Regression Statsmodel’s Logit model (Seabold & Perktold, 2010).

Moreover, the coefficients of the Linear and Logistic Regression models were plotted and analyzed for feature importance. Standardization of the features was necessary before plotting the coefficients, since the different feature ranges would lead to very high coefficients. Therefore, for both Linear Regression and Logistic Regression, standardization was applied for the main dataframes and important features were identified. Thereafter, the important features were tested and used as input for the dataframes with different parameters. A consequence of this approach could be that the important features for one parameter might not be that informative for another parameter. Therefore, the important features were tested for the different parameters. Overall, all the results from the machine learning tasks are calculated with only the informative features, unless otherwise mentioned.

For the illustration of the results it was decided to illustrate two significant digits for the standard deviation and to include enough decimal points in the mean to allow for the two significant digits Cole (2015).

4.5 Two groups

Lastly, for the visualization of the survey responses a High_sun and a Low_sun group were created. For the creation of the two groups, the percentage of sun per participants was calculated from the start of the experiment until the end of task 13, which resulted in one value per participant. Subsequently, the 26 participants with the lowest Percentage_sun values were included for the Low_sun group and the 26 participants with the highest Percentage_sun values for the High_sun group. Which is approximately 25% of the middle data removed. 25% precisely is not possible, since that would mean that 17.25 participants of the 69 have to be removed. Table D.1 in appendix D depicts the mean, minimum, maximum and standard deviation of the two groups, which illustrate the differences in the sun exposure experienced per group. The p-value for the Percentage_sun between the two groups is significant, with p < 0.000.

Furthermore, for the Heat_score values a similar approach was implemented, where the average Heat_score per participant was calculated from the 180 seconds window dataframe, and used to split the participants in two groups. 26 participants in the High_heat group and 26 in the Low_heat. The difference between the groups is significant (p < 0.000), and tableD.2 depicts the statistics for the High_heat and Low_heat groups.

(31)

Chapter 5 Results

5.1 Survey responses

Figure 5.1 depicts the survey responses of the participants for the question: At these stages the climate was [Task 1]. Participants could select from predefined responses how comfortable they experienced the climate during the specified task. The question was asked for all the 13 tasks, hence there are 13 responses per participant. From the figure it can be concluded that the participants in the Low_sun group experienced the climate as more comfortable than the participants in the High_sun group.

Figure 5.1: The survey responses for the High_sun and Low_sun groups

Figure 5.2: The survey responses for the High_heat and Low_heat groups

Heat exposure, heat stress and the prediction with physiological signals

Heat exposure, heat stress

and the prediction with

Heat exposure, heat stress and the

prediction with physiological signals

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Literature review

Chapter 3

Data description and analysis

3.1

Wearable devices

3.2

Stages data

3.3

Events data

3.4

Climate data

3.5

Choices data

3.6

Physiological signals

3.6.1

Skin temperature

3.6.2

Blood volume pulse

3.6.3

Heart rate

3.6.4

Electrodermal activity

Chapter 4

Method

4.1

Approach

4.1.1

Target feature - Heat score

4.2

Features

4.3

Machine learning

4.3.1

Preparation

4.3.2

Linear Regression

4.3.3

Logistic Regression

4.3.4

Random Forest classification

4.4

Evaluation

4.5

Two groups

Chapter 5

Results

5.1

Survey responses