Feature Importance of Behavioural Adaptation to Heat Stress

(1)

Feature Importance of

Behavioural Adaptation to

Heat Stress

(2)

Layout: typeset by the author using LA_TEX.

(3)

Feature Importance of Behavioural

Adaptation to Heat Stress

Marten H. Rozema 11735651

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor

dr. Valeria V. Krzhizhanovskaya, Msc Valentin Melnikov Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 26th, 2020

(4)

Acknowledgements

I would like to sincerely thank my supervisors Valentin Melnikov and dr. Valeria Krzhizhanovskaya for taking the time to meet weekly and support me with feed-back and insights, especially Valentin who would often discuss my results with me late at night. I also want to thank my fellow students who did their research within the same project for insights and support.

(5)

Abstract

Global temperatures are rising and this heat could cause health risks, a decrease in social development and a decrease in economic growth in the world. To evade these risks there has to be an improvement in current building strategies and designs which can be achieved by researching the key factors causing humans to behave differently to heat stress. Hence, this thesis aims to find the answer to the following question: Could adaptive path choices to heat stress be predicted with the use of socio-demographic, physiological and climate data and could the most important features influencing the shadow-adaptive path choices of people be determined? The shadow adaptive path choices reflect the behavioural adaption of people to heat stress.

Based on previous literature about outdoor thermal comfort and heat stress, classi-fication algorithms were used to predict adaptive behaviour. The data is retrieved from an experiment conducted by the project Cooler Calmer Singapore that stud-ies outdoor thermal comfort. The data is used with a random forest classifier, principal component analysis and logistic regression to determine the important features. The features that are the most important according to the random forest classifier with an accuracy of 80% are wet bulb globe temperature, heart rate, skin temperature, electro-dermal activity peak frequency, time lived in Singapore and time elapsed during the experiment. The random forest with PCA has a 78% accuracy and suggests that insulation from clothing is an important feature as well. The logistic regression gives information if higher or lower values for features direct the prediction towards adaptive behaviour with a 71% accuracy. Further research needs to predict adaptive behaviour by using a larger scale data set and more variety in the socio-demographic features to determine if their importance is actually less than physiological and climate features.

keywords | Heat Stress, Behavioural Adaptation, Feature Importance, Outdoor Thermal Comfort

(6)

2.2.4 Logistic Regression . . . 10 2.2.5 XGBoosting Classifier . . . 11 3 Methods 12 3.1 Acquiring data . . . 12 3.2 Data pre-processing . . . 15 3.3 Classification . . . 17 4 Results 20 4.1 Distribution of data . . . 20 4.2 Random Forest . . . 21 4.3 Decision Tree . . . 23 4.4 PCA . . . 24 4.5 Logistic Regression . . . 26 4.6 XGBoosting Classifier . . . 28 5 Discussion 29 5.1 Findings . . . 29 5.2 Limitations . . . 31 5.3 Future work . . . 32 6 Conclusion 33 3

(7)

(8)

Abbreviations

ASHRAE: The American Society of Heating, Refrigerating and Air-Conditioning Engineers

BMI: Body mass index BVP: Blood Volume Pulse EDA: Electrodermal activity HR: Heart rate

OTC: Outdoor thermal comfort WBGT: Wet bulb globe temperature

(9)

Chapter 1 Introduction

Temperatures around the globe are rising due to global warming. This leads to many issues such as the melting of ice caps, decreasing biodiversity which brings implications for human health. The temperatures estimated for 2050 due to global warming could cause an increase in heat stress and health risks (Sailor, 2014). Temperatures that cause thermal comfort are 23-27 degrees Celsius in the summer and 20-25 in the winter Above these thresholds, the human body has difficulty maintaining a core temperature of 37 ± 1 degrees Celsius. To maintain normal body function the body needs to interchange heat with its environment (Epstein & Moran, 2006). Heat stress is challenging to define because of the different dis-ciplines studying the phenomenon. Different definitions and indexes exist without uniform consent to one (Moran, Shitzer, & Pandolf, 1998). However, most disci-plines agree that heat stress is seen when the body needs to adapt to physiological and psychological levels due to heat.

To date, several medical studies have investigated the higher morbidity and mor-tality rates during heatwaves. Due to heat stress, people are more vulnerable during periods of increased temperatures (Loughnan, Nicholls, & Tapper, 2012). On a physiological level, heat stress can cause dehydration, heat strokes, or even death. This does not only affect people who are severely ill or old, but has im-plications on public health as well (Kovats & Hajat, 2008). On a psychological level, it has been reported that there is an increase in mental health problems and a decrease in key psychological performance features (Tawatsupa et al., 2012). Heat stress also has an effect on social life. The heat causes fatigue and reduced performance during sports, social activities, and work (M.-L. Chen, Chen, Yeh, Huang, & Mao, 2003). This presents a problem not only to human health, but also to social development and economic growth (Kjellstrom, Holmer, & Lemke, 2009).

(10)

To overcome difficulties such as heat, a human is capable of considerable adapta-tion metabolically and behaviourally. This behavioural adaptaadapta-tion is also seen in the field of outdoor thermal comfort studies. Adaptations could include change of clothing, amount of exertion, and avoidance of sun.

Behavioural adaptation is not the only adaptation humans are capable of. Through-out the years, humans needed to improve their environment to support their way of living. There is a possibility that an improvement in the environment makes the need for behavioural adaptation obsolete. Because of this, it is important to understand key features causing this adaptation so that adjusting building strate-gies could improve climate conditions them may help to reduce the change in behaviour. Current improvements to the environment are costly and use much energy. Especially in the cities, where building structures keep heat trapped and air-conditioning is used for indoor cooling (Gago, Roldan, Pacheco-Torres, & Or-dóñez, 2013). Building strategies may be implemented more precise to the key features that are present in an environment so we could avoid the extensive use of energy.

The data is provided by the project Cooler Calmer Singapore which strives to enhance outdoor thermal comfort in Singapore. The project develops dynamic models to eventually develop a tool for assessing thermal comfort in urban spaces and designs (Melnikov, Krzhizhanovskaya, Lees, & Sloot, 2018). The project is from the ETH Zürich Future Cities Laboratory, a program that researches the development of sustainable cities.

The thesis aims to answer the question: Could adaptive path choices to heat stress be predicted with the use of socio-demographic, physiological and climate data and could the most important features influencing the shadow-adaptive path choices of people be determined? The answer to this question may help us avoid heat stress. But to answer this question we need to determine what behavioural adaptation is and how we can predict an adaptation with the data provided.

In this thesis the theoretical framework describes existing knowledge to familiarize the reader with the subject. In the method section that follows is an explanation how the results are found. The discussion serves as an overview of the possible debates that could arise in the methods and results. In the conclusion the results found are evaluated and used to answer the questions at hand.

(11)

Chapter 2 Theoretical Framework

This chapter is to elaborate on the theories and algorithms involved in this research. It serves as background knowledge for the reader to better comprehend the content of the thesis. There are two fields that are combined: the field of outdoor thermal comfort and machine learning. The section below gives an explanation of OTC and the section after that about the field of machine learning.

2.1 Outdoor Thermal Comfort

The field of thermal comfort focuses on the physiological, psychological and be-havioural changes by thermal sensation. A number of indices are proposed to determine thermal comfort based on climate factors. These climate factors are linked to signals in different fields to determine a standard for evaluating the ther-mal comfort level. Therther-mal comfort is defined by ASHRAE as “the condition of mind that express satisfaction with the thermal environment” (ANSI/ASHRAE Standard 55-2013, 2013).

However, in this thesis the thermal comfort is researched in an outside environ-ment. This leads to more climate factors involved than inside, such as wind speed and solar radiation. Along with the fact that people indoors are more often in a unvarying physical condition than outdoors (Höppe, 2002).

Outside this satisfactory mind as the definition of thermal comfort described above, there may be an experience of heat stress. This is caused by the imbalance between heat dissipation of the body and its heat production (Epstein & Moran, 2006). The body shows numerous activities ranging from an increase in core tempera-ture to a decrease in preforming key mental activities (Havenith, 2001; Tatterson, Hahn, Martini, & Febbraio, 2000).

(12)

2.1.1 Behavioural science

Behavioural science is getting more involved in the field of OTC. One of the earlier articles about behaviour and heat stress is from 2001 and uses the subjective evaluation of participants (Nikolopoulou, Baker, & Steemers, 2001).

Many after this have taken more signals into account than subjective evaluation. Through the years research took place on different locations along with different micro-climates. However, it is also said that for assessing the behavioural science in outdoor comfort physical, physiological, psychological and social factors all need to be included. When these all can be included, there is still a need for a predictive model to help designers adjust their building strategies based on the climate factors (L. Chen & Ng, 2012).

2.2 Machine Learning

2.2.1 PCA

PCA stands for principal component analysis. It is a mathematical algorithm for dimensionality reduction. It is often used to reduce the number of features while maintaining the variation in the data set. A large number of features can make visualization problematic and seeing patterns or groups in data may become diffi-cult. The PCA makes principal components based on the original data by finding directions in the variation. These linear combinations are made by finding the direction with the most variation to make its first principal component. After the first component, it finds the second component by iterating the process and so on. The features included in a principle component are normalized and weighted by their eigenvalues.

PCA is often used for exploring large data sets. The principal axes in feature space are often plotted as well to see the original contribution to the transforma-tion. It is useful to do this before a classification task to determine which features contribute to a principal component and have importance in the classification task at hand (Ringnér, 2008).

2.2.2 Decision Tree Classifier

A decision tree is a supervised machine learning algorithm that is able to classify and predict the class of data. The algorithm begins at a root node. All the data comes through this first node and is split into two. It carries on making partitions in new nodes extended on the root node to form a tree structure (Friedl & Brodley, 1997). The nodes split based on predictor variables, values within the features of

(13)

the training set. When splitting, the entropy is calculated to split on the highest information gain. When the entropy decreases the information gain increases. So the features with the highest information gain split the data first. The ID3 algorithm continues doing this recursively until all the data is classified (Yu, Haghighat, Fung, & Yoshino, 2010).

2.2.3 Random Forest Classifier

The random forest classifier is a supervised machine learning technique that is used to classify data and predict classes in data. In descriptive manner, it could be used to give an overview of how the classes are divided into the data. In a predictive manner, it may be used to predict the class or category of new data based on the data it was trained on (Kulkarni & Sinha, 2012).

A random forest classifier is an algorithm that uses a large number of tree-based classifiers. It generates an N number of decision trees based on a random vector of parameters. These random vectors are independent of the previous vectors but they have the same distribution. To classify, an input vector goes through each of the decision trees. With the input of the vector, each of these trees cast a vote to what they classify as the class the data belongs to. The random forest determines which class has the most votes to choose the class for the record (Breiman, 2001). The decision trees are made with a bootstrapped sample of the size of input records as default, but other sizes are also possible. They use this sample to randomly pick input variables of the records to split the nodes on. The variables that split the best are chosen for the node (Kulkarni & Sinha, 2012).

Random Forest comes with some hyper-parameters. The max depth of the de-cision trees can be determined to reduce overfitting and make the classifier less complex. The number of decision trees and the number of maximum sample split are also to be set beforehand. Another important hyper-parameter is class weight. This is used to improve the prediction of a class when a data set is imbalanced.

2.2.4 Logistic Regression

Logistic regression can be used to classify binary classes. Some accept either values of 1 or 2, or simply two names, but it handles the classes as 0 and 1 to fit the Bernoulli distribution (Hilbe, 2009). The logistic regression algorithm in basic form uses the logit, the natural logarithm of an odds ratio. For each feature given in the training set, the odds ratio is computed from the class labels. Transforming this ratio into the logit gives a coefficient to the logistic regression. The direction of the coefficient, either larger or smaller, denotes the relationship between the

(14)

specified input value and the logit of the odds of the outcome of a class. If these coefficients are below zero, a larger specified value is related to a smaller probabil-ity of the binary class 1. If the coefficients are above zero a larger specified value is related to a higher probability of the binary class 1. This works vice-versa for the binary class 0. When using multiple variables there is more coefficients. The logit of the probability of the outcome of interest is eventually made from adding the Y-intercept with the variables times their coefficients (Peng, Lee, & Ingersoll, 2002).

P robability of outcome = e

α+β1·X1+β2·X2

1 + eα+β1·X1+β2·X2

Where α is the Y-intercept, β are the coefficients and X are the specified values. Based on these probabilities the logistic regression can be used to predict classes.

2.2.5 XGBoosting Classifier

XGBoosting stands for eXtreme Gradient Boosting and is a supervised machine learning algorithm for regression and classification. The classifier is known for its speed compared to other gradient boosting algorithms. It is a solution for poor predictive models, such as decision trees when using an imbalanced data set. It builds itself in a stage-wise manner, generalizing the model by minimizing the error of the model. When it uses the weak learners to predict it gives the instances that are predicted correctly a lower weight, the wrong predicted a higher weight and retries (Gu et al., 2012). It is similar to Random Forest in the way that it follows the tree ensemble. However the XGBoosting classifier trains in a different way. When optimizing a tree, the classifier adds a new tree at every step. The tree that is added is chosen based on regularization and training loss (Dhaliwal, Nahid, & Abbas, 2018).

(15)

Chapter 3 Methods

3.1 Acquiring data

The following pages serve as an explanation on how the data was acquired in the first place and what the data files contain. The experiment is conducted for the project Cooler Calmer Singapore from ETH Zürich Future Cities Laboratory. The Future Cities Laboratory program researches the development of sustainable cities (https://fcl.ethz.ch/research/responsive-cities/cooler-calmer-singapore.html). The project Cooler Calmer Singapore aims to enhance outdoor thermal comfort in Sin-gapore. The project dpes this by developing a design tool for assessing outdoor thermal comfort in urban spaces and designs to eventually find strategies and prin-ciples to improve thermal comfort in urban design (Berger, Buš, Cristie, Kumar, & Lauener, n.d.). The experiment was conducted at Nanyang Technical University Singapore campus and aims to construct human components in outdoor thermal comfort modelling (Melnikov, Krzhizhanovskaya, & Sloot, 2017).

Participants of the behavioural experiment were asked to first fill in a pre-survey, the questions asked is explained below in the Section survey.xlsx. After they had done this a bio-feedback wristband called the Empatica E4 was set up. Then they could walk around to familiarize them with the study area where they would participate in the experiment. The participants received a booklet containing a specification of the tasks. At each task the goal for the participant was to reach the target destination from an origin location by walking one of the two specified options. Both of these options where paths leading to the same target location. The participant would walk the chosen path and reach the target location, which also served as an origin for the next task. After all tasks were completed partic-ipants filled in a post-survey and the Empatica E4 bracelet was detached. The data extracted from the experiment contained:

(16)

climate.csv

The climate file contains the weather conditions and signals. Each participant has a separate file from the time they were participating. It contains two signals from the two measurements stations set in the sun and in the shade. These signals were measured every second during the experiment. The sensors were set in the sun and shadow as some types of signals differ in measurements when exposed to sunlight. All the temperature measurements are in degrees Celsius. The climate signals carry the information about the level of heat the participants are subjected to, that is why physiological signals can not be processed separately.

1. Air Temperature

2. Globe Temperature: Hollow copper matt black sphere with sensor inside to capture radiant heat

3. Wind Speed 4. Relative Humidity

5. Heat Stress Index: Index for establishing heat stress based on climate con-ditions

6. Wet Bulb Temperature: Thermometer with soaked clothed wrapped around it to capture relative humidity and wind speed

7. Psychro Wet Bulb Temperature: Includes radiative heat flux as well com-pared to the wet bulb temperature.

8. Wet Bulb Globe Temperature: Index for establishing heat stress based on climate conditions

Empatica E4 wristband

The Empatica E4 wristband served to observe the physiological data. Therefore it included multiple files. Each of these files describe other physiological charac-teristics as there are many that can indicate heat stress. The physiological signals are important because they contain information about changes within the body. Therefore, it tells us about the participant specifically and could differ by their physicality.

1. ACC.csv, contains the 3-axis accelerometer to determine the movement and movement speed of the participant.

(17)

2. BVP.csv, contains the Blood Volume Pulse of the participant from the pho-toplethysmograph

3. HR.csv, contains the average Heart Rate of the participant from the BVP signal

4. TEMP.csv, contains the Skin Temperature of the participant in degrees Cel-sius

5. EDA.csv, contains the ElectroDermal Activity of the participant in mi-crosiemens

6. IBI.csv, contains the Inter Beat Interval of the participant computed from the BVP signal

stages.csv

This file gives a description of the timestamps during the start and end of each task. It also contains the options that were taken by the participant and the con-ditions during the decisions between the two optional paths. These timestamps are important for splitting the rest of the data based on the moments the task or decision was taking place.

survey.xlsx

This file contains the survey answers that the participants were asked to fill in. Most of the questions are socio-demographic along with other relevant informa-tion that need to be taken into account such as the clothing the participants were wearing. Some of the questions are about the participants physicality and some about their perception regarding the climate conditions. There is a possibility that people behave differently to heat stress due to their physicality. That is why it is important to include them. A full list of the questions in the survey are shown in Appendix B.

choices.xlsx This file contains a summary of the option properties and the choices the participants made. It also contains the length of each path and the fractions of sun the paths were exposed to in segments. These segments are included so the sun fraction does not have to be considered for the whole width of the path, but for example the least sunny segment.

events.csv

Contains the information about the events during the tasks. This file contains the approximate location of the participant while an event occurred, the type of event in terms of sun exposure and the option they chose between the two paths.

(18)

3.2 Data pre-processing

The previous section stated how the data was acquired and the signals it contains. To classify the data the signals have to be pre-processed. Some features need to be converted to either numerical values and some signals need to be dropped because of a high correlation with others or a too high dimensionality.

In this section, there is an explanation of the features that are taken into account and why. As described in the Theoretical framework, in order to say something about adaptive behaviour to heat stress, physical, physiological, psychological and social features are needed. In the survey answers, the participants had to fill in different questions that are used for the social and physical attribute part.

In Nikolopoulou and Lykoudis, 2001, state that there is a difference between dis-tinct age groups and their sensitivity for heat perception. It states that the age group of 65 and above are more sensitive for heat exposure and that they show more adaptive behaviour to the sun. In the survey, the participants had to fill in their age category. These categories are 20-25, 25-35, 35-45, 55-65 and >65. To use these groups in classification algorithms the groups were denoted by the median of the possible age within the group. This is done because numerical continuous could tell us more than categorical binary features for each different group. The gender of participants was transformed to a categorical binary feature. This is done so a classifier would not see one gender as a continuous value. The feature added denoted the gender male if the participant is male and a 0 if the participant is female. According to Mehnert, Bröde, and Griefahn, 2002, there is a difference in sweat rate between women and men when exposed to heat. This could make a difference in the behavioural adaptions each gender makes.

In the same manner, ethnicity is processed in a categorical binary way. The eth-nicity of the participants are used as a feature with 1 denoting that it is their ethnicity. It has been reported that the effects of heat on tourists, migrants and refugees needs to be researched (Hansen, Bi, Saniotis, & Nitschke, 2013). The ethnicity of the participants could be interesting to use in classification along with the time they have lived in Singapore. In the survey answers, the participants also filled in how long they have lived in Singapore and is processed in years. Months and days is added to this number in fractions of a year.

In the sensation of heat stress, the ability of the body to dissipate body heat is an important factor. This could be influenced by the types of clothing the par-ticipants were wearing. The insulation clothing provides is calculated in clo units. One clo is the amount of insulation that a person can maintain their thermal equi-librium and is the same as 0.155 · K · m2 _{· W}−1_{. Where K is Kelvin, m is meters}

(19)

and W is Watt in an indoors near-sedentary situation (Schiavon & Lee, 2013). ASHRAE evaluated types of clothing and the number of clo units it offers. The insulation is used by adding these clo values for each type of clothing a participant was wearing (Al-Ajmi, Loveday, Bedwell, & Havenith, 2008).

Many behavioural science studies focused on the subjective evaluation of partici-pants. Most of the time this is done on a 5-point scale. It is an important factor in behavioural science and needs to be included. In the survey, the participants could give their opinion about the climate in Singapore on overall and just for the day of the experiment. This subjective evaluation could be indicative in classifying and predicting behavioural adaptation (L. Chen & Ng, 2012).

It is thought that the amount of liquid people drink could influence their pro-ductivity in hot climates. Propro-ductivity decreases with a smaller amount of liquid consumed before performing in hot areas (Delgado Cortez, 2009). The amount of liquid the participants drank that day is used as a feature in litres.

To tell more about the physical aspect BMI is included to incorporate the height and weight of the participant. The BMI tells us more about the body than height and weight separate, it is used by the WHO as an index to determine body weight in relation to body height (Sperrin, Marshall, Higgins, Renehan, & Buchan, 2016). It has also been reported that body fat is one of the bigger factors in individual heat storage (Havenith & van Middendorp, 1990). There is some evidence to suggest that fit people have a higher tolerance for heat than people who exercise less (Kuennen, Gillum, Dokladny, Schneider, & Moseley, 2013). Amount of exer-cise is included in the number of calories it would burn in practising it half an hour depended on the weight of the participant. This number came from the July 2004 edition of the Harvard Heart Letter. The survey does not contain answers stating how much the participants practice activities, that is why the burn rate is used. If a participant did more types of activities, these burn rates are added to each other. For the climate part that is needed to make a complete predictive model. In the prediction of heat stress, there are many different indices being used. Heat index and Humidex are examples of indices, but the most widely used is the wet bulb globe temperature determine heat stress. The wet bulb globe temperature is composed by the wet-bulb temperature, air temperature and globe temperature. The wet-bulb temperature also includes relative humidity and wind speed (Gaspar & Quintela, 2009). Therefore, it is not needed to include these from the climate file. To prevent a high dimensionality the other features are left out.

Describing the physiological part the features skin temperature, EDA and heart rate are used. The two features are processed as the mean over the minute

(20)

be-fore the decision. An elevated skin temperature causes an increase in blood flow through the skin. This helps the body release metabolic heat through the skin. Where on the long term acclimatization to the heat makes a person sweat more and thus have a lower skin temperature (Zhou & Yamamoto, 1997). In the signals in the Empatica file, there was the EDA.csv, containing the electro-dermal activity of a participant. EDA measures the conductivity changes of the skin. The EDA is a measure of the current flow between two points in contact with the skin. The EDA signal has been closely linked to the activation of the sympathetic nervous system and emotional processing. It is commonly used as an index for emotional state (Braithwaite, Watson, Jones, & Rowe, 2013). So while being a physiological signal, it possibly carries psychological information. It could make a predictor for stress responses of people (Sioni & Chittaro, 2015).

The heart rate signal and the inter-beat interval signal are both computed from the blood volume pulse interval. In the case of predicting behavioural adaptation, there is no need to use all three because they have a high correlation. According to (Yamamoto, Iwamoto, Inoue, & Harada, 2007) when people are subjected to heat exposure for over 30 minutes, their heart rate increases significantly and it could be a useful predictor.

3.3 Classification

Regarding classification, behavioural adaptation is still to be defined in terms of the experiment. There were two paths for the participant to choose to get from origin to destination. These paths could have a bigger fraction of sun exposure than the other and there was the possibility of being longer than the other. That leaves us with four different options for classification of decisions.

Decision types

Shorter Longer More sun No behavioural

adaptation

Contradicting Less sun Rational Behavioural

adaptation

Table 3.1: Path types to illustrate determination of behavioural adaptation There were instances where a path was shorter and had the least sun exposure. The least sun exposure is calculated by taking the path length and multiplying

(21)

it with the minimum of fractions of sun of the segments within the path. As the path did not have a uniform exposure to sun, the exposure to sun is calculated in 5 segments within the width of the path. Taking this path could be seen as a rational decision, since it is optimizing for both least distance and least sun expo-sure. The participant would get from origin to destination in the fastest way, with the least exposure to heat. It does not contain any information if the participant was acting based on heat exposure.

The paths that were longer and had more sun exposure are labelled as contradict-ing. There is no explanation for this decision that could be meaningful to establish sun evading behaviour.

Therefore, the other options are used to label the data-set. These two options do carry information about behaviour in terms of sun exposure. The longer and least sunny option tells us that the participant wanted to walk a long way to hy-pothetically evade the sun. The shorter and sunnier route tells us the participant is not bothered by the sun and does not attempt to evade it. To make sure the distinction between sunnier and least sunny is noticed by the participants, the data is filtered on the presence of any sun during the conditions. In some cases the participants were not subjected to sun. The decisions that are taken by the participants when there was no sun at all are not used for the classification task. The decisions that were taken while in full sun or cloudy sun are used.

Various different classifiers were used and evaluated on their results. Each of the classifiers was evaluated on test set of a 0.2 ratio to the whole data set and their test score can be seen in the next section results. The best results came with the use of a random forest classifier by scikit-learn, a python package. The random forest was used to determine the important features that cause behavioural adap-tation. The important features are shown in the results section sorted on their importance. To improve the random forest algorithm PCA was implemented. The PCA could reduce the risk of overfitting and would reduce the collinear features as. The random forest classifier with the data was optimized with cross-validation that preferred the class weights to be longer and least sunny:1.86, shorter and more sun:0.68 to counteract the imbalanced data set. The class weight was computed by scikit-learn function compute class weight. The XGBoosting classifier was also used to counteract the imbalanced data set but it did not show the results the random forest did with the set class weights. The decision tree classifier was used with the same hyperparameters as the random forest classifier to extract a tree. This tree could give more insights into how the random forest was classifying the data. As stated in the results section, the tree is not used for determining the importance of the features. It was shown to visualise the random forest classifier. The tree-based algorithms could tell use the Gini-importance of the features. This

(22)

tells us the mean decrease impurity, defined as the total decrease in node impurity averaged over all the trees (Pedregosa et al., 2011). This does not tell which feature is contributing to classifying one of the two classes. That is why logistic regression is implemented. The coefficients of the logistic regression tell the di-rection of the features towards the two classes, so the contribution of the features could be visualized in Section 4.5 independently of the other features. The results from the above described method are presented in the next section.

(23)

Chapter 4 Results

In this section the results is displayed of the methods described above. At first there is a short description of the data set followed by the performances of the classifiers.

4.1 Distribution of data

As described in the methods section the data was filtered on correlation and chosen because of their ability to predict according to literature. When all the data is filtered on correlation or left out on purpose the distributions of the data differ from before. The figures and the table for all features are to be found in appendix A.

Feature Percentage of whole data set # of decisions made Longer and least sunny option 26.9% 54

Participant is male 55.2% 111 Participant is Chinese 55.7% 112 Participant is Indian 23.9% 48 Participant is Other (Asian) 9.0% 18 Participant is Other (Caucasian) 7.0% 14 Participant is Malay 3.0% 6 Participant is Eurasian 1.5% 3 Table 4.1: Distribution of nationality and gender data. The total number of decisions of the classes of interest in the data set is N=201.

(24)

Figure 4.1: Distribution of gender and ethnicity in percentages.

4.2 Random Forest

In this section the F-1 score for both the classes are shown and their micro-average. The F-1 is the harmonic mean between precision and recall.

F − 1 = 2 · (precision · recall) precision + recall

The micro average is true positives divided by the true positives plus the false positives.

M icro − average = T rueP ositives

T rueP ositives + F alseP ositives

The scores were determined based on a 0.2 test set ratio that randomly picks a subset from the data set. F-1 score is chosen because of the imbalance in the classes.

Below that are Figure 4.2 and Table 4.3 describing and visualizing the Gini-importance of the features in the predictive model. This means that these features

(25)

have the highest total decrease in node impurity. The random forest used a class weight as described in the methods section to counteract the imbalance.

Class Precision Recall F-1 Longer and least sunny 0.64 0.64 0.64 Shorter and more sun 0.87 0.87 0.87 Micro average 0.80 0.80 0.80

Table 4.2: F-1 score and the micro-average score of the random forest classifier

(26)

Feature Gini-importance EDA peak frequency 0.138

Skin temperature 0.118 Time lived in Singapore 0.118 Duration tasks (time elapsed) 0.116 Heart rate 0.108 WBGT sun average 0.099 Liquid drank 0.066 BMI 0.062 Exercise 0.058 Insulation from clothing 0.0343 Singapore climate overall experience 0.0247 Singapore climate today experience 0.0212 Age 0.0138 Is male 0.0092 Is Chinese 0.0069 Is Malay 0.00327 Is other (Caucasian) 0.00284 Is Indian 0.000381 Is Eurasian 0.0000 Is other (Asian) 0.0000

Table 4.3: Gini-importance of the features within the random forest classifier

4.3 Decision Tree

The decision tree is shown here to give a visualization of the process of the random forest. As stated in the methods section the decision tree is shown here to give a better understanding of the classification in trees. Note that random forest uses more than one tree to make a prediction. The class ’shadier’ denotes the longer and least sunny path as mentioned in 3.3 and the class ’shorter’ denotes the non-adaptive class.

(27)

Figure 4.3: Decision tree as visualisation of the random forest classifier process with shadier denoting the adaptive class

4.4 PCA

In this section is shown that at 8 principal components the random forest classi-fier had the highest accuracy score after using PCA. The gini-importance of the principal components are shown in Figure 4.6.

Below that is Figure 4.7 showing the principal axes in feature space, these are a representation of the maximum variance in the original data set to visualize their magnitude of corresponding values in the eigenvectors. This gives a sense about the correlation between the features and the principal components. If the value is higher it indicates that the feature has more significant relation with the principal

(28)

component.

Figure 4.4: PCA cumulative explained variance with the red dotted line at x=13 denoting 95% of cumulative variance explained

Figure 4.5: Accuracy score of the random forest classifier with different amounts of principal components as training set

(29)

Figure 4.7: Principal axes in feature space

4.5 Logistic Regression

The evaluation of the logistic regression is seen in Table 4.4 on a test set of ratio 0.2. The F-1 score was chosen because of the imbalance in the classes. The class weight used at the random forest was not used with the logistic regression as it returned better results.

Figure 4.8 and Table 4.5 visualize and show the direction of the coefficients in the logistic regression model. The negative coefficients direct towards the behavioural adaptation class and the positive coefficients direct towards the non adaptive class.

(30)

Figure 4.8: Coefficients of the logistic regression negative directing towards adap-tive class and posiadap-tive directing towards non-adapadap-tive

(31)

Feature Coefficients BMI 0.48 Heart Rate 0.46 Singapore climate today experience 0.44 Is Malay 0.43 Is male 0.42 Liquid drank 0.158 Duration tasks (time elapsed) 0.113 EDA peak frequency 0.0303

Is other (Asian) 0.0186 Is other (Caucasian) 0.00053 Time lived in Singapore -0.0390 Is Eurasian -0.059 Is Indian -0.063 Is Chinese -0.074 Exercise -0.151 Skin temperature -0.288 WBGT sun average -0.290 Singapore climate overall experience -0.306 Age -0.46 Insulation from clothing -0.78

Table 4.5: Coefficients of the logistic regression negative directing towards adaptive class and positive directing towards non-adaptive

4.6 XGBoosting Classifier

As is stated in the methods section the XGBoosting Classifier is tried to see if it has better results than the random forest because of the imbalance in the data set. The F-1 score and micro-average of the classification task by using this classifier are shown in Table 4.6 below.

(32)

Chapter 5 Discussion

5.1 Findings

This thesis aims to find important features in predicting heat adaptive behaviour. By pre-processing the data from different types of signals and survey answers, the random forest classifier was able to predict the classes with an accuracy of 80%. The F-1 score for the class of interest, the behavioural adaptation class, is 0.64. XGBoosting classifier is also used for the classification task to counteract the im-balance in the data set. However, it does not show the results the random forest classifier shows. That is why the feature importance of the random forest could better describe the important features for predicting the two classes. Based on the random forest classification, the important features have been found to be the WBGT, the time that has elapsed during the experiment, the time the partici-pants lived in Singapore, the EDA peak frequency, skin temperature and heart rate. Their Gini-importance is amongst the highest as seen in Table 4.3. This means that these features have the highest total decrease in node impurity and are most predictive in classifying the data.

The logistic regression has an accuracy of 71% at prediction, which is lower than the random forest. However, the logistic regression offers us the direction of the features as covered in the theoretical framework section. As stated in the results the negative coefficients direct the prediction towards the class of behavioural adaptation.

Performing transformation of feature space through did not result in significantly improved performance of classification on the task of interest. The highest accu-racy, 78%, was found at 8 principal components. The cumulative explained vari-ance only reached 95% at 13 principal components. At 8 principal components, the PCA loses a lot of the information in the variance of the original features. The

(33)

principal components had almost the same Gini-importance. The most important among them was component 6 (5 in Figure 4.6 and 4.7). As seen in Figure 4.7, this principal component has the highest magnitude of corresponding values in the eigenvector for the feature insulation from clothing. The features with the highest magnitude of the corresponding values in the eigenvectors in all principal components were mostly physical such as age, BMI, gender along with insulation from clothing. These are not important according to the random forest without PCA. However because of the difference in prediction accuracy between random forest without PCA and with PCA the important features from the random forest classifier seem more accurate.

When combining the feature importance of the random forest with the coefficients of the logistic regression the results state that a higher value for heart rate would direct the prediction towards non-adaptive behaviour according to the logistic re-gression results in Figure 4.8. This result contradicts the findings of (Yamamoto et al., 2007) and does not appear to be logical. It is the same for the EDA peak frequency, which contradicts the findings of (Sioni & Chittaro, 2015). The impor-tance of the two features seems to be in accordance with the literature, but not in the direction of prediction. The time elapsed during the experiment, or duration of tasks, directs the prediction of the logistic regression towards the non-adaptive class. If (Yamamoto et al., 2007) findings are accurate, this could indicate that the longer the participant is exposed to the climate, the better he gets acclimatized and regains an unvarying physical condition. There is also the possibility that the participant wanted to finish the tasks faster due to the longer exposure to heat. The coefficient of time lived in Singapore is negative, it does not fit the view that the participant is more acclimatized to the heat. It may be that these participants are more used to evading sun.

The WBGT directs the prediction to the behavioural adaptation class as it shows a negative coefficient shown in the results of the logistic regression. This appears to be logical, since it has been used as an index to determine heat stress in other research. According to (Gaspar & Quintela, 2009) a higher WBGT value means the person is subjected to more heat stress by the climate conditions. The skin temperature also has a negative coefficient, meaning the higher the skin temper-ature, the more likely the logistic regression classifies the data as behaviorally adaptive. According to (Zhou & Yamamoto, 1997) when skin is exposed to heat, its temperature decreases due to sweating. However Huizenga, Zhang, Arens, and Wang, 2004, state that skin temperatures increased during a 80 minute warm con-dition test.

The most important finding regarding random forest with PCA is that the insula-tion from clothing has high feature importance. Its coefficient is the most negative

(34)

of all the coefficients and steers the prediction towards the class of behavioural adaptation. This seems logical when a higher level of insulation of clothing, as mentioned in the data pre-processing section, leads to more insulation of the body.

5.2 Limitations

In this research, the reliability of the data is inflicted by some implications. These could come from measurements, data pre-processing or the use of the classification algorithms. This section is a description of the limitations the research encoun-tered.

While the experiment took place different types of measurements were done. The measurements rely on different kinds of sensors that could cause small mistakes. The Empatica E4 bracelet has to be worn fairly tight around the wrist to get accurate measurements. There could have been instances the wristband was not worn tight enough that may have caused some mistakes in measurement. The re-search used heart rate as a feature which is extracted from the blood volume pulse signal as mentioned in chapter 3. When the bracelet is not worn tight enough the BVP signal could make small errors which implicate the heart rate as well. This could mean that the skin temperature was wrong at some times caused by friction between the skin and the sensor that caused the sensor to measure more heat than only the skin temperature itself. The features heart rate and skin temperature are included as the average over 60 seconds before the decision. This could be done in other ways, for example, at the moment of the decision or another amount of time leading up to the decision. The EDA signal could be affected by the time recording since it takes about 10 minutes to couple the electrodes with the skin. The EDA signal could also be affected by the climate conditions temperature and humidity and the placement on the body (https://support.empatica.com/hc/en-us/categories/200023126-E4-wristband)

The pre-processing of the data has a number of limitations. To use the features some data had to be interpreted in a way that could have been done differently. The experience of the climate in Singapore is used on a five-point scale. It has been commonly used in other researches according to L. Chen and Ng,2012, but is still debatable since it relies on subjective evaluation of the participant. The amount of exercise was processed as calories burned per half an hour, the survey did not contain how often they exercised and how long the participant would exercise. The age, weight and height answers were in groups, not in the precise values. In this thesis is chosen to use a median within the specified group, but the participant

(35)

could still be above or below that. The ethnicity and gender are used in a binary way that could alter the importance in comparison with a non-binary method. For the insulation of clothing, clo units were used as described in the methods chapter. This unit is mainly based on an indoor environment. The tightness of clothing or thickness of clothing is generalized to include in the data set. The wind is not measured in this unit as it is based on indoors. The ethnicity of the participants did not vary much, in the survey there are a couple of different answers and some are generalizing over many countries, not giving a precise description.

The accuracy overall for the random forest is 80% and for the logistic regres-sion 71%. However, they have lower scores for the class of behavioural adaptation. This could cause some of the importance to be more predictive for the non-adaptive class than the class this research is interested in.

The random forest algorithm makes the decision trees in its ensemble based on randomly picked subsets of the data, causing the models to vary each time it is run. The data is labelled as described in the methods chapter. The label for be-havioural adaptation could be simplistic and could be extended with other factors such as movement speed.

5.3 Future work

As described in the Section above the research has some limitations that could have altered the results. The measurements could be done in other manners to decrease the possibilities for mistakes. The survey for the participants could be in more detail to be more precise in processing data. The research stated that the important features were mostly physiological and climate-related. However, social-demographic and physical features appeared more important when using PCA. There is still a need for predictive models that could implement these features on a non-binary way to see if they are more important than they appear in this thesis. The predictive model should include more variety in ethnicity to determine if there is a difference between them. It would also help if the research is done on a larger scale, causing the smaller class to have an amount of records that it does not form an obstacle for classifiers to predict the class.

(36)

Chapter 6 Conclusion

This thesis aims to find important features in predicting shadow adaptive path choices. Based on classification through a random forest classifier and the use of physiological, social, physical attribute and climate data, the important features have been found to be the WBGT, the time that has elapsed during the experi-ment, the EDA peak frequency, skin temperature and heart rate. These are mostly physiological and climate features. Combining the important features of the clas-sifier with logistic regression gives direction to the features using coefficients. In essence, this is how certain values steer the probability of predicting one of the two classes. In this way higher values for a negative coefficient direct the pre-diction to the adaptive class. The coefficients of heart rate, EDA peak frequency and duration of tasks are positive and direct the prediction to the non-adaptive class. The heart rate increases when subjected to stress and remains steady when acclimatized, so its coefficient does not seem to fit in previous literature. EDA has a positive coefficient, which contradicts the commonly held view which is with higher values the it is more likely people have giher level level of stress. A higher duration of tasks could mean the participant got more acclimatized to the heat and could also mean that participant wanted to finish the tasks faster. Higher values for WBGT, the time lived in Singapore and skin temperature direct the prediction towards the adaptive class. As WBGT is commonly used as a heat stress index this appears to be logical. The notion the time lived in Singapore has a negative coefficient could be caused by the participants being used to the heat and avoiding the sun. The skin temperature tends to get higher when exposed to heat, therefore it has a negative coefficient as well. The most important feature according to PCA in combination with random forest is the insulation of clothing. However, the accuracy in prediction with PCA was lower than without PCA, so the features that are important without PCA are more accurate.

The features that resulted from this thesis as important could be of use when

(37)

plementing building strategies and designs. There still is need for predictive models to analyze behavioural adaptation to heat stress. An experiment on a larger scale could cause classifiers to predict more accurate and could lead to features such as ethnicity to vary more. This could give a better overview of the importance of social and physical attribute features. This research is a step in the analysis of behavioural adaptation to heat stress that does not rely on subjective evaluation of participants.

(38)

References

Al-Ajmi, F. F., Loveday, D. L., Bedwell, K., & Havenith, G. (2008). Thermal insulation and clothing area factors of typical arabian gulf clothing ensem-bles for males and females: measurements using thermal manikins. Applied ergonomics, 39 (3), 407–414.

ANSI/ASHRAE Standard 55-2013, A. (2013). Thermal environmental conditions for human occupancy. American Society of Heating, Refrigerating and Air-Conditioning Engineers . . . .

Berger, M., Buš, P., Cristie, V., Kumar, A., & Lauener, J. (n.d.). Cooler calmer singapore.

Braithwaite, J. J., Watson, D. G., Jones, R., & Rowe, M. (2013). A guide for analysing electrodermal activity (eda) & skin conductance responses (scrs) for psychological experiments. Psychophysiology, 49 (1), 1017–1034.

Breiman, L. (2001). Random forests. Machine learning, 45 (1), 5–32.

Chen, L., & Ng, E. (2012). Outdoor thermal comfort and outdoor activities: A review of research in the past decade. Cities, 29 (2), 118–125.

Chen, M.-L., Chen, C.-J., Yeh, W.-Y., Huang, J.-W., & Mao, I.-F. (2003). Heat stress evaluation and worker fatigue in a steel plant. Aiha Journal , 64 (3), 352–359.

Delgado Cortez, O. (2009). Heat stress assessment among workers in a nicaraguan sugarcane farm. Global health action, 2 (1), 2069.

Dhaliwal, S. S., Nahid, A.-A., & Abbas, R. (2018). Effective intrusion detection system using xgboost. Information, 9 (7), 149.

Epstein, Y., & Moran, D. S. (2006). Thermal comfort and the heat stress indices. Industrial health, 44 (3), 388–398.

Friedl, M. A., & Brodley, C. E. (1997). Decision tree classification of land cover from remotely sensed data. Remote sensing of environment , 61 (3), 399–409. Gago, E. J., Roldan, J., Pacheco-Torres, R., & Ordóñez, J. (2013). The city

and urban heat islands: A review of strategies to mitigate adverse effects. Renewable and Sustainable Energy Reviews, 25 , 749–758.

Gaspar, A. R., & Quintela, D. A. (2009). Physical modelling of globe and nat-ural wet bulb temperatures to predict wbgt heat stress index in outdoor

(39)

environments. International journal of biometeorology, 53 (3), 221–230. Gu, J., Weber, K., Klemp, E., Winters, G., Franssen, S. U., Wienpahl, I., . . .

others (2012). Identifying core features of adaptive metabolic mechanisms for chronic heat stress attenuation contributing to systems robustness. Inte-grative Biology, 4 (5), 480–493.

Hansen, A., Bi, L., Saniotis, A., & Nitschke, M. (2013). Vulnerability to extreme heat and climate change: is ethnicity a factor? Global health action, 6 (1), 21364.

Havenith, G. (2001). Individualized model of human thermoregulation for the simulation of heat stress response. Journal of Applied Physiology, 90 (5), 1943–1954.

Havenith, G., & van Middendorp, H. (1990). The relative influence of physical fitness, acclimatization state, anthropometric measures and gender on indi-vidual reactions to heat stress. European journal of applied physiology and occupational physiology, 61 (5-6), 419–427.

Hilbe, J. M. (2009). Logistic regression models. CRC press.

Höppe, P. (2002). Different aspects of assessing indoor and outdoor thermal comfort. Energy and buildings, 34 (6), 661–665.

Huizenga, C., Zhang, H., Arens, E., & Wang, D. (2004). Skin and core temperature response to partial-and whole-body heating and cooling. Journal of thermal biology, 29 (7-8), 549–558.

Kjellstrom, T., Holmer, I., & Lemke, B. (2009). Workplace heat stress, health and productivity–an increasing challenge for low and middle-income countries during climate change. Global health action, 2 (1), 2047.

Kovats, R. S., & Hajat, S. (2008). Heat stress and public health: a critical review. Annu. Rev. Public Health, 29 , 41–55.

Kuennen, M., Gillum, T., Dokladny, K., Schneider, S., & Moseley, P. (2013). Fit persons are at decreased (not increased) risk of exertional heat illness. Exercise and Sport Sciences Reviews, 41 (2), 134–135.

Kulkarni, V. Y., & Sinha, P. K. (2012). Pruning of random forest classifiers: A survey and future directions. In 2012 international conference on data science & engineering (icdse) (pp. 64–68).

Loughnan, M., Nicholls, N., & Tapper, N. J. (2012). Mapping heat health risks in urban areas. International Journal of Population Research, 2012 .

Mehnert, P., Bröde, P., & Griefahn, B. (2002). Gender-related difference in sweat loss and its impact on exposure limits to heat stress. International journal of industrial ergonomics, 29 (6), 343–351.

Melnikov, V., Krzhizhanovskaya, V. V., Lees, M. H., & Sloot, P. M. (2018). Sys-tem dynamics of human body thermal regulation in outdoor environments. Building and Environment , 143 , 760–769.

(40)

Melnikov, V., Krzhizhanovskaya, V. V., & Sloot, P. M. (2017). Models of pedes-trian adaptive behaviour in hot outdoor public spaces. Procedia Computer Science, 108 , 185–194.

Moran, D. S., Shitzer, A., & Pandolf, K. B. (1998). A physiological strain index to evaluate heat stress. American Journal of Physiology-Regulatory, Integrative and Comparative Physiology, 275 (1), R129–R134.

Nikolopoulou, M., Baker, N., & Steemers, K. (2001). Thermal comfort in outdoor urban spaces: understanding the human parameter. Solar energy, 70 (3), 227–235.

Nikolopoulou, M., & Lykoudis, S. (2007). Use of outdoor spaces and microclimate in a mediterranean urban area. Building and environment , 42 (10), 3691– 3707.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12 , 2825–2830.

Peng, C.-Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction to logistic regression analysis and reporting. The journal of educational research, 96 (1), 3–14.

Ringnér, M. (2008). What is principal component analysis? Nature biotechnology, 26 (3), 303–304.

Sailor, D. J. (2014). Risks of summertime extreme thermal conditions in buildings as a result of climate change and exacerbation of urban heat islands. Building and Environment , 78 , 81–88.

Schiavon, S., & Lee, K. H. (2013). Dynamic predictive clothing insulation mod-els based on outdoor air and indoor operative temperatures. Building and Environment , 59 , 250–260.

Sioni, R., & Chittaro, L. (2015). Stress detection using physiological sensors. Computer , 48 (10), 26–33.

Sperrin, M., Marshall, A. D., Higgins, V., Renehan, A. G., & Buchan, I. E. (2016). Body mass index relates weight to height differently in women and older adults: serial cross-sectional surveys in england (1992–2011). Journal of Public Health, 38 (3), 607–613.

Tatterson, A. J., Hahn, A. G., Martini, D. T., & Febbraio, M. A. (2000). Effects of heat stress on physiological responses and exercise performance in elite cyclists. Journal of Science and Medicine in Sport , 3 (2), 186–193.

Tawatsupa, B., Yiengprugsawan, V., Kjellstrom, T., Seubsman, S.-a., Sleigh, A., Team, T. C. S., et al. (2012). Heat stress, health and well-being: findings from a large national cohort of thai adults. BMJ open, 2 (6).

Yamamoto, S., Iwamoto, M., Inoue, M., & Harada, N. (2007). Evaluation of the effect of heat exposure on the autonomic nervous system by heart rate

(41)

variability and urinary catecholamines. Journal of occupational health, 49 (3), 199–204.

Yu, Z., Haghighat, F., Fung, B. C., & Yoshino, H. (2010). A decision tree method for building energy demand modeling. Energy and Buildings, 42 (10), 1637– 1646.

Zhou, W., & Yamamoto, S. (1997). Effects of environmental temperature and heat production due to food intake on abdominal temperature, shank skin temperature and respiration rate of broilers. British Poultry Science, 38 (1), 107–114.

(42)

(43)

Figure 6.1: Distribution of features between two classes where shadier denotes the adaptive class and the error lines are the standard deviation and the red lines show the average of decision type over all features

(44)

Feature Mean shorter class Mean shadier std shorter std shadier p-value Time lived in Singapore 15.0 15.1 12.1 13.5 0.95 Age 27.0 28.5 5.6 8.2 0.16 BMI 22.5 22.1 3.33 2.83 0.43 Singapore cli-mate is on overall 4.05 4.14 0.82 0.76 0.47 Singapore cli-mate is today 4.12 4.09 0.94 0.87 0.80 Exercise 604.9 510.4 316.2 264.6 0.053 Liquid drank 0.92 0.87 0.63 0.72 0.67 Insulation from clothing 0.209 0.244 0.084 0.107 0.0204 WBGT sun av-erage 32.1 32.0 1.91 2.02 0.54 EDA peak

fre-quency 0.378 0.375 0.292 0.194 0.93 Heart rate 110.2 109.2 15.9 14.5 0.70 Skin

tempera-ture 33.5 33.6 1.18 0.94 0.64 Duration tasks 1587.3 1592.9 500.8 472.7 0.94 Table 6.1: The means of the distribution of the numeric features between the two

(45)

Feature % took adaptive option % took non-adaptive option % adaptive decision compared over all features Is male 76.6% 23.4% +3.4% Is Chinese 72.3% 27.7% -0.8% Is other (Asian) 66.7% 33.3% -6.5% Is Malay 100.0% 0.0% +26.9% Is Indian 72.9% 27.1% -0.2% Is Eurasian 66.7% 33.3% -6.5% Is other Caucasian 78.6% 21.4% +5.4% Table 6.2: The means of the distribution of the categorical features between the two classes of the classification task

Appendix B

1. Your gender 2. Your age

3. Your nationality, race 4. Your weight

5. Your height

6. How long have you lived in Singapore? 7. My mood now is

8. My focus now is 9. I am feeling now

10. How long ago was your last meal? 11. How long ago was your last drink? 12. How much liquid did you drink today? 13. Upper body

(46)

15. Feet

16. Singapore climate is on overall 17. Singapore climate is today

18. I regard the climate in Singapore as (On onverall) 19. I regard the climate in Singapore as (Today)

20. On overall I would prefer these parameters of Singapore environment to be (Air temperature)

21. On overall I would prefer these parameters of Singapore environment to be (Sun radiation)

22. On overall I would prefer these parameters of Singapore environment to be (Wind)

23. On overall I would prefer these parameters of Singapore environment to be (Humidity)

24. On overall I would prefer these parameters of Singapore environment to be (Sound)

25. On overall I would prefer these parameters of Singapore environment to be (Crowdedness)

26. On overall I would prefer these parameters of Singapore environment to be (Pollution)

27. What type of activities are you performing in the outdoor environments of Singapore (Walking for commute / studies)?

28. What type of activities are you performing in the outdoor environments of Singapore (Walking for leisure)?

29. What type of activities are you performing in the outdoor environments of Singapore (Hiking)?

30. What type of activities are you performing in the outdoor environments of Singapore (Physical exercises)?

31. What type of activities are you performing in the outdoor environments of Singapore (Running (jogging))?

(47)

32. What type of activities are you performing in the outdoor environments of Singapore (Sport games)?

33. How much time a day do you usually spend outdoors on weekdays? 34. How much time a day do you usually spend outdoors on weekends? 35. I use air conditioner

36. I took this route for task .. because it was 37. I feel that at task ... climate was

38. At task ... the climate was

Feature Importance of Behavioural Adaptation to Heat Stress

Feature Importance of

Behavioural Adaptation to

Heat Stress

Feature Importance of Behavioural

Adaptation to Heat Stress

Acknowledgements

Abstract

Contents

Abbreviations

Chapter 1

Introduction

Chapter 2

Theoretical Framework

2.1

Outdoor Thermal Comfort

2.1.1

Behavioural science

2.2

Machine Learning

2.2.1

PCA

2.2.2

Decision Tree Classifier

2.2.3

Random Forest Classifier

2.2.4

Logistic Regression

2.2.5

XGBoosting Classifier

Chapter 3

Methods

3.1

Acquiring data

3.2

Data pre-processing

3.3

Classification

Chapter 4

Results

4.1

Distribution of data

4.2

Random Forest

4.3

Decision Tree

4.4

PCA

4.5

Logistic Regression

4.6

XGBoosting Classifier

Chapter 5

Discussion

5.1

Findings

5.2

Limitations

5.3

Future work

Chapter 6

Conclusion

References

Appendix B