Decision-Making under Uncertainty for Social Robots

(1)

MSc Artificial Intelligence

Intelligent Systems Track

Decision-Making under Uncertainty for

Social Robots

by Richard Pronk 6382541 Masters thesis 42 EC

Masters programme Artificial Intelligence University of Amsterdam

A thesis submitted in conformity with the requirements for the degree of MSc. in Artificial Intelligence

Supervisors:

Jo˜ao Messias and Maarten van Someren Informatics Institute, Faculty of Science,

Universiteit van Amsterdam Science Park 904, 1098 XH Amsterdam

(2)

Abstract

In this thesis, we explore the problem of learning social behaviour for a semi-autonomous telepresence robot. We extract social cost func-tions from instantaneous and delayed feedback obtained from real-world experiments, so as to establish a performance metric for the behaviour of our autonomous agent in a social setting. Based on the identified social cost functions, we extract additional insight regarding the normative behaviour of the agent as a response to perceived social cues, and discuss the relevant state features to be used for autonomous control in the task of social positioning during conversation. We also propose a discretized proxemics-based state space representation for this task given the feedback collected during our experiments, and perform a statistical analysis on the relevance of the state features in a group conversation setting.

In the second part of this work, we compare and discuss different approaches to the problem of obtaining a control policy for our task of positioning of the robot during social conversation. This is a challeng-ing task due to the limited information available to the agent durchalleng-ing execution. Specifically, we study the performance as well as the ease of implementation of Partially Observable Markov Decision Processes (POMDPs) against a Long Short Term Memory (LSTM) recurrent neural network. We show by validating both methods on real world experiments that the LSTM outperforms the POMDP in our task, and offer explanations for that result which can help guide the deployment of these frameworks for more general problems of decision-making un-der uncertainty in the field of Robotics.

(3)

1 Introduction

Robotics is becoming more and more popular in real-world settings [27], and have shown the potential to assist in many real-world tasks e.g., helping the elderly [5]. In these kinds of tasks, robots are acting in open human-populated environments which has shown to be difficult due to the highly unpredictable structure of such environments [28]. Next to the issue of acting in a noisy environment, behaving in a social intelligent manner has also shown to be an important factor for the acceptance of an autonomous agent in a human-populated environment [31, 40].

This thesis revolves around the TERESA1 _{project, a project aiming to}

develop a telepresence system meant to assist the elderly in maintaining their social activities despite their reduced mobility. To achieve this, a telepres-ence robot is used which is equipped with semi-autonomous behaviours (i.e. behaviours for which the autonomous agent takes over some of the low level tasks) to assist the elderly in controlling the robot. It has been shown that elderly users have difficulties with controlling the robot while simultaneously conversing [11], but semi-autonomy could relieve them of the cognitive effort of controlling the robot.

The autonomous behaviour of the TERESA robot must be socially intelli-gent, meaning that the robot should behave in a socially acceptable manner. Social acceptability however, is a subjective measurement of performance and it is therefore hard to define in a way that can be used as a basis for the autonomous decision-making of a robot. Real world experiments are therefore needed in order obtain quantitative measures of socially acceptable behaviour for our robot.

Next to the issue of defining a performance measure for the robot, the robot itself has also incomplete information regarding its situation. This means that the robot does not know the true state of its environment and it is therefore possible for the robot to receive false or incomplete information. The algorithms created for the autonomous agent should therefore be able to handle uncertainty due to the imperfect sensors of the robot.

This thesis is divided into two main parts: in the first part we explore the definition of social correctness for our robot from which we will extract social cost functions. In the second part, we discuss the development of a decision-theoretic controller for the behaviour of the robot during conversation and compare it to a state of the art recurrent neural network on the problem.

In this section, we first give a general introduction regarding the TERESA

1_{TERESA: TElepresence REinforcement-learning Social Agent. Project reference:}

(5)

project in Section 1.1, where we explain the motivation and contributions of this thesis to the TERESA project. Next, we give an overview of the robots hardware and software in Section 1.2 and finally provide an overview of the notations used regarding the robot in Section 1.3.

1.1 The TERESA project

The TERESA project aims to develop a socially intelligent semi-autonomous telepresence system, to allow a remote controller be more physically present than with standard teleconferencing while minimizing the amount of input required from the controller.

The idea is that the TERESA robot will take over the low level tasks such as navigation and positioning so that the controller only has to indicate who he/she wants to communicate with. Socially acceptable behaviour for the robot is however difficult to manually define and should therefore be learned from real world experiments.

The TERESA project contains many challenges such as controlling the motion of the robot in a socially intelligent manner, and also perceptual problems such as the robust detection of the persons around the robot, the detection of social cues such as hearing problems and extracting feedback from facial expressions.

This thesis focuses mainly on the social conversation aspect of the project, where we explore the social correctness of the positioning and behaviours of the robot during conversation by performing real world experiments.

This problem could in principle be solved by providing solutions to the problem from the field of Robotics or Control Theory with a hard coded or expert defined measure of social correctness. The TERESA project, however, aims to use reinforcement learning to overcome these challenges as these methods are especially suited to learn policies for decision-making under uncertainty that are difficult to prescribe manually.

The main contributions of this thesis to the TERESA project are by providing:

- Social cost functions which are required to define the social correctness during conversation.

- Comparison between two different types of low level controllers to assist future development of such a controller for the TERESA project.

(6)

(a) Front view of the robot (b) Side view of the robot

Figure 1: Front and side view of our robot

1.2 Robot hardware and software

The robot that we are using is built upon the commercially available Giraff platform2 with an additional computer mounted on the side of the robot. In our experiments, the computer present within the Giraff robot is only used for the video conferencing software where this computer does not control the robot although it is capable of doing so.

The Giraff robot is by default equipped with a fish-eye camera and an on-board microphone which are again only used by the video conferencing software and are not used by our algorithms directly. Next to the camera and microphone of the Giraff robot, these are the additional sensors that are connected to the TERESA robot:

• 2 Hokuyo UST 10X laser rangefinders • Asus Xtion PRO LIVE (RGBD sensor) • VoiceTracker II array microphone • XSense MTi-30 AHRS gyroscope

The 2 additional laser range-finders provide a 360 degrees field of view around the robot, whereas the additional Kinect sensor is only facing forward (see Figure 1). The additional computer mounted on the side of the robot

(7)

runs on Ubuntu 14.04 LTS (Trusty) and runs on the Robot Operating System (ROS) [25] version Indigo as a framework for our software.

In our experiments, 3rd party detection software from the SPENCER project3 [19, 20] is used for detection of the interaction targets around the robot.

The self-localization of the robot within the room is done by using Hector SLAM [18] which also simultaneously maps the room. This mapping and self localisation is mainly used for the post annotation of the experiments and is not used by the algorithms discussed throughout this thesis.

1.3 Notation conventions

In this section, we provide some of the notations used throughout this thesis related to the positioning and orientation of the robot and the interaction target. Throughout this thesis, the robot is denoted as R and the interaction target is denoted as T . Furthermore, we denote positions as p ∈ R2 _{:= [x y]}T

and postures as q ∈ R2_{× [−π, π] := [p}T _θ]T_{. The position of the interaction}

target for example, can therefore be represented as pT. This representation

however, does not yet include in what frame of reference this position of the interaction target is given.

There are two frames of references used to represent the positions of objects of interest in our control tasks; the world frame w and the robot frame r. The world frame is fixed to a static point within the operational environment of the robot (e.g. an arbitrarily chosen point within a room).

The robot frame on the other hand is fixed to the robot itself (with the x-axis facing forward), where the position of the robot is represented within the world frame. The position of the interaction target T can therefore be represented in both the world frame w and the robot frame r (i.e. pw

T for the

world frame and pr

T for the robot frame).

In Figure 2, an overview of the world frame is given. The posture of the robot and the position interaction target relative to the world frame w and relatively to the robot frame r are defined as follows:

q_Rw =   xw_R yw R θw R   pr_T = xr T yr T pw_T =cos(θ) −sin(θ) x w R sin(θ) cos(θ) yw R   xr_T yr T 1  

Given this representation, the distance between the interaction target T and

3_{SPENCER: Social situation-aware perception and action for cognitive robots. Project}

(8)

xw yw R yr xr θ T ϕ

Figure 2: Overview of the world frame (top down view)

the robot R can be calculated as follows: d(T , R) = q (xw T − xwR)2+ (yTw− ywR)2 ⇔ q (xr T)2+ (yTr)2. (1)

In this thesis, the orientation of the interaction target is not used. Rather, we are interested in the bearing of the robot with respect to the position of the interaction target ϕ which can be calculated as follows:

ϕ = arctan(y_Tw− yw

R, xwT − xwR) − θ ⇔ arctan(yTr, xrT). (2)

As you can see in Equation 2, the bearing with respect to the interaction target in the robot frame can be calculated by a straightforward arctangent based on the position of the interaction target. In the world frame however, also the position [xR, yR] and orientation θR of the robot need to be taken

into consideration during calculation.

2 Social cost functions

One of the objectives of the TERESA project, is to find the optimal position of the robot with respect to the interaction target (i.e. the distance and bearing) during conversation. In order for an agent to be able to behave in an optimal manner however, it requires a performance measurement that describes the relative utility of the different situations that the robot may encounter. For example, a performance measurement might describe that reaching the end of a maze is a desirable state to be in, whereas being in any other state of the maze is undesired. Knowing this, the agent can optimize its behaviour by taking the actions that lead to the end of the maze the

(9)

quickest while at the same time minimizing the number of visited states that are undesired. Note that a performance measurement describing the unde-sirability of states essentially describes the cost of being in a certain state. These performance measurements are therefore also called cost functions as they provide a mapping from states to cost (see Section 2.1).

In our task, the agent acts in a human-populated environment where the agent must behave in socially acceptable manner. This means that the performance measurement for our agent should include a measurement of social acceptability. Cost functions that include this measurement of social acceptability to derive the cost for a state are also called social cost functions. Although various research has already been performed regarding finding the optimal position of a robot during human-robot interaction [29, 33], we can not assume that the same methods can be applied to our telepresence robot. With a telepresence robot it is for example important that the in-teraction target as well as the pilot are able to clearly see and hear each other during the teleconference. This requirement might not be present for different types of robots, leading to completely different social cost functions. In this section, we provide a method for extracting these social cost func-tions from real world experiments in order to obtain quantitative measures of socially acceptable behaviour for our telepresence robot. First, we extract the social cost functions given instantaneous negative feedback for a single interaction target setting in Section 2.3. Next, in Section 2.4, we extract social cost functions for the positioning of the robot within a group conver-sation setting (i.e. situations with more than one interaction target). Here, we extract the social cost functions from delayed feedback provided by the interaction target at the end of each episode, consisting of an evaluation of the overall quality of the behaviours of the robot.

2.1 Background

A cost function provides a mapping from a state s ∈ S to a cost C : S → D where this cost D could either have a discrete (e.g. D ∈ {low, normal, high}) or a continuous (e.g. D ∈ R) representation. In our particular setting, we have two features that influence the cost of each state. Namely, a continuous cost for the bearing as well as a continuous cost for the distance with respect to the interaction target. This means that we have cost function mapping from the bearing state sb ∈ Sb to a cost Cb : Sb → R and a mapping from

the distance state sd ∈ Sd to a cost Cd : Sd → R with S = Sd× Sb. The cost

for state s ∈ S is then given by some combination of the costs for sb ∈ Sb

and sd ∈ Sd (e.g. the average between the two costs). See Section 2.2 for a

(10)

(a) The TERESA robot in a simulation environment

(b) Obstacle avoidance cost function. Higher costs are shown in red, and lower costs in blue.

Figure 3: Obstacle avoidance cost function used for navigation within the TERESA project.

These cost functions can be used by decision theoretic solvers to derive an optimal policy where the cost functions provide the cost for being in a certain state. The task of navigating a robot through a room for example uses a cost function for obstacle avoidance, where there is a high cost around objects and a low cost the in areas without any obstacles (see Figure 3). In this figure, the colours surrounding the detected objects (i.e the walls and the poles of the table) indicate the cost of the position of the robot. Given this cost function, a straightforward path planning algorithm can then be applied to calculate the route with the lowest cost.

When reasoning within a human-populated environment however, things get significantly more complicated as we cannot fully prescribe the rules according to which the agent should behave around humans. For obstacle avoidance, a obstacle buffer cost can defined so that the autonomous agent can safely navigate around obstacles without hitting anything. While work-ing in a human-populated environment however, the social acceptability of the surrounding humans should also be taken into consideration.

Social navigation is a very active research area that also incorporates social cost functions to allow for safe and socially intelligent navigation in a human-populated environment.

Social cost functions could for example be used to incorporate human interactions into the planning phase, where methods learning from expert examples such as inverse reinforcement learning could be used [23]. During inverse reinforcement learning, demonstrations by an expert are used to learn the robot’s behaviour when faced with different state features. Here, we try to

(11)

extract a policy that mimics the observed behaviour (i.e. the demonstrations which are considered to the optimal behaviour) and use this to represent it as a cost function that describes a measurement of social correctness (i.e. what behaviours should be performed when acting in a human populated-environment and facing the different state features).

Another method could be by adding a personal space cost function to assist the planner in avoiding the personal space of a person, where for ex-ample Gaussians functions centered on the person can be used to define the social cost [17].

In our problem however, we are not dealing with a navigation task but rather with the positioning of the robot during conversation. Here, not only the distance d(T , R) is of importance but also the bearing ϕ of the robot R with respect to the interaction target T .

The social cost functions will in our particular case be learned from in-stantaneous feedback as well as delayed feedback given over an episode. In the instantaneous feedback case, we have key presses indicating negative feed-back for a specific state. This negative feedfeed-back over states is then used to provide a mapping from states to cost. In the delayed feedback case on the other hand, a grade based on the social correctness of the agent is given over an episode containing multiple states. In that case, we assign the grade that was given in that episode to all of the visited states, in order to provide a mapping from states to cost.

2.2 Defining the state space

Before we can extract the social cost functions, we first need to define the state space for our task. Previous research by our partners at the university of Twente [34] has already shown that the relevant features for the positioning of the TERESA robot during conversation are given by the bearing and distance relative to the interaction target. Using these features, a valid state representation of the robot in the conversation setting could in principle be defined as the real valued distance and bearing towards the interaction target. It has however been shown that due to the so called “curse of dimensionality”, working in a continuous state space in a real-world robotic applications can lead to intractable problems for decision theoretic solvers [10].

A solution to this problem is to discretize the state space. An example of a popular discretized environment is the grid world, where the state space consists of the cells within a grid (see Figure 4). In this world, the task of the robot R might be to approach the goal G where the agent can only be located in one of the cells.

(12)

G R

Figure 4: Example of a discretized environment (grid world)

Take for example the phrase “You are invading my personal space”, which indicates that there is a boundary at what you consider to be your personal space. The distance boundary of what we consider to be comfortable depends on the nature of the relation between the participants during conversation. Two very good friends for example, might want to conduct a conversation while standing very close together. When the participants do not know each other on the other hand, this same distance might be uncomfortable. The study of proxemics [12] has identified a discrete set of distance intervals for different types of interaction between humans. These distances are defined as the personal, social, and public proxemic zones.

This lends support to the hypothesis that a similar discrete representation of the distance space would be a natural basis for the domain of this variable within our social conversation task. These identified proxemic zones however, rely on the fact that the interaction took place between people rather than the users interacting through a robot. To overcome this issue, we have attempted to learn a discretized proxemics-based representation of the distance space for a human interacting with a telepresence robot.

During our experimental runs (see Section 2.3), we collected the distances between the robot and the interaction target, as measured by the robots laser range-finder. For the purpose of state space identification, the relevant portion of this dataset concerns the segments of the experimental runs where we asked the subjects to approach the robot to what they considered to be the normal, minimum and maximum interaction distances. Note that these interaction distances will be used as a reference for the autonomous control of the robot. In order to identify the interaction ranges, we extracted the set of final distances of the subjects after their approaches to the robot, as P = {p1, . . . , pN} with pi ∈ R. Here, the final distances are defined as the

distance between the interaction target and the robot after the participant finished its approach towards the robot. Based on the final distances of these approaches P , we run K−means clustering [13] with K = 3 on dataset P to come up with K points to identify the centroids of these interaction distances. The centroids of the resulting clusters, based on 31 approaches,

(13)

are as follows:

ρclose = 0.71m, ρnormal = 1.16m, ρf ar = 1.68m.

We then define the discrete distance state of the robot during social conver-sation as Sd = {dpersonal, dsocial, dpublic}. Let Fρ : {dpersonal, dsocial, dpublic} →

{ρclose, ρnormal, ρf ar} be a bijection where Fρ(dpersonal) = ρclose, Fρ(dsocial) =

ρnormal and Fρ(dpublic) = ρf ar . Then, during execution, the distance state of

the robot can be determined as: arg min

sd∈Sd

|ρt− Fρ(sd)| (3)

where ρt is the real-valued distance at time t.

Next to the discretized distance state space based on the identified inter-action ranges, we also discretize the bearing relative to the interinter-action target to extract a discrete social cost function for the bearing that can be used by decision theoretic solvers during planning. A straightforward solution would be to divide the full angular range into equally-sized intervals. This would however lead (depending on the size of these intervals) to either many states with high precision or a small amount of states with low precision. You could however argue that because we are working with a telepresence robot, that the front of the robot should have a high precision because of the viewability of the screen. The social cost of the interaction target standing beside or behind the robot should therefore not differ much as the screen can not be viewed and makes teleconferencing impossible.

Based on this idea, we have divided the angular range in front of the robot into steps of 10 angular degrees in width up to 50◦. That is, let

Φ := {−50◦, −40◦, −30◦, −20◦, −10◦, 0◦, 10◦, 20◦, 30◦, 40◦, 50◦}

and Fφ : {sbi}i=0,...,10 → Φ where {sbi}i=0,...,10 =: Sb is the discrete bearing

state space. Then the bearing state at time t is defined as: arg min

sb∈Sb

|Fφ(sb) − deg(φt)| (4)

where φt is the bearing at time t, and deg(φt) = φt× 180_π .

Another important fact regarding the bearing is that the robot is symmet-ric and the conversation task, a priori, has no reason to favour one angular direction over the other. Therefore we assume that the social cost should also be symmetric with respect to the bearing (i.e. the cost at the left of the robot should equal the cost at the right of the robot). This leads to requiring

(14)

Figure 5: Angular states representation in different shades of gray

less experimental data and removes possible biases during the experiments. During execution, however, the actions of the robot naturally should depend on the actual direction of the interaction target. In Figure 5, the different bearing states are represented in grayscale, and this symmetry can also be seen.

In order to get the full state space of the robot, the possible state spaces based on distance Sd and bearing Sb are combined to form the state space

S. This gives us the combination of possible distances in Sd and bearing

states in Sb which provides us a total of 18 state spaces in our state space

S = Sd× Sb.

2.3 Initial experiment

First, an initial experiment was conducted to learn social cost functions for the TERESA robot based on instantaneous negative feedback. These social cost functions are used to provide a performance measurement for the robot in a social setting. For instance, we want to learn at what distance the robot should approach the interaction target for different situations e.g. whether or not there are hearing problems detected. The detection of these kinds of situations is part of the research of our partners at the University of Twente. We however want to learn that when such a situation occurs, what is the optimal distance with respect to the interaction target. Another important feature for social positioning, as indicated by our partners [34], is the bearing of the robot with respect the interaction target during a conversation.

Given these two features (i.e distance and bearing), we want to learn for each state s ∈ S, how socially acceptable it is to be in that state. This

(15)

Figure 6: Layout of the initial experiment

measurement of correctness will be extracted from negative feedback provided by the pilot, and will be turned into cost functions to be able to be used by decision theoretic controllers.

2.3.1 Experimental set-up

In the initial experiment, a “Wizard of Oz” setup is used. In this type of setup, the robot is controlled by an expert located in an isolated area. In our experiment, the expert acts with different degrees of quality in order to trigger a response of the pilot to the behaviour of the robot. The layout of the experiment (two rooms with an additional isolated area for the expert) can be seen in Figure 6. In this figure and throughout the rest of the thesis, the following definitions are used;

- pilot : the person sitting behind the computer and communicating through the robot,

- interaction target : the person physically with the robot and is commu-nicating with the pilot

- experimenter : the person setting up the robot and instructing the par-ticipants

- expert : the person controlling the robot

As can be seen in the experiment layout, also an additional camera is installed in the interaction room where this camera is used to provide the expert with

(16)

a better overview of the room and is additionally used for later validation of the recorded experimental data.

The pilot is asked to give feedback when the robot is doing something which is for him/her socially unacceptable. This feedback is given by pressing a button on the keyboard located in front of the pilot.

During the experiment, we alternate between two different situations. In the first of these, the interaction target is asked to approach the robot as described in Section 2.2. Alternatively, the expert controlling the robot approached the interaction target with different degrees of quality. After the approaches, the visitor and interaction target continued the conversation where the expert executed various socially acceptable and non acceptable behaviours. Acceptable behaviours include; maintaining the correct body pose and correctly changing the distance based on social cues (e.g. hearing problems) whereas the non acceptable behaviours include; suddenly turning away and randomly changing the distance to the interaction target.

In order to explore how the robot should react when a social cue is trig-gered (e.g. hearing problems), artificial background noise (i.e. sounds of a cafeteria) was introduced during the experiment to simulate such a situation. At the end of the experiment also an interview was conducted to obtain some additional insight on the behaviour of the robot.

2.3.2 Results

The first result from our initial experiment, are the identified centroids ρ of our identified distance ranges for our distance states sd∈ Sd as discussed in

Section 2.2. Here, the final distances of the approaches from the interaction targets P were used to define our state space based on the euclidean distance relative to the interaction target Sd. This process was required because we

did not know at what distance the robot should approach the interaction target for the different interaction ranges (i.e. close, normal and far).

Given these centroids ρ, the distance states sd ∈ Sd can be defined and

visualized as shown in Figure 7a. In this figure, the interaction target is indi-cated by the purple human like shape where the arrow indicates its orienta-tion. The different distance states sd∈ Sd with Sd= {dpersonal, dsocial, dpublic}

are indicated by the circles surrounding the interaction target in orange, yel-low and red respectively. Note that the red area extends throughout the entire operational environment of the robot but that the plot has been cut off into a square of ρf ar by ρf ar for visualization purposes.

In the presence of background noise, the interaction targets approached the robot significantly closer than their normal approach distance. Here, the participants did not receive any explicit instructions but rather acted

(17)

(a) Normal situation (b) Hearing problems detected

Figure 7: State representation for the distance

on the presence of the artificially introduced background noise (i.e. sounds of a cafeteria). All of the interaction targets (re-)approached the robot at a distance between their normal approach distance and the distance they defined as close but still socially acceptable. Also, because we did not instruct the participant to approach the robot at a certain distance range, we know that the distance at which they approached the robot (while background noise is being played) should be their normal approach distance.

Given the final distances of these approaches in the presence of back-ground noise Pnoise = {p1, . . . , pN} with pi ∈ R, we can update the centroid

ρnormal to identify the proxemic-based distance ranges when hearing

prob-lems are detected. The centroid ρ0_normal is calculated as the average distance of the approaches in Pnoise (i.e. ρ0normal = avg(Pnoise) = 0.75m), whereas the

centroids ρclose and ρf ar are not altered (i.e. ρ0close = ρclose and ρ0f ar = ρf ar).

ρ0_close = 0.71m, ρ0_normal = 0.75m, ρ0_{f ar} = 1.68m.

When visualizing the distance ranges (see Figure 7b), you can see that the ranges are located significantly closer to the interaction target during the presence of background noise. This agrees with our intuition that the robot should approach the interaction target closer when the interaction target has difficulties hearing the pilot. These distances ranges can again be used by the robot to know at which distance it should approach the interaction target when hearing problems are detected.

For the second result from our initial experiment, we focus on extracting social cost functions given the negative feedback provided by the pilot. In our conducted experiments, the test participants have pressed the negative feedback button a total number of 65 times. As not every test subject pressed the feedback button an equal number of times, normalization is applied to

(18)

distribute the weights of the key presses more fairly. This means that 2 key presses for one participant are assumed to be as informative as the 8 key presses of another participant.

(a) Normalised number of key presses (b) Cost function based on distance

Figure 8: negative feedback from pilot

In Figure 8a, the normalised negative feedback (button presses) of the pilots are plotted against the states of the robot during those key presses. As can be seen from this figure, there is a very sparse distribution of key presses over the states. The absence of key presses would indicate that the robot is performing in a socially appropriate way. However, our intuition tells us that having a larger angular offset towards the interaction target should correlate with negative feedback. This intuition is also backed up by the findings of our partners [34] and will also be shown in Section 2.4.2.

However, when we look purely at the distribution of explicit feedback for the different distances (the key presses for the different angular offsets summed together), a clear distribution that our intuition also supports can be observed. In Figure 8b,the distance of the robot relative to the interaction target is plotted against the normalised explicit feedback. Here we can see that based on the explicit feedback, being in the interaction range identified as far yields a relatively high probability of being in a socially unacceptable situation. Being in the interaction range identified as the normal conversation distance on the other hand, yields the smallest probability of being in a socially unacceptable situation.

This distribution can be used as a cost function for the Euclidean distance of the interaction target relative to the robot, and is visualized in Figure 9. In Figure 9a, we show an approach to the interaction target that was performed while the robot was being controlled by the expert. The distance at which

(19)

(a) pre-Approach (b) Approach finished

Figure 9: Social cost based on distance from interaction target

the expert finishes the approach can be seen in Figure 9b, and lies within the normal distance state where our defined cost function also gives the lowest cost.

To conclude, we obtained the following two results from our initial ex-periment; 1) we identified the centroids of our identified distance ranges for our distance state Sd and 2) extracted a social cost function based on the

Euclidean distance relative to the interaction target. Given these results, we obtained a performance measurement with respect to the distance relative to the interaction target, where we can see that we obtain the highest reward (i.e. lowest cost) when the distance lies within the normal interaction range.

2.4 Group conversations

Our partners at the university of Twente have already conducted an experi-ment that focused on social positioning of the TERESA robot within a group conversation setting [34]. During this experiment, our partners focused on studying the proxemics in the social conversation task through the theory of F-formations [9] (i.e. different spatial arrangements for persons during group conversations) for situations with multiple interaction targets. Al-though these experiments did not contain instantaneous feedback from the participants (as we had in our experiments described in Section 2.3), they did include an a posteriori evaluation of the quality of the interaction behaviours at the end of each episode.

One of their findings include that for a single interaction target setting, a larger bearing towards the interaction target negatively correlates to the social correctness of the robot. This would mean that for a single interaction target, maintaining an as small as possible bearing would always be the

(20)

socially optimal behaviour with respect to the bearing. The bearing alone however, does not describe the full cost function for the social correctness of the positioning of the robot (i.e. it also includes the cost for the distance). Here, the utility of different bearing states will influence the approach policy of the robot. For example, with a high cost on the bearing and a low cost on the distance, we expect that the robot will first turn towards the interaction target (i.e. to minimize its bearing cost) and afterwards drive to the optimal distance towards the interaction target. When we have a high cost on the distance and a low cost on the bearing on the other hand, the robot might first drive towards the interaction target and afterwards minimize its bearing so to obtain a faster trajectory at the expense of being turned away from the interaction target for a short period of time.

In this section, we focus on extracting social cost functions for the bear-ing within a group conversation settbear-ing given the dataset provided by our partners. Additionally, we perform a statistical analysis on the relevance of the state features.

A detailed description of the experimental setup and results that were extracted from the experiment by our partners can be found in Deliverable 3.1 [34] of the TERESA project.

2.4.1 Experiment overview

The experiment performed by our partners, revolved around the social po-sitioning of the TERESA robot during a group conversation setting. The group conversation consisted of 3 interaction targets and the robot itself. The interaction targets were located in pre-indicated positions whereas the robot was allowed to move around freely. In this experiment, the partici-pants received the task to solve a murder case where the controller of the robot could retrieve additional clues by driving to indicated areas in the room. An overview of the interaction area including the different areas is given in Figure 10. During the experiment, 9 approach/converse/retreat cycles per group were performed with a total of 14 groups. At the end of each cycle, the users scored the quality of the interaction on a 1 - 7 numerical scale where 7 represents an interaction that is completely socially acceptable, and 1 a socially unacceptable interaction. This means that we do not have explicit feedback for the social correctness of a certain state and have to generalize the grade over the states the robot visited during that interaction phase.

The tracking of the interactions targets and the robot was done by an OptiTrack motion capture system4 _{using 8 infrared cameras giving a very}

(21)

Figure 10: Overview of the interaction area. On the circle in the middle the positions of the Interaction Targets are indicated (IT1, IT2, IT3), these were projected using a projector mounted to the ceiling, but only in between the approach/converse/retreat cycles. The rectangles near the border of the interaction area indicate the positions of the markers A- H. C1 and C2 indicate the positions of the cameras. Note: Figure and description are reprinted from Deliverible 3.1 [34] of the TEREA project.

high accuracy localization. The available data includes the positions as well as the orientations of the interaction targets and robot at a high frequency rate during the whole experiment. This data is also annotated, indicating the time steps of the different cycles and the grades belonging to those cycles. 2.4.2 Results

In our research we are interested in the social correctness during the conver-sation phase, the other two interaction phases of the cycle (i.e. the approach and retreat phase) will therefore not be processed in this section. Also, be-cause the grades are given over the entire interaction phase, we have to gen-eralize the grade over the visited states during that interaction phase. This generalization is done by creating a dictionary for the states, where each time a state was visited in a interaction phase, the grade for that interaction phase (that was given after the cycle had ended) is added to the dictionary. The social correctness values of different states are compared based on this dictionary.

This experiment was performed on group conversations whereas our state representation Sb expects a single interaction target. Therefore in order to

extract a social cost function for the positioning of the robot with respect to a single interaction target within a group conversation, the group conversation

(22)

setting should be converted to our state space Sb. In this section, we discuss

and compare two possible methods for this conversion:

1. Process the position and grade for each person individually

2. Use the position of the center of the group and combine their grades

(a) Individual bearing state (b) Group bearing state

Figure 11: Cumulated probabilities over grades as a function of the bearing state (a) of each individual participant and (b) with respect to the geometric center of the group. The probabilities are represented in grayscale, where a value of 1 corresponds to white (the cumulative probability of score 7 is always 1), and a value of 0 corresponds to black. The expected value of the grade for each class is denoted by a blue diamond.

In Figure 11, the cumulative probabilities for the grades are plotted against the angular states for both methods where the expected value of the grade for each bearing state is indicated by the blue diamond shapes.

The first method would assume that the participants grade the conver-sation based on purely self interest. However, because we are dealing with a group conversation, it might be possible that the participants graded the social correctness of the robot with respect to the well being of the group. To test this, we set the following null hypothesis:

H0 := The bearing with respect to the position of an individual participant

makes no difference in the grade regarding the social correctness of the positioning of the robot in a group conversation setting.

(23)

More concretely, H0 infers that the grades given for each bearing state in Sb

have the same mean when considering each participant individually.

To test this null hypothesis, we ran the Kruskal-Wallis test where this test is chosen due to the presence of different sample sizes of grades given per state. This test checks if samples originate from the same distribution and provides a p-value which indicates the significance level of the results. A null hypothesis is rejected when the p-value < 0.05 and indicates that the samples did came from different distributions and that the difference was not due to random sampling. When the p-value ≥ 0.05 on the other hand, indicates that there is no reason to conclude that we can reject the idea of the samples originating from the same distribution. Note that when the null-hypothesis is not rejected, this does not imply that the samples must therefore originate from the same distribution.

The result of the Kruskal-Wallis test based on hypothesis H0 is p = 0.14

and means that hypothesis H0 can not be rejected. We can therefore say that

the participants in this experiment might have graded the social correctness of the robot not purely based on self interest.

The second method, already takes into consideration that the participants are not grading solely on self interest. However, you could argue that perhaps the bearing of the robot relative to the center of the group has no influence on the grade at all. To test this, we set the following hypothesis:

H1 := The bearing of the robot relative to the position of

the center of the group makes no difference on the grades given by the participants.

The Kruskal-Wallis test based on this hypothesis gives p = 1.36e−4 which means that this hypothesis is rejected and that the bearing relative to the center of the group does make a difference in the grades given by the partic-ipants.

In the case of method 2, the center of the group can be used as the theory of F-formations suggests that, in group conversations involving three or more persons, the participants tend to position themselves facing the geometric center of the convex hull of their formation [16]. In Figure 12, a plot of the positions of the participants and the robot are plotted with an overlay of the three social spaces: o-space, p-space and r-space of an circular F-formation. Note that the social spaces are not calculated but rather displays the fact that the participants were positioned in a circular F-formation during this experiment.

In Figure 13, the social cost for both methods based on the bearing are plotted. The cost is calculated as the expected value subtracted from the

(24)

(a) Raw positioning data

(b) Social spaces of an F-formation overlayed on the positions of the par-ticipants and the robot

Figure 12: F-formation during experiment (robot in red, participants in blue, green and yellow and the center of the group is denoted by a cross)

(a) Angular cost for method 1 (b) Angular cost for method 2

Figure 13: Social cost based on the bearing of the robot

highest possible grade for that bearing state. Figure 13a, displays the cost based on the grades on the bearing states for each participant individually. Here, the cost seems to go down as the bearing towards the interaction tar-get becomes larger, this is very counter intuitive and supports hypothesis H0

where we show that the participants might not have graded the performance of the robot purely based on self interest. In Figure 13b on the hand, we can see that cost goes up as the bearing towards the center of the group be-comes larger. This agrees with our own intuitive and shows that the bearing relative towards the center of the group is a good state feature in the group conversation setting.

(25)

3 Decision making frameworks

In previous experiments of the TERESA project [35], we observed that the low level controller in charge of maintaining the correct bearing towards the interaction target showed unexpected behaviour due to noise in the detection and identification of the interaction target. The initial controller assumed the interaction target to be defined as the person closest to the robot. This assumption however, led to the issue of the robot turning away whenever it received a false positive detection closer than its previous detection. This issue became especially apparent when someone walked closely behind the robot and the robot started turning away from an optimal bearing relative the interaction target to an optimal bearing towards the person behind the robot. In order to overcome this issue, we explore and compare two different frameworks which allow for more robust decision making while maintaining the assumption of the interaction target being defined as the closest person to the robot. Note that this issue could in principle be resolved by installing an external high accuracy tracking system (e.g. the OptiTrack motion cap-ture system) where the interaction target is always correctly detected and identified. In this project however, we want to find a solution that uses the sensors of the robot to detect the interaction target. These sensors are simply not perfect, leading to uncertainty during the decision making and requires us to use more robust decision making frameworks to solve this task.

The first framework that we will dicuss are the Markov Decision Pro-cesses (MDPs) [2], which are mathematically well-defined models for sequen-tial decision making under uncertainty. The second framework that will be discussed are recurrent neural networks, which are trained given expert ex-amples. The idea is that these frameworks are capable of filtering out these previously mentioned false positives and allow for a more robust low level controller for the task of maintaining the correct bearing towards the inter-action target.

The main difference between these two frameworks is that the first frame-work first learns a model of the environment (for a partial observable setting usually represented as a Partially Observable Markov Decision Process) and then solves this model to find the optimal policy. The recurrent neural net-work on the hand, does not first learn a model of the environment but rather learns a model end-to-end by minimizing an objective function (i.e. the dif-ference between its own prediction and the expert example). Our initial assumption is that the recurrent neural network will perform better as it does not require the difficult task of building a model of the environment and can therefore learn a model more optimised to the task to solve. Note that although we suspect the RNN to outperform the POMDP, learning a

(26)

model of environment has the benefit that it can be used to solve multiple tasks by adjusting its reward function.

3.1 Background

3.1.1 Markov Decision Processes

When we make choices, we often need to make a trade-off between immediate rewards and possible future gain. These kind of trade-offs are present in our everyday lives, like deciding to eat at your local snack-bar (high immediate reward) or eating a healthy salad instead (better in the long run). These decisions however, also bring uncertainty with them which makes reasoning on which decision to take significantly more complex i.e. you might get an unexpected allergic reaction from one of the ingredients of the salad.

Markov Decision Processes (MDPs) are designed to model these kinds of problems in uncertain environments and try to find an optimal policy which balances immediate and future rewards given state transition probabilities. The policy (i.e. a solution from an MDP) that is extracted from the MDP provides a mapping from states to actions. Optimal policies will provide the best action to take for each state.

The MDP model contains a discrete set of states S, a discrete set of actions A, a reward function R, a transition function T and a discount factor γ. The model in its original form assumes full knowledge of the state of the environment, however extensions to the model have been made which also allow for partial observability as will be discussed later on. The reward function R : S ×A → R specifies the immediate reward that agent receives for performing action a in state s. The transition function T : S × A × S → [0, 1] specifies the probability over the next states, given the current state and the action that is performed. Finally there is the discount factor γ ∈ [0, 1), which represents the difference in importance between future rewards and present rewards.

The idea is to find a policy π for the decision maker that specifies a mapping from the state space S to action space A (i.e. π(s) = a for a ∈ A, s ∈ S). Due to the uncertainty over future states, this policy typically maximizes the expected discounted reward (see Equation 5).

Eπ ( _∞ X t=1 γtR(st, π(st)) ) (5) where t indicates the time step and t = 1 refers to the current time step. Note that in our task we are working with an infinite horizon as we do not know at what time step our interaction with the participant is finished. The

(27)

horizon h is here defined as the total number of decisions that the agent is allowed to take, in a finite horizon we know the fixed number of decisions left until termination (i.e. h − t with t ∈ [1, h]). In our task, we consider an infinite horizon, since the task does not have a fixed time limit and the robot should continue to take actions for as long as the interaction is taking place. For simplicity and without loss of generality, we provide all equations assuming an infinite horizon.

In order to measure the reward accumulated by policy π, a value function V is defined to associate the expected objective value for each state S while following policy π (i.e. Vπ : S → R):

Vπ(s) = R(s, π(s)) + γ

X

s0

T (s, π(s), s0)Vπ(s0) (6)

The Bellman optimality equation [1] states that the optimal value of a state is given by the expected return of the best action in that state. Using this expression, the optimal value function V∗(s) can be calculated as shown in equation 7. V∗(s) = max a∈A ( R(s, a) + γX s0 T (s, a, s0)V∗(s0) ) (7)

This optimal value function V∗(s) can then be used to extract the optimal policy π∗by for each state taking the action with the highest expected reward. This provides us with a policy π∗ that maps each state s ∈ S to an action a ∈ A that leads to the highest expected reward (see Equation 8).

π∗(s) = argmax a∈A ( R(s, a) + γX s0_∈S T (s, a, s0)V∗(s0) ) (8)

We can now find the optimal policy π∗ of a problem that is defined as an MDP in a fully observable domain using dynamic programming. In our task however, we are working with a robot in a real-world environment, and due to imperfect and noisy sensors of the robot, we are unable to observe the true state of the environment. Luckily, there is an extension to the previously defined MDP framework that allows for decision making without access to the true state, and is called a partially observable Markov decision process (POMDP) [15]. To do this, the POMDP maintains a belief over the possible states based on the observations that it received. This belief state is a probability distribution over which state the robot could currently be in, and is updated after receiving a new observation.

(28)

More formally, the POMDP contains the same elements as an MDP (i.e. a set of states S, a discrete set of actions A, a reward function R, a transition function T and a discount factor γ), but in addition also contains a discrete set of observations O and an observation function O : O × S × A → [0, 1] that gives the probability of an observation o ∈ O given the state s0 ∈ S and action a ∈ A.

As the true state is unknown, we maintain a belief state b which for each state s, gives the probability that state s is the true state (i.e. b(s) ∈ [0, 1] for s ∈ S with P

s∈Sb(s) = 1). Another difference with respect to the original

MDP is that the policy π is now a mapping from beliefs to actions rather than a mapping from states to actions (i.e. π(b) = a for a ∈ A).

The calculation for the expected discounted rewarded for an MDP de-pends on the fact that we know the true state. In the case of an POMDP however, as the true state is not observable, the probability for each state s ∈ S is multiplied with its expected reward to provide an expected dis-counted reward over the belief state b (see Equation 9).

Eπ ( _∞ X t=1 γtX s∈S bt(s)R(st, π(st)) ) (9)

When defining a problem as a POMDP, we start with some a priori belief b0 in what state the agent will start (i.e. in our task this is a uniform

dis-tribution over all the possible states). Upon receiving an new observation, this belief distribution is then updated using the state transition and obser-vation probabilities given the action a ∈ A that we were executing and the observation o ∈ O that we received (see Equation 10).

ba_o(s0) = Pr(s 0_{|b, a, o)} Pr(b|a, o) = O(a, s0, o)X s∈S b(s)T (s, a, s0) X u0_∈S O(a, u0, o)X u∈S b(u)T (u, a, u0) for ∀s0 ∈ S (10)

Also the value function V for an POMDP is given over the belief state b as we do not have access to the true state. The value for a belief state is calculated as the sum of the values for each state s ∈ S multiplied with their probabilities b(s) (see Equation 11). Note that in Equation 11 also the observation function O is add to calculate the expected reward for the possible future states s0 ∈ S as the future state now depends on the received

(29)

observation. Vπ(b) = X s∈S b(s) R(s, π(s)) + γX s0_∈S X o∈O T (s, π(s), s0)O(π(s), s0, o)Vπ(bπ(s)o ) ! (11) The optimal value of a belief state is then given by the value obtained by taking the action that maximizes the expected return for that belief state (see Equation 12). V∗(b) = max a∈A ( X s∈S b(s) R(s, a) + γX s0_∈S X o∈O T (s, a, s0)O(a, s0, o)V∗(ba_o) !) (12) In the MDP case, we generated our policy π∗ by applying value itera-tion for each state in our discrete state space S, providing us the optimal action a ∈ A for each state s ∈ S. The POMDP on the hand, acts on belief states where we want to find the best action a given a belief state b. However, because the belief space is continuous, it would be computationally intractable to find an exact solution using value iteration in a infinite horizon setting [21] (i.e. an infinite horizon POMDP has an infinite amount of pos-sible belief points making it impospos-sible to calculate the value for all of these points). Note that in the finite horizon case, this problem is not present as we could plan starting at our initial belief b0 and enumerate over the possible

future belief points until we reached the plan horizon, and therefore do not require the entire belief-simplex to define the optimal value function.

It has been shown that in a finite horizon setting, V∗ is piecewise-linear convex (PWLC) function [7], and can be well approximated by a PWLC function for an infinite-horizon setting [6]. This value function can therefore be represented by a set of α-vectors V that for each vector α ∈ V gives a value for each state s ∈ S and is associated to an action a (see Equation 13).

V∗(b) = max α∈V ( X s∈S b(s) · α(s) ) (13) Here, a single α-vector encodes the value of a state s ∈ S for a particular action a ∈ A and observation o ∈ O over the belief space. Note that the value for a single belief point on the α-vectors can be calculated with the same equation as used in Equation 11.

Given the value function represented by α-vectors, the policy based on the belief state can be found by taking the α-vector that maximizes the value function V∗(b) and mapping it to its associated action Γ : V → A (see

(30)

Equation 14). π(b) = Γ argmax α∈V b · α (14) An example of the α-vectors for a two state POMDP is given in Figure 14. In this example, we can see that when we are certain that we are in state s0 (i.e. b = [1, 0]) we should take the action associated with the α-vector

α1 as this vector maximises the value function V for that particular belief point. Note that the same principle also works with more states but that the α-vectors become hyperplanes over the belief space.

V (b) α1 α2 α3 α1 α2 α3 [1, 0] [0, 1] [b(s0), b(s1)]

Figure 14: Linear support of α-vectors for a two state POMDP To find the policy, we could in principle generate all possible vectors that we could construct for every possible action a ∈ A and observation o ∈ O. Because for our policy we only need the α-vectors that maximize the value function, a α-vector that is completely dominated5 _{by other α-vectors}

can be pruned away as it does not influence our resulting policy. This would provide us with the solution for the α-vectors to represent our policy, and this method is also called the Monahan’s enumeration algorithm [22]. Although this sounds good in theory, this method would not work for our task as we are working in an infinite horizon setting and not all reachable belief points can be calculated in a finite amount of time.

In an infinite horizon setting, an -approximate value function (i.e. V∗(b) : V∗(b) − V (b) ≤ for ∀b) could in principle be calculated by exact methods such as value iteration, but have however shown to scale poorly with respect to the number of states and observations [24]. Point based methods [30] on the other hand, perform much better than optimal methods given these scal-ability issues, as they approximate the optimal value function by computing it only for a finite and representative set of belief points. These methods focus on the fact that given an initial belief, there is only a finite set of reachable belief states, making calculating the optimal value function for the

5_{An α-vector is completely dominated when for each point in the belief space there is}

(31)

entire belief space often unnecessary. Point based methods select a finite subset of beliefs B, and iteratively update the value function V over B. The various point based methods differ in the process of selection the belief set B and how the value function V is updated.

In this thesis, we will use the Perseus point based algorithm [32] which selects B by random exploration in the belief space. This exploration is done by sampling actions and observations leading to new beliefs for our beliefs set B. Based on this belief set B, we iteratively update value function V where with each iteration we get closer to V∗.

As will be discussed in more detail in Section 3.3, we have decided to work with an asynchronous execution rate for this task to relieve ourselves from the task of manually assigning the execution rate. The main reason for this is that especially in robotics applications, a synchronous execution rate generally requires a large amount of fine tuning to get right. Practically, this means that we do not perform an action at a certain time interval but rather act in response to events triggers (i.e. observation changes). The previously defined POMDP should therefore be extended to allow for these event changes to be modelled into the POMDP framework.

The event-driven POMDP is structurally almost identical the original POMDP but with the main difference in its observation function O. This function, rather than giving the probability of an observation o ∈ O given the state s and action a, now gives the probability of an observation change given the corresponding state change and action. It could however be possible that we received an observation change, but that there was no corresponding state change (i.e. observation changes caused due to the imperfect sensors of the robot). These are the false positives that we want to ignore, and is possible due to fact that these events will yield a low probability of actually occurring (i.e. the interaction target can not suddenly appear on the opposite side of the robot). There is however another possibility, namely that the interaction target actually changed its position but the sensors did not detect it. Here, we would like to model the possible sequences of intermediate states that might have be visited unknowingly due to the false positive detection.

To model this, we add an additional symbol for false negative event de-tections of to our observation set O = O ∪ of. Also, note that the

observa-tions o ∈ O in the event driven POMDP refer to observation changes rather than direct observations as was the case with the synchronous POMDP. The observation function for the event driven POMDP therefore becomes: O : O × S × S × A → [0, 1], where this function gives the probabilities of perceiving observation o ∈ O given states s ∈ S, s0 ∈ S and action a ∈ A such that O(a, s, s0, o) = Pr(o|s, s0, a). Note that the states s ∈ S and s0 ∈ S indicate the state transition events from state s to state s0.

(32)

Although by definition the agent will never receive a false negative ob-servation, value iteration does take this additional symbol into consideration during planning. Effectively allowing the agent to reason over the interme-diate states that might have be visited unknowingly due to the false positive detections. During planning, a new action is selected at every event trigger. However, because the agent can not receive false negative observations, the agent is also incapable of changing its action during those event triggers.

This leads to an additional constrained to the system, namely that when we receive of we cannot change our action. By adding a constraint

gener-ation function C(a, o) that for a given action a and observgener-ation o returns a constraint action set to our planner (see Equation Equation 15), we can plan over observation change events containing the false negative observation event of. C(a, o) = ( a, if o = of A if o ∈ O\of (15) To solve the event driven POMDP, we use the Constraint-Compliant Perseus (CC-Perseus) algorithm [10] which is an adaptation to the previously defined Perseus algorithm to event driven POMDPs.

3.1.2 Recurrent Neural Networks

First we provide a short introduction on artificial neural networks (ANNs), in this introduction we will describe the architecture of neural networks and show how these networks can be trained. Once we have a clear understanding of neural networks, we will describe how this network can be extended into recurrent neural networks (RNNs) and describe our motivation to work with a recurrent network rather than a standard neural network. Finally, we will describe Long Short Term Memory (LSTM) which is again an extension to recurrent neural networks and is the network architecture used throughout this thesis.

Neural networks are typically organized in layers, where each layer con-tains nodes that are interconnected with the nodes of their adjacent lay-ers. The connections between these nodes are all weighted, where the initial weights are initialized arbitrarily and later updated using backpropagation (i.e. performing gradient descent based on backward propagation of the er-ror). In such types of networks, information is always fed forward and never backwards (i.e. we do not allow loops), and is therefore also called a feedfor-ward neural network (see Figure 15) [4].

In order to get a better understanding of how the different parts of a neural network interact, we give a step-by-step overview of the different parts

(33)

x1 x2 x3 z1 z2 z3 z4 h1 h2 h3 h4 ˆ y1 ˆ y2 z₁0 z₂0 Hidden layer Input layer Output layer wij _w0 ij x w _h w yˆ 0

Figure 15: An example of a feedforward neural network

for calculating the output ˆy (i.e. applying forward propagation) and then apply backpropagation to update the weights (w and w0) of the network based on the error of the previously computed output.

The architecture of the neural network shown in Figure 15 will be used as an example for our calculations. The input layer of this network (indicated in green) consists of 3 nodes where each node corresponds to an element of the input vector x (i.e. x = [x1, x2, x3]). The nodes in the hidden and

output layer also contain an additional activation function that introduces non-linearity into the network and also helps to stabilize the network6 _The

first step to calculate the output of the network is to calculate the pre-activity of the hidden layer by multiplying the input x by the weights w (see equation 16). x1 x2 x3 ×   w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34  = z1 z2 z3 z4 (16)

Note that we here go through the network for a single data point only, and that you in principle could also change the input x to a matrix of n × 3 where n is the number of data points to process multiple data points at once. Given the pre-activity of the hidden layer z, an activation function is individually applied on zi ∈ z to obtain the activation of the hidden layer

6_{The nodes in the hidden and output layer can also contain a bias term b to shift}

the activation function to either the left or right. This term is arbitrarily initialised and trained in the same manner as the weights of the connections between nodes.

(34)

(see equation 17).

f (z) = f (z1) f (z2) f (z3) f (z4) = h1 h2 h3 h4

(17) Common choices for activation functions are hyperbolic tangent (tanh) func-tions, sigmoid funcfunc-tions, and rectified linear units (ReLUs). In this example, we will use a sigmoid activation function as shown in Equation 18. The sig-moid function is bounded (i.e it limits the output to a value between 0 and 1), easily differentiable (see Equation 19) and is a monotonically increasing function, making them easy to handle and the activation function of choice for this example.

f (x) = σ(x) = 1 1 + e−x (18) d dxf (x) = d dxσ(x) = σ(x)(1 − σ(x)) (19)

Now that we know the activity for the hidden layer h, we can calculate the pre-activity for the output layer z0 by multiplying the activity for the hidden layer h with the weights w0 (see Equation 20).

h1 h2 h3 h4 ×     w0₁₁ w0₁₂ w0₂₁ w0₂₂ w0₃₁ w0₃₂ w0₄₁ w0₄₂     = z0₁ z₂0 (20)

Finally, the output ˆy can be calculated by applying the activation function on the pre-activity of the output layer z0:

f (z) = f (z₁0) f (z₂0) = ˆy1 yˆ2 . (21)

We have now completed a forward pass through the network, and obtained the output of the network ˆy given arbitrarily initialized weights w and w0. The output ˆy would however at this point not make much sense as the weights are not yet trained. In order to get a sensible output from the network, we need to update the weights based on the error of the forward pass. This error J is given as the sum of the difference in prediction ˆy given the true label y:

J = 1 2 |y| X j=1 (yj − ˆyj)2. (22)

(35)

Here, a quadratic term is add to make sure the error function is convex and that gradient descent can leads us to the global minimal error. Additionally, a

1

2 is add to simplify the equation later on and does not influence the learning

process because we are interested in minimizing the error function rather than the real valued error.

In order to know how much each individual weight contributed to the final error, we need to calculate the partial derivatives for the weights w and w0 given the error J . We start with the partial derivatives for the weights closest to the output (i.e. w0), and can be calculated as shown in Equation 23. Note that ◦ here refers to the Hadamard (entry-wise) product.

∂J ∂w0 = ∂1₂P|j| j=1(yj− ˆyj)2 ∂w0 = h |_◦ (ˆy − y) d dz0f (z 0₎ (23) Next, we can calculate the gradients for the weights between the input layer and the hidden layer, and can be calculated as shown in Equation 24.

∂J ∂w = ∂1₂P|j| j=1(yj− ˆyj)2 ∂w = x |_◦ (ˆy − y) d dz0f (z 0 )w0 d dzf (z) (24) Now that we know the partial derivative of the weights over the cost J , we can update the weights as shown in Equations 25 and 26. Here, η ∈ [0, 1] in-dicates the learning rate of the network and influences the speed and quality of learning. wi,j = wi,j− η ∂J ∂w (25) w 0 i,j = w 0 i,j− η ∂J ∂w0 (26)

We have now successfully applied a single step of the iterative process of gradient descent to minimize the objective function J and to learn the optimal weights for this network. With this, we can now alternately apply forward propagation to calculate the output and backward propagation to update the weights of our neural network.

In our task of maintain the correct bearing towards the interaction target in a noisy environment however, reasoning given only the input (i.e. obser-vation) at the current time step xt would not be enough to solve this task.

Namely, we would not be able to disambiguate between noise and true obser-vations. The distinction between noise and true observations can be found by looking at previously obtained observations as will be discussed in more detail in Section 3.3.

Recurrent Neural networks allow for loops within the network architec-ture, essentially allowing for information to persist throughout the time steps,

Decision-Making under Uncertainty for Social Robots

MSc Artificial Intelligence

Decision-Making under Uncertainty for

Social Robots

Contents

1

Introduction

1.1

The TERESA project

1.2

Robot hardware and software

1.3

Notation conventions

2

Social cost functions

2.1

Background

2.2

Defining the state space

2.3

Initial experiment

2.4

Group conversations

3

Decision making frameworks

3.1

Background