A Gaze Detecting Tracker - Master Thesis -

(1)

A Gaze Detecting Tracker - Master Thesis -

Tijs Zwinkels 1558218

June 30, 2009

Internal Supervisors:

drs. T. van der Zant. Artificial Intelligence, University of Groningen

Prof. dr. L.R.B. Schomaker. Artificial Intelligence, University of Groningen External Supervisor:

Dr. P.E. Rybski. The Robotics Institute, Carnegie Mellon University

(2)

(3)

Abstract

In order for robots to be accepted as members of our lives, it’s not enough for them to be just aware of unmoving, static parts of their environments. For desirable social behavior, they need to be aware of the people in the near environment, and the attitude of these people towards the robot. Aim of this research is to implement a multi-person tracking and gaze-detection system that can be used as a basis for human interaction interest detection. This system has been designed to be deployed on the Carnegie Mellon University Snackbot, which has recently been completed. Proposed is a two-staged system. The first stage performs person tracking. This system allows to answer the question ‘Where are persons in the robots environments?’. Sensor data from a SICK laser-rangefinder mounted at leg height is used by a leg detector to detect possible person positions. Leg-detections are fed to a Kalman filter based tracker. Several leg detectors and tracking strategies have been compared. The second stage performs gaze detection. A novel approach towards the problem of gaze detection is employed; Instead of building a highly specific and complex multi-layered system, a relatively simple, elegant simple features based system is used to perform binary gaze classification, answering the question: ‘is someone looking at the robot?’. Indications of a person looking at the robot, could eventually be used as an indication that the person wants or expects something from the robot. This system has been implemented on a MobileRobots inc. Peoplebot robot platform. We are comparing performance at differing conditions, angles and distances for k-nearest neighbor, perceptron, and naive Bayes machine learners. Experimentation shows good performance for the person tracker, tracking up to 90% of multiple-person movements correctly. In a difficult environment at a range of up to 3.6 meters, the gaze-detector has a classification accuracy of 72% per frame. Results could probably be improved by combining results from multiple frames.

(4)

(5)

Chapter 1 Introduction

As the field of robotics progresses, robots are becoming more and more adept at common household tasks. Commercial vacuum cleaning robots have existed for years, and more advanced behavior can be seen in robotics laboratories around the globe. In order to take home-robotics out of the laboratory, and have robots accepted as desirable members of our lives, robots need not only be aware of static environment, but also of us, the people that live in these environments.

This is mainly a study in the field of home robotics. Research has been performed at the CMAssist research group at the Robotics Institute of Carnegie Mellon University.

This research group was originally founded to perform at the Robocup@Home competi- tion [32][43], which is focused solely on robotics in realistic home situations. Furthermore, the team has recently introduced its Snackbot robot platform [23]. This robot will serve as a human-robot interaction research platform, make autonomous food deliveries, and function as an advanced snack vending machine. The systems developed for this research have been designed mainly with this last function in mind.

At this moment, many modalities are used for human-robot interaction. In its simplest form, buttons and switches are used. Switches can be placed on a remote or on the robot body among other places, and are sometimes hidden, for example in toy robots where you have to ‘scratch their back’ in order for certain behavior to activate. More advanced systems include visual recognition of specially prepared cards such as on the AIBO, gesture recognition, or voice recognition systems. However, most systems require their users to initiate interaction and require user knowledge of the specific interactions that are expected by the system. We propose the use of human gaze cues in human-robot interaction. Gaze cues could be used as an indication of human interaction interest. An indication of interest could be used to let the robot initiate and guide interaction. This could allow for more natural interaction that doesn’t require any prior knowledge on part of the user of the system. For a detailed background on the use of gaze in human-robot interaction, see section 4.2.1.

(8)

1.1 The relevance of gaze detection

We ask ourselves: Why would we want gaze detection on a robot? Why is gaze detection important? Langton reasons: Humans and the majority of primate species are social animals, living in groups comprising as many as 200 individuals. Thriving in such an environment requires a particular kind of ‘social intelligence’; an ability to make sense of another individual’s actions and crucially, to predict what they are about to do next. [18]

Evidently, social intelligence and prediction is important in person-to-person interaction.

As has been argued by Baron-Cohen[3], gaze direction detection is likely an important cue for social prediction.

Premack claims that humans have evolved a mechanism to attribute mental states to interpret and predict action [29]. He refers to this mechanism as the ‘mindreading’

system. Baron-Cohen differentiates the workings of this mindreading system, in four modular components, based on experiments on infants. He differentiates an Intentionality Detector (ID), akin to Premack’s suggestion, whose function is to represent behaviour in terms of volitional states (desire and goal); an Eye Direction Detector (EDD), whose function is initially to detect the presence of eye-like stimuli, and later to represent their direction as an Agent ‘seeing’ the Self or something else; a Shared Attention Mechanism (SAM), whose function is to represent if the Self and another Agent are attending to the same object or event; and a Theory of Mind Mechanism (ToMM), whose function is to represent the full range of mental states, and to integrate mental state knowledge into a coherent and usable theory for interpreting action[3]. These modules become active in different stages of infant development. A graphical representation of this model has been presented in figure 1.1. Baron-Cohen places specific emphasis on the Eye Direction Detector, and it’s interaction with the Shared Attention Module. According to him, the ability to detect eye direction and thus predict gaze direction, is of extreme importance in mindreading [18]. If we are aiming for social robot behavior, gaze detection seems a necessity.

1.2 Biological gaze detection

How do we know that gaze direction detection exists in humans? Most of us have ex- perienced the tendency to look where others are looking. In the middle of your next conversation, for instance, suddenly shift your gaze and observe your partners behavior [18]. Anecdotally then, there is ample evidence that humans do perform gaze detection, and might indeed have dedicated neural mechanisms for this. We will highlight an experiment by Langton and Bruce [19]: For this experiment, subjects were asked to press the space bar as soon as a target letter appeared at one of the four locations on a computer screen. Either 100 ms or 1000 ms before the target letter appeared, a face would appear in the center of the screen, oriented towards one of the four possible target locations.

Subjects where told truthfully that following the appearance of a face, the letter was equally likely to appear in each of the four locations. The cue of the appearing face was entirely uninformative and could safely be ignored. However, subjects where unable to comply with these instructions; Whenever a gaze cue appeared 100 ms before a target- letter appeared, detection times where faster if the target appeared in the cued location.

This effect had disappeared when offering the gaze cue 1000 ms before appearance of

(9)

Intentionality Detector (ID)

Eye Direction Detector (EDD)

Shared Attention Module (SAM)

Theory of Mind Module (ToMM)

Figure 1.1: The modules in the ‘mindreading’ system as proposed by Baron-Cohen.

Baron-Cohen puts specific emphasis on the Eye Direction Detector, and thus the importance of gaze detection, as a very important function in social interaction.

the target. Langton and Bruce conclude that the face cues trigger a kind of reflexive or exogenous shift of visual attention. Other research reaches similar conclusions [18].

There is evidence that a gaze-detection system could be present in primates as well.

Research by Barth [6], implies that chimpanzees can exploit social gaze cues; In an experiment where the experimenter gazed at the location of hidden food, the subjects scores at levels far exceeding chance for finding the food. Scores where at chance level when glancing or pointing at the hidden food. Additionally, they show that previous conflicting results were likely due to a different experimentation procedure, where the subjects would still be present when the experimenter was hiding the food.

Little is known about the exact neural underpinnings of gaze-detection in humans.

An indirect but enthralling hypothesis can be found in the structure of our eyes. The white outer sclera, which is in stark contrast with the dark cornea of the eye, provides a particularly powerful signal to the direction of another person’s gaze[18]. It has been hypothesized that this feature has evolved to make it easier for members of the same species to follow an individuals gaze direction in close-range joint attentional and communicative interactions. This is called the cooperative eye hypothesis [38]. As can be seen in figure 1.2 for instance, this feature isn’t present in other primates such as chimpanzees.

Langton demonstrates the cooperative eye hypothesis by plotting the response of vertically oriented simple cells from the striate cortex, for a subject looking at an eye. These cells react to vertically-oriented cues in the image, and respond vigorously to the the whole eye. The response is in three spatially separate parts: one to each of the visible parts of the white sclera left and right of the cornea, and one to the dark iris and pupil.

As the eye turns, the responses to the two parts responding to the white sclera change in

(10)

Figure 1.2: A chimpanzee eye (left), and a human eye (right). For a human eye, the white outer sclera in contrast with the dark cornea provides a particularly powerful signal to the direction of another person’s gaze. It has been hypothesized that this feature has evolved to make it easier to follow an individuals gaze direction.

relative strength. Thus, the contrast of the response of the two scleral parts is a mono- tonic function of eye direction [18].

Tomasello et. al. tested the cooperative eye hypothesis by comparing gaze following behavior of human infants to great apes. A human experimenter ‘looked’ to the ceiling with either his eyes only, his head only (eyes closed), both head and eyes, or neither. The results support the cooperative eye hypothesis: Great apes followed gaze to the ceiling based mainly on the human’s head direction, although eye direction played some role as well. In contrast, human infants relied almost exclusively on eye direction. This shows that humans are especially reliant on eyes in gaze following [38]. While this is no direct evidence that the cooperative eye hypothesis is true, it does provide compelling support.

Perrett and his colleagues have suggested that the eye-direction detector as proposed by Baron-Cohen[3], forms only part of a system to compute the direction social attention;

As suggested by the previous article, their research suggests that head-orientation in primates plays an important role as well. Their single cell studies indicate that individual cells in the superior temporal sulcus region of the macaque temporal is sensitive to conjunctions of eye, head, and body position. Accordingly, they postulate the existence of a direction-of-attention detector (DAD) which combines information from eye direction, head direction, and body pose detectors[27]. It’s suggested that this information is combined in a hierarchical fashion, where eye direction largely overrides head direction information, and head direction largely overrides body orientation information. This happens by a network of inhibitory connections.

(11)

Leg Detector SICK laser

rangefinder

Person Tracker

Gaze Detector Bumblebee

2 Stereo Camera

Where are persons in the

robots environment?

Is someone looking at the

robot?

in system out

Figure 1.3: bird-eye view of the system.

1.3 Goal and Organization of the thesis

As we’ve seen, gaze detection is a key component in human social behavior. The aim of this research is to implement a multi-person tracking and gaze-detection system that can be used as a basis for human interaction interest detection and other socially desirable behavior on a robot. This system has been designed to be deployed on the Carnegie Mellon University Snackbot, which has recently been completed. A bird-eye view of the system has been given in figure 1.3. Firstly, a person tracker has been developed. The robot is equipped with a SICK laser rangefinder attached at ankle height. Person legs are detected and an orientation is estimated heuristically with a leg detector system. This data is fed to a Kalman filter based tracker to track persons in the robots environment.

This allows the system to answer the question: ‘Where are persons in the robots environment?’

Secondly, a human gaze detection system has been developed. A novel approach towards the problem of gaze detection is employed; Instead of building a highly specific and complex multi-layered system, a relatively simple, elegant system is used. Simple, mostly holistic features are extracted, and used in a supervised machine learning system performing binary classification. We’ve christened this the simple feature approach. Mul- tiple classifiers are used for comparison. Raw data received from the Bumblebee 2¹ stereo camera can be roughly segmented using persons positions information from the tracker, which is mainly useful if there are multiple persons present in the robots environment.

Consequently, segmented data from the camera, tracker, and leg detector is used to gen-

1more information about the Bumblebee 2 camera is available on http://www.ptgrey.com/products/stereo.asp.

(12)

erate simple features. Previously seen data of persons looking at the robot and besides the robot is used to train a number of classifiers. This allows the system to answer the question ‘Is someone looking at the robot?’.

This system has partly been inspired on the work of Perrett et. al. on the workings of a proposed biological direction-of-attention detector (DAD). We touch on his research in section 1.2. Like suggested in his research, the in this thesis presented system contains separate features for body-direction, head-direction, and eye-direction could be added.

Combining these features is implicitly done by the machine learners. Of course, there are many differences as well. For instance, the here presented system contains no explicit eye-direction measurement, and no explicit feature-overriding is used, even though this can be modeled implicitly by most supervised machine learners as well.

This thesis is organized as follows: In chapter 2, the used hardware and software platform is described. Both originally available or purchased hardware and software, and some architectural software developed during this project is described. In chapter 3, the person tracker system is detailed. Background is provided as well, with an analysis of previous work and an introduction to probabilistic tracking theory which is the basis for Kalman filter systems. Specific implementation details and a screenshot along with an explanation of the developed tracker visualization software is provided.

Detailed experiments are run to ascertain the performance of the tracking system. Since the tracking system is dependent on the leg-detector system, performance of this system has been measured separately. In chapter 4, the gaze detection system is detailed. We start with a discourse on the use of gaze detection in human interaction interest measurement, and an analysis of other gaze-detecting systems. Then, the gaze detection system is explained, looking in detail at the perceptual system, the feature extractors, and the machine-learning classifiers. A dataset is recorded and experiments are run to compare performance for a few different implementations in differing circumstances. De- tailed information on the dataset and experimental setup can be found in appendix A.

The discussion in chapter 5 deals with remaining thoughts and suggestions concerning both general as well as some specific issues. The conclusion in chapter 6 closes the thesis with a summary, a summary of results, and final thoughts.

(13)

Chapter 2 Platform

2.1 Hardware

2.1.1 Peoplebot

Figure 2.1: Picture of the Peo- pleBot Robot Platform

The PeopleBot¹ is a commercially available robot- platform sold by Mobile Robots inc.² The robot is about 112 cm tall. This is taller than most Mobile Robots bases, and makes it more suitable for human-robot interaction purposes. The CMAssist team at Carnegie Mellon University has put together this platform to be nearly sensor and actuator compatible with the recently introduced snackbot platform. Details about the Snackbot platform will be presented in section 2.1.2.

Sensors

SICK Lidar The robot is equipped with a SICK LMS200³ laser range-finder device. This device provides us with distance measurements of surfaces at a height of 26.8 cm above the ground. It measures in an angle of -90^◦to +90^◦, with the 0^◦point being straight ahead relative to the robot. With an angular resolution of 0.5^◦, the device provides us with 360 angle-distance tuples at a rate of up to 10 Hertz. Due to an as-of-yet undiag- nosed problem, every second frame would be identical to the previous one when these maximum settings where used. Therefore, the sensor has been used at a rate of 5 Hertz. The maximum distance at which surfaces can be

1The PeopleBot Operations manual can be downloaded on http://www-ee.ccny.cuny.edu/www/

web/jxiao/P2-manual.pdf.

2The website of Mobile Robots inc. is http://www.activmedia.com/.

3The SICK LMS200 manual can be downloaded on http://www.mysick.com/saqqara/get.aspx?

id=IM0012759.

(14)

detected depends on the reflectivity of the surface. The manual states that every surface with a reflectivity of more than 2% can be detected at a distance of at least three meters.

Matte black cardboard that has a reflectivity of about 10% can be detected at a distance of at least ten meters. For our purposes, no surfaces need to be detected beyond a range of 5 m. We never encountered a surface that the device didn’t detect at these distances.

According to the manual, the device has a statistical error of about 5 mm for ranges up to 8 meters. Moreover, the device can have a systematic error of up to 15 mm for the same range. Summing the two errors, the device should be accurate within 2 centimeters for ranges up to 8 meters.

Hokuyo Lidar The peoplebot has a URG-04LX⁴ rotating laser range-finder from Hokuyo mounted at chest-height. It measures in an angle of -135^◦to +135^◦with a maximum angular resolution of 0.36^◦and a scan-rate of 10 hertz. The scanner turned out to develop blind spots if all the values where set to maximum, so an optimal balance needs to be found.

The maximum range of the scanner is four meters, but this is dependent on the reflectivity of the surface. Preliminary testing shows that bright clothing is detectable almost up to the maximum distance of four meters, but that subjects need to be close, no further away than about 2.5 meters, when wearing dark clothing.

Compared to the SICK laser, the Hokuyo laser is small, cheap and draws little power, but it’s also notably less powerful, with reliable object and person detection only possible at close range. Due to these problems, the Hokuyo laser hasn’t been used in the systems that are presented in this thesis.

LIDAR Both the SICK and the Hokuyo laser rangefinders use the lidar princi- ple. The time of flight of a reflected laser-beam is used to compute the distance to the reflecting surface, in a manner highly similar to radar. Lidars are often employed in robotic applications. Especially the SICK lidars entertain a high degree of popularity [7][11][17][32]. Because these devices employ a fixed angular resolution, positioning resolution and accuracy falls linearly with distance to the robot. Both laser range-finders are class I laser devices, and are safe for use near humans without any special precautions.

Bumblebee Stereo Camera Vision on the Peoplebot is provided by a Bumblebee 2 high-resolution stereo camera from Point Grey Research⁵. The camera is mounted on a pan-tilt unit on top of the robot. The point-of-view of the robot is approximately 125 cm above the ground.

The camera is equipped with two ICX204 ccd sensors with a maximum resolution of 1024x768. Each sensor is mounted in its own lens assembly. The separation between the middle of the two sensors or the baseline is 12 cm. Frame-rates of up to 18 fps are supported, but in practice the frame-rate is mostly limited by the available processing power or capacity of the recording equipment. The camera is connected to a laptop using a IEEE1394 firewire bus. The camera comes with the Triclops SDK from Point Grey

4More details are available in the URG-04LX specification sheet, which can be downloaded on http:

//www.hizook.com/files/publications/HokuyoURG_Datasheet.pdf.

5More details can be found in the Bumblebee manual, which can be downloaded on http://www.

ptgrey.com/support/downloads/documents/Bumblebee2GettingStartedManual.pdf

(15)

Research. This software allows for advanced near-real-time stereo-image processing to provide depth-maps of the environment.

Actuators

Locomotion motors The Peoplebot has two driven wheels, one on each side of the robot. These wheels are driven by two stepper motors, which allow the robot to move around in it’s environment. Robot movement can be initiated by requesting movement of a certain amount of centimeters to the wheels, or by marking a movement waypoint on a map in the ARNL software library. This actuator hasn’t been used in the systems that are presented in this thesis.

Pan-tilt unit The Bumblebee stereo camera is mounted on a PTU-D46⁶ pan-tilt unit or ptu by Directed Perception. This allows the robot to change its viewing direction without moving the whole robot-platform around. The pan-tilt unit is connected to the internal peoplebot pc with a RS232 serial bus. This actuator hasn’t been used in the systems that are presented in this thesis.

Computing The robot is equipped with an on-board pc with a 1500 MHz pentium-D processor and 1GB of RAM. All actuators and sensors are connected to this computer by default. Since processing or even recording of the two high-resolution image streams from the bumblebee stereocamera can take a lot of processing power, there are two laptops available, each with a 2.4 GHz core 2 duo cpu and 2GB of RAM. These two laptops can both be mounted on the robot at the same time, and can be connected to the on-board computer using an on-board gigabit ethernet switch. The stereo camera was attached to one of these laptops when recording the data-set. All computers are running a variety of the GNU/Linux operating system. Clocks on the the computers are synchronized within a few milliseconds using the Network Time Protocol (NTP) [21].

2.1.2 Snackbot

Figure 2.2: Picture of the The peoplebot has been purchased as a development plat-

form for the Carnegie Mellon University Snackbot. The Snackbot has been designed to serve as a human-robot interaction platform, and to provide a continuing service to deliver snacks in the university buildings. The systems described in this thesis have been developed on the People- bot, but many contributions will eventually be added to the Snackbot.

The Snackbot is largely similar to the peoplebot. It has been built on a MobileRobots inc. Pioneer 3 base, and has the same sensors and actuators in approximately the same locations, even though these are located in a custom-built and visually attractive shell. There are some additional

6More details can be found in the PTU-D46 manual.

15

(16)

sensors as well. The Snackbot has two non-functional arms holding a tray for snacks. This tray has a sensitive pressure- sensor array to detect which compartments of the tray are holding snacks. Moreover, the robot is equipped with a omnicamera on top of its head. See the article by M. Lee et. al. [23] for more details.

2.2 Software

2.2.1 Work on the CMAssist software repository

A lot of time was invested in writing software for general shared use in the CMAssist software repository. Together with others I’ve worked on extracting color images from the stereo camera, logging of images, threaded logging, translation and rotation of point clouds relative to the position of the ptu, a generic configuration system and socket- messaging wrappers among other things.

2.2.2 ARIA

Robot platforms by MobileRobots inc. are mainly controlled using the freely available ARIA (ActivMedia Robotics Interface) software library. This software library handles interfacing with the robot’s internal micro controller, interfacing with most sensors, and control of the actuators.

2.2.3 Intra-process communication

Modular programming increases clarity and reusability of software by separating dividable functionality in easily attachable, detachable, and re-attachable ‘building blocks’. To this end a push-message based architecture has been devised. A schematic overview is presented in figure 2.3. Every module can send messages to an address, and every module can subscribe to addresses and receive every message that was sent to this address. Since data in messages is sent as a C++ void pointer, it’s up to every module to interpret the messages in a meaningful way.

All processing is done sequentially; A sending module calls the sendMessage() function to send a piece of data, referenced by a void pointer, to a certain address. The function calls the messageReceived() function of all modules subscribed to the address. When the sendMessage() function returns, all data processing is done and the referenced data can safely be freed. Any module that keeps a memory of past messages or does asynchronous processing needs to copy the referenced memory from the message. This greatly simplifies object lifetime management.

This approach has a number of advantages. One-to-many messaging is part of the design.

Since modules are only woken up if there’s data that can be processed, no polling is needed. This allows for computationally and complexity-efficient software. Moreover, coupling between modules is low, which allows modules to be added, removed, and shared easily between programs. For example: removing a module might cause other modules

(17)

MessageHandler::send(message, address) MessageHandler::subscribe(address, receiver)

MessageHandler

dp_sender dp_address dp_data d_timeStamp dp_header

Message MessageSender

newMessage(message) MessageReceiver

send (2) subscribe (1)

newMessage (3)

Camera CameraDisplayer

Figure 2.3: The intra process messaging architecture.

to stop working since they won’t receive the data they are waiting for, but it will never result in a compilation error.

2.2.4 Inter-process communication

For inter-process communication, existing CMAssist software relies on a socket-based system detailed in the article by Jeremy Stolarz [37]. Sockets are platform and programming language independent and are especially designed for communication between networked machines. This allows for easy cross-platform and cross-programming-language distribution among multiple and potentially very different machines. Moreover, instead of a single program gaining exclusive access over a sensor or actuator, this allows access to be distributed in an efficient one-to-many and many-to-one fashion. I have written wrapper-libraries and modules to allow for easy use of this socket-based messaging system. This system was used whenever inter-process communication was needed, such as for communication with the sensors and the logging-system.

(18)

(19)

Chapter 3 Tracking

3.1 Introduction

Aim of this research is to implement a multi-person tracking and gaze-detection system that can be used as a basis for human interaction interest detection. In this chapter, the tracking portion of the system is detailed. As can be seen in figure 3.1 this system can be used as a segmentation step for the gaze detector, which has been described in chapter 4.

In section 3.2.1 we look at other tracker implementations. We are especially interested in methods for perception and classification in the context of tracking. In section 3.2.2, the theoretical background of probabilistic tracking is explained, with a short intuitive explanation on how the Kalman filter solves the problem of probabilistic tracking. This knowledge is needed to understand the model as presented in section 3.3. The model section describes the workings of the system in enough detail to be able to rebuilt the system. This section has been split in a perception part (3.3.1) explaining the developed leg detectors, and a tracking part (3.3.2) explaining the exact model used for tracking.

The chapter closes with an implementation section (3.4), elaborating on the developed re-usable tracking software framework and details about the visualization for the tracker.

3.2 Background

3.2.1 Previous work

As detecting persons is a necessary skill in any human-robot interaction task, many person detecting and tracking systems for robots already exist. Often, those systems are demonstrated by having the robot follow a person [12]. Since the snackbot will mostly be operating in ‘vending machine mode’, this functionality wasn’t currently necessary for our system. In this section, existing approaches to person tracking will be explored. We will mainly concentrate on the perceptual modules of existing systems, and how those perceptions are classified into tracks of persons.

(20)

Leg Detector SICK laser

rangefinder

Person Tracker

Gaze Detector Bumblebee

2 Stereo Camera

Where are persons in the

robots environment?

Is someone looking at the

robot?

in system out

Figure 3.1: Bird-eye view of the overall system. The green bar indicates the portion of the system that is detailed in this chapter.

Perception

The vast majority of perceptive systems with the goal of person tracking use laser- rangefinder-based leg detection[17][12], visual face-detection[24], or a combination of both methods[7][11]. The older, much-cited article describing the Pfinder system[42] uses color- blobs to detect persons.

When a laser rangefinder is mounted at leg-height, the legs of persons form a character- istic pattern[11]. Usually simple heuristics, or simple classification techniques are used to extract leg-pairs and their positions from the range-finder measurement data.

There are more differences in the methods for face-detection. The haar-like features based face-detector [20] as available in the OpenCV computer vision library[13] is often used in the researched literature [24][7]. This classifier is also used in the here presented system for gaze-detection as described in section 4.3.2. The article by Fritsch et. al. [11] uses a eigenfaces based method[39].

Tracking

The kalman filter, as described in great detail in section 3.2.2, is a popular choice for tracking implementations[7][24][9]. There was an old tracker written for a decommissioned robot available in the CMAssist code-base, as described in the article by Stolarz[37]. An particle-filter based tracking approach as described by Gockley[12] was implemented in this code. Other methods are employed as well. The articles by Fritsch et. al. and Lang et. al. [11][17] employs an so-called anchoring framework, borrowing some jargon from

(21)

probabilistic tracking theory such as the predict and update step, whose will be explained later.

3.2.2 Tracking with the Kalman filter

Tracking

‘Estimation is the process of inferring the value of a quantity of interest from indirect, inaccurate, and uncertain observations.’ [5] Tracking is a special case of estimation, where an estimation of a quantity is based on previous measurements as well as current measurements. As is often the case, these quantities of interest can be a representation of location. This is true for the presented system, where the physical location of a person is being tracked. It’s also possible that the quantities of interest have nothing to do with the location of the entity. An example of this could be the tracking of the state-of-charge of an electronic battery.

The Kalman filter

The kalman filter has been invented by Richard Kalman in 1960 [15]. Since then, kalman filters have been applied to a a multitude of problems, including but not limited to person- tracking[7][24], location-estimation for robots[25][33], missiles, objects, and tracking the state-of-charge for portable computer batteries.

We will not explain everything about, derive, or even mention the full equations by R. Kalman here, as there is already a wealth of documentation available. Recommended are the original paper [15], and the publicly available thesis by Negenborn[25] which thoroughly explains the theory in an accessible manner. Instead, it’s tried to offer a global understanding of the uses and limitations of kalman filters for the problem of person tracking. Details of the kalman filter that are relevant to understanding the specific implementation presented in this thesis, will be visited in more detail.

State

In general, the purpose of a kalman filter is to estimate some unknown set of variables, the ‘state’ x_t, based on available but noisy measurements. In the following sections, we will apply the kalman filter to the problem of estimating a persons location, and the examples that we use will be tailored to this specific problem. However, the general principles of a kalman filter can be applied to the broader problem of estimating a true (partial) world-state based on inaccurate measurements.

Person-location as a Belief

We want to have a good estimation, a belief close to the truth, of the location of a person, but all we have is current and previous noisy sensor measurements. We can represent the current estimated location of a person as a belief.

Bel(xt) = P (xt|d0...t) (3.1)

(22)

If we are assuming a tracking problem, this represents the probability that the person is at location x_t at time t, given all the available data up to that time d_0...t. The location that has the highest probability given this data, the peak in the probability distribution, is the most likely location of the person. We want the peak in the probability distribution to be as close as possible to the actual true location of the person. There are two processes, acting and sensing that can change the estimation of the state, the belief.

Acting

Acting is when the system itself performs an action, or if it’s known to the system that an action is performed, that changes the state in a predictable way. Let a_t be the action performed at time t.

P (x_t|x_t−1, a_t−1) (3.2)

This probability density gives the probability that the state is x_t, given that the previous state was x_t−1 and action a_t−1 was performed. This is called the action or motion model.

For example, if a robot wants to estimate its location, and gives its actuators the command to move one meter forward, that’s an instance of acting. Please note that equation 3.2 describes a probability density of outcomes of the action. In this example, the robot might not move forward by one meter exactly. Probably it moves a little more, or a little less. This is modeled by the probability density.

In the case of person tracking, the notion of an action is less intuitive. The system doesn’t perform any actions that influence the state, since our state estimates the location of a person. However, we’re still using the notion of acting in our system: we’ll use a model of previous movement of the tracked subject to estimate it’s current actions, and use this to update our believes. More details about this action-model can be found in section 3.3.2.

Sensing

Sensing is when the system receives a measurement giving information about the state of the system. Let s_t be the sensing data received at time t.

P (s_t|x_t) (3.3)

This probability density gives the probability that the sensor observes s_t when the state is x_t at time t. This is called the perceptual or sensor model. Often the perceptual model is time-invariant, in which case we can omit the t. In our case, the data received from the SICK laser scanner at leg height (see section 2.1.1) provides sensing data.

Feature Extraction

Feature extraction is employed to simplify the sensor-model. The sensor model is often large and hard to compute, for example because the data is of high dimensionality. In our case, we would have to build a model for every possible SICK laser measurement for every possible configuration of people in the environment, which wouldn’t be feasible.

For this reason, feature extraction is employed to reduce complex measurements to a much simpler and lower dimensionality feature vector.

σ : S → Z (3.4)

(23)

Instead of the sensing data S, the feature vector Z can be used in equation 3.2. In our case, the leg-detectors explained in section 3.3.1 extract the sensing data from the SICK laser into one or more feature vectors containing a hypothesized x,y location of a person.

Since this corresponds largely to the state that we want to estimate, we can use a simple sensor model.

Bayesian approach

This interpretation of the tracking problem enables a Bayesian approach. So how do we put this together to track a person? We have to start with an initial belief about the location of the person. It is possible to use an uniform distribution as an initial belief, but in our case we adopt the estimated location from our first measurement as the initial belief.

Now the belief has to be updated to keep on tracking the person. This is done by applying the action model and the sensor model in turn. The reader is encouraged to look up the formal derivation of these processes[25]. We will suffice with giving the final formula to calculate the posterior belief or the belief after receiving the most recent measurement.

Bel⁺(x_t) = η_tP (z_t|x_t) Z

P (x_t|x_t−1, a_t−1)Bel⁺(x_t−1)dx_t−1 (3.5) With Bel⁺(x_t) being the posterior belief (as denoted by the +), that is, the belief after applying the acting step, and η_t being a probability density normalizer, to scale the outcome between zero and one. This simplified formula is valid after applying a Markov assumption[25] to the action model. The Markov assumption states that given the current state, the past is independent of the future and vice-versa. We will make a first order markov assumption, assuming that the current state is only dependent on the previous state, and the actions that were performed while in the previous state. Moreover, the Markov assumption is applied to the sensor model. Here, the Markov assumption states that the current sensor reading is only dependent on the current state, i.e. the sensor itself doesn’t have any ‘memory effect’.

Complexity

The size and complexity of the belief about the state, the sensor model, and the action model quickly surpass constraints posed by the available hardware.

Even for a small number of variables in the state, the whole state-space can get very large. Since the belief needs to contain a probability for every possible combination of values in the state, storage requirements quickly surpass available memory space. If the state contains continuous variables, the state-space is infinite. This is the problem with representational complexity.

There is a similar problem for the sensor and action-model, especially when the actions or sense-data is high-dimensional it quickly becomes intractable to compute a model.

This is called modeling complexity. We’ve already seen one solution for high-dimensional measurement vectors, namely feature extraction. In section 3.2.2 a number of practical solutions to the complexity problem are presented, leading to practical implementations of a state-estimator. One of these implementations is the kalman filter.

(24)

Implementations

We’ve seen that probabilistic models often surpasses practical limitations on available memory space and computing power. Real probabilistic state-estimators must address these problems while retaining the advantages of a probabilistic model.

Discrete belief Instead of using a continuous or otherwise large belief-space, the belief space could be discretized or factorized in a finite, possibly small number of areas. The associated probabilities can be computed and stored explicitly. This way of modeling a belief space is done by hidden markov models.

grids are a different form of discretization, that would be practical for the problem of person tracking. The location-space that we want to model can be discretized in a finite number of squares. The belief would consist of the probabilities associated with each of these squares.

Particle filters Another interesting way of dealing with the complexity associated with a continuous or otherwise large state space is embodied by particle filters[2]. The belief can be modeled by taking n weighed samples of the belief space. Each sample is weighed by the probability of that specific location in state-space. These samples can be used to gather an estimated position, for example by calculating a weighed average among the samples or by taking the values of the sample with the highest probability. This approach is often applied to person tracking and robot localization.

Belief function sampling and discretizing are solutions to store a belief distribution in a finite memory space. Another solution is to use a function that describes the belief distribution as closely as possible. Instead of values, the parameters of the probability function are stored. Often, the Gaussian probability density function is used. Since a gaussian function can be described by only its mean and variance, this is very efficient.

Naturally, this approach can only be used if the belief can be approximated by a gaussian function.

Kalman filter The kalman filter or KF is an example of a belief function-based approach. KFs assume that the sensor and motion model are subject to zero-mean gaussian noise. Moreover, it is assumed that the belief itself can be modeled by a gaussian function. If the system can be described by linear equations and the assumptions are true, it has been proven that the kalman filter is an optimal state estimator [15]. Even if the belief and sensors are close-to-gaussian and the system can be linearized, the kalman filter is often an efficient and well-performing state estimator. Moreover, the kalman filter remains effective, even for very high-dimensional state vectors. This can for example be used in robot-localization scenario’s, where a kalman filter variant can be used to track the location of every single landmark in one large state vector[33].

(25)

Limitations of the kalman filter

There are a number of limitations to Kalman filters. As follows from the description in section 3.2.2, the sensor and motion-models must be able to be modeled by a gaussian.

Any non-zero-mean or non-independent model noise cannot be modeled properly by a Kalman filter. Secondly, the belief probability distribution must be appropriately modeled by a gaussian too. Since a gaussian is a uni modal function, i.e., a function with one peak, a single kalman filter cannot model multiple very different hypothesis, or cannot be applied to multiple-person tracking.

Luckily, there are ways to circumvent this last limitation. In our case, we use a kalman filter for every person in our environment, and we use a least-error heuristic to assign incoming measurements to the right kalman filter as described in section 3.3.2.

Another interesting approach to this problem is to use a particle filter-like system as described in section 3.2.2. Instead of just storing the state-vector-variables for every particle, each particle becomes a small kalman filter. This allows to use an order of mag- nitude less particles compared to traditional particle filters, and allows for the modeling of multi-modal belief distributions [30].

3.3 Model

3.3.1 Perception

Leg detection

The tracker can’t deal with the complex data from the SICK laser range-finder directly.

A leg detector is used as a feature extractor. During the project, multiple leg detectors have been written. These leg detectors can be divided in two main categories. The mobile category works without a background model, and can therefore be used on a mobile robot with changing background conditions without any change. Two leg-detectors in this category have been made; One based on existing code, and one based on the article by Fritsch et al. [11] when performance of the existing leg-detector was judged insuffi- cient. The static category uses a background model. This type of leg detectors does give excellent performance for a static robot or otherwise static sensor, but would lose it’s advantage or even perform worse on a moving robot. We have one leg-detector in this last category.

All leg detectors are using data from the SICK laser range finder, which has been mounted on the robot at a height of approximately 25cm above the floor. More details about this laser scanner can be found in section 2.1.1. We receive measurements from the range finder with an angular resolution of 0.5^◦at a rate of 5 Hz. Leg-detections are generated for every measurement, therefore tracker-updates are 0.2 s apart. Experiments have determined the best leg-detector in both categories, and these leg-detectors have been used for the tracker.

Stolarz’s Leg Detector The first leg-detector was borrowed from existing code by Jeremy Stolarz of the CMAssist project [32]. Rombouts¹ and I have rewritten the

1I’m referring to my friend and colleague Jaldert Rombouts from the University of Groningen.

(26)

software and ported it to our architecture for our own purposes.

The leg-detector works by grouping measurements from the laser scanner that are less than ten centimeters apart. groups that aren’t bigger than 25 cm and aren’t smaller than 5 cm are considered leg candidates. Two candidates that aren’t more than 50 cm apart are reported as being a pair of legs. The average location of the measurements in the pair of legs is reported as the center.

Leg Detector without Background Model This leg-detector is a direct implementation of the leg detector as described by the article by Fritsch et al. [11]. The working are somewhat similar to those of the first leg detector. First, consecutive measurements are separated into segments. This happens as following: Starting with the rightmost point, points are sequentially added to the current segment. Whenever two consecutive points are more than 75 cm apart, a new segment is started. The following features are extracted from the segments:

• number of reading points (n)

• mean distance (µ)

• standard deviation of the distances (σ)

• width (w)

After this, the segments are classified with a number of simple decision boundaries.

A segment is considered a leg if and only if the following is true.

(n > 4) ∧ (σ < 0.04) ∧ (µ < 5) ∧ (0.05 < w < 0.25) (3.6) These values have been taken directly from the article by Fritsch et. al. [11]. Please note that no special care has been taken to group leg-detections into leg-pairs. The tracker is equipped to average detections that are close together. This has the additional advantage of still being able to detect a person if only one leg is detected, for example because one leg is occluded by the other.

Leg Detector with Background Model This leg-detector is an extension to the leg detector without a background model. Before the initial segment separation, data- points are compared against a background model. Points that have moved less than 10 centimeters as compared to the background model are pruned. Since this approach nearly eliminates false positives, less stringent decision boundaries are used when classifying the segments. A segment is considered a leg if and only if the following is true.

(n > 3) ∧ (µ < 5) ∧ (0.05 < w < 0.5) (3.7) These values have been determined by experimentation. For our experiments, the background model is extracted from and identical to the first frame. Please note that this method will only provide better performance than the second leg detector on a station- ary robot. More advanced background modeling methods could be employed to provide robustness for a robot that regularly changes locations.

(27)

3.3.2 Tracking

Kalman filter

A quick overview of probabilistic tracking as used by the Kalman filter has been described in section 3.2.2. The Kalman filter estimates a state stored in a state vector.

After acquiring an initial belief, the belief is updated by applying the Prediction model and Update model whenever a new feature vector arrives from the feature extractor. The feature extractor is a leg detector in this case, which has been described in section 3.3.1.

Both a kalman filter without and a kalman filter with a linear motion model have been implemented.

Theoretically, the prediction model and update model needn’t be applied in turn. The prediction model can be applied as many times as desired, as long as it’s up to date when a new measurement is received. However, in the implementation presented in this thesis, predictions are run only when new measurements are received, and therefore the models are applied in turn.

The original Kalman filter can only be applied to linear prediction and update functions. Since goniometric functions are being used in the motion model of the prediction function, the Kalman filter can not be used. Instead, the extended kalman filter (EKF) is employed [22]. For this variant, non-linear functions are linearized with the use of Jacobian matrices.

State Vector For each person that is being tracked, the following state-vector is estimated.

X = (x, y, v, θ)ˆ (3.8)

With x and y being the location of the person relative to the robot in meters, v being the velocity of the person, θ being the direction of movement, and X being the real state.

Since we’re observing the world through noisy sensors, we can never know the true world state. Instead, we’re estimating the world state, which is denoted by the hat on the X : X.ˆ

Sensor and model error a Kalman filter reduces the complexity of probabilistic tracking by assuming zero-mean gaussian sensor noise and model noise. This allows the noise to be modeled in it’s entirety by sensor noise variance R and model noise variance Q, both being matrices with the independent noise along the identity axis, having the same length as the sensor vector and state vector respectively.

R =0.3² 0 0 0.3²

(3.9)

Please note that the SICK laser rangefinder should be more accurate than this. However, since a leg detector is used, the error of the leg detector is estimated here. In section 3.5.1, the expected error of the leg detector is analyzed in great detail. Here we suffice to say that 0.3 is a reasonable upper-bound on leg detector noise

(28)

Q =







0.5² 0 0 0

0 0.5² 0 0

0 0 0.5² 0

0 0 0 π²







(3.10)

The Q matrix describes the estimated model error. In this case, the model error is mainly determined by the unpredictability of human movement. It is estimated that the location of a person usually doesn’t differ more than 0.5m between two updates, after the motion model has been applied. Moreover, it is estimated that the velocity of a person doesn’t change at a higher rate than 0.5m/s²/update. The absolute maximum deviation possible in angle is 2π. π has been chosen as a reasonable upper bound. This has been done because movement-direction estimates become unreliable and differ wildly at low velocities.

Initial belief The Kalman-filter needs an initial belief before it can start tracking a person. The initial belief consists of an initial state vector with an initial variance per vector element describing the accuracy of the state vector. Variance σ² is the square of the estimated error σ.

A new kalman filter is created and initialized the first time a new person is perceived. The x and y locations are initialized to the location of the first leg-detection. Since we can’t compute a velocity v and movement-angle θ from a single measurement, the velocity and angle are both initialized to be zero. The initial location-variance is equal to the sensor variance R. The initial velocity variance is 0.5² m/update, with 0.5 m/update being a rough upper-bound on indoor human velocity. The initial angle-variance is π² with π being an upper bound on error for an angle varying between −π and +π.

Update model The update model or sensor model describes the expected sensor data based on the current state estimation. This allows the Kalman-filter to update its belief based on the difference between the estimated sensor readings and the actual sensor readings at the update step.

Zˆ_t = H( ˆX_t) (3.11)

With ˆZ_t being the estimated sensor data at time t, H being the update model, and ˆX_t being the estimated state at time t.

Since the feature extractor reduces the measurement information to x,y location information in the same frame of reference as the state-vector, the sensor model is straight- forward.

Zˆ_x = Xˆ_x (3.12)

Zˆy = Xˆy (3.13)

The software expects Jacobian matrices for the update and prediction model as well.

Jacobian matrices are matrices that can be used to linearize a non-linear function at a certain point in the functions result curve. Since the update-model is linear and straight- forward, the jacobian is simply the identity matrix.

(29)

Prediction model The prediction model or motion model describes the expected current state based on the previous state. This is used by the Kalman filter to update the belief at the prediction step.

Xˆ_t+1 = f ( ˆX_t) (3.14)

With ˆXt being the current state, ˆXt+1 being the next state, and f being the prediction model.

We have two prediction models, the performance of which will be evaluated in section 3.6.2. The first prediction model is a static prediction model, i.e. it doesn’t have a motion model. This means that the state-vector doesn’t change at a prediction step, only the uncertainty estimates are updated. This also means that the v and θ in the state vector aren’t used. Since this model is trivial and largely analogous to the update model, we won’t present it in any more detail.

The second prediction model is a linear motion model. The assumption is that the movement velocity and direction of a person between the previous and the current measurement, is likely to be similar to the velocity and direction between the current and next measurement.

x_t+1 = x_t+ (cos(θ)v∆t) (3.15)

y_t+1 = y_t+ (sin(θ)v∆t) (3.16)

v_t+1 = v_t (3.17)

θ_t+1 = θ_t (3.18)

With ∆t being the time-difference between the current and previous prediction step. Note that v and θ aren’t constant. Every step, the software calculates the angle and movement between the previous and the current estimation, and the v and θ variables are updated accordingly.

The kalman-filter requires linear prediction and update functions. Equation 3.15 clearly isn’t linear. Therefore, the extended kalman filter, which uses a Jacobian matrix φ to linearize this system of equations, is used [22].

φ =







1 0 cos(θ)∆t − sin(θ)v∆t 0 1 sin(θ)∆t cos(θ)v∆t

0 0 1 0

0 0 0 1







(3.19)

Multiple Person Tracking A single Kalman-filter is not suitable for tracking multiple persons, as has been described in section 3.2.2. A system to create kalman-filter-based tracks, delete tracks, and assign measurements to the right track has been devised.

The basic task of the system is to assign measurement to the right track. This is done by calculating the distance between the incoming measurement, and all currently present tracks. The Mahalanobis distance is commonly used for this purpose [9].

D_m = (z − ˆz)^TS⁻¹(z − ˆz) (3.20) With D_m being a single value for the Mahalanobis distance, ˆz being the estimated measurement as computed by the update function, z being the incoming measurement. S is

(30)

the residual covariance matrix containing the state-estimation uncertainty. To take into account the sensor-variance as well, the sensor variance is added along the identity axis of covariance matrix S.

The measurement is assigned to the track that has the lowest mahalanobis distance to the measurement. This track (and only this track) will use the measurement for its update step.

If the minimum mahalanobis distance to all present tracks is greater than 2 standard deviations, it’s deemed too unlikely that the measurement belongs to any of the existing tracks. Since the mahalanobis distance is a z-value, this corresponds to a 95% confidence.

If this happens, a new track is created which will be initialized with the current measurement as described in the ‘initial belief’ paragraph.

If the uncertainty of a track surpasses ten times the model uncertainty Q, the track is deleted. This value has been determined by experimentation.

3.4 Implementation

3.4.1 Architecture

The tracking-architecture has been based on the Bayes++ bayesian filtering software library². We have developed a number of wrapper classes in a framework that allow us to quickly employ this framework for a tracking problem. The framework has been designed to employ reasonable defaults to allow a tracker to be built quickly with very little code, while still retaining the possibility to override these defaults if more control is needed.

3.4.2 Visualization

Screenshots of the visualization system for the tracker have been provided in figure 3.2.

The screen represents a top-down view of the 5x5m area immediately in front of the robot.

The robot origin is at the bottom of the screen, facing forward. The little green dots represent the measurements from the SICK laser at leg-height. When there is a person in the area, the leg-detector is detecting his legs which is visualized by drawing little red circles around them. The little green line in these circles represents an estimation on which direction the person is facing. As soon as a leg is detected, a track is initialized on the position of the first detection. The tracks are represented by a small colored square, with a same-colored circle around it representing the uncertainty (sd=1) of the track.

Text at the right of the track shows the id, the x and y coordinates of the track, the x and y uncertainty, and an estimation of the velocity and the angle of the track. To show the effects of the motion model, a blue line is drawn from the current location to the estimated next location.

2The Bayes++ software can be downloaded on http://bayesclasses.sourceforge.net/Bayes++.

html. The documentation can be found on this webpage as well.

(31)

(a) Subjects walking towards robot. (b) Subjects swapping positions

Figure 3.2: The Visualization of the Tracking system. The background has been colored white for clarity when printed. The green dots represent the measurements from the SICK laser rangefinder. Groups of measurements that are detected as legs, are surrounded by a red circle. Tracks are visible as small colored squares, surrounded by a circle representing the uncertainty of the track. A blue line originating in the tracks shows current movement velocity and angle, and ends in the estimated next location of the track. More details about the visualization can be found in section 3.4.2.

(32)

3.5 Leg Detection Experiments

3.5.1 Method

The leg detectors are tested separately from the trackers. Both are tested on the tracking dataset. The dataset has been described in appendix section A.0.4.

Annotations describe when and at which locations persons are present. The software will automatically compare leg detections with the known locations of persons. For performance measurements, the leg-detector is considered a binary classification system, detecting whether legs are present within an acceptable error boundary around the known location of a person. The following ratings are computed by comparing expected presence with actual detections for any of the nine location in any frame for which the location of persons is annotated. The true positive rate is determined by the fraction of locations where a person is expected and at least one detection is present within the acceptable error boundary. The false positive rate is determined by the fraction of locations for which leg detections have occurred, even though none were expected.

Acceptable error-boundaries

An acceptable error boundary needs to be defined, where an error boundary is defined as the area around a known location of a person within which we can reasonably expect a leg-detection to occur for a correctly working leg detector. Leg-detections within this area count positively towards the performance-score of the leg detector.

It’s been tried to estimate and justify a reasonable value for the error boundary, but data to ground these assumptions on is scarce. To get a good sense of an acceptable error boundary, we will test different boundaries against the dataset. A acceptable boundary will be determined based on these results. The error boundary is a cumulative error:

• E1: The error in the known-location of persons.

There are two sources of error in the known-location of a person. The first error is the error in the placement of the person relative to the sensor: Persons will never stand on exactly the same location relative to the robot. The second error is caused by errors in placement of the robot relative to the markers on the floor.

Displacement of the robot along the x and y axis in the robots frame of reference are expected to be on par with the errors in the displacement of persons, since the robot has been placed with similar human judgement. A bigger error is caused by small differences in the angle of the robot: Small errors in angle propagate to large measurement-errors in the robots frame-of-reference as distance increase, as described by:

E_y = dist ∗ sin(∆α) (3.21)

E_x = dist − (dist ∗ cos(∆α)) (3.22) Eang =

q

E_x²+ E_y² (3.23)

A possible other source of error can be found in small errors in the placement of the subjects position markers. This error is however expected to be relatively small and has therefore been excluded from this discussion.

(33)

• E2: The error in the locations of legs relative to the center of a person.

The legs of a person are not below the center of the body of that person. Based on metrics on human sizes³ we estimate the average distance from the center of the body to the center of a leg to be around twenty centimeters. This is only relevant if one leg is detected and the other isn’t, which is relatively rare.

• E3: The error in estimating the center of legs by the leg detector.

The leg detector uses the average location of the points in the point cloud that is detecting the leg. The disappearance of a single measurement from the leg-detection, for example because of tiny movements of the subject, cause the estimation of the leg-position to shift slightly. Since the number of measurements for a single leg varies with distance and leg-size, this effect again amplifies at greater distances.

• E4: The error in the SICK laser data.

The SICK laser has a independent error of 5 mm per beam. Moreover, the device could have a systematic error of up to 15 mm. More details about the SICK laser an its accuracy can be found in section 2.1.1.

Where E_tot = E₁+ E₂ + E₃+ E₄ . Experiments

For most experiments, the three different leg detectors as described in section 3.3.1 will be compared. For reliable and comparable results, we need to define an acceptable error boundary. This is the goal of the first experiments, that compare the true positives and the false positives based on error boundary. After a usable error boundary has been established, performance of the leg detectors will be compared for increasing distance, and for each of the two environments.

3.5.2 Results

Error Boundary

What would be an good error boundary to use for the rest of our leg-detection experiments? Leg-Detections will rarely occur at the exact position of the target. Because of inaccuracies in the measurement, the position of persons, the positions of the targets on the floor, and the way that persons stand on the targets. We need an acceptable error boundary around our experimental targets, within which radius we would consider leg detections to correctly belong to that target. See section 3.5.1 for more details.

An error boundary that is too low would classify correct measurements as a miss, decreas- ing our perceived performance for the leg detector below the actual performance. On the other hand, an error boundary that is too high might inadvertently inflate our perceived performance, by considering actual false detections as true positives for a target far away.

Moreover, a too-high decision boundary might decrease our perceived performance too;

The minimum distance between two targets in our testing scenario is 92 cm, namely for the targets one and three, closest to the robot. The experimental software is naive as to where we expect persons to be: Targets are queried sequentially, and if a person is within

3The data on 95-percentile hip breadth http://www.roymech.co.uk/Useful_Tables/Human/Human_

sizes.html has been used for a quick estimation of leg to leg distance.

A Gaze Detecting Tracker - Master Thesis -