Body pose tracking in the Watching Window : a system that tracks both hands in a virtual reality environment

(1)

University of Twente Computer Science Human Media Interaction

Enschede, Overijssel, The Netherlands University of Otago

Department of Computer Science

Graphics and Vision Research Laboratory Dunedin, Otago, New Zealand

Bod y pose tracking in the Watching Window.

A s y s t e m t h a t t r a c k s b o t h h a n d s i n a v i r t u a l r e a l i t y e n v i r o n m e n t .

K . G . F . H e r m s J a n u a r y 2 0 0 7

Examination committee: dr. D.K.J. Heylen

ir. R. Poppe

prof. dr. ir. A. Nijholt dr. B. McCane

(2)

Abstract

This thesis is written for the closure of the master course Human Media Interaction at the University of Twente. This thesis describes a research project related to the Watching Window (WW) project of the University of Otago.

The WW is a Virtual Reality (VR) environment in which the user can interact with a VR application without the use of body suits, markers or gloves. To achieve this interaction, the user’s head is tracked by several cameras and these head coordination’s are triangulated to estimate the coordination in 3D world space. With this 3D coordination, the right perspective of a VR application is drawn by a projector on a screen. This enables the user to see the 3D scenery in the right perspective continually, giving the user the right motion parallax while the user moves around. With the help of some stereo glasses, the user can even experience a 3D effect. The current applications only use the motion parallax and 3D effect while more interesting and challenging applications are possible if the user can physically interact with the VR world. This thesis describes a system that tracks both hands of the user making more interaction with the VR world possible.

A hand-tracking module is presented that uses all available cameras to estimate the current upper body pose of the user. Both hands are derived from the estimated pose and used in the WW applications. The body pose estimation technique used in the module is based on a model based pose estimation technique and a particle filter is used to track the body pose of the user. A 3D model is used that can be projected on every camera observation making a generic evaluation possible. This enables the use of all possible cameras for the evaluation which makes more information available and solves camera observation ambiguity. The dimensionality of the model is reduced by decomposing the model into separate body-part estimation problems. Each of these body part estimation problems is solved by a model based particle filter.

The hand-tracking module is evaluated by 2 evaluation tests namely an accuracy and a performance evaluation test. From these evaluation tests it can be concluded that the hand-tracking module proposed in this thesis, works well enough for pointing and manipulating objects. However, for detailed hand tracking and small movement the method described in this thesis is not sufficient. Some future work proposes the redesign of the image feature, the projection of the particles and an addition to the model to enable more detailed tracking.

(3)

Acknowledgements

During my Human Media Interaction study at the University of Twente I came in contact with the University of Otago, in Dunedin New Zealand. For the closure of HMI a few interesting research projects at the University of Otago where presented. From all of these assignments the hand tracking system in the Watching Window was the most motivating and appealing.

I want to thank the University of Otago for the opportunity to be part in such an interesting and motivating project. I would like to thank my supervisor from Otago University, Brendan Mccane, and from the Netherlands Dirk Heylen, Anton Nijholt and Ronald Poppe in particular. I also want to thank Geoff Wyvill for his help and fascinating ideas and Damon Simpson, Phil Mccleod and Sui-Ling Ming Wong for their help and support.

I also want to thank the other exchange students that made Dunedin a perfect place to study and of course my family and friends for their support and understanding for my seven months of absence.

Koen Herms, Huizen, January-2007

(4)

Table of Contents

Part I Introduction... 1

Chapter 1 Introduction ... 1

1.1 The Watching window... 1

1.2 Problem statement... 1

1.3 Thesis outline ... 2

Part II Research ... 4

Chapter 2 Literature overview... 4

2.1 Introduction ... 4

2.2 Body pose estimation... 5

2.3 Tracking ... 8

Part III Design ... 11

Chapter 3 Main Approach ... 11

3.1 Current state ... 11

3.2 Main approach ... 13

Chapter 4 Design ... 16

4.1 Image Feature... 16

4.2 Modeling... 18

4.3 Particle filter ... 25

Part IV Conclusions ... 30

Chapter 5 Evaluation ... 30

5.1 Accuracy ... 30

5.2 Speed results ... 34

Chapter 6 Conclusions... 36

6.1 Summary... 36

6.2 Conclusions... 36

6.3 Future work ... 37

Part V Appendices ... 38

Appendix A The Watching Window definition ... 38

Appendix B Camera calibration ... 40

Appendix C The user model ... 42

Appendix D Hardware Statistics ... 43

Appendix E Architecture ... 45

Appendix F Results... 47

Part VI References... 50

(5)

List of Figures

Figure 1 - Watching Window Application ... 1

Figure 2 - System architecture... 11

Figure 3 - Images from the three different cameras ... 11

Figure 4 - 3D coordination estimation... 12

Figure 5 - New Watching Window architecture... 14

Figure 6 - Simple Motion Filter... 16

Figure 7 - Edge detection and resulting AND'ed image... 16

Figure 8 - Line Representation ... 17

Figure 9 - Hough Space Representation of a Line ... 17

Figure 10 - Hough Line detection results... 17

Figure 11 - Bad Hough Line detection ... 18

Figure 12 - H-Anim standard joint structure... 19

Figure 13 - Quadratic surface of an ellipsoid. ... 20

Figure 14 - Rotation of an Ellipsoid... 22

Figure 15 - Joint plane constraint problem ... 23

Figure 16 - Projection of a quadric... 23

Figure 17 - Particle transition function ... 26

Figure 18 - Example body pose tracking ... 31

Figure 19 - Mock- up of the WW screen ... 38

Figure 20 - Camera model ... 40

Figure 21 - Distortion a) pillow distortion b) barrel distortion ... 40

Figure 22 - Kinematics Constraint ... 42

Figure 23 - Model of the user... 42

Figure 24 - Hand-tracking module ... 45

Figure 25 - Communication model... 45

(6)

Part I Introduction

Part I gives an introduction to the thesis presented. First a global introduction to the thesis subject is given followed by the problem statement with all requirements and research questions. Finally an overview of the rest of the thesis is presented.

Chapter 1 Introduction

1.1 The Watching window

The Watching Window (WW) is an ongoing project at the Computer Graphics and Vision laboratory at the University of Otago. In time the WW project has evolved from a single computer application to a complete networked system. The WW is an environment that enables a user to interact with Virtual Reality (VR) applications without the use of any physical hardware (e.g. markers, gloves, HMD, etc.). To achieve this, 3 cameras are directed at the user and Computer Vision (CV) techniques are used to determine the coordination’s and the movement of the user’s body-parts. These coordination’s and movements are used as the input of VR applications which is drawn on a big screen in the center of the cameras, see Figure 1.

Currently, the environment is situated in a booth to get the best CV results and a long-term goal of the WW is to get the WW out of the controlled environment making it everywhere available with a few computers, a projector and cameras. Appendix A, gives a specification of the booth and the computers used in the WW.

Figure 1 - Watching Window Application

The WW environment is named after the main application for which it is build, the Watching Window. This application defines the framework for other applications created for this environment. The WW application is a simulation of a window with an interesting background. In this virtual window, the background is drawn in the right perspective according to the coordination of the user’s head, which is determined by analyzing the camera observations. Because the head coordination is updated in “real-time” and the perspective is redrawn in “real-time” the user experiences the right motion parallax¹. This enables the user to walk around and look through the virtual window as if it where a real window with a real view behind it.

Besides the parallax view, the WW environment has the possibility of a creating a 3D experience. This is achieved by drawing the right perspective for each individual eye in a different color, red and blue, and overlaying these views to define the final perspective view. When the user wears red-blue filtering glasses, the right view is filtered for each eye and each eye sees a different view specifically projected for this eye.

This trick enables the user to experiences real depth in the VR world. In combination with the motion parallax view, any application created for this environment can give the user a real VR experience without the use of any hardware (except for the glasses).

1.2 Problem statement

The subject of this thesis is about the extension and improvement of the current WW environment. When I started this project the WW environment was only capable to track the head of the user and use this information in the 3D and motion parallax view. These applications give the user a great VR experience which becomes even more realistic when physical interaction with the virtual world is possible. For the user,

1 Motion Parallax is the change of angular position of two stationary points relative to each other as seen by an observer, due to the motion of an observer. The objects that are far in the background move slower when the user moves his head then the objects that are close to the virtual window.

(7)

the most natural interaction with the WW environment would be the use his/her hands. When the user can use both hands it is possible to create more interesting applications and games, e.g. the virtual clay ball created by D. Simpson. As the name suggests, a virtual ball of clay is projected in the WW and the user can shape this ball in any form. This implies that the WW environment needs to know the coordination’s of the user’s hands. In several previous version of the WW environment, several hand tracking algorithms [33, 34, and 35] have been tried. However, none of these hand tracking algorithms are currently working and a complete new solution needs to be found.

1 . 2 . 1 G o a l

The goal of the project is to track the hands of the user in order to enable more interesting and challenging applications. This means that a system needs to be created that fits into the current WW system and tracks both hands of the user. The coordination’s of both hands, need to be transmitted to the VR application that uses this information. In short:

Track both hands of the user in world space using the current available setup of the WW.

1 . 2 . 2 R e q u i r e m e n t s

There are several requirements which define the boundaries of this research project.

1. Good accuracy and speed.

The tracking of the coordination of both hands should be accurate enough to manipulate and move objects in the Watching Window environment and the system needs to operate in “real-time” in order to facilitate smooth interaction with the VR application.

2. No physical changes to the user.

The current WW environment does not use any equipment and this property has to be maintained.

This means that the user should be able to walk in and use the system without putting on some hardware.

3. User independent.

It should not matter how the user looks. This means that size, clothes and skin-color of the user makes no difference in the tracking.

4. Intuitive to use.

The user should be able to come into the booth and use the system without any knowledge on how to use the system.

1 . 2 . 3 R e s e a r c h q u e s t i o n s

To achieve the main goal according to the requirements some research questions are defined which are divided in two main questions namely a research and evaluation question. The research question defines the system that can track both hands and the evaluation question determines how good the system works.

How can both hands of the user be tracked in the WW?

1. What projects are proposed in literature and are any of these projects useful?

a. Which techniques do they use and which of these techniques are useful?

b. Which of these techniques can be combined to form a new solution?

2. What does the current WW look like?

a. What does the current WW architecture look like?

b. What are the current CV techniques used in the WW?

When the system is defined, what is the performance of the system?

3. How accurate does the system track the hands of the user?

a. How accurate are the estimated hands in relation with the real coordination?

4. What is the operating speed of the system?

a. Is the operation speed sufficient for the user interaction?

1.3 Thesis outline

The rest of this thesis is subdivided into 3 parts namely the research, design and evaluation part. The research part consists of a literature study, elaborated in the next section, and gives an overview of different hand tracking and body pose tracking techniques. In addition some interesting useful techniques are explained in detail.

The design part starts with an overview and motivation of the methods used. First an evaluation of the WW and the currently used computer vision techniques are explained. From the literature study and the WW

(8)

information the hand tracking system is defined globally. The design part finishes with a detailed description of each of the components of the system.

In the last part, the module is evaluated in relation with the requirements stated in the previous section. First the evaluation setups, techniques and results are presented. Then the goal and requirements, described in the previous section, are related to the results to give a conclusion on the performance of the system.

(9)

Part II Research

This part describes the research of several Computer Vision techniques. First an introduction into the field of Virtual Reality and Computer Vision is given. From this introduction the field of pose estimation and tracking is explained in detail.

Chapter 2 Literature overview

2.1 Introduction 2 . 1 . 1 V i r t u a l R e a l i t y

Ivan Edward Sutherland is the inventor of Sketchpad, an innovative program that influenced alternative forms of interaction with computers. Sketchpad was the first program that interacts in the way we currently interact with the computer. Later in 1968, Sutherland created what is widely considered to be the first VR and Augmented Reality (AR) Head Mounted Display (HMD) system. It was primitive both in terms of user interface and realism. The HMD worn by the user was so heavy it had to be suspended from the ceiling and the graphics of the virtual environment were simple wire frame rooms.

Since then, VR has matured and new systems and programs are created that aid us in many ways. The army uses VR to train its pilots and soldiers for warfare. NASA uses the same techniques to train astronauts for every possible situation before they even leave earth. Even the industry uses VR to design and test new products for there ergonomics. For example, in the airplane industry the designers can virtually sit and fly the plane before it is even build.

Still the majority of the VR applications are made for the amusements industry. Here the main goal of VR is to entertain users in the form of games and movies. These games and movies become more real when new VR techniques are designed. One technique that enables more realistic and natural interaction comes from the combination of VR and Computer Vision.

2 . 1 . 2 C o m p u t e r V i s i o n

Computer vision is the study and application of algorithms that allow computers to "understand" images. This understanding of images differs from the task that it is developed for. This understanding can be very superficial, e.g. color information of the image, and be very specific, e.g. recognizing objects. Many of the methods and applications are still in the state of basic research. However, more and more methods have found their way into commercial products. Here they are part of a larger system that can solve complex tasks (e.g., in the area of medical images and quality control or measurements in industrial processes).

One way to describe computer vision is by describing some of the application areas in which computer vision is used. One of the most common and prominent application fields is medical computer vision. In medical applications, computer vision techniques are used to analyze noisy image data (e.g. reconstructing 3D models of a series of noisy microscopic images [14]). A second big field in which computer vision is used is the industry. In the industry some computer vision techniques are used to control specific process in a factory (e.g. quality check on a factory belt or robot arm coordination and orientation). However, maybe most applications are used in the military. Even though most of the work done in the military is not open for public, one can guess where computer vision might get used, e.g. missile guidance, soldier detection from satellite images, enemy missile tracking etc.

Another big CV research field is focused on detecting and tracking humans and their poses. Applications made for this field include surveillance systems, advanced user interfaces, motion analysis and VR [5, 32]. If computer vision is used in these applications, instead of other tracking devices, people and their poses are estimated from the camera observation. Some of these applications only need to know if there is a human present (surveillance systems) while others need to know the complete human body configuration (motion analysis).

In VR and advanced user interfaces, body pose estimation is used to remove the cumbersome hardware that is currently needed. As explained in the introduction this thesis is about a system that enables the user to have a whole VR experience without the help of suit, gloves, HMD or any other hardware, instead computer vision is used.

2 . 1 . 3 W a t c h i n g W i n d o w

The WW is a VR environment that uses CV techniques to estimate the head coordination of the user and use this information in VR applications. This thesis is about a system that uses CV techniques to track both hands of the user in several camera observations. This problem of estimating both hands is described by the difficulty in observing the hand. Firstly a hand may be observed in an infinite number of shapes and sizes.

(10)

This has several reasons, first of all the camera observation is ambiguous. This means that one hand pose can be represented by several camera observations. On the other hand multiple hand poses can result in the same camera observation. The other reason for the amount of different observations comes from the high number of body-parts in the hand. All these body-parts can move in more directions, each occluding other body-parts, making up an infinite number of possible shapes creating a huge dimensionality in the search space.

Another problem in observing the hand is the size of the hands in the camera observation. The hand is only a small percentage of the complete human body. Since the hand tracking system must be able to observe all the possible coordination’s that the hands can be in, the hands only make up a small part of the camera observation. This creates a huge search space in which the hands can be anywhere in the observation.

Because the hand is hard to detect and the literature about hand pose estimation only address the estimation of the hand pose instead of finding the hand in and image, the field of body pose estimation is investigated. The pose of the user can determine the coordination’s of the hands of the user and thus solve create a system that tracks both hands. The next section addresses the art of body pose estimation while the section that follows addresses the tracking problem.

2.2 Body pose estimation

In the art of body pose estimation, the pose of the user is estimated according to one or more camera observations directed at the user. The detail of the estimated body depends on the application in which the pose estimation is used. One can imagine that for motion analysis, the amount the detail must give the best possible appearance of the human body. While for surveillance cameras a determination of people and global movement is sufficient. The detail is described by the number of body-parts, each of which is connected to a joint. Each joint can rotate in zero or more directions and each of these directions is called a Degree of Freedom (DOF). The amount of DOF determines the detail of the body pose.

Like the hand, also body pose estimation suffers from the dimensionality of the body. The humanoid representation of the H-Anim standard [29] has 94 joints in total. This representation includes a representation of the head, the hand, the vertebrate column and all the limbs of the user. Each of these joints can rotate in one or more directions creating at least 94 DOF. Gavrila and Davis [32] and Sidenbladh et al [18], each use 22 and 25 DOF respectively for their model estimation. However pose estimation can go so far as 54 DOF [31].

There are a lot of example algorithms that successfully track complete body poses of users in video sequences. All these of body pose estimation systems need to find a relationship between the camera observation and the pose of the user. Therefore the analysis of the image is an important part of the body pose estimation and this is described first in the next section. The field of body pose estimation can be divided in two main methods namely model-based and model-free pose estimation. In model-based approaches, a model of the user is used estimate the pose of the user. The model-free approach uses the camera observation to directly infer the user’s pose. Sections 2.2.2 and 2.2.3 give an overview of each of these methods.

2 . 2 . 1 I m a g e f e a t u r e s

When analyzing the images the decision needs to be made which image features describe the estimated object best. Image features, also called background subtractions, are useful abstractions from the camera observation. In background subtractions all irrelevant information is removed from the camera observation, only the subject of the body pose estimation algorithm remains. The problem in the image feature analysis can be described by the appearance of a user. The appearance of the user differs in each camera observation because of lighting changes, different camera angles, user clothes etc. For this reason, image features need to be extracted from the camera observation, which are useful abstractions of the estimated object. In the next paragraphs an overview of several of the most used image features is presented.

The most common image features are the motion, edge, color and shape feature. When a feature needs to be extracted from a video sequence, the easiest usable feature would be the motion feature. The motion feature assumes that the object that needs to be estimated is the only continually moving object in the scene. When this assumption holds, the motion feature is an efficient, simple and computable feature. The motion feature determines the difference between two consecutive frames. The resulting image feature only represents pixels that were not corresponding in the previous frame. This comes from the assumption that the intensity of one pixel in the object does not change, it only moves. Therefore, the image feature only represents the moving objects and (if the assumption holds) the Object of Interest (OI). One problem, besides the movement assumption, is the fact that light changes and shadows give noise in the motion feature. This noise needs to be filtered out to get a clear image feature of the OI.

The edge feature uses the edges between high and low intensity to remove the background. The edges determine the outline of an object which is useful in the shape feature. The major advantage of the edge

(11)

feature is that the edge feature is invariant to light changes. On the other hand, one problem from which the edge feature suffers are high noisy texture backgrounds. When the background texture is very noisy, a lot of intensity edges are represented in the resulting edge feature from which an object is hard to determine.

The color feature is used in a lot of applications in a variety of ways. One method used in many applications uses the color distribution of the OI. This color distribution is compared with the camera observation and all pixels that do not fit this color distribution are filtered out [3, 35]. The remaining pixels are determined to be the OI. The main problem from which the color distribution method suffers is the lighting condition, e.g.

normal lamp, tube light or the sun. The problem occurs if the lighting condition changes, then the color distribution of the OI also changes. Yet another problem is that man objects are hard to generalize e.g. skin color.

Another method, described by Horprasert et al [8], uses color to remove noise, shadows and highlights in the motion feature. Their algorithm uses a Background Model (BM) that describes the irrelevant background information. Their method filters out moving objects by comparing a new frame with the BM. The algorithm defines a pixel in the new frame to be background if it has both brightness and chromaticity similar to the BM pixel. It defines a pixel a shadow if it has similar chromaticity but lower brightness than the BM pixel, a highlighted background if it has similar chromaticity but higher brightness and a moving object if it is chromaticity different from the background model pixel. In addition the BM model is updated every time step to handle lighting condition changes. With this model they successfully perform motion filtering, removing all shadows or highlights in indoor and outdoor scenes even when the global lighting changes.

In many projects a combination of image features is used to extract high level information from the camera observation. For example the shape feature is mostly used in combination with other image features to extract high level information. In [13] lines are detected from an edge feature with a Hough transformation.

From these lines only the parallel lines are selected and shapes are created from these parallel lines. In contrast, Yao et al [3] propose a system that uses a combination of image features to detect a raised arm in a classroom. Their system filters the camera observation with a motion filter and filters the resulting motion feature with an erosion and dilation filter to remove noise. They combine this motion feature with an edge feature to get good edges around the moving objects in the motion feature. Finally a skin-color check is performed to define which objects are arms. Haritaoglu et al [9] propose a general-purpose surveillance system that determines human motion and body pose. They use shape analysis to define points of interest and use corresponding points in stereo camera observation to determine the depth of the points. From these points of interest the human pose is successfully determined in outdoor scenes with multiple people.

2 . 2 . 2 M o d e l - b a s e d b o d y p o s e e s t i m a t i o n

In model-based pose estimation a virtual model, that represents the user, is used to estimate the users pose.

his model is described by its appearance, state representation, kinetic constraints and evaluation function.

The appearance of a model can differ from a simple 2D plane representing with specific image features to a complete 3D surface model. Each of these appearances has its own advantages and disadvantages and the application for which it is designed determines the appearance. E.g. a 3D surface model of the user would represent the user in a perfect way and detailed tracking is possible with this model. However a 3D surface has 2 major problems. Firstly the model is very computational to manipulate and secondly the surface model is user (or clothes) dependant. This means that for each user (or the clothes that the user is wearing) a new surface model needs to be designed. In the case of motion analysis this would not be a problem because speed is not an issue and a detailed model can be created for each test person. However, when the application needs real-time estimation of completely different users, this model would be impractical. In [32, 16 super quadrics, e.g. cylinders, spheres, ellipsoids, cones and hyper rectangles, are used which are connected to each other to form a simple generic but realistic appearance of the user.

In literature there are two significant model-based body-pose estimation techniques namely top-down and bottom-up. In the top down approach the model is matched with the image features to estimate the pose while in the bottom up approach the individual body-parts are found and reassembled together to form the estimated pose. Currently some systems use a combination of both techniques to overcome the disadvantages of both of them. Each of these techniques is described in the next sections.

2.2.2.1 Top down

The top down approach matches a model of the user with the camera observation. The model parameters are changed until a pose is found that matches the camera observation best. Because this brute force method is not applicable in real-time, most top-down approaches use an initial pose to estimate the next pose. When the system needs to track the user over time this naturally implies an initialization procedure that determines the user’s initial pose.

Gavrila and Davis [32] use a top down approach with a 3D appearance created by connected super quadrics. They use a 3D appearance because 2D appearances use the assumption that the user needs to

(12)

move parallel to the image plane while they want to be able to track more unconstraint human movement.

Also the ability of projecting the same model onto multiple cameras makes it possible to track the body pose more stable because poses that are ambiguous in one view might not be ambiguous for another view. They use search space decomposition to overcome the dimensionality of the search space and first search for the head and torso and then estimate the other body-parts. Sidenbladh et al [18] use a top down method with a set of models because they assume that an analytical expression of the likelihood value can not be determined. This implies that one model can not give a clear answer if it is the right representation and therefore they use a set of models and let the best model be the estimated pose of the user. With this method they successfully use a set 10.000 models to estimate the user’s pose at 5 min per frame.

Mikić et al [7] use Voxel data to estimate the user’s body pose. In this approach multiple cameras are used to determine the movement surface in 3D space. For each of the cameras the movement surface is computed by a motion feature. Each motion pixel is triangulated to a 3D coordination (voxel) to form the 3D surface of the user’s movement. From this 3D surface model the user’s pose is inferred by a top down modeling approach in 3D space. A model of the user is matched in this 3D space to estimate the pose of the user that creates this movement surface. The big advantage of this method is that the model matching method is done in 3D which enables the use of a good likelihood function. A big problem with this approach is that the estimation of the 3D movement surface is very computational. Mikić et al resolve this issue by pre- calculating all the possible combinations of pixels offline and transforming the triangulation calculation into a database search.

Bregler et al [1] demonstrate a visual motion estimation technique that is able to recover human body configurations in video sequences. They use a single prior pose of the human body configuration and integrate a mathematical technique, the product of exponential maps and twist motions, into differential motion estimation. This results in solving simple linear systems, and enables them to recover the DOF in noisy and complex configurations.

2.2.2.2 Bottom up approach

In the bottom up approach the body model usually consists of 2D body-parts. Each limb is modeled by image features and in the estimation phase each limb model is matched with the camera observation. The great advantage of the bottom up approach is the fact that it does not need an initialization procedure. When the model is designed right the bottom up approach estimates the body pose fast and good. However, the design of the model turns out to be very difficult. The main problem is the fact that the users appearance changes because of lighting condition changes and rotation of the body-parts.

Ramanan and Forsyth [2, 12] propose a system that uses a bottom up approach with some assumptions.

First they assume that all human body-parts are cylinder shaped and secondly that the body-parts in the video sequence do not change dramatically in appearance. The second assumption is in contrast with the problem that the body-parts do change due to rotations, but it works in small videos because they assume that the user does not change clothes over time. Their approach detects the user’s body-parts with the first assumption, e.g. find all the cylinder shaped objects in the camera observation. These cylinder shaped objects are detected by convolving over the image with parallel lines of contrast. This is only done for a small part of the video sequence in which all the candidate body-parts are followed. The candidate body-parts are matched with a kinematics structure and movement model to form a complete body. The candidate body parts that do not fit these models are discarded. The color distributions and textures of the human model’s limbs are learned and used to track the body of the rest of the video sequence. With this method Ramanan and Forsyth successfully track user body pose in several video sequences.

Sigal et al [19] propose a new method that uses a loosely-connected model. They use a graph model in which each node represents a limb model, each described by the appearance, likelihood function and current state. Each node is connected to the other nodes by a conditional probability. Together with a temporal evaluation probability of each limb the pose estimation is solved with an adjusted particle filter.

2.2.2.3 Middle way

The middle way uses the advantages of the bottom-up and top-down approaches described in the previous sections to overcome the disadvantages. Navaratnam et al [4] use a model that has a collection of separate body-parts linked according to a kinematic model. Each body-part is represented by a set of 2D templates created from a 3D model which encodes the 3D joint angles. In the first step all body-parts are searched in the camera observation with the 2D templates (bottom-up approach). Then the most reliable detected part is chosen to be the anchor for the rest of the estimation process. The rest of the parts are searched based on the detected location of this anchor in a kinematical order (top-down). The big advantage of this approach is that it does not need an initialization procedure and the 3D rotations of the users can be tracked.

(13)

2 . 2 . 3 M o d e l - f r e e b o d y p o s e e s t i m a t i o n

In model-free body-pose estimation the body pose is directly inferred from the camera observation. This means there is a need for a direct relation between the pose and camera observation. This direct mapping is not always obvious and easy to design because an analytical function that represents the likelihood of the user is impossible to design.

Argiwald and Trigs [31] use a learning based approach to directly infer a pose from a monocular camera observation. They choose for a learning based method to avoid the use of explicit initialization and 3D modeling which needs computational expensive rendering. They use a set of motion captured data to learn a reconstruction function. The motion capture data images are reduced to 100D image feature vector’s which represent the human movement. Given this set of vectors a non linear regression algorithm learns a smooth reconstruction function which is used to determine the pose given a set of input images. However, monocular camera observations are ambiguous. This means that multiple hypothesis are possible give the camera observation. This problem is solved by either a regression on the image and the previous pose or the use a mixture of regressions in a multiple hypothesis tracking scheme. With this method they successfully recover a 54 DOF model in a monocular test sequence with real users. The advantage of this method is that it does not need an initialization procedure and is not computational. However, only users that are similar to the user in the motion capture data can be used in this system.

2.3 Tracking

Most pose estimation systems are designed for tracking the users pose over time and use a prior state to estimate the posterior state. In order to track the body pose of the user several methods are possible. Some methods use a single hypothesis while other methods use multiple hypotheses to determine the state. In this section the most common tracking techniques are discussed in detail.

2 . 3 . 1 K a l m a n f i l t e r

The Kalman filter [24, 25] is a generic estimation technique that estimates the new state of a system given the current state and noisy sensor data. The Kalman filter is proven to work and applied in many applications used today. The most common application is airplane tracking on radars. In this application the sensor data is the noisy radar data and the state of the system are the airplanes position and velocity. The Kalmann filter estimates the posterior state of the system in 2 steps, the predictor and corrector step. In the predictor step the next state of the system is estimated and in the correction step the predicted state is compared with the sensor data to refine the estimated state.

The prediction step depends on a linear transition function and a transition noise measurement described by a white Gaussian distribution. The corrector step depends on the noisy sensor data, a sensor noise described by a white Gaussian distribution and the Kalman gain. The Kalman gain is determined by the noise of the sensor data and the estimated error covariance from the prediction step estimation. The Kalman gain’s goal is to minimize the estimates error covariance after the corrector step and thus get the most probable state of the system.

The Kalman algorithm begins with an update of the state of the system which is determined by the linear transition function and the previous state of the system. The prior error covariance is estimated according to the posterior error covariance of the previous state and the transition noise distribution. In the corrector step the Kalman gain is estimated according to the prior error covariance and the sensor noise. The estimated state is updated according the sensor measurement and the Kalman gain. Finally the posterior error covariance is calculated for the next filter cycle.

The Kalman filter assumes that the transition function is a linear function in order to predict the posterior state of the system. If this assumptions hold up there is no better solution then the Kalman filter. However, human body pose movement is far from linear because of joint acceleration. If the transition function is non- linear then the Extended Kalman Filter (EKF) can be used to estimate the state of the system. The EKF estimates the linear representation of the non-linear function every time step by using the calculating the Jacobian matrix of the non linear function and use the resulting matrix in the Kalman filter equations. An alternative to the EKF is particle filtering which has the advantage that it can represent the Probability Density Function (PDF) when enough particles are used. The main advantage of the particle filter is that it contains multiple hypothesis, when only a single hypothesis is maintained, e.g. Kalman filter and the EKF, the estimation can get stuck or drift in the wrong direction.

2 . 3 . 2 P a r t i c l e f i l t e r

A particle filter is a sequential Monte Carlo method [23, 24], used in many applications and estimates the values of hidden variables according to noisy sensor data. In order to explain particle filtering the notation needs to be elaborated. A particle filter has a population of N samples (particles), each of which resembles a state of the system that needs to be estimated. Let Xk={xk

i, i=0…N} denote the set of N particles at time k

(14)

and Z_k denote the noisy sensor data. The goal of the particle filter is to estimate the posterior PDF denoted in Equation 1.

( _k| _k)

P X Z Equation 1

2 . 3 . 3 S I S p a r t i c l e f i l t e r

The standard particle filter is called the Sequential Importance Sampling (SIS) particle filter. The SIS particle filter works according to two steps namely the prediction and evaluation step. The first step estimates the next state of each particle while the second step evaluates each particle according to the sensor data.

In the first step all the particles are propagated forward according to the current state of the particle and a transition function. The transition function can be a non-linear function which determines the behavior of the system, see Equation 2.

( 1)

i i

k k

x⌢ = f x₋ Equation 2

In the second step a weight is assigned to each particle, denoted by Wk={wk

i, i=0….N}, which is determined by the likelihood function, see Equation 3. The weight population is normalized according to Equation 4.

( | )

i i

k k k

w =P Z x⌢ Equation 3

1

Ns i k i

w

=

∑ = Equation 4

If the number of particles comes close to infinity the complete population, particle with its weight, resembles the PDF and Equation 5 determines the state of the system at time k.

P X( _k|Z_k)≈∑^N_i₌0w x_i^k _i^k Equation 5

However, when the number of particles does not come close to infinity the population is a poor resemblance of the PDF and the state estimation is difficult. Often there exist many local maxima in the likelihood surface and when for example 2 local maxima exist, each tracked by a separate group of particles, the state calculated by Equation 5 is be placed in between of the local maxima. A solution to this problem divides each particle group into different local maxima groups. The local maximum that best resembles the state of the system is chosen to be the state of the system.

Another problem from which particle filter suffers is the degeneracy problem. This problem occurs when after a few iterations all but one particle has a negligible weight. This implies that a lot of particles use a lot of computing time while there contribution to the PDF is almost zero.

2 . 3 . 4 S I S R p a r t i c l e f i l t e r

The most common method of eliminating the degeneracy is called resampling. In resampling a new set of particles, X’k, is constructed from the old particle set. The idea behind resampling is to eliminate the particles with low weights and concentrate on the particles with high weights while the new particle weight set approximates the old particle weight set as good as possible. The particle filter population is resampled when the variance of the weights is high and therefore degeneracy is taken place. However for implementation issues, the Sampling Importance Sampling Resampling (SISR) particle filter is often used [24]. In the SISR particle filter the resampling step is added to the particle filter.

While resampling is the frequently used, it suffers from the sample impoverishment problem. Sample impoverishment occurs when all particles follow the same trajectory and there is no diversity in the particle population. There are several resampling schemes e.g. multinomial, stratified, residual and systematic resampling [30]. Residual, stratified and systematic resampling give comparable results but the most common resampling algorithms systematic resampling for its simplicity in implementation. In systematic resampling the new particle population X’_kⁱ is determined by drawing N samples from a uniform distribution according to Equation 6 in which U is defined by a uniform distribution on the interval [0, 1/N].

( 1) /

Ui = −i N+U Equation 6

The new particle set is determined according to the Cumulative Density Function (CDF) which value for each particle is calculated according to Equation 7.

1 j

j j k

c =c₋ +w Equation 7

The particle X_kⁱ in the new particle set is chosen to be the particle X’_k^j from the old particle set in which cj-1 <= Uⁱ <= cj. The resulting particle set focuses on the particles that have a higher weight by copying the particles with the high weight multiple times in the new particle set and removing the particles that have a low

(15)

weight. In order to maintain diversity and remove sample impoverishment in the population some particles with a low weight are copied in the new particle set.

2 . 3 . 5 T r a n s i t i o n f u n c t i o n

The transition function describes the behavior of the system and determines the most likely new state of each particle given the prior state of the particle. In the case of body pose estimation, the transition function is an estimate of the next probable pose given the current pose. However, the human body can perform a massive array of movement. Therefore a transition function that gives the next likely pose for every possible pose becomes infeasible. However, in most cases the human movement is restricted to some class of movement, e.g. walking, grabbing, etc. In this case, a probabilistic model can be created from the training data of this class of movement. This probabilistic model can then determine the next probable pose given a window of previous poses of the system.

Sidenbladh et al [6] propose a probabilistic motion model for pose synthesis and tracking. Their view on the problem changes from learning a probabilistic model to a searching problem. They propose a system that uses a large set of human motion. With a time window of d frames, this motion set is divided in a number of different movement samples, each of d frames. The dimensionality of the sample space is reduced according to principal component analysis (PCA) and each sample is projected on this low dimensional sub space. The resulting sample vectors are put in a binary tree for low cost searching. The input of the system, also of length d, is then projected on the same low dimension space. The resulting vector is compared with the database which returns the most likely sample given the input sample. From this sample, the next pose in the normal set of human motion is chosen to be the next probable pose of the system. The assumption made is that the high order statistics are implicitly represented in the database which removes the need for learning a motion model.

(16)

Part III Design

In this part the current system and some previous hand tracking attempts are evaluated. Together with the information from the literature study a new hand tracking approach is introduced. Chapter 3 gives a global overview of the approach and Chapter 4 explains the hand tracking system in detail.

Chapter 3 Main Approach

3.1 Current state

In order to create a system that tracks the hands of the user, the current WW system needs to be investigated. The architecture and computer vision algorithms that are already tried previously need to be explored. Even the current refining of the head tracking component is interesting in designing the hand tracking algorithm.

3 . 1 . 1 C u r r e n t A r c h i t e c t u r e

The current WW architecture uses 3 different modules that work in a chain, across several computers, to get an end result. These are the computer vision, 3D server and the application modules. Figure 2 gives an overview of the modules. Each camera is connected to a computer vision module which performs the computer vision tasks. The results from the computer vision modules are transmitted to the 3D server module which estimates the 3D coordinates of these body-parts. These results are then transmitted to the application module which draws a VR application on the screen according to the information of the user. The WW architecture is setup in such a way that new cameras can always be added to improve the tracking.

Because each camera has it own computer vision module which tracks the user’s body-parts, more camera’s results in more 2D coordination’s of the user’s body-parts which in turn results in a more detailed 3D coordinate estimation. In the current architecture the computer vision modules only execute computer vision calculations. The 3D server only handles the 3D data and the application module only draws the output on the screen.

3D Server Module 2D user positions

Application Module 3D user positions

Computer Vision module Camera

Figure 2 - System architecture 3.1.1.1 Computer vision module

The computer vision module’s task is to track the coordination’s of the user’s body-parts. Currently the only body-part tracked by the computer vision module, is the head. Figure 3 shows the images from all 3 cameras currently used by the system. The cameras are installed in the booth in such a way that the complete booth is visible for each camera. Because of this installation, the camera observations are rotated.

Figure 3 - Images from the three different cameras

The computer vision module’s tracking algorithm is made as general as possible so the computer vision module becomes camera independent. When the computer vision module is camera independent more cameras can be used which results in more information. Because the top camera has a completely different camera observation, in relation to the side cameras, a camera independent tracking algorithm is impossible.

Therefore in the current head tracking scheme, only the side cameras are used. The hand tracking system however, needs to be as general as possible in order to use all possible cameras and thus use all available information to determine the coordination’s of the hands.

3.1.1.2 3D server

The task of the 3D server is to estimate the 3D coordination of the user’s body-parts given the 2D coordinates obtained from the computer vision module(s). This is only possible when there are multiple

(17)

cameras from which the intrinsic and extrinsic camera parameters are known. Figure 4 shows how the 3D coordinate of the point X is obtained from two observations. To determine the 3D point, a ray is traced for each camera, from the centre point of the camera through the point on the image plane, U1 and U2, into world space. The point where both rays intersect denotes the point in 3D space. See appendix B for a more detailed description of the 3D point estimation.

Figure 4 - 3D coordination estimation 3.1.1.3 Application

The application module is the VR application that uses the WW environment. This application can use the parallax and 3D visualization. Examples of applications are interactive games such as a maze, Rubric cube, virtual clay ball etc.

3.1.1.4 Networking

The protocol that connects all the modules to each other is the XML Remote Procedure Call (RPC) protocol.

This protocol is good in sending complex structured data and works with remote method invocation on the server. The data is transmitted in XML code in order to make it readable for the user and easy in use. This communication protocol works well but is not applicable for fast communication of a massive amount of data.

A problem in the current communication setup is that the 3D server acts as a RPC server. This means that the other modules can only ask for methods from the 3D server. The 3D server has no possibility to send requests to the other modules.

3.1.1.5 Conclusion

From the current architecture it can be concluded that the hand tracking system needs to present the coordinates of the hands to the 3D server, which communicates them to the application module. The hand tracking system can be part of the computer vision module and send the 2D coordinates to the 3D server.

However, a general hand tracking algorithm that can be used in any camera observation is hard to design and a new algorithm have to be designed for each camera observation.

3 . 1 . 2 C u r r e n t C o m p u t e r V i s i o n A l g o r i t h m s 3.1.2.1 Head tracking

Currently only the head of the user is tracked by the computer vision module. In time, the head tracking algorithm has evolved from skin-color filtering to a particle filter that uses Eigenfaces as a likelihood function.

The head tracking algorithm used in the beginning of the project used a particle filter method in which the state is the location and size of the user’s head. Each computer vision module tracks the user’s head with this particle filter. The appearance of a head is chosen to be an ellipse [35] and the likelihood function of the particle filter then becomes an ellipse shape matching procedure. Each computer vision module communicates the tracked coordination of the head to the 3D server which transforms them into 3D coordination’s. The shape matching procedure from this particle filter is revisited into a likelihood function that uses Eigenfaces.

The Eigenface likelihood function [35] uses an average face created by averaging over a database of faces.

This average face is subtracted from the other faces in the database and PCA is used to define a set of eigenvectors that best define the database set. In the tracking mode the average face is subtracted from the camera observation and the resulting image is compared with the eigenvector. The distance between the eigenvectors is the likelihood of observing a face. This new likelihood function enables the computer vision