3D Pose Tracking Using Optical Flow

(1)

3D pose tracking using optical flow

Vincent van Megen (0513482)

Internal Supervisors: Tom Heskes and Iris van Rooij

External Supervisors (Vicar Vision): Marten den Uyl and Ed Lebert

September 14, 2010

(2)

Abstract

An action recognition system has many applications, ranging from security to sign language in-terpretation. If a subject’s body pose can be tracked in streams of video, a wealth of information about possible actions can be extracted. In this thesis a three dimensional pose tracking system is presented, designed so that its output can be used as input for an action recognition system. By presenting a classifier with optical flow and low resolution greyscale images, the system is able to detect and track a subject’s pose over an extended period of time. Optical flow is the distribution of motion over an image, and was hypothesized to be an important factor in the system’s ability to perceive depth. Investigations into the effect of optical flow reveal it to be important for the system’s performance, yet less important than other inputs when it comes to depth perception.

(3)

Acknowledgements

I would like to thank Tom Heskes and Iris van Rooij for their guidance during the writing of this thesis. My thanks also go to Marten den Uyl for allowing me to work at Vicar Vision, and Ed Lebert for his guidance during my work there. I am also grateful for all the help Mark de Greef has given me with debugging my code, and Paul Ivan for getting me started on my annotation algorithm. And finally I would like to thank all other colleagues at SMR group, for the fun we had during the lunch breaks.

(4)

Introduction

1.1 Pose Recognition

We derive a great deal of information from body language. We can often estimate a familiar per-sons mental state by his/her pose, without ever looking at the face. Tracking the hands and arms of someone we are looking at can also give us clues to a persons goals or actions. Pose recognition and gesture detection can be invaluable tools for scientists from a great variety of research fields, ranging from psychology to sports science. The disadvantage of using human motion and pose to study behavior is that these movements are often fast and not repeated often. A subject’s reaction to stimuli can be so quick, that a researcher would have to possess a keen memory to analyze all the complex joint configurations that make up a person’s pose. Fortunately these reactions can be recorded on video.

By using video recordings, the observer gains the ability to rewind and watch an event mul-tiple times, if needed even in slow motion. However, having recordings of human and animal behavior analyzed by human observers can be time consuming. A human observer would have to watch a piece of video, document what he/she observes, and possibly watch the same section of video again and again, to fully document the relevant information in that piece of video. With the advances of computer and camera hardware, automated analysis of behavior became possible, and due to its many possible applications, automatic analysis of human actions in video footage by action recognition systems is a very active research area [15].

As can be seen in figure 1.1, there are four phases common to most action recognition

(7)

tems. In the first phase, initialization, subjects are often required to take a certain pose before the camera, so the system can get a lock on their joints or limbs. These points, which I will refer to as anchor points, will then be tracked in the second phase. Tracking can occur in 2D (only x,y coordinates) or in 3D (x,y, and z coordinates). Using the result from the tracking phase, a pose estimation can be created. By tracking such a pose over time, the system can try to recognize the action a subject is performing. In this thesis I will focus on the second and third phase of action recognition. The goal is to create a system that can, given a series of frames, locate and track a person’s upper body pose.

1.2 Research goals

This thesis is the result of research done at Vicar Vision, a machine vision company in Amsterdam which specializes in automatic analysis of human emotions and behavior. This company, along with several others, participate in the Inside Consumer Experience (ICE) project [1]. The ICE project aims to develop new techniques for the study of food selection and consumption in real-life contexts. One example is the restaurant of the future in Wageningen. This restaurant is filled with cameras that psychologists use to study eating behavior [2]. Unfortunately, as mentioned in the previous section, the process of studying these images can be time consuming. To speed up this process, Vicar Vision wanted to create a system that can automatically recognize actions related to eating. This thesis has two research goals: first to create a system that can track a person’s upper body pose, of which the output can be used as input to an action recognition system designed to recognize eating actions. And second to investigate a novel technique for depth perception in tracking the anchor points used in such a pose tracking system. My hypothesis is that this technique will provide the system with the necessary depth perception, while the other inputs will provide the system with the information it needs to calculate the x and y coordinates. This technique is called optical flow, and will be discussed in section 1.3.2. If my hypothesis is correct, removing optical flow from the finished system will severely decrease the system’s performance on the z dimension, while maintaining its level of performance on both the x and the y dimension.

1.3 Depth perception

1.3.1 Depth Cues

An important aspect of pose recognition is the number of dimensions. Although limited pose recognition in two dimensions is possible (see section 1.4.1), three dimensions give a far more accurate picture and resolve ambiguities. But depth perception is no easy task, and human use many depth cues at the same time to create our relatively stable three dimensional view of the world [16]. A commonly used depth cue in 3D vision is convergence. The difference in angle between two sensors (or cameras) can provide a clue about the depth of an object. By calculating the difference in angle between two cameras, the distance of an object tot the cameras can be estimated. Unfortunately, these calculations can be very hard and time consuming [17]. Another depth cue humans use is movement, or optical flow. When an object moves, and it is far away from the viewer, the object’s speed will seem relatively slow. If an object is close, the movement will appear to be very fast. In this way optical flow provides information about the three dimensional location of objects. In this thesis I will investigate the effectiveness of optical flow for depth perception in a pose tracking system.

1.3.2 Optical flow

Optical flow can be described as a distribution of velocities over an image. There are different methods for computing this distribution [13], but the basic idea is as follows. When given two images, one can match regions of the first image i to regions of the second image i+1. The method

(8)

of comparison can differ between approaches, but they generally return some form of distance mea-sure δ between those two regions. The region in image i+1 that has the smallest δ to the region of image i, is chosen as the new location of that region of pixels. By comparing the coordinates of the region in i and i+1, a motion vector can be computed. These regions are chosen at a grid of locations in the image, called grid points. Computing the movement of regions at these grid points results in a distribution of velocities. This distribution contains the displacement in x and y direction for each grid point. The more grid points in an image, the more dense the resulting velocity distribution.

The approaches for computing optical flow can be divided into three main streams of research, discussed below [13]. For a performance comparison of recent methods, see [3], and for a more complete overview of techniques see [4] or [18].

Gradient Based

By calculating the gradient of the image intensity at specific points in an image, one can compute the most likely direction of the motion. This solution is very efficient and easy to create. Unfor-tunately, this method works under the assumption that pixel intensities are translated between frames. This means that the light intensity stays the same at all times. Although this is an unre-alistic assumption, Gradient Based methods have proven to work well in practice [3]. And there has been research into eliminating the assumptions that go with this approach, while maintaining the efficiency [12].

Feature Based

The Feature Based approach uses features to calculate optical flow. These features are distinctive points in the image, such as edges and corners. This approach consists of two stages. In the first stage, two consecutive frames are compared to locate distinctive features. Once these features have been found, their respective locations in two frames can be used to construct the velocity distribution. Correspondence can be a big problem in feature based optical flow, since it can be quite hard to determine which features have to be matched. For example in an area with a repeating pattern, many features can be found which are very similar. A common solution to this problem is setting a threshold for the maximum displacement of a feature between two frames [13]. The disadvantage of this threshold is that fast movements can no longer be detected. Other approaches have tried to eliminate the correspondence problem while keeping the ability to detect fast movements [5].

Region Based

The idea behind the Region Based method is simple: Match regions of pixels between two suc-cessive images. Determining the region in the second frame which resembles the first region the closest, is often done by the sum of squared differences (SSD). In most approaches, only a small and limited amount of displacements is tried, and a penalty is added to the SSD for large dis-placements. Because the regions are compared directly, instead of first transformed into features or gradients, this approach works quite fast [13].

Selected approach

In this thesis I will use the last approach, Region Based Matching. The reason for this choice is that Vicar Vision developed an algorithm called Correlation Based Accumulators (CBA) [13]. The advantage of this algorithm, besides being readily available, is speed. The algorithm moves templates across the image, matching the template to each position. Once the best match has been determined, the amount of motion can be computed. When the number of grid points is set to a reasonable number, the CBA algorithm works in real time (30 frames per second). Note that in [3] more accurate methods than CBA are presented. But because the system presented in this thesis should serve as a basis for a second system, and should be as computationally cheap as possible, the CBA method which sacrifices accuracy for efficiency is a good choice.

(9)

1.4 Similar work

Although the use of optical flow for pose recognition is new, there have been many other approaches to pose recognition. A full comprehensive review of all relevant literature is beyond the scope of this thesis. For a full overview see [15, 19]. Haritaoglu et al. were one of the first to perform real-time tracking of human limbs [9]. The paper by Micilotta et al. present an innovative approach to creating large datasets, and a small dataset is one of the bottlenecks of my approach [14]. I will discuss Gao et al. because their goals were similar to mine (limb tracking for the study of eating behavior), and shows possible applications of my system [8]. And Gall et al. show that instead of using pose tracking to recognize actions, one can also use action recognition to improve pose estimation [7].

1.4.1 Real time tracking

One of the first to perform real-time tracking of human poses, the system by Haritaoglu et al. works by edge detection, object detection and tracking, and a temporal model [9]. Written in C++, this system works on 320 x 240 greyscale images instead of color images, combined with an infrared camera. Capable of reaching a speed of 20 frames per second on a dual 200 MHz Pentium proces-sor, the system is quite fast. It is important to note however, that it only tracks in two dimensions.

The first step they perform is foreground-background segmentation. In case this fails to com-pletely identify the humans in the frame as foreground objects, Haritaoglu et al. also use an object detection algorithm. This object detection algorithm detects previously found objects (such as hands) in new frames. This ensures that, in the case of occlusion, the visible body parts can still be found. These parts are then merged into a single hypothesized person, while the different body parts are found using template matching.

1.4.2 Large dataset creation

Micilotta et al. present a possible solution to the dataset problem (to be discussed in section 3.2). Micilotta et al. use Haar wavelets and adaBoost to create robust part detectors (hands, elbows, face) [14]. Like Viola and Jones they use this approach for their robust face detection algorithm, discussed in section 2.1. These separate parts, along with a silhouette found by edge detection, are matched to a database of 5000 images. These images are created with a 3D model of a human body. This 3D model is programmed to take on various poses, which results in images that can be matched to incoming frames. The drawback of this method is that no color or texture information can be used to match with the database, since color and texture are not represented in the database. The authors claim to have a performance of 16 frames per second, while in reality they process 8 frames per second, thus skipping every other frame.

1.4.3 Dining activity analysis

Gao et al. focus on the recognition step of the four-step model in figure 1, more specifically recognition of hand to mouth movements. Gao et al. present a computer vision system that can count the amount of hand to mouth movements in a video clip [8]. They use this for dining activity analysis in a nursing home. By using motion segmentation they can compute motion vectors for the hands of their subjects. By also detecting faces, they can detect when the arms are moving towards the face. This motion information is passed on to a hidden Markov model, which classifies the observed subject’s actions as moving towards the head (start of an eating event), moving away from the head (end of an eating event), and the dont care state, in which the subject is moving his or her hands in a manner unrelated to eating. The paper by Gao et al. is an example of an action recognition system that could use the output of the system developed in this thesis to achieve better results.

(10)

1.4.4 From 2D to 3D with action recognition

Three dimensional pose perception using a monocular camera is fundamentally ambiguous. Gall et al. present an interesting solution to this problem: they use 2D information to recognize actions, which in turn facilitates 3D pose recognition [7]. The 2D information comes from multiple cameras tracking a person through time, and from the subject’s silhouette movement in this timeframe they derive an action prediction. Although their action classification from 2D silhouette information is not perfect, it is still capable of limiting the possible 3D poses. This not only eliminates some of the ambiguity, but also speeds up the 3D recognition system. An approach like this could also serve our needs, since the neural network used in our approach can also predict two dimensional poses, as described in section 2.5. This prediction could then be used as input for a network to predict the final three dimensional pose.

(11)

Chapter 2

System description

In this section the various parts of the anchor point tracking system are explained. In figure 2.1, a schematic overview of the entire system can be seen. Each of the steps represented in the figure will be explained in the following sections.

2.1 Face Detection

My system starts the pose recognition process by trying to find a subject’s face. The reason for this approach is that finding the head first will put constraints on the locations of the other points, because people that are eating usually have their hands, elbows, and shoulders below their face. Unfortunately, finding faces in images is not a trivial problem. Besides normal object detection problems such as occlusion and varying lighting conditions, face finding presents us with some unique challenges. Most importantly: every face is different. Although there is a certain similar structure for every face (eyes between the ears, nose above the mouth, etc), the exact placement and properties of each of these features are unique for everyone. And to make matters worse, these features can vary within-subjects due to facial expressions, differences in facial hair, and wearing glasses. For an overview of face detection methods see [11].

The face-finding method that was used in this project is developed by Vicar Vision and based on the Viola-Jones algorithm [21]. This method was the first face-finding method to work in real time [22], and it still up to 8 times faster than the second fastest algorithm of Heisele et al. [10]. It is important to note, though, that the Viola-Jones algorithm takes weeks to train, while the SVM approach of Heisele et al. takes only a day [10].

Once a face has been found, a search window is selected. Because the goal of the system is to track the upper body pose of eating subjects, all relevant motion occurs below the faces. So to speed up the calculation of optical flow, a smaller portion of the image is selected in which to perform optical flow. In this thesis a window of 600 by 400 pixels was used, on images with a resolution of 720 by 576. This search window is big enough to contain the subject’s entire body, but as small as possible to speed up processing.

2.2 Determining Grid Point Locations

As explained in section 1.3.2, optical flow is computed on selected points in the image, called grid points. The less distance between the grid points, the denser the distribution of optical flow over the image. The distance between grid points used in this thesis is given in Appendix B.

(12)

FaceFinder Calculate Optical Flow Get GreyScale Classification

SequenceMatcher Count Skin Determine Grid Point Locations Search Window Grid Points GreyScale Values x,y-Displacements Anchor Point Coordinates Best Sequences Output

Anchor Point Coordinates

Best Pose

(13)

2.3 Optical Flow

Once the search window and the grid points have been established, optical flow can be computed. The sensitivity of the optical flow algorithm is regulated by several parameters, which can be found in appendix A. The resulting velocity distribution is used as input for the classifier, where the displacements are normalized between 0 and 1.

2.4 GreyScale

To be able to handle those situations in which there is no movement, the classification step also receives a greyscale version of the image. This image is computed from RGB at the grid point coordinates, using the following equation.

Greyscale = red ∗ 0.3 + green ∗ 0.59 + blue ∗ 0.11

3 (2.1)

These greyscale values are then divided by 255 before given as input to the classifier, so that they are also normalized between 0 and 1.

2.5 Classification

In this step the information from the optical flow distribution and the greyscale image is used to determine the anchor point coordinates. Although there are many different classification algo-rithms, Vicar Vision has a lot of experience with neural networks, and network training software was easily available. Therefore in this thesis a multi-layered perceptron is used to perform the classification step [6]. Able to learn (when given enough examples) almost any input-output rela-tion, neural networks are a popular method of machine learning. By tweaking the weights between layers of neurons, the networks try to find a mapping between the information in their inputs, and their desired output. To do this, it has to be provided with enough training samples to generalize the important pieces of information in the input. This process can be speeded up by doing pre-processing, and filtering the relevant pieces of information from the input before it is presented to the network. For an overview of the parameters of the network, see appendix B.

2.5.1 Input

In our case, we present our network with the information from the optical flow described above. The idea behind this preprocessing step is that the neural network cannot be presented with the images themselves. There is too much information in these frames for the network to learn from a reasonably sized database. But subjects move whenever they eat. So not only is the movement a queue for the eating classification that could be the next step for our system, it also provides the network with a silhouette of the subject (as long as there is no movement in the background). So for each grid point the network receives the displacement in the x and y direction, along with a grayscale value. The search window of 600 by 400 pixels contains 1176 grid points, which means that the network receives a total of 3528 (1176 times 3) inputs.

2.5.2 Output

The desired output of the network is an x,y,z coordinate for each of the anchor points: Hands, elbows, shoulders and the neck.The output is encoded in binary output nodes, where each output node represents a section of the image in a certain dimension, relative to the face. So a node N would for example represent x = 100 to x = 300. If the x coordinate of the target anchor point lies in that region, N should become active. In my case, each of the 8 anchor point coordinates

(14)

(15)

is represented by 75 output nodes; 25 for each dimension. In figure 2.2 a frame divided into 25 sections can be seen. Each of these sections corresponds to an x-output node. All three dimen-sions are relative to the face. So x = 0 would represent the section directly below the face, while x = -100 would represent the section left of the face. This normalization is required because the network would otherwise learn coordinates on the image, instead of coordinates relative to the face.

The reason this output representation was chosen was twofold: first, the scoring mechanism of the training software could only handle binary outputs. Second, this representation was more robust. If an output between 0 and 1 would be transformed in to a coordinate, tiny differences in activation could make large differences on screen. However, in my encoding, small differences did not matter, as long as the right output node had the highest activation.

The reason why I only chose 25 output nodes for each dimension, instead of an output node for each pixel on the bitmap, is that the output layer would become too large, and the resulting input-output relation too hard for the network to learn. The downside of just using 25 output nodes, is that effectively the output works on a 25 x 25 frame. So minor changes in position for an anchor point could lead to large differences in the output. Therefore I decided to smooth the activation. This was done so that the activation over several nodes contained more information about the precise coordinate of an anchor point than the activation of a single node.

I implemented this smoothing by spreading the target activation out over several output nodes. The node N that represents a coordinate closest to the target coordinate, gets an activation of 1. But the target coordinate almost never lies exactly on the coordinate represented by an output node. The neighbors of N therefore receive activation based on the respective distances of their represented coordinate to the target coordinate. So if node 1 represents x = 50, node 2 represents x =75 and node 3 represents x =100, and the target is (85,10), then node 2 would get an activation of 1. But node 3 is closer to (85,10) than node 1, so node 3 gets a higher activation than node 1. This distributed activation always sums up to 2. Remember that in this example the nodes only represent x coordinates. There are other output nodes which receive their activation based on the distance of their represented y coordinate to the y coordinate of the target. Once the coordinates relative to the face were known, they were de-normalized so their coordinates in the image could be represented in an output image.

Algorithm 1 Smoothing Algorithm for section activation lef tCoordinate ← sectionCoordinate - (0.5 * sectionLength); rightCoordinate ← sectionCoordinate + (0.5 * sectionLength);

lef tActivation ←1 - ((targetCoordinate - LeftCoordinate)/sectionLength); rightActivation ←1 - ((rightCoordinate - targetCoordinate)/sectionLength);

Algorithm 2 Main network loop Get all gridPoint coordinates(); for all gridPoints do

gridP oint.vX ←getXDisplacement(); gridP oint.vY ←getYDisplacement();

gridP oint.greyScale ←getGreyScale(gridPoint.x, gridPoint.y); end for

binaryOutput ←Network.getOutput(gridPoints);

relativeCoordinates ←transFormBinaryTo3D(binaryOutput); Coordinates ←deNormalize(relativeCoordinates);

(16)

2.6 SequenceMatcher

While the network from section 2.5 receives most of its input information from optical flow, the system is designed to work on videos. So beside the information in the frames themselves, we can retrieve extra information by looking at a sequence of poses. Such a sequence can give us a prediction about where certain points in our current pose are going to be, which can help interpret the results from the network.

In my system the sequence is used to constrain the output generated by the network. Occa-sionally, the network returns a coordinate for an anchor point (usually one of the hands) which is highly unlikely. For example, if the network returns the coordinates of the hands so that the left hand is on the right side of the face, and the right hand on the left side of the face. The more likely result would be the left hand on the left side, and the right hand on the right side. In order to find the most likely solution, we can look at the previous results.

The first step for this SequenceMatcher was to create a second database, from the annotations created in section 3.2. Small sequences of three frames each were created from these annotations, and stored in a sequence database. These sequences consisted of three frames, each of which contained x,y,z coordinates for each anchor point. When a new frame is analyzed, the result of the current frame is added to the results of the two previous frames, thus forming a sequence of three poses. This sequence is then compared to the sequence database, simply by calculating the Euclidian distance between the respective anchor points, for each frame in the sequence. Once a best matching sequence is found in the sequence database, the pose from the best matching sequence is selected to be the current pose. This improves the performance of the system, as can be seen in section 3.3, the disadvantage is that the system can now only recognize poses that are in the sequence database.

2.6.1 Local hand tracking improvement

As can be seen in section 3.3, the network has some trouble finding the correct positions for the hands. In order to specifically improve the hand tracking results, the SequenceMatcher was mod-ified to return the two best matching sequences. To find better coordinates for the hands, the system compares the two best matching sequences, by counting the amount of skin color found in a small window around the hand positions in the poses returned by those sequences. For the pseudo code for this algorithm see algorithm 3 on page 14.

Skin color is found by finding pixels with certain values in HSV color space. If a pixel has HSV-values within a certain range, that pixel is counted as having skin color. The position of the hands is then set at those coordinates at which the windows around the hands contain the most skin-colored pixels. If no skin color can be found in either sequence, the sequence that matches the best on the other points is chosen.

(17)

Algorithm 3 Updating output with SequenceMatcher currentSequence.add(coordinates);

for all sequences IN database do

distance ← getDistance(currentSequence, databaseSequence) if distance < shortestDistance then

shortestDistance ← distance secondSequence ← bestSequence bestSequence ← databaseSequence else

if distance < secondDistance then secondDistance ← distance

secondSequence ← databaseSequence end if

end if end for

if bestSequence.numSkin > secondSequence.numSkin then return bestSequence

else

return secondSequence end if

(18)

Chapter 3

Evaluation

In this section the system is evaluated. In order to provide a fair comparison to other pose tracking systems, the assumptions on which our system operates are discussed. Then the database is described which was used to evaluate the system’s performance, along with the results of the system on that database. And finally the importance of optical flow on the depth perception of the network is evaluated.

3.1 Assumptions

There are many assumptions used in the action recognition literature. Figure 3.1 gives an overview of commonly used assumptions. The most typical assumptions in computer vision literature can

Figure 3.1: Common assumptions in action recognition [15]

be found in figure 3.1, ranked by frequency, where 1 is the most frequent and 10 the least frequent. Obviously, the goal was to keep the amount of assumptions to a minimum, so the system could

(19)

operate in the most diverse conditions. Unfortunately, some could not be avoided. The numbers used in front of the assumptions in the following subsection refer to the numbers used in figure 3.1.

3.1.1 Assumptions related to movements

2: None or constant camera motion

By choosing to have a static camera, I have made the use of optical flow a lot easier. Although optical flow can be used with a moving camera (it can even detect the exact movement of the camera), some form of foreground/background reduction would be needed to discern between optical flow caused by camera movement and optical flow caused by the subject’s movement. 3: Only one person in the workspace at the time

Although the system can handle people sitting in the background, a lot of extra movement will lead to the same problem as with assumption two: another algorithm will be needed to discern between foreground and background movement.

4: The subject faces the camera at all times

It is possible to train neural networks for different viewpoints, in this thesis I only train a network with the subjects facing the camera. This is of course the best position to study eating behavior, since it provides a view of a subject’s food, movements, and expressions. Unfortunately it might not always be possible to place the camera in front of the subject.

6: No occlusions

As long as the occlusions remain relatively small (for example a hand occluding the shoulder), the system is able to perform as usual. If the occlusions are too large (like someone standing in front of the entire left arm), the system is no longer able to estimate the right pose. This is logical because most of the time the right arm does not provide a lot of information about where the left arm should be.

7: Slow and continuous movement

This assumption comes from the use of sequences (see section 2.6). In our database subjects move at a certain speed, if a new subject moves at a speed that differs a lot from our database, the sequences might not be matched correctly.

9: The motion pattern of a subject is known

There are some strange motions in our database, but the basic motion patterns of an eating person are usually the same: hand goes from table to hand, hand goes down again. So although the exact pattern of motion is not known, the amount of patterns is so restricted that I still count this as one of my assumptions.

3.1.2 Assumptions related to appearance

2: Static background

Like assumptions 2 and 3 from the assumptions related to movements, this assumption is made because otherwise a system to discern between foreground and background movement has to be made.

3.2 Database creation

In order to train and test the system, I created an annotated frame database. First, a number of videos were made of people eating. There were no restrictions to the type of clothing or lighting conditions, although all movies were shot at the same location at (approximately) the same dis-tance to the subject. These movies were then separated into frames, at 0.1 second intervals. For training purposes the frames were divided into a train- and a test set. For a sample frame see figure 3.2. It is important to note in figure 3.2 that the screen in the background is to prevent the movements of people in the area to interfere with the optical flow (see section 1.3.2). The system is able to handle cluttered backgrounds, as long as they are relatively static.

(20)

Once the database of frames was ready, I created a separate piece of software called 3D

An-Figure 3.2: Sample frame of a subject eating

notator. This software can be used to create models consisting of three dimensional points. These points can also be connected to each other, either with or without a constraint on the maximum distance they can be apart. As can be seen in figure 3.3, the subjects in this study were anno-tated with a model that consisted of 8 points: the hands, elbows, shoulders, base of the neck and mouth. The hands are constrained to be within a certain distance of the elbows, and the elbows are constrained to the shoulders. The maximum distance that they can be apart is relative to the width of the shoulders. So the distance from the hands to the elbows can be at most 0.6 times the distance between the shoulders. All the other points are connected, but unconstrained.

The right picture shows the same model, only now in top view. So the left window displays x,y coordinates, while the right displays x,z coordinates. It is important to note that there were no top view images of the subjects, so the exact distance on the z-axis remained guesswork. Using the 3D annotator, I annotated over 2000 frames, at an average rate of 30 frames per hour. To create a large database, every frame was flipped horizontally and added to the database. So a se-quence of frames of a right handed person eating became a sese-quence of a left handed person eating. This way the database was doubled. An even larger database would most likely have resulted in

(21)

Figure 3.3: The 3D Annotator

better performance of our system. Unfortunately annotating the images is a time consuming task, so creating a large database can be a challenge. An alternative would be to create the database using a 3D model, as mentioned in section 1.4.2.

3.3 Network Results

In this section I will discuss the performance of the neural network. To do this I will first introduce my own performance measure, which displays not the amount of output nodes with the correct activation, but the average distance to the target coordinate of the output poses. I will then discuss the performance of various sizes of the network, and the effect the SequenceMatcher and local Hand improvements have on the performance.

3.3.1 Pixel Distance

In order to get an idea of the performance of the network, I measure the distance that each anchor point returned by the network has to its respective target location. This distance will be measure in pixels. Because I work at the specific resolution of 720 by 576, I will also describe the distance in percentage of the image. The Z distance was annotated on a black frame of 720 by 576, so the maximum Z distance is 576, just like the maximum Y distance.

3.3.2 Performance Results

The results of the network in various sizes can be found in Table 3.1. Each of these configurations was trained for 300 epochs with the default settings described in appendix B. What can clearly be seen here is that the network has the most difficulty learning Y and Z coordinates, which can be explained by the fact that the X coordinate does not vary all that much during eating. What can also be seen is the relatively small difference in performance between the network sizes. There

(22)

is an especially small difference between the networks of 3000 and 1500 hidden units. Therefore I did all further testing with a network of 1500 hidden units, since that had a high score, but a relatively small size.

Table 3.1: Performance without SequenceMatcher, the average distance to the target is displayed in pixels and image percentage.

Number of hidden Set X Y Z 3000 Train 8.5 (1.2%) 8.4 (1.5%) 23.7 (4.1%) Test 14.2 (1.9%) 23.8 (4.1%) 29.6 (5.1%) 1500 Train 9.6 (1.3%) 8.3 (1.4%) 23.6 (4.1%) Test 16.7 (2.3%) 17.7 (3.1%) 29.6 (5.1%) 1000 Train 11.3 (1.6%) 11.7 (2.0%) 24.4 (4.2%) Test 15.0 (2.1%) 20.7 (3.6%) 30.4 (5.2%) 500 Train 14.1 (1.9%) 16.7(2.9%) 26.3 (4.6%) Test 14.6 (2.0%) 21.0 (3.6%) 34.3 (6.0%) Average Train 10.9 (1.5%) 11.3 (2.0%) 24.5 (4.3%) Test 15.0 (2,1%) 20.8 (3.6%) 31.0 (5.4%)

3.3.3 Effect of the SequenceMatcher

In table 3.2 the effect of adding the SequenceMatcher to the system can be seen. There is a remarkable improvement of Z values, where the distance is less than half of the average distance without the SequenceMatcher. The X and Y distances are also significantly less, with the distance on the test set Y axis being one third of the distance without the SequenceMatcher. These results

Table 3.2: Performance with SequenceMatcher, the average distance to the target is displayed in pixels and image percentage.

Number of hidden Set X Y Z 1500 Train 7.9 (1.1%) 7.5 (1.3%) 9.5 (1.6%)

Test 13.8 (1.9%) 7.1 (1.2%) 19.0 (3.3%)

seem promising. The average distance on the test set is 2.1%. However, when we look at Table 3.3, we can see that the distances between the different anchor points vary. The system is able to estimate the shoulders so well, that mistakes made on the hand coordinates are evened out. But it is the y coordinates of the hands that are the most interesting when studying most tasks, since the hands are used to grasp and move objects. One would think that the y coordinates would be significantly harder than x coordinates, since during eating most of the motion is in the y direction, but these results show that the system is able to estimate x and y coordinates equally well.

3.3.4 Local Hand Tracking performance

As mentioned in section 2.6.1, I decided to implement a local skin-counting algorithm to get better coordinates for the hands. In Table 3.4 you can see the results. Although the local hand tracking improves the hand coordinates, their distance to the target coordinates still leaves something to be desired. Table 3.5 shows that longer training time does not further improve the results.

(23)

Table 3.3: Performance with SequenceMatcher, the distance to the target for each anchor point is displayed in pixels and image percentage.

Anchor Point Set X Y Z

Left Hand Train 13.7 (1.9%) 14.4(2.5%) 20.6 (3.6%) Test 15.7 (2.2%) 25.4 (4.4%) 47.0 (8.15%) Left Elbow Train 7.8 (1.1%) 8.4 (1.5%) 13.0 (2.3%)

Test 7.1 (1.0%) 3.1 (0.5%) 14.9 (2.6%) Left Shoulder Train 5.3 (0.7%) 3.9 (0.7%) 0.0 (0.0%)

Test 2.0 (0.3%) 0.2 (0.0%) 0.0 (0.0%) Neck Train 5.3 (0.7%) 3.8 (0.7%) 0.1 (0.0%) Test 1.6 (0.2%) 0.2 (0.0%) 0.0 (0.0%) Right Shoulder Train 5.4 (0.8%) 3.8 (0.7%) 0.2 (0.0%) Test 1.7 (0.2%) 2.3 (0.4%) 0.0 (0.0%) Right Elbow Train 7.0 (1.0%) 7.0 (1.2%) 12.7 (2.2%)

Test 13.2 (1.8%) 4.6 (0.8%) 3.2 (0.6%) Right Hand Train 10.8 (1.5%) 11.2 (1.9%) 20.1 (3.5%)

Test 55.7 (7.7%) 13.2 (2.3%) 68.2 (11.8%)

Table 3.4: Performance after local hand tracking, the distance to the target for both hands is displayed in pixels and image percentage.

Anchor Point Set X Y Z

Left Hand Train 13.2 (1.8%) 13.4 (2.3%) 20.6 (3.6%) Test 11.9 (1.7%) 16.3 (2.8%) 39.3 (6.8%) Right Hand Train 10.6 (1.5%) 10.5 (1.8%) 20.4 (3.5%) Test 54.4 (7.5%) 9.8 (1.7%) 66.5 (11.5%)

3.4 Importance of Optical Flow

In this section the importance of optical flow for depth perception is evaluated. In Table 3.6, the effect each section of the input has on the output is plotted. These numbers are obtained by multiplying the weights from the inputs to the hidden layers with the weights from the hidden to the output. This method is based on the data driven method described in [20]. The second num-ber displays what percentage of the input a dimension receives from the different input sections. These results show that the depth perception is constantly less accurate than 2D perception, and that greyscale represents about 4/6 of the input for each dimension. This suggests that optical flow does not contribute more to depth perception than greyscale, but the x and y displacement together do provide 1/3 of the information needed.

The importance of optical flow can also be established by attempting to estimate poses with-out optical flow. Table 3.7 shows that test set performance in the x and z dimension do not decrease when optical flow is removed. This shows that the system does not need the optical flow information to establish these dimensions. When the y dimension is considered, a different effect can be seen. In this case, removing optical flow nearly halves the performance of the system, and as explained above, the y dimension is the most interesting when we consider studying eating behavior. The movement of the hands towards the face gives the most information, and this move-ment consist almost solely of movemove-ment in the y direction. The performance on the training set can be explained by over fitting: with greyscale only, the network only receives 1/3 of its normal inputs, and will therefore start over fitting to the training data faster than the normal network.

(24)

Table 3.5: Effect of training time on average pixel distance with 1500 hidden units. The average distance to the target is displayed in pixels and image percentage.

Epochs Set X Y Z

300 Train 7.8 (1.1%) 7.3 (1.3%) 9.6 (1.6%) Test 13.2 (1.8%) 5.3 (0.9%) 17.7 (3.1%) 600 Train 7.8 (1.1%) 7.2 (1.3%) 9.5 (1.6%)

Test 13.2 (1.8%) 5.3 (0.9%) 17.7 (3.1%)

Table 3.6: Relative importance of different kinds of inputs. Results are displayed in total sum of weights leading to output nodes of a dimension, and fraction of total inputs for that dimension. The fractions are approximations.

Anchor Point X Y Z

xDisplacement 44,247.49(1/6) 45,386.34(1/6) 37,006.65(1/6) yDisplacement 45,050.47(1/6) 46,330.72(1/6) 37,655.22(1/6) Greyscale 169,171.9(4/6) 172,689.3(4/6) 141,624(4/6)

Table 3.7: Network performance when optical flow is removed. Network has 1500 hidden units and the average distance to the target is displayed in pixels and image percentage.

Epochs Set X Y Z

300 Train 8.6(1.2%) 8.6 (1.4%) 10.7 (1.9%) Test 16.6 (2.3%) 10.0 (1.7%) 18.56 (3.2%)

(25)

Chapter 4

Conclusion

4.1 Creating an upper body pose tracking system

The first goal was to create a system capable of tracking a person’s upper body pose. This system is to become the basis of an action recognition system able to recognize eating actions. Section 3.3 shows that the system developed in this system is reasonably successful at tracking a person’s upper body pose. Furthermore, tests have also revealed it performs at a speed of over 10 frames per second, which means that with some optimization the system can work in real-time. These optimizations can either be in the use of optical flow (by selecting a smaller area, as mentioned in section 4.3.2, or faster implementations of the algorithms. Although the tracking of the hands leaves something to be desired, the system shows that our approach works, and with further optimization can provide a fast pose estimation system to serve as input for an action recognition system. Suggestions for optimizations can be found in section 4.3. The results in section 3.3 have been obtained under fairly strict assumptions, and a bigger, more diverse database will have to be created before the system can perform under real-life conditions.

4.2 Using optical flow for depth perception

The second goal of this thesis was investigating the effects of optical flow on depth perception. Ta-ble 3.6 shows that optical flow did not have the hypothesized effect, showing an equal contribution to all dimensions, and a significantly smaller contribution than greyscale. Although there are ways to improve the performance of optical flow, as mentioned in section 4.3, I have come to believe there is a more fundamental problem with optical flow. As described in section 1.3.1, humans use optical flow for depth perception. However, humans use optical flow primarily when they them-selves are moving, and the world around them is relatively static. This gives the most reliable optical flow information. Unfortunately objects in the world move at different speeds, which would make optical flow less reliable than with a moving camera and static objects, yet I believed the optical flow would still contain enough information for a relatively stable depth perception. Table 3.6 shows that this hypothesis was not correct. Even in the z dimension, the network makes more use of the greyscale information than the information from the optical flow algorithm. Table 3.7 shows that although optical flow does not contribute to depth perception more than greyscale, it does show optical flow contributes to the results, especially in the y dimension.

Since optical flow does not contribute to depth perception, it is surprising that the system performs relatively well on the z dimension, as can be seen in Table 3.5. One explanation for this result, is that section 1.3.1 did not mention a third depth cue: relative size. If an object of constant size moves away from the viewer, it appears smaller. If it comes closer, it will appear bigger. This information is contained within the greyscale input, and the network might have learned to take advantage of this information.

(26)

Another thing that might explain the relatively low contribution of the optical flow, is the speed of movement when people are eating. As mentioned in section 3.2, the frames were selected at 0.1 second intervals. But people move their hands so fast, that sometimes optical flow is no longer able to detect the movement, since the regions of the two images that have to be matched are so far apart, that optical flow concludes they do not belong together.This might explain why the SequenceMatcher is so successful, because it is not dependent on the grid points, and can deal with large jumps in coordinates between frames.In the next section a possible solution to this problem, and other possible improvements are described.

4.3 Further Research

4.3.1 Sample frames at smaller intervals

As mentioned in the the previous section, optical flow sometimes cannot detect motion when the displacement of a region between two images is too big. In this thesis the frames were sampled at 0.1 seconds. If frames were sampled more often, for example every 0.03 seconds, optical flow might be able to detect more motion, and provide more information to the network.

4.3.2 Silhouette model

As explained in section 3.1, the system cannot handle moving backgrounds. If the background is moving, the system cannot differentiate between the moving background and the moving person. By first applying a silhouette detection algorithm, which can detect the contours of the subject, and only performing optical flow within that contour, this problem can be largely eliminated. The added advantage of this approach is that the system would only need to compute optical flow within this small silhouette, which means the system would become faster. The result of this silhouette could also be provided as input to the network, which might also increase performance.

4.3.3 Using 3D pose to create an action classifier

As mentioned in section 1.2, one of the goals of this thesis was to develop a system that could serve as input for an action recognition system. Due to time constraints this recognition system was not implemented, but it would still be interesting to see if the 3D pose estimations provided by the system are accurate enough for action classification.

4.3.4 Scaling before applying the SequenceMatcher

The system currently compares the current sequences with sequences from the SequenceMatcher without normalizing the size of the body. The effects of the SequenceMatcher might be improved if all sequences in the SequenceMatcher would be normalized. Since the body model used in the 3D annotator is normalized on the width of the shoulders, this would make a good normalization standard. All the sequences in the SequenceMatcher could be scaled so that the width of the shoulders has a certain value, and every current sequence normalized before compared with the sequences from the SequenceMatcher.

4.3.5 Optical flow density and sensitivity

A way in which the performance of optical flow might be increased, is by adapting the parameters of the system so there are more grid points, and the algorithm is more sensitive to motion. Although the disadvantage of this solution is that the system will be slower, and more susceptible to false movement, it should increase the overall performance of the optical flow. This might be especially useful in situations where offline video has to be analyzed, in which case the speed of the algorithm is less important than accuracy.

(27)

Appendix A

Optical Flow Parameters

These parameters are used in the CBA algorithm. This algorithm is used by the program as can be seen in figure 2.1. All variables are default values as given in [13], except for the matchTH and mpegTH. Since movements related to eating (for example opening the mouth) can be quite small, matchTH and mpegTH were modified to provide a more sensitive optical flow algorithm, capable of picking up these smaller movements.

A.1 Parameter description

matchTH = Threshold for matching two templates, higher = more similarity required numSL, the number of scan lines = a, higher = less lines + faster performance lengthTemplate = length of the template to be matched, must be odd.

mpegTH: lower = less motion detected, a threshold for the difference between pixels in greyscale (0-255)

sizeGC = size of the patch for mpegTH

A.2 Parameter values

A.2.1 Parameters for vector creation

matchTH = 8 lengthTemplate = 11 numSL = 14

A.2.2 Parameters used to check motion vectors

mpegTH = 20 sizeGC = 30 minFraction = 0.05f

(28)

Appendix B

Network Parameters

These parameters are used to train the network, using the FamProp program developed by Vicar Vision. Except for the number of nodes in each layer and the number of epochs, all variables were either kept at the default values or given by the database (such as the number of train samples).

B.1 Network

networktype=backpropagation learningtype=supervised firstepoch=0 lastepoch=600 evaluationepoch=5 ninput=3528 noutput=525 nhidden=1500 learnrate=0.05 learnrate0=0.05 learnrate1=0.01 momentum0=0.15 momentum1=0.15 seed=12345

B.2 Data

nlearnsetsamples=2992 nstopsetsamples=360

B.3 Scaling

class scale constant=0.2 input scale type=none

B.4 Optimize

performance=midpoint midpoint value=0.5

(29)

Bibliography

[1] www.ice-project.org.

[2] www.restaurantvandetoekomst.wur.nl.

[3] S. Baker, S. Roth, D. Scharstein, M.J. Black, J.P. Lewis, and R. Szeliski. A database and evaluation methodology for optical flow. Technical report, Microsoft Research, 2007.

[4] S.S. Beauchemin and J.L. Barron. The computation of optical flow. ACM Comput. Surv., 27(3):433–466, 1995.

[5] A. Besinger, T. Sztynda, S. Lal, C. Duthoit, J. Agbinya, B. Jap, D. Eager, and G. Dissanayake. Optical flow based analyses to detect emotion from human facial image data. Expert Systems with Applications, 2010.

[6] C.M. Bishop. Neural networks for pattern recognition. Oxford University Press, USA, 1995.

[7] J. Gall, A. Yao, and L. van Gool. 2d action recognition serves 3d human pose estimation. ECCV, 2010.

[8] J. Gao, A.G. Hauptmann, and H.D. Wactlar. Combining motion segmentation with tracking for activity analysis. In The sixt International COnference on Automatic Face and Gesture Recognition, pages 699–704, 2004.

[9] I. Haritaoglu, D. Harwood, and L.S. Davis. W4: A real time system for detecting and tracking people. In CVPR. IEEE Computer Society, 1998.

[10] B. Heisele, T. Serre, S. Prentice, and T. Poggio. Hierarchical classification and feature reduc-tion for fast face detecreduc-tion with support vector machines. Pattern Recognireduc-tion, 36(9):2007– 2017, 2003.

[11] Z. Jin, Z. Lou, J. Yang, and Q. Sun. Face detection using template matching and skin-color information. Neurocomputing, 70:794–800, 2007.

[12] H.L. Kennedy. Gradient operators for the determination of optical flow. In dicta, pages 346–351. IEEE Computer Society, 2007.

[13] I. Lieberwerth. Presenting two time slice based optical flow algorithms. Internal Report: Vicar Vision, 2007.

[14] A.S. Micilotta, E.J. Ong, and R. Bowden. Real-time upper body detection and 3d pose estimation in monoscopic images. Computer Vision–ECCV 2006, pages 139–150, 2006.

[15] T.B. Moeslund, A.Hilton, and V. Kruger. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, pages 90–126, 2006.

[16] S.E. Palmer. Vision science: Photons to phenomenology. MIT press Cambridge, MA., 1999.

[17] S. Roy and I.J. Cox. A maximum-flow formulation of the n-camera stereo correspondence problem. In IEEE Proceedings of International Conference on Computer Vision, 1998.

(30)

[18] S.M. Smith. Reviews of optical flow, motion segmentation, edge finding and corner finding. Technical report, Oxford Centre for Functional Magnetic Resonance Imaging of the Brain, 1997.

[19] P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea. Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11):1473, 2008.

[20] F.Y. Tzeng and K.L. Ma. Opening the black box: Data driven visualization of neural net-works. In Proceedings of IEEE Visualization ’05 Conference, pages 383–390. IEEE Computer Society, 2005.

[21] P. Viola and M.J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

[22] L. Xiaohua, K.M. Lam, S. Lansun, and Z. Jiliu. Face detection using simplified gabor features and hierarchical regions in a cascade of classifiers. Pattern Recognition Letters, 30(8):717–728, 2009.

3D Pose Tracking Using Optical Flow