BING3D: Fast Spatio-Temporal Proposals for Action Localization

(1)

UNIVERSITY OF AMSTERDAM TNO

BING3D: Fast Spatio-Temporal

Proposals for Action Localization

by

Ela Gati

January 2015

Supervisors:

Dr. Jan van Gemert (UvA)

Dr. John Schavemaker (TNO)

Master of Science in Artificial Intelligence

in the

Faculty of Science Informatics Institute

(2)

Abstract

The goal of this thesis is realistic action localization in video with the aid of spatio-temporal proposals. Action localization involves finding the spatial and spatio-temporal loca-tion of an acloca-tion, as well as its label. Spatio-temporal proposals are video regions that are likely to contain an action, which can be fed into a classifier to determine the action class. Generating a small set of quality action proposals improves localization, as well as classification, since it reduces the amount of background noise and allows use of more advanced video features and classification techniques.

Current proposal generation methods are computationally demanding and are not prac-tical for large-scale datasets. The main contribution of this work is a novel and fast alternative. Our method uses spatio-temporal gradient computations as its basic video features, as these are simple and fast to compute, while still capable of capturing action spatial and temporal boundaries. This is an extension of the BING method of proposal generation for object detection in still images. We generalize BING to the temporal domain by adding temporal features and a spatio-temporal approximation method. We call our method BING3D. The method is orders of magnitude faster than current meth-ods and performs on par or above the localization accuracy of current proposals on the UCF sports and MSR-II datasets. Furthermore, due to our efficiency, we are the first to report action localization results on the large and challenging UCF 101 dataset.

Another contribution of this work is our Apenheul case study, where we created and tested our proposals performance on a novel and challenging dataset. The Apenheul dataset is large-scale, as it contains full high definition videos, featuring gorillas in a natural environment, with uncontrolled background, lighting conditions and quality.

(3)

Introduction

Golf swing

Figure 1.1: Action localization aims to find where, when and what action is taking place in a video. The red tubelet is the ground truth, the blue cuboid is our best

proposal. The action label indicates what action is taking place.

Action localization is an active area of research in computer vision. Unraveling when, where and what happens in a video is a great step in the challenge of understanding a video. The literature in this field defines three fundamental and generic problems, detection (and tracking), recognition and action localization. Recognition deals with deciding whether a specific action is present or not in a video segment. It assumes one action per video, and that videos are trimmed to fit the action, thus only answers the question what happens in the video. The goal in detection task is to find objects in a video, so it answers the where and when questions. Action localization is the most general problem, as it aims at answering all three questions at once. Action localization involves finding all the spatio-temporal windows that contain a specific action. The desired output for action localization is a tubelet (bounding box per frame list) and an

action label. Figure1.1illustrates a proposal, such as the ones generated by our method.

The blue tubelet represents the proposal, and the red one the ground truth. While the 1

(6)

Introduction 2

proposal demonstrates where and when the action takes place, the action label indicated what action is being performed. In this work we propose a method for action localization using spatio-temporal proposals, which is fast and achieves state-of-the-art results. The naive approach to action localization is using a sliding sub-volume, which is the 3D extension of the sliding window approach for static images. In this approach all

possible tubelets in the video are evaluated. While effective for static images [1], when

dealing with videos sliding window approaches become computationally intractable even for modest sized videos. The complexity of searching in the spatio-temporal video space, is higher than the one in image space, since on top of determining the spatial location and scale, we have to also find the temporal location and duration. That is, for a sub-volume search (i.e. a cuboid where the size and location is the same for all frames) in

a video sized w × h × l, the complexity is O(w2_h2_l2_{), compared to O(w}2_h2_{) in images.}

Furthermore, if we relax the cuboid constraint of the sliding windows, then the search

space becomes exponential in the number of pixels [2].

More recent methods for action localization [3,4] are proposals based. This is inspired by

successful object-proposals in static images [5–8]. The idea in proposal based methods

is to first reduce the search space to a small set of spatio-temporal tubes, with high

likelihood to contain an action. Compared to sliding-subvolume approaches, such as [9–

11], proposals for action localization are more efficient and allow using bigger datasets.

Another advantage of proposal based methods is that the small number of proposals that has to be classified makes it possible to use more computationally expensive features and more advanced classifiers, that would be impractical otherwise, to achieve state-of-the-art localization accuracy.

Current action proposal algorithms [3, 4] consist of three steps: 1. Pre-processing the

video by segmentation, 2. Generating the proposals by grouping segments and 3.

Rep-resenting the tube with dense trajectory features [12, 13] so it can be fed to an action

recognition classifier. These steps are computational expensive, the pre-processing (step

1) easily takes several minutes for a modest video of 720x400 video of 55 frames [4,14]

and can take days for a realistic full HD movie. The computational demands of current action proposals are not practical for large-scale video processing. This is the main motivation for creating a fast large-scale action localization method.

In this work, we present BING3D, a generalization of BING [6] from image to video for

high speed 3D proposals, in which we use spatio-temporal image gradients instead of video segmentation. We chose BING because of its impressive speed and small number of quality object proposals. The strength of BING’s efficiency lies in simple gradient features and an approximation method for fast proposal selection. We generalize to the

(7)

Introduction 3

temporal domain by adding temporal features, and a spatio-temporal approximation method leading to BING3D.

In chapter 5 we demonstrate that our spatio-temporal proposals method is orders of

magnitude faster than current methods and performs on par or above the localization accuracy of current proposals on the UCF sports and MSR-II datasets. Moreover, due to our efficiency, we report action localization results on the large and challenging UCF 101 dataset for the first time. We also present our results on a case study, a novel and highly challenging dataset, namely the Apenheul dataset. The videos in the dataset feature a group of gorillas living in the Apenheul zoo in the Netherlands. All videos are high-definition realistic videos, with uncontrolled natural background, lighting conditions and quality. The objective of this dataset is to recognize the different behaviours of gorillas. A summary of this thesis, along with the work of Jan van Gemert and Mihir Jain, was submitted to the International Conference on Computer Vision and Pattern Recognition (CVPR) 2015, the highest ranked computer science publication according to Google

Scholar 1. The paper is currently under review, the confidential submission is supplied

in appendix ??.

In the next chapter (2) we present our case study in detail. Chapter 3 gives a short

review of the related research in the field. The method is described and explained in

chapter4. We present the experimental setup, as well as experiments results and analysis

in chapter 5. Finally we conclude our work in chapter 6.

1

(8)

Chapter 2

Apenheul Case Study

Automatic detection and classification of gorillas actions is a fascinating and challenging task. To the best of our knowledge there are no published results which demonstrate success in doing so. The results of automatic recognition of these behaviours can be applied in ethology, genetics, conservation management, tracking the health and well-being of the gorillas, and for park/zoo visitor education and entertainment purposes, among other applications.

As the case study of this research we apply our localization and classification algorithms to automatically detect gorilla actions in videos taken in the Apenhuel. The Apenheul is a unique zoo, in which the gorillas are living in a natural environment and are free to behave as they like. The Apenheul provides a unique opportunity to observe natural behaviour of gorillas, since the gorillas are an endangered species, and furthermore, the remaining ones live in remote and often inaccessible areas. In addition, the social structure of the gorilla group in the Apenheul also resembles the one usually found

in nature [15], with one adult male and multiple females and offsprings. The gorilla

sanctuary is surrounded by eleven high definition cameras covering most of its area, see

section2.2for more details. The videos from these cameras are the base for a new action

dataset, which will be referred to as the Apenheul dataset from now on.

This unique dataset presents two main challenges, one is more technical in nature, and the other comes from the nature of gorillas looks and behaviours. The technical issues are due to the size of the videos (at least two orders of magnitude larger than any other

action dataset. See5.1for comparison of the properties of the different datasets used in

this project). Furthermore, although the videos are high-definition, the great distance of the cameras from the gorillas actually causes the resolution of each ground truth track to be quite low. The second challenge is unique to gorillas: gorillas colours range from black to dark grey or silver, which are not discriminative colours, especially on a

(9)

Apenheul Case Study 5

natural background containing rocks, trees, trunks, shades and shadows. In many cases the gorillas really blend into the background, and are hard to detect even for a human eye. On top of that, gorillas are not especially active. They spend most of their time sitting down and eating, and rarely make grand movements. This fact, in combination with the long distance from the cameras, make the behaviour detection harder. An

example frame from one of the Apenheul videos is presented in figure 2.1. There are

eleven gorillas in this frame, can you spot them all? (Answer in appendix A).

Figure 2.1: Example frame from one of the Apenheul videos (All gorilla images in this thesis are courtesy of Apenheul).

Though the social interaction and hierarchy of gorillas is of great interest and importance, this study will focus on individual actions, mainly due to the complexity of the task. The next sections provide some background information about the gorillas and the Apenheul, and details the creation of the dataset used for in our experiments.

2.1 Gorillas

Gorillas are the largest extant genus of primates by size. They are ground-dwelling, predominantly herbivorous apes that inhabit the forests of central Africa. Gorillas share 98.3% of their genetic code with humans and they are the next closest living relatives to humans after the bonobo and common chimpanzee. The natural habitats of the gorillas cover tropical or subtropical forests in Africa. Wild male gorillas weigh 135 to 180 kg while adult females usually weigh half as much, 65-90 kg. Adult males are 1.7 to 1.8 m tall. Gorillas move around by knuckle-walking, although they sometimes walk

(10)

bipedally for short distances while carrying food or in defensive situations. Adult male gorillas are known as silverbacks due to characteristic silver hair on their backs reaching to the hips. Gorillas live in groups of 6-12 with the oldest and largest silverback leading a family of females, their young and younger males called blackbacks. The silverback makes the decisions on when his group wakes up, eats, moves and rests for the night. Because he must protect his family at all times, the silverback tends to be the most aggressive. In such situations, he will beat his chest and charge at the perceived threat

[15]. Gorillas are shy animals that are most active during the day. At dusk, each gorilla

constructs a nest of leaves and plant material in which it will sleep. Mothers usually share their nests with nursing infants. Young males may leave their family groups as they become older and either live as solitary silverbacks or create their own family groups. The silverback has the exclusive rights to mate with the females in his group. The gorilla world population is estimated to contain 100,000-200,000 individuals. They are classified as critically endangered by the IUCN Red List of Threatened Species, because

of a population reduction of more than 80% over three generations [16].

2.2 Apenheul

The Apenheul was opened in 1971 as a small but revolutionary zoo. It is the first and only zoo in the world where monkeys live free in the forest but can also walk around the visitors. The zoo began with woolly monkeys, spider monkeys and a few other small species. Before long the concept became a proven success among monkeys and visitors alike. The freedom given to the animals allowed them to form ideal social groups and to reproduce perfectly. In Dutch, apen means apes or monkeys and heul is old-Dutch for refuge or safe zone. The concept of the Apenheul is simple: people enjoy primates most when the primates are enjoying themselves and behaving naturally. So the monkeys do not live in cages with bars, but in large, natural enclosures in the forest.

The gorilla, the biggest of all apes, arrived in the Apenheul in 1976. Three years later, in 1979, the first gorilla babies were born, followed by many more. Apenheul has the most close to nature group of gorillas in the world in captivity. As in most gorilla groups

in nature [15], the Apenheul group also consists of silverback ’Jambo’ as group leader,

with 14 other female and young males.

’Gorillas in the Cloud’ is a cooperative project between the Apenheul Foundation and TNO. As part of the project eleven high definition cameras were placed around the

(11)

Figure 2.2: Map of camera positions and coverage on the gorilla island in the Apen-heul. Image courtesy of the ApenApen-heul.

Most scientific research regarding gorilla behaviour was done either on wild gorillas or on caged ones. The Apenheul is somewhere in between, as the area the gorillas live in is bigger and more nature-like than in most zoos, but the gorillas are still being fed and treated by human care-takers.

2.3 Gorilla Behaviours

AppendixBdetails a list of scientifically defined behaviours of the gorillas, created by the

Apenheul team. Another list of behaviour definitions can be found in [15]. Out of these

lists we had to choose which actions to detect. Taking into account the resolution, which does not allow for fine details detection, and the lack of professional annotators, we chose to focus on simple actions, which are easy to recognise by any human viewer. The list of actions finally selected is as follow: Sitting, Walking, Standing, Tree climbing, Running,

Bipedal walking, Foraging. See figures 2.3,2.4 for some examples of these actions.

Table 2.1 details the properties of the different action classes. Average and standard

deviation are stated for the duration and spatial change. It stands out that the dis-tribution of actions is uneven, as well as their properties, such as duration and how dynamic they are (this is demonstrated by the average change in x and y positions). For example, sitting action tend to be long and to have nearly no spatial movement, while the running action is short (usually since they just leave the camera view after a few frames of running), and have a large spatial displacement. It is also obvious that sitting

(12)

Standing

Walking

Walking Tree Climbing

Tree Climbing

Walking

Figure 2.3: Examples of different gorilla actions.

Bipedal Walking

Tree Climbing

Standing Walking

Standing

Figure 2.4: Examples of different gorilla actions.

and walking are the most common actions, with a big gap from all other actions. This makes action localization even more challenging.

2.4 Dataset Construction

Since the original videos have no annotations, the first stage in constructing the dataset

involved annotating the videos. This was done semi-automatically using the VATIC [17]

(13)

Action #instances Duration (frmaes) x change (pixels) y change (pixels)

Sitting 201 133 ± 90 12 ± 37 5 ± 9 Walking 219 44 ± 34 360 ± 332 89 ± 110 Standing 28 50 ± 58 20 ± 18 8 ± 8 Tree climbing 72 139 ± 94 69 ± 58 100 ± 79 Running 9 16 ± 8 433 ± 368 43 ± 40 Bipedal walking 17 38 ± 24 261 ± 295 66 ± 58 Foraging 6 41 ± 26 28 ± 40 12 ± 23

Table 2.1: Action statistics for the Apenheul dataset, average and standard deviation are stated.

This dataset was constructed from a large collection of videos. The raw data contains thousands of videos, from different days of the year, different times of the day and different cameras. Each video is HD (1920x1080) resolution, and about 4000 frames long (2-2.5 minutes @ 30 fps).

Since annotating the videos is extremely time consuming, we chose only two videos to annotate. The videos were chosen out of a subset of videos which contain many actions. Both videos are from the same camera, so the view point is similar, but from different days and hours, so lighting and shadings are different. Each video was cut to segments of maximum 300 frames, and was annotated semi-automatically using the VATIC software

(14)

Chapter 3

Related Work

Action recognition and localization is an active research topic in computer vision with many important applications, including human-computer interfaces, content-based video

indexing, video surveillance, and robotics, among others. Weinland et al. [18] published

a comprehensive survey on action recognition methods. These seem to focus on human body model estimation, silhouettes extraction and global image models. Such methods estimate the pose and position of a body in each frame to recover the action performed. On top of being computationally expensive, these methods often fail when occlusions

occur, which is usually the case in realistic videos. Lan et al. [19] showed that action

localization is an important component of action recognition, that can lead to better action classification accuracy. Therefore, we focus on localization in this work.

Local features can be used instead of the complex body models. Local features can

be computed around salient spatio-temporal interest points [20, 21], densely sampled

points [22, 23] or along dense trajectories [12, 13]. Feature descriptors capture shape

and motion in the neighborhoods of selected points using image measurements such as spatial or spatio-temporal image gradients and optical flow. Some local features

examples are spatio-temporal gradient information [24, 25], optical flow [26], color and

brightness information. Robustness to camera motion is either directly modeled from

the video [13,27] or dealt with at the feature level by the motion boundary histograms

descriptor [12, 26]. Local features provide invariance to occlusions and background

clutter, as well as spatio-temporal shifts and scales. Therefore, local features provide a solid base for action recognition and action localization. We chose to use in our work the

improved dense trajectory features [12,13] due to their excellent performance in action

localization [3].

The local descriptors are then aggregated into a global video representation. This can be

done using the basic bag-of-visual-words model [3,12], or using more advance encoding

(15)

Related Work 11

methods such as VLAD [27] or Fisher [13,28,29], which encode additional information

about the distribution of the descriptors. Motivated by the high performance [13], we

use the Fisher vectors as our video representation for the action localization. For the final classification we are using linear SVM, as most other methods.

Several action localization methods apply an action classifier directly on the video. Ex-amples include sliding 3D subvolume methods like spatio-temporal template matching

[10], a 3D boosting cascade [9] and spatio-temporal deformable 3D parts (SDPM) [11].

Other methods maximize a temporal classification path of 2D boxes through static

frames [2, 30] or search for the optimal classification result with a branch and bound

scheme [31]. The benefit is that these methods do not require an intermediate

repre-sentation and directly apply a classifier to densely sampled parts of the video. The disadvantage of such methods, however, is that they have to perform the same dense sampling for each individual action class separately. Due to the computational com-plexity of the sampling, this is impractical for larger numbers of action classes. Instead, spatio-temporal proposals based methods first generate a small set of bounding-box tubes that are likely to contain any type of action. Since classification only has to be done for the small proposals set, more robust and computationally expensive features can be used, to achieve higher performance.

Current spatio-temporal proposal methods are inspired by 2D object proposals in static

images. A version of objectness [5] is extended to video [32], where super-pixels

seg-mentation is computed online, and tubelet hypotheses are proposed accordingly. This is achieved by computing super-pixels per frame, thus incurs in loss of temporal

informa-tion. Selective search [8] is based on super-pixel segmentation, but diversify the search

by using a variety of complementary image partitionings to deal with as many image conditions as possible. An extension to 3D of selective search was proposed by Jain et

al. [3] where super-pixels are replaced with super-voxels, and introducing the

indepen-dent motion evidence feature to characterize how the action motion deviates from the

background motion. Another popular 2D proposal method is the randomized Prim [7]

, which uses the connectivity graph of an images super-pixels, with weights modelling the probability that neighbouring super-pixels belong to the same object, to generate random partial spanning trees with large expected sum of edge weights. This idea was

generalized to a spatio-temporal variant by Oneata et al. [4].

Several 2D object proposal methods and their 3D generalizations are based on a

super-pixel segmentation pre-processing step [3–5, 7, 8, 32, 33]. Trying to compute

segmen-tation for our HD Apenheul videos has failed, mainly due to memory constraints and too long computations time (up to a week for a single video), hence we argue the above mentioned methods are computationally too demanding for large scale video processing.

(16)

Related Work 12

Other 2D proposal methods such as edge boxes [34] use edge-detection and BING [6]

uses gradient computation as pre-processing steps. Following the trend, we also propose a spatio-temporal proposal based action localization method which extends a current 2D technique. Since we aim at large-scale fast proposal generation, and gradients are the fastest to compute, we propose a 3D extension of BING for spatio-temporal proposals. Our BING3D method is orders of magnitude faster than any other method that discloses its processing time.

(17)

Chapter 4

Methodology

This chapter will discuss the methods and algorithms used in proposal generation and action localization. It is divided into two main parts; the first is the proposal generation, which is done using BING3D. The second part is the action localization for which we use standard state-of-the-art pipeline with improved dense trajectories as descriptors, encoded as Fisher vectors and classified using a linear SVM.

4.1 Proposals Generation

c000 c011 c111 c110 c100 c001 c101 c00 c11 c01 c10 c0 c1 c Input video Multi-scale resize Video gradient Proposals

NMS

BING3D Reduce redundant proposals

Figure 4.1: Overview of the BING3D pipeline. Multi-scale resizing of input video is done using trilinear interpolation. The video gradient and BING3D features are extracted per scale, and proposals scores are computed. We use non maximum sup-pression to reduce the number of redundant proposals, to end up with a small set of

quality proposals.

The generation of action proposals is done using BING3D, our extension of the BING

[6] algorithm from static images to videos. BING stands for ’BInariazed Normalized

Gradient’, as it is based on image gradient as its basic features. We chose to extend BING because of its low computation time and excellent performance for images. Image derivatives, as well as their three-dimensional extension for videos, are simple features that can be computed efficiently. It has been shown that objects tend to have

(18)

Methodology 14

well-defined object boundaries [6], that are captured correctly by the spatial derivatives

magnitude. Adding the temporal derivative to the gradient is imperative to capture the temporal extent of an action. We implicitly assume that actions have clear temporal boundaries, which is not necessarily the case in realistic scenarios (e.g. in group sports, players often run and then gradually slow down and start walking, so when does the running action end and walking starts?), but as we will show later, we still manage to properly localize actions temporally.

BING3D contains two main stages. An overview of the first stage pipeline is shown in

figure 4.1. The input video is first resized to a pre-defined set of different scales. For

each scale we compute the video gradient, which is then binarized and approximated, leading to our BING3D features. As the BING algorithm, the BING3D is a supervised method. We use the BING3D features to learn a classifier model. The model is also binarized to accelerate the computation of the proposal confidence scores. As the last stage we use non-maximum suppression to reduce the amount of redundant proposals. All the different parts of the above pipeline are explained in details in the following sections.

An ’objectness’ measure (or ’actioness’ measure) is a function that determines how likely a proposal is to contain an action. As the second stage of our algorithm, we learn such a measure from the training data, that is later used to rank and reorder the set of final proposals. The objectness measure parameters are learned independently per scale, to reflect the fact that some scales are more likely to contain an object instance than others.

More details about this stage are given in section 4.1.5.

4.1.1 Multi-scale resize

In order to generate diverse proposals in terms of width, height and length, we first resize our videos to a set of pre-defined scales (1/2, 1/4, 1/8, 1/16, 1/32). The resizing of the videos is done using trilinear interpolation. Trilinear interpolation is an extension of linear interpolation to three dimensions. The main idea is to approximate the value of a point c = (x, y, z) linearly using the values of the lattice points.

Let the corners of the volume be c000 = (x0, y0, z0), c001 = (x0, y0, z1) etc. as shown in

figure 4.2 and let xd, yd, zd be the ratios between the distances from c to c000 and the

corresponding edge sizes. That is:

xd= (x − x0)/(x1− x0)

yd= (y − y0)/(y1− y0)

(19)

Methodology 15 c₀₀₀ c₀₁₁ c₁₁₁ c₁₁₀ c₁₀₀ c₀₀₁ c101 c00 c₁₁ c₀₁ c10 c0 c₁ c

Figure 4.2: Trilinear interpolation explanation

First we compute the linear interpolation along the x axis:

c00= c000(1 − xd) + c100· xd

c01= c001(1 − xd) + c101· xd

c10= c010(1 − xd) + c110· xd

c11= c011(1 − xd) + c111· xd

Using the values computed in the first stage, we can now interpolate linearly along the y axis:

c0= c00(1 − yd) + c01· yd

c1= c10(1 − yd) + c11· yd

Finally, we get the value of c, using previous results to interpolate along the z axis:

c = c0(1 − zd) + c1· zd

If we denote ˆxd= 1 − xd, and ˆyd, ˆzd similarly, the process can be summarized into:

c = ˆzd(c000xˆdyˆd+ c100xdyˆd+ c001xˆdyd+ c101xdyd) (4.1)

+zd(c010xˆdyˆd+ c110xdyˆd+ c011xˆdyd+ c111xdyd) (4.2)

4.1.2 Video Gradient

Our method uses efficient features based on 3D Normalized Gradients (referred to as NG3D). The gradient of video v is defined by the partial derivatives of each dimension

(20)

Methodology 16

Figure 4.3: Visualization of BING3D features from 3D Normalized Gradients

(NG3D). Top: The red boxes are on non-action parts in the video, the green box covers a Running action. Bottom: visualisation of the spatio-temporal NG3D features in the red and green boxes from the top in 8 × 8 spatial resolution and D = 4 temporal frames. The action is clearly described with NG3D feature, while random blocks from the same video do not display a similar pattern illustrating that the NG3D feature can

be used for discriminating actions from non-actions.

|∇v| = |(vx, vy, vz)T|, where vx, vy, vz are the partial derivative of the x, y, z axes

respec-tively. The partial derivatives are efficiently computed by convolving the video v with a 1D mask [-1 0 1], which is an approximation of the Gaussian derivative, in each dimen-sion separately. For each pixel the gradient magnitude is computed and then clipped

at 255 to fit the value in a byte, as min(|vx| + |vy| + |vz|, 255). The final feature vector

is the L1 normalized, concatenated gradient magnitudes of a pixel block. The shape of

the pixel block is 8x8 spatially, so it fits in a single int64 variable, which allow for easy use of bitwise operations, and we vary the temporal depth of the feature D resulting in

a 8 × 8 × D block. In chapter5we evaluate the performance when varying the temporal

depth D. In figure 4.3 we illustrate the NG3D features. The top row is showing a

sequence of random frames from one of the training videos. The red boxes are random boxes of non-action, while the green boxes cover a Running action. The bottom boxes illustrate the spatio-temporal NG3D features of the boxes drawn on top. The action is clearly described with D = 4 temporal layers on NG3D feature, while random blocks from the same video do not display a similar pattern illustrating that the NG3D feature can be used for discriminating actions from non-actions.

4.1.3 BING3D

The main stage of our method is actually divided to three parts. The first one involves using the training data to learn a classifier model and compute its approximation. In the second stage we compute the binarized features, we we call BING3D. We use the computed features and approximated model to compute proposal scores in the last stage.

(21)

Methodology 17

Learning a classifier model To learn our model we use our training data to generate

positive and negative training samples. For the positive samples, we use approximations of the ground truth tracks. For each ground truth track the tubelet is first enlarged to a cuboid and then resized with different scales. Cuboids that overlap more than a threshold with the ground truth cuboid (posSamplesThreshold = 0.25) are used as positive samples. In addition, to enrich the set of positive samples, we flip each positive cuboid. Flipping is done by mirroring each temporal slice horizontally and reversing the temporal order, to get a consistent action (otherwise, we would get a person walking backwards for example). When generating the positive samples, we count how many samples are generated for each scale. The scales that have enough training samples are saved and later used for generating the cuboid proposals. For the negative samples, a set of cuboids is generated randomly, and each cuboid that overlaps less than the threshold with all ground truth tracks, is added as a negative sample. The negative samples set is produced to match the positive samples in size.

We use support vector machine (SVM) as our classifier (see section 4.2.3 for short

ex-planation of support vector machines). We compute the NG3D features for all training samples and then use them in a linear SVM to learn the model w. It is important to note that our features are cuboids (i.e. the bounding boxes are the same size and location in all frames), that can only be scaled to generate the proposals, thus we learn from and generate cuboid proposals and not tubelets.

Approximate model Efficient proposal classification is achieved by approximating

the SVM model w in a binary embedding [6, 35] which allows fast bitwise operations

in the evaluation. The learned model w ∈ R8×8×D is approximated by a set of binary

basis vectors a ∈ {−1, 1}8×8and their coefficients β ∈ R. The approximation becomes

w ≈ D X i=1 Nw X j=1 βijaij. (4.3)

In chapter 5 we evaluate the quality of the approximation with different number of

components Nw. Pseudo code for computing the binary embedding is given in algorithm

1.

Generating BING3D features In addition to the approximation of the model,

we also approximate the normed gradient values using the top Ng binary bits of the

(22)

Methodology 18

Algorithm 1 Binary approximation of w

Input: w, Nw, D

Output: {{βij}Nj=1w}Di=1, {{aij}Nj=1w}Di=1

for i = 1 to D do ε = wi for j = 1 to Nw do aij = sign(ε) βij = haij, εi/||aij||2 ε ← ε − βijaij end for end for

Figure 4.4: Illustration of BING features (image from [6])

Ng binarized normed gradient features as:

gl=

Ng

X

k=1

28−kbk,l (4.4)

where l = (i, x, y, z) is the scale and location of the feature. The 8 × 8 × D patches of approximated gradient are the BING3D features. As with the approximation of w, we approximate each temporal slice independently. We use the fast algorithm proposed in

[6], and presented in algorithm2to compute the 8 × 8 feature for each of the D temporal

slices. Thanks to the cumulative relation between adjacent BING3D features and their last rows, we can avoid looping over the 8 × 8 region, by using BITWISE SHIFT and

BITWISE OR operations. This is illustrated in figure 4.4.

Algorithm 2 BING [6] algorithm to compute BING features for W × H positions

Input: binary normed gradient map bW ×H

Output: BING feature matrix bW ×H

Initialize: bW ×H = 0, rW ×H = 0

for each position (x, y) in scan-line order do rx,y = (rx−1,y 1) | bx,y

bx,y = (bx,y−1 8) | rx,y

(23)

Methodology 19

Proposals Generation The proposal generation process involves computing an

ap-proximated classifier score (or ’proposal score’) slfor each scale and location in the video

and then choosing only the top scored proposals. To do so for a test video, it is first resized to each one of the scales that were saved before, and the NG3D features are

computed. Using the learned model w (4.3) and binarized NG3D feature (4.4) we can

efficiently compute sl using atomic bitwise operations.

The approximated classifier score is defined as

sl= hw, gli (4.5)

and can be efficiently tested using:

sl≈ D X i=1 Nw X j=1 βij Ng X k=1 28−k(2ha+_ij, bk,li − |bk,l|) (4.6)

with BITWISE SHIFT and POPCNT SSE (a fast built-in c function to count the number

of 1 bits) operations. See [35] for more details.

Each location represent a cuboid proposal, with its corner at that location and with width, height and duration according to the current scale.

4.1.4 Non-Maximum Suppression

Non-maximum suppression (NMS) is often used along with edge detection algorithms. The image is scanned along the image gradient direction, and if pixels are not part of the local maxima they are set to zero. This has the effect of suppressing all image information that is not part of local maxima. We use it slightly differently to reduce the amount of redundant proposals. The process is done by first sorting the proposals according to their approximate classification score, and then including high scored proposals in the final set, while removing their neighbor proposals (in a 2x2x2 neighborhood).

4.1.5 Objectness Measure

As explained before, we learn an objectness function, that indicates how likely a proposal is to contain an action. To rank proposals on the likelihood of containing an action we

learn scale-specific parameters vi, ti, to be used in the objectness function

(24)

Methodology 20

Input

video

ImprovedgDense

ggggTrajectories

SVM

Fisher

vectors

Proposals

Output

Label:gRunning Runnin g

ProposalsgGeneration

ActiongLocalization

BING3D

Figure 4.5: Pipeline for action localization. From the input video we extract both the Improved Dense Trajectories (IDT) features and the 3-dimensional normed gradient (NG3D). NG3D is used to generate the action proposals. For each proposal, all IDT trajectories that lie inside the proposal cuboid are aggregated into a Fisher vector, which is then fed to an SVM classifier to produce the final proposal+label output.

where i is the scale, l = (i, x, y, z) is the location, slis the proposal score for this location,

and ol is the objectness score for the proposal. To learn the parameters, we generate

proposals for the training videos, compute the approximated classifier score using the w learned in the first stage, and use that as a 1D feature. The label is determined according to the overlap score (positive id the overlap score is higher than a threshold, negative otherwise). After generating the final set of proposals, we sort it according to the objectness score of each proposal.

4.2 Action Localization

As a final step of our work, we use the action proposals to classify the actions. This is done using state-of-the-art pipeline, which includes Improved Dense Trajectories (IDT)

[13] as low-level feature descriptors, encoded with Fisher vectors [36] and classified using

a linear SVM. An overview of the whole pipeline is shown in figure4.5. From the input

video we extract both the IDT descriptors and the BING3D features. The latter are used to generate the action proposals, as was explained in the previous section. For each proposal, all the descriptors associated with IDT trajectories that lie inside the proposal cuboid are aggregated into a Fisher vector, which is then fed to an SVM classifier to produce the final proposal+label output.

For a given action proposal, we first find all trajectories that are strictly inside the proposal cuboid (using the trajectory coordinates per frame), and then aggregate all

(25)

Methodology 21

descriptors associated with these trajectories into a Fisher vector, so we have a Fisher vector per proposal. This can be done on a single descriptor (TRAJ, HOF, HOG, MBH) or on a concatenation of them all. All details of the different parts of the pipeline are to follow.

4.2.1 Improved Dense Trajectories

Figure 4.6: Illustration of the approach that extracts and characterizes dense trajec-tories (image from [37]).

Improved Dense Trajectories (IDT) [12,13] are state-of-the-art video features [37]. This

approach densely samples points for several spatial scales. Points in homogeneous areas are suppressed, as it is impossible to track them reliably. Tracking points is achieved by median filtering in a dense optical flow field. Each densely sampled point at frame t is tracked to the next frame t + 1 in a dense optical flow field. Points of subsequent frames are concatenated to form a trajectory, so the shape of a trajectory encodes local motion patterns. Trajectories tend to drift from their initial location during tracking, thus to avoid drifting, the feature points are only tracked for L (defaults to L = 15) frames and new points are sampled to replace them. Static feature trajectories are removed as they do not contain motion information, and trajectories with sudden large displacements are pruned as well, since they are assumed to be erroneous.

For each trajectory, we compute several descriptors (i.e., TRAJ, HOG, HOF and MBH) as is detailed below. The Trajectory descriptor (TRAJ) is a concatenation of normalized displacement vectors. The other descriptors are computed in the space-time volume aligned with the trajectory.

The method is illustrated in figure4.6. On the left of the figure, we can see the feature

points are densely sampled on a grid for each spatial scale. The middle part shows how tracking is carried out in the corresponding spatial scale for L frames by median filtering in a dense optical flow field. On the right, the trajectory shape is represented by relative point coordinates, and the descriptors (HOG, HOF, MBH) are computed

(26)

Methodology 22

cells. All histogram-based descriptors are normalized using l1 normalization followed by

power normalization (i.e. square-rooting each dimension).

The improved version includes camera motion estimation which is used to remove trajec-tories consistent with camera motion, and to cancel out camera motion from the optical flow. This is done by matching feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Camera motion estimation sig-nificantly improves motion-based features, such as HOF and MBH, which are computed using the warped optical flow (recomputed using the estimated homography).

In the next paragraphs we explain in detail the different descriptors extracted by IDT.

TRAJ The Trajectory descriptor (TRAJ) is a vector containing the normalized

dis-placements between subsequent frames. The displacement is the direction and mag-nitude of movement of a tracked pixel between two frames. This descriptor encodes local motion patterns. Formally, given a trajectory of length L, the TRAJ descriptor is described by the sequence

S = (∆Pt, ..., ∆Pt+L1) (4.8)

of displacement vectors, ∆Pt= (Pt+1Pt) = (xt+1xt, yt+1yt). All trajectories are L frames

long and each displacement is represented by its two (x, y) coordinates, resulting in a 2L-dimensional vector (30 by default).

HOG The Histograms of Oriented Gradients (HOG) descriptor focuses on static

appearance information. The video 3D gradient is computed within a spatio-temporal region around a trajectory, and then quantized into eight bins using full orientations. The size of the HOG descriptor is 8x[spatial cells]x[spatial cells]x[temporal cells], which defaults to 96 dimensions.

HOF The Histogram of Optical Flow (HOF) descriptor captures the local motion

information. HOF is computed in a similar manner as HOG by taking a spatio-temporal patch around a trajectory and quantizing the warped optical flow fields into eight bins plus an additional zero bin, using full orientations . The size of the HOF descriptor is 9x[spatial cells]x[spatial cells]x[temporal cells], which defaults to 108 dimensions.

MBH The Motion Boundary Histogram (MBH) was proposed by Dalal et al. [26],

where derivatives are computed separately for the horizontal and vertical components

(27)

Methodology 23

and y components. Spatial derivatives are computed for each of them and orientation information is quantized into 8-bin histograms. Since MBH represents the gradient of the optical flow, constant motion information is suppressed and only information about

changes in the flow field (i.e., motion boundaries) is kept. Each one of MBHx, MBHy has

8x[spatial cells]x[spatial cells]x[temporal cells] dimensions, which defaults to 2x96=192 dimensions for the concatenated descriptor.

4.2.2 Fisher Encoding

The Fisher encoding is an extension of the bag-of-visual-words model. It is not limited to the number of occurrences of each visual word but it also encodes additional information about the distribution of the descriptors. The Fisher Kernel was introduced by Jaakkola

and Haussler [36] and applied by Perronnin and Dance [38] to image classification. It

assumes that the generative process of the features can be modeled by a Gaussian Mix-ture Model (GMM). Before learning the GMM and extracting the Fisher vectors, the features dimensionality is usually reduced using Principal Component Analysis (PCA). The Fisher encoding then captures the average first and second order differences

be-tween the reduced video features and the centres of a GMM. We used the YAEL [39]

implementation of PCA, GMM and Fisher encoding to obtain our Fisher vectors.

Principal Component Analysis Principal Component Analysis (PCA), is a

tech-nique that is widely used for applications such as dimensionality reduction, lossy data compression, feature extraction, and data visualization. The goal of PCA is to project data with dimensionality D onto a space having dimensionality M < D while maximiz-ing the variance of the projected data. It involves evaluatmaximiz-ing the mean and the covariance matrix S of the data set and then finding the M eigenvectors of S corresponding to the M largest eigenvalues. This ensures the principal components are sorted according to their variance in descending order, and are orthogonal to each other.

Gaussian Mixture Model The Gaussian Mixture Model (GMM) is, as the name

suggest, a mixture of Gaussian distributions, which is defined as:

p(x) =

K

X

k=1

πkN (x|µkΣk) (4.9)

Each Gaussian density N (x|µkΣk) is called a component of the mixture and has its

own mean µk and covariance Σk. The parameters πkare called mixing coefficients, and

(28)

Methodology 24

learn a GMM from the data is the Expectation-Maximization (EM) algorithm, which updates the parameters values iteratively while maximizing the log-likelihood function. It is also common practice to first run K-means clustering and use the resulting clusters’ parameters to initialize the EM.

Fisher vector After extracting our video descriptors and learning a GMM with

K components, for a video V , each descriptor is soft quantized to the GMM. First and

second order differences between each descriptor xiand its Gaussian cluster mean µkare

accumulated in corresponding blocks uk, vk in the vector F V (V ) ∈ R2KD, appropriately

weighed by the Gaussian soft-assignments and covariance, leading to a 2KD-dimensional video representation:

F V (V ) = [u1T, v1T, ..., uKT, vKT]T (4.10)

Normalization The standard way to normalize Fisher vectors is with l2

normaliza-tion, that is dividing by the l2 norm. i.e. |v| =

q P

ivi2. Perronnin et al. showed that

doing so removes the dependence on the proportion of image-specific information. That is, similar objects with different scales (and therefore different amounts of background

information), will still have similar Fisher vectors after performing l2 normalization.

In [40], the authors suggest to use power normalization, which, on its general form, can

be expressed by the function

f (z) = sign(z)|z|α (4.11)

applied in each dimension, with 0 ≤ α ≤ 1. The motivation behind this normalization method is that when the number of Gaussian components K increases, the Fisher vectors become sparser, meaning the distribution of features in a given dimension becomes more peaky around zero. The power function balances the distribution and thus has

an ”unsparsifying” effect. Finally the power normalization is followed by standard l2

normalization, to still benefit the discarding of image-independent information. We followed the paper’s results and used α = 0.5 in all of our experiments.

Local Coordinate Encoding Local Coordinate Encoding (LCE) was introduced

by Sanchez et al. [41] (also referred to as spatially-extended local descriptors). The

idea is to append spatial information to the each video feature. Formally, for video

feature vi, computed with respect to trajectory ti, we first apply PCA projection, and

then append the mean values of the normalized coordinates of ti, to get an extended

feature [mean(xti), mean(yti), mean(zti), vi

T_]T_{. The extended features are then used}

(29)

Methodology 25

soft quantization process. This method is used instead of spatial pyramids, since it is

significantly more memory-efficient and yields similar results [42].

4.2.3 Support Vector Machine

As the final step of the classification process, we learn a linear Support Vector Ma-chine (SVM). SVM selects the separating hyperplane where the distance of the hyper-plane from the closest data points (the margin) is as large as possible. Formally, for training data D, containing n d-dimensional samples with positive or negative labels,

D = {(xi, yi)|x ∈ Rd, yi ∈ {−1, 1}}ni=1, SVM aims to find the maximum-margin

hyper-plane that divides the points having yi = 1 from those having yi = −1. Any hyperplane

can be written as the set of points x satisfying w · x − b = 0, where · is a dot product and w is the normal vector to the hyperplane. SVM is a popular classifier with multiple

efficient implementations. For our purposes we chose to use the LibLinear [43] package,

(30)

Chapter 5

Experiments

In this chapter we present the results of our experiments. In the first section we detail the experimental setup and introduce in detail the datasets and evaluation methods. Next, we give a thorough analysis of the different method parameters and their effect on performance for both the proposal generation and the action localization, evaluated on benchmark datasets. We compare our results to previous works when possible. In the last section we present our results on the case study Apenheul dataset.

5.1 Experimental Setup

For all experiments in this section we use a train-test split and state results obtained on the test set. For UCF Sports and UCF 101 we use the standard split, and for MSR-II a random split of 50% train and 50% test videos. Since UCF-sports and UCF 101 are trimmed, BING3D outputs full length proposals for them. Both in BING3D and in the localization training, we set the positive samples threshold to 0.25 in all experiments.

We used liblinear [43] everywhere SVM is used, and the SVM parameter is set using

cross validation. We used default parameters in the extraction of the improved dense trajectories. For the Fisher encoding, we always reduced descriptors’ dimensionality to

half, as suggested in [37].

5.1.1 Data Sets

In the experiments and evaluation of the algorithm we used three benchmark action localization datasets, namely UCF Sports, UCF 101 and MSR-II, as well as our case study dataset, the Apenheul dataset. The datasets differ in number of videos, video sizes, number of classes, temporal segmentation, number of actions per video, average size of

(31)

Experiments 27

Dataset UCF Sports UCF-101 MSR-II Apenheul

#Videos 150 3,204 54 29

#Classes 10 24 3 7

Width (pixels) 686 ± 83 320 320 1920

Height (pixels) 448 ± 74 240 240 1080

Length (frames) 63 ± 25 173 ± 79 764 ± 199 300

Avg. action length (frames) 63 ± 25 170 ± 80 320 ± 134 87 ± 82

% Trimmed 100% 74.6% 0% 0%

#Actions/vid 1 1 3.8 ± 1.5 19 ± 7

Realistic 3 3 7 3

Table 5.1: Statistics of the evaluated action localization datasets, average and stan-dard deviation are stated when appropriate.

BipedalFWalking TreeFClimbing Standing Diving Walking Standing Basketball Boxing HandFwaving MSR-II UCFF101 UCFFSports Apenheul

Figure 5.1: Example frames from the various datasets.

bounding boxes, how realistic the videos are, etc. All of the above variations make big

differences in difficulty levels. Table 5.1 summarizes the main differences, and next we

give the full details for each data set. Figure5.1illustrates the main differences between

the datasets by showing one example frame taken from a random video, together with the ground truth annotations for that frame. It clearly shows the frame size differences, as well as the number of actions in a frame (one for UCF Sports and UCF 101 and multiple for MSR-II and Apenheul) and the ratio between the ground truth size and the frame size (UCF Sports and MSR-II tend to be relatively close to the camera, while actions in the other two datasets are mainly zoomed-out).

UCF Sports Action Data Set The UCF Sports action data set [44] consists of a

set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. The video sequences were obtained from a wide range of stock footage websites including BBC Motion gallery, and GettyImages. The data set contains 150 video sequences at a maximum resolution of 720x480 and length of 2-3 seconds at 25 fps. The collection represents a natural pool of actions featured in a wide range of scenes and viewpoints. It has a single action per video, and

(32)

Experiments 28

the videos are trimmed to fit the action (so each action lasts through the whole duration of the video).

UCF 101 action recognition data set UCF-101 is an action recognition data set

of realistic action videos [45], collected from YouTube, having 101 action categories.

As part of the Thumos challenge [46], bounding box annotations were provided for

24 out of these action categories, thus we used only these categories for our action localization. With 3204 videos from 24 action categories, UCF 101 gives the largest diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc. As most of the available action recognition data sets are not realistic and are staged by actors, UCF 101 aims to encourage further research into action recognition by learning and exploring new realistic action categories. The video resolution is lower than UCF Sports, but the videos are longer. This is a largest and most challenging action localization benchmark dataset, and we are the first to report localization results for it.

Microsoft Research Action Data Set II The Microsoft Research Action Data

Set II (MSR-II) consists of 54 video sequences recorded in a crowded environment. The videos are low resolution, but quite long and untrimmed. Each video sequence consists of multiple actions. There are in total 203 action instances, divided to three action types: hand waving, hand clapping, and boxing. These action types are overlapped with the

KTH dataset [47], which is an older dataset, containing 600 videos with 2391 action

instances. KTH is considered an easy dataset since backgrounds are homogeneous and apart from the zooming scenario there is only slight camera movement, and the actions are acted out pointedly. KTH is often used for cross-dataset action detection by using

the KTH dataset for training while using MSR-II dataset for testing [3,48,49]. We also

report results for cross-dataset experiments in section

Apenheul Gorilla Data Set The Gorilla data set is composed from high-definition

videos taken in the Apenheul zoo. The dataset contains 29 videos, taken from 2 different video streams. The videos were annotated by hand with bounding boxes and action labels. Each video is full HD (1920x1080 resolution) and is 300 frames long. For more

(33)

Experiments 29

5.1.2 Evaluation methods

We used different methods to quantify the performance of our algorithms. For the

proposals quality evaluation we used the ABO, MABO and Best Overlap recall measures, as explained in more details next. The action localization is evaluated using average precision and AUC.

proposals quality The proposal quality of a proposal P with a ground truth tube G

is evaluated with spatio-temporal tube overlap measured as the average ”intersection-over-union” score for 2D boxes for all frames where there is either a ground truth box or a proposal box. More formally, for a video V of F frames, a tube of bounding boxes

is given by (B1, B2, ...BF), where Bf = ∅, if there is no action i in frame f , φ is the set

of frames where at least one of Gf, Pf is not empty. The localization score between G

and P is L(G, P ) = 1 |φ| X f ∈φ Gf∩ Pf Gf∪ Pf , (5.1)

The Average Best Overlap (ABO) score is computed by averaging the localization score of the best proposal for each ground truth action. The Mean Average Best Overlap (MABO) is the mean of the per class ABO score. The recall is the percentage of ground truth actions with best overlap score over a threshold. It is is worth mentioning that

although other papers often use 0.2 as the threshold (e.g. see [19] for argumentation for

using 0.2 threshold), we chose to use stricter criteria, thus unless stated otherwise we report recall with a 0.5 threshold.

Action localization The localization performance is measured in means of average

precision (AP) and mean average precision (mAP). To compute average precision, the proposals are sorted according to their classification score. A proposal is considered relevant if its label is predicted correctly and its overlap score with the ground truth tubelet is over a threshold. We present plots of AP and mAP scores for different overlap thresholds.

For comparability with previous works, we also provide AUC plot, computed as in [19]:

Take each action class as the positive class at one time, and use the classification scores to produce ROC curves for each positive class. A video is considered as being correctly predicted if both the predicted action label and the localization match the ground truth. After computing the average action localization performance of all the action categories in terms of ROC curves, which is computed when the threshold is set to 0.2, we evaluate

(34)

Experiments 30

the area under ROC (AUC) measure, with threshold varying from 0.1 to 0.6, in a step of 0.1.

5.2 Experiments

5.2.1 Proposals quality

We start by investigating the effects of the different BING3D parameters on the proposals quality.

Effect of NG3D feature depth (D). We vary the temporal NG3D feature depth

D ∈ {1, 2, 4, 8, 16} while keeping Nw= 4 fixed, see4.3. In5.2(left) we report the average

time per video in seconds where higher D values are slower. Next, we show the effect

on the recall in5.3(left). The feature depth does not matter much for UCF-Sports and

UCF 101. Even disregarding the temporal scale, D = 1, works well which is due to the trimmed nature of these datasets. For untrimmed MSR-II, where temporal localization is required, the best performance is obtained by higher D, which illustrates the need for temporal modeling in untrimmed videos.

Effect of model approximation (Nw). We vary Nw ∈ {2, 4, 8, 16} while clamping

D to the best value (4 for UCF-Sports, UCF-101, 8 for MSR-II). In5.2(right) we report

the average time per video in seconds, showing that Nw has barely any effect on the

computation time. The effect on recall is illustrated in 5.3 (right). The approximation

quality does not effect accuracy for trimmed UCF-Sports and UCF 101 where even

Nw = 2 components works well. For untrimmed MSR-II more than 2 components

are needed, and Nw = 16 components is too much, which may re-interpret Nw as a

regularization parameter.

Effect of using temporal scaling on UCF 101 All UCF Sports videos are

trimmed, thus there is no need for temporal scaling. For MSR-II no videos are trimmed, so temporal scaling is demanded. UCF 101 has most (74.6%) videos trimmed, so it can benefit from temporal scaling (but that increases the training time and the number of proposals). In this experiment we tested the performance with and without temporal scaling, as well as a combination of the proposals from both. Temporal scaling improve results, but the improvement is relatively small (0.6% in MABO, 2.4% recall), mainly sue to the fact that most of the videos are trimmed, thus temporal scaling is not needed. When combining the proposals from both runs, the improvement is more clear, with

(35)

Experiments 31

UCF Sports UCF 101 MSR−II 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Time (seconds)

Effect of NG3D feature depth on time D=1

D=2 D=4 D=8 D=16

UCF Sports UCF 101 MSR−II 0 1 2 3 4 5 Time (seconds)

Effect of number of components on time

#Comp=2 #Comp=4 #Comp=8 #Comp=16

Effect of D Effect of Nw

Figure 5.2: Evaluating BING3D parameters D (left) and Nw(right) on computation

time (s). The feature depth has a strong impact on the generation time.

UCF Sports UCF 101 MSR−II 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 % Recall ( ≥ 0.5 overlap)

Effect of NG3D feature depth on recall

D=1 D=2 D=4 D=8 D=16

UCF Sports UCF 101 MSR−II 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 % Recall ( ≥ 0.5 overlap)

Effect of number of components on recall

#Comp=2 #Comp=4 #Comp=8 #Comp=16

Effect of D Effect of Nw

Figure 5.3: Evaluating BING3D parameters D (left) and Nw (right) on recall. The

untrimmed MSR-II dataset is the most sensitive to parameter variations, illustrating the need for temporal modeling.

almost 6% improvement in recall over no temporal scaling and 3.5% over temporal scal-ing. Since we look at the best overlap proposals, the combination is always at least as good as the better set of proposals. The improvement shows that options are comple-mentary, i.e. for some ground truth tracks the first experiment proposals are better and for others the second experiment proposals are better. Since the improvement in the results is not high enough to justify the additional computation time, we did not use temporal segmentation for the forthcoming experiments on UCF 101.

Effect of ranking In this experiment we evaluate the quality of the learned objectness

measure. The same set of proposals was ranked once according to the approximate classification score, and once using the learned objectness measure. As explained in

(36)

Experiments 32

Temporal scaling ABO MABO Recall #Proposals

7 43.9 43.7 39.2 1,678

3 44.6 44.3 41.6 2,959

Combined 46.1 45.8 45.1 4,637

Table 5.2: Propoasls quality for UCF 101, with and without temporal scaling, as well as for a combination of both.

0 20 40 60 80 100 120 140 160 180 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 # Proposals

UCF Sports, Recall versus number of proposals

Recall

Ranking by SVM score

Ranking by BING3D learned objectness measure

Figure 5.4: Recall per number of proposals, using different ranking methods.

section 4.1.5, we learn scale specific parameters to capture the more likely scales. It

is easy to see that the learned rank is performing better, reaching the maximum recall after 85 proposals on average, while the other needs over 150 proposals for that. Higher recall for the same number of proposals indicates that the good proposals (i.e. proposals with high overlap score) are ranked higher.

Cross-dataset model transfer Recently it was suggested that the supervision in

the original BING is not crucial to its success [50]. To test how this affects BING3D we

evaluate the quality of the learned model w by training on one dataset, and evaluation on another dataset. For training, we include the spatio-temporal annotations of the

KTH dataset [51], KTH is commonly used as a train set for MSR-II [3]. We show

cross-dataset results in5.5. For UCF Sports and UCF 101 the results are similar for all

models. For MSR-II however, the model learned on the untrimmed MSR-II and KTH sets outperforms models trained on the trimmed datasets. We conclude that for trimmed videos the model has limited influence, yet, for untrimmed videos a model trained on untrimmed data is essential.

Qualitative analysis To get a better understanding of the strengths and weaknesses

of BING3D, we analyze success and failure cases for each dataset. We visualize below the ground truth tracks with highest and lowest best overlap score. In all the figures

(37)

Experiments 33 0.68 0.39 0.31 0.68 0.40 0.41 0.66 0.39 0.52 0.66 0.39 0.52 Trained on Tested on

Recall for cross dataset training

UCF Sports UCF101 MSR−II KTH UCF Sports

UCF101

MSR−II

Figure 5.5: Cross dataset training a model on set A, and applying it on set B. Note the robustness on UCF-Sports and UCF101. The untrimmed MSR-II set is sensitive

to model variations.

Overlap: 0.87 Overlap: 0.12

Figure 5.6: UCF Sports: visualization of best overlap proposals with highest and lowest overlap score.

the blue cuboid illustrates the proposal and the red one the ground truth. The overlap score is stated under each figure.

The highest scored proposal for UCF Sports is from the Lifting class (figure 5.6 left).

This class is characterized by cuboid ground truth annotations which makes it easier on

BING3D to generate quality proposals. The lowest scored proposal (figure5.6 right) is

from the running class. Here we can see the weak point of generating only cuboids and not tubelets. Even though the proposal captures almost all of the action range (which can be seen by the fact that most of the ground truth tubelet is inside the proposal cuboid), the overlap score is low, because per frame there is a big difference in the bounding boxes sizes between the proposal and the ground truth.

(38)

Experiments 34

Figure 5.7: UCF 101: visualization of best overlap proposals with highest and lowest overlap score.

boxes that fit nicely in a cuboid, thus yielding high scored best proposal. On the right we encounter again the disadvantage of generating only cuboid proposals. Whenever an action contains large movements within the frame, the overlap scores are dropping. There are a few other ground truth tubelets with low overlap scores that were not visualized because they are too short (up to 20 frames), thus making the visualization unclear. Since we treated UCF 101 as a trimmed dataset, all proposals were generated with full video length and therefore for the few untrimmed videos, we get low overlap scores.

For MSR-II the big challenge is the temporal localization. The highest scored proposal is demonstrating impressive success, from a video with length of 907 frames, the temporal localization is only 4% off (126 common frames between the proposal and the ground truth, out of shared length of 131 frames, when the length of the ground truth tubelet is

129 frames). Encouraging results are that even for the lowest scored proposal (figure5.8

right) the temporal localization is relatively good. 21 out of 32 frames are shared. The bad performance in this case might be again due to the short ground truth track. With average length of 320 frames per action, BING3D learns to generate longer proposal cuboids, thus failing to fit the outlier ground truth track temporally.

Versus state of the art In this section compare BING3D versus other action

local-ization methods. The methods we compare to are the Tubelets method by Jain et al.

[3] and Prim3D by Oneata et al. [4], for both of which we got the raw proposals, and

computed all the evaluation metrics ourselves, so to have a fair comparison.

First of all, we compare the computation time of BING3D versus other methods. The strongest point of BING3D is its fast speed, orders of magnitude faster than other

(39)

Experiments 35

Figure 5.8: MSR-II: visualization of best overlap proposals with highest and lowest overlap score.

Computation time (s)

Pre-processing Generation Total

Prim3D 840 38 878

Tubelets 185 59 244

BING3D 1 0.6 2

Table 5.3: Computation times for pre-processing, proposal generation, and their com-bined total on a 400x720 video of 55 frames with 12,852 trajectories. Note the speedup

of our proposals.

methods, as can be seen in table 5.3. We compare the processing time for one video

from the UCF Sports dataset, for which we have timing results from the other methods. Our timing was measured on a single core, 2.93 Ghz Intel Xeon processor.

Next, we compare the performance with three evaluation metrics (ABO, MABO and recall), for the three benchmarks. We also state the number of proposals each method generated. Note that the number of proposals generated by BING3D is significantly lower. For UCF Sports our performance is lower than that of Tubelets, but we still outperform Prim3D in all metrics. Note that we still perform well with 10 to 15 times less proposals. UCF 101 does not have any previously reported results to compare to, and on MSR-II we significantly outperform Tubelets with about half the number of proposals. It is also important to remember that since BING3D outputs cuboids and not tubelets as the other methods, its performance is bounded.

Figure5.9shows the recall for different overlap thresholds on all datasets. As mentioned

before we can see that BING3D is dominated by Tubelets for UCF Sports. We can also see that even though BING3D performs better than Prim3D for low thresholds (up to 0.5), it actually degrades for higher thresholds. Note that for the far more challenging

BING3D: Fast Spatio-Temporal Proposals for Action Localization