Meta Fisher Vector for event recognition in generative encoded visual streams

(1)

MSc Artificial Intelligence

Track: Intelligent Systems

Meta Fisher Vector for event recognition in

generative encoded visual streams

by

Markus Nagel

10407308

January 2015 42 EC Supervisors: Dr. Thomas Mensink Dr. Cees Snoek Assessors: Dr. Henk Zeevat Dr. Jan van Gemert

Intelligent Sensory Information Systems Intelligent Systems Lab Amsterdam

(2)

In this thesis we focus on event recognition in visual image streams. High level dynamic semantics such as events can often not be inferred from a single still image, but can be discov-ered within visual streams such as collections of images and video clips. Therefore we aim for recognition from a visual streams, rather than a single image. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Meta Fisher Vector, a Fisher kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian Mixture Model as underlying probability distribution. First, the Student’s-t Mixture Model which captures the heavy tails of the small sample size of a visual stream. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analyti-cal approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three re-cent datasets for event recognition in photo collections and web videos, leading to an efficient compact visual stream representation which achieves state-of-the-art performance on all these datasets.

(3)

Abstract i

Contents ii

1 Introduction 1

2 Related work 5

2.1 Image Collections . . . 5

2.2 Video Event Recognition . . . 6

2.3 Fisher Vector Representation . . . 7

3 Preliminary: The Fisher Vector 8 3.1 Fisher Kernel . . . 9

3.2 Fisher Vector image representation . . . 11

3.3 Fisher Information Matrix . . . 12

3.3.1 Analytic approximation for the GMM . . . 13

4 Meta Fisher Vector 15 4.1 Student’s-t Mixture Model . . . 16

4.1.1 Fisher scores for the StMM . . . 18

4.1.2 Analytic FIM approximation for the StMM . . . 19

4.2 Hidden Markov Model . . . 21

4.2.1 Fisher scores for the HMM . . . 21

4.2.2 Analytic FIM approximation for the HMM . . . 23

4.3 Alternative forms for temporal encoding . . . 23

5 Experimental evaluation 27 5.1 Experimental setup . . . 27

5.1.1 Datasets . . . 27

5.1.2 The Meta Fisher Vectors pipeline . . . 28

5.1.3 Experiments . . . 31

5.2 Results and evaluation . . . 32

5.2.1 Properties of Meta Fisher Vectors . . . 32

5.2.2 Performance of different generative encodings . . . 37

5.2.3 Comparison to alternative temporal encodings. . . 40

5.2.4 Combination with mean features . . . 41

5.2.5 Comparison with state-of-the-art . . . 42 ii

(4)

6 Conclusions 44

(5)

Introduction

For humans it is easy to detect objects in images, to recognize events in videos or to understand the theme of image collections. However, computers interpret images as a matrix of pixels and automatic detection of abstract concepts such as cows, bicycle tricks and weddings from this digital representation is far from trivial. In Computer Vision we focus on automatic processing and interpretation of visual information. Apart from the societal benefit to unlock the value of the enormous amount of visual content uploaded to the internet every minute, this may one day reveal how the human brain is able to see.

FIGURE 1.1: The example image of people having a drink (blue) may belong to different events. When seen in context of the image collection its event type is much easier to

deter-mine. In this thesis we study how to encode visual streams for event recognition.

No wonder many have studied the understanding of still images in the format of detecting objects like planes and cars and static scenes like office and mountain. However, it is often difficult to infer high level dynamic semantics from a single still image like events such as birthday partyand rock climbing. This is the case for computers, but even humans may have difficulty to assign the appropriate event to a static image. For example a photo of people having a drink might be from a wedding ceremony, birthday party or Christmas dinner, see Figure1.1. The central theme of this thesis is to consider recognition not from a single image,

(6)

−100 −50 0 50 100 −100 −50 0 50 100 −40 −20 0 20 40 −40 −30 −20 −10 0 10 20 30 (A) Image feature −100 −50 0 50 100 −100 −50 0 50 100

150 Feature per single image

−40 −20 0 20 40 −40 −30 −20 −10 0 10 20 30

40 Meta FV per event

(B) Visual stream encoding (Meta-FV) FIGURE1.2: 2D visualization [41] of the image features in a collection (left) and the stream encoding that we propose (right). The color indicates the event type where the image or visual stream belongs to. Already in this 2D space the event types form better clusters using

the stream representation.

but from a stream of images such as photos in a folder on a users’ computer, images on Facebook albums, or multiple frames from YouTube videos. If the photo of people having a drink is now seen in a stream of images which also includes dancing, eating dinner and exchanging rings, then it will most likely be from a wedding ceremony. Figure1.2illustrates in an empirical way that a visual stream is more discriminative for an event than a single still image.

The goal of this thesis is to design an effective representation of visual streams, such as photo collections [13,2] and video clips [17,30]. This is challenging, since a compact (vectorial) representation is preferred for efficient recognition, while this representation should still cap-ture the visual semantics and temporal diversity of the stream from only a few samples. This leads to the following research question which guides the direction of this work:

How can we encode visual streams efficiently?

For single images an effective representation is the Bag-of-Words model [39,5]. It is inspired from text retrieval, where each document can be efficiently described using a vector containing the counts how often a word is present in a document. Similarly, an image can be represented as a vector of occurrence counts of local image features which capture the visual appearance of a small region in the image. Note, unlike words in text, local image features do not necessarily have discrete values and often have to be discretized before they can be counted. A natural extension of the Bag-of-Words model from images to visual streams would be to count the occurrences of image features in the whole stream. However, since there are many different

(7)

FIGURE 1.3: Comparison between an image and a visual stream. An image (left) consists of thousands of local SIFT features whereas a visual stream (right) only contains a small

amount of images.

image appearances and a rather small amount of images per stream, this is unlikely to be an effective encoding.

Our approach to encode visual streams is inspired by the success of Fisher Vectors [36] for the encoding of images [36,38] and videos [43,32]. The Fisher Vector is an extension of the Bag-of-Words model which captures higher order statistics instead of simple occurrence counts. It encodes local patches from a single image, or trajectories from a video, using the Fisher kernel [14] with a Gaussian Mixture Model as underlying generative probability function. A visual stream, however, behaves significantly different than local patches or trajectories as we illustrate in Figure1.3. Most notably, streams may consist of just tens to hundreds of images, while dense sampling methods for patches and trajectories extract 10K-100K local observa-tions per image or video. Moreover, in contrast to low-level local descriptors, e.g. SIFT [23] or MBH [6], an image in a stream can be described by more discriminative features, e.g. based on pre-trained DeepNets [19,47]. Finally, the temporal structure of an image collection might be less well defined than the explicit sequential ordering of frames of a video. These observations lead to the following two sub-questions:

• Can we exploit the temporal structure?

• How to deal with small amount of observations?

Our main contribution is a Fisher kernel encoding for a visual stream. Our encoding method extracts a single representation per collection, is independent of the number of images in the stream, and is agnostic to the underlying input features. This is advantageous, since it allows efficient learning of event classifiers, and leveraging of discriminative DeepNet features. We coin our encoding the Meta Fisher Vector, and illustrate its pipeline in Figure1.4.

(8)

FIGURE1.4: The proposed Meta Fisher Vector pipeline for a collection of images or video-stream. In the first step we extract DeepNet features of each image in a collection. From these features we train a generative model, e.g. a Gaussian Mixture Model or Hidden Markov Model, which is then use in the Fisher kernel framework to encode the collection. The

result-ing vector is called the Meta Fisher Vector.

As our second contribution, we propose alternatives beyond the Gaussian Mixture Model as the underlying distribution of a visual stream: i) We replace the Gaussian Mixture Model by a Student’s-t Mixture Model, which is more robust for our small sample size. ii) We explicitly encode the sequential ordering of a stream using Hidden Markov Models, with both the Gaussian and the Student’s-t distribution as the emission probability function. As a third contribution, we derive analytical approximations of the Fisher information matrix for all of these models.

The remainder of this thesis is organized as follows. We first summarize related work on encoding of image collections, event recognition in video and the Fisher Vector representation in Chapter 2. In Chapter 3 we provide a detailed introduction to the general Fisher Vector as it is used for image representation. We present our Meta Fisher Vector and the proposed generative models in Chapter 4. We evaluate the properties of the Meta Fisher Vectors and its generative models on three challenging event recognition datasets in Chapter 5. Lastly, we conclude in Chapter 6. In the confidential Appendix A1 we provide a report from a case study focused on predicting personality types of social network users. In this experiment, we use the Meta Fisher Vector for encoding visual streams consisting of the images uploaded by users. Finally, in Appendix B1 we provide the summary of this thesis as submitted to the International Conference on Computer Vision and Pattern Recognition.

1

(9)

Related work

In this chapter we discuss some of the most relevant work on representing photo collections, event recognition in videos, and the Fisher Vector representation.

2.1 Image Collections

The work on image collections is limited and not as extensively studied as other computer vi-sion challenges such as object recognition in images or event recognition in videos. Early work focused on combining high level image features in various ways [16,7]. Jiang et al. [16] in-troduced a Bag-of-Features event level representation which is conceptual similar to our Meta Fisher Vector though implementation wise it significantly differs. They use a spectral clus-tering algorithm which is based on the Earth Mover’s Distance to create a visual vocabulary and then create a histogram whereas we use a Gaussian Mixture Model with fisher encoding. Das et al. [7] on the other hand base their work on a Bayesian belief network which also in-cludes temporal features to increase the performance. They obtain an impressive performance of70%, though their dataset only consist of four very general and easy to distinguish events; namely vacation, sports, party and family moments.

More recent, recognition of events in streams of images is commonly achieved by a represen-tation consisting of simple averaging of images features [13] or similarly by a majority vote of single image classification scores [25]. Imran et al. [13] use a page ranking algorithm in order to select informative low level features form each image. Then they follow a standard Bag-of-Words approach in encoding each image and average the image features in order to represent a collection. Experiments show that their feature selection brings an improvement of 4% compared to a random feature selection. Though it is unclear why they do not use all low

(10)

level features. Mattivi et al. [25] classify events and sub-events. For the later they use unsuper-vised clustering combined with a single image classification. In order to perform their event level classification they use single image classification applied on all images in the collection. To arrive at the final event label they use Bag-of-Words features and a SVM classifier and combine the results by majority voting. However, we deem it unlikely that a simple average or majority voting can capture the variability of the visual semantics of an image stream. A notable different approach is the Stopwatch Hidden Markov Model by Bossard et al. [2], proposed in conjunction with the challenging Photo Event Collection benchmark dataset. They propose a discriminative Hidden Markov Model, that models the transitions between states as a function of the time gap between consecutive images in a collection of personal photos. This allows to model the sequential nature of the image stream, an advantageous property which we adopt in our representation as well. However, their final model requires evaluation of several per image features and computing HMM potentials per event and per collection, which is computationally and memory inefficient. Our proposed method extracts a single vectorial representation per collection, allowing efficient event recognition. For fair comparison, we evaluate our method also on the Photo Event Collection dataset, with the author provided features.

2.2 Video Event Recognition

Similarly to encoding collections of images, also videos have often been encoded as the mean visual feature of the sampled frames [22,24]. Further, several research focus on semantic rep-resentations based on prediction scores of pre-trained attribute classifiers [27,26,11], though such methods rely on additional training data for the attribute classifiers. Alternatively, using a Fisher Vector over several low-level video descriptors, such as SIFT, STIP and HOG, has been used [29,40, 28]. A notable exception is the proposed method of Lai et al., where a video is treated as a Bag-of-Frames, and event recognition is handled as a multi-instance learning problem [20].

The current state-of-the-art video representation is the Fisher Vector over motion boundary histograms from improved dense trajectories [43]. For few example recognition on challenging web videos from the TrecVID multimedia event detection dataset [30], the approach by Wang and Schmid [43] was further improved by learning a compact semantic embedding [10]. This method has been shown effective for recognition using only 10 examples, but the performance difference decrease for recognition using more examples. While we also obtain a compact representation, we do not require additional training data to learn the embedding. Moreover, our approach can model the variation over time explicitly.

(11)

Besides reporting results on TrecVID multimedia event detection dataset [30], we also report on the frequently used Columbia Consumer Video dataset introduced by [17]. To the best of our knowledge these are the largest publicly available video corpora in the literature for event recognition containing user-generated videos with a large variation in quality, length and content.

2.3 Fisher Vector Representation

The Fisher Vector representation was introduced as an alternative to the Bag-of-Words image representation [33,36]. It is based on the Fisher kernel framework [14], a general principle to derive a kernel from a generative probabilistic model which can be used in discriminative classifiers.

Over the years, many extensions have been proposed to the Fisher Vector framework, from which we highlight a few directions. First, the idea of using multiple layers of Fisher Vec-tors [38, 32] is similar in spirit to the proposed Meta Fisher Vector. Indeed, while we use DeepNet features as input for our Meta Fisher Vector, we could equally well have used a sequence of Fisher Vectors as input. Further, many ideas of encoding the spatial layout have been proposed. Spatial Pyramids [21] are used to partition an image into increasingly fine sub-regions and computes one Fisher Vector for each sub-region. In contrast, Sanchez et al. [35] augment the local patches to encode the spatial layout and the Spatial Fisher Vector [18] de-fines a location model inside the Fisher Vector framework. We adapted these ideas in order to model the temporal layout of a visual stream and compare them to the explicit temporal encoding using a Hidden Markov Model. Finally, the idea of using non-iid generative mod-els has been explored for image classification in [3]. While we also use non-iid models, we base them on the temporal structure of our visual streams. In the next chapter we will give a detailed introduction to the Fisher Vector representation.

(12)

Preliminary: The Fisher Vector

In this chapter we describe the Fisher Vector image representation, which has been introduced as an extension of the Bag-of-Words (BoW) model for image representations [36]. The BoW model is a popular approach to describe images for image retrieval and classification. It has been simultaneously introduced by Sivic and Zisserman [39] in the context of image retrieval and by Csurka et al. [5] for image classification. Later various extensions and adaptations have been proposed. The BoW model is inspired from text retrieval, where each document can be efficiently described using a vector containing the counts how often a word is present in a document. An equivalence for words in images are local features, e.g. densely sampled SIFT. However, unlike in text, these local features do not have discrete values. Thus we first train a visual vocabulary by clustering and then assign each local feature to its nearest cluster, which are also called visual words. Using the visual vocabulary we can create a histogram for each image by counting the assignments of the image’s local features to the closets visual word. Such a histogram representation of an image can then be used for image classification or retrieval. The concept of the BoW representation is also illustrated in Figure 3.1. Note, the size of such a histogram representation is independent of the image size, it only depends on the size of the visual vocabulary. In summary, the standard pipeline of most BoW models consists of three steps. For the basic model of above these are:

Local feature extraction Densely sampled SIFT features.

Visual vocabulary training Cluster centers obtained by clustering the SIFT features. Local feature encoding Creating a histogram of SIFT features assigned to clusters.

The Fisher Vector image representation makes two main improvements over the traditional BoW model:

(13)

FIGURE3.1: An illustration of the Bag-of-Words approach. Each of the images consists of a set of local features (top) which then gets assigned to its closest visual word. Then the histogram of these assignments (bottom) is used to represent the image. Image courtesy to Li

Fei-Fei.

• It replaces the hard assignment of local features to a cluster by a soft assignment, i.e. one local feature can partly contribute to more than one visual word.

• In addition to counting the visual word assignments, it also includes higher order statis-tics; i.e. how much and in which direction does a local feature deviate from its assigned cluster.

In the following we first introduce the Fisher kernel [14] in Section 3.1, the general theory behind the Fisher Vector. Then in Section3.2we describe the Fisher Vector image represen-tation. Finally in Section 3.3 we discuss various approximations of the Fisher information matrixto improve the classification performance.

3.1 Fisher Kernel

The Fisher Vector is based on the Fisher kernel [14], a general principle to derive a kernel from a generative probabilistic model which can be used in discriminative classifiers. This combines the advantages of both sides. On one hand discriminant classifiers like Support Vec-tor Machines (SVM) show excellent classification performance. On the other hand generative models describe the data generation process and therefore are able to deal with missing data and variable length inputs.

The Fisher kernel is derived from the gradient of the log-likelihood of a generative model p(_{·; θ) w.r.t. to its parameters θ. This gradient describes how the parameters of the model} contribute to the generative process of creating a sample. LetX ={x1, . . . , xn} denotes a set

(14)

ofn observations, where each observation i is described by the vector xi ∈ R1×D;D is the dimensionality of the image descriptors andxid is thed-th entry in the observation xi. Then the Fisher score of the observationsX is given by the gradient of the log-likelihood w.r.t. the parameters:

GX_θ =_∇θlog p(X; θ). (3.1)

The dimensionality of the Fisher score depends only on the number of parameters inθ, not on the number of observationsn.

In order to measure the similarity between two samplesX and Y , Jaakkola and Haussler [14] proposed using the Fisher kernel which is defined as the normalized inner product of two Fisher scores: KF K(X, Y ) = GXθ > F_θ−1GY_θ, (3.2) whereFθ = EX∼θ[GXθ GXθ >

] is the Fisher information matrix. This transformation ensures invariance w.r.t. re-parametrization of the probabilistic model. However, calculating Fθ ex-actly is in general not feasible. We will discuss in Section3.3 several approximations which have shown improved classification performance. SinceFθis positive semi-definite and so its inverse, we can decomposeF_θ−1 = F−

1

2>

θ F −1

2

θ . Thus the Fisher kernel can be written as the inner product between two Fisher Vectors:

GX θ = F

−1 2

θ ∇θlog p(X; θ). (3.3)

Using the Fisher VectorsG_θX in a linear classifier is equivalent to usingKF K(X, Y ) in a non-linear kernel machine. This formulation has a clear benefit because non-linear classifiers can be learned efficiently, even on big datasets.

Rewriting of the Fisher score formulation Before we introduce the Fisher Vector image representation, let us first consider a general rewriting of the Fisher scores. In any graphical model with latent variables the marginal probability is defined asp(X; θ) =P

Zp(X, Z; θ), where Z is a configuration of hidden states and the sum is over the space of all possible configurations of hidden states. For such models we can rewrite the Fisher score in:

GX_θ =∇θlog p(X; θ) = 1 p(X; θ) ∇θp(X; θ) (3.4) = P Z∇θp(X, Z; θ) p(X; θ) (3.5) =X Z p(X, Z; θ) p(X; θ) ∇θp(X, Z; θ) p(X, Z; θ) (3.6) =X Z p(Z; X, θ)_∇θlog p(X, Z; θ). (3.7)

(15)

In the first line we applied the chain rule to resolve the logarithm and in the last line we applied it in a reverse way to get back to the logarithm for the joint probability. This rewriting provides us an efficient way of calculating the Fisher scores.

3.2 Fisher Vector image representation

The Fisher Vector image representation [36] applies the Fisher kernel to represent images. It uses a Gaussian Mixture Model (GMM) to model local image patches (e.g. SIFT, SURF) of a single image. This assumes that each observation xi is an independent sample from the probability function. The probability function of an imageX modeled with a GMM is given by: p(X; θ) = n Y i=1 K X k πk N (xi; θk), (3.8)

whereN (·; θk) is a Gaussian distribution with parameters θk={µk, σk}, the mean and vari-ance. Since the size of the Fisher Vector representation is depending on the number of param-eters inθ, we assume a diagonal covariance matrix. This reduces the size of the covariance from a D× D matrix to a single D-dimensional variance vector, resulting in a final Fisher Vector of size K(2D) 1_{. The Gaussian distribution} _{N (·; θ}

k) with a diagonal covariance is parametrized by its meanµkand the varianceσk:

N (xi; θk) = 1 (2π)D/2_|σ2 k|1/2 exp ( −1 2 D X d (xid− µkd)2 σ2 kd ) . (3.9)

In case of the GMM all observationsX =_{x1, . . . , xn} are independent and thus the Fisher scores for thek-th mixture component are given by:

∇θklog p(X; θ) = n X i=1 X Z p(Z; xi, θ)∇θklog p(xi, Z; θ) (3.10) = n X i=1 K X l=1 γi(l)∇θklogN (xi; θl) (3.11) = n X i=1 γi(k)∇θklogN (xi; θk), (3.12)

whereγi(k) = p(zi=k; xi) is the posterior probability (responsibility) of the k-th Gaussian mixture component for the observationxi; which is defined by the soft assignment ofxito the 1_{For clarity we omit the gradient w.r.t. weights. Sanchez et al. [36] have shown that the weights do not add} much additional information.

(16)

componentk: γi(k) = πkN (xi; θk) PK l=1πlN (xi; θl) . (3.13)

Before deriving the final Fisher scores for the GMM, let us first recall that the logarithm of the Gaussian distribution with a diagonal covariance matrix is:

log_{N (x}i; θk) =− 1 2 D X d=1 log(2π) + log(σ2_kd) +(xid− µkd) 2 σ2 kd . (3.14)

Combining the logarithm of the Gaussian distribution with Eq. (3.12), we obtain the Fisher score of the imageX w.r.t. the mean µk:

GX_µk = 1 σ2 k n X i=1 γi(k) (xi− µk) , (3.15)

and w.r.t. the varianceσk:

GX_σk = n X i=1 γi(k) (xi− µk)2 σ3 k − 1 σk , (3.16)

where all vector divisions should be read element-wise.

We have now derived the Fisher scores for the GMM. By concatenating all gradients, i.e. GX

θ = [GXµ1, GXσ1, . . . , GXµK, GXσK]>we have a single vectorial representation for the image X. This representation can directly be used in a classifier or after normalization with the Fisher information matrix which we discuss in the next section. For the Fisher Vector model the three steps of the BoW pipeline become:

Local feature extraction This does not depend on the Fisher Vector model; often densely samples SIFT features are used.

Visual vocabulary training The visual vocabulary for the Fisher Vector is the GMM. It is trained by maximum likelihood estimation of its parameters using a set of local features. Local feature encoding The encoding consists of calculating the responsibilities per local feature (i.e. the assignment to the clusters) and then calculating the Fisher Vector per image.

3.3 Fisher Information Matrix

The Fisher Information Matrix (FIM) transforms the Fisher scores into a representation which is invariant w.r.t. re-parametrization of the probabilistic model. Previous research has shown

(17)

that such a transformation significantly improves classification performance [36]. However, only for a few simple parametric models the Fisher information matrix can be calculated ex-actly. Alternatively, the Fisher information matrix can be approximated. Three common ap-proximations are:

• The identity matrix, since asymptotically the Fisher information matrix is immate-rial [14].

• The diagonal empirical approximation, which results in a whitening of the signal for the diagonal [1], i.e. each dimension will have zero-mean and unit-variance.

• The analytic approximation, which have been proposed by [33,36] for the GMM. In the following we will outline the derivation of the analytical approximation focusing on the parts which are relevant and reused in Chapter 4. A detailed derivation can be found in appendix A of [36].

3.3.1 Analytic approximation for the GMM

The analytic approximation of the Fisher information matrix for the Gaussian Mixture Model [33,36] is based on the assumption that the posterior distributionγi(k) is sharply peaked, i.e. ∀i ∃k γi(k) ≈ 1, which is also known as the hard assignment assumption. This means that each local descriptors of an image is assigned to a single statek.

The entries of the FIM are given by the expected value of the negative second order derivative:

[F ]i,j= E −∂ 2_{log p(x; θ)} ∂θi∂θj , (3.17)

where θi and θj should be read as specific parameters (e.g. variance of component k and mean of componentk0_{). The partial derivatives of the posteriors w.r.t. the mean and variance} parameters is: ∂γi(k) ∂δk0 = γi(k) [[k = k 0 ]]− γi(k0) ∂ log p(x; θ 0 k) ∂θ0 k . (3.18)

The hard assignment implies thatγi(k)γi(k0)≈ 0 for k 6= k0, andγi(k)γi(k0)≈ γi(k) when k = k0_{. In both cases, the partial derivative becomes zero. Leading to the conclusion that the} partial derivatives are zero if:

• They involve mean or variance parameters corresponding to different mixture compo-nents (k_{6= k}0_).

(18)

• They involve a mixing weight parameter and a mean or variance parameter (possibly from the same component).

When using the hard assignment assumption to approximate the second order derivative w.r.t. the mean and variance, the cross terms between mean and variance from the same mixture component become zero:

[F ]σkµk =− Z x p(xi; θ) ∂2log p(xi, θ) ∂σk∂µk dx≈ 0. (3.19)

Together with the conclusion from the hard assignment, this means that all terms of the FIM outside its diagonal are zero. Thus the FIM takes the form of a diagonal matrix, i.e. a single scaling term per component of the Fisher score. Using again the hard assignment assumption in the second order derivative w.r.t. the mean (or variance) results in the following analytical approximation for the FIM entry of meanµkparameter:

F− 1 2 µk = σkπ −1 2 k , (3.20)

and of the varianceσkparameter:

F− 1 2 σk = σk(2πk)− 1 2. (3.21)

Sanchez et al. showed that this analytical approximation for the Gaussian Mixture Model out-performs both the identity and empirical approximation, for image classification using Fisher Vectors of local descriptors [36]. In Section5.2.1we will compare experimentally this analyt-ical approximations to the identity and empiranalyt-ical approximations for Fisher Vectors of visual streams.

(19)

Meta Fisher Vector

In this chapter we describe the Meta Fisher Vector (Meta-FV) encoding for a visual stream. Our encoding bears a resemblance to the Fisher Vector image representation which has been detailed in Chapter3. The Meta-FV encodes a visual stream of images, where each image is described by a single feature, in our case extracted from a DeepNet [19,47]. This means, we encode a visual stream X where each observation xi is a descriptor of an image. In general we can apply the same Fisher Vector as described for image encoding on the visual streamX. However, as illustrated in Figure1.3, there are two significant differences in the observations of a visual stream compared to a single image:

• The amount of observations; an image usually consist of approximately 10k local fea-tures whereas a visual stream has typically50_{− 100 images.}

• The dimensionality of the observations; local features like SIFT have around 128 di-mensions whereas most image representations have several thousand didi-mensions.

Both, the higher dimensionality of the features and the less observations makes the encoding of a visual stream a harder problem which may require different techniques.

For the Meta-FV, we also explore using a GMM to model a stream of images. This assumes that each stream can be modeled as a collection of independent distributed images. We then change the underlying generative model in two directions:

• We replace the GMM by using a Student’s-t Mixture Model (StMM), which is more robust against outliers.

• We explicitly model the sequential ordering of a stream using Hidden Markov Models (HMM).

(20)

FIGURE 4.1: Overview of the relation between our four generative models as underlying probabilistic distribution of the Meta Fisher Vector and how they interpret a visual stream.

−4 −2 0 2 4 6 8 10 0 0.1 0.2 0.3 Original data Gaussian Student’s-t

FIGURE 4.2: Some example data (green) and how they are fit by the Gaussian (blue) and Student’s-t distribution (red). Indeed, Student’s-t distribution is less effected by the outliers

and fits tighter the data, a desirable property.

Finally we also combine both extensions which results in a Student’s-t Hidden Markov Model (StHMM). An overview of our different generative models and their relationship is illustrated in Figure4.1.

In the remainder of this chapter, we first describe the different probabilistic models we use to encode the images of a stream; in Section4.1the Meta-FV for the Student’s-t Mixture Model and in Section4.2the sequential modeling of a stream using a Hidden Markov Model. Finally, we discuss in Section4.3alternative forms of temporal encoding.

4.1 Student’s-t Mixture Model

A problem with the Gausian Mixture Model is that it is highly effected by the presence of (a small number of) outliers. While this might be less of a problem when encoding a set of approximate10K local descriptors, as in the Fisher Vector image encoding, it becomes a

(21)

−5 −4 −3 −2 −1 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 Gauss (ν =∞) ν = 10 ν = 1 ν = 0.1 SMM Image index Component 10 20 30 5 10 15 GMM Image index Component 10 20 30 5 10 15

FIGURE4.3: The Gaussian and Student’s-t distribution: showing the heavy tails of the pdf (top), and its reflection in the responsibility values (bottom). Indeed, the responsibility values

of the StMM are less sparse, a desirable property.

problem when encoding an image stream consisting of around50− 100 images each. In order to make our model more robust against outliers, we replace the Gaussian distribution with a Student’s-t distribution, which is known for its heavier tails [31]. The Student’s-t distribution, which is more resistant against outliers, is beneficial in two ways:

• A small number of outliers have a less significant influence during training of the mix-ture model, thus we can fit the distribution tighter (more accurate) to the data. We illustrate this in Figure4.2.

• The distribution is less peaked (i.e. have more probability mass in the tails), thus the responsibility values are less sparse as we illustrate in Figure4.3.

The multivariate Student’s-t distribution of mixture componentk is defined by:

St(xi; θk) = Γ(νk+D₂ ) Γ(νk 2 )(πνk) D 2|σ2 k|1/2 1 +_νk1δk(xi) −νk+D 2 , (4.1)

whereθk denotes the parameters: meanµk, varianceσk, and degrees-of-freedomνk;Γ is the gamma function; _|σ_k2_{| is determinant of diagonal covariance matrix and δ}k(xi) denotes the

(22)

Mahalanobis distance: δk(xi) = D X d=1 (xid− µkd)2 σ2 kd . (4.2)

In the sequel we will first derive the Fisher scores for the Student’s-t Mixture Model and then propose an analytic approximation for the Fisher information matrix in order to perform efficient classification.

4.1.1 Fisher scores for the StMM

The StMM is a mixture model in the same form as the GMM. This means all observations X = _{x1, . . . , xn} are independent and by replacing the Gaussian distribution in Eq. (3.12) with the Student’s-t distribution from Eq. (4.1) the Fisher score for thek-th mixture component is: ∇θklog p(X; θ) = n X i=1 γi(k)∇θlog St(xi; θk), (4.3)

whereγi(k) is now the responsibility of the k-th Student’s-t mixture component for observa-tionxi: γi(k) = πkSt(xi; θk) PK l=1πlSt(xi; θl) . (4.4)

Now let us first consider the logarithm of the Student’s-t distribution with a diagonal covari-ance: log St(xi; θk) = log Γ(νk+D₂ ) Γ(νk 2 (πνk) D 2) ! − 1 2log(|σk|) − νk+D 2 log 1 +_νk1 δk(xi) . (4.5)

Combing this with Eq. (4.3), we obtain the Fisher score of the collectionX w.r.t. the mean µk:

∇µklog p(X; θ) = n X i=1 γi(k)∇µklog St(xi; θk) (4.6) = n X i=1 γi(k) νk+ D 1 +_ν1 kδk(xi) (xi− µk) σ2 kνk (4.7) = νk+ D σ2 kνk n X i=1 γi(k) (xi− µk) 1 +_νk1 δk(xi) , (4.8)

(23)

and w.r.t. the varianceσk: ∇σklog p(X; θ) = n X i=1 γi(k)∇σklog St(xi; θk) (4.9) = n X i=1 γi(k) " νk+ D 1 +_νk1δk(xi) (xi− µk)2 σ3_kνk − 1 σk # (4.10) = νk+ D σ3 kνk n X i=1 γi(k) (xi− µk)2 1 +_νk1 δk(xi) − 1 σk n X i=1 γi(k), (4.11)

where again all vector divisions should be read element-wise. This equations are similar to Eq. (3.15) and Eq. (3.16), since it is also a weighted average of the observations, where each observation is weighted by its responsibility. However, two important differences are (i) that the responsibility values are now based on the Student’s-t distribution with its heavier tails, and (ii) each observation is also weighted by the Mahalanobis distance w.r.t. componentk and the degrees-of-freedomνkand therefore the dimensions are no longer independent.

4.1.2 Analytic FIM approximation for the StMM

In this section we propose an analytic FIM approximation for the Student’s-t Mixture Model. This is similar to the approximation for GMM [33, 36] which we discussed in Section3.3. Unlike in the GMM, the Fisher scores of the StMM are dependent on all dimensions, due to the Mahalanobis distance in the denominator of Eq. (4.8). However, in any high dimensional dataset, the relative contrast between the farthest-point distance Dmax and the closest-point distanceDminvanishes [48]. This leads to our two main assumptions:

• Sharply peaked posterior assumption, which says that the distribution γi(k) is sharply peaked, i.e._{∀i ∃k γ}i(k)≈ 1, as used for the GMM approximation.

• Constant distance assumption, in addition we make the assumption that the Mahalanobis distanceδk(xi) between a descriptor xi and its assigned statek converges to a constant, i.e._{∀i ∀k δ}k(xi)≈ c.

Since the StMM is a mixture model in the same form than the GMM, the implications from the hard assignment assumption in Eq. (3.18) are also valid here. Thus in the following we show that (i) the cross terms between mean and variance of the same components are also zeros for the StMM and (ii) the scaling factors for the componentsµkandσk.

To see that the cross terms between mean and variance of the same component are also zero for the StMM model, we rely on the constant distance assumption. Using this, the second-order

(24)

partial derivative w.r.t. the mean and variance of the same component becomes: ∂2log p(xi; θ) ∂σk∂µk ≈ νk+ D νk(1 + c) γi(k)(xi− µk) ∂σ_k−2 ∂σk (4.12) =−2 νk+ D νk(1 + c) γi(k)(xi− µk)σ_k−3. (4.13)

Then by integrating over it we get:

[F ]σkµk =− Z x p(xi; θ) ∂2log p(xi, θ) ∂σk∂µk dx (4.14) ≈ 2_ννk+ D k(1 + c) σ_k−3 Z x p(xi; θ)γi(k)(xi− µk)dx (4.15) = 2 νk+ D νk(1 + c) σ_k−3πk Z x p(xi; θ)(xi− µk)dx = 0. (4.16)

Now we derive in a similar way the analytic approximation of the scaling factors in the FIM. The second order derivative w.r.t. the mean reads:

∂2_{log p(x} i; θ) (∂µk)2 ≈ νk+ D σ2 kνk(1 + c) γi(k) ∂(xi− µk) ∂µk (4.17) =₋νk+ D σ2 kνk 1 1 + cγi(k). (4.18)

Integration then gives:

[F ]µkµk =− Z x p(xi; θ) ∂2_{log p(x} i, θ) (∂µk)2 dx (4.19) ≈ νk+ D σ2_kνk 1 1 + c Z x p(xi; θ)γi(k)dx (4.20) = νk+ D σ2_kνk 1 1 + cπk Z x p(xi; θ)dx (4.21) = σ_k−2πk νk+ D νk 1 1 + c ≈ σ −2 k πk νk+ D νk , (4.22)

where in the last step we ignored _1+c1 since it is a constant scaling term present in any entry of the diagonal FIM. This results in the following analytical approximation for the FIM entry of meanµkparameter: F− 1 2 µk = σk πk _νk+Dνk −1₂ . (4.23)

Similarly, we can show that the FIM entry for the varianceσkparameter becomes:

F− 1 2 σk = σk 2πk νk+Dνk −1 2 . (4.24)

(25)

To the best of our knowledge, this is the first closed-form approximation of the Fisher infor-mation matrix for the Student’s-t Mixture Model.

4.2 Hidden Markov Model

In this section we propose to model the temporal relationship among images in a stream, using a Hidden Markov Model (HMM). While the independent mixture models, discussed above, ignore the temporal structure of the visual stream and treat each image as an independent observation, the HMM models encode the temporal relation explicitly.

The temporal relation in the HMM is modeled by the latent statezi, which depends not only on the observation xi, but also on the latent variable zi−1 of the previous observation. The probability of a sequenceX is given by:

p(X; θ) =X Z n Y i=1 p(zi; zi−1)p(xi; zi), (4.25)

where p(zi; zi−1) models the transitional probability, parametrized by the transition matrix A ∈ Rk×k_{, and the vector}_π

k ∈ R1×k for the initial state distribution; andp(xi; zi) models the emission probability, for which we use either the Gaussian distribution _{N (x}i; θk) or the Student’s-t distributionSt(xi; θk). Note that the summation Z sums over the knpossible latent sequences of the stream.

4.2.1 Fisher scores for the HMM

To derive the Fisher score for the HMM, we will use a similar notation to Bishop [1], where zi,k is a 1-of-K encoding scheme, i.e.zi,k = 1 if zi = k, otherwise zi,k = 0. By combining the HMM definition of Eq. (4.25) and the general Fisher score of Eq. (3.7), we obtain:

∇θlog p(X; θ) = X Z p(Z; X, θ)_∇θ n X i=1

log p(zi; zi−1, A) + log p(xi; zi, θ) (4.26)

=X Z p(Z; X, θ) n X i=1 ∇θlog p(xi; zi, θ) (4.27) =X Z p(Z; X, θ) n X i=1 K X k=1 zi,k∇θlog p(xi; θk) (4.28) = n X i=1 K X k=1 γi(k)∇θlog p(xi; θk), (4.29)

(26)

whereγi(k) is the responsibility of the observation xi for the hidden statek. Note, these are no longer the soft max assignment as in the mixture models. Instead these are now defined by the HMM: γi(k) = E[zik] = p(zi= k; X) (4.30) = p(X; zi= k) p(zi= k) p(X) (4.31) = p({xj} i j=1, zi= k) p({xj}nj=i+1; zi= k) p(X) . (4.32)

The responsibility valuesγi(k) can be efficiently computed using the forward-backward algo-rithm (see Bishop [1], section 13.2).

The final step we demonstrate using the Gaussian distribution as emission probability. Note, this can be done in exactly the same steps by using the Student’s-t distribution. Using the logarithm of the Gaussian distribution from Eq. (3.14) in Eq. (4.29), the Fisher score for the visual streamX w.r.t. the mean µkis:

∇µklog p(X; θ) = n X i=1 K X l=1 γi(l)∇µklogN (xi; θl) (4.33) = n X i=1 γi(k) xi− µk σ2 k , (4.34)

and w.r.t. the varianceσk:

∇σklog p(X; θ) = n X i=1 γi(k) (xi− µk)2 σ3 k − 1 σk . (4.35)

Comparing these to the Fisher scores for the GMM in Eq. (3.15) and Eq. (3.16), we observe that these are identical, albeitγi(k) is computed differently. These are now given by Eq. (4.32) instead of just the soft assignment probabilities.

Similarly, for the Fisher scores of the HMM using the Student’s-t emission distribution from Eq. (4.1) we obtain: ∇µklog p(X; θ) = νk+ D σ2_kνk n X i=1 γi(k) (xi− µk) 1 +_νk1 δk(xi) , (4.36) and ∇σklog p(X; θ) = νk+ D σ3 kνk n X i=1 γi(k) (xi− µk)2 1 +_νk1 δk(xi) − 1 σk n X i=1 γi(k). (4.37)

Indeed, these are identical to Eq. (4.8) and Eq. (4.11), however using the responsibility values γi(k) as given by the HMM model in Eq. (4.32).

(27)

4.2.2 Analytic FIM approximation for the HMM

In this section we introduce an analytical FIM approximation for the proposed HMM models. The approximation is based on the observation that the Fisher scores of the HMM models (Eqs. (4.34-4.37) for the HMM and StHMM) are identical to the GMM and StMM models, except for the used responsibility functionγi(k), which is now given by:

γi(k) =

p(_{xj}ij=1, zi= k) p({xj}nj=i+1; zi= k)

p(X) ,

instead of the soft-assignment posterior probabilities of Eq. (4.4).

For the analytical approximation however, the exact method of computation of γi(k) is not important. Since, the hard-assignment assumption is crucial for the analytical approximation for the FIM of the GMM model. Thus the assumption is that every observation is assigned to a single state, i.e.∀i ∃k γi(k)≈ 1, independent of the method for computing gamma.

In conclusion, for both HMM models we obtain the same closed form approximations as used in the independent image models (Eq. (3.20) and Eq. (4.23)). We are aware that this is a crude approximation, since the observations are no longer independent. However, our experimental evaluation, see Section 5.2, showed that this approximation is sufficient and clearly outperforms the identity and empirical approximations.

4.3 Alternative forms for temporal encoding

For the Fisher Vector image representation several extensions have been proposed in order to model the spatial layout of an image. Here we discuss a few of these approaches and how they can be used to model the temporal location of images in a stream.

Temporal Pyramids Temporal Pyramids (TP) [29] is an adaptation of the Spatial Pyra-mids [21] for the Bag-of-Words model, one of the earliest spatial encoding methods. The Spatial Pyramids partitioning an image into increasingly fine sub-regions and computes one Fisher Vector for each sub-region. Then the Fisher Vectors from all sub-regions are concate-nated into one descriptor.

Similarly, in temporal pyramids we partition the visual stream into sub-regions and compute the Meta-FV from all images inside each sub-region and concatenate them all in the end. This is a simple and computational efficient extension of the Meta-FV, though one main drawback is that the dimensionality of the resulting Meta-FV using temporal pyramids is significant

(28)

higher, namelyc_{× d, where c is the number of temporal regions and d is the dimensionality} of the original Meta-FV without temporal pyramids.

Temporal augmentation of the image features Temporal Augmentation (TA) of image features is inspired by augmenting local SIFT features to model its location in the image [35]. The main advantage of this approach compared to the spatial pyramids is that it does not (significantly) increase the dimensionality of the final descriptor. They propose to augment the SIFT featurexi ∈ R1×Dwith its relative spatial locationmi= [mix, mix]>and the patch sizeσi.

For the temporal augmentation of the image features we augment each image feature xi ∈ R1×D with its relative temporal locationti ∈ [0, 1]. This results in the augmented feature vectorxˆi ∈ R1×D+1:

ˆ

xi= [xi, ti− 0.5]>. (4.38) By using this augmented image descriptors instead of the original image descriptors, the un-derlying distributionp(X; θ) not only reflects the generative process of the image descriptors, but also the temporal location where they are likely to be generated. In order to fairly compare this augmentation to the original Meta-FV, we follow [35] and leave out the least significant principal component when applying PCA. Then we havex0_i_{∈ R}1×D−1andxˆi ∈ R1×D, which mean the augmented features have the exact same dimensionality than the features used in the original Meta-FV.

Temporal Fisher Vectors The Temporal Fisher Vector (TFV) is based on the Spatial Fisher Vector [18] which models the spatial location within the Fisher kernel framework. There the Fisher Vector which is only based on the appearance is combined with a location model. The appearance-location tupleF = (X, M ), where M =_{m1, . . . , mn} is modeled by:

p(F ; θ) = n Y i=1 K X k πkp(xi; θk) p(mi; θk), (4.39)

wherep(mi; θk) it the spatial location model for component k. This location model can either be described by a single Gaussian distribution_{N (m}i; θk) or by a mixture of Gaussian model. The final Fisher Vector includes both, the gradients from the appearance and location model. In Figure4.4the difference between the Spatial Fisher Vector encoding and the Spatial Pyramids are illustrated.

(29)

FIGURE4.4: Illustration of different spatial encodings. The Spatial Pyramids (left) concate-nates the Fisher Vectors of different spatial cells whereas the Spatial Fisher Vectors (right) models the temporal layout by the mean and variances of the occurrences of each visual word.

Image courtesy to [18].

For the Temporal Fisher Vector we model the temporal locationT = {t1, . . . , tn} instead of the spatial location. Thus the appearance-location tuple isF = (X, T ) and modeled by:

p(F ; θ) = n Y i=1 K X k πkp(xi; θk) p(ti; θk), (4.40)

wherep(ti; θk) it the temporal location model for component k which is the Gaussian distri-butionN (ti; θk). The final Temporal Fisher Vector will then include both, the gradients from the appearance and temporal location model. Further more, the calculation of the posterior probabilityγi(k) is now also depending on the temporal location model.

Note, this approach is very similar to the temporal augmentation discussed above. If the appearance model is a GMM with a diagonal covariance, all dimensions are independent and can be expressed as a product of univariate Gaussian distributions:

N (xi; µk, σk) = D Y

d=1

N (xid; µkd, σkd). (4.41)

By using the augment image featurexˆi∈ R1×D+1this gets:

N (ˆxi; µk, σk) = D+1 Y d=1 N (ˆxid; µkd, σkd) (4.42) = D Y d=1 N (xid; µkd, σkd)N (ti; µkt, σkt) (4.43) = p(xi; θk) p(ti; θk). (4.44)

This means the temporal augmented Fisher Vector is exactly the same as the Temporal Fisher Vector using a GMM as appearance model and a single Gaussian for the location model. However, when using a different appearance model, e.g. a Mixture of Student’s-t distribution, this equivalence does not hold any more.

(30)

Above we introduced three temporal encodings which are inspired by modeling the spatial location in images. Such approaches can be used as alternative to explicit modeling of the sequential ordering like we do it with the HMM. We will experimentally compare all three alternative temporal encodings and the explicit modeling using a HMM in Section5.2.3.

(31)

Experimental evaluation

In this chapter we perform an in depth experimental evaluation of our proposed Meta-FV. We first describe our experimental setup, including the datasets we use and how we extract the Meta-FV in Section5.1. Then in Section5.2we perform experiments and discuss their results.

5.1 Experimental setup

5.1.1 Datasets

We evaluated our models on three recent datasets of photo and video events. Basic statistics are described in Table5.1.

Photo Event Collection (PEC) [2] This dataset was introduced in 2013 as benchmark for event classification from Flickr photo collections. It consists of 14 social event classes, e.g. Birthday, Christmas, Hiking, Halloween, and 807 photo collections with in total over 61K photos. We use the suggested experimental setup: per event 30 collections are selected for training and 10 for testing. Performance is evaluated using mean class accuracy (MCA) over the 14 events.

TrecVID Media Event Detection (MED13) [30] This dataset was part of the 2013 TrecVID benchmark task on Media Event Detection. We follow the 100Ex evaluation procedure in our experiments which contains 100 positive training videos per event. With over 10K training and 27K testing videos this is one of the biggest datasets for event detection in videos. In our

(32)

PEC [2] MED13 [30] CCV [17]

Nr. train streams 420 10461 4659

Nr. test streams 140 27033 4658

Nr. events 14 20 20

Avg. nr. images / stream 70.0 58.5 40.8

Avg. nr. streams / label 40.0 174.4 393.9

Avg. pos. train streams 30.0 100.0 195.6

TABLE5.1: Overview of the datasets used in our experiments.

experiments we focus on the visual aspect of the videos and therefore use only visual frame-based features. Performance is evaluated using mean average precision (MAP) over the 20 events.

Columbia Consumer Video (CCV) [17] This dataset consists of over 9K user-generated videos from YouTube, with an average video length of 80 seconds. The dataset comes with video level ground-truth annotations for 20 semantic categories, 15 of which are events while the other 5 are object or scene classes. We use the split suggested by the authors which consists of 4,659 training videos and 4,658 test videos. Performance is evaluated using mean average precision (MAP) over the 20 categories.

5.1.2 The Meta Fisher Vectors pipeline

Extracting the Meta Fisher Vectors consists of four consecutive steps which are similar to the general steps present in any BoW model. First we need to extract the visual features from the images/frames and then reduce their dimensionality. Using the dimension reduced features we train the generative model and finally extract the Meta-FV using the learned parameters.

Visual features For each of the images in a stream we extract visual features from a pre-trained DeepNet [19,47]. For the PEC dataset we use all photos belonging to a collection, while for the video streams we sample a frame every 2 seconds. Our DeepNet is an in-house implementation of [47], trained on 15K ImageNet classes from the fall 2012 release [8]. As is common practice, we use the output of the final fully connected layer of the Network. This results in a 4,096 dimensional vector, which we whiten so that each dimension has zero-mean and unit-variance. In preliminary experiments we also tested other image features like traditional Fisher Vectors [36]; though as expected the DeepNet features showed significantly better results.

(33)

Dimension reduction Before using the visual features in the Meta-FV, we reduce their di-mensionality. This has two benefits:

• The size of the Meta-FV linearly depends on the size of the visual features, thus reducing their dimensionality leads to a more compact representation of the visual stream. • In our models we assume a diagonal covariance matrix, to match this assumption the

dimensions of the visual features needs to be decorrelated.

Such a dimension reduction can be learned in a supervised and unsupervised manner.

The most common unsupervised dimension reduction is the Principle Component Analysis (PCA), which is also used in the Fisher Vector image representation. It defines an orthogonal transformation to convert a set of observations with possibly correlated dimensions into a set of values of linearly uncorrelated dimensions, called principal components. The number of principal components can be less than or equal to the number of original dimensions. PCA is defined in a way that the first principle component has the highest possible variance, and each further component has the highest possible variance under the constrain that it is uncorrelated (orthogonal) to the previous components. The reduced featurex0_i∈ R1×P _{is obtained by:}

x0_i= V xi, (5.1)

whereV is a P _{× D matrix containing the first P principle components. This means our new} feature x0 is reduced from dimensionalityD to P by leaving out the D− P least significant principle components. Further on, all dimensions ofx0_{are linearly uncorrelated.}

Alternatively, recent literature on Fisher kernels applied on high level features [32,38] pro-posed to use supervised dimension reductions. This follows the intuition that a high level descriptor of a part of the image or collection, is already to some extent discriminative by its on. Inspired by this, we consider supervised dimension reduction based on a max-margin formulation similar to [45]. As input we have the visual featuresxi ∈ R1×D and their la-bels y _{∈ {1, . . . , C}. The goal is to learn a feature map V ∈ R}P ×D _{that transforms} _x

i into x0_i ∈ R1×P _where _P _{D. We can learn such a feature map within the multiclass SVM} framework. However, instead of having one linear map for the discriminant functionfi(x), we split the projection matrix in two parts; the mapping to the reduced image feature space:

ΦI(x) : R1×D→ R1×P, (5.2)

and the mapping to the labels:

(34)

For both we use a linear mapping, i.e.ΦI(x) = V x and ΦW(i) = wi, wherewi is the i-th column of the P × C dimensional matrix W . Combining both feature maps, our model is defined by the following discriminant function:

fi(x) = ΦW(i)>ΦI(x) = w>i V x. (5.4)

Using this in the multiclass SVM formulation from Crammer and Singer [4] we obtain the following loss function:

argmin W,V 1 2R(W, V ) + C n n X i=1 max c6=yi h 1_{− w}>_yiV xi+ w>cV xi i +, (5.5)

where R(W, V ) is the regularization term, which we define as the l2 norm of a vector con-taining all parameters (i.e. all elements inW and V ), and [z]+is the hinge loss, i.e.[z]+ = max(0, z). This function is jointly optimized using gradient based optimization. To have a reasonable good starting point of the optimization we initializeV with the results of the PCA dimension reduction. After the optimization is finished we discard the classification matrix W and obtain our reduce features x0 _{by transforming it with the linear transformation}_{V as} in Eq. (5.1) for PCA. Since the dimensions of the transformed featuresx0_i might still be cor-related after the transformation, we follow [38, 32] and apply a final PCA to decorrelate the transformed features before using them in the Meta-FV. Note, this final PCA only decorrelates the feature space, it does not further reduce the dimensionality ofx0

i. Using this max-margin formulation, we reduce the features dimensionality by trying to optimize their discriminative power. Further on, due to the final PCA these reduced features are also decorrelated and match the diagonal covariance assumption of our models.

Simonyan et al. [38] also proposed an efficient approach for supervised dimension reduction. This approach learns a multiclass SVM and projects the original input features in a space of classifier scores. However, such an approach is not applicable for the Meta-FV. For our problem of event classification we only have up to 20 classes in each dataset, where [38] considers image classification with 1,000 classes. Clearly, a 20-dimensional vector is to small to capture the differences between two images.

Training the generative model The parameters of our generative models are obtained by maximum likelihood estimation on the training set, using standard expectation maximization (EM) algorithms. For the sequential models, we obtain the parameters of the emission dis-tribution from the corresponding independent model, and then only estimate the transition parameters. In this way we obtain similar performance as when all parameters are estimated using the sequential expectation maximization. However, we reduce computational cost and

(35)

are able to compare the sequential and independent models more directly since all components have the same parameters.

For training the mixture models we follow various best practices from [34]. Since training a GMM is non-convex, different initializations might lead to different solutions. Therefore we inclemently train our mixture models. We initialize the procedure with a single component for which a closed-form solution exists, then split it in two components by slightly perturb-ing the mean and then re-estimatperturb-ing the parameters. We repeat splittperturb-ing the components and re-estimating the model parameters till we have reached the desired number of components. Further we apply a prior on the posterior distribution which favours towards equality weighted mixture components.

In the maximization step of EM for the StMM, there exists a closed-form solution for the mean, variance and weight. However, for the degrees-of-freedom ν there does not exist a closed-form solution and it can only be estimated using a computational expensive iterative procedure. Preliminary experiments showed that the influence on the final results for various degrees-of-freedomν is small. In order to perform an efficient training we fixed ν = 100 for all our experiments.

Extracting the Meta-FV Finally we use the parameters from the trained model to extract the Meta-FVs for both the training and the test set. Following common practice we then apply power-normalization and`2 normalization. For the experiments on the PEC dataset we use the multi-class SVM implementation of Liblinear [9]. For our retrieval experiments on the MED13 and CCV datasets, we train binary SVMs for each event using the VLFeat Pegasos implementation [37,42] and use the SVM scores to rank the videos.

5.1.3 Experiments

In this section we give a brief overview of the experiments we designed to evaluate the Meta-FV. The experimental details, results and evaluation are described in Section5.2.

Experiment 1: Properties of Meta Fisher Vectors In the first experiment we evaluate the properties of the Meta-FV. The goal is to find a set of parameters which perform well for each model and dataset.

Experiment 2: Performance of different generative models In this experiment we perform an in depth comparison of the different generative models proposed for the Meta-FV.

(36)

Experiment 3: Comparison to alternative temporal encodings Here we compare the alter-native encoding methods to the proposed sequential encoding of the HMM and the in-dependent assumption.

Experiment 4: Combination with mean features The fourth experiment addresses the ques-tion how does the Meta-FV perform in comparison to using the mean feature and whether the mean feature and Meta-FV are complementary.

Experiment 5: Comparison with state-of-the-art In this last experiment we compare the Meta-FV with alternative state-of-the-art event recognition approaches.

5.2 Results and evaluation

5.2.1 Properties of Meta Fisher Vectors

In the first set of experiments we evaluate some of the basic properties of the Meta-FV. We take the GMM as our baseline model and explore how performance is influenced by the number of PCA dimensions, the number of mixture components, the gradients used and the different approximations for the Fisher information matrix. To make the overview concise, we use a set of default parameters, and vary only the parameter of interest. The default parameters we use ared = 256 PCA dimensions and k = 8 mixture components, moreover we extract Meta-FV w.r.t. the mean only using the analytical approximation for the Fisher information matrix.

Influence of PCA and number of mixture components In the first experiment we consider the number of PCA dimensionsd and the number of mixture components k. The results for all models are shown for each dataset in Tables5.2-5.4. We observe that the optimal parameters ford and k are dependent on the dataset, that was to be expected since the statistics (in terms of the amount of training data) of the datasets vary significantly. Further, the optimal parameters are stable across our four models; only on smaller datasets are some variation, though we attribute this to the small number of test examples.

In general it holds that PCA is helpful, and that modest dimensionality reduction is beneficial. There seems to be a correlation with the dataset size and the number of dimensions and com-ponents. For example, on the biggest dataset (MED13), performance saturates atd = 2, 048, while for the other datasets saturation starts at lowerd values.

For the number of components k in the mixture model a similar pattern is visible in the re-sults. On the largest dataset, the optimal value is the highest. It is important to notice that the

(37)

d 128 256 512 1024 2048 4096 k=4 81.4 83.6 81.4 83.6 80.0 77.1 k=8 81.4 80.7 80.7 82.9 78.6 77.1 k=16 81.4 81.4 80.0 82.1 77.9 k=32 81.4 82.1 82.9 78.6 k=64 80.7 84.3 82.1 k=128 79.3 80.7 (A) GMM d 128 256 512 1024 2048 4096 k=4 78.6 82.1 81.4 81.4 81.4 77.9 k=8 83.6 85.7 82.9 82.1 79.3 77.9 k=16 85.0 82.1 82.9 82.9 81.4 k=32 80.0 83.6 78.6 82.9 k=64 80.7 79.3 79.3 k=128 81.4 80.0 (B) StMM d 128 256 512 1024 2048 4096 k=4 82.1 82.1 81.4 82.1 78.6 77.1 k=8 80.7 83.6 82.1 82.9 78.6 77.9 k=16 82.1 81.4 82.1 80.7 77.9 k=32 77.9 81.4 78.6 81.4 k=64 79.3 79.3 82.1 k=128 75.0 75.7 (C) HMM d 128 256 512 1024 2048 4096 k=4 80.7 82.1 80.0 81.4 79.3 78.6 k=8 80.7 80.7 83.6 80.7 78.6 78.6 k=16 84.3 78.6 85.0 82.9 80.7 k=32 80.0 82.1 77.1 80.7 k=64 77.1 77.9 80.7 k=128 74.3 78.6 (D) StHMM

TABLE 5.2: Detailed analysis of the influence of the PCA dimension reductiond and the number of componentk for all four generative models on the PEC dataset.

d 128 256 512 1024 2048 4096 k=8 66.2 66.8 67.1 66.9 66.2 66.8 k=16 67.4 67.9 67.4 67.1 65.9 k=32 63.4 64.5 66.8 67.4 k=64 64.1 64.9 66.2 k=128 63.9 64.8 (A) GMM d 128 256 512 1024 2048 4096 k=8 66.7 68.0 68.1 68.6 67.4 k=16 66.0 68.1 69.0 68.7 67.5 k=32 63.3 66.1 66.6 68.3 k=64 64.7 64.0 66.4 k=128 64.3 64.7 (B) StMM d 128 256 512 1024 2048 4096 k=8 63.6 65.4 66.4 66.6 66.0 k=16 64.2 66.0 66.2 66.6 65.7 k=32 59.2 61.6 65.4 66.6 k=64 59.6 62.0 64.5 k=128 59.7 61.9 (C) HMM d 128 256 512 1024 2048 4096 k=8 63.1 64.8 66.5 66.5 66.1 k=16 61.9 65.3 66.5 66.5 65.8 k=32 58.9 62.8 63.9 65.8 k=64 60.3 60.4 63.8 k=128 59.1 61.1 (D) StHMM

TABLE 5.3: Detailed analysis of the influence of the PCA dimension reductiond and the number of componentk for all four generative models on the CCV dataset.

d 128 256 512 1024 2048 4096 k=8 31.4 32.5 33.7 35.2 36.8 36.3 k=16 32.7 33.4 34.1 34.5 36.3 k=32 34.0 34.5 34.8 35.0 k=64 34.3 35.0 35.6 k=128 35.3 35.0 (A) GMM d 128 256 512 1024 2048 4096 k=8 33.8 36.3 37.2 34.6 k=16 35.3 37.0 37.0 k=32 34.9 36.8 k=64 35.9 k=128 (B) StMM d 128 256 512 1024 2048 4096 k=8 34.8 36.6 k=16 33.9 36.4 (C) HMM d 128 256 512 1024 2048 4096 k=8 35.4 36.8 k=16 35.7 35.9 (D) StHMM

TABLE 5.4: Detailed analysis of the influence of the PCA dimension reductiond and the number of componentk for all four generative models on the MED13 dataset.

(38)

Dimension reduction d = 128 d = 256 d = 512

PCA 66.2 66.8 67.1

Supervised 66.3 66.5 66.3

TABLE 5.5: Comparison of unsupervised dimension reduction (PCA) and supervised di-mension reduction (Eq. (5.5)) on the CCV dataset. We varry the dimensionalityd and fix the

number of componentsk = 8.

dimensionality of the final Meta-FV isk_{× d, i.e. linear in both, the number of mixture} com-ponents and the input dimensions. In general it holds that higher dimensional Meta-FV are beneficial only when significant training data is available, both for estimating the parameters of the generative model, and for training the SVM classifiers.

Dimension reduction: PCA versus supervised We explore the influence of the used di-mension reduction technique on the classification performance. Discriminative didi-mension reduction has shown to be advantageous over PCA for large scale image classification [38], using a similar approach of defining a Meta Fisher Vector of various image regions. Therefore we compare PCA dimension reduction to the supervised dimension reduction described by Eq. (5.5).

The results are presented in Table5.5. We observe, that the supervised dimension reduction is on par with PCA dimension reduction. For a small dimensionalityd the supervised dimension reduction slightly improves over PCA, however this vanishes for higher dimensions where PCA even outperforms the supervised dimension reduction. This contrast to the recent litera-ture. The reason could be, that our DeepNet features are already highly discriminative for the task at hand, especially compared to the more generic Fisher Vectors used in [38].

Assessing the components of the Meta-FV Finally, to explore the properties of the different components of the Meta-FV, we evaluate the influence of the different gradients, i.e. with respect to the weights, mean and/or variances. We use once again the GMM withk = 8 as a baseline and vary the number of PCA dimensionsd; and also we fix d = 256 and vary the number of componentsk.

In Figure 5.1 we show the performance on the MED13 dataset. We observe that the mean clearly shows the strongest performance with an MAP of over30% whereas the variance only has about15%. As expected, the weight shows a very low performance when varying d, since it has only k = 8 dimensions in the resulting Meta-FV, it steadily increases performance when k is varied. Combining all gradients could increase the performance, but at the cost

(39)

128 256 512 1024 2048 0 10 20 30 40 PCA dimension d MAP µ σ π all 4 8 16 32 64 0 10 20 30 40 Number of components k

FIGURE5.1: Comparison of using Meta-FV w.r.t. to weight, mean and variance parameters of the GMM on MED13.

FIM approximation GMM StMM HMM StHMM

Identity 27.6 27.2 28.2 28.5

Empirical 34.7 34.6 34.5 33.8

Closed form 36.8 37.2 36.6 36.8

TABLE 5.6: Overview of different approximations for the Fisher information matrix on MED13

of computational complexity. We conclude that good results are obtained by using only the gradient w.r.t. the mean.

Approximations of the Fisher information matrix In this experiment we study the influ-ence of the Fisher information matrix approximations, and evaluate the analytical approxima-tion, the identity approximation and the empirical approximation as discussed in Section3.3. The results are presented in Table5.6, for all four different generative models on the MED13 dataset, using k = 8 and d = 2048. First, we observe that for all generative models the behavior is very similar; approximating the Fisher information matrix with the identity ma-trix performs worst (MAP _{±27%), the empirical estimation brings a strong improvement to} ±34% and finally the analytical approximation increases performance to ±37%. For the GMM model, our results are in line with the findings of [36], where the analytical approximation also obtains best performance for the Fisher Vectors image representation. While we have made strong additional assumptions for the closed form approximations of our generative models, the resulting approximation works well in practice.

For the StMM we made the constant distance assumption, which assumes that the Mahalanobis distance in a high dimensional space converges to a constant c. In Figure 5.2 we show the

(40)

Chapter 5. Experimental evaluation 36 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 1.2 1.4 ·10 4 Distance to clusters 0 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 1.2 1.4 ·10 4 Distance to clusters (A) MED13 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 1.2 Distance to clusters 0 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 1.2 1.4 ·10 4 Distance to clusters (B) CCV

FIGURE 5.2: Histogram showing the distance of all datapoints to its assigned cluster on MED13 (left) and CCV (right).

distance of all datapoints to its assigned cluster. Indeed, we see that the distance is fairly similar for most dataspoints. On the MED13 dataset are87% of the datapoits between 14 and 28, i.e. the assumed constantc is changing less than factor two and even 99% between 10 and 40 wherec is changing less than factor four. Similarly, on the CCV dataset we have 89% of the datapoints between 4 and 8 and99% between 3 and 12; changing the constant c by less than factor two and four, respectively. The absolute difference of the constantc between both datasets is due the the usage of the optimal dimensionality,d = 2048 for MED13 and d = 512 for CCV.

Furthermore, we note that the sequential encoding models (HMM and StHMM) outperform the independent image models when the identity approximation is used. Unfortunately, these improvements vanish when the empirical or analytical approximations are used. Likely, this is caused by the fact that the diagonal Fisher information approximations are not valid in the sequential models, due to the dependencies between the frames (or images).

Conclusions from this exploration From this section we conclude the following properties of the Meta-FV:

• The dimensions d and mixture components k depend on (the size of) the dataset, though not on the generative model. Further more, supervised dimension reduction does not improve over PCA. From now on we use PCA and the parameters PEC (256,8), CCV (512,16), and MED13 (2048,8) for all models.

• For the high dimensional Meta-FV the gradient w.r.t. the mean performs close to opti-mal, therefore we use this for the remaining experiments.