Improving Semantic Query Generation through User Feedback on Word Embeddings

(1)

Improving Semantic Query Generation through User

Feedback on Word Embeddings

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Douwe Knook

10285261

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN-

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

July 17, 2016

(2)

Improving Semantic Query Generation through User

Feedback on Word Embeddings

Douwe Knook

University of Amsterdam

douweknook@gmail.com

ABSTRACT

General-purpose video retrieval systems allow users to search using an unlimited range of semantic textual queries. How-ever, such systems can only be trained to detect a lim-ited set (i.e. thousands) of concepts. Semantic query gen-eration (SQG) seeks to bridge this semantic gap through query expansion. The presented method employs automatic SQG based on a model of machine learned word embeddings called Word2Vec. Additionally, users are asked to provide relevance feedback on the selected concepts. The system uses this feedback to learn an improved word embeddings model with the goal of improving retrieval results contin-uously. The method is evaluated on the TRECVID MED-TRAIN data set and compared against other relevance feed-back methods such as re-weighting and query point modifi-cation. We find that the presented method improves upon the baseline system without feedback, but cannot outper-form re-weighting. However, we conclude that with further optimization of our method competitive performance is pos-sible.

1. INTRODUCTION

Central to the development of concept-based video re-trieval systems is the notion of the semantic gap: the issue of translating low-level features (i.e. colors, textures, shapes) of a video to the semantic concepts they denote [25]. Closing this gap is among the main objectives in many state-of-the-art video retrieval systems. However, a constraining factor is the amount of concepts a system can be trained to detect. While current state-of-the-art systems are able to detect an increasingly large amount of concepts (i.e. thousands), this amount still falls far behind the near infinite amount of pos-sible (textual) queries general-purpose heterogeneous video search systems need to be able to handle [2].

An area of video retrieval in which the semantic gap proves remarkably hard to bridge is that of event detection. Events can be defined as complex queries that consist of a multi-tude of concepts (objects, actions and scenery among oth-ers). Examples of event queries are Attempting a bike trick, Working on a woodworking project and Giving directions to a location. To bridge the semantic gap, semantic query gen-eration (SQG) can be used to select a set of concept detec-tors that represent the event. For instance, Attempting a bike trick can be represented by concepts such as bike trick, attempt, trick, bike and flipping bike.

SQG works well when there is a direct match between a

concept detector and the full query. When no such match exists however, query expansion can be used to create a list of detectors that best represents the query [17]. In automatic SQG this introduces a problem as sometimes non-relevant, or less representative, concept detectors are included in the query expansion besides the relevant detectors. This occurs especially in cases when queries are complex and relevant detectors are sparse.

One approach to improve performance when less or non-relevant detectors are selected is the use of relevance feed-back. Users annotate the relevance of the retrieved videos and the system uses this input to create an improved set of results. However, in video retrieval such judgment can be hard for users as it is often to time consuming to watch all retrieved videos. While most systems show several key-frames, these can easily fail to accurately communicate the entire content of the video. Rather than feedback on the selected videos, feedback on the selected concepts in SQG could solve this problem.

This paper therefore focuses on the use of relevance feed-back on automatically selected concepts to improve system performance. The paper considers two main research ques-tions: (1) How can system performance be improved through user feedback on the semantic relevance of automatically se-lected concept detectors? (2) How can the semantic space be altered based on relevance feedback such that more accurate retrieval results are achieved?

To answer these questions, we performed several experi-ments on the TRECVId MEDTRAIN data set. The system used in these experiments employs automatic SQG based on a model of machine learned word embeddings (a seman-tic space) to create an initial list of concept detectors that represent the query. Next, users are asked to annotate any selected concepts that are non-relevant, i.e. that the user beliefs do not represent the query well. We present a method of using this feedback to alter the word embeddings model to make it better reflect the semantic meaning of the trained concepts. We compare this method against the traditional relevance feedback approaches of re-weighting and query point modification.

Our experiments show improved performance when rel-evance feedback is applied to the automatic SQG, demon-strating that users are capable of identifying semantically less relevant detectors. We find that re-weighting performs best of the explored methods in terms of Mean Average Precision (MAP). However, none of the applied methods out-performs manual SQG by an expert. This result can largely be attributed to the limitations of the word

(3)

embed-dings model. With regard to altering the semantic space we find improvement compared to the baseline system, but that performance differs greatly depending on the amount and order of alterations. The results indicate that with fur-ther parameter optimization, competitive results could be reached.

This paper is structured as follows: Section 2 contains a discussion of related work, Section 3 consists of the imple-mentation of the base system, SQG method and application of feedback. Section 4 explains the method of the experi-ments while Section 5 contains the results. Finally, Section 6 and 7 provide the discussion and conclusion.

2. RELATED WORK

This section discusses related work to the topics of seman-tic query generation (SQG) and relevance feedback. Three general approaches to SQG are examined including the use of word embedding models. Additionally, several methods of implementing relevance feedback are discussed.

2.1 Semantic Query Generation

The goal of SQG is to translate a textual query into a set of low-level features that can be detected within a video. According to De Boer et al. [3] SQG can be considered a two-step approach. First, the gap between low-level feature descriptors (the concept detectors) and concept labels needs to be bridged. These concept labels and the related concept detectors are stored in a vocabulary of a limited size. This brings us to the second step: the gap between the concept labels in the vocabulary and the full semantics of textual queries. Full semantics includes the meaning of the query words, but can even denote intent of the user [3]. Liu et al. [12] identify five categories for approaches to SQG: ing object ontologies, using machine learning methods, us-ing relevance feedback, generatus-ing semantic templates, and making use of additional textual and visual information in Web image retrieval. In the following subsections the first three approaches are discussed in more detail.

2.1.1 Knowledge bases

Approaches in the first category defined by Liu et al. [12] use ontologies or knowledge bases to select the relevant con-cepts. Such knowledge bases map concepts and their rela-tions and can be created by experts or consist of common knowledge. Expert knowledge bases have the advantage of containing more detailed and uniform information and have shown good performance. However, their creation requires the dedicated effort of experts. For events, a recent expert knowledge base is EventNet [28]. Examples of freely avail-able common knowledge bases are Wikipedia [16] and Word-Net [15], which are both regularly used in concept-based video retrieval [27, 9]. De Boer et al. [3] compare perfor-mance between an expert knowledge base and two common knowledge bases, ConceptNet and Wikipedia, in query ex-pansion. They find that SQG using relevance feedback or through manual selection generally works better than using a knowledge base. Also, expert knowledge bases do not nec-essarily perform better than common knowledge bases [3]. SQG with knowledge bases is often done by expanding the query with concepts most similar or otherwise related (such as hypo- or hypernyms in WordNet) to the query. Elements like the similarity measure and the amount of concepts to be used for query expansion differ per publication and

ex-periment. Conclusive results on which method works best have not yet been found.

Knowledge bases have proven useful in cases of semantic reasoning [2]. SQG using knowledge bases tends to work best for dedicated image or video retrieval systems where a clear ontology can be defined. For general-purpose video retrieval systems however, SQG using knowledge bases has several disadvantages. First, knowledge bases are limited in the amount of terms and relations they contain. If a query word does not appear in the knowledge base, finding related terms becomes problematic. Secondly, the creation of se-mantic knowledge bases is rather subjective. The more com-plex and general a knowledge base becomes, the harder it is to find consensus on how to define relations within the ontol-ogy. Since the presented methods are intended for general-purpose video retrieval, SQG using knowledge bases is not used in our approach.

2.1.2 Machine learning methods

The second category Liu et al. [12] distinguish is SQG using machine learning techniques to expand the query by automatically selecting good concepts for expansion. Such approaches are often used in tasks where sample query im-ages or videos are provided as this directly gives a (small) set of data to train the model. However, machine learning can be used in zero example video retrieval as well. In current state-of-the-art systems machine learning approaches in the form of semantic word embeddings have become increasingly popular [18, 5, 9, 24].

An often used set of models is Word2Vec, which uses a deep learning approach to produce semantic word embed-dings. However, the term deep learning can be disputed here as the models are trained using only a two-layer neu-ral network. The approach was developed by Mikolov, et al. and is detailed in two research papers [13, 14]). Widely used Word2Vec models are the Skip-gram and Continuous Bag-of-Words (CBOW) model. The Skip-gram model seeks, given a word w and a window size of n words, to predict the contextual words c. Alternatively, this can be notated as probability P (c|w). Contrary, CBOW tries to predict a word given the context, or probability P (w|c).

The Word2Vec models operate on the hypothesis that words with similar meanings occur in similar contexts [6]. Supporting this hypothesis is the logic found in relationships between terms within the model. This logic follows examples such as King - Man + Woman = Queen and Paris - France + Italy = Rome. It is this semantic logic within Word2Vec that makes it also suitable for query expansion.

Several systems [1, 9, 17, 24] have shown to effectively use Word2Vec deep learning vectors for query expansion in image retrieval. AlMasri et al. [1] use a word embeddings approach in comparison to a pseudo-relevance feedback and a mutual information approach. They show significant im-provement on multiple CLEF collections. As Word2Vec is a vector-space model using high dimensional vectors, simple distance measures such as cosine and euclidean distance are used to calculate similarity between words. A set of words with a high similarity is then used to expand the query.

A problem of word2Vec with regards to SQG is that it only contains vector representations for single words. Lev et al. [11] explore the pooling of a multi-set of Word2Vec single word vectors to represent sentences. They find that simple mean pooling already performs well but that in some

(4)

situations Fisher Vector pooling can further improve results. In general they note that conventional pooling techniques are powerful enough to obtain competitive performance.

2.1.3 Relevance feedback

The use of relevance feedback stems from the dynamic na-ture of information seeking [23]: information needs can be continuously changing and be unique to each user. Rele-vance feedback allows users to adjust or expand their query so that the results better reflect their information need com-pared to results initially returned by the system. In SQG this is useful as semantic meaning can be rather subjective and thus unique to a specific user.

Zhou and Huang [30] describe the general approach of rel-evance feedback in concept-based image retrieval as a three-step scenario. (1) The retrieval system provides an initial ranking of results based on the user input (i.e. the query). (2) The user provides judgment on the retrieved documents in terms of whether, and to what degree, they are relevant. (3) Based on these judgments, the system learns and pro-vides a new (improved) ranking.

In SQG using relevance feedback the approach is simi-lar. An initial set of concepts is selected using an ontology or machine learning techniques. Next, the user is asked to delete non-relevant concepts and/or adjust the importance weighting of concepts. An important difference is that in SQG using relevance feedback, users do not review the re-trieved results (the videos) but rather the query formula-tion presented by the system. Therefore, it is not so much the general performance of the system they judge, but the extend to which the system understands the query correct semantically.

In step 3, where the systems learns and provides a new ranking, there are two typical approaches: re-weighting and query point modification (QPM) [12]. In re-weighing, the weights embedded in the query are updated. In SQG this means altering the weight of concept detectors to optimize the relative importance of each concept for this specific query. The goal of QPM is to alter the query representation towards the relevant samples and away from the negative samples in the user feedback. An often used method for this is Roc-chio’s algorithm [21]: Q0= Q + α   1 |DR| X i∈DR Di  − β   1 |DN| X j∈DN Dj  

where Q and Q0 are the initial and modified query respec-tively and DR, DN are the collections of relevant and

non-relevant samples that are obtained through relevance feed-back. α and β are tuning parameters. Both re-weighting and QPM are widely used in state-of-the-art video retrieval systems that employ relevance feedback [22, 10, 8, 26].

Zhou and Huang [30] also list three constraints of rele-vance feedback. First, there is a small sample size as users generally provide only a small amount of relevance judg-ments. Additionally, there exists an asymmetry in the train-ing samples. While positive samples tend to share the same characteristics that make them positive, negative samples can be considered negative for a range of different reasons, i.e. they do not necessarily share the same negative features. Lastly, relevance feedback requires real time execution, lim-iting the amount of computations that can be done before returning the new ranking.

Figure 1: The retrieval pipeline including SQG combining machine learning and relevance feedback.

With regard to commonly used relevance feedback meth-ods He et al. [7] provide a useful dichotomy between short and long-term learning. Most systems use a short-term ap-proach, only using the relevance feedback during the current query session. He et al. [7] however use a long-term model to construct a semantic space which gradually improves per-formance over time as more relevance knowledge is accumu-lated. Our method of continuously updating the model also aims for long-term learning.

3. IMPLEMENTATION

To implement the proposed methods an adjusted version of the AVES (Automatic Video Event Search) system [20] developed at TNO was used. The original relevance feedback for retrieved videos module is removed while the automatic SQG module is extended with relevance feedback. Figure 1 provides an overview of the adjusted system. The system has as input the event query, a concept-bank of detectors and their labels, and the Word2Vec word embedding model. Next, it uses automatic SQG (explained in Section 3.1) to se-lect a set of best detectors for this particular query. The user is asked to provide feedback on these concepts by labeling them relevant or non-relevant. This feedback is then used to update the Word2Vec model after which the automatic SQG selects a new set of concepts. This new, improved set of concepts serves as the input for the scoring function, which then provides a ranked list of video results.

A concept-bank of a total of 2048 concept detectors was created. These detectors have been trained using deep con-voluted neural networks and SVMs. Table 1 details the data sets and their number of detectors.

Table 1: An overview of the data sets used and their number of concept detectors. Dataset #Detectors ImageNet [4] 1000 Places [29] 205 SIN [19] 346 TRECVID MED [19] 497 Total 2048

(5)

Figure 2: The Automatic SQG is a two-step system to trans-late the query into a set of concept detector labels. In step 1, mean pooling is used to merge the vectors. In step 2, cosine similarity is used as the similarity measure. Event query Attempting a board trick is used as an example.

For evaluation, the system is implemented on the MED-TRAIN data set from the TRECVID Multimedia Event De-tection benchmark [19]. This data set consists of 5594 videos (75.45 key-frames on average per video). After extracting 1 key-frame every 2 seconds for each video in the data set, the trained detectors are used to score each of these key-frames in the form of a 2048-dimensional vector representa-tion. Additionally, a background score is calculated for each detector using the TRECVID Event Background collection (5000 videos) [19] in order to decrease noise. The scoring function itself is discussed in Section 3.3.

3.1 Automatic SQG

To translate the event query automatically into an ini-tial set of concept detectors, a Word2Vec Skipgram-model is used. This form of SQG using machine learning employs word embeddings to create a feature vector for each word in the model. As discussed in Section 2.1.2, these word embed-dings also convey a certain level of semantic relations. The Word2Vec model used contains pre-trained vectors for 300 million words from the Google News data set1_{. The}

embed-ding of each word is expressed in a 300-dimensional feature vector.

The SQG approach taken consists of two steps: (1) the translation from the event query to a single Word2Vec vec-tor representation and (2) finding the concept labels closest related to the query (see Figure 2). The first step is impor-tant as our Word2Vec model only contains embeddings for single words. A query like Attempting a board trick is bro-ken down into three separate words (no vector for stopword ’a’), each with its own feature vector. Mean pooling is used to merge these separate vectors into a single feature vector to represents the query. As shown by Lev et al. [11], mean pooling is a simple pooling method that performs well.

Once a single vector representation of the query is cre-ated, the model is searched for those words closest to the query. However, only words that are part of the concept labels of the detectors in the concept-bank are considered. This severely limits the amount of possibilities from 3 mil-lion words to 2048 concept labels. To measure similarity be-tween the vector of each word and the query vector, cosine similarity is used. This is the same distance measure used by Mikolov et al. in their original outline of the Word2Vec model [13] and which performs good at finding semantic sim-ilarities. After we score each word with a similarity score w (where w is bounded by 0 ≤ w ≤ 1), we select the top n

1

https://code.google.com/archive/p/word2vec/

Algorithm 1: Algorithm for updating word vectors Data: Query vector −→q , Concept vector −→w Result: Update Word2Vec feature vector if concept is relevant then

for each element i in −→w do −

→_w

i= −→wi+ α * (−→qi− −→wi)

end end

if concept is non-relevant then for each element i in −→w do

− →_w

i= −→wi- α * (−→qi− −→wi)

end end

concept detectors whose labels have the highest similarity scores.

3.2 Relevance feedback

After the automatic selection of the top n most similar concept detectors from the concept-bank, the user is asked to provide relevance feedback on these selected concepts. This gives a set of relevant and a set of non-relevant detec-tors. The collected relevance feedback is used to adjust the Word2Vec model so that it better reflects the semantic rela-tions between query and concepts. Feedback is applied using an approach somewhat similar to that of query point modifi-cation, but rather than altering the query vector, the vectors in the model are altered. Relevant concepts are moved closer to the query while non-relevant concepts are moved further away in vector space. Algorithm 1 details the formula used to update the concept vectors. The Word2Vec vectors do not necessarily sum to 1, which is essential for calculating the cosine distance. Therefore a normalization is applied to make sure altering the vectors does not impact the accuracy of the cosine distance measure.

By continuously updating the model, we aim to fine tune the semantic relations between words. Since the Word2Vec model is trained on 3 million single words and our concept-bank only contains 2048 concept labels, we aim to optimally position these labels in the semantic space.

3.3 Scoring function

After updating the model a new set of top concept detec-tors is retrieved using the same method of automatic SQG presented in Section 3.1. These top n concept detectors (D) are then used to score and rank the videos. Each video is scored (sv) using the following scoring function

sv= |D|

X

d∈D

wd(sv,d− bd)

where wd is the Word2Vec similarity score of detector d,

bd is the background score of detector d and sv,d is the

score for detector d on video v. sv,d is calculated using

the max(sk0,d, sk1,d, ..., skn,d) where skn,d is the score for

the nth key-frame for detector d. The videos are returned in descending order of their score sv.

(6)

4. METHOD

To evaluate our implementation of SQG using word em-beddings and relevance feedback and answer our research questions, several experiments were set up. Theis section discusses these experiments. First we consider the general set-up including the data set, user feedback, performance measure, number of top detectors and how to treat multi-word and multi-concept labels. Next, Subsection 4.6 details in more detail the relevance feedback methods we compare in our experiments.

4.1 Data set

For evaluation we use the MEDTRAIN data set from the TRECVID Multimedia Event Detection (MED) benchmark [19]. This data set contains 5594 videos (75.45 key-frames on average) of publicly available, user-generated content. The events included in the data set are all complex activities involving interactions with people and objects.

The data set also includes relevance judgments for 40 events (i.e. queries) with a 100 true positives for each event. In the experiments only 32 of these events are used as in the other 8 cases a direct match exists between the query and one of the concept detectors. As this results in (near) perfect retrieval results user feedback is unnecessary, making these cases irrelevant to the purpose of the experiments. Examples of included events are Attempting a board trick, Changing a vehicle tire and Non-motorized vehicle repair. A list of all 32 events included and the 8 events excluded can be found in Appendix A.

In the original MED evaluation, events are provided as event kits, including the name, a definition, an explication, and a set of illustrative videos. In our experiments however, input only consists of the event name as a textual query.

4.2 User feedback

Fifteen participants were asked to volunteer in providing relevance feedback. The participants were presented with a list of the 32 events with the top 15 concepts per event as provided by the initial automatic SQG (see Appendix D). They were asked to evaluate these concepts semantically and provide relevance judgment by flagging the non-relevant concepts.

As some concept labels include a multitude of concepts and would become quite long, for clarity’s sake it was de-cided to only use the first concept of the label (a concept ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin was thus displayed simply as ashcan). In cases where the relevance of a concept was unclear, decisions on how to judge it were left entirely up to the participants.

Afterwards, feedback was split in two set. (1) A set of non-relevant concepts as indicated by the user. (2) A set of relevant concepts consisting of the top 15 concepts minus the ones flagged by the participant as non-relevant. These two sets for the 32 events were then used in the experiments.

4.3 Mean Average Precision

To judge performance in the experiments Mean Average Precision (MAP) [19] is measured. This metric takes into ac-count both precision and recall and thus provides a solid gen-eral evaluation of retrieval performance. It should be noted however that an improvement in MAP does not necessarily means a improvement the system’s judgment of semantic

relations. However, as semantic relevance and improvement is rather subjective to judge and the goal is to ultimately improve retrieval results we believe MAP provides accurate measurement.

4.4 Number of detectors

In all experiments the number of top detectors used for scoring the videos is set to 5 (n = 5). Experiments on the baseline system showed that this amount of detectors would result in a good performance. Appendix C shows we observe a peak in Mean Average Precision for n = 5 compared to other values (n = 1, 3, 10, 15, 30). Also, user feedback was only collected for the top 15 detectors. By using only part of this feedback each iteration, we are able to also perform some experiments over a multitude of iterations.

4.5 Multi-word and multi-concept labels

The concept labels in our concept-bank can be largely divided into three categories: single word labels (board ), multi-word labels (skateboard trick ) and multi-concept la-bels (attempt, attempting ). Concept lala-bels can also contain combinations of categories (boardwalk, wood board walk ). As the Word2Vec model only contains vectors for single words, the issue arises of how to treat concept labels that consist of multiple words and/or concepts.

Generally, there are two approaches that can be taken here. (1) Calculate the similarity between each single word and the query first and then fuse these similarity scores. (2) Pool the single word vectors first and calculate the similar-ity between the pooled vector and the query vector. We implemented the latter for our experiments as we found this approach to perform best (see Appendix B).

4.6 Experiments

Similar to our research questions, our experiments are two-fold. First, we are interested in the effects of relevance feedback on automatically selected concepts. With regard to this we want to see if users are capable of identifying se-mantically non-relevant concepts and if implementing such relevance feedback improves general performance of the sys-tem.

Secondly, our experiments are consider three different ap-proaches to the method explained in Section 3.2. A factor that we expect to be of influence on the performance of the method is the way the word embeddings are used to create and alter the semantic space. Therefore, three vari-ations (WordChange, ConceptChange and EventChange) of this method are considered that each create the semantic space differently. These methods are further detailed below. To measure the effects of implementing relevance feedback we compare our method against the baseline system with-out relevance feedback. A method of manual selection by an expert is also included to see how automatic SQG with relevance feedback compares to this. To further judge the ef-fectiveness of our method the traditional relevance feedback methods of re-weighting (AlterWeights) and QPM (Modify-Query) are also included. The rest of this section explains each of the methods.

4.6.1 Semantic space methods

Below the three variations on the presented method are discussed. The general approach for these methods is de-tailed in Algorithm 1. The methods differ in their use of

(7)

word embeddings to create a semantic space.

WordChange.

In WordChange we uses the original single word embed-dings as our semantic space. The feedback is used to up-date the vectors of each word separately. For example, if a concept washer, automatic washer, washing machine is considered non-relevant, the vector for each individual word is moved away from the query vector. After filtering the concept-bank for terms not present in the Word2Vec model, this results in a vocabulary of 2269 individual single words with unique vectors.

ConceptChange.

Method ConceptChange splits a concept label into sepa-rate concepts. A vocabulary is created in which multi-word concepts are merged into a single vector. Thus, a concept sewing machine is considered as a new word sewing machine in the semantic space with its own vector. This vector is the mean of the vectors of the separate words (sewing and ma-chine). We then update these concept vectors according to the user feedback.

This approach results in a total vocabulary size of 2532 concepts after filtering for words not present in the Word2Vec model. The size of the vocabulary is larger than for Word-Change as a multi-word vector can still be created if one (or more) words are missing in the original Word2Vec model. Imagine two concepts amphibious vehicle and armored com-bat vehicle where only the word vehicle is present (i.e. has a vector) in the Word2Vec model. In WordChange this results in one vocabulary concept vehicle with a single vector. In ConceptChange however two concepts are created (amphibi-ous vehicle and armored combat vehicle that have the same vector. Although the original vectors are the same, this can prove useful as each concept vector is continuously altered for their respective semantic meaning.

EventChange.

The EventChange method takes ConceptChange one step further and creates a single vector for each label in the concept-bank. This means a concept label boardwalk, wood board walkway becomes boardwalk wood board walkway in the vocabulary, whose vector is the mean of the separate word vectors boardwalk, wood, board and walkway. This results in a vocabulary of 1774 individual labels after filtering out labels for which none of the words are present in the model.

Parameter value for

α.

As shown in Algorithm 1 an α parameter is used to tune the impact of the relevance feedback on the word embedding vectors. In our experiments this value is set to α = 0.1 for all three methods. Although this value has not been extensively optimized, we note that smaller values tend to result in better performance.

4.6.2 Baseline methods

Below we list the methods against which we compare the results of our presented method.

Baseline.

The baseline we compare against consists of the system described in Section 3 but without the relevance feedback

module. It thus uses automatic SQG with the original word embeddings model to select a set of detectors.

Manual.

An expert familiar with the concept-bank and data set was asked to select a set of relevant concepts for each event. The expert also provided a weight for each concept. While the number of concepts differs per event, their weights always add up to one. As manual selection tends to outperform automatic SQG for complex events (noted in related work [3]), this method was included to evaluate the system in this respect.

AlterWeights.

Since we use the similarity score between query and con-cept label as a weight for that detector, re-weighting could lead to improved performance. The AlterWeights method increases the weights for relevant concepts while lowering them for non-relevant concepts. Weights are altered fol-lowing an approach inspired by Rocchio’s algorithm [21]. Relevant detectors are adjusted following Equation 1 while non-relevant detectors are adjusted following Equation 2.

For parameters α and β values ranging from 0.0 to 1.0 were considered with step-size 0.1. Additionally, more extreme values of 5.0 and 10.0 were included. For the values ranging from 0.0 to 0.1 we found that best performance was achieved for α = 0.4 and β = 0.9 (see Appendix E). Therefore, these are the values used in our experiments. Alternatively, we observe similar precision with alpha = 10.0 and β values ranging from 0.1 to 0.9.

wd0 = wd+ α ∗ wd (1)

wd0 = wd− β ∗ wd (2)

With values α = 0.0 and β = 1.0, AlterWeights pro-duces results identical to removing the non-relevant anno-tated concepts. This special case is useful as it provides clear insight whether users are able to identify non-relevant concepts correctly. As we expect users to be able to identify the non-relevant concepts, we expect to also see an increase in performance for this special case.

ModifyQuery.

The ModifyQuery method consists of query point mod-ification using Rocchio’s algorithm. We assume an ideal query vector exists, which results in an optimal set of con-cept detectors. Using the vector representations of both the relevant and non-relevant detectors provided by relevance feedback, we update the initial query vector −→q according to Equation 3. − →_q0 = −→q + α ∗ 1 |Cr| X d∈Cr − → d ! − β ∗ 1 |Cnr| X d∈Cnr − → d ! (3) where −→q0is the modified query vector, Crand Cnrare the

set of relevant and non-relevant detectors from the relevance feedback respectively and−→d is the Word2Vec vector repre-sentation of detector d. Values ranging from 0.1 to 1.0 were considered for parameters α and β, with step size 0.1 and including some extreme values of 5.0 and 10.0. We found

(8)

best performance at α = 0.6 and β = 0.7 (see Appendix F), which were used in all experiments. The extreme values all resulted in much lower performance.

To summarize, Table 2 lists all methods and where ad-justments based on the relevance feedback are made. Table 2: An overview of what is adjusted in each of the methods we compare

Methods Adjustment

Baseline/Manual None

AlterWeights Concept weights ModifyQuery Query vector WordChange Single word vectors ConceptChange Single concept vectors EventChange Merged concept vectors

5. RESULTS

In this section the results of our experiments are pre-sented. First, we detail the demographics of the participants in our experiments. Next, we show the results of the rele-vance feedback experiments in terms of general performance. Also, we split these results per event to more extensively con-sider their differences. Finally, results for the three semantic space methods are presented in more detail.

5.1 User demographics

Of the 15 participants that took part in the experiments (N=15), 12 are male and 3 are female. The mean age of the participants was 24.87 (σ = 3.739). Nearly all participants are Dutch, except for 1 German participant, and have an education of Bachelor level or higher (µ = 3.47, σ = 0.64, on a scale of 1-5, 1 being None and 5 being Doctorate). The na-tionality and education level of participants is worth noting as the concept labels were provided in English. On aver-age, participants marked 6.2 out of 15 concept detectors as non-relevant (σ = 1.494). However, the average number of detectors marked as non-relevant differed greatly per event (minimum 0.5 to maximum 11.7) and per user (minimum 3.7 to maximum 8.7).

A Fleiss’ Kappa test was performed to determine user agreement in the flagging of non-relevant concepts, which resulted in κ = 0.514. According to the Landis and Koch scale, this indicates a moderate agreement among users.

Of the 15 participants, 3 can be regarded as experts as they already were familiar with the system and data sets. Among this subgroup of 2 males and 1 female the mean age is 24.7 (σ = 0.58), all nationalities are Dutch and the education levels are 2 Bachelors and 1 Master (µ = 3.3, σ = 0.58). The experts marked 6.6 concepts as non-relevant on average (σ = 1.957). A Fleiss’ Kappa test among the experts showed that κ = 0.600, also indicating moderate agreement.

5.2 Mean average precision

Table 3 shows the results of each method in the user ex-periments. We find that of the relevance feedback methods, AlterWeights performs best. However, Manual selection of concepts by an expert outperforms even the best relevance feedback method.

The results reported for WordChange, ConceptChange and EventChange are found over 5 runs where the order of users

Table 3: Mean (MAP), standard deviation, minimum and maximum per method across all users. MAP for methods WordChange, ConceptChange and EventChange was aver-aged over 5 iterations with randomized user and event order, where the model was reloaded after each iteration.

Method MAP (µ) STD (σ) Min Max

Manual 0.1856 - - -Baseline 0.1598 - - -AlterWeights 0.1718 0.0053 0.1603 0.1793 ModifyQuery 0.1607 0.0062 0.1516 0.1724 WordChange 0.1646 0.0034 0.1580 0.1694 ConceptChange 0.1589 0.0019 0.1552 0.1618 EventChange 0.1566 0.0019 0.1529 0.1590

and events are randomized. The word embedding model was reloaded after each run. Although we observed some higher performances depending on when the model was reloaded (see Table 4), this set-up was found to be most represen-tative as feedback was only collected for one full iteration. Figure 3 shows the performance in each of the 5 runs for WordChange, ConceptChange and EventChange. We find that only WordChange consistently outperforms the base-line system in each iteration and even outperforms Alter-Weights in one (µ = 0.1730). For ConceptChange we see an improvement upon the baseline system in two iterations while in three we note worse performance. EventChange only outperforms the baseline system in one of the five runs. The minimum and maximum reported are for individual users. We find that AlterWeights is the only method to always perform higher or equal to the baseline system. The other methods all have performed below the baseline system for at least one user.

5.3 MAP per Event

An overview of the MAP for each method per event can be found in Appendix G. We note that none of the meth-ods outperforms the others consistently although Manual selection performs best in most cases (18 our of 32 events). If we exclude manual selection, we find that AlterWeights performs best in 12 our of 32 events. Furthermore we ob-serve large differences between events. For Attempting a bike trick (E021) we see the highest MAP scores ranging from µ = 0.4556 (Baseline to µ = 0.5948 (Manual ). The lowest MAP scores are observed for Parking a vehicle (lowest µ for all relevance feedback methods from 0.0112 to 0.0133) and Tailgating (lowest µ for Manual of 0.0213).

Figure 3: The MAP of the semantic space methods split out per run. The black line indicates the MAP of the baseline system.

(9)

Table 4: An overview of the semantic space methods where the model is reloaded at different moments (after each query, user, run or not at all). Result are measured in MAP (Standard deviation).

Method Query (1 Run) User (5 runs) Run (5 Runs) None (5 Runs) None (15 Runs) WordChange 0.1647 (0.0055) 0.1613 (0.0059) 0.1646 (0.0034) 0.1489 (0.0018) 0.1459 (0.0007) ConcepChange 0.1558 (0.0031) 0.1517 (0.0049) 0.1589 (0.0019) 0.1561 (0.0011) 0.1581 (0.0004) EventChange 0.1622 (0.0043) 0.1580 (0.0060) 0.1566 (0.0019) 0.1592 (0.0006) 0.1721 (0.0006)

To further investigate these differences and get better in-sight in the semantic performance of each method, Appendix H shows the selected concepts for events Horse riding com-petition (E035), Winning a race without a vehicle (E029) and Writing (E020). These results make apparent the im-portance of weights and their relative distribution. Further-more they show how certain concepts impact performance. In the next section we will provide a more in depth discus-sion of these results.

5.4 Semantic space performance

To examine how the performance methods WordChange, ConceptChange and EventChange differs per run, several iterations with variations on when the model was reloaded were performed. In all cases the order of users and events was randomized. These results are presented in Table 4.

These results show that WordChange tends to perform better in the short-term. For EventChange on the other hand we observe highest performance after several itera-tions without reloading the model. Figure 4 shows the per-formance results per run for 15 iterations without reload-ing the model. We see that already in the first iteration EventChange outperforms the other two methods. This sug-gests starting position and the order of users and queries is of large influence on the performance. Furthermore, this indicates that the performance fluctuates and that (local) minimums and maximums can be found.

We also observe that ConceptChange and EventChange mainly change performance in the first iterations and later seem to somewhat converge. Also, performance of these methods never drops again under the performance in the first iteration. The performance of WordChange on the other hand is constantly changing and no convergence seems to occur (yet). Also the performance is continuously lower than in the first iteration.

Figure 4: An overview of the MAP over 15 iterations (runs) without reloading the model.

5.5 Expert vs non-expert

As mentioned above, three of our users could be regarded as experts. Although all relevance feedback methods per-formed slightly better in terms of MAP, we did not find any substantial differences (see Table 5).

Table 5: Mean (MAP), standard deviation, minimum and maximum per method for expert users.

Method MAP (µ) STD (σ) Min Max

Manual 0.1856 - - -Baseline 0.1598 - - -AlterWeights 0.1748 0.0060 0.1680 0.1793 ModifyQuery 0.1637 0.0076 0.1586 0.1724 WordChange 0.1666 0.0023 0.1646 0.1691 ConceptChange 0.1600 0.0012 0.1591 0.1613 EventChange 0.1578 0.0021 0.1554 0.1590

6. DISCUSSION

In this section the results presented in the previous sec-tion with be discussed in more detail. The secsec-tion follows the same structure where first the general results are con-sidered, followed by the results per events. Next, we look at the semantic methods. The section ends with some broader points of discussion.

6.1 Mean Average Precision

In terms of general MAP, our results show two interesting observations. First, manual selection outperforms any form of relevance feedback and second, that the only method that does not search for any new concepts after the initial auto-matic SQG but only alters the weights performs best.

The good performance of Manual is largely related to the fact that the expert for several queries selected concepts none of the other methods was able to find.A good exam-ple of this is Marriage proposal (E025). The expert selected only the concepts putting ring on finger while the automatic methods all select concepts related to marriage (e.g. marry, wedding, bride, wedding ceremony). The semantic relation between the event and the concept putting ring on finger is hard to map using Word2Vec as the concept consists of multiple individual words with different semantic meanings. Only in this specific composition are the individual words related to the event.

In relation to this we find that the selected concepts some-times seem to reflect only part of the event query. For ex-ample event Fixing musical instrument (E034), where the selected concepts are all related to instruments rather than

(10)

fixing (e.g. musical instrument, piano, acoustic guitar ). Al-though in this case this strategy outperforms manual selec-tion by the expert, it does not accurately reflect the seman-tics of the query.

With regard to AlterWeights outperforming all other rel-evance feedback methods we can conclude two things. First, we observe that the relative weighting of the concepts is of great influence on the performance. By increasing the rela-tive differences between relevant and non-relevant concepts we can already gain large performance improvements. It should be noted however that absolute value of the weights also matters. One of the reasons ModifyQuery performs less than other methods is that for some events it finds the same concepts but with lower absolute weights.

Furthermore, AlterWeights performs well because it sharply decreases the weights of non-relevant concepts. The other methods tend to replace non-relevant concepts with other, possibly relevant, concepts rather than only lower their weights. On the other hand, simply removing the complete concepts (AlterWeights with α = 0.0 and β = 1.0) does perform a little less good. This can likely be attributed to the fact that in a few occasions concepts have been annotated as non-relevant that actually perform well for the event.

Secondly, the good performance of AlterWeights indicates that our automatic SQG module already performs good for concept selection. Alternatively is could mean that rele-vance feedback might be an ineffective approach to find any alternative, better concept detectors. The latter also relates to the weakness of Word2Vec to detect certain semantic re-lations that are more complex or contain negation.

6.1.1 MAP per event

As mentioned before, we observe large differences between events in terms of performance. We observe especially high performance (M AP > 0.3000) for all methods for events Landing a fish (E003), Flash mob gathering (E008) and At-tempting a bike trick (E021). These events have in common that they contain one specific element (fish, flash mob and bike trick) for which we have a set of representative concept detectors in the concept-bank. In these cases the system performs as we expected.

Low performance (M AP < 0.0750) for all methods is found in events Feeding an animal (E003), Giving direction to a location (E024), Renovating a home (E026), Wedding shower (E032), Non-motorized vehicle repair (E033), Play-ing fetch (E038) and TailgatPlay-ing (E039). However, the rea-sons for this bad performance differ per event.

For Feeding an animal semantically relevant concepts are retrieved such as animal, feed, feeding and pet yet perfor-mance remains low. This can probably be attributed to the detectors not being able to detect these concepts well. In such a case the semantic query generation and relevance feedback perform as expected as semantically relevant de-tectors are selected. A solution would be to retrain these detectors on videos more representative of the concepts.

In the case of Giving direction to a location the problem is twofold: (1) the event is very complex, consisting of many word with broad semantic meaning and (2) the concept-bank does not contain very accurate concepts (e.g. direc-tion, map and traffic are the most relevant). With regard to Playing fetch the problem is that the semantic model is unable to understand the relation between this event and concepts such as dog, park, ball and throw, throwing.

How-ever, the expert picked these concepts and while we observe performance nearly three times higher, it is still very low at M AP = 0.0470 indicating these detectors do not work well for this event.

Word2Vec does not handle negation very well. While it contains relations to similar words with similar context, the reverse does not work in most cases. The low performance for Non-motorized vehicle repair was therefore expected. In-teresting however is that the performance for Winning a race without a vehicle is rather average. If we look at the seman-tic concepts found by each method for this event (Appendix H), we see that the good performance of AlterWeights and ConceptChange is largely dependent on concepts racer, race car, racing car, win, winning and race, racing. Although not entirely clear, it could be that the negated part of the query is actually less important in the evaluation. Alternatively, detectors race, racing and win, winning could be trained on videos without vehicles and therefore perform well.

Appendix H also contains overviews for events Horse rid-ing competition and Writrid-ing. For the first, we note that all relevance feedback methods except ConceptChange find the same concepts as included in the manual selection. Yet, manual selection still outperforms these methods as the ex-tra concepts selected decrease performance. This shows that, if a few good detectors are available, selecting more (still relevant) concepts does not further increase perfor-mance. This is further strengthened by what we see for event Writing. The concept read, reading is selected in all relevance feedback methods. While thematically similar, it is actually not semantically similar and proofs to be a bad detector for the event (shown by the much better perfor-mance of AlterWeights).

We gather that it would be advantageous to let the num-ber of concepts selected depend on the query, rather than setting this number up front. A simple approach would be to set a similarity threshold, although we regularly observe less relevant detectors with a higher similarity than highly relevant ones.

6.1.2 Semantic space

With regard to the semantic space methods it is hard to determine which of the variations performs best. We find that WordChange performs better in the short-term. This is also emphasized by the fact that it still performs well when we reset the model after each query. In this respect we find that adjusting the model with the approach of Con-ceptChange is not very effective as this results in perfor-mance below the baseline system. According to these re-sults we are thus to conclude that only WordChange and EventChange are good methods of altering the semantic space, as both result in increased performance in the short-term.

Yet, if we continuously update the model (as intended) and reset it only after an entire iteration of all users and all events, we find the results to differ a lot. Figure 3 already shows that the performance can change depending on the randomized order of users and events. Figure 4 contradicts these results even further as the order of best performance is fully reversed in the first iteration with EventChange per-forming exceptionally well and WordChange rather poorly.

In the continuous 15 iterations without reloading the model, we find that WordChange shows much larger differences per run than the other two methods. ConceptChange and

(11)

EventChange make a notable improvement in the first itera-tion, but after several iterations seem somewhat to converge. The constant changes of WordChange are likely due to the fact that single words can be used in many different labels with many different semantic meanings. This makes it hard to find a good position in the semantic space that works well for many cases. As ConceptChange and EventChange take the meaning of an entire concept or label, their seman-tics become less ambiguous and likely a good position in the vector space exists.

For these long-term iterations some side notes need to be made as feedback was only collected for one full iteration over all users and events. After one iteration two issues can occur in further iterations: (1) concepts considered by all users as relevant are left with their similarity scores maxed out to 1 (i.e. their vectors are nearly similar to the query vector), (2) new concepts are found for which we do not have any relevance feedback, thus they are not updated.

Taken together, the results indicate that the methods can result in good performance. For both WordChange and EventChange we have observed results that perform Alter-Weights. As we observe a slightly higher performance for EventChange and much more stable results, we are inclined to believe this variation on the presented method to word best. However, to reach these good performances the model needs to be optimized less randomly as the randomization of users and events currently has to big an impact on the performance. A better optimization approach (e.g. gradient descent ) could find a (local) maximum in performance and train a model that consistently performs well.

7. CONCLUSION

This paper focused on the use of relevance feedback for semantic query generation using word embeddings. We ex-plored two research questions: (1) How can system perfor-mance be improved through user feedback on the seman-tic relevance of automaseman-tically selected concept detectors? (2) How can the semantic space be altered based on rele-vance feedback such that more accurate retrieval results are achieved?

In our results we find first of all that it is indeed possible to improve performance through user feedback on the semantic relevance of concepts. We observe that users are indeed capable of identifying non-relevant concepts independent of their level of expertise of familiarity with the system.

We explored several methods of implementing the rele-vance feedback and compared these to manual selection by an expert and the baseline system without relevance feed-back. We found that manual selection performed best due to the limitations of the word embedding model to express complex semantic relations. This means manual selection mainly performed better as the expert was able to pick con-cepts none of our methods was able to consider relevant to the query. With regard to the feedback methods, we found re-weighting to be highly effective and to result in the best performance. These results show that the baseline system is already quite good at selecting the relevant concepts, but that the relative weight distribution can be optimized.

Secondly we explored how the semantic space (i.e. the word embeddings model) could be altered using the rele-vance feedback to better reflect the semantic relations be-tween concepts. We compared three variations of creating and altering the model which differed in how they treated

multi-word and multi-concept detector labels. We found that a method considering a separate vector for each single word performed best in the short-term. In the long-term however a model merging each concept label into one single vector seemed to perform better as it could more accurately represent each concepts semantic meaning. In our experi-ments we were not able to alter the semantic space such that it continuously improved performance in all cases. However, we observed results highly competitive with re-weighting. We believe these results indicate that with further optimiza-tion a semantic space can be created that will outperform re-weighting and receives continuously high performance for all queries.

8. FUTURE WORK

Our experiments are limited in the optimization of the semantic space methods. Suggestions for future work are therefore to further explore optimization of the semantic space models in the long-term. Different algorithms could be considered such as gradient descent to find local or global maximums. Furthermore, other similarity measures could be attempted (e.g. Euclidean or Tanimoto distance) as well as different methods to pool word embedding vectors (e.g. Fisher Vector).

Lastly, for long-term experiments we found our use of stat-ically collected user feedback to be a constraint. A sugges-tion for future research is therefore to set up an experiment where relevance feedback is collected dynamically. Lastly, to broaden the results experiments could be done including more events or queries or exploring different data sets.

9. REFERENCES

[1] M. ALMasri, C. Berrut, and J.-P. Chevallet. A comparison of deep learning based query expansion with pseudo-relevance feedback and mutual information. In Advances in Information Retrieval, pages 709–715. Springer, 2016.

[2] M. de Boer, L. Daniele, P. Brandt, and M. Sappelli. Applying semantic reasoning in image retrieval. Proc. ALLDATA, 2015.

[3] M. de Boer, K. Schutte, and W. Kraaij. Knowledge based query expansion in complex multimedia event detection. Multimedia Tools and Applications, pages 1–19, 2015.

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009, pages 248–255. IEEE, 2009. [5] M. Elhoseiny, J. Liu, H. Cheng, H. Sawhney, and

A. Elgammal. Zero-shot event detection by multimodal distributional semantic embedding of videos. arXiv preprint arXiv:1512.00818, 2015. [6] Y. Goldberg and O. Levy. word2vec explained:

Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.

[7] X. He, O. King, W.-Y. Ma, M. Li, and H.-J. Zhang. Learning a semantic space from user’s relevance feedback for image retrieval. Circuits and Systems for Video Technology, IEEE Transactions on, 13(1):39–48, 2003.

[8] L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann. Zero-example event search using

(12)

multimodal pseudo relevance feedback. In Proceedings of International Conference on Multimedia Retrieval, page 297. ACM, 2014.

[9] L. Jiang, S.-I. Yu, D. Meng, T. Mitamura, and A. G. Hauptmann. Bridging the ultimate semantic gap: A semantic search engine for internet videos. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages 27–34. ACM, 2015.

[10] L. Kaliciak, D. Song, N. Wiratunga, and J. Pan. Combining visual and textual systems within the context of user feedback. In Advances in Multimedia Modeling, pages 445–455. Springer, 2013.

[11] G. Lev, B. Klein, and L. Wolf. In defense of word embedding for generic text representation. In Natural Language Processing and Information Systems, pages 35–50. Springer, 2015.

[12] Y. Liu, D. Zhang, G. Lu, and W.-Y. Ma. A survey of content-based image retrieval with high-level

semantics. Pattern Recognition, 40(1):262–282, 2007. [13] T. Mikolov, K. Chen, G. Corrado, and J. Dean.

Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[15] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. [16] D. Milne and I. H. Witten. An open-source toolkit for

mining wikipedia. Artificial Intelligence, 194:222–239, 2013.

[17] C. Ngo, Y. Lu, H. Zhang, M. de Boer, K. Schutte, and W. Kraaij. VIREO-TNO TRECVID 2015:

Multimedia event detection. In Proceedings of TRECVID 2015. NIST, USA, 2015.

[18] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013. [19] P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders,

W. Kraaij, A. F. Smeaton, G. Qu ˜Al’enot, and R. Ordelman. TRECVID 2015 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proc. TRECVID 2015, page 52. NIST, USA, 2015. [20] G. Pingen, M. de Boer, and R. Aly. Relevance

feedback in automatic event search using semantics. Paper to be published soon.

[21] J. J. Rocchio. Relevance feedback in information retrieval. 1971.

[22] R. Rocha, P. T. Saito, and P. H. Bugatti. A novel framework for content-based image retrieval through relevance feedback optimization. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 281–289. Springer, 2015.

[23] I. Ruthven and M. Lalmas. A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(02):95–145, 2003. [24] C. G. Snoek, S. Cappallo, D. Fontijne, D. Julian, D. C.

Koelma, P. Mettes, K. Sande, A. Sarah, H. Stokman, R. B. Towal, et al. Qualcomm research and university of amsterdam at trecvid 2015: Recognizing concepts, objects, and events in video. 2015.

[25] C. G. Snoek and M. Worring. Concept-based video retrieval. Foundations and Trends in Information Retrieval, 2(4):215–322, 2008.

[26] C.-F. Tsai, Y.-H. Hu, and Z.-Y. Chen. Factors affecting rocchio-based pseudorelevance feedback in image retrieval. Journal of the Association for Information Science and Technology, 66(1):40–57, 2015.

[27] Y. Yan, Y. Yang, H. Shen, D. Meng, G. Liu,

A. Hauptmann, and N. Sebe. Complex event detection via event oriented dictionary learning. In

Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

[28] G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the 23rd Annual ACM Conference on Multimedia

Conference, pages 471–480. ACM, 2015.

[29] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in Neural

Information Processing Systems, pages 487–495, 2014. [30] X. S. Zhou and T. S. Huang. Relevance feedback in

image retrieval: A comprehensive review. Multimedia systems, 8(6):536–544, 2003.

APPENDIX

A. LIST OF EVENTS FOR EXPERIMENTS

Each even is listed as EventName (EventID). The names are entered as the query, the id is only used for easy tracking and short notation. The event included in the experiments are:

Attempting a board trick (E001), Feeding an animal (E002), Landing a fish (E003), Working on a woodworking project (E005), Changing a vehicle tire (E007), Flash mob gathering (E008), Getting a vehicle unstuck (E009), Grooming an an-imal (E010), Parade (E012), Parkour (E013), Repairing an appliance (E014), Working on a sewing project (E015), Do-ing homework or studyDo-ing (E016), Hide and seek (E017), In-stalling flooring (E019), Writing (E020), Attempting a bike trick (E021), Cleaning an appliance (E022), Giving direc-tions to a location (E024), Marriage proposal (E025), Ren-ovating a home (E026), Rock climbing (E027), Winning a race without a vehicle (E029), Working on a metal crafts project (E030), Wedding shower (E032), Non-motorized ve-hicle repair (E033), Fixing musical instrument (E034), Horse riding competition (E035), Felling a tree (E036), Parking a vehicle (E037), Playing fetch (E038), Tailgating (E039). Event excluded from the experiments:

Wedding ceremony (E004), Birthday party (E006), Making a sandwich (E011), Hiking (E018), Dog show (E023), Town hall meeting (E028), Beekeeping (E031), Tuning musical in-strument (E040).

(13)

B. MUTI-WORDS AND MULTI-CONCEPTS

LABELS

We compared performance for three methods of handling multi-word and multi-concept labels. FullSplit considers the label as separate single words and calculates the cosine sim-ilarity between each individual word and the query. The maximum cosine similarity is taken as the similarity for the label. FullFusion first fuses the separate words and concepts of the label into a single vector using mean pooling. Next the cosine similarity is measured between concept label vec-tor and query. ConceptSplit splits the label per concept, fuses multi-word concepts using mean pooling and then cal-culates the cosine similarity between each concept and the query. Here we also select the maximum similarity of the concepts as the similarity for the entire label. Table 6 shows the results these methods retrieve on the baseline systems in terms of MAP.

Table 6: MAP results for three different methods of handling multi-word and multi-concept labels. Experiment run on the baseline version of the system.

Method MAP

FullFusion 0.1598 FullSplit 0.1589 ConceptSplit 0.1551

C. NUMBER OF DETECTORS

Appendix C: An overview of the results in MAP for differ-ent numbers of detectors used in the scoring function. We observe a peak at n = 5.

D. FORM TO COLLECT USER FEEDBACK

Participants were asked to annotate the non-relevant con-cepts for each of the 32 events. The figure shows the overview participants were presented with for the first five events. They were asked to cross off the non-relevant concepts us-ing a pencil. A pp e nd ix D: Th e o v erv iew of e v en ts a n d c o n c ep ts u sed to co lle ct u ser feed b a ck.

(14)

E. PARAMETER OPTIMIZATION FOR

ALTERWEIGHTS

Appendix E: MAP for method AlterWeights relative to α and β values. Values ranging from 0.0 to 1.0 were considered with step-size 0.1.

F. PARAMETER OPTIMIZATION FOR

MODIFYQUERY

Appendix F: MAP for method ModifyQuery relative to α and β values. Values ranging from 0.0 to 1.0 were considered with step-size 0.1.

G. COMPARISON OF ALL METHODS

SPLIT PER QUERY

The Figure shows the experiment results for each event query measured in MAP including standard deviations.

(15)

H. RETRIEVED

SEMANTIC

CONCEPTS

PER

EVENT

T a b le 7 : Ov e rvie w o f th e c o n c ep ts se lec ted fo r ev en t H o rse rid ing com p eti ti on (E0 3 5 ) p er me th o d . Co n c ep ts fo r a ll re lev a n c e me th o d s a re fro m a sin g le u se r rep res en ta ti v e for th e g e n era l p erfo rm a n c e. M e tho d M a n u a l B a sel in e A lt erW e ig h t M AP 0 .4 9 6 8 0 .3 8 9 2 0 .3 9 5 7 W e ig h t | C on c e pts 0 .5 | rid e h o rse, rid in g h o rse 0 .8 6 | rid e h o rse, rid in g h o rse 1 .2 0 | rid e h o rse, rid in g h o rse 0 .5 | ra c ec o u rse, h o rse ra c e c o u rse 0 .7 3 | h o rse 1 .0 3 | h o rse 0 .7 1 | rid e , ri d in g 1 .0 0 | rid e, ri d in g 0 .6 5 | rid e b ik e, ri d in g b ik e 0 .0 6 | rid e b ik e, ri d in g b ik e 0 .6 5 | ra c ec o u rse, h o rse ra c e c o u rse 0 .9 1 | ra c ec o u rse, h o rse ra c e c o u rs e M o d ify Qu e ry W o rd Ch a n g e Co n c ep tCh a n g e Ev en tCh a n g e 0 .4 0 4 0 0 .4 0 0 1 0 .1 6 1 9 0 .4 0 2 5 0 .8 0 | rid e h o rse, rid in g h o rse 1 .0 0 | h o rse 0 .9 8 | rid e h o rse, rid in g h o rse 0 .9 8 | ri d e h o rse, rid in g h o rse 0 .7 1 | h o rse 0 .9 8 | rid e h o rse, rid in g h o rse 0 .9 6 | rid e , ri d in g 0 .9 5 | h o rse 0 .6 5 | ra c ec o u rse, h o rse ra c e c o u rse 0 .9 4 | rid e , ri d in g 0 .9 6 | h o rse 0 .9 3 | ri d e, ri d in g 0 .6 3 | rid e , ri d in g 0 .8 6 | ra c ec o u rse, h o rse ra c in g co u rs e 0 .8 4 | ra c e, ra c in g 0 .8 4 | h o rse ca rt , h ors e-c art 0 .5 9 | ra c e, ra c in g 0 .7 0 | h o rse ca rt , h ors e-ca rt 0 .7 7 | p ers o n rid e, p erso n ri d in g 0 .8 3 | ra c ec o u rse, h o rse ra c e c o u rs e

(16)