Automatic Image Annotation From Noisy Data Using Deep Neural Representations

(1)

MSc Computational Science

Specialization: Big Data & Urgent Computing

Master Thesis

Automatic Image Annotation

From Noisy Data Using Deep

Neural Representations

by

Anton O. Radice

11386959

May 18, 2017

42 credits January 2017 - June 2017

Supervisor:

Dr. Valeria

Krzhizhanovskaya

UvA, ITMO

Co-supervisor:

Dr. Nikolay

Butakov

ITMO

Daily supervisor:

Maarten Stol, MSc.

Braincreators

(2)

Acknowledgments

I would first and foremost like to thank my wonderful supervisors Maarten Stol, Valeria Krzhizhanovskaya, and Nikolay Butakov for their expert guid-ance, patience, and encouragement. It was only through many long, insightful conversations with my supervisors that I was able to produce the ideas found in this thesis. Furthermore, I want to thank all of my colleagues at BrainCre-ators for providing me with the opportunity to perform my research in such a stimulating and supportive environment, giving me the freedom to guide my own research while also providing me with the necessary resources to succeed. On a personal note, I am deeply grateful for my parents and sister, Katya, Glenn, and Veronica, for their unwavering support throughout my academic career. Lastly, I would like to give thanks to the rest of my family, espe-cially my grandparents Margaret and Richard, and my close friends for their invaluable moral support and faith in my success.

(3)

Abstract

Assembling large-scale, accurately labeled image datasets is a common, yet significant problem faced by many researchers and commercial companies in the field of computer vision. Manual annotation is often too costly or time consuming for most applications, and suffers from biases inherent in the anno-tation task. This leads to the following fundamental question: how can large datasets of labeled images be constructed without resorting to full manual annotation? Luckily, the exponential growth of information on the web in tandem with recent advancements in the representational ability of deep neu-ral networks enable the design of intelligent methods that can discern visually perceptible semantic entities like objects and attributes occurring in images from noisy web data in order to suggest labels for unseen images, thus mini-mizing the labor requirement and bias influence of full manual annotation.

In this thesis, a novel method for automatic image annotation is pro-posed that combines neural network-based automatic attribute discovery with an image retrieval framework to form a semi-supervised image annotation system that can label images from a vocabulary of relevant concepts and attributes derived from user-generated image and text data. The proposed approach exploits deep representations of images embedded from the inter-mediate layers of convolutional neural networks (CNN) to discover feature spaces unique to tags with discernible visual properties and treats annotation as an image retrieval task by finding the K-nearest neighbors to the query image in each of the discovered feature spaces. The proposed method results

(4)

3 in a feature representation that is up to 95% smaller than traditional CNN features and is tested on a challenging real-world dataset to demonstrate its effectiveness for the image annotation task.

(5)

List of Figures

1.1 An example of different annotations for a single image . . . 11

1.2 An example of an image and accompanying text found on mar-ketplace Etsy.com . . . 13

3.1 Typical convolutional neural network architecture . . . 26

3.2 An illustration of the global max pooling procedure . . . 28

3.3 An illustration of the feature space construction procedure . . 32

4.1 A random sample of images from the clothing category of the Etsy dataset . . . 38

4.2 VGG-19 architecture . . . 41

4.3 Positive annotation examples . . . 43

4.4 Negative annotation examples . . . 44

6.1 A visualization of activation unit divergences in the intermedi-ate layers of the network for tag “ruffle” . . . 52

(8)

List of Tables

4.1 Tags with highest and lowest visualness . . . 40

4.2 Results of the human judgment task . . . 46

6.1 VGG-19 network details . . . 50

6.2 List of the top 50 seed tags by frequency . . . 51

(9)

Glossary

API Application programming interface. CNN Convolutional neural network. DMIL Deep multiple instance learning. GPU Graphics processing unit.

IR Information retrieval. JEC Joint equal contribution. KL Kullback-Leibler divergence. KNN K-nearest neighbors algorithm. MLP Multi-layer perceptron.

SVM Support vector machine. UGC User-generated content.

(10)

Chapter 1

Introduction

“Data is the sword of the 21st century, those who wield it well, the Samurai.”

— Jonathan Rosenberg, Google

1.1 Motivation

The quote highlighted above is from the year 2009. While only eight years ago, a lot has changed in the years since, especially in the field of artifi-cial intelligence which has seen remarkable progress mainly due to advances in deep learning research. Nonetheless, many of the same trends from the past few decades continue, namely the exponential growth in data being generated

1 _{and the constant march of increased computing power popularly referred to}

as Moore’s law 2. Advances in deep learning research have benefited greatly from the latter phenomenon, more specifically from cheaper graphics process-ing units (GPU) that speed the trainprocess-ing of deep learnprocess-ing algorithms, but have yet to catch up to fully exploit the former trend, the explosion of information

1_{http://insidebigdata.com/2017/02/16/the-exponential-growth-of-data} 2

https://en.wikipedia.org/wiki/Moore%27s_law

(11)

Chapter 1. Introduction 10

growth online. Considering daily uploads of images alone from five of the top social networks are estimated to have exceeded 1.8 billion in 2014 [1], there is a clear gap between the rate at which data is being generated and the rate at which data is being transformed into a usable form so that valuable intelligence can be extracted via analytics and modeling.

Aside from the largest players in the technology space with millions and millions of users constantly generating data that can be intelligently captured, labeled, and used for analysis and modeling, researchers without large grants and smaller organizations without expansive consumer-facing platforms are often faced with the same issue of accessing the labeled data they need to achieve results of similar performance and capability. Normally, to circum-vent this issue, some sacrifices have to be made, such as spending time and money performing manual data cleaning or annotation, which in practice limits the size of the dataset that can be produced severely. However, this approach is suboptimal, costing the researcher or organization precious time and resources that can be spent improving the models themselves, not obtain-ing the necessary input data. Fortunately, there is an alternative to manual annotation and that is developing semi-supervised methods that utilize a human-in-the-loop to aid in data cleaning or annotation.

In the field of computer vision, the problem of collecting annotated data is especially pronounced, due to the ambiguity and subjectivity that surrounds labeling the visual content in images. Take for example an image of a busy city scene such as the one in figure 1.1. There are numerous ways to describe its content. Perhaps you might focus on annotating the finer details in the image, such as the appearance of individual people, cars, and buildings. Maybe you choose to describe the more abstract, higher level concepts, such as the fact that its rush hour, there is grid lock traffic, or that the sunset over the buildings looks beautiful. In either case, it is clear that describing the visual content of an image is a laborious task that not only depends on the complexity of the image, but also the diligence of the annotator. These

(12)

Figure 1.1: An example of different annotations for a single image complexities lead to two well known issues associated with manual image annotation, which are class-imbalance and weak-labeling [2]. Class-imbalance refers to the high variance in the number of images corresponding to different labels and weak-labeling occurs when a large proportion of images are not annotated with all relevant labels. Thus, the process of manually annotating images is fundamentally problematic.

Luckily, the explosion of information growth online includes so called user-generated content (UGC) that can be leveraged for the purpose of au-tomatic image annotation. UGC is the core source of content creation for many online platforms such as Facebook, Instagram, Twitter, Flickr, eBay, and Etsy. Though these platforms differ in purpose (i.e. Facebook and Twit-ter are social networks whereas eBay and Etsy are marketplaces), they all depend on users uploading and sharing images, text, videos, and other forms of media. While some of this UGC is proprietary, more often than not it is open for access via application programming interfaces (APIs), mainly to

(13)

inspire third parties to develop their own applications using the service, thus expanding its reach. However, this data can also be collected and used to create datasets for a variety of purposes, including image datasets for vision tasks. While the strengths of UGC include its large size (billions of photos with their accompanying text), availability (typically publicly accessible for free or minimal cost), and diversity (spanning numerous topics, styles, and time periods), the difficulty in utilizing UGC lies in the fact that it is often too noisy to use in its raw form. Thus, in order to harness these strengths, intelligent methods are needed to filter and separate the signal from the noise. In this thesis, a novel approach to making sense of such user-generated image data is described. The automatic annotation method proposed here is capable of suggesting relevant labels for unseen images from a corpus of weakly-supervised data by leveraging deep representations of visual concepts mined from noisy image tags. This method can serve as a tool for researchers and organizations in the field of computer vision to use to build large, high quality datasets from weakly supervised data found online by maximizing the accuracy and efficiency of human-in-the-loop strategies, hence minimizing the cost associated with full manual annotation.

1.2 Contributions

In the previous section, the motivation for an automated method to annotate images from large, noisy, user-generated image datasets that are available openly online but pragmatically difficult to use was provided. To address this need, we focus on synthesizing recent advances in convolutional neural networks (CNN) in computer vision with a simple, yet proven image annotation framework. Following research into intermediate representations learned by CNNs [3, 4, 5], we hypothesize that the discovery of specific pat-terns of activation unit responses generated by deep CNN architectures can be

(14)

used to improve the performance of automatic annotation systems. Specif-ically, we are interested in identifying patterns of neurons that correspond to visually identifiable semantic concepts present in images and quantifying their discriminative power, allowing us to filter out irrelevant, non-visual tags found in weakly supervised data.

Typically, automatic image annotation methods are developed and eval-uated on benchmark datasets specifically designed for this task. These include popular image annotation datasets such as Corel 5K [6], ESP Game [7], and IAPR TC-12 [8]. While these datasets allow for the direct comparison of different automatic image annotation methods and are challenging due to di-versity of visual content, class-imbalance, and weak-labeling, the ground truth label sets are human annotated and inspected to ensure quality, resulting in an artificially clean dataset. As a result, these datasets are useful for bench-marking but do not address the issue of noisy labels that are commonly found in user-generated image datasets in the real-world.

Figure 1.2: An example of an image and accompanying text found on mar-ketplace Etsy.com

In this thesis, we limit our scope to the problem of automatic image an-notation from noisy, user-generated images and their accompanying text (see figure 1.2 for an example). Aside from adopting an image retrieval framework based on the k-nearest neighbors algorithm, a typical component of automatic

(15)

image annotation systems, we utilize results in two other fields of research, namely the study of deep image representations extracted from CNNs and methods that learn image classifiers in the presence of image and/or label noise (i.e. semi-supervised). While we try to review most of the related work in these fields, we focus specifically on research that is related to our task of in-terest, that is the automatic annotation of unseen images from user-generated data.

In chapter 2, we offer a broad literature survey on the three main fields of research related to our goal. In chapter 3, the main contribution of this thesis, a novel automatic image annotation method that draws inspiration from recent discoveries in the representational ability of deep CNNs for image recognition and classic image annotation frameworks, is described. We focus on first defining our method mathematically, then offer a straightforward step-by-step description of the image annotation pipeline. In chapter 4, the design and main results of our experimental study on a challenging, real-world dataset are presented. Lastly, in chapter 5, we draw conclusions on the main findings of this research and offer suggestions for future work.

(16)

Chapter 2

Related Work

“The possession of knowledge does not kill the sense of wonder and mystery. There is always more mystery.”

— Ana¨ıs Nin

We now provide a thorough survey of literature from three major fields related to our research. The first field is automatic image annotation, which includes the body of work on the specific task of annotating images with tags automatically. This research area draws parallels to the multi-class image classification problem since the common goal is to assign unseen images with a set of labels. However, automatic image annotation approaches are usually designed with certain objectives in mind, such as increasing the total number of correctly assigned labels. The second field of research we cover is deep representations or embeddings of images produced by CNNs. CNNs have been show to be extremely effective on image classification tasks and can be used to learn robust features for a variety of vision tasks. Lastly, the final research area we draw inspiration from is the corpus of papers related to learning image classifiers from noisy data, which is only tangentially related to our task at hand, but share many similarities difficulties in dealing with

(17)

Chapter 2. Related Work 16

image and label noise.

2.1 Automatic Image Annotation

Automatic image annotation is a subfield of research that largely arose from developments in the field of information retrieval (IR) research. This is due to the fact that the task of automatically annotating images closely relates to tasks central to IR. For example, given a query image or query term describing some aspect of an image, visual IR systems are tasked with finding the closest image or set of images that match the query and return them. In order to accomplish this, candidate images for retrieval must be first annotated and then retrieved based on either text features, visual features, or a combination of both. Thus, research on automatic image annotation predominantly grew out of the need to produce large sets of annotated images that are candidates for retrieval in search systems.

Most commonly, automatic image annotation follows a general structure which can be summarized by two main steps: 1) low-level image features such as colors and textures are computed via global or region-based methods and used as image representations, 2) using these image representations, higher level semantic concepts are learned from the low-level image features of the labeled training images using either single or multi-label classification meth-ods or transfered via neighbor-based retrieval techniques [9]. Research in this area is then focused on improving both steps in this process, namely finding a better bag of visual features to describe images and the best machine learning or retrieval algorithm to effectively assign an unseen input image with a set of relevant labels.

We will appeal to the excellent survey by Zhang et al. [10] for an in depth summary of earlier approaches to automatic image annotation. Most of these methods are based on computing features using traditional computer

(18)

vision techniques and use standard classifiers like support vector machines (SVM) or decision trees or k-nearest neighbors for retrieval. By contrast to this survey, in this section we will focus on more recent papers that use advanced techniques such as deep neural networks to extract features and build classifiers or retrieval systems from.

One of the first applications of using features generated by a convolu-tional neural network (CNN) to automatic image annotation was proposed by Murthy et al. [11]. The authors use image features extracted by the net-work and word embeddings derived from the meta data associated with the images to train a CNN-based linear regressor to predict the word embedding associated with a given input image. In essence, this approach tries to learn correlations between image features and word embeddings, but does not at-tempt to isolate which specific dimensions of each feature are activating the most for a specific meta tag.

In a related study, Bai et al. [12] use a deep neural network to learn image-to-image and image-to-word representations for the task of automatic image annotation. Their approach is dependent on interactions from search engine click-through logs to find similar words for each category they consider and to remove noisy images that are of low similarity for a given category. While the experimental results show high levels of accuracy and good cross dataset generalization, the approach relies on access to proprietary data that most researchers and organizations cannot access.

Wu et al. [13] propose a semi-supervised Deep Multiple Instance Learn-ing (DMIL) framework to learn correspondences between image regions and keywords. For an input image, their end-to-end learning framework aims to predict and localize relevant keywords in the image. Though their method is shown to work on noisy data, it requires manually annotating bounding boxes for several salient objects for each image in the training set. This manual input is simply unfeasible for web-scale noisy datasets.

(19)

Another image annotation approach offered by Yu et al. [14] combines deep learning with a human-in-the-loop strategy to amplify the effectiveness of manual annotation. The system works in an iterative process where a small sample of images is labeled manually, multi-layer perceptron (MLP) classifiers are trained on a random train-test split of this now labeled sample and then applied to the full collection of images in order to rank them based on whether they are “hard” or “easy” to classify. The “easy” images are automatically labeled if they exceed a certain threshold and “hard” images are then fed to the next iteration, where they may be selected in the sampling procedure for manual annotation. The authors show that this approach results in a high label precision of about 90% while also amplifying human efforts by 40 times on average. In the end, however, this approach clearly requires human effort which is not feasible for many situations.

An automatic attribute and characterization discovery system that learns from noisy web data is developed by Berg et al. [15]. Their approach works by modeling the relationship between images and their descriptions, more specifically, how well a particular substring of the image description can be predicted from visual features of the image. The authors use shape, color, and texture features of images to rank attributes of images and determine whether or not they are localizable. The authors show that this system performs well in identifying visual attributes of images in the context of an e-commerce dataset and introduce the notion of visualness, which we adopt in our anno-tation system. However, this automatic attribute discovery system relies on computing classical computer vision image features, whereas ours computes features from diverging neurons deep within neural networks, similarly to [3], reviewed next.

Most closely related to our work is the research done by Vittayakorn et al. [3] and serves as a inspiration for our proposed system. The authors develop a system that can detect and localize visual attributes in images from noisy image-tag sets. More specifically, positive and negative sets of images

(20)

sourced from the web are constructed, where positive means the inclusion of a particular user-generated keyword and negative means the absence of the same keyword, and fed through pre-trained and fine-tuned CNNs in order to compare distributions of activation responses for each neuron in the network. Then, for a given tag and for each neuron in the network, Kullback Leibler (KL) divergence scores between the positive and negative distributions for each neuron are computed, resulting in a measure of importance for each neuron in the network. Using the set of most divergent neurons, which the authors call prime units, they show it is possible to train a simple logistic classifier in two stages to not only determine the “visualness” of a particular keyword, but also use it in a two step process to identify false-positives and false-negatives in the dataset, acting as a noise-reduction mechanism. We borrow the general approach behind identifying prime units corresponding to specific visual keywords and apply it to the task of performing automatic image annotation.

We adopt the general baseline automatic image annotation approach described by Makadia et al. [16]. The authors show that a simple framework using what they call joint equal contribution (JEC) that computes low level image features, finds the nearest neighbors to the query image in that feature space, and applies a heuristically chosen label-transfer mechanism from the discovered neighbors to the query image performs well compared to many more complex methods. We specifically choose this framework for automatic image annotation because it allows for a nearest neighbor search in any feature space, or combination of feature spaces. This allows us to use feature spaces we discover for specific visual keywords to find nearest neighbors via distance calculations.

As a follow up to [16], Verma et al. offer an improved automatic image annotation framework called the 2-pass k-nearest neighbors algorithm that overcomes issues of class imbalance and weak labeling [2]. The authors de-velop the notion of a semantic neighborhood, or set of nearest neighbors for

(21)

a given class, to act as a sort of “bottom-up pruning” in order to shrink the size of the search for the final label prediction. We also use a similar notion of semantic neighborhood, but within a feature space we intelligently construct from divergent neurons deep within a neural network.

Guillaumin et al. propose a novel discriminative metric learning in near-est neighbor models for the task of automatic image annotation called Tag-Prop [17]. TagTag-Prop takes a weighted combination of tag absence/presence among its neighbors to predict tags for an input image. Weights are either related to neighbor distance or rank and are optimized by maximizing the likelihood of annotations in the set of training images. In addition, the au-thors introduce a tag-specific logistic discriminant model that either boosts or suppresses the tag presence probabilities for frequent or rare words, thus increasing the number of words that are assigned to images. The TagProp system is shown to be more effective than JEC in both precision, recall, and count of tags with non-zero recall values. This approach, however, like many other image annotation frameworks relies on feature engineering using 15 dis-tinct image descriptors, namely one Gist descriptor, 6 color histograms and 8 bag-of-features descriptors, whereas we extract features from within the CNN layers using KL divergence.

2.2 Deep Representations

Following the success of Krizhevsky et al. [18] AlexNet in the ImageNet competition in 2012, research into CNNs exploded, especially in the computer vision domain. The remarkable ability of CNN-crafted features over hand-crafted features has been shown to have many powerful applications, such as handwriting and face recognition. Since our proposed automatic image annotation system depends on the representational ability of CNNs to capture both high and low level image features, we will discuss a few key papers in

(22)

this area to motivate our methods.

Bengio et al. motivate the importance of data representation, or feature design, for machine learning algorithms [19]. They note the shortcomings of traditional approaches to feature engineering and the ability of deep learn-ing algorithms to act as representation learnlearn-ing procedures that encapsulate multiple levels of data representation, with higher-level features representing more abstract forms of the data. This paper serves as motivation to pursue automatic image annotation using deep CNN features.

Yosinski et al. [20] offer one of the first and most comprehensive studies into the inner workings of CNNs in image recognition. The authors develop visualization tools that allow users to see which features in which layers acti-vate the most for a given input image in real time. Their DeepVis framework allows researchers to build an intuition as to what the deep layers of a CNN are doing for different input images, as well as observe the top images for each activation unit in the network from the training set.

Ozeki et al. dig deeper in understanding how CNNs develop an internal representation of images by examining what they call category-level attributes [4]. By this hypothesis, the authors show that there are small numbers of ac-tivation units that can predict semantic attributes relatively accurately using an approach similar to ours. However, they only consider activation units from the fully connected layers and do not study activation patterns in the convolutional layers.

Escorcia et al. similarly aim to discover the relationship between ac-tivation patterns and visual attributes by studying what they call Attribute Centric Nodes (ACNs) [5]. The authors empirically show that there exists an unevenly and sparsely distributed set of activation units within a CNN that collectively encodes relevant information relating to particular visual at-tributes. These so-called ACNs are shown to be important to tasks such as object recognition and are more offer more predictive power than hand-crafted

(23)

visual features. Nonetheless, the authors do not suggest how to incorporate this discovery of ACNs in an image annotation framework.

2.3 Learning From Noisy Data

A wide variety of research has been conducted on the topic of learning from noisy data. Learning directly from noisy data circumvents the problem of obtaining a high quality ground truth training set from which to build clas-sifiers. In this way, it can be considered an alternative approach to automatic image annotation, which fundamentally aims at constructing a high quality training set of images. Thus, we find we must mention the key results in this area of research and will focus mostly on research that draws parallels to the automatic image annotation goal.

Izadinia et al. propose a technique for learning image classifiers from “wild” tags appearing in a dataset of 100 million Flickr images [21]. The logistic regression model the authors develop is specifically robust to label noise and show it can be applied to image annotation and retrieval. However, this method does not incorporate convolutional neural networks, which have been shown to be especially robust at image classification tasks.

Joulin et al. also train a multi-label classifier on the 100 million Flickr image dataset, but they use a CNN to learn robust image features [22]. In this semi-supervised setting, the authors perform a series of experiments to show that the features learned by the network are powerful for word prediction (i.e. multi-label classification), transfer learning applied to other image datasets, as well as constructing word embeddings not from word-to-word co-occurrences like in word2vec [23], but from word-to-image co-occurrences.

Wang et al. similarly attempt to learn strong CNN models in an un-supervised setting, but instead focus on developing a procedure to improve the transferability of pre-trained CNNs to work on smaller training sets [24].

(24)

The authors achieve this by introducing a low-density separator module that learns decision boundaries between successive layers that traverse regions of as low density as possible and avoid intersecting high-density regions in the activation space. In the end, the paper shows that this modification to CNNs can be used to recognize novel categories from very few examples.

Xiao et al. develop a technique to train large scale CNNs from only a small set of cleanly annotated images and a large set of noisily annotated images [25]. The authors explicitly model label noise by using a probabilistic graphical model that aims to distinguish labels of three types: noise-free labels, confusing noise labels, and pure random noise labels. Confusing noise labels are those that may potentially have some representation in the image but difficult to say with certainty. Pure random noise labels on the other hand are labels which are entirely wrong. Using two independent CNNs, the authors model these noise types and show reasonable effectiveness on a large scale, real world dataset.

(25)

Chapter 3

Automatic Image

Annotation Using Deep

Neural Representations

“The voyage of discovery is not in seeking new landscapes but in having new eyes.”

— Marcel Proust

In the motivation section in chapter 1, we demonstrated the need for new methods for automatic image annotation that perform not just on benchmark datasets commonly used in the literature, but on noisy datasets found in the real world in order to turn the streams of images and accompanying texts that are uploaded continually online into usable datasets. The inspiration for our proposed automatic image annotation system comes from the recent successes of deep neural networks [26], specifically CNNs in representation learning [27], in addition to classic image annotation techniques [2], [16]. We aim to discover visually observable semantic entities like objects and attributes from noisy user-generated image captions and focus on these entities for annotating

(26)

Chapter 3. Automatic Image Annotation 25

unseen images. To the best of our knowledge, we are the first to propose an automatic annotation system that uses deep CNN activation unit responses to automatically discover and construct feature spaces corresponding to different tags, quantify the degree to which a semantic concept present in a tag is visual, and perform image retrieval in the feature spaces for what are calculated to be visually perceptible tags, filtering out irrelevant tags in the process. In this way, the resulting approach is not only robust to noisy input data, but also produces more compact and interpretable features.

3.1 Preliminaries

We derive the basis for our method from the current best practices in the automatic image annotation and deep learning communities.

3.1.1 Convolutional Neural Networks

CNNs have demonstrated remarkable performance for detection, seg-mentation, and classification on data that takes a grid-like shape, such as images (2D grid) or time series data (1D grid). Somewhat surprisingly, CNNs trained to classify targets from large image datasets like ImageNet [28], a col-lection of 1.2 million images from 1000 object classes, have been shown to be effective even when reapplied to classification tasks on a new image dataset not explicitly seen by the network during training [29]. The general intuition behind this phenomenon commonly referred to as transfer learning is that CNNs are able to learn rich low and mid level features ranging from edge and color blob detectors to shape detectors and other higher-level abstractions which are common to different vision tasks and are thus transferable. Fea-ture maps from later layers in the network, on the other hand, become more and more specific to the distribution of the original dataset, thus diminishing their transferability. Nonetheless, a portion of learned filters can serve as a

(27)

useful tool to use as a fixed feature extractor or as a starting point to fine-tune specific layers of the original network to new input images and targets. We utilize the former mechanism as a basis for the construction of our feature from the intermediate layers of the network.

Figure 3.1: Typical convolutional neural network architecture

We start by feeding input images through a pre-trained CNN (described in section 4.1.4) and extract activation unit responses from different layers in the network. A typical CNN network architecture is shown in figure 3.1. Note that the convolution, nonlinearity, and pooling operations are typically repeated some number of times. Since our focus is to go beyond the “black box” view of neural networks and discover how semantic entities are encoded throughout the network, we opt to record neural responses from the deep, hidden layers of the network, specifically the convolutional layers. In general, each set of activation units or neurons can be considered as a non-linear map-ping of an input image to some higher-dimensional feature space. For neurons from convolutional layers, this mapping is achieved by a multi-dimensional array of parameters called a kernel typically smaller than the size of the in-put. The kernel is slid across different regions of the input, producing a set of linear activations for each region. Then, the activations are fed to a nonlinear activation function and pooled to produce a condensed representation that is invariant to local translations. These three operations (convolution, nonlin-ear activation, and pooling) are performed in sequence for each convolutional

(28)

layer in the network.

In order to construct informative features from the convolutional layers for our task of encoding image representations corresponding to specific tags, we must extract the activation units from layers of interest after non-linearity has been applied, but before they are pooled, unless the pooling that is per-formed is global. The standard pooling function is max or average pooling (i.e. taking the max or average of responses in each window), thus mapping from a volume of size (W1 × H1 × D1) to (W2 × H2 × D2) where W2 < W1,

H2 < W2, and D1 = D2. Instead, we perform global max pooling on the

spatial dimension in all channels, effectively taking the maximum response of each feature map, resulting in a D-dimensional feature where D is the num-ber of feature maps corresponding to a particular convolutional layer. An illustration of this procedure is shown in figure 3.2.

3.1.2 K-Nearest Neighbors For Image Annotation

K-nearest neighbors (KNN) is a simple, yet powerful algorithm used frequently in IR research and automatic image annotation [10]. Unlike other discriminative models like logistic regression or SVM, KNN does not require explicit training or the tuning of parameters, and can immediately accommo-date new examples by simply adding them to the set of existing examples. However, KNN is expensive and slow, achieving a computational complexity of O(n · d) in the worse case when pairwise distances to the input x must be computed for n examples in d dimensions.

In situations where the runtime performance of KNN-based methods are an issue, a variety of optimizations can be made, usually by introducing advanced data structures [30] or approximating distance calculations [31]. Alternatively, if the search space is decreased by an order of magnitude or more, naive KNN is usually sufficiently fast. Since our goal is to construct low-dimensional feature spaces corresponding to specific image tags, we know

(29)

Figure 3.2: An illustration of the global max pooling procedure

we only need to perform KNN search in the union of a positive set (i.e. images annotated with a specific image tag) and a negative set (i.e. images not annotated with a specific image tag) of examples. Of course, performing this search for each tag in our vocabulary becomes expensive, but in practice is much cheaper than computing pairwise distances between all examples.

(30)

High-Dimensional Spaces

Since the effectiveness of KNN-based methods rests on the ability to compute meaningful distance metrics between points in high-dimensional spaces, it is worth highlighting some of the difficulties in dealing with such spaces. The phrase “curse of dimensionality” was coined by statistician Richard Bellman in 1957 [32] to describe the problem caused by the exponen-tial increase in volume associated with adding extra dimensions to Euclidean space [33]. More generally, this phrase refers to the numerous difficulties that result when trying to reason about high-dimensional spaces using our intuitions that are grounded in a three-dimensional world. We appeal to the authors of [34] for a succinct formulation of the main concerns:

1. Problem 1: Poor Discrimination of Distances

Concepts such as proximity, distance, or neighborhood become less mean-ingful with increasing dimensionality due to a loss of contrast of dis-tances.

2. Problem 2: Presence of Irrelevant Attributes

Among the features of a high dimensional data set, for any given query object, many attributes can be expected to be irrelevant to that object. Irrelevant attributes can interfere with the performance of similarity queries for that object.

3. Problem 3: Presence of Redundant Attributes

Similarly as with Problem 2, in a data set containing many attributes, there may be correlations or redundancies among subsets of attributes that also lead to special difficulties for data mining, indexing, or simi-larity search applications.

Problem 1 as described above is a fundamental concern when dealing with high-dimensional spaces. With the aid of specially chosen distance met-rics, which we will discuss in the next sub-section, we can compute similarity

(31)

measures that appear to have some meaning, but whose magnitudes are diffi-cult to interpret and fail entirely in retrieving relevant examples in some cases. Problem 2, however, is indirectly addressed by the attribute discovery step in our annotation model. By eliminating irrelevant dimensions correspond-ing to globally max pooled feature maps generated from deep convolutional layers, we are effectively boosting the signal of an input image in discriminat-ing whether or not it corresponds to a specific semantic entity. Nevertheless, even if we discover the optimal combination of dimensions that are relevant in describing a semantic entity, problem 1 is still of concern because of the loss of contrast of distances. Lastly, problem 3, or the presence of redundant attributes, is problematic, but can be addressed by examining the correlation between dimensions for a specific semantic entity and dropping highly corre-lated dimensions above a certain threshold. This falls outside of the scope of this work but is a promising direction for future work.

3.2 Annotation Model

We begin to describe our automatic image annotation model mathe-matically. Let I = {I1, . . . , In} be a collection of images and L = {l1, . . . , ln}

be a collection of their corresponding tags where each li ∈ L is a set. Our

goal is to produce a relevance ranking of tags for an unseen image J 6∈ I and annotate J with the top m ranked tags.

We define the full tag vocabulary to be the union V = Sn

i=1li. Given

a subset of tags T ⊆ V (referred to as seed tags hereafter) from the full tag vocabulary V, which can be constructed by any function satisfying the mapping V 7→ T ⊆ V, we construct positive and negative sets of images from I for each t ∈ T corresponding to images that are labeled with the tag t and images that are not labeled with the tag t. We denote these sets as P_t+ and P_t−. We holdout a fixed proportion of P_t+ and P_t− to act as a validation set

(32)

for which to evaluate visualness, which will be described later.

Once sets P_t+ and P_t− are constructed for each t ∈ T , we feed each Ij ∈ Pt+ and Ik ∈ Pt− to our network, denoted as the function fA : I 7→ a

mapping images to activation vectors, where A is the set of activation units extracted from different layers of the network (i.e. |a| = |A|), and store activation vectors a for each Ij ∈ Pt+ and Ik ∈ Pt−. Next, for both P

+ t and

P_t−, and for each activation unit ui ∈ A, we construct histograms from the

empirical distribution of activation responses and call these histograms D+_i and D_i−. Then, we compute the symmetric KL divergence score similarly to [3] from D_i+ and D_i− for each activation unit i ∈ A:

Si(t|I) = DKL(D+_i ||D_i−) + DKL(D_i−||D+_i ) = X x D_i+(x) log D + i (x) D_i−(x) + X x D_i−(x) log D − i (x) D+_i (x) (3.1)

Here x denotes the histogram bins of D_i+ and D_i−. It is also worth noting that DKL(P||Q) is defined only when Q(i) > 0 for any i such that P (i) >

0. Although activation unit responses from globally pooled feature maps produced by convolutional layers are denser than feature maps produced by fully connected layers, computing histograms of their distributions can still result in zero valued bins. To avoid division by zero error, we add a small non-zero value to each bin before computing Si(t|I). The intuition is that as the

padding we apply to each bin approaches zero, the change in the magnitude of the KL divergence also approaches zero. Thus, with a sufficiently small non-zero value, we can safely pad each bin without drastically modifying the outcome of the computation.

Once KL divergence scores are computed for each ui ∈ A, we construct

a histogram Ht from their empirical distribution and select activation units

whose KL divergence falls into the top nth percentile of this distribution (n can be treated as a tunable hyperparameter). The selected activation units

(33)

correspond to our discovered feature space Ft. An illustration of this

proce-dure can be found in figure 3.3. The intuition behind this selection criteria is that we would like the dimensionality of our feature space for tag t to vary according to the density of its KL divergence throughout the network. This allows tags whose KL divergence distribution exhibits a heavy right tail to be represented by a lower-dimensional feature whereas tags whose distribution is tightly bound to zero to be assigned a higher-dimensional feature.

Figure 3.3: An illustration of the feature space construction procedure After feature spaces Ft are discovered for each t ∈ T , we proceed to

evaluate a visualness score following the definition in [15], [3]. Visualness is defined as:

V(t|f ) ≡ accuracy(f, P_t+, P_t−) (3.2) where f is a binary classification function and accuracy is evaluated on the validition set corresponding to P_t+ and P_t−. Following [3], we balance the training sets of P_t+ and P_t− by taking a random sample in order to allow easier interpretability of the balanced classification accuracy. In theory, if V(t|f ) ≤ 50%, the binary classification accuracy of f in distinguishing a

(34)

member of P_t+ from a member of P_t− is no better than a random guess, or worse in the case that V(t|f ) is strictly less than 50%, suggesting that there is no connection between the features and tags. Thus, we would like to set a threshold by which to distinguish tags with high visualness from tags with low visualness. We relax the assumption that a 50% visualness score is most capable in discriminating tags with high and low visualness and introduce a data-dependent threshold θ. We suggest setting θ after inspecting V(t|f ) for each t ∈ T , but it can also be set a priori.

Define S := {t | V(t|f ) > θ} ⊆ T to be the set of tags with visualness exceeding θ and H := {Ft | V(t|f ) > θ} to be the set of all feature spaces

corresponding to tags in S. Given an unannotated image J , we extract the feature aJ from our network fA : I 7→ a and project aJ to each feature

space Ft ∈ H. Given a projection aJ onto Ft ∈ H, denoted as aJ,t hereafter,

we compute pairwise distances between aJ,t and aI,t ∀ I ∈ Pt+ and aJ,t and

aI,t ∀ I ∈ Pt− using a metric D(I, J ) and select the k-nearest neighbors to

aJ,t from both Pt+ and Pt−. Denote these sets of nearest neighbors as Nt+ and

N_t−. Effectively, we are selecting the most informative images in predicting the presence of a tag t in a feature space constructed from activation units deep within a CNN that are best at characterizing the visual composition of tag t.

Once neighbors corresponding to image J are selected from both positive and negative sets, we assign a relevance score for each tag t ∈ S by computing the difference between mean pairwise distances between aJ,t and aI,t ∀ I ∈

N_t+ and aJ,t and aI,t ∀ I ∈ Nt−. We can express this by the equation below:

R(J|t) = 1 k X I∈N_t+ D(I, J ) − X I∈N_t− D(I, J ) (3.3)

Thus, we have produced a relevance ranking for each t ∈ S to image J and can assign the top m tags to J .

(35)

3.3 Pipeline

In this section, we offer an overview and step-by-step walkthrough of our proposed automatic annotation method.

3.3.1 Overview

The life cycle of the system is composed of two distinct parts: the dis-covery cycle and the annotation cycle. The disdis-covery cycle of the annotation system requires as input a set of images, their corresponding tag sets, and a set of seed tags. The set of seed tags can be generated from the full tag vocabulary using any heuristic (such as tag frequency). After feature spaces corresponding to the highly diverging activation units for each seed tag are constructed and visualness scores are calculated from equation 3.2 using a binary classifier (e.g. logistic regression) on the test set, the system is ready to graduate to the annotation cycle. The annotation cycle of the annota-tion system takes as input an untagged image and generates a ranked list of suggested tags for the input image.

3.3.2 Description

A detailed explanation of all steps in the proposed system are described below:

Discovery Cycle

1. Seed tag selection A set of seed tags is generated according to some user-defined function. For example, a naive approach is to select a set of seed tags by their frequency in the full tag vocabulary. Other approaches can

(36)

use domain specific heuristics or natural language processing techniques to select relevant seed tags.

2. Set construction For each tag from the set of seed tags, create positive and negative sets of images. The positive set should contain images that have the specified tag and the negative set should contain a random selection of images that do not have the specified tag. The positive and negative sets of images are then split into training, validation, and test sets, with the training and validation sets used in the discovery cycle and the test set used in the annotation cycle (described in the next subsection).

3. Feature extraction For each tag, iterate through the positive and neg-ative sets of images created in step 2 and feed each image through a pre-trained CNN and save activation unit responses for each image from each layer in the pre-trained network (or a subset of all the layers in the network).

4. Distribution construction For each tag, iterate through activation unit responses of the images from the positive and negative sets generated in step 3 and create distributions (i.e. histograms) of the responses for each activation unit in the network for both sets.

5. KL divergence computation For each tag, iterate through distributions of activation unit responses of the images from the positive and negative sets of images created in step 4 and compute the Kullback Leibler (KL) divergence corresponding to each activation unit in the network and save these divergence scores. We provide a visualization of this step in appendix 6.1.

6. Feature space selection For each tag, iterate through the KL divergence scores corresponding to each activation unit in the network and select the top n-th percentile of diverging units (we choose the top 10th per-centile but this can be treated as an optimizable hyperparameter) as

(37)

the feature space corresponding to the tag of interest.

7. Visualness evaluation For each tag, use the discovered feature spaces from step 6 to train a binary classifier on the validation set correspond-ing to the specified tag. The accuracy score, or visualness, is saved.

Annotation Cycle

1. Feature extraction For each image in the test set, feed the image through the same pre-trained network used in the discovery cycle, record acti-vation unit responses, and perform steps 2-4.

2. Nearest neighbor search For each tag in the set of seed tags whose vi-sualness score exceeds some threshold, find the k-nearest neighbors to the input image from the positive and negative sets in the feature space corresponding to the discovered prime units in the discovery cycle. We set k = 10 in our experiments but the choice of k can be treated as a optimizable hyperparameter.

3. Relevance computation For each neighbor in the set of k-nearest neigh-bors corresponding to the positive and negative sets of a particular seed tag, calculate average distances from the input image to each neighbor, and take the difference. In other words, calculate how far on average the input image is to the k-nearest neighbors from the positive set and the k-nearest neighbors from the negative set and take the difference. 4. Annotation Take the ranked list of seed tags and assign the top m tags

(38)

Chapter 4

Experiments

“No amount of experimentation can ever prove me right; a single experiment can prove me wrong.”

— Albert Einstein

To test the effectiveness of our automatic annotation system, we per-form a series of experiments on a small subset of a real-world dataset that is specifically chosen because of its particularly noisy characteristics. We hy-pothesize that if the performance of our method is satisfactory on such a dataset, it will improve on datasets of larger size and cleaner composition.

4.1 Experimental Setup

4.1.1 Dataset

Our proposed method is tested on the Etsy dataset 1 provided by Kota Yamaguchi of Tohoku University. The Etsy dataset is comprised of a collec-tion of 2.8 million product listings sold in September 2014 on Etsy.com, an

1

vision.is.tohoku.ac.jp/~kyamagu/research/etsy-dataset/

(39)

Chapter 4. Experiments 38

online peer-to-peer marketplace focused on handmade, unique, and vintage goods. Similarly to the authors in [3], we choose the subset of listings from the clothing category to focus our experiments on, totaling approximately 175,000 listings from 247 sub-categories. Each product listing includes an im-age and associated metadata such as the product title, description, category, and tags.

We show a sample of images from the dataset under consideration in figure 4.1 below. It is evident that the images users take of their products are captured from a variety of orientations and have varying levels of background noise. In addition, watermarks are often added, contaminating the clarity of the images further. We also find a non-trivial number of images that are difficult for even domain experts to annotate. This elevates the challenge of working with this dataset in developing an automatic annotation system.

Figure 4.1: A random sample of images from the clothing category of the Etsy dataset

(40)

4.1.2 Data Preprocessing

For our task, we choose to rescale the full-sized images to 224x224 pixels before sending the images to the network for feature extraction. We drop all metadata except for the user-generated tags, pre-processing them only to remove special characters and spaces. The dataset is then split into 80% training and 20% testing sets corresponding to the discovery and annotation cycles of the system. Our minimal preprocessing follows from our goal of designing an annotation system that can automatically label unseen images with relevant tags from a corpus of noisy image and text data.

4.1.3 Seed Tag Generation

We experiment with several different seed tag generators in order to extract a meaningful vocabulary from the full set of user-generated tags. We operate with the goal of choosing tags that will allow the construction of positive and negative sets of images large enough to allow our system to discriminate tags with low visualness from tags with high visualness. We experiment with different thresholds, but find that a minimum set size of 50 is able to produce meaningful results.

The simplest seed tag generation algorithm is based on counting the frequency of tags in the full set. Specifically, by counting tag frequencies and taking any tag whose frequency exceeds a predetermined minimal set size, we are left with a subset of tags that we can test for visualness. Applying this heuristic to the Etsy dataset, we are left with a vocabulary of 319 seed tags. The top 50 tags ordered by frequency generated in this way are shown in table 6.2 in the appendix.

In total, this process produces a set of 319 seed tags. After the discovery cycle is complete, we inspect the visualness score of each tag, and find that a threshold of 0.6 is most suitable to filter out noisy, non-visual tags, such

(41)

as free shipping, sale, and one of a kind. After filtering these tags with low visualness, we are left with 269 tags in our vocabulary. Table 4.1 below displays the most and least visual tags before filtering.

Table 4.1: Tags with highest and lowest visualness

Rank Tag Visualness

1 funny tshirt 0.89 2 converse 0.88 3 tutu 0.88 4 flip flops 0.88 5 shoes 0.87 315 free shipping 0.53 316 fun 0.53 317 teen 0.53 318 custom made 0.53 319 trendy 0.52

4.1.4 Implementation

We implement our automatic annotation system in Python with the deep learning framework Keras [35] using the TensorFlow [36] backend. We use the VGG-19 network [37], a specific CNN architecture characterized by its simple arrangement of 3x3 convolutional layers stacked on top of each other. See table 6.1 in the appendix for more details. We load weights for VGG19 that were pre-trained on ImageNet [28] and write a custom routine to extract features from the convoluational layers. We do not fine-tune any portion of the network to our dataset and instead use the weights as is.

(42)

layers, and a softmax layer, as shown in figure 4.2 2. We ignore the fully connected and softmax layers and extract features from select convolutional layers only. We consider the last convolutional layer in each block, pool the spatial dimension, and concatenate them to form a 1472 dimensional feature vector.

Figure 4.2: VGG-19 architecture

We run our experiments using the following hyperparameters: minimum number of samples to construct positive sets = 50, train proportion = 0.9, negative set size factor = 1, number of histogram bins = 50, KL divergence PDF threshold = 0.1, and k = 10.

2

(43)

4.2 Experimental Results

In order to measure the effectiveness of our method, we evaluate our system both qualitatively and quantitatively. Specifically, for the quantitative evaluation in section 4.2.2, we test 500 images that were not used by the system in the discovery cycle, and record the top 5 suggested tags for each image from the output of the annotation cycle.

4.2.1 Qualitative Evaluation

We pick a few anecdotal examples of annotations produced by our method to highlight some of its strengths and weaknesses. In figure 4.3 below, we highlight instances where our system accurately assigned relevant tags to unseen images. It is interesting to note the linguistic variety in assigned tags. For example, the shoes in the bottom middle are exclusively tagged with words describing the physical object contained in the image directly, whereas the denim shorts in the top middle are assigned three tags describing the object and two tags describing the attributes of the object. In both cases, the images are relatively clean (i.e. minimal background noise), yet the semantic composition of the generated tags differs. It is also interesting to analyze the example in the top right corner of figure 4.3. Although a human anno-tator might guess that this is an image of the front of a T-shirt, the scale at which the image was taken makes this inference difficult. Even so, the system assigns this tag correctly, in addition to one clearly visible attribute of the image, that is the decoration of the shirt in rhinestones.

Given a dataset of such noisy characteristics, we expect our method to fail to correctly assign tags to unseen images in some cases. In order to get an insight as to why this happens, we inspect a few examples of poor annotations and offer theories as to why this occurs. First and foremost, we recognize that our method will not be able to correctly annotate images that

(44)

Figure 4.3: Positive annotation examples

are totally unrecognizable or unrecognizable to a degree that a human an-notator would find difficulty in describing its content. For images that meet this criteria, which as we noted earlier is a non-trivial proportion of the entire dataset, we find that the system makes what appears to be random, nonsen-sical annotations. However, in certain cases where images are relatively easy to interpret, we find that our method suggests tags that seem completely out of place. Take the image of the furry coat at the far right in figure 4.4. While the coat certainly appears “warm”, the suggested tags that proceed it do not have any relation to content in the image. It is difficult to uncover why this happens precisely, but our first intuition suggests that these images, despite containing what appears to be relatively discernible semantic entities and at-tributes, have little to no similarity to images from the positive sets of these same discernible semantic entities and attributes. Thus, our relevance rank-ing score for the tags correspondrank-ing to these semantic entities and attributes

(45)

is not noticeably higher than the ranking score for unrelated tags and as a result, the suggested tags appear random.

Figure 4.4: Negative annotation examples

4.2.2 Quantitative Evaluation

Since the tag sets of images in our dataset are user-generated, we cannot use these tag sets as ground truth to evaluate the quality of the suggested tags by our method. Thus, we must appeal to human annotators to determine whether or not the assigned tags are relevant to some aspect of the image or not.

Human Judgment Task

We pick 500 random images from our test set and design the following task for annotators to judge the quality of our generated tags:

(46)

each tag in relation to the image in the clothing domain3. If there is a tag whose meaning is not understood, look up its definition in relation to this domain.

2. For each tag, pick one of the following to score its relevancy to the given image:

• RE: the tag is relevant to the image

For example, if the tag is directly discernible in the image (i.e. an object or visually perceptible attribute)

• PR: the tag is possibly relevant to the image

For example, if the tag is indirectly discernible in the image (i.e. an abstract attribute)

• NR: the tag is not relevant to the image

For example, if the tag is not discernible or related to the image in any way

Results

Since the output of our annotation system is a ranked list of the top 5 most relevant tags, each tag in this list can be considered either a true positive or false positive. Thus, we infer the quality of annotations from the human judgment task by computing precision and precision at k (for k = 1) for the 500 test images. More specifically, we compute these measures for two cases: when considering tags marked with RE as true positives and when considering tags marked with RE or PR as true positives. Additionally, we compute N +, the number of tags that are correctly assigned to at least one test image. Table 4.2 below summarizes the main results.

3_{We emphasize the clothing domain to eliminate biases due to semantic differences of tags in different}

(47)

Table 4.2: Results of the human judgment task

Case Precision Precision@1 N+

1 0.34 0.43 197

2 0.58 0.61 225

4.3 Discussion

While the precision and precision at k score corresponding to relevant tags is fairly low (case 1), it increases substantially when also considering possibly relevant tags as true positives (case 2). Specifically, from case 1 to case 2, the precision increases by 70% and the precision at 1 increases by 42% relative to case 1 . Since some of our seed tags are abstract attributes that don’t have a concrete visual manifestation, we consider evaluation scores for case 2 where relevant and possibly relevant suggested tags are considered true positives to be a indicator of general performance.

In addition, we see that N+, the total number of correctly assigned tags, represents a large fraction of the total vocabulary in both cases, representing 73% of the total vocabulary in case 1 and 83% in case 2. This indicates that our method is assigning a diverse set of tags to images and is not placing prefence on a small subset of tags from the vocabulary.

Considering the challenges associated with the Etsy dataset, we find the performance of our method suitable for use in human-in-the-loop annotation strategies. In addition to suggesting relevant or possibly relevant tags almost 60% of the time, we also find that our system is able to recognize abstract attributes present in images, going beyond the recognition of mere objects and physical entities.

(48)

Chapter 5

Conclusion

In this thesis, we motivated the need for new automatic image anno-tation methods that can operate on the abundance of noisy, user-generated image and text data available today in order to aid researchers and organi-zations in the process of constructing image datasets. Through an extensive literature study, we demonstrate that this particular task is not represented well in the automatic image annotation community. In addition, we expose the considerable lack of deep learning-based methods, indicating room for im-provement in this field. With this fact in mind, we propose a novel automatic image annotation method that utilizes the representational ability of deep neural networks and is designed to work with weakly supervised data.

We find that our proposed method for automatic image annotation per-forms favorably in experiments on a noisy, real-world dataset. Furthermore, our feature space discovery process shows that given a pre-trained CNN from one domain, we can apply it to another domain without fine-tuning it and identify a small number of dimensions that are sufficient in capturing image-to-image similarity with respect to abstract semantic entities with discernible visual properties.

(49)

Chapter 5. Conclusion 48

5.1 Future Work

As is common in the deep learning field, investigating the impact of deeper networks on the performance of the system is a standard research question. In particular, inspecting the impact of larger, deeper features de-rived from intermediate layers of a CNN on the encoding of semantic entities that exhibit some degree of visualness is a promising direction for future re-search. In addition, there are many improvements to be made on the natural language processing side of things. In particular, it remains an open question as to whether constructing positive and negative sets of images more intelli-gently, for example by examining label-to-label dependencies, can offer more informative feature spaces in which to perform retrieval. Lastly, testing this image annotation method on different datasets, in terms of size and compo-sition, would suggest whether or not the performance is dependent on the quantity and quality of data it receives.

(50)

Chapter 6

Appendix

(51)

Chapter 6. Appendix 50

Table 6.1: VGG-19 network details

Block Layer type # filters Kernel/pool size Activation Output shape

1 Conv 64 3x3 ReLU (224, 224, 64) Conv 64 3x3 ReLU (224, 224, 64) MaxPool - 2x2 - (112, 112, 64) 2 Conv 128 3x3 ReLU (112, 112, 128) Conv 128 3x3 ReLU (112, 112, 128) MaxPool - 2x2 - (56, 56, 128) 3 Conv 256 3x3 ReLU (56, 56, 256) Conv 256 3x3 ReLU (56, 56, 256) Conv 256 3x3 ReLU (56, 56, 256) Conv 256 3x3 ReLU (56, 56, 256) MaxPool - 2x2 - (28, 28, 256) 4 Conv 512 3x3 ReLU (28, 28, 512) Conv 512 3x3 ReLU (28, 28, 512) Conv 512 3x3 ReLU (28, 28, 512) Conv 512 3x3 ReLU (28, 28, 512) MaxPool - 2x2 - (14, 14, 512) 5 Conv 512 3x3 ReLU (14, 14, 512) Conv 512 3x3 ReLU (14, 14, 512) Conv 512 3x3 ReLU (14, 14, 512) Conv 512 3x3 ReLU (14, 14, 512) MaxPool - 2x2 - (7, 7, 512) 6 FC1 - - ReLU (4096) FC2 - - ReLU (4096)

(52)

Table 6.2: List of the top 50 seed tags by frequency

Tag Count Tag Count

dress 8983 tank 3232 women 8374 boho 3186 handmade 8023 sweatshirt 3117 clothing 7842 hippie 3061 shirt 7147 personalized 2973 black 7003 womens 2940 baby 5301 shorts 2913 shoes 5261 green 2776

custom 4979 plus size 2763

vintage 4817 toddler 2695

blue 4467 lace 2667

cotton 4452 pants 2621

pink 4424 unique 2535

white 4346 top 2500

tank top 4150 purple 2471

summer 4096 monogram 2420 tshirt 3870 upcycled 2380 t shirt 3760 sexy 2374 girl 3709 halloween 2371 jacket 3542 cute 2318 gift 3502 denim 2297 girls 3485 children 2287

red 3358 tie dye 2201

wedding 3350 birthday 2165

(53)

Figure 6.1: A visualization of activation unit divergences in the intermediate layers of the network for tag “ruffle”

(54)

Bibliography

[1] Mary Meeker. 2014 Internet Trends Report. Report. KPCB, 2014. [2] Yashaswi Verma and CV Jawahar. “Image annotation using metric

learning in semantic neighbourhoods”. In: European Conference on Com-puter Vision. Springer. 2012, pp. 836–849.

[3] Sirion Vittayakorn et al. “Automatic attribute discovery with neural activations”. In: European Conference on Computer Vision. Springer. 2016, pp. 252–268.

[4] Makoto Ozeki and Takayuki Okatani. “Understanding convolutional neural networks in terms of category-level attributes”. In: Asian Con-ference on Computer Vision. Springer. 2014, pp. 362–375.

[5] Victor Escorcia, Juan Carlos Niebles, and Bernard Ghanem. “On the relationship between visual attributes and convolutional networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 1256–1264.

[6] SourceForge. Mulan: A Java Library for Multi-Label Learning. 2002. url: http://mulan.sourceforge.net/datasets-mlc.html (visited on 05/14/2017).

[7] Luis Von Ahn and Laura Dabbish. “Labeling images with a computer game”. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM. 2004, pp. 319–326.

Automatic Image Annotation From Noisy Data Using Deep Neural Representations

MSc Computational Science

Master Thesis

Automatic Image Annotation

From Noisy Data Using Deep

Neural Representations

Anton O. Radice

May 18, 2017

Supervisor:

Dr. Valeria

Krzhizhanovskaya

UvA, ITMO

Co-supervisor:

Dr. Nikolay

Butakov

ITMO

Daily supervisor:

Maarten Stol, MSc.

Braincreators

Acknowledgments

Abstract

Contents

List of Figures

List of Tables

Glossary

Chapter 1

Introduction

1.1

Motivation

1.2

Contributions

Chapter 2

Related Work

2.1

Automatic Image Annotation

2.2

Deep Representations

2.3

Learning From Noisy Data

Chapter 3

Automatic Image

Annotation Using Deep

Neural Representations

3.1

Preliminaries

3.1.1

Convolutional Neural Networks

3.1.2

K-Nearest Neighbors For Image Annotation

3.2

Annotation Model

3.3

Pipeline

3.3.1

Overview

3.3.2

Description

Chapter 4

Experiments

4.1

Experimental Setup

4.1.1

Dataset

4.1.2

Data Preprocessing

4.1.3

Seed Tag Generation

4.1.4

Implementation

4.2

Experimental Results

4.2.1

Qualitative Evaluation

4.2.2

Quantitative Evaluation

4.3

Discussion

Chapter 5

Conclusion

5.1