Fine-grained classification of street objects and scenes in satellite images

(1)

and scenes in satellite images

BY

J. Tressel, MSc

11151129

SUPERVISED BY

dr. S. Rudinac

Amsterdam Business School

(2)

Fine-grained classification of street objects and scenes in

satellite images

Postal services are looking for new business models as revenues from mail delivery are diminishing due to decreasing mail volumes. Because of this de-clining revenue PostNL, the leading postal services company in The Nether-lands, is investigating new business models that leverage the current postal delivery workforce. Their existing data brokerage services can be extended using observations of street objects or scenes of interest by postal delivery persons to create value added services.

To aid the postal delivery workforce in locating these objects or scenes of interest, a computer vision model can efficiently classify satellite imagery and identify the likely locations. A relevance feedback approach is proposed to allow PostNL to quickly classify objects or scenes with a limited number of user interactions. An interactive dashboard has been created to allow a user to visually explore and search the available images for street objects or scenes. To enable the classification and search functionality, features are automatically extracted from the images.

This paper shows that a computer vision task can learn to classify previ-ously unknown street objects and scenes in satellite images. Different classi-fiers have been trained and show acceptable performance. More importantly, it is shown the classifiers can be improved further given time and resources. Based on these results, PostNL is now better able to estimate the effort re-quired to create the visual analytics system which is needed to effectively run the envisioned value added services.

(3)

1 Introduction

As a direct result of an increasingly digitizing society, postal services is a rapidly diminishing business. Since 2000, the mail volumes have increasingly dropped. Initially a small decrease in volume was observed, but starting from 2009, volumes have started to decrease by 5% each year, with a peak in 2013 when volumes dropped by 12% (PostNL Annual report, 2016). This leads to a repositioning of the incumbent postal companies such as PostNL in The Netherlands. From the onset of the rise of e-commerce, PostNL has been building its package delivery services. This has resulted in a successful parcel delivery company. But as the mail business is still heavily regulated, the mail delivery has been kept separate from the parcel delivery activities. This has led to societal discussions on the role of mail delivery, job security of mail delivery persons and internally, on the changing nature of mail delivery. Not every mail delivery person is keen to change, and this poses yet another challenge for PostNL, when they want to test new products and services in the market.

As the mail delivery by PostNL is regulated, they are required to deliver mail at least five days per week. As labour cost is a large portion of the total operating cost, new products and services are thereby aimed at more effectively utilising one of its most important assets: the postal delivery person passing through every street multiple days a week. Some of these services are very much in line with delivery, such as combining delivery with installation services. Another example is reading the water meter on request of water utility providers, which can be executed very efficiently, as the mail delivery persons are already in the neighbourhood.

Other services are more in line with some of the data brokerage services PostNL is already providing, such as collecting information on addresses or verification of household characteristics. These data brokerage services can be enhanced with observations from the postal workforce during their deliv-ery route, such as reporting street litter, graffiti or landscape service

(4)

notifi-cation. Other data services can be collecting information on neighbourhood or house characteristics, such as registration of abandoned houses, brands of cars parked in the streets, or houses with swimming pools or thatched roofs. Collecting this kind of neighbourhood or house/household related infor-mation would become very costly if every single street has to be checked. Particularly the collection of information on house characteristics can be op-timized by first assessing if a particular point of interest is likely to occur in a certain area, by using satellite images. Also, when mail delivery per-sons record the street object or scene using their smart phone, recognizing a particular occurrence of e.g. street litter or grassy areas in need of a good trimming is a prime candidate task for automation. These tasks require going through large amounts of data, which makes manual annotation infeasible. Computer vision techniques are able to recognize certain objects or scenes in satellite images or in the pictures from the aforementioned smart phone, thereby reducing the manual effort and making the overall process feasible in a business setting.

These computer vision tasks have become the subject of many recent studies and at the same time have found a huge uptake in the industry, with applications varying from social networks, self driving cars, to detecting malignant cells in medical imagery. Facebook has become very popular partly due to the ease of automated recognition of faces from friends, Tesla is using the visual information from the on-board cameras to detect for example road signs and street lining to enable the car to drive by itself [Gavves, 2016]. And IBM is creating cognitive assistants that can assist medical doctor in diagnoses using image analysis combined with a question and answer system to guide the doctor through the full analysis [Doyle-Lindrud, 2015].

PostNL also has experience using advanced image recognition throughout the logistics chain, starting from recognising written or printed addresses to allow for fully automated logistics processing, through to recognizing package sizes to determine logistical handling. These computer vision tasks are

(5)

some-what different from the aforementioned tasks of recognizing street objects or scenes. The computer vision tasks that PostNL currently has experience with focus on a very specific application. Conditions such as placement of the package, lighting, smooth background, etc. in such a setting can be op-timized for easier recognition of the image content. Street scenes or aerial images are much noisier and there is almost no control over the creation of the image, resulting in a lack of quality and consistency.

Given that PostNL is exploring many options with regard to the afore-mentioned new data brokerage services, the computer vision methods should be able to classify previously unknown categories. A fully automated system seems unlikely at this stage, given the imperfections in current state com-puter vision techniques. A combination of man and machine is a proven path of success for many analytical systems as it allows for the best of both worlds, human interpretation of information vs. bulk data analysis by a computer [Ware et al., 2001]. The classification of street objects and scenes is likely to be very subtle, such as long grass vs. short grass, or a thatched roof vs. a normal tiled roof. This variability in both the input (images) as the output (object classes) creates the need for an interactive computer vision system that can classify these fine-grained objects or scenes. Another benefit is the increased adoption of such an interactive system, as the user has direct in-fluence over the output, as opposed to an automated solution which might not always work as required.

The interactive part of the system would allow the user to start exploring in search for a new category. It would then capture the user input and suggest for similar images which the user can then investigate for appropriateness, so to initially identify some of the images with the sought after category. The system then learns from the initial user input. Following this initial labelling a model can be trained and should then be able to propose likely candidates that show a similar pattern. The user can then indicate which of the candidates are true representatives of the category, and with this

(6)

information the model can then improve itself (relevance feedback). This learning cycle repeats until the system cannot identify any new potential likely candidates, the user is satisfied with the performance or if the user is out of time.

From these challenges the main research question arises: Can a computer vision task learn to classify previously unknown street objects and scenes? This paper tries to answer this question by exploring the following four sub-questions:

1. What are effective means to extract basic features from the image? 2. How can users provide input and feedback to this learning task?

3. What are the appropriate quality criteria for assessing the output of this learning task?

4. How much user interaction is required before the fine-grained classification task is learned?

Ultimately, when PostNL wants to implement such an interactive learning system, many factors need to be taken into account. During implementation attention needs to be paid to creating compelling user stories, technology has to be build and configured, change management has to make sure the system is used in the right way and stakeholders are aligned. This research is aimed at exploring the potential of such a system, investigating some of the analytical and technical prerequisites, and showing such an interactive visual learning system can be built with minimally available means.

2 Literature review

Computer vision tasks in essence have the aim of understanding the scene in a visual representation of reality, such as photographs or videos. One part of this understanding is in crossing the semantic gap between the information

(7)

that can be automatically extracted from the visual data and the user inter-pretation of that same data [Worring et al., 2000]. Early work was aimed at recognising edges and 2D lines based on the effects of image formation phe-nomena. Until recently, these analysis techniques were prevailing for many advanced types of image analysis, such as detecting features, segmentation, motion, stitching, and rendering [Szeliski, 2010; Prince, 2012].

The advances of machine learning techniques, exponentially increasing computer power and the huge amount of available images form the basis for the shift of computer vision tasks towards the use of convolutional neural nets (CNNs). CNNs have been outperforming traditional techniques in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) since the introduction of AlexNet [Deng et al., 2009; Krizhevsky et al., 2012]. In 2015, Microsoft Research Asia won the competition with almost superhuman accuracy: only 3.57% error [He et al., 2016].

CNNs are a special configuration of very deep neural networks. Deep Neural Networks (DNNs) are a type of machine learning: the ability of a com-puter program to learn without being explicitly programmed [Samuel, 1967]. They have a long history: Rosenblatt first described perceptrons [Rosenblatt, 1958], which were capable of learning weights based on a one layered func-tion. In answer to the inability of perceptrons to solve non-linear boundaries, Minsky & Papert developed the multi-layered perceptron (also know as neu-ral networks) [Minsky and Papert, 1969]. These have been proven to be universal approximators.

DNNs are inspired by the function of the brain, hence the term neural networks. The basic unit is a neuron, which has a set of inputs, each with a specific weight. The neuron is a non-linear function, which returns a value between 0 and 1. This value is then passed on to neurons in the next layer of the network. This process is called forward propagation, which results in an ultimate output of the network. This output is then validated with the holdout examples and the resulting error is then back-propagated through the

(8)

network, thereby adjusting the weights. This process repeats until the error or loss is sufficiently small. Stochastic Gradient Descent (SGD) techniques have made the process of loss optimization quite efficient [Ruder, 2016].

Neural networks have been able to grow into very deep layered archi-tectures with the increase in compute power, the availability of big data training sets and tackling of many algorithmic challenges, such as vanishing gradients, unsupervised pre-training and improved regularization tricks such as drop-out. With this growing in size and depth, neural networks are able to express very sophisticated non-linear relationships with literally billions of parameters, allowing to fine-tune into very specific classification challenges such as text, voice, images, and even video streams.

The interesting fact about deep learning is that the model takes raw data, which is then transformed, layer by layer, into a useful representation of the data so the same model is able to classify the input data to the out-put classes. This way of learning features as opposed to (manual) feature engineering is also called representation learning, or feature learning [Ben-gio et al., 2013]. Feature learning has many applications, for example audio classification, where audio features are extracted from the raw audio file to identify a speaker, or natural language processing where text features are dis-tilled from raw text, and images where vision features are extracted from the raw image. Where manual feature engineering has found some good features such as SIFT & HoG for image feature extraction, and (completely separate) features for audio files such as spectrogram & MFCC, deep learning provides a common algorithm for learning these features in an automated way.

Currently, the leading DNN architecture for image classification is a Con-volutional Neural Network (CNN). This type of neural network is built using repeating layers of convolutions and pooling, with the final layer serving as a classifier. In convolution layers or convolutional operations the original image is processed such that the contained information is extracted. This operation is an integral (or summation) that expresses the amount of overlap

(9)

of one function as it is shifted over the other function. In essence this can be used as a technique to tease out important features of an image, such as edges or corners, and even more specific features such as parts of a face [Goodfellow et al., 2016].

The convolutions can be seen as a sliding window that slides over the input image. Technically, this is implemented using a matrix multiplication, with the image pixels as the left matrix and a kernel/filter as the right matrix.

S(i, j) = (I ? K)(i, j) =X

m

X

n

I(m, n)K(i − m, j − n)

This matrix multiplication is ideally performed using graphics processing cards (GPUs), which are specifically built for this purpose.

CNNs exploit the symmetry of images, with the deeper layers allowing to capture very abstract features, such as angles and light transitions, and the shallower layers tuning in more to the ultimate classes, i.e. wheels for detecting cars and eyes or noses for detecting faces. For CNN to generalize well for a particular tasks or even across tasks, a number of regularization tricks are performed, a popular one being dropout which randomly drops connections from the network, to prevent overfitting [Hinton et al., 2012].

But, even with the huge compute power available, many hours of inten-sive computing is required for these deep neural nets to be trained. It is understood that these models need a very long period of initialization, for the model to start learning. Luckily, research into feature transfer shows that the learned features can be transferred to other similar learning tasks [Goodfellow et al., 2012]. This method of feature transfer has powered many applications and forms an active research field. One of the key computer vision research enablers has been the creation of ImageNet, a large-scale on-tology of images accurately tagged using Amazon Mechanical Turk [Deng et al., 2009]. ILSVRC, the accompanying challenge to create classification models using ImageNet and publish the resulting pre-trained models has

(10)

of-fered other researchers in computer vision a great platform to extend. One of the areas that is researched is the way that learned features are transferred to other learning tasks. Particularly the features learned from images are interesting to use as a base for further learning computer vision tasks [Zheng et al., 2016], as a CNN normally requires large datasets and compute power to start learning. Therefore, it can be beneficiary to use a neural net pre-trained on a large generic dataset and fine-tune it for a specific purpose. Weights of pre-trained DNNs/CNNs can be adapted to new classes using a technique called transfer learning [Bengio et al., 2013]. This allows the model to adapt to new domains.

Even without fine-tuning, transferring the DNNs feature weights pre-trained on ImageNet as the base for learning basic features from the images provided, has proven to be very successful. Research into which layer contains the most useful information, and is therefore best to extract, shows that transferability can be negatively impacted as higher layer neurons are too specialized to the original task at the expense of a new task, while lower level features are too generic to provide insight [Yosinski et al., 2014].

Using the transferred features, a supervised classifier can be trained inter-actively using logistic regression or support vector machines (SVM) models. Stochastic gradient descent techniques can be applied to achieve fast training [Bottou et al., 2016]. Special emphasis should be placed on the problem of unbalanced classes. The few positive observations need to be balanced with the large number of (implicit) negative observations, as the machine learning algorithm is hardly penalized for assigning false negatives and will therefore optimize for the bulk of the observations [Kotsiantis, 2007].

3 High level overview

To be able to determine if a computer vision task can learn to classify previ-ously unknown street objects and scenes, an interactive dashboard is designed

(11)

and constructed according to the following requirements: the user needs to be able to explore images, be able to search for user defined street objects and scenes that are not yet known to the system, and be able to manually indicate a certain image contains the street object or scene. This indication provides the positive samples for the classifier with which unseen images can be classified.

Figure 1: Interactive system

The front-end of the system allows the user to interactively explore the pre-indexed sets of images. Some screen shots from the system are shown in figure 1, with letters A to F.

Figure 2: Interactive exploration

(12)

to follow the flow through the system. The system has been designed using the principles mentioned in Stephen Few’s Information Dashboard design. The essential principles are contained in his definition of a dashboard: A dashboard is a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance. [Few, 2006].

A. These images are either displayed in a grid if these are pictures of street scenes, or as arranged tiles in a map-like view if the set is a collection of satellite images.

Three related views have been created to guide the user through the interface (symbols B to D). The bar at the top of the page shows the user the active view. The user can browse back and forth between the views without losing the context. The image on the top-right of the grid/tile interface (A) contains the context of the originating view.

B. Areas can be explored to search for objects or scenes of interest browsing through them using the grid interface;

C. Similar images can be explored in a related view, based on the image selected by the user;

D. When the classification model has classified images using the (positive) selections from the user, user validation is gathered through the confir-mation view, which shows the results from the classification model; E. Different areas and objects or scenes of interest can be selected using

responsive search boxes;

F. A responsive modal screen shows the selected image, offering options to further explore nearby images or similar images. It also offers a deep link to an on-line map provider and the option to confirm this image contains the interesting object or scene.

(13)

Figure 3: Preprocessing

The interaction diagram also shows the training of the classifier. This process can either be called from the front-end or from the back-end. In the first case, the classifier is trained when the user has provided some (initial) positive cases. After training the model can produce predictions, which the user can confirm by examining the proposed images in the confirmation view, which is sorted based on the model score. The score can be interpreted as a likelihood of the item of interest appearing in the image. Alternatively, the classifier can be called from the back-end, upon indexing new images. These unseen images are then classified using the previously learned classifier and the result can then again be examined by the user to confirm the validity of the proposed classes.

The system has a back-end in which the preprocessing takes place: source images are loaded and the features are learned using a pre-trained ImageNet model. An overview of the preprocessing is shown in figure 3. Images are sourced from pictures taken by PostNL delivery persons, or by using APIs from a satellite image provider. After collection, a background task loads the images and produces the ImageNet Pool5 features, which are stored in a database for later retrieval, along with attributes from the images, such as image geo-coordinates.

(14)

de-Figure 4: Data model

scribed in figure 4, using the Roman literals I to VI. The data elements cover all the information that is stored in the system:

I. Area: a particular collection of images within a certain area, which can be either satellite map tiles or pictures from street object or scenes. An area has one or more images.

II. Image: has attributes such as geo-coordinates, annotations, picture res-olution. An image can have a (Euclidian) distance from another image, computed from their feature vectors

III. Features: each image has a feature vector, extracted by fitting a pre-trained ImageNet model on the image, using only the final pooling layer IV. ImageNet model: pre-trained parameters of an ImageNet model, used

for extracting features from a given image

V. Object or scene of interest: user defined interests, which can be at-tributed to an (entire) image either by the user or by the classification

(15)

model based on user based attributions (as described in the user inter-action above). One image can have multiple objects of interest.

VI. Classification model: parameters of a linear classifier, trained using positive image examples, showing objects of interest

4 Methodology

This sections describes the source data, the feature extraction and the models used for classification and interactive exploration.

4.1 Data

The system can load images that form the raw data for the learning task. The images are interpreted as matrix of pixels. The image resolution is the width and height in pixels. Each pixel has an RGB (Red/Green/Blue) colour value, which is sometimes named channel in machine learning software, and typically varies between 0 and 255 for each of the three colours.

For this research, four geographical areas have been selected to train the classifiers. Three of these areas (Deventer, Raalte & Steenwijkerland) have been part of a proof of concept at PostNL, where postal delivery persons had been given the task to register houses with thatched roofs. This collected data forms the reference data for testing how well the classifier performs. The fourth area (Blaricum) has been selected as this is an area known to have a large number of houses with a thatched roof.

1. Deventer 4×4km - 19,044 images 2. Raalte 4×4km - 19,044 images 3. Steenwijkerland 4×4km - 19,044 images 4. Blaricum one postal code - 576 images

(16)

In total, 57,708 images of 640x640 pixels have been collected, loaded and indexed into the interactive system.

Beside the raw data, user input is gathered using the interactive system. For every image, the user can select one or more objects of interest. These selections form the positive cases for the linear classification model.

A ground truth image set has been created by manually searching for houses with thatched roofs (rieten dak ) and for the flat roof (plat dak ) ob-jects across all areas. This ground truth will form the basis for testing the effectiveness of the different models.

4.2 Feature extraction

For this research, features are extracted from the images by fitting a pre-trained ImageNet CNN (Inception-V3) [Szegedy et al., 2016]. Once sourced, all image data is fed in its raw format into the feature extraction sub-system. This sub-system is built using the Keras Python package (see section 5.2), which includes the weights for some of the pre-trained CNNs. The prepro-cessing steps are as follows:

1. Load the pre-trained ImageNet model Inception-V3, without the fully-connected (FC) layer

2. For each image: 1. Load the image;

2. Rescale to 299x299; and 3. Fit the ImageNet model.

3. This results in a feature vector of 2048 components for each image, 0.17% from the original of 1,228,800 data points (640 × 640 × 3)

(17)

4.3 Dimension reduction

Finding a representation that is more informative for further processing is one of the reasons for reducing the number of features in such a way that the least amount of information is lost. At a glance it might seem counter-intuitive to reduce features to gain more information. But with a large number of features the number of samples might be too small for accurate parameter estimation. This problem is also known as the curse of dimensionality. Dimensionality reduction effectively battles this curse by selecting or reducing the number of features such that the least amount of information is lost. Reduction of computational & storage cost are some of the additional benefits of dimen-sion reduction. One of the simplest and most widely used algorithms for dimension reduction is principal component analysis (PCA), a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated [M¨uller and Guido, 2017].

4.4 Classification

To be able to classify certain objects or scenes in an image, a machine learning technique called classification is used. A classification algorithm is specialized in identifying to which of a set of categories a new observation belongs. It is a form of supervised learning, where a ground truth (training set of correct examples) is available, as well as a describing set of features. We are only interested in a binary classifier as our object or scene classifier only needs to predict if the image shows a particular object or scene.

Some common well known classification algorithms [M¨uller and Guido, 2017] are briefly described here:

— Logistic regression: regression coefficients are usually estimated using a maximum likelihood estimation (i.e. Newton’s method);

(18)

the margin between the decision hyperplane and the examples in the training set;

— Nearest neighbour: a majority vote of its neighbours picking the most common class;

— Decision trees: iteratively create branches based on the most distinctive feature of the particular subset of that branch.

As our system needs to classify images interactively, we are looking for algorithms that can be quickly trained. The first two algorithms are nu-merically very efficient and robust when using stochastic gradient descend techniques and will therefore be used in the experiments.

4.5 Interactive exploration

To allow for interactive exploration of images, some preparation has to take place. Typically, creating an interactive learning system follows the following pattern [Hoi et al., 2006]:

1. Collect all images for one learning task;

2. Extract features from the images using unsupervised feature learning; 3. Allow the user to browse through the image collection that show the object

or scene of interest;

4. User selects images that follow the detected pattern;

5. Train a classifier, based on the first user selections (relevance feedback); 6. Apply the classifier to the whole collection and present the likely

candi-dates to the user;

7. User validates the candidates and again selects images that show the ob-ject or scene of interest;

(19)

8. Repeat steps 3 to 7 until convergence or when the session ends.

Typically, convergence is reached when either the user can not (easily) find more images that show the object of interest, or is satisfied with the result, or that the user is out of time. While browsing through the images, the user can select an image and request similar ones. The image similarity is calculated by comparing the (Euclidean) distance between the feature vector of the selected image and those of the other images. A separate, related view displays the resulting images, sorted in increasing distance / decreasing similarity within this system. The user can always navigate back to the exploration view.

5 Experimental setup

To evaluate if a computer vision task can learn to classify previously unknown street objects and scenes, the four research questions are investigated. This section describes the experiments needed to arrive at the conclusions. 1. For understanding effective means to extract basic features from the

im-age, feature transfer techniques are tested using the Keras Python pack-age which comes with pre-trained Impack-ageNet models (see subsection 5.2 on software). The resulting features are tested for explanatory power us-ing the similarity search in the interactive system and the image classifier performance.

2. To enable users to provide input and feedback to this learning task, an interactive system has been built using Django as described in section 3. The effectiveness of the similarity search is tested using the ground truth data set. Three illustrative use cases are investigated to show the means of interaction when training a classifier for a particular class for an object or scene:

(20)

— Explore an area through the grid in search for objects or scenes — Select a representative image and request similar images, or ones

(ge-ographically) nearby

— Using the trained classifier, find relevant images for a particular object or scene

3. Multiple quality criteria for assessing the output of this learning task are investigated, including precision, recall & AUC. Precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples. AUC is the so-called Area Under the (Receiver Operating) Curve. The Receiver Op-erating Curve shows the True- & False Positive rates for various threshold levels of the cut-off. To be able to determine the best model parameters for the classification model, different hyper-parameters are tested based on the ground truth data set, including:

— Dimension reduction: PCA using 40/60/80 components

— Algorithms: Logistics regression (LR) using log loss vs. Support Vec-tor Machines (SVM) using hinge loss (both from the Skikit-learn pack-age SGDClassifier - see section 5.2)

— Regularization term: L2 norm (Ridge regression) is used, as this method does not zero out any feature

— Regularization multiplication term (alpha): parameters settings be-tween 0.0001 and 0.000001 are tested

— Iterations: A minimum of 10_n6, with n=size of training set is tested. The training set is around 500 images, leading to at least 2,000 iterations. To find the optimal parameter settings we run 10,000 iterations. 4. To find out how much user interaction is required before the fine-grained

(21)

in-spired by the paper from Zah´alka [Zah´alka et al., 2015], which is imple-mented using an artificial actor:

— Start with the ground truth image set from the Blaricum area — Then the following automated sequence is applied:

1. Start with a random selection of 5 images from the ground truth and add these to the positive set

2. Apply the classification algorithm using supervised learning based on the positive set and from the outcome select the top 25 based on the highest likelihood

3. Count the number of correctly selected images and add these to the positive set

4. Repeat steps 2 and 3 until convergence (no more positives in the top 25) or after 10 rounds (a proxy for user satisfaction)

5.1 Hardware

For all data pre-processing, analytical modelling and the interactive appli-cation hardware was provided by Distributed ASCI Supercomputer 4 (DAS-4): a six-cluster wide-area distributed system designed by the Advanced School for Computing and Imaging (ASCI). DAS-4 is funded by NWO/NCF (the Netherlands Organization for Scientific Research). As one of its distin-guishing features, DAS-4 employs a number of High Performance Comput-ing (HPC) Accelerators, of which notably various Graphical ProcessComput-ing Unit (GPU) types. The GPU’s can be employed for training and fitting the deep convolutional networks.

5.2 Software

For analytical programming & analysis, as well as building the interactive system, the following software has been used:

(22)

— Python (www.python.org), a well-supported scripting language with an interactive programming environment, many packages for business ana-lytics and programming [van Rossum, 2016];

— Keras (keras.io), a Python library for easily defining and refining Deep Neural Networks, on top of Tensorflow, Microsoft CNTK or Theano. Comes with a collection of pre-trained ImageNet models [Chollet, 2015];

— Tensorflow (www.tensorflow.org), a popular machine learning library us-ing graph computations with back-propagation built-in. Graph computa-tions allow the models to be executed using graphical cards (GPUs) which are very efficient in matrix multiplications [Abadi and others, 2016];

— Scikit-learn (www.scikit-learn.org), an easy to use machine learning li-brary, built on NumPy, SciPy and matplotlib, the leading numerical analysis libraries in Python [Pedregosa et al., 2011]. It includes programs for dataset transformations, supervised learning, unsupervised learning & model selection and evaluation;

— Django (www.djangoproject.com), a high-level Python Web framework, allowing to build clean web interfaces quickly [Holovaty and Willison, 2017].

It is worth to note all this software is open-source and free to use within their respective licensing restrictions.

6 Results and implications

This section describes the results of the experiments and some of the impli-cations for each of the four research questions.

(23)

6.1 Basic feature extraction

The literature research has shown that a CNN is an effective way to extract features from an image. It has also shown that features trained on one net-work with generic images and object classes can be transferred to similar image recognition tasks. Features are extracted from the images by fitting a pre-trained ImageNet CNN on the images. Keras provides pre-trained Im-ageNet CNNs, so called applications, including VGG16, VGG19, ResNet50, and Inception-V3. These architectures are the respective winners in the Im-ageNet Large Scale Visual Recognition Competition (ILSVRC). For showing effective means to extract basic features the Inception-V3 pre-trained model is used, as it is the most recent CNN model with the highest accuracy in the ILSVRC. The full ImageNet models come with a final classifier layer to classifiy one of the 1,000 classes. As our model needs the underlying features from the last pooling layer, we discard the final classification layer. In Keras this can be achieved by setting the parameter <include_top=False> .

On the available hardware the following timings were achieved:

— Initial loading the ImageNet weights into memory: 24 seconds

— Processing 57,938 images: 1 hours and 16 minutes, which amounts to 8 seconds per 100 images

To put it into perspective: The Netherlands has a geographical area of 41.543km2_{. Our experimental system has sourced 3× 16km}2_{, so extrapolating}

for The Netherlands this would amount to a total of about 50 million images. At 8 seconds per 100 images this adds up to about 1,100 hours of processing for the feature extraction of the image tiles for the whole of The Netherlands. As this effort can be further optimized and run in parallel, a throughput time of one day would be feasible.

(24)

6.2 User input and feedback

Three illustrative use cases are investigated to show the means of interac-tion when training a classifier for a particular object or scene, in this case showing some of the result of plat dak. The use cases are selected along the Exploration-Search axis [Zah´alka et al., 2015].

1. Explore an area through the grid in search for objects or scenes; 2. Select a representative image and request similar, or nearby images; 3. Find relevant items for a particular object or scene.

6.2.1 Explore an area

When starting a search for a new to classify object or scene the user will start from the exploration interface. This interface shows a grid, which can be navigated through the Previous and Next buttons. On the right of the screen, some context images show which objects or scenes have already been identified for the images shown.

A red border shows if one of the im-ages contains the selected object or scene of interest.

Upon clicking an image a modal pop-up screen shows the image en-larged and the possible actions, in this case to show similar images, show on a map, or confirming this image contains the object or scene of interest.

(25)

6.2.2 Request similar images

Any image from the exploration interface can be selected and then similar images can be viewed. Based on the picture of a plat dak in section 6.2.1 the exploration interface now shows similar images to the one selected. The images are sorted based on their amount of similarity, and can (again) be navigated through the Previous and Next buttons.

The exploration interface has changed slightly, as the grid now shows separate images instead of tiles on a map. Many show a sim-ilar pattern to the one previously selected.

The next page (again) shows many images with a similar pattern to the one selected, with many of them al-ready confirm by the user.

Sometimes, a partial view of the ob-ject or scene of interest shows. Nav-igating to (geographically) nearby images will show the full context, in this case the whole building showing a plat dak.

(26)

The explorer interface shows the nearby images. This could also speed up the labelling, when objects or scenes of interest are close to each other, in this case an area that has many buildings with a plat dak.

6.2.3 Find relevant items

Once a fair number of images have been tagged by the user for a particular object or scene, the classifier can be trained. After training, it can then classify all the other images, allowing the user to find relevant items.

The explorer interface shows the classified images, sorted by the clas-sifier probability. Currently, all probabilities show 100% as the clas-sifier is overconfident.

6.2.4 Wrongly classified images

The means of interaction works quite intuitively when the patterns of the sought after objects or scenes are relatively clear, or there is some prior knowledge on where particular objects or scenes are known to be found, such as the houses with thatched roofs in the Blaricum area. Other patterns are not as clear and the system fails to find similar images.

(27)

This image is found to be similar to an image clearly showing a plat dak. We can see the pattern of a black surface with light demarca-tions, which is a similar pattern to a plat dak, but for a human it clearly shows a road.

6.3 Model quality criteria

Ultimately, the classifier needs to be evaluated against the appropriate qual-ity criteria. Precision, recall & AUC are assessed for training two classifiers: one for rieten dak objects, and the other for plat dak. To be able to determine the best model parameters for the classifiers, different hyper-parameters are tested based on the ground truth data set, as described in the experimental setup section 5. For dimension reduction we test PCA using 40/60/80 com-ponents and for classification we test Logistics regression (LR) using log loss vs. Support Vector Machines (SVM) using hinge loss & regularization alpha multiplication parameters between 0.0001 and 0.000001.

A tabular overview of the experiments to test the classification model is provided in tables 1 and 2. Looking at the AUC values, we can see there is not much variance in the results, for both classifiers. The AUC score is around 0.85, which is a good score. It means that for 85% of the test images the classifier predicted the right class. The precision score is around 0.03, which means that for every 100 images the classifier returns three valid ones. This is not so good as we want our classifier to return a much higher percentage for it to be useful. The recall score is around 0.84, which means that the classifier can actually find 84% of the images showing the rieten dak and the plat dak objects.

(28)

Precision-Recall curves for both classifiers.

Figure 5: ROC & Precision-Recall curve for the rieten dak classifier

Figure 6: ROC & Precision-Recall curve for the plat dak classifier

We can see the same pattern emerge from the curves. The ROC curves for both classifiers show a similar pattern: they predict far better than random

(29)

(the red diagonal line). The further away the ROC curve lies from this line, the better it predicts. The Precision-Recall curve shows a high recall can only be achieved at a low precision.

What this means in practise, is that with the current performance the classifiers will be able to find around 84% of the images containing a rieten dak or plat dak. But the user will have to look at 33 images to find one. With an average of 15 images on a screen to look at, this means the user will have to look at 2 screens of images to find a valid one.

The classifier could be tuned further to produce higher precision at the early stage of the learning process. This would go at the expense of the ability of the classifier to find all the images, but in the early stages of exploration that would not matter too much.

6.3.1 Comparing results of PostNL data collection

PostNL postal delivery persons had been given the task to register houses with thatched roofs in three areas. This exhaustively collected data was used to test the classifier relevance.

Figure 7 shows the classifier (user confirmed) results in comparison to the PostNL collection results in a grid representing the geographical area of Deventer and Raalte for the area loaded into the interactive system. Data collected for Steenwijkerhout showed no results in the area that was loaded. As we can see from the comparison is that the search for images showing a rieten dak resulted in many relevant images for Deventer and few for Raalte (red dots). Vice versa, the data collected by PostNL shows many houses with a rieten dak found in Raalte and few in the Deventer area (blue dots). From interacting with the system it was noticed the Raalte images were often of a lower resolution, which might explain the low number of relevant images from the system compared to the results from the PostNL proof of concept. Interestingly, there is few overlap in the results. This means that the process at PostNL side has room for improvement, as we would expect at

(30)

Figure 7: Comparing PostNL data collection with the results of the search for rieten dak objects

least to see an overlap with the results from the classifier, given the fact that it was meant to be an exhaustive search. And the classifier can certainly be improved on the low-resolution images from the Raalte area.

6.3.2 Performance

Along with the model evaluation, we’ve also measured the timing of training a model and scoring the images. A tabular overview of the experiments to test the classification model is provided in tables 3 and 4.

The timings for fitting the best performing classifier on 57,708 images takes about 2 seconds. Scoring an image takes about 0.2 seconds.

Putting these timings into perspective of the 50 million satellite images of The Netherlands, this approach would not be feasible as the system will not be interactive due to the large time lag. Very recent related research show that a smarter approach can handle 100 million images interactively, even on a high-end desktop computer [Zah´alka et al., 2016].

(31)

6.4 Required user interaction

This section shows the required user interaction to train the classifiers, using the procedure described in the experimental setup section 5. Ten simulations are run using the optimal model parameters from section 6.3, and we measure the same three quality criteria AUC, Precision & Recall. The results are show in figure 8. Each simulation has on average 10 iterations in which the artificial actor tries to find images of the rieten dak class. Each iteration it finds on average 10 images, after which the classifier is retrained. As in each simulation run the actor finds different images, the results show some variation.

Figure 8: Artificial actor AUC, Precision & Recall for the rieten dak classifier Overall the results are encouraging, as all three quality criteria go up. As the actor is starting to find more images of the rieten dak class, the classifier shows all-round better performance, as it is able to find more more images due to the higher recall percentage, and the user will see more appropriate results due to the higher precision percentage.

Cautiously extrapolating the results, would indicate that after 10 more iterations we would see a precision rate that would return perhaps 50-60% appropriate results and find 90-95% of the images.

(32)

7 Conclusions

After analysing the results we can conclude a computer vision task can learn to classify previously unknown street objects and scenes.

We have shown basic features can be extracted from the images by unsu-pervised feature transfer using a pre-trained ImageNet model. These features allow on the one hand exploring images in search for interesting objects or scenes using image similarity and on the other hand provide the explanatory variables for the classification task.

A system was created to capture user input and feedback for the learning task, allowing the user to train classifiers for new objects or scenes. We have demonstrated three ways a user can interact with the system to find new objects or scenes. Firstly, the user can browse through (geographically nearby) images, secondly the user can find structurally similar images based on the image features, and thirdly the user can find relevant items based on the image classification.

Three quality criteria for assessing the output of the learning task have been investigated: precision, recall and AUC. Ultimately, a classifier should achieve both high precision as well as high recall. The current performance of the tested classifiers have a low average precision, but comparison with an exhaustive search shows remarkable recall. Precision and recall can be further tuned to achieve a high early precision, allowing the user to find more relevant images in the early stage of the learning process.

To understand the required user interaction for learning the fine-grained classification task it was shown that after 10 iterations the classifier already achieved reasonable results, improving after each iteration on every quality metric. This is an encouraging outcome, showing that when the user labels more images the classifier returns more as well, both in volume and as in relevance.

Scaling the system to extend the whole of The Netherlands would require to optimize the indexing to allow the system to stay fully interactive.

(33)

7.1 Directions for further research

During this research, some venues that had been identified in the literature review could not be explored further due to time constraints. It is expected these approaches can help further optimize the system so it would be worth-while exploring them.

The approaches include:

— Test different ImageNet weight models (AlexNet, Inception, etc.), or dif-ferent ImageNet layers (final pool vs. previous pooling layers). Research has shown that each layer provides a different granularity of features [Yosinski et al., 2014]. The learning process at PostNL might benefit from a deeper layer that provides more generic features.

— Retraining or fine-tuning of ImageNet parameters. If a classifier for a particular object or scene needs to become very accurate, and not much further gain can come from user input, the ImageNet that provided the features for the classifier can be fine-tuned based on the new class. [Bengio et al., 2013]. This can result in model weights that are better tuned into the particular class and can therefore allow for better separation of this class from the other classes.

— Test if high precision gives a better user experience at the early stages of the learning task. Initially, a user might not be patient enough to sift through too many irrelevant results. Providing only the top-25 or at least showing the most relevant images, the user might become more confident of finding more relevant images, and as a consequence become more patient in search for the needles in the haystack.

(34)

8 Tabular results

This chapter contains the detailed results of the search for the best hyper-parameter settings, as well as the detailed results of the timings of fitting the model and scoring the images.

Rank PCA Loss Alpha AUC Precision Recall 14 60 hinge 1e-05 0.85±0.098 0.030±0.018 0.82±0.143 12 60 hinge 1e-06 0.85±0.100 0.031±0.021 0.82±0.147 13 60 hinge 1e-07 0.85±0.099 0.030±0.020 0.82±0.138 18 60 log 1e-05 0.85±0.100 0.031±0.021 0.81±0.151 17 60 log 1e-06 0.85±0.100 0.030±0.019 0.81±0.140 16 60 log 1e-07 0.85±0.101 0.030±0.019 0.82±0.138 7 80 hinge 1e-05 0.85±0.099 0.030±0.017 0.82±0.133 1 80 hinge 1e-06 0.86±0.098 0.031±0.019 0.83±0.137 4 80 hinge 1e-07 0.86±0.100 0.030±0.017 0.84±0.132 3 80 log 1e-05 0.86±0.097 0.030±0.017 0.82±0.141 2 80 log 1e-06 0.86±0.100 0.031±0.018 0.84±0.132 5 80 log 1e-07 0.86±0.099 0.030±0.017 0.82±0.139 10 100 hinge 1e-05 0.85±0.099 0.032±0.020 0.81±0.147 11 100 hinge 1e-06 0.85±0.099 0.031±0.018 0.81±0.137 15 100 hinge 1e-07 0.85±0.100 0.030±0.018 0.81±0.143 8 100 log 1e-05 0.85±0.099 0.032±0.019 0.81±0.142 9 100 log 1e-06 0.85±0.098 0.031±0.019 0.80±0.148 6 100 log 1e-07 0.85±0.097 0.031±0.019 0.81±0.143 Table 1: Quality criteria for the rieten dak classifier for various hyper-parameters

(35)

Rank PCA Loss Alpha AUC Precision Recall 10 60 hinge 1e-05 0.88±0.105 0.026±0.012 0.83±0.226 12 60 hinge 1e-06 0.88±0.106 0.026±0.012 0.84±0.224 7 60 hinge 1e-07 0.88±0.105 0.026±0.012 0.84±0.222 2 60 log 1e-05 0.88±0.107 0.027±0.014 0.84±0.232 3 60 log 1e-06 0.88±0.100 0.025±0.012 0.83±0.240 8 60 log 1e-07 0.88±0.105 0.026±0.013 0.84±0.231 4 80 hinge 1e-05 0.88±0.105 0.027±0.014 0.79±0.253 9 80 hinge 1e-06 0.88±0.104 0.026±0.013 0.79±0.231 5 80 hinge 1e-07 0.88±0.103 0.026±0.013 0.80±0.243 6 80 log 1e-05 0.88±0.105 0.027±0.013 0.79±0.246 1 80 log 1e-06 0.88±0.096 0.027±0.013 0.80±0.237 11 80 log 1e-07 0.88±0.104 0.026±0.012 0.81±0.229 17 100 hinge 1e-05 0.87±0.108 0.029±0.014 0.79±0.233 15 100 hinge 1e-06 0.87±0.103 0.027±0.014 0.80±0.236 13 100 hinge 1e-07 0.88±0.096 0.027±0.012 0.81±0.213 18 100 log 1e-05 0.87±0.111 0.028±0.014 0.79±0.240 16 100 log 1e-06 0.87±0.103 0.028±0.014 0.80±0.234 14 100 log 1e-07 0.88±0.099 0.025±0.012 0.80±0.235 Table 2: Quality criteria for the plat dak classifier for various hyper-parameters

(36)

Rank PCA Loss Alpha Fit time Score time 14 60 hinge 1e-05 32.54±4.329 1.92±0.648 12 60 hinge 1e-06 45.10±2.306 1.58±0.282 13 60 hinge 1e-07 50.03±17.596 1.92±0.201 18 60 log 1e-05 75.01±48.737 12.38±7.469 17 60 log 1e-06 44.14±9.378 1.87±0.521 16 60 log 1e-07 44.94±1.414 1.81±0.154 7 80 hinge 1e-05 55.59±10.083 2.14±0.233 1 80 hinge 1e-06 54.90±1.829 2.01±0.220 4 80 hinge 1e-07 54.39±1.829 2.04±0.401 3 80 log 1e-05 65.87±14.790 4.39±1.904 2 80 log 1e-06 54.79±1.893 2.03±0.260 5 80 log 1e-07 54.77±1.514 1.97±0.219 10 100 hinge 1e-05 45.62±18.158 3.17±0.886 11 100 hinge 1e-06 65.97±2.353 2.23±0.285 15 100 hinge 1e-07 59.16±7.320 1.78±0.450 8 100 log 1e-05 80.04±34.333 19.43±12.114 9 100 log 1e-06 66.83±7.362 2.40±0.264 6 100 log 1e-07 64.83±1.962 2.31±0.390

Table 3: Timings of training the rieten dak classifier for various hyper-parameters

(37)

Rank PCA Loss Alpha Fit time Score time 10 60 hinge 1e-05 45.83±1.523 1.66±0.176 12 60 hinge 1e-06 46.66±2.163 1.62±0.183 7 60 hinge 1e-07 45.80±1.389 1.70±0.236 2 60 log 1e-05 47.03±3.215 1.83±0.239 3 60 log 1e-06 45.86±2.036 1.55±0.358 8 60 log 1e-07 44.56±1.544 1.75±0.228 4 80 hinge 1e-05 55.36±1.621 2.00±0.168 9 80 hinge 1e-06 54.48±2.547 2.07±0.170 5 80 hinge 1e-07 54.20±2.299 1.86±0.189 6 80 log 1e-05 55.19±2.651 2.11±0.190 1 80 log 1e-06 54.76±1.307 2.00±0.269 11 80 log 1e-07 55.38±3.294 1.96±0.301 17 100 hinge 1e-05 66.60±1.951 2.34±0.241 15 100 hinge 1e-06 66.19±1.759 2.28±0.158 13 100 hinge 1e-07 60.17±8.586 1.86±0.535 18 100 log 1e-05 67.30±1.763 2.41±0.253 16 100 log 1e-06 66.78±1.397 2.18±0.192 14 100 log 1e-07 65.74±2.303 2.37±0.213

Table 4: Timings of training the plat dak classifier for various hyper-parameters

(38)

References

Mart´ın Abadi and others. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs], March 2016. URL http://arxiv.org/abs/1603.04467. arXiv: 1603.04467.

Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Re-view and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, August 2013. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.50.

L´eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838 [cs, math, stat], June 2016. URL http://arxiv.org/abs/1606.04838. arXiv: 1606.04838. Fran¸cois Chollet. Keras, 2015. URL https://github.com/fchollet/keras. J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, June 2009. doi: 10.1109/CVPR.2009.5206848.

Susan Doyle-Lindrud. Watson Will See You Now: A Supercomputer to Help Clinicians Make Informed Treatment Decisions. Clinical Jour-nal of Oncology Nursing, 19(1):31–32, February 2015. ISSN 1092-1095. doi: 10.1188/15.CJON.31-32. URL http://search.ebscohost.com/ login.aspx?direct=true&db=rzh&AN=103749260&site=ehost-live. Stephen Few. Information dashboard design. 2006.

URL https://www.thali.ch/files/Shop/Documents/ 018161 Chapter 1 Clarifying The Vision.pdf.

Efstratios Gavves. UvA Deep Learning Course, February 2016. URL https: //uvadlc.github.io/lectures/lecture1.pdf.

(39)

Ian Goodfellow, Aaron Courville, and Yoshua Bengio. Large-Scale Feature Learning With Spike-and-Slab Sparse Coding. Proceedings of the 29th In-ternational Conference on Machine Learning (ICML-12), June 2012. URL http://arxiv.org/abs/1206.6407.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. volume 2016-January, pages 770–778, 2016.

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580 [cs], July 2012. URL http://arxiv.org/abs/1207.0580. arXiv: 1207.0580.

Steven CH Hoi, Rong Jin, Jianke Zhu, and Michael R. Lyu. Batch mode ac-tive learning and its application to medical image classification. In Proceed-ings of the 23rd international conference on Machine learning, pages 417– 424. ACM, 2006. URL http://dl.acm.org/citation.cfm?id=1143897. Adrian Holovaty and Simon Willison. Django, 2017. URL https://

djangoproject.com.

S. B. Kotsiantis. Supervised Machine Learning: A Review of Classifi-cation Techniques. In Proceedings of the 2007 Conference on Emerg-ing Artificial Intelligence Applications in Computer EngineerEmerg-ing: Real Word AI Systems with Applications in eHealth, HCI, Information Re-trieval and Pervasive Technologies, pages 3–24, Amsterdam, The Nether-lands, The NetherNether-lands, 2007. IOS Press. ISBN 978-1-58603-780-2. URL http://dl.acm.org/citation.cfm?id=1566770.1566773.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classifi-cation with Deep Convolutional Neural Networks. In F. Pereira, C. J. C.

(40)

Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neu-ral Information Processing Systems 25, pages 1097–1105. Curran Asso-ciates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. Marvin Minsky and Seymour Papert. Perceptrons: an introduction to

compu-tational geometry. The MIT Press, Cambridge, Mass, expanded ed edition, 1969. ISBN 978-0-262-63111-2.

Andreas C. M¨uller and Sarah Guido. Introduction to Machine Learn-ing with Python : A Guide for Data Scientists, volume First edi-tion. O’Reilly Media, Sebastopol, CA, 2017. ISBN 978-1-4493-6941-5. URL http://search.ebscohost.com/login.aspx?direct=true&db= nlebk&AN=1361381&site=ehost-live.

Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ´Edouard Duch-esnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011. ISSN ISSN 1533-7928. URL http://www.jmlr.org/papers/v12/pedregosa11a.html.

S.J.D. Prince. Computer Vision: Models Learning and Inference. Cambridge University Press, 2012.

F. Rosenblatt. The perceptron: A probabilistic model for information stor-age and organization in the brain. Psychological Review, 65(6):386–408, November 1958. ISSN 0033-295X.

Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 [cs], September 2016. URL http://arxiv.org/abs/ 1609.04747. arXiv: 1609.04747.

(41)

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. II #x2014;Recent Progress. IBM Journal of Research and Development, 11(6):601–617, November 1967. ISSN 0018-8646. doi: 10.1147/rd.116.0601.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception Architec-ture for Computer Vision. pages 2818–2826, 2016. URL http://www.cv-foundation.org/openaccess/content cvpr 2016/ html/Szegedy Rethinking the Inception CVPR 2016 paper.html. Richard Szeliski. Computer Vision: Algorithms and Applications.

Springer-Verlag New York, Inc., New York, NY, USA, 1st edition, 2010. ISBN 1-84882-934-5 978-1-84882-934-3.

Guido van Rossum. Python, December 2016. URL http://www.python.org. Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, and Ian H Wit-ten. Interactive machine learning: letting users build classifiers. Inter-national Journal of Human-Computer Studies, 55(3):281–292, Septem-ber 2001. ISSN 1071-5819. doi: 10.1006/ijhc.2001.0499. URL http: //www.sciencedirect.com/science/article/pii/S1071581901904999. Marcel Worring, Arnold W. M. Smeulders, Amarnath Gupta, Simone Santini,

and Ramesh Jain. Content-Based Image Retrieval at the End of the Early Years. IEEE Transactions on Pattern Analysis & Machine Intelligence, 22 (12):1349–1380, 2000. ISSN 0162-8828. doi: 10.1109/34.895972.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transfer-able are features in deep neural networks? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3320–3328. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks.pdf.

(42)

Jan Zah´alka, Stevan Rudinac, and Marcel Worring. Analytic Quality: Evaluation of Performance and Insight in Multimedia Collection Analy-sis. In Proceedings of the 23rd ACM International Conference on Mul-timedia, MM ’15, pages 231–240, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3459-4. doi: 10.1145/2733373.2806279. URL http: //doi.acm.org/10.1145/2733373.2806279.

Jan Zahálka, Stevan Rudinac, Björn ?ór Jónsson, Dennis C. Koelma, and Marcel Worring. Interactive Multimodal Learning on 100 Million Images. In Proceedings of the 2016 ACM on International Conference on Multi-media Retrieval, ICMR ’16, pages 333–337, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4359-6. doi: 10.1145/2911996.2912062. URL http://doi.acm.org/10.1145/2911996.2912062.

Liang Zheng, Yali Zhao, Shengjin Wang, Jingdong Wang, and Qi Tian. Good Practice in CNN Feature Transfer. arXiv:1604.00133 [cs], April 2016. URL http://arxiv.org/abs/1604.00133. arXiv: 1604.00133.