Large-Scale Spatial Active Learning in the Wild

(1)

MSc Artificial Intelligence

Master Thesis

Large-Scale Spatial Active

Learning in the Wild

by

Bobbie van Gorp

11161108

July 12, 2020

36 EC January 2020 - July 2020

Supervisor:

Dr P.S.M. Mettes

Assessor:

Dr E. Bostan

(2)

Abstract

Aerial images contain a wealth of information for various real-world applica-tions. However, they often come in large size, requiring many images to cover large areas of land. It is expensive and time-consuming for human experts to analyze a large area, making automated analysis with computer vision meth-ods an attractive option. For many such computer vision tasks, a significant amount of labeled data would still be needed to get good results, defeating the purpose of automated analysis. In this research, the focus is on aerial image object detection with no existing labeled data and very limited data annotation capabilities. The research shows that active learning can be used to improve detection results significantly, without the need for very large la-beled datasets. Different active learning query strategy variations based on uncertainty sampling are tested, with key differences in the way uncertainty of individual boxes is aggregated to represent a full image. Furthermore, geo-graphical proximity is explored as a feature for query strategies to complement predictive uncertainty. It is shown that strategies that take into account a measure of distance tackle different types of detection errors than strategies based on uncertainty alone. Finally, a typical aerial imagery detection task is performed on very large-scale data to show how active learning can be used effectively on a real-world task.

(3)

1 Introduction

Over the last decade, the methods used to solve the task of object detection in the field of computer vision have changed significantly with the paradigm shift towards deep learning. Performances of near human-level can be reached on widely used benchmark datasets, such as MS-COCO, ImageNet Detection and Open Images Dataset [11]. However, these datasets all have similar char-acteristics and therefore do not represent all real-world problems that object detection can be used for. Most modern object detection datasets are large-scale, with hundreds of thousands of fully annotated images with many differ-ent object classes. Furthermore, these datasets consist only of natural images, in which there is a well-defined scene context, objects are upright and object scale and distance vary considerably.

Available data for real-world detection tasks on aerial imagery differs sig-nificantly from the common detection datasets. Although it often is similarly large-scale with millions of images available, for many objects of interest in aerial imagery tasks there are few or no existing annotated images available. In those cases, images have to be hand-labeled for the individual application. This is a time-consuming process which may require the knowledge of domain experts. However, there are many such object detection cases that have a use for solving real-world problems. Examples include solar photovoltaic array detection for power management [28], tree detection and classification [55], whale counting [14], and land-based wildlife detection [52] for nature con-servation programs. There also exist more common aerial imagery tasks for which labeled datasets are increasingly becoming publicly available, such as building localization for urban planning [4] and vehicle detection for localizing accidents and analyzing traffic flow for intelligent transportation [43].

A second difference between common object detection datasets and aerial imagery detection tasks is the nature of the images comprising the dataset. As the name implies, data for aerial imagery detection tasks consist of aerial images that differ considerably from natural images. Aerial images are much larger, they have a continuous scene, objects can be rotated in-plane and object scales are comparable and small [42].

Lastly, in aerial image object detection, the goal is often to find instances of one or a few types of objects (e.g. vehicles). All other objects are considered background, even though the true definition of background is less obvious than for natural images. In natural image object detection, the goal is usually a general object detector that finds and labels all objects in the foreground correctly. This means that detectors for natural images learn features of many types of objects to be able to distinguish between them, while aerial image detectors should learn the features of one or a few object types to be able to distinguish them from any other objects, that are considered background. The difference lies in the fact that the background class in aerial image object detection actually covers a large number different objects and patterns that could all have very different features.

All of these issues make object detection on aerial imagery more challeng-ing than the detection tasks on natural images. This research will focus specif-ically on the problem of labeled data scarcity for aerial detection tasks. In the

(6)

following sections, common solutions to data scarcity will be introduced, both for general machine learning tasks and object detection specifically. Finally the details of this research are specified, as well as the research objectives associated with it.

1.1 Scarcity of labeled data

Clearly, this is not the first machine learning task for which a lack of labeled data poses a threat to the effectiveness of any model created to perform the task. Key solutions that have been proposed in the general machine learning literature to address the abundance of unlabeled data but the lack of labels are active learning and semi-supervised learning [38]. Active learning is based on the idea that not all training examples are equally useful to learn from. Instead of doing the time-intensive process of completely annotating a large dataset, the active learning model employs a query strategy that estimates from which unlabeled examples the model would learn the most, if it were to have the label. An oracle is then asked to add labels for these instances, after which the model is retrained. In this manner, the model aims to achieve the highest performance while using as little labeled data as possible and so reducing the cost that is required to label instances.

A simple example of a semi-supervised learning approach is a model that, after initial training on a small dataset, is used to classify a number of unla-beled data points. The most confident data points and predictions are then added to the training set, after which the model is retrained and the process repeats.

There are also more specific solutions often used in computer vision tasks to increase model performance without the need for large-scale annotation of new data. Data augmentation is a common approach that relies on creating more labeled data by copying and tweaking, e.g. flipping, rotating or cropping, existing labeled images [48]. Another typical approach is the reuse of layers from a previously trained model on similar images, which is called transfer learning [30].

1.2 Efficient learning in object detection

The increased availability of large labeled datasets of natural images has re-duced the need for research on efficient learning with few labeled instances in this field. General efficient machine learning methods such as active learning therefore have not gained a strong foothold in object detection research [3]. In the task of object detection on aerial imagery on the other hand, efficient learning remains crucial. Of the two solutions that are applied to general com-puter vision tasks to increase model performance without adding new labeled data, data augmentation is a common ingredient for object detection tasks on aerial images (e.g. [5, 21, 34]). Transfer learning has also been applied to this task in the past, but with mixed success [8, 28]. Despite these solutions, the increase in object detection performance on this task lags behind that of object detection performance on natural images [16], raising the question whether further improvements can be made with the use of general efficient machine learning solutions that have not been used on a large scale for object

(7)

detection. This research explores whether active learning can be used effec-tively to get a significant performance increase for the task of object detection on aerial imagery with few labeled objects. Furthermore, new active learning query strategies are examined, among which a strategy in which a character-istic specific to aerial imagery is taken into account: geographical proximity. In aerial imagery, labeled objects are not fully independent - Tobler’s first law of geography states that ”everything is related to everything else, but near things are more related than distant things.” [49]. By taking into account a measure of proximity, the effectiveness of the query strategy might be im-proved. Lastly, this research explores how to effectively apply active learning on very large-scale data.

Figure 1: High resolution aerial images containing wet cooling towers.

1.3 Details of the research

This research was carried out in cooperation with the Netherlands National Institute for Public Health and the Environment (RIVM). Experiments were conducted on a task of interest to the RIVM in which the goal is to identify wet cooling towers on high-resolution aerial imagery of The Netherlands. Wet cooling towers are the parts of industrial cooling water systems where waste heat is rejected to the atmosphere using evaporation of water. A poorly maintained wet cooling tower can become a reservoir of infection for Legionella pneumophila and from here the pathogen can spread on aerosols to infect people in the vicinity of this cooling tower [29]. Knowing the locations of wet cooling towers in The Netherlands helps public health officials from the RIVM to identify possible reservoirs of infection in the event of a geographically clustered occurrence of Legionella pneumophila infection. There are various types of wet cooling towers that differ in size and appearance, but nearly all are visible on high resolution aerial images. Two example images with cooling towers can be found in figure 1. Around 1000 wet cooling tower locations known to the RIVM were used to create a dataset and several thousands of unknown wet cooling towers were estimated to exist. The main dataset was a

(8)

collection of high resolution aerial images of The Netherlands. An additional dataset with zoning information was available for preprocessing.

1.4 Research objectives

In accordance with the context of the project as covered in this introduction, the main goals of this research are defined as: (1) to explore whether active learning can bring a significant performance increase to aerial image object detection; (2) to compare different query strategies and a newly proposed strategy based on geographical proximity; and (3) to apply active learning on a real-world task with very large-scale unlabeled data.

(9)

2 Related Work

2.1 Object detection

This research is focused on the task of object detection, in which images are processed to detect and localize objects of predefined types. Object detection also forms the basis of a number of other tasks in computer vision, such as instance segmentation, image captioning and object tracking. Zou et al. [58] provide a comprehensive history of the developments in object detection over the last 20 years. The most relevant history and related work is discussed in this section.

2.1.1 Methods and architectures

Traditionally, object detectors used handcrafted features and a sliding window approach, going through all possible locations and scales in an image to see if any window contains an object. As an image can easily be divided into a very large number of windows of different scales, this method is very computation-ally inefficient. Detectors from that time use multiple approaches to decrease the amount of computation necessary. The well-known Viola-Jones detec-tor [53] uses, among other methods, a cascade model to speed up detection. A cascade model uses multiple levels of classifiers with increasing complexity, that can each reject the input area of the image or pass it on to the next level. The goal is to remove the least interesting areas (i.e. background) early on in the cascade with simple, efficient models. In that way, complex, less efficient classifiers further down the cascade will only be applied to areas in the image that are likely to actually contain the object of interest.

This core idea to only apply complex classifiers to image regions likely to contain objects carried over to the deep learning era, with the introduction of R-CNN in 2014 [13]. In this first major application of convolutional neural networks (CNNs) to object detection, a region proposal method such as selec-tive search [51] is used first to create category-independent region proposals. Then affine image warping is used to compute a fixed-size matrix from each region proposal. This is fed to a CNN that extracts a fixed-length feature vector for each proposed region and finally a linear SVM classifies the region. Later, two improved versions of this two-stage detector were published. Based on ideas from SPPnet [18], Fast R-CNN [12] improved speed dramatically by first extracting features with a CNN once for the entire image and then ex-tracting fixed-length representations from the shared feature map for each region. Instead of using an SVM for classification, it now performs classifi-cation and bounding box regression using a single neural network with fully connected layers and two outputs. Improving on this model again, Faster R-CNN [33] uses a CNN for region proposals as well: the Region Proposal Network (RPN). RPN reduces the computation time necessary to make region proposals, because it also uses the shared feature map that is used for classi-fication and bounding box regression. Apart from speeding up detection once again, it is also the first end-to-end deep learning detector. The CNN that constructs the shared feature map used by RPN and the classification and bounding box regression networks is frequently referred to as the backbone

(10)

network and numerous different backbone networks are proposed for specific applications.

These deep learning object detectors are all based on the same two-stage architecture, in which image regions are proposed first and these regions are subsequently classified. In 2015, several researchers established a new class of object detectors consisting of a single stage and thereby discarding region proposals. The three major single-stage detectors are YOLO [32], SSD [27] and RetinaNet [26]. The accuracy of these single-stage detectors is generally lower than that of two-stage detectors, but detection speed is much higher.

In this research, Faster R-CNN will form the basis of the object detection algorithm, as it provides good performance at a relatively high speed and reliable implementations are available.

2.1.2 Scale variation problem

A central problem in object detection that will be briefly discussed here is the scale variation problem. Objects in images can cover any number of pixels, depending on the actual size of the objects and their distance to the sensor. Images generally contain objects of multiple sizes in the same image, where some objects are in the foreground and others are in the background. This means that for an object detector to perform well on a general test set of images, it should be able to detect objects of many different scales. However, this is challenging, since the features on the feature map that is output by a vanilla CNN backbone network have a fixed receptive field, and are therefore optimized to detect objects on a specific scale. Multiple classes of solutions have been proposed to counter this issue, which are clearly summarized in [25]. A simple approach is to independently extract features from the same image at multiple scale levels, and for each of these feature sets apply a detection model to identify objects on that scale level. A faster approach is to use the feature maps generated by each layer in a CNN for prediction at different scales, as the feature maps of these layers each have different receptive fields. However, there is a large semantic gap between features extracted by these different layers, with earlier layers only able to detect low-level features.

Lin et al. introduce feature pyramid networks (FPN) [25], in which fea-ture maps of earlier, semantically weaker layers are combined with those of later, semantically stronger layers to create multiple feature maps at differ-ent scales that are all semantically strong. A detector is then applied to the feature maps of different scales to detect objects of multiple sizes. Another successful approach called TridentNet was introduced in 2019 [24], where a single network produces multiple scale-specific feature maps by using different weight-sharing branches in which only the dilation rate [57] of the convolutions differs so that the features maps have different receptive field sizes.

Although there is no large size difference between foreground and back-ground objects on aerial images, cooling towers occur in a range of different sizes. Therefore, for this research, a feature pyramid network was used to provide Faster R-CNN with feature maps optimized to detect cooling towers of different sizes.

(11)

2.1.3 Detection on aerial imagery

The specific task of object detection on aerial images differs from object de-tection on natural images. For this reason, there has been quite some research into the task. The meta-architectures of proposed methods are not different, with two-staged detectors such as Faster R-CNN and one-stage detectors like SSD commonly used as a basis. However, slight adaptations to the architec-ture, backbone or hyperparameters are generally made to get performance increases by making use of the specific characteristics of aerial images. Below are three examples of proposed adaptations for the task of vehicle detection. This task resembles the problem that will be tackled in this research, because there are few object classes and the objects are of small size.

• Sommer et al. propose to extend a Faster R-CNN and VGG-16 base with a deconvolutional module. In this way they combine features of shallow and deep layers while keeping the resolution of the feature map sufficient to detect small objects [44]. This is applied and shown to outperform plain Faster R-CNN on the task.

• Sakla et al. [36] also use Faster R-CNN and VGG-16 as a base. They remove the last two convolutional layers to keep a higher resolution feature map and adjust the anchor boxes used in RPN to be of similar size to the vehicles that are to be detected.

• Tang et al. [46] adapt SSD to be able to handle arbitrarily-oriented objects as are often present in aerial images, by enabling the use of non-axis-aligned bounding boxes. This makes the detection more accurate than plain SSD.

Naturally, there are many more methods that have been proposed for higher performance on aerial images. Apart from optimized hyperparame-ters, adapted methods such as those mentioned above were not applied this research. Non-adapted methods sufficed to detect cooling towers and the fo-cus of this research is on large-scale (spatial) active learning. However, it should be known that adapted methods exist, as other aerial image detection tasks may benefit significantly from these.

2.2 Active learning

Active learning is an idea that is several decades old and is present in both machine learning and statistics literature. The key observation that is behind the active learning approach is simple: not every training example is equally effective for training a machine learning algorithm. For example, if a model is created to recognize handwritten digits and after several rounds of training it can successfully recognize all clearly written 4’s, then continuing training on hundreds of additional images of clearly written 4’s is not effective. This will not lead to large performance increases as an adequate decision boundary for this class has already been learned. Therefore, the core idea in active learning is to allow the learner to choose the data on which it is trained, so that the training becomes much more efficient and the total number of training

(12)

examples required to reach the same performance is drastically reduced. This approach works best in settings where (1) annotating instances is expensive in any sense of the word, and (2) obtaining unlabeled instances is much cheaper than obtaining labeled instances.

A thorough review of the literature and methods in active learning was written in 2009 by Settles [38]. Relevant methods will be discussed here and the last paragraph will be dedicated to active learning research in the deep learning era.

2.2.1 Scenarios and query strategies

Research in active learning is focused around three different scenarios. In the membership scenario [1], the model itself generates an instance, such as an image, and the oracle labels it. In stream-based selective sampling [6], an unlabeled instance is sampled from the distribution and the learner subse-quently decides whether to request its annotation or not. Lastly, the most common scenario is pool-based active learning [23], where a large collection of unlabeled data is available at once, in addition to a small labeled dataset. The learner can make use of all labeled and unlabeled data to determine which in-stances from the unlabeled set should be queried. Pool-based active learning is also the scenario that the experiments of this research operate in.

Although active learning is applied in these three different scenarios, the main problem in all these settings is the same: How to decide what instances should be annotated? Therefore, the bulk of the active learning research fo-cuses on finding optimal query strategies. A common and straightforward query strategy is uncertainty sampling [23], in which the learner queries the instances of which it is most uncertain about the label. A universal version of this strategy uses information entropy as a measure of uncertainty. Query-by-committee [41] is another query strategy, in which multiple models are trained on the labeled set. The labels of the instances in the unlabeled set are predicted by each of the models and the instances on which the committee disagrees the most will be queried, with many possible measures of commit-tee disagreement. A third query strategy framework is centered around the expected model change if the label were known. As many models employ gradient-based optimization, an implementation of this general framework is the expected gradient length [40].

These methods are susceptible to a preference for outliers that, when queried and trained on, do not improve performance on the general dataset. Other frameworks try to reduce the impact of outliers, for example by consid-ering a measure of similarity to the rest of the dataset when deciding on the data to be queried, as is done in density-weighted methods [39]. The same susceptibility was a main motivation for the introduction of variance reduction methods [7], that attempt to find the optimal query to reduce model variance in order to minimize the learner’s future error, as well as the introduction of the estimated error reduction framework [35], that directly minimizes gen-eralization error by querying for minimal expected future error. Lastly, an interesting method was proposed by Konyushkova et al. [22] where a data-driven approach is taken as query strategy: they train a regression model to predict the expected error reduction for a candidate sample in a particular

(13)

learning state.

The query strategies described here are designed for general active learning problems and so they operate without the need for problem-specific informa-tion. Part of this research will focus on combining general model metrics and domain-specific information, concretely spatial information in this set-ting where instances are spatially correlated, to improve active learning per-formance. In principle, any of the mentioned query strategies could be used as a basis for this combination, but the commonly used uncertainty sampling based on entropy will be used here because of its simplicity and proven worth. Additionally, the active learning is applied in a large-scale setting, where a huge quantity of data is available of which a particularly small part is labeled. The speed of calculating entropy is therefore also a contributing factor to the choice for this strategy.

2.2.2 Active learning in the deep learning era

Although deep learning models require large amounts of annotated data for which labeling might be expensive, active learning never became standard practice for working with these models. The dependence of deep learning methods on large amounts of data conflicts with the reliance of active learn-ing methods on belearn-ing able to update models from small amounts of data [10]. In the classical active learning setting, a single data point is queried at each iteration and the model is retrained afterwards. This is usually not feasible for deep learning models, as a single point is unlikely to have a statistically significant impact on the test performance. Also, a full retraining after query-ing a squery-ingle data point is often intractable in a deep learnquery-ing settquery-ing [37]. In research that does use active learning for training deep learning models, batches of data are queried instead of single data points to overcome these issues. Batch-based querying will be used in this research as well.

(14)

3 Methodology

The first research objective is: (1) to explore whether active learning can bring a significant performance increase to aerial image object detection. Experi-ments will be conducted in which active learning is done with multiple query strategies. If one or more of these succeed and it is quantitatively shown that active learning can bring performance increase on this task, the first objective will have been completed.

The third objective is: (3) to apply active learning on a real-world task with very large-scale unlabeled data. One of the query strategies will be used to to perform and analyze active learning as it is intended, with manual annotation on request on a large-scale dataset. The results of this large-scale active learning will be analyzed and used to gain more insights on the workings of that query strategy.

This leaves research objective two: (2) to compare different query strate-gies and a newly proposed strategy based on geographical proximity. The remainder of this section will explain existing and proposed query strategies that are tested for the second objective of this research.

3.1 Background

In pool-based active learning, each instance in the pool set is assigned a value v(x) that estimates the possible contribution to the model improvement [3]. In this research, uncertainty sampling is used as the starting point of all query methods. This means that the images about which the model is most uncertain get the highest v(x) and are therefore queried first. Uncertainty is measured with entropy, as is commonly done in active learning applications [38]. In the experimental setup, it will be explained that the task at hand is formulated as a binary classification problem: the class of interest is the cooling tower class and everything else is considered background. This means that, in a standard active learning case, the instance that will be queried x∗ is chosen with x∗= arg max x [−X i P (yi|x; θ)logP (yi|x; θ)] (1)

where, in a binary case such as this, i ranges from 0 to 1 and therefore x∗ is equal to the instance where the class predictions are closest to 0.5.

However, an issue comes up when this query strategy is applied to ob-ject detection. In the above formula, x is an image, while labels yi are not

assigned to images, but to detected boxes instead. The number of detected boxes differs per image and so the individual entropy values per box have to be aggregated to result in a v(x) per image. The aggregation methods aver-age, sum and maximum are tested in [3]. Now the query formula becomes

x∗= arg max x [aggr j (−X i

P (yi|boxxj)logP (yi|boxxj))] (2)

where i still ranges from 0 to 1, boxx

j is box j of image x with j ranging

(15)

method used, such as maxj. Equation 2 describes how to query a single image,

but in practice a number of images are queried each active learning round. This is done by sorting the images based on their aggregated entropy and then selecting the top n number of images to be queried.

It must be noted that images with no detections are always assigned a v(x) of 0, meaning that these images will never be queried. This may result in a lack of active learning improvement: An image with a single box that has a cooling tower score of 0.45 will never be queried, despite having relatively high entropy. The reason is that this score does not exceed 0.5 and the box therefore does not count as a detection. To alleviate this problem, the score threshold for detections in inference is lowered to 0.35 during active learning, whereas for the final results it remains 0.5. This means that a region proposal with a score of 0.35 or higher for the cooling tower class is added to the boxes that are used for querying, resulting in almost all images having at least one detection. The image containing a predicted box with a cooling tower score of 0.45 now has a good chance of being queried, because of its high entropy.

3.2 Proposed aggregation methods

Different variations of this query strategy are tested in the experiments. The aggregation methods maximum, sum and average were mentioned before and these have a relatively straightforward intuition:

• Maximum: Although full images are queried, the model learns from individual boxes. For the model to learn most efficiently, the images that contain the least certain boxes in the complete active learning pool are the images that should be added to the training set. These boxes are found by getting the box with maximum entropy for each image and compare all images with each other based on this measure.

• Sum: The model does learn from individual boxes, but an image with 5 somewhat uncertain boxes could be more useful for learning than an image with a single, very uncertain box. Therefore the sum of entropy of all boxes may be a better measure to estimate possible contribution to model improvement.

• Average: It could also be that the best estimate of possible contribution to model improvement should take into account both the number of boxes and their uncertainty scores in a more balanced manner than the sum or maximum. In the sum aggregation method, an image with 20 low-entropy boxes is estimated to be more useful than an image with 2 high-entropy boxes. Just as the maximum aggregation method, the average method would consider the image with 2 high-entropy boxes to be more useful. However, an image with 5 high-entropy boxes is preferred by the average method over an image with 1 very-high entropy and 5 low-entropy boxes, unlike the maximum aggregation method. Apart from these aggregation methods, three new methods are explored: the minimum aggregation method and two combination methods. The com-bination methods require further changes to the algorithm than just the ag-gregation method:

(16)

• Minimum: This is a new method to test the idea that the most useful images are those with the highest minimum entropy. Here the most certain boxes of the images are compared to each other, and the images are chosen where these most certain boxes are least certain. The aggr function that will be the min function for this method.

• Combination 1: In this method, the maximum and minimum strategies are combined to find if these could complement each other and improve performance increase further. Concretely, this means both collecting the minimum and the maximum entropy per image and creating two sorted lists of all images: one sorted by maximum entropy from high to low and the other sorted by minimum entropy from low to high. Then, using some partition factor 0 ≤ p ≤ 1, the first p · n images are selected from the list sorted on maximum entropy and the first (1 − p) · n images are selected from the list sorted on minimum entropy. Naturally there is a mechanism in place to prevent duplicates in the final list of images to be annotated.

• Combination 2: The second combination method is not a combination of different aggregation methods, but a combination of different parts of the list sorted by a single aggregation method. This means that, if the maximum is taken as the aggregation method, the list of images is first sorted on their maximum entropy from high to low, as was the case before. Then, instead of taking the top n images from this list, this method takes the top t, middle m and bottom b images from that list, where t + m + b = n. The idea behind this is that the model might benefit from an added number of clear and moderately clear training examples, in addition to adding some of the most uncertain images.

3.3 Proposed distance-based methods

Apart from these variations of query strategies based on uncertainty alone, geographical proximity is an interesting feature to explore. While it cannot be used in most natural image object detection tasks, aerial imagery has a much more direct relation to geographical location. When proximity is taken into account in an active learning query method for this setting, it could ensure that the images in the training set are more evenly spread out locations in the country. The reason that this is expected to improve active learning performance increase is that near things seem to be more related than distant things, according to a common rule in geography. Therefore, if the training examples are spread throughout the area of interest, a model is expected to perform better on that area as a whole.

This rule in geography needs more explanation, especially when applied to this case: why would images with cooling towers close to each other be more related than a distant cooling tower? Three factors that could be relevant are identified and discussed below, although there could be many more.

• Cooling towers: Given regional variation in characteristics such as pop-ulation density, natural resources and infrastructure, different industries

(17)

have their own, unique spread throughout a country. Since different in-dustries could be using distinct cooling tower types, the occurrence of cooling towers types will vary per region. Thus, the larger the distance between two cooling, the more unlikely it is that they are of a similar type.

• Buildings, roads and vegetation: Since the largest parts of aerial images are covered by rooftops, roads and vegetation, the appearance of these elements is crucial to a model that needs to separate cooling towers from everything else. Even in a country as small as the Netherlands, there will be significant differences in the rooftop and road style, and vegetation type. But, these features usually change gradually when moving away from an area. Therefore, the smaller the distance between two areas, the more likely it is that buildings, roads and vegetation are similar. The background class actually covers a very broad range of objects with specific features. A model that needs to perform well on the entire country should, during training, have seen as many types of background as possible to be able to learn many different non-cooling-tower features so it can generalize well. It is therefore beneficial for a training set to contain images from many different parts of the area of interest

• Images: Part of what makes any computer vision task difficult is that images of the same object taken from the same angle could still have widely different pixel-level values depending on the lighting and camera equipment used. This also holds for aerial imagery that, in this case, was shot with airplane-based cameras. As it may take weeks or months for an airplane to take all pictures necessary cover the entire country, there will be large differences in lighting, due to the season and time of day it was taken and the weather conditions at that time, among other things. Secondly, it is likely that multiple planes were used, each with slightly different equipment, to cover different areas of the country in order to speed up the collection process. Both factors contribute to slight image differences and the larger the distance between these areas, the less likely it is that the images are very similar: two images close to each other are more likely to have been taken from the same plane, around the same time of the day and in the same season. Figure 2 illustrates some of these local similarities and regional differences. Research has indeed shown that object detectors trained on images from one geographical area do not generalize well to images from other geographical areas [54]. This supports the idea that it is beneficial to have a training set of images spread across the area of interest. Two methods were created to test the hypothesis that in this setting, active learning could increase performance further if geographical proximity is taken into account.

• Distance only: This method is based on geographical proximity alone, meaning that for each image in the active learning pool the distance to the closest known image is calculated. The set of known images refers to all images in the train set, as well as the images that were queried

(18)

(a) Friesland (b) Within 15km of 2a.

(c) Gelderland (d) Within 15km of 2c.

(e) Zuid-Holland (f) Within 15 km of 2e.

Figure 2: Images showing local similarities and regional differences in the aerial images due to differences in vegetation, time of year/day, equipment used and architectural styles. Distance between 2a and 2c: 128km. Distance between 2a and 2e: 169km. Distance between 2c and 2e: 142km.

(19)

previously. Then the image is queried that has the largest distance to its closest known image. This is a very simple method that is not expected to outperform more complex methods. However, it could provide an indication of the usefulness of using distance in a query method by comparing this method to a random sampling baseline.

• Distance-entropy combination: The final query strategy that is tested in this research is a combination of both predictive uncertainty and geographical proximity. It is designed to query images such that the images in the training set are spread out over the entire area of interest and the newly queried images also have a high predictive uncertainty. In order to accomplish this, the distance to the closest known image is again calculated for each image in the active learning pool. These images are subsequently sorted based on this distance, from high to low. The s > 1 images with the highest minimal distance are taken out and the image with the highest aggregated entropy within this subset is queried. This process repeats until n images are selected. Algorithm 1 shows the simplified implementation structure of this query strategy, where parameter s is named entropy subset size.

(20)

Algorithm 1: Algorithm that returns n images to be annotated based on aggregated entropy and the distance to known images.

def QueryEntropyDistance(inference results, n, known images, entropy subset size):

Data:

inference results : list containing images with inference results, being zero or more boxes with associated classification scores.

n : number of images to be queried. known images : a list of the known images.

entropy subset size : size of the subset from which the image with the maximum aggregated entropy is chosen.

Result:

list of images that were queried queried images = []

while length(queried images) < n:

/* Calculate and sort on minimal distance */ for image in inference results:

image.min distance = image.calculate min distance(known images) sorted on distance = sort on min dist high to low(inference results) furthest subset = sorted on distance[:entropy subset size]

/* Calculate aggregated entropy per image */ for image in furthest subset :

entropy all boxes = [] for box in image.boxes:

entropy = calculate entropy(box) entropy all boxes.add(entropy) image.v x = aggregate(entropy all boxes)

/* Choose the image with the highest v x */ sorted entropy = sort on v x high to low(furthest subset)

chosen image = sorted images[0] queried images.add(chosen image) inference results.remove(chosen image) known images.add(chosen image) return queried images

(21)

4 Experimental setup

4.1 Data

In this project, three data sources are used to create usable datasets and get the desired predictions:

1. A collection of aerial images 2. Confirmed cooling tower locations 3. Land use and zoning information 4.1.1 Aerial images

The main data source is a collection of aerial images of the Netherlands avail-able through a Web Map Service (WMS) [31]. The services operates on a 10 cm ground sampling distance (GSD), meaning that in general, 1 pixel equals 10 x 10 cm on the ground. In practice, however, the GSD of the individual images varies from 25 to 4 cm, but the WMS ensures that the images contain the same number of pixels. The aerial imagery was divided into a grid of 100 by 100 meter (i.e. 1000 by 1000 pixels) tiles in order to be workable for an object detection model.

The size of this dataset complicates an object detection task. The number of tiles covering the complete country is over 3.7 million, together requir-ing around 600 GB of disk space to be stored. As storrequir-ing this was not a straightforward option for the computational grid that was used, images were requested from the WMS live during full inference. This limited inference spreed and a single round of full inference alone on all these images would take around 45 hours on the available computational grid.

4.1.2 Confirmed cooling tower locations

A second main data source is a number of confirmed cooling tower locations from the organization Atlas Leefomgeving, totalling 1029 points. Not only is this a very small number, especially relative to the total number of images, but these points also need to be converted to bounding boxes before they are of any use. Annotating the images is time-consuming manual labor and requires people who can distinguish cooling towers from similar structures on rooftops.

4.1.3 Resolving the issues

Firstly, to address the image dataset size issue, the known cooling tower loca-tions were analyzed with an additional dataset containing land use and zoning information [45]. This analysis showed that 98.5% of the known cooling tow-ers were found in 11 out of 37 zoning classes, making up only 36% of all the tiles. Furthermore, most of the exceptions actually were in one of these zon-ing classes, but the points were somewhat misplaced or the zonzon-ing category in that area was not up to date. It must be noted that it is not known how

(22)

Active learning Hyperparameter search Positives Negatives Positives Negatives

Training set 102 0 303 303

Active learning pool 737 737 -

-Test set 429 429 429 429

Table 1: Compositions of the annotated-tiles-dataset that are used during the experiments. All tiles here have been manually annotated by experts. Positives is the number of images in this set containing at least one cooling tower. Negatives are images that do not contain any cooling towers.

representative these points are for the set of all cooling towers in The Nether-lands. However, because the numbers were so convincing and this significant size reduction would make full inference more realistically achievable, it was decided to discard the tiles with other zoning classes.

The second challenge was creating test and training datasets to be able to train object detection models. The confirmed cooling tower points were combined with the aerial imagery and the tiles containing these points were extracted. A number of experts then set out to annotate the images by creating bounding boxes and dividing these boxes into 4 (unofficial) categories that the cooling towers seem to naturally fall into. Examples of cooling towers from these categories are shown in figure 3. Apart from the resulting images that contain cooling towers, a number of negative examples were extracted as well. These are randomly selected tiles from the data that are manually checked to ensure that they contain no cooling towers.

The small size of this set of annotated images remains problematic, but that is the basis for applying active learning to the task at hand. However, data augmentation is applied during training to slightly alleviate this problem before active learning comes into play. The training images have a 0.5 proba-bility of being flipped horizontally, a 0.5 probaproba-bility of being flipped vertically and they are rotated a random number of degrees around the center. These augmentation methods have been shown to work well in general [48] and are also used in detection tasks on aerial imagery [34].

4.1.4 Datasets

Since two different combinations of the described data will be used for con-ducting various types of experiments, a name and description will be provided here to improve clarity throughout the sections that will follow.

Firstly there is the annotated-tiles-dataset. This dataset consists only of the tiles that have been manually annotated by experts. The tiles in this dataset are subdivided into the customary sets: There is a fixed test set and the remaining annotated tiles are divided over a training set and an active learning pool, depending on the specific experiment that is carried out. The exact compositions of these specific versions of the annotated-tiles-dataset are shown in table 1.

Secondly, there is the Netherlands-dataset. Here, the training set is made up of all annotated tiles, i.e. the full annotated-tiles-dataset.

(23)

Further-(a) Small multi-fan (b) Medium single or multi-fan

(c) Small round single-fan (d) Very large natural draft

Figure 3: An example of each of the four categories of wet cooling towers used during image annotation.

(24)

more, there is an active learning pool of unlabeled images consisting of all images covering the Netherlands after zone class filtering, totalling around 1.3 million images. The queried images will be manually labeled by an expert upon request by the active learning method, which is how active learning is designed to be used in practice. There is no separate test set for this dataset and the performance evaluation will be qualitative only.

4.2 Model and implementation

4.2.1 Model

The focus of this research is on (geospatial) active learning techniques and not on creating a new detection model or comparing different detection models to find the optimal detector for the problem. Therefore, a detection model was chosen based on: existing research tackling a similar type of problem, speed/accuracy trade-off with limited resources in mind and availability of reliable existing implementations. The model that was chosen based on these aspects is a torchvision implementation of Faster R-CNN [50], adapted to be able to support uncertainty-based active learning methods. It uses feature pyramids [25] and a ResNet-50 backbone pretrained on ImageNet. This back-bone has been shown to lie close to the Pareto optimality frontier of speed versus accuracy [2] in image classification, and the same seems to hold for the combination of a ResNet model with Faster R-CNN for object detection [20]. Faster R-CNN is often used in other research focusing on aerial image object detection [16,36,43] and so are multi-scale feature maps such as FPN [15,44,47] and ResNet backbones [34, 47, 56]. Because of the categories used for image annotation, the Faster R-CNN classification step becomes a 5-class problem, with 4 cooling tower classes and a single background class that encompasses everything else. A diagram showing the structure of a Faster R-CNN model is displayed in figure 4.

4.2.2 Active learning

As shown in figure 4, RPN first generates region proposals which are then processed and classified by the remaining parts of the detector. In order to be able to do uncertainty-based active learning with the chosen model, slight structural adaptations were implemented. In the Faster R-CNN version by Torchvision, the output of a single predicted box is not a distribution over classes, but only the predicted class and the corresponding prediction value. The reason for this is that a single region proposal is used to create C (number of classes) predicted boxes by applying class-specific box-regression outputs, slightly changing the box for each class. All these C predicted boxes are separately added to the results and will be in the output if they meet the thresholds set in the Faster R-CNN parameters. In the implementation for this research, the full predictive distribution is stored and added to the boxes that make it to the output, so that the entropy can be calculated. The structure of the box postprocessing procedure and the adaptations made to it are shown in figure 5.

(25)

Figure 4: Diagram showing the structure of a Faster R-CNN model. Text in italics are names of hyperparameters of the model. Based on [33].

Uncertainty is represented by binary classification entropy. This transfor-mation from the 5-class classification that the Faster R-CNN model actually performs to binary classification is done by merging all 4 cooling tower classes into a single class. Subsequently, the entropy is calculated for the distribu-tion over the background class and cooling tower class. The reason for using binary classification is explained in section 4.4 on metrics.

Lastly, a short note on the dataset that will be used for the active learning experiments: the annotated-tiles-dataset in the active learning configura-tion as shown before in table 1. Here, the training set consists of 102 positive images and 0 negatives. The reason that there are no negatives in this train-ing set is that the starttrain-ing data for this and similar applications in principle only is a set of known objects of interest. The experiments on this dataset represent these situations where the starting dataset is only a small number of known objects and where obtaining negatives is as difficult as obtaining positive images, because expensive expert knowledge is required.

(26)

Figure 5: Diagram showing the postprocessing steps of Faster R-CNN. Text in italics are names of hyperparameters of the model. The parts in red have been added to make active learning possible.

4.2.3 Supporting code and hardware

The customized model and active learning are supported by a large body of code that handles the full flow of training, inference and experiments. This includes parameter setting, data loading and augmentation, training, interme-diate testing, applying query strategies result comparisons and visualizations, all designed to run on the available computational grid. The code is available on request.

The hardware available at RIVM was a computing cluster managed by IBM Platform LSF to run batch jobs, consisting of queues for different types of jobs. For this research, a GPU queue was available with two Tesla M10 GPU accelerators, totalling 8 GPUs with 8GB memory each.

4.3 Hyperparameter optimization

If good results are to be expected from any active learning methods, it must be made certain that the base object detector works relatively well for the task. Therefore, before active learning methods can be reliably compared, a set of fine-tuned hyperparameter values must be chosen for the detection architecture. A total of 10 different hyperparameters were chosen to be part of the optimization process, most of them related to the training progress. There are a number of additional model-specific hyperparameters, but these either have become de facto standards or have a minimal impact on perfor-mance and their values have therefore been adopted from the literature. The hyperparameter values that were adopted from literature are shown in table 2 in appendix A.

Ideally, hyperparameter optimization consists comparing many different values for each hyperparameter in a grid search. This tests all possible

(27)

com-binations of the values of different hyperparameters and it is likely that a near-optimal configuration is found in this way. However, testing a single configuration for this detection problem with the data from the annotated-tiles-dataset took over 8 hours, and only 8 of these models could be run in parallel. This made a complete grid search infeasible with the available resources.

A different hyperparameter optimization method was therefore chosen, based on the observation that some hyperparameters are closely related, while most have less direct relations to each other. In the method that was used, the more independent hyperparameters were tested and fixed in a specific order, as to approach desired performance levels in an incremental fashion. The closely related hyperparameters were tested and optimized in a mini-grid-search, while the other, less related hyperparameters were kept constant.

The test performance increase during this incremental hyperparameter op-timization and fixation is shown in figure 6. The final set of best performing hyperparameter values are fixed and kept constant for all upcoming exper-iments. Additional information on these hyperparameters as well as their tested and chosen values can be found in tables 3 and 4 of appendix A.

Figure 6: Figure showing the improvement in test performance after successive testing and fixing values of hyperparameters grouped in different hyperparam-eter sets, as indicated on the x-axis. Hyperparamhyperparam-eter tuning was done on the annotated-tiles-dataset in the hyperparameter search configuration. All mod-els used in the upcoming experiments share these optimized hyperparameter values. Additional information can be found in appendix A.

(28)

4.4 Metrics

While different object detection evaluation metrics existed in the past, the contemporary standard metric for object detection performance is average precision (AP) as introduced in VOC2007 [9]. It is defined as the average detection precision under different recalls and when there are multiple object categories, the mean AP (mAP) is used as final comparison metric [58]. Ob-ject localization accuracy is measured by the intersection over union (IoU) between the predicted box and the ground truth. Using 0.5 as a threshold for the IoU has become a standard, meaning that an object will be identified as successfully detected only if the IoU with the ground truth box is larger than 0.5.

In this research, AP will indeed be the main performance metric in the evaluation pipeline, which consists of both standard and custom elements. It will be discussed below. For the analysis of the experiments, supplementary performance measures will also be used for more in-depth comparison between specific setups, notably the true positive (TP) and false positive (FP) rates, and object detection error types.

The first step after performing inference on the test set is the application of non-maximum suppression (NMS) to the inference result, as is the standard in object detection tasks. This method is useful for reducing the number of boxes, by filtering out boxes that overlap the local highest scoring box. After-wards, a second method is applied to filter boxes and improve performance: only boxes for which the score of the highest scoring cooling tower class ex-ceeds the score for the background class are kept. This is necessary because Faster R-CNN keeps all boxes in which the highest scoring object class (so excluding the background class) exceeds a threshold of 0.35.

The cooling towers in the dataset are divided into 4 classes, making the Faster R-CNN classification step a 5-class problem, with 4 cooling tower classes and a single background class. However, these classes are seriously imbalanced, dramatically impacting test set mAP. Since the predicted classes do not actually matter - only the classification of cooling tower versus other is of interest - the problem is converted to a binary classification problem before calculating the metric, which then simply becomes AP.

Next, the AP is calculated for the single remaining object class. The IoU thresholds used here deviates from the standard IoU threshold. During training this standard IoU threshold of 0.5 is used by Faster R-CNN to se-lect positive and negative bounding boxes, as this indeed results in optimal training progress. However, when the final test performance is measured, an IoU threshold of 0.1 is used to more get a more accurate indication of per-formance with the final goal of the application in mind: discovering cooling tower locations in the Netherlands. On a very large scale like this, the pre-dicted bounding boxes do not have to coincide perfectly with the boundaries of the objects, as long as an expert can quickly identify whether a cooling tower is present or not. Furthermore, a quick study of the labeled images in the dataset makes it clear that cooling towers can be difficult to annotate and the annotators dealt with situations in different manners. An example of annotation ambiguity is a type of larger structure that looks like a number of smaller cooling tower combined. Is a structure like this actually a single

(29)

larger cooling tower or is it a combination of many smaller towers? Figure 7 shows two examples of ambiguous cooling towers, that are relatively common in the data. Using a small IoU helps to mitigate the effect of differences in annotations on the measure of final performance. For similar reasons, this same IoU threshold of 0.1 is also used for the NMS step mentioned before.

The performances of query strategies are compared by examining perfor-mance progress graphs, in which test and train AP after each round of active learning is plotted. This allows for a comparison of the final performance, av-erage AP increase and the course of the performance change for the different query strategies.

(a) Two, three or seven white cooling

tow-ers? (b) Six or twelve cooling towers?

Figure 7: Examples of images that are difficult to annotate. How many cooling towers are in the image? Experts had different ways of handling situations like this. A small IoU of 0.1 was used during the final metric calculation as to reduce the impact of such annotation differences.

(30)

5 Experiments

With the three objectives for this research in mind, two separate sets of ex-periments have been performed and will be discussed in the sections below:

1. Query strategy comparison on the annotated-tiles-dataset. 2. Large-scale active learning on the Netherlands-dataset.

5.1 Query strategy comparison

This section explores the results of the experiments related to the first two research objectives, being (1) to explore whether active learning can bring a significant performance increase to aerial image object detection; (2) to compare different query strategies and a newly proposed strategy.

5.1.1 Setup

For a comparison of query strategies, and for research in general, it is critical to keep as many factors constant between methods. Concretely, this means that different strategies use the same object detector, initialization and hy-perparameters, the same number of initial labeled training points, the same budget for obtaining new labeled instances and the same training procedure. The only component that is varied is the query strategy. Furthermore, exper-iments need to be repeated multiple times with different random seeds, as to collect representative test results.

The model, initialization and hyperparameters have been discussed in the previous section and these are used for all query strategy experiments de-scribed here. Next to this, all query strategies have a budget of 100 labeled instances from the active learning pool divided over 4 active learning rounds. This is a number that can be annotated by an expert in a reasonable amount of time and test runs have shown that it works well, although any choice for budget and number of rounds will be somewhat arbitrary. For query strategy comparison the annotated-tiles-dataset will be used, specifically in the ac-tive learning configuration, so that all query strategies have the same initial training set and active learning pool. All models in the experiments will be trained for the same number of epochs, namely 50. Test runs have shown that this is a reasonable number of epochs to be able to reach convergence. Lastly, all configurations will be run 8 times with different random seeds.

However, before query strategies are compared, an upper and lower bound will be established to be able to put active learning performance increase into perspective. The lower bound is determined by training a model as specified on the 102 positive instances in the training set alone. For the upper bound setup, this training set and the complete active learning pool with 737 positives and the same number of negatives are combined, and a model is trained on this combined dataset. The models created for both bounds are evaluated on the same test set as to obtain the test AP values of these bounds. Next to an upper and lower bound, a simple random sampling baseline is tested in addition to the query strategies mentioned in section 3, to estimate

(31)

baseline performance increase for adding additional, randomly chosen training data.

5.1.2 Results and Analysis

After completing multiple runs of the upper and lower bound setups, it was clear that the lower bound experiments produced around 0.29 test AP per-formance, while the test AP of the upper bound lies close to 0.72. This large gap means that there is a good potential for active learning query strategies to achieve a significant performance increase within this set of experiments. The performance of each query strategy is shown in a graph with the same setup: the point at active learning round 0 shows the initial average perfor-mance with the unaltered train set and the subsequent 4 points show average AP performance after 25 new images are queried each time. The plots also show a range of one standard deviation around the mean, calculated from the individual AP values that were produced by the 8 different runs with that setup. For some query strategies, this graph is shown next to a ’commonly queried image’. This is an images that is typical for the query strategy in question, because it was queried during (almost) all separate runs with that query strategy.

In theory, an increase in size of the training set with randomly chosen images should also result in a performance increase, which will be verified with the random sampling baseline method that will be discussed first.

• Random sampling

Figure 8 confirms that adding random images from the pool set to the training set increases test performance significantly. The final test per-formance after 4 query rounds is approximately 0.42 AP, and because the initial AP was 0.34, the average AP increase per round is 0.02. No commonly queried image is shown here, because there is no image that is typical for this strategy, as all images are chosen at random.

Figure 8: Test-set AP performance development for active learning with a random sampling query strategy. Performance does indeed increase, but the difference is not very large and there is significant amount of variability.

(32)

• Maximum

The query strategy based on entropy with aggregation method maxi-mum proves to be very effective. Figure 9a shows that the final per-formance after 4 query rounds is approximately 0.51 AP and with a starting AP of 0.31, the average increase is 0.05 per query round, sig-nificantly more than for the random sampling baseline. The typical queried image shown in figure 9b is clear: the large group of circular objects in the center-top causes one or more object proposals with high predictive uncertainty. The aggregated score is therefore high and this image is very likely to be queried by this method.

(a) Test set AP performance development. (b) Commonly queried image.

Figure 9: Performance development and commonly queried image for the maximum query strategy. The strategy proves to be very effective, with much larger performance increase than random sampling. A typical image queried by this method contains objects that look much like cooling towers, such as the the large group of circular objects in the center-top shown here.

• Sum

With a final performance of 0.6 AP and an average AP increase of around 0.07 per round, the sum proves to be the best performing query strategy of these experiments. Many images such as figure 10b are queried. These contain a large number of cooling towers and some ob-jects that look similar, which turns out to be very useful for learning. • Average

The average aggregation method performs poorly on this task, as dis-played in figure 11a. With a final performance of 0.4 AP and an average AP increase per round of under 0.02, it is slightly outperformed by the random sampling baseline, although its variability is lower. The images that are queried by this method hardly ever contain cooling towers. The reason for this is not exactly clear: apparently predicted boxes that contain cooling towers either have relatively low predictive uncertainty or are usually in the same image as other boxes with low predictive uncertainty.

(33)

Figure 10: Performance development and commonly queried image for the sum query strategy. The sum strategy is the best performing strategy of these experiments, with a high final AP and very low variability. Queried images mostly contain a large number of cooling towers and similar object.

Figure 11: Performance development and commonly queried image for the av-erage query strategy. Performance is similar to random sampling, and there-fore not good. The reason is not exactly clear, although it is notable that hardly any of the images queried by the strategy actually contain cooling towers.

(34)

• Minimum

The minimum query strategy has a performance similar to that of the average query strategy and the random sampling baseline with a final AP of 0.41. This can be seen in figure 12. The images queried by this method are relatively uninteresting with few objects that look like cool-ing towers. The most certain predicted boxes of these images are very uncertain. Apparently, most images with (objects similar to) cooling towers also contain predicted boxes with relatively high certainty.

Figure 12: Performance development and commonly queried image for the minimum query strategy. Performance is similar to the random sampling baseline, although with lower variability. The images queried by this method are relatively uninteresting, with few objects that resemble a cooling tower.

• Combination 1

Combination 1 combines the minimum and maximum strategies, but performs below the maximum strategy, as can be seen in figure 13. The commonly queried image is clearly from the set chosen by the minimum query strategy, as it is quite uninteresting. Although this strategy per-forms well, it is partly based on the maximum query strategy, which performs even better. Simply choosing the maximum query strategy would therefore be the better option.

• Combination 2

Figure 14a shows that combination 2 method works well. This method uses aggregation method maximum, but instead of taking only those images with the highest maximum entropy, it also queries a number of images with medium and low maximum entropy. Figure 14b shows an image with low maximum entropy, that would never be queried by the maximum query strategy. Similar to the first combination method, this strategy performs well. However, again it is fully based on the maximum query strategy, which performs even better.

(35)

Figure 13: Performance development and commonly queried image for the combination 1 query strategy. The strategy performs well, but does not outperform maximum itself, making it quite useless. The influence of the minimum strategy seems to only negatively impact performance, instead of complementing the images chosen by maximum.

Figure 14: Performance development and commonly queried image for the combination 2 query strategy. The strategy performs well, but does not out-perform maximum itself, making it quite useless. Adding images that contain boxes with low predictive uncertainty only negatively impacts performance.

(36)

• Distance only

The distance only query method does not perform well, as can be seen in figure 15a. This method was only intended to show that distance is a characteristic containing at least some useful information for a query method. It was therefore not expected to outperform any other meth-ods, but that its performance is this much lower than even the random sampling baseline is unexpected. By examining the queried images, how-ever, it becomes clear why this is the case: the image in the data that is the furthest removed from all training images is almost always a ’bor-ing’ image. The largest part of the image is covered by a monotonous pattern, such as water, grassland, forest or agricultural land, such as in figure 15b. This supports the idea that distance must be combined with a measure of predictive uncertainty for it to be of any use. Note that since all 8 runs start with the same training set, these different runs all query the exact same images.

Figure 15: Performance development and commonly queried image for the distance only query strategy. Performance is very poor, quite a bit worse than that of random sampling. The images that are furthest away from the training set images apparently are even less useful for training than the average randomly sampled image. Figure 15b clearly shows that this is indeed the case.

• Distance-entropy combination

The distance-entropy strategy combines distance and an aggregated en-tropy score. The sum aggregation method proved to be most effective, so this is the aggregation method that was used in this combination method. Figure 16a shows that the method works relatively well with a final performance of 0.48 AP and 0.045 AP increase per round. How-ever, this performance is worse than that of the sum query strategy in terms of AP. Specifically, this method produces fewer false positives, but also fewer true positives than the sum query strategy. A possible

Large-Scale Spatial Active Learning in the Wild

MSc Artificial Intelligence

Master Thesis