Multi-level Alignment and Approximate Labeling in Domain Adaptation Networks

(1)

Multi-level Alignment and Approximate

Labeling in Domain Adaptation Networks

submitted in partial fulfillment for the degree of

master of science

Daniël Bartolomé Rojas

10534628

MSc Information Studies

Data Science

Faculty of Science

University of Amsterdam

2017-07-18

Internal Supervisor External Supervisor Title, Name Thomas Mensink Andrea Pagani

Affiliation UvA, FNWI, IvI KNMI

(2)

Multi-level Alignment and Approximate Labeling in Domain

Adaptation Networks

Daniel Bartolomé Rojas

University of Amsterdam

Amsterdam, The Netherlands

d.bartolome.r@gmail.com

ABSTRACT

While advances in image classification with convolutional neu-ral networks are significant, labels for supervised learning are prohibitively expensive to acquire. This highlights the impor-tance of the ability for networks to generalize to scenarios where labels are not available. Domain adaptation, the study of creating generalizable networks across domains, is recently showing fruitful progress. With the introduction of gener-ative adversarial networks, novel and improved techniques for domain adaptation have been introduced. In this paper we propose methods for training pixel-level alignment mod-els and test combining pixel-level with feature-level domain adaptation networks. We find that the proposed methods im-prove upon existing techniques, but that combining pixel-level with feature-level models does not improve the generalization capacity of domain adaptation models.

INTRODUCTION

Neural networks have brought forward advances in many su-pervised machine learning tasks. The ever growing capability of collecting and processing larger amounts of data accom-modates this. In most cases, however, the data is not – or noisily – labeled. Unlabeled datasets fail to provide the net-work with feedback, hampering the capacity of supervised neural network architectures to learn.

This also applies to datasets which distribution is similar, e.g. an image dataset of dogs and one of cats, but where one or more of these datasets lack (quality) labels. A network that has modeled the dataset in one domain, will overfit the data to a certain extent. Consequently, it increasingly loses predictive capabilities in another domain as the shift between domains grows larger. If we go back to the cats and dogs example, a model trained to detect dogs might work on cats as well, but not as good as it would on dogs. In this case, the difference between cats and dogs would be the domain shift.

An example of a machine learning task which particularly suffers from domain shift is detecting weather conditions with computer vision. Domains which benefit from this task, such as highway and city center CCTV system imagery, are often not locations where weather measurement equipment (capable of collecting quality labels) are present. Hence, image clas-sification algorithms trained on data obtained from locations with this equipment are generally rendered useless in other locations.

Figure 1. Visualizations of different alignment techniques. In feature alignment, we see that features are mapped such that the domain shift is reduced in lower dimensional mani-fold. Using the pixel-level alignment technique, an image is transformed such that it is visually similar to the other domain.

The domain where labels are present is called the source do-main. The domains to which generalization is desired (because the labels are absent or noisy) are called the target domains. Neural networks that are trained on source domain – but gen-eralize to target domains – eliminate the need for labels in these target domains, effectively reducing the dependence of neural networks on labeled data, uncovering even more of the potential of neural networks.

Contemporary techniques for addressing domain shift can be subdivided into two approaches. The first approach entails finding/using a function that maps both domains into a lower dimensional manifold, where the shift between the source and target domain is reduced. In other words, the function finds features in the data which are comparable across domains. For this reason, in this paper we will further refer to this technique as the feature-level alignment technique.

The second approach – which is only useful for image classifi-cation algorithms – aims to find a same mapping function, but in contrast to the feature-level alignment technique, intents to map the images such that the domains resemble each other on

(3)

pixel-level. That is, it effectively returns a new transformed image, rather than working with features on a lower dimen-sional manifold. This technique will be further referred to as the pixel-level alignment technique. In Figure 1, a visual example is given for both alignment on feature-level and on pixel-level.

Applications in domain adaptation research have treated these techniques separately. All models that have been developed use only one of either pixel-level or feature-level alignment techniques. However, as can be seen in the bottom figures from Figure 1, the transformed images generally do not fully resemble the image it aims to mimic. For this reason, it is hy-pothesized that feature-level alignment could still be beneficial afterpixel-level alignment has been performed.

Hence, the aim of this paper is to evaluate the effectiveness of aligning data on feature-level after pixel-level alignment has been performed. We will call this approach multi-level align-ment, for the data distributions across domains are aligned on both pixel and feature level.

Related Work

The field of study addressing the problem of domain shift is domain adaptation (DA). DA has a long standing tradition, particularly in mathematical statistics. Though we focus on GAN applications in this paper, in the next section a brief overview of existing statistical techniques is outlined. Next, the GAN framework is discussed, after which we turn to GAN applications in DA.

Statistical Domain Adaptation

More traditional approaches to domain adaptation mostly in-clude finding mapping functions that match the feature distri-butions of the source and target domains. Because the distribu-tions across domains can be significantly different, Fernando et al. [5] use Subspace Alignment to learn more robust repre-sentations of the domains by reducing domain shift in a lower dimensional space. A similar approach is used by Huang et al. [8], who align distributions by matching distribution means in a kernel-reproducing Hilbert space using Kernel Mean Matching.

Generative Adversarial Networks

Generative adversarial networks (GAN) [7] are a type of un-supervised learning algorithm that allow generative models (denoted by φ in this paper) to learn complex probability dis-tributions, and consequently, capable of generating samples from this learned distribution. In this framework, a generative model competes against a discriminative model (denoted by D) in a two-player minimax optimization game.

Using noise as input, φ generates data that D will compare against the probability distribution that φ aims to mimic. The parameters of D are estimated by minimizing its predictive error, i.e.: learn to consistently distinguish generated from real data. On the other hand, the parameters of φ are estimated by maximizingthe error of D, i.e.: learn to generate data that is indistinguishable by D.

This minimax optimization game is called adversarial train-ing (hence the name). Traintrain-ing φ and D in an adversarial

fashion drives these networks to eventually be very adept in both detecting counterfeit examples and, more importantly, generating these examples as well.

GANs in Domain Adaptation

Although the statistical DA methods can address the problem of domain shift reasonably well, recent developments with GAN applications in DA are resulting in state-of-the-art gen-eralization performance, particularly in the domain of image classification [3, 13, 16].

As noted earlier, GAN applications in DA have developed into two effective frameworks: pixel-level alignment and feature-level alignment. GAN generators can not only learn transfor-mations that match the domains on feature space – as more traditional statistical techniques have aimed to achieve – but also address domain shift reduction by matching images on pixel-level. That is, generate new (labeled) images that resem-ble the images from the unlabeled domain.

One caveat of using GANs in DA applications, is that class membership should be maintained while aligning the domains. This is important for both feature-level and pixel-level align-ment models. To extend the typical unsupervised GAN algo-rithm to a supervised one, the typical minimax optimization game is extended by adding a source-trained classifier model (further referred to as fs) alongside D [4]. The parameters of φ are, in this case, estimated by not only maximizing the loss of D, but also by minimizing the loss of fs.

Feature-level alignment — Ganin et al. [6] have used ad-versarial training to create the Domain Adaptation Neural Network (DANN), an image classification algorithm that is only sensitive to features that exist both domains. Conse-quently, the convolutional neural network (CNN) that extracts the image features maps images from both domains into the same lower dimensional manifold. Different versions of this algorithm have since been proposed, that replace its discrimi-nator feed-forward neural network with other metrics, such as Maximum Mean Discrepancy [10, 15] and matching the mean and covariance between domain feature vectors [13]. Pixel-level: source to target — Bousmalis et al. [3] have used GANs to transform source images to resemble the target domain distributions, while maintaining the correct class la-bel. This allows for training a classifier on a generated target domain dataset, creating a more effective classifier on the true target set in test time, as the domain shift is reduced. The ben-efit of this approach is that the labels of the source images are available. This way, the generator knows to maintain correct classes.

Pixel-level: target to source — Recently, however, Ash et al. [2] proposed the Approximate Label Matching (ALM) model that allows to transform target images such that they resemble source distributions. Because the generator needs labels from the transformed images – and the target data labels are absent – approximate labels have been introduced. The generator tries to generate correct classes by matching them to approximate labels. The approximate labels are obtained by creating target set predictions from a source trained classifier. These labels are acquired before training the ALM and are

(4)

Figure 2. A graphical representation of the DANN model. The feature extractor sends a feature vector to both a discrimi-nator and a classifier. During back-propagation, the gradients from the discriminator are reversed.

used throughout the entire training process. Depending on the magnitude of domain shift, these labels will be increasingly more noisy.

METHODS

To evaluate whether multi-level alignment is effective, we will use the output of a pixel-level alignment model as input for a feature-level alignment model. To enable such experiment, we will use the the ALM as pixel-level model and the DANN as feature alignment model. These models will be described in detail in the next sections. Next, the proposed method for multi-level alignment will be covered.

Before delving into the details of the models, we will define some domain adaptation notation. Formally, the labeled source domain with ns samples is represented by Xs= {xs

i, ysi}n

s

i=0, where x is the data and y are the labels. The unlabeled target domain(s) with nt samples are represented by Xt= {xt_i}nt

i=0. Furthermore, true class and domain labels are defined by yc and yd, respectively. The class and domain predictions are defined by ˆycand ˆyd.

Feature-level Alignment Model

A domain adaptation model using the feature alignment tech-nique is the DANN [6]. The model contains a feature vector extractor (φf; which is a typical convolutional neural network without fully connected layers) to which a discriminator (D) and a source-trained classifier ( fs) are built atop (see Figure 2). Using the feature vector produced by φf, fspredicts the class of the instance, whereas D predicts the domain of the instance. The DANN achieves a domain invariant feature extractor through the use of a gradient reversal layer (GRL). This layer multiplies gradients originating from D during back-propagation by a negative constant −λ . Reversing the gradi-ents allows for adversarial training without maximizing the loss of D. Hence, with the class and domain predictions

ˆyc= fs(φf(xs; θφ); θf) ˆyd= D(φf(x; θφ); θD) respectively, the DANN optimizes

E(θφ, θf, θD) =L1(ˆyc, yc) − λL2(ˆyd, yd) (1)

Figure 3. A graphical representation of the ALM model. The classifier is pre-trained on source data. The generator and the discriminator are trained adversarially.

whereL1is the categorical cross-entropy andL2is the binary cross-entropy, by finding ˆ θφ, ˆθf= arg min θφ,θf E(θφ, θf, ˆθD) ˆ θD= arg max θD E( ˆθφ, ˆθf, θD)

resulting in a φf and D minimax optimization game.

Pixel-level Alignment Model

The ALM [2] is a GAN model that uses the pixel-level tech-nique for DA (see Figure 3). It learns a mapping function φpthat transforms target domain images to resemble source domain images. That is, with generator φp, target images are transformed to resemble the source distribution: φp(xt) ∼ xs. Using adversarial training, the generator φptries to fool the discriminator D into thinking that φp(xt) are in fact samples from xs. Simultaneously, it minimizes the loss for classifier fs. Formally, with class and domain predictions

ˆyc= fs(φp(xt)) ˆyd= D(φp(xt)) φpis trained by minimizing

(1 − γ)L (y?_,

ˆyc) + γL2(1, ˆyd) (2) whereL is the Mean Squared Error between class predictions and ground truth one-hot vectors,L2the binary cross-entropy for domain predictions and γ a regularization parameter to regularize fsand D losses. Minimax optimization is achieved by minimizing the probability that a transformed target sample is in fact from the source domain, resulting in confusing D. As described earlier, transforming the target data carries some implications. Most notably, φpmust be trained with approx-imate labels y?= fs(xt). This way, the generator learns to produce correct classes by aligning them to the approximate labels.

Multi-level Alignment

In this section we describe multi-level alignment for DA in more detail. Deep learning applications in domain adaptation

(5)

Figure 4. Graphical representation of a pixel-level and feature-level ensemble. In this paper we use the ALM generator (φp) to transform the target domain on pixel-level. The DANN block represents the architecture from Figure 2, only with different input datasets.

research have treated existing techniques for domain adapta-tion separately. All models that have been developed use only one of either pixel-level or feature-level alignment techniques. As can be seen in Figure 1, the transformed target images do not fully adapt to source distributions. Therefore, feature-level alignment could address the shortcomings of the transformed images. With this method, pixel and feature-level alignment models are ensembled by using the output of a pixel-level model as the input for a feature-level alignment model. See Figure 4 for a graphical representation of the ensemble used in this paper.

This is specifically effective for the DANN model. The DANN tends to fail when the domain shift is large. In this case, D quickly learns to perfectly predict the domain of all instances. Once no loss is back-propagated from D, there are no gradients flowing that can be reversed by the gradient reversal layer. Consequently, the ability of the φf to learn domain invariant features is hampered.

Using a pixel-level model we can reduce the domain shift by transforming the target to resemble the source. This way, we create a more challenging source/target input for the DANN model, and force more gradients to flow from D. Consequently, we increase the capacity of the DANN model.

Approximate Labeling in ALM

The default method for approximate labeling proposed by Ash et al. [2] involves acquiring approximate labels after pre-training fs on the source dataset. These approximate labels are then used throughout the entire process of ALM training. At some point, the generator/classifier produces more accu-rate approximate labels than those acquired at the beginning. Therefore, we propose a semi-supervised and unsupervised method for approximate labeling, where they are updated once they are improved.

Iterative

Usually, target datasets are entirely unlabeled in DA problems. Therefore, an unsupervised method for improving approximate labels is desirable. The unsupervised method for approximate labeling will be further referred to as the iterative method. The iterative method entails an approximate label update after every u training iterations. In this case, u is a tunable hyper-parameter, that can be adjusted to optimally suit the speed at which the model trains. Since the approximate labels are updated iteratively, this method can be used without the need for target set labels.

Selective

Next, we describe an improved method for approximate la-beling when a small validation set of target image labels are available. This semi-supervised method for approximate label-ing will be referred to as the selective method.

The selective method entails updating the approximate labels once the generator/classifier produce more accurate approx-imate labels than those acquired at the beginning. To know when these labels are more accurate, they are compared to the validation set.

Formally, let yjbe the target set predictions at the jth of m training iterations

yj= fs(φp(xt))

and f (·) the classification accuracy, then the selected approxi-mate labels are:

y?= arg max yj

({ f (yj, yt) : j = 1, ..., m})

where ytare the true labels in a target validation set of size n. Both methods ensure approximate labels are iteratively being improved as the ALM training progresses. It increases the capacity of the generator to produce more realistic transforma-tions by reducing classifier loss.

EXPERIMENTS

In this section we describe the experiments devised to test the proposed methods, along with a description of the data used to test these methods. Before we cover the main experiments de-vised to answer the research questions, some pre-experiments are executed. These include baseline performances of sev-eral supervised and unsupervised algorithms, along with an exploration of the effect of different image input sizes. Next, the proposed approximate labeling methods are tested using the ALM model. In the main experiment, the effectiveness of multi-level alignment for the improvement of generalization capacity is evaluated.

To measure generalization performance across domains, we measure the classification accuracy of a source trained classi-fier on the target test-set, as well as the classification accuracy of a target trained classifier on a target test-set, respectively referred to as the source-only and target-only methods. We then compare the target test-set accuracy obtained from ex-perimental methods to the source and target-only methods to show whether we improve the generalization performance. Ultimately, the goal in DA is to equal – or even exceed – the target-only performance.

Network architectures and other hyper-parameters are listed in Appendix I, and additional notes and suggestions on training the networks can be found in Appendix II. The code that has been used to perform these experiments is publicly available on GitHub.1

Experiment 0

The pre-experiment serves as a preparation for the main exper-iments. First, we show that the ALM as pixel-level model does

(6)

Figure 5. First 5 instances from MNIST dataset (top row) and MNIST-M dataset (bottom row).

not account for complete alignment of the domains. We do so by showing t-SNE [11] embedding visualizations of source and target images before and after pixel-level alignment. Next, we evaluate the source-only and target-only performance of several existing supervised algorithms. The performance of an Inception-v3 [14] and VGG19 [12] serve as CNN baseline, whereas a k-NN algorithm serves as a non-CNN baseline. The CNNs are pre-trained on ImageNet to maximize classification performance while minimizing the runtime. These statistics are then used to put the performance of the DA models into perspective.

With this baseline experiment, the effect of different image input sizes are evaluated as well. By definition, the Inception-v3 and VGG19 input images of sizes 299 × 299 and 224 × 224, respectively. For the sake of comparing these models to the k-NN and smaller CNNs used in the DANN and ALM in a fair fashion, we test the effect of using both input image sizes of 299 × 299 and 28 × 28. This experiment mostly serves to uncover the effect of smaller images in terms of classification accuracy, but also in terms of computational complexity.

Experiment 1

The first experiment entails the comparison of different ap-proximate labeling methods in ALM models. The default method for approximate labeling will be compared against a semi-supervised and unsupervised method of approximate labeling.

The unsupervised iterative method for approximate labeling requires the labels to be updated every u iterations. Therefore, model performance is evaluated with labels being updated every 50, 100 and 200 iterations.

The semi-supervised selective approximate labeling method requires a validation set of labeled target instances to assess whether the approximate labels improve. Because getting labels for target instances can be hard or expensive, we test the model performance for validation set sizes of 50, 250 and 1000. We evaluate the effect of different validation set sizes by comparing classification accuracy on the target test-set. We can conclude which approach for approximate labeling works best by comparing their classification accuracies and losses produced by fs.

Figure 6. Example images from KNMI dataset. Top row images are captured in Cabauw and bottom row images are captured in De Bilt.

Experiment 2

In the main experiment we will test if multi-level alignment improves the generalization capacity compared to segregated feature-level and pixel-level alignment. In this experiment, the target images transformed by the ALM generator; φ (xt), will be used as target image input for the DANN model. The best approximate labeling method according to the first experiment are used with the ALM model.

The results of the main experiment are evaluated by compar-ing the classification accuracy for the multi-level approach to both segregated approaches. We can conclude that multi-level alignment is effective if it achieves higher classification accu-racy on the target test-set than the pixel-level or feature-level alignment methods independently.

Data

The proposed methods will be evaluated on two domain adap-tation tasks for image classification; a digit classification task and a fog detection task. In this section, the datasets are de-scribed, as well as the pre-processing steps performed before the experiments.

Digit classification

In the digit classification task, MNIST (LeCun [9]) is used as source dataset and MNIST-M (Ganin et al. [6]) is used as target dataset (see Figure 5). MNIST is a database constructed from handwritten digits. The trainset and testset consist of 60000 and 10000 instances, respectively. Being 28 × 28 pixels in greyscale, the images are of 784 dimensions.

MNIST-M is a variation on MNIST purposely created for domain adaptation research. It is obtained by using the digit as a mask, and applying a random patch from BSDS500 images (Arbelaez et al. [1]) over the original MNIST image, inverting the colors inside the digit mask. This results in a 28 × 28 × 3 RGB variant of MNIST with random color patches.

To address the dimension mismatch caused by MNIST being greyscale and MNIST-M being RGB, MNIST images are con-catenated depthwise three times to match 28 × 28 × 3 MNIST-M dimensions. Furthermore, they are normalized between -1 and 1.

Fog detection

As mentioned earlier, weather condition detection using im-age classification is a task that is inherently prone to suffer

(7)

Figure 7. t-SNE feature embeddings of source and target before (left) and after (right) transforming target images with the ALM generator (φp). Red items are target or transformed target instances and blue items are source instances. Numbers denote class membership.

from domain shift. To address this problem, the Royal Dutch Meteorological Institute (KNMI) has built an image dataset for fog detection. It consists of two domains: (1) images cap-tured from De Bilt weather station and (2) images capcap-tured at Cabauw weather station (see Figure 6). Statistics on dataset size, class balance and train test split are outlined in Table 1. The images are labeled with Meteorological Optical Range (MOR). MOR is a metric for visibility and used to measure the density of fog. For the purpose of image classification, the MOR values have been binned into 4 classes (MOR range between parentheses): heavy fog (0 to 249), moderate fog (250 to 999), light fog (1000 to 2999) and none (from 3000). Originally, the KNMI images are of 640 × 480 × 3 pixels, but can be scaled arbitrarily depending on the results of the pre-experiment. As with the MNIST and MNIST-M datasets, the KNMI images are normalized between -1 and 1.

RESULTS AND ANALYSIS

In this section we turn to the evaluation and analysis of the results from the experiments. Before covering the results of the main experiments, the results of the pre-experiment are dis-cussed. Then, the results of the first experiment are discussed, after which we turn to the results of the main experiment.

Experiment 0

The first part of the pre-experiment aims to show that pixel-level transformations usually do not fully account for the do-main shift in the dataset. As can be seen from Figure 7, before applying the pixel-level transformation, there is considerable domain shift. This is particularly true for the digit classifi-cation task, but to a certain extent also observable in the fog detection task. After applying the transformation, some do-main shift is accounted for. However, it is clear that the blobs of embeddings do not fully cover each other. This indicates that there is still domain shift unaccounted for, motivating the use of feature-alignment models on the transformed data. Next, baseline models are evaluated on the effect of image size on generalization capacity in terms of classification accuracy and computational complexity in terms of time to converge the model. The results are listed in Table 2. It can be concluded that the k-NN has good target-only performance but is not useful for generalization. The convolutional neural networks perform better with regard to generalization capacity, in partic-ular the deep Inception-v3 and VGG19 networks. One benefit of the shallow CNN is that it is considerably faster to train. Interestingly, reducing the input size of the images benefits generalization capacity, as the source-only performance

in-Heavy Moderate Light None Total

Cabauw 334 (111) 440 (162) 722 (247) 775 (256) 2271 (776) De Bilt 220 (67) 311 (95) 767 (241) 1219 (410) 2517 (813)

Table 1. KNMI image dataset metadata. Along with the total dataset size, the amount of images per class are outlined as well. The numbers without parentheses are the amount of train-set instances, whereas the numbers within parentheses are the amount of test-set instances.

Models Shallow CNN k-NN Inception-v3 VGG19

Image width 28 299 28 299 299 224

Source-only 17.9 11.9 13.3 8.9 48.7 36.8

Target-only 65.4 62.2 69.8 78.5 82.5 79.2

Time to train 6 97 0 0 6235 9384

Table 2. Classification accuracy in % and time to train the models to convergence in seconds. The 4-classes fog detection task was used for this experiment. For the shallow CNN, the architecture for the DANN generator and classifier was used (see Appendix I for details).

(8)

Figure 8. Target validation set accuracy (top row) and ALM classifier ( fs) loss (bottom row) for both datasets. Left plots are MNIST to MNIST-M results and the right plots are Cabauw to De Bilt results. The plots have been smoothed with a Savitzky-Golay filter (Savitzky & Golay, 1964) for clarity. Original results are shaded.

creases. This is fortunate, as the network converges about 10 times faster with the use of smaller images as well. This has lead us to use downscaled fog detection images in our final experiments.

Experiment 1

In the first experiment we compare the different approximate labeling methods. However, the proposed methods require hyper-parameter tuning, including the validation set size in the semi-supervised method and the update rate in the unsuper-vised method. The optimal results are then used to compare the proposed methods to the default approximate labeling method. As can be seen from Table 3, the highest accuracy for MNIST to MNIST-M was achieved using a validation set size of 1000. In the KNMI dataset, the optimal validation set size was 50. A representable validation set should give the algorithm a reason-able indication of when it is truly improving the approximate labels. Though the probability that there exist representable instances from each class in the validation set depends on the number of classes, generally a higher validation set size should be better. In this case, it is more likely that the larger valida-tion set is representative for the entire dataset. It is interesting to see though, that good results can also be achieved with a smaller validation set.

Looking at the results for the update rate in the iterative ap-proximate labeling method, we see that – with both datasets –

n MNIST ⇒ MNIST-M Cabauw ⇒ De Bilt

50 75.9 42.9

250 77.7 32.3

1000 78.7 37.9

Table 3. Target test-set classification accuracy (%) for differ-ent validation set sizes (n).

Update rate MNIST ⇒ MNIST-M Cabauw ⇒ De Bilt

50 75.2 32.9

100 76.1 37.9

200 74 28

Table 4. Target test-set classification accuracy (%) with differ-ent update rates for iterative approximate labeling.

Method MNIST ⇒ MNIST-M Cabauw ⇒ De Bilt

Source-only 50.7 18.4

Default 72.4 29.8

Iterative 76.1 37.9

Selective 78.7 42.9

Target-only 93.8 66.3

Table 5. Target test-set classification accuracy (%) obtained from the semi-supervised selective, unsupervised iterative and default approximate labeling methods.

(9)

Model MNIST ⇒ MNIST-M Cabauw ⇒ De Bilt Source-only 50.7 18.4 Feature-level 52.6 25.1 Pixel-level 78.7 42.9 Multi-level 77.5 33.1 Target-only 93.8 66.3

Table 6. Target test-set classification accuracy (%) for seg-regated pixel-level (ALM), feature-level (DANN) and multi-level (ALM+DANN) domain adaptation models. Results from ALM models have been obtained with the selective approxi-mate labeling method.

an approximate label update at every 100 iterations yields the best results (see Table 4).

Taking the results from the hyper-parameter tuning into ac-count, we see that the proposed methods improve over the ALM performance with the default approximate labeling method (see Table 5). The results show that the selective method outperforms any other approximate labeling method. Interestingly, the iterative method is still useful, as – being an unsupervised method – performs only slightly worse than the semi-supervised method.

A more detailed look into the training procedure shows that the improvement in target set accuracy corresponds to a decrease in ALM classifier loss during training (see Figure 8). Both the selective and iterative method significantly lower classifier loss compared to the default ALM approximate labeling method.

Experiment 2

In the second experiment the effect of combining pixel-level models with feature-level alignment models is evaluated. The generated images from the ALM model are used as target set in the DANN model. As can be seen in Figure 9, the images generated by the ALM model have mapped the target images to resemble the distribution of images from the source domain. This becomes especially apparent from the t-SNE feature embeddings in Figure 7.

The results of the experiment are found in Table 6. Results indicate that the ALM model is the best performing domain adaptation model. However, even when using its generated output in the DANN model for feature-level alignment, no gains were made. In fact, the performance suffers from multi-level domain adaptation with these datasets.

CONCLUSION

The aim of this paper was firstly to evaluate the effect of multi-level alignment in domain adaptation with generative adver-sarial networks. Secondly, we propose different approximate labeling techniques for target to source pixel-level domain adaptation networks.

It has been found that the use of selective and iterative ap-proximate labeling methods yields increased generalization capacity compared to the default method of approximate la-beling. It does so by significantly lowering the classifier loss, in turn giving the generator more accurate feedback.

Figure 9. Examples of transformed target domain images. For both datasets, the bottom row images are transformed top row images, such that they resemble middle row images.

Moreover, it has been found that – given that the target vali-dation set (used to compare the approximate labels with) is representable – only a small validation set is necessary for se-lective approximate labeling to be effective. This is desirable when labels are hard or expensive to acquire. On the other hand, the iterative approach is a worthwhile solution if labels are not possible to acquire.

Although the proposed approximate labeling techniques were able to improve over the default ALM performance, estab-lished deep CNN architectures – in particular the Inception-v3 – have an even higher generalization capacity. Originally, these networks are not built for generalization across domains, so apart from their intrinsic generalization capacity, little gains could be made in this regard. However, it is possible to equip domain adaptation networks with them. It would therefore be interesting to see how pixel-level or feature-level models could benefit from them. Hence, a suggestion for further research is to evaluate the effect of adding a fine-tuned deep CNN as classifier model to domain adaptation networks.

Furthermore, the results show that multi-level alignment with the ALM as pixel-level and DANN as feature-level model does not lead to improved generalization capacity. There are, however, much more pixel-level and feature-level DA models. It would therefore be interesting to see if this conclusion could be drawn for other DA models or datasets as well.

All in all, in this paper we have proposed a novel technique for approximate labeling in pixel-level DA models that trans-form target domain images. Both the semi-supervised and unsupervised techniques improve the generalization capacity of ALM models compared to the default approximate labeling method. Moreover, the combination of pixel-level and feature-level domain adaptation networks does not lead to improved generalization capacity.

ACKNOWLEDGEMENTS

I am happy to thank everyone who made this thesis possible. First of all, I am grateful to my internal supervisor Assistant Professor Thomas Mensink, PhD for his indispensable insights and guidance. Secondly, I would like to express my gratitude to my external supervisors at the KNMI, Martin Roth, PhD and Andrea Pagani, PhD, for letting me integrate my thesis in

(10)

their project, but also for offering their continuous support and expertise. Finally, I would like to thank my peers and family for motivating me to push myself further than I could have ever done alone.

REFERENCES

1. Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2011.

2. Jordan T. Ash and Robert E. Schapire. Multi-source domain adaptation using approximate label matching. arXiv preprint arXiv:1602.04889, 2016.

3. Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. arXiv preprint arXiv:1612.05424, 2016.

4. Emily L. Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural

information processing systems, pages 1486–1494, 2015. 5. Basura Fernando, Amaury Habrard, Marc Sebban, and

Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE international conference on computer vision, pages 2960–2967, 2013.

6. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.

7. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

8. Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, and Alex J. Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.

9. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

10. Mingsheng Long, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with joint adaptation networks. arXiv preprint arXiv:1605.06636, 2016.

11. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. 12. Karen Simonyan and Andrew Zisserman. Very deep

convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

13. Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. arXiv preprint arXiv:1511.05547, 2015.

14. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.

15. Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International

Conference on Computer Vision, pages 4068–4076, 2015. 16. Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor

Darrell. Adversarial discriminative domain adaptation. arXiv preprint arXiv:1702.05464, 2017.

(11)

APPENDIX I

In this section the used architectures of the DANN and ALM model are outlined, along with the hyper-parameters used in the experiments.

ALM

1. Discriminator/classifier loss regularizer used: γ = 0.2 2. Learning rates

(a) Classifier: 1e-3 (b) Discriminator: 5e-5 (c) Generator: 5e-5 3. Batch size: 50

DANN

1. GRL λ scheme used from [6]: _{1+exp(−10p)}2 + 1 where p is percentage of training completed. 2. Learning rate: 1e-4

(12)

APPENDIX II

In this section we will describe some suggestions on training GAN models that we have found useful while tuning the models. In order to maximally improve the quality of the generated images, both the generator and discriminator need to be carefully balanced in terms of network capacity. Imbalance leads to one model overpowering the other, resulting in a scenario where nothing is learned. Generally, imbalance can lead to three observable behaviours.

First, mode collapse causes the generator to only create images that can fool the discriminator, but not necessarily resemble the actual images its trying to mimic. This problem is addressed in some papers by using different types of loss function and is a very active field of research.

Another type of imbalance is the discriminator loss not converging correctly. If that is the case, the loss could either rise unboundedly or drop to 0. Since the discriminator loss is maximized, maximum entropy in a 2-class classification problem is a binary cross-entropy loss of − log(0.5) = 0.693. Hence, ideally a discriminator should converge around such value. Tuning the width and depth of both the generator and the discriminator should be done in order to achieve correct convergence.

Figure 10. Examples of discriminator losses exhibiting (dis)balance. In the left plot, the loss rises unboundedly. In the right plot, the loss converges around an acceptable value.

In terms of architectural design, we found that high dropout, batch-normalization and leaky ReLU were key in improving the performance of the GAN model. Another important factor in improving the GAN was using a very small learning rate for the generator and the discriminator. In our experiments we have used a learning rate of 5e-5.