Bridging the Shifting Distribution Gap: Domain Adaptation for Semantic Segmentation and Visual Data Streams

(1)

Master Thesis

Bridging the Shifting Distribution Gap:

Domain Adaptation for Semantic Segmentation

and Visual Data Streams

by

Sindi Shkodrani

11128348

May 22, 2018

36 ECTS September 2017 - May 2018 Supervisor: Dr. E. Gavves Daily supervisor: Dr. M. Hofmann Assessor: Prof. Dr. C.G.M. Snoek

(2)

(3)

The focus of this thesis work is visual domain adaptation which is robust to domain shifts and differences in data distribution statistics between potential source and target domains across different dataset types and tasks. Associative domain adaptation is reformulated to work well in realistic scenarios and applications where source and target domains cannot be guaranteed to have similar class distribution statistics.

In addition, modern deep learning applications require methods that scale well with the amount of supervised and unsupervised data and are able to transfer the knowledge learned from previous datasets. This work revisits static domain adaptation in the context of domain shift that arises in open, dynamical data sources, such as image data streams in the wild. Traditional domain adaptation is usually applied to two static domains, while data is usually not all available in practice. This thesis work develops a framework for adaptation that redefines static domain adaptation in a dynamic context that can be treated sequentially similar to streaming application.

Results are reported on several domain adaptation benchmark datasets for classification. In addition, another application where there is increasingly more interest for robust domain adaptation techniques is semantic segmentation. In this work domain adaptation for seman-tic segmentation with associative learning is developed. Finally, a framework for adaptation over distribution shifts that change in time is introduced and extensive experiments are re-ported that indicate how this framework can be used to adapt for unsupervised data bundles incoming later in the training in a streaming-like fashion.

(4)

The biggest thanks go to my supervisors. Because of them I had a very thorough thesis experience and learned a lot in the process. To Michael, for being involved and always finding time to discuss and put effort in this work despite his busy schedule as a manager. To Stratis, for his dedication and intuition on directing work in the right manner, by giving the proper advice on how to tackle things.

My gratitude goes out to them and Cees Snoek for agreeing to be in the committee and to read the final thesis on short notice.

Many thanks to my colleagues at TomTom who were always helpful and approachable for questions during my time as a master thesis intern.

Finally, I’d like to thank my family and friends for their support and understanding of my lack of presence recently. Special thanks to my researcher friends who are always motivating me with their dedication, to Peter for the constant patience and support and to Brian for relocating to bring a bit of home closer these months.

(5)

1 Introduction 7

1.1 Motivation . . . 7

1.2 Overview and Contributions . . . 9

2 Preliminaries and Related Research 11 2.1 Domain Adaptation Overview . . . 11

2.1.1 Introduction . . . 11

Notation . . . 11

The domain shift problem . . . 11

Categories and lines of research . . . 12

2.1.2 Shallow Approaches . . . 12

Instance reweighting methods . . . 12

Parameter adaptation methods . . . 12

Feature augmentation methods . . . 13

Feature space alignment methods . . . 13

Feature transformation methods . . . 14

2.1.3 Deep Approaches . . . 14

Discrepancy based adaptation . . . 14

Adversarial domain adaptation . . . 15

Data reconstruction-based methods . . . 17

2.1.4 Domain Adaptation for Semantic Segmentation . . . 17

Adversarial discriminative methods . . . 17

Adversarial generative methods . . . 18

Other methods . . . 19

2.2 Learning Models for Streaming Data . . . 20

2.2.1 Introduction . . . 20

The concept drift problem . . . 20

Considerations . . . 20

2.2.2 Algorithms and methods . . . 21

2.3 Discussion . . . 23

3 Method 24 3.1 Domain Adaptation with Associative Learning . . . 24

3.1.1 Associative domain adaptation . . . 24

From maximum mean discrepancy to learning by associations . . . 24

Definition . . . 24

3.1.2 Relaxing the class distribution assumption . . . 27

Intuition . . . 27

Estimating visit loss weights . . . 28

3.2 Associative Domain Adaptation for Semantic Segmentation . . . 30

3.2.1 Fully convolutional networks for semantic segmentation . . . 30

3.2.2 Embeddings in semantic segmentation networks . . . 30

Visualizing behavior . . . 30

(6)

The probabilistic interpretation of distances and importance of

numer-ical stability . . . 32

3.2.3 Adapting segmentation models . . . 33

The designated embedding layer . . . 33

Handling memory constraints . . . 33

Adaptation . . . 34

3.3 Associative Adaptation for Streaming Data . . . 35

3.3.1 From domain adaptation to sequential adaptation for streaming . . . 35

3.3.2 Building a framework for adapting classifiers in time . . . 35

3.3.3 Adapting for streaming data . . . 36

4 Experiments and Results 37 4.1 Robust Associative Domain Adaptation for Image Classification . . . 37

4.1.1 Datasets and adaptation benchmarks . . . 37

4.1.2 Balancing distribution differences with weighted visit loss . . . 38

4.1.3 Understanding embedding associations: The effect of the metric and normalization method . . . 39

4.2 Domain Adaptation for Semantic Segmentation . . . 41

4.2.1 Datasets . . . 41

4.2.2 Towards semantic segmentation with patchwise classification . . . 41

The patchwise classification dataset . . . 42

Adaptation results . . . 43

4.2.3 Adapting Semantic Segmentation Models . . . 43

Adaptation results . . . 44

4.3 Adapting Models for Streaming Data . . . 47

4.3.1 Dataset and setup . . . 47

4.3.2 Adaptation results . . . 47

5 Summary and Conclusions 50 5.1 Summary . . . 50

5.2 Conclusions . . . 50

5.3 Future Work . . . 51

(7)

1.1 Motivation

There are things known and there are things unknown, and in between are the doors of perception.

Aldous Huxley

Recent improvements in deep architectures in combination with increasing availability of large annotated image datasets have allowed for higher accuracy and usefulness of deep learning methods. However, labels are costly to obtain, especially for dense prediction tasks where every pixel in an image needs to be labelled. As an example, accurate pixel-level annotation of Cityscapes dataset [10] for semantic segmentation, consisting of images from urban areas, took more than 1.5 hours per image to annotate. Developing methods to exploit the availability of unsupervised data is crucial to scalability and applicability of deep learning advancements.

In addition, aquisition of a finite amount of labeled images results in an inevitable dataset bias, which inhibits learned models from generalizing well to data collected under different conditions. These biases can be conditioned on small shifts such as lighting or pose as well as large appearance or shape differences between objects. Looking towards robust approaches to transfer knowledge learned in a supervised manner to previously unseen data is a necessity. Domain adaptation is the subset of transfer learning that is concerned with transferring knowledge learned on a supervised source set to a target set where no annotations are available [11]. These sets share the same label space, but due to the dataset bias the data distribution between source and target sets differs, causing models learned on the source to fail on the target. This is known as the domain shift problem. Due to this shift, tailored methods have to be developed to be able to exploit supervised and unsupervised data to train models that work well for both. Shallow methods have dominated the field until the last few years when the first deep approaches succeeded in adapting models to simple classification tasks. With vast amounts of unlabeled data and various domain shifts, it is a necessity to develop algorithms that generalize well across different adaptation tasks. Capturing representations that are consistent across domains and applicable across tasks with a robust domain adaptation method is at the forefront of this research.

A particularly interesting task for domain adaptation that has only recently been brought to the attention of researchers is semantic segmentation. Nowadays highly accurate semantic segmentation methods are becoming increasingly important with the advancements towards autonomous driving, where semantic understanding of images is necessary not only for map-ping of roads but also for per-pixel analysis of images collected from in-car cameras. While algorithmic advancements are being made, dependency on annotated datasets remains a bottleneck.

Deep domain adaptation methods for dense prediction tasks were almost non-existent, until Hoffman et al.[24] pioneered adaptation attempts for semantic segmentation. Due to the higher complexity compared to classification as well as deeper architectures and larger datasets used, adapting for dense prediction is considered to be a harder task. Recent

(8)

advancements in adversarial methods and generative models have found application and are being used in this direction [62].

Particularly fundamental for the semantic segmentation task is that it has been shown that adaptation can be achieved from models trained on synthetic data rendered from com-puter game engines to real world images. Synthetic data comes at cheap costs and highly accurate labels, therefore having methods that exploit the synthetic domain properly can fa-cilitate improvements in real domains immensely. Figure 1.1 illustrates the idea of adapting from synthetically generated data to real images. Synthetic to real adaptation for semantic segmentation is part of the focus of this work.

Figure 1.1: Adapting from synthetic to real domain images. A shift in object appearance can be observed across domains.

Besides cross-task applicability, taking a look at how domain adaptation is addressed re-veals meaningful insights. Interestingly, to date domain adaptation is addressed mostly in the context of static datasets. A domain is defined over marginal and conditional distri-butions of a set of data with respect to its labels [56]. It is usually represented by a finite dataset, which is likely to contain a sample selection bias. However, in modern applications the data from a defined domain is not usually available all at once. If we consider image data collected for training models for self-driving cars for instance, the data collected from different cities comes in at different times in ”bundles” or domain-specific sets. This data may have similar distribution with the previously collected samples, but usually suffers from sample selection bias. We can also not consider the incoming data as coming from en entirely different domain, as the difference in distribution will be dependent on sample selection bias. This dynamic work setting requires flexible approaches that can be easily adapted to new data [17]. As we move from ”closed” and static dataset sources to ”open” and dynamic dataset sources, the non-stationary dataset statistics become a function of time.

Similarly, this happens in a more structured way with streaming data applications. Stream-ing data comes in real-time or accumulated in bundles and labelStream-ing it takes a longer time than the availability of data. Think here for instance data incoming from social media or video streams in the wild, where frames observed at any moment might change drastically over time. In shown in Fig. 1.2 we show images from the GTA5 synthetic dataset [43], which is collected by rendering a video-game play. Looking at this dataset sequentially as it was collected, we can observe visual differences across sequences which may lead classifiers to perform worse if the appearance of this data is considerably different from the dataset used for training. In streaming applications the distribution shift of data over time is called the concept drift. In addition, the data is not assumed to be independent and identically distributed. Instead, we might have bundles of data representing very small parts of the

(9)

distribution, observing here the sample selection bias. Annotations for the abundant data are not cheap and they are usually obtained slower than the incoming data.

Figure 1.2: Observing changing data distribution over different timesteps in dynamic data. As the GTA5 dataset was collected during video game play, changing appearance of the scenes over time can be observed in the dataset.

Due to the distribution dynamics and further considerations, such as the memory limi-tations for storing the vast amount of incoming data, dealing with streaming data calls for frameworks and methods that differ from traditional tasks. However, including current tasks that deal with non-stationary distribution environments in a single framework conform mod-ern application requirements lays the grounds for dealing with visual data at large in the wild. This work attempts to develop a framework for streaming that can be applied across-tasks for improving classifiers in an unsupervised manner, extending to better adaptive learning over time.

1.2 Overview and Contributions

Conform the motivation discussed above this work proposes an adaptation framework that generalizes domain adaptation in a dynamical context and is applied across tasks. A robust domain adaptation method is explored and developed to work well for scenarios of evolving distribution statistics over time and tasks where complex class distribution statistics are observed across sets.

The work is structured as follows. In Section 2 some preliminaries of the introduced topics as well as an overview of related research is presented. We go especially in-depth of deep visual domain adaptation for classification and dense prediction tasks and discuss how this brings us to our decisions on the method.

In Section 3 we discuss the approach. Using as a base method associative learning [19], an adaptation technique that uses association of embeddings in latent space between supervised and unsupervised samples, we reformulate this to work well and generalize in scenarios where class distribution statistics between source and target are dissimilar. This enables the applicability of this method to tasks such as semantic segmentation and streaming data classification, as further presented in this section.

Next, Section 4 reports on extensive experiment results across several adaptation bench-marks for classification and semantic segmentation. Further experiments are reported to demonstrate classifier performance improvements within the proposed streaming framework.

(10)

A summary, detailed analysis and discussion of the findings and potential future directions of this research are summarized in Section 5.

The contributions of this work are as follows:

• A broad review of research in domain adaptation including existing adaptation methods for semantic segmentation

• An overview of machine learning for streaming data and discussion on differences with static data methods

• A formulation of the associative learning method for domain adaptation that relaxes previous assumptions on source and target class distributions

• Considerations for applying associative domain adaptation to the semantic segmenta-tion task

• A novel framework for adaptation of dynamic distribution shifts that allows exploita-tion of unsupervised data to improve classifiers over time

• Empirical evidence of the success of the reviewed approach in multiple domain adap-tation benchmark

• An in-depth analysis and discussion of extensibility and applicability of the above to dynamic or streaming data in the wild

(11)

2.1 Domain Adaptation Overview

2.1.1 Introduction

Notation

Unsupervised domain adaptation or simply referred to as domain adaptation is considered to be a subset of transductive transfer learning where the tasks for both source and target datasets are the same, but the distributions differ due to dataset bias and distribution mismatch [11, 62].

More formally, we want to adapt a model learned from a labeled source dataset consisting of source features Xs and a marginal probability distribution P (Xs) over data as Ds = {Xs_{, P (X}s_{)} to an unlabeled target dataset D}t_{= {X}t_{, P (X}t_{)}. The source and target data}

distributions are different, so P (Xs) 6= P (Xt) due to the dataset bias. The label space of source and target datasets is the same, so Ys = Yt. While source data is paired with respective annotations given Xs, Ys, only the data Xtis available for the target.

In some cases a small set of labels is available in the target dataset, in which case semi-supervised domain adaptation approaches are used. However, in most cases by domain adaptation the unsupervised version is assumed.

Further, there are two approaches to unsupervised domain adaptation. First, conservative domain adaptation assumes that the same classifier can be used for both source and target datasets, assuming that in source classifier space there is one such which performs well in both. In the non-conservative domain adaptation case, this assumptions is not made and the classifier differs from source to target [46].

The domain shift problem

A domain is defined over marginal and conditional distribution of a set of data with respect to its labels [56]. A domain is usually represented by a finite dataset, which is likely to contain a sample selection bias. This distribution shift or domain shift between datasets causes classifiers learned in one domain to fail when applied to another domain of the same categories. This concept is illustrated in Figure 2.1.

There is a wide range of what can be considered as a domain shift, including differences due to the acquisition of data, e.g: lighting, conditions and point of view in images. More complex domain shifts can be caused due to intra-class variability and category biases be-tween datasets. Sometimes research methods tackle small and large domain shifts differently [24].

In addition, homogeneous domain adaptation is referred to cases when source and target feature space is the same (Xs = Xt), while in heterogeneous domain adaptation features in source and target can have different representations spaces as well as different modalities (Xs 6= Xt_{) [11].}

(12)

Figure 2.1: A domain shift due to sample selection bias causes classifiers trained on source to fail when applied to target domain data. Domain adaptation aims to correct classifiers for the shift. Image from [45].

Categories and lines of research

Initial domain adaptation attempts were shallow models, these being homogeneous or het-erogeneous, and the classifier assumption being conservative or non-conservative as explained above.

Later a line of research utilized deep features extracted from deep models but shallow classifiers to obtain higher domain adaptation performance than the previous shallow ap-proaches, before behavior of domains in deep models was well understood.

Last, deep domain adaptation methods started to emerge. These are categorized into discrepancy based approaches, adversarial and reconstruction based methods[11, 62, 42].

2.1.2 Shallow Approaches

Early domain adaptation approaches relied on data statistics transformation and mappings or feature augmentation to apply domain adaptation.

Instance reweighting methods do weighting of an instance by estimating, for instance, the ratio between likelihoods of it being a source and target sample. These can be esti-mated independently with a classifier on samples as source or target. Many approaches use Maximum Mean Discrepancy between domain distributions [11]. Instance reweighting is illustrated in Figure 2.2.

Figure 2.2: Instance reweighting illustrated. Image from [33]

Parameter adaptation methods are non-conservative domain adaptation methods which do not assume that the same classifier may be used on both source and target. Typi-cally these methods adapt parameters of a model trained on source, e.g: an SVM. Adaptive

(13)

SVM [64] uses perturbation functions to progressively adjust the classifier learned on source to target data.

Feature augmentation methods use an augmented feature space for source and target. In [12] a ”frustratingly easy” approach each feature is mapped to an augmented feature space by duplicating the feature vectors and filling with zero vectors as (xs _xs₀₎T _{and (}_xt_{0 x}t₎T_.

An SVM is then used to exploit features belonging to each domain and to both.

In [30], a common subspace is introduced where features from both domains can be pro-jected with respective W1 and W2 matrices and a single classifier can be used (see Figure

2.3).

Figure 2.3: Source and target samples transformed into a heterogeneous feature space. Image from [30]

Feature space alignment methods minimize the domain shift by aligning source and target features. For instance Correlation Alignment (CORAL) [52] aligns source and target by second order statistics. More specifically, computing covariances CS and CT on source

and target, the source data can be whitened as DS = DS∗ C −1₂

S and re-colored with target

covariance as D∗_S= DS∗ C

1 2

T. A classifier trained on the source data then can be applied to

the target. Whitening both domains would not work well as the data may lie on different subspaces [52]. Figure 2.4 demonstrates the process.

Figure 2.4: (a) Source and target normalized features have different covariance. (b) Source is whitened. (c) Source is re-colored with target covariance. (d) Whitening both sets wouldn’t work if they lie on different subspaces. Image from [52]

(14)

Feature transformation methods are various and they attempt to minimize discrep-ancy of source and target distributions in latent space by learning different transformations. For instance, Stacked Marginalized Denoising Autoencoders [7] learn representations by re-construction, recovering denoised features by marginalizing the noise using correlations be-tween source and target features. Multiple principles from feature transformation methods have been used in deep domain adaptation methods.

2.1.3 Deep Approaches

Discrepancy based adaptation

Discrepancy based deep adaptation approaches are inspired by working concepts from early shallow methods, and usually attempt to minimize a discrepancy measure between source and target in latent space. Siamese architectures are often used where the weights are shared between source and target in the conservative adaptation case or the target weights are tuned further in the non-conservative adaptation cases. Typically a loss function composed of the task loss and discrepancy loss is used, thus L = Ltask+ Lda.

Deep Adaptation Networks [34], also know as DAN, use one such siamese architecture where a multi-kernel Maximum Mean Discrepancy measure between the activations in the last three layers is minimized (see Figure 2.5). Previous approaches which used single-layer single-kernel MMD (for instance [61]) were outperformed.

Figure 2.5: DAN siamese architecture using Multi-Kernel MMD for domain adaptation. Im-age from [34]

The loss function including the task and discrepancy loss for the multi-kernel MMD be-comes: L = Lclassif ication+ λ l2 X `=l1 d2_k(D`_s, D`_t)

where layer indices are set between 6 and 8, d2

k(D`s, Dt`) is the MK-MMD between source

and target and D`∗= h∗`i is the hidden representation of source or target in layer `.

Similarly, in DeepCoral [53], which extends on the shallow CORAL[52] approach men-tioned above, the regularization loss for adaptation is as below:

L_da= 1

4d2kCS− CTk 2 F

where the Frobenius norm between source and target covariances is minimized.

In Residual Transfer Networks [35] a non conservative approach is used, assuming that source and target classifiers differ by a residual function fS(x) = fT(x) + ∆f (x). Residual

(15)

blocks here are used not for feature mappings, but for source to target mapping of the classifier.

In the recent Parameter Reference Loss [26] approach, an extra loss function is added besides the classifier and adapter loss. This loss component minimizes the distance between source and target parameters:

L = L_task+ LM M D+ LP R where LP R = NP

X

i=1

kp_S_i− p_T_ik₁

All these discrepancy-based methods have the limitation of having to choose a distance metric to optimize and often a kernel to map features in latent space. Associative Domain Adaptation [19], the approach on which this thesis builds upon, chooses to use an association-based loss between domains instead of the discrepancy loss in addition to the task loss component. A detailed explanation on this approach follows in Section 3.

Adversarial domain adaptation

Adversarial domain adaptation is categorized into adversarial discriminative and generative approaches. The first category typically utilizes a discriminator between source and target features in order to force the original classifier to output invariant features. The second category uses a generative model that learns to generate samples imitating the distribution of the other domain. A taxonomy of adversarial domain adaptation approaches is presented in Figure 2.6.

Figure 2.6: Categories of adversarial domain adaptation. Image from [60]

Discriminative methods Adversarial discriminative domain adaptation relies on the use of a discriminator between source and target that tries to distinguish features coming from either. Discriminator feedback aims to make the features indistinguishable, therefore invari-ant. Discriminative domain adaptation often relies on the assumption that there is a set of invariant features across domains on which a single classifier can be used.

One of the first discriminative approaches was Domain Adversarial Neural Networks (DANN) [16]. The feed-forward part of the network has a feature encoder and a label predic-tor component. The output of the feature extracpredic-tor has a second head into the discriminapredic-tor, where a gradient reversal layer ensures domain features are indistinguishable.

(16)

Adversarial Discriminative Domain Adaptation (ADDA) [60] is a recent approach which consists of two stages of training. In the first stage, a task-oriented endcoder and classification network is trained on source. In the second stage, a copy of the pre-trained encoder model is further fine-tuned on target with a discriminator that predicts the domain from which the features coming from. The new fine-tuned encoder is then used with the original classifier as the class label predictor on target. Here the weights are shared only in the classifier part of the network, but not in the encoder, thus making it a non-conservative approach.

In [59], a multi-task domain adaptation aproach simultaniously optimizes for domain in-variance in addition to a soft-label distribution matching loss for trasferring task correlation. The joint loss function considering classifier, discriminator and representation parameters

L(xS, yS, xT, yT, θD; θrepr, θC) = LC(xS, yS, xT, yT; θrepr, θC)

+ λLconf(xS, xT, θD; θrepr)

+ νLsof t(xT, yT; θrepr, θC)

is minimized, where Lconf is the domain confusion loss of the discriminator and Lsof t is

defined over per-category soft labels as

Lsof t(xT, yT; θrepr, θC) = −

X

i

l(yT)

i logpi

where p is the softmax activation of the target image.

Generative methods With the recent advances on generative adversarial networks (GANs) [18], many domain adaptation aproaches use some version of GANs to learn an explicit mapping between source and target. A generator loss is added to the adversarial one to learn to map domain distributions.

In [4] a GAN is used to learn a mapping between source and target and generates source images that look like the target domain. The loss is composed of a domain GAN loss component, a task-specific (classification) component and a content similarity loss component between images. The objective becomes:

min

θG,θT

max

θD

αLd(D, G) + βLt(T, G) + γLc(G)

where Lt is the classification task loss, i.e. cross entropy, Ld is the domain loss where

the discriminator and generator are optimized in a minimax fashion where the generator is conditioned on noise as well as the source image:

L_d_{(D, G) = E}_xt[logD(xt; θ_D)] + E_xs_,z[log(1 − D(G(xs, z; θ_G); θ_D))]

For the similarity loss component Lca masked pairwise mean squared error between pixels

is used instead of L1 or L2 which are commonly used.

Coupled Generative Adversarial Networks(CoGANs) [31] use a pair of GANs to synthesize realistic images in each domain and discriminate whether the images are real or synthesized. A weight sharing constraint is applied that allows the network to learn a joint distribution from images without corresponding pairs.

Cycle-consistent Adversarial Domain Adaptation (CyCADA) [23] is a recent approach which uses a combination of cycle-consistent GAN losses that learn mappings from source to target and vice versa. Cycle consistency is introduced following the work of [69] where inverse mapping is enforced besides the one-directional one.

(17)

Data reconstruction-based methods

These methods use data reconstruction to achieve feature invariance between domains often by using autoencoders, where the encoder part learns the representation and the decoder part the reconstruction.

Domain Separation Networks [5] use stacked domain-specific and joint autoencoders, where the encoded features are used to learn common and specific representations. The decoders are attached to a reconstruction loss where they attempt to reconstruct the input samples jointly with classification training.

2.1.4 Domain Adaptation for Semantic Segmentation

Semantic segmentation is the task of predicting an object class label for every pixel value of an image. Due to the higher complexity compared to image classification, domain adapta-tion aproaches for semantic segmentaadapta-tion only recently emerged. It was shown with Fully Convolutional Networks [32] that encoders from classification networks such as ResNet [22] or VGG [47] can be extended in a fully convolutional fashion with upsampling layers to produce dense output and achieve state of the art results. These architectures or versions thereof with additional context or pooling modules in the upsampling have been widely used as base architectures for domain adaptation methods.

Adversarial discriminative methods

FCNs in the Wild [24] was the first work to perform such adaptation using an adversarial approach. For the segmenter a siamese architecture of FCNs with a dilated convolutions context module in the upsampling stage is used [66]. A discriminator network takes as input encoder features and learns to discriminate between source and target, giving thus feedback to the segmenter that has to learn to output invariant features for both domains. In addition, the authors argue that domain adversarial training only captures global domain shifts. In order to account for category specific shifts a constraint multiple instance loss component is added. The output prediction map of the target is constrained on a lower and upper bound of category presence statistics in an image, based on percentages of labels in images of the source domain. This procedure is illustrated in Figure 2.7

Similarly, in [58] adversarial training is done between source and target to achieve feature invariance for domain adaptation. However, multi-level adversarial training is done by dis-criminating separately between output probability maps as well as feature-level activations in the network. Two separate discriminators are being used at two levels of the network which allows for better performance than sigle-level feature or pixel-level adaptation.

Similarly in [57] adversarial training for both feature and pixel level is performed. In addition, novel regularizations for feature level adaptation are explored while using insights from semi-supervised learning.

In Reality Oriented Adaptation (ROAD) [8] domain classifiers are trained to achieve fea-ture invariance on different areas of the images. More specifically, the image is split into a grid and for every part of the grid a separate domain classifier is trained. This is done in attempt to capture more information on the spatial structure of the images collected from urban scenes.

In addition to the grid-based domain adversarial training, an additional distillation loss is defined that attempts to preserve the pre-trained weights (e.g on ImageNet) from forgetting what’s learned on real objects while training on the synthetic object source. This loss is defined as:

(18)

Figure 2.7: FCNs in the Wild. Image from [24] L_dist = 1 N X i,j kxi,j− zi,jk2

where xi,j and zi,j are activations at position (i, j) of the feature map and N is the total

number of activations. Euclidean distance between activations is minimized in attempt to guide the model to maintain behavior learned on the real images.

Adversarial generative methods

As in the respective classification approaches, these methods use a generator to learn an explicit mapping between source and target images. In [44] this mapping is learned by the generator G which learns to produce ”fake” source and target images. A discriminator D tries to distinguish source and target generated images from the real ones. In addition to the segmenter loss on source and discriminator loss on generator outputs, an auxiliary segmentation loss on newly generated source images from target is added as well as an L1

reconstruction loss for the accuracy of cross-domain generated images (see Figure 2.8). Training is done in 3 stages: First the discriminator D is trained to be able to distinguish real and fake outputs from the G network together with the auxilary segmentation loss as:

L_D = Ls_adv,D+ Lt_adv,D+ Ls_aux

Second, the generator G is trained to beat the discriminator D by making the features invariant together with reconstruction loss:

L_G= Ls_adv,G+ Lt_adv,G+ Ls_rec+ Lt_rec

Third, the segmenter network (F + C) is trained with segmentation and auxilary loss:

(19)

Figure 2.8: Unsupervised Domain Adaptation for Semantic Segmentation with GANs. Image from [44]

The CyCADA [23] approach mentioned above uses a GAN to learn a mapping between source and target in addition to cycle-consistency loss enforcement. Both pixel and feature level adaptation is done and competitive results are achieved in semantic segmentation as well besides image classification.

Other methods

A few other recent methods have attempted to solve domain adaptation for classification by using other approaches rather than the standard adversarial approach.

A Fully-convolutional tri-branch [67] splits the original segmentation network into 3 branches. The first two are trained on source and used to make target predictions. If these two branches agree on target predictions with a high confidence score, these predictions are used as pseudo-labels to train the third branch. The first two branches are encouraged to be diverse by a weight-constrained loss.

Curriculum Domain Adaptation is explored in [68] where in curriculum-style learning fash-ion domain adaptatfash-ion is achieved by first solving the easier tasks before the more complex ones. The authors use label distributions over images as an easy task and local distributions over landmark superpixels as a difficult one.

(20)

2.2 Learning Models for Streaming Data

2.2.1 Introduction

Streaming data is generated from continuous data sources such as social network feeds, transaction data, live video feeds etc. The process generating the data is often non-stationary and the data cannot be assumed to be independent and identically distributed and drawn from a single distribution. The temporal properties of streams require a different set of techniques for dealing with this kind of data.

Consider a stream S of data that appear incrementally in sequences of single online samples or in portions or blocks. The sequential data Xτ enters the stream for time τ = 1, 2, ...K and the labels Yτ may or may not be provided. If the data is processed sample by sample this is called online processing. In the case of portions or data blocks, these are usually accumulated in equal sizes and training is done once a block is available, which is known as block processing [50].

The concept drift problem

Concepts in data are stable if they are generated from the same distribution. Often in streaming this is not the case. The distribution shift over time in streaming data is known as the concept drift. Thus if the data samples are generated with a source distribution pτ(X, Y ),

for two distinct points in time τ and τ + ∆, an X exists such that pτ(X, Y ) 6= pτ +∆(X, Y )

[50].

There are two types of drift. Real drift is defined as the variation of posterior probability of classes pτ(Y |X) over time independent of variations in the evidence pτ(X). Virtual drift

is considered the change in marginal distribution of the evidence pτ(X) without affecting

the posterior probability of classes pτ(Y |X) [25].

This drift can be sudden, when the current data distribution is suddenly replaced by a different one, or gradual when a slow rate of change is observed in the stream distribution. [50]. Noise and outliers are supposed to not affect the classifiers which aim to capture the underlying distributions. Further, drifts can be permanent if variation continues through time or transient when the drift disappears after a while [14]. Often data from a similar distribution reappears resulting in recurrent drifts.

Considerations

Besides the distribution shift over time, there are a few other important considerations to be taken into account when developing methods for streaming data classification.

First, memory constraints do not allow for all stream data to be stored simultaneously. Thus most of the data should be used and then discarded to free the memory for incoming streams. This creates the need for a one-pass learning approach where each sample pair or data block is only seen once in the training process before the data is discarded.

Second, labels for the sequence Sτ are not usually available with the data itself. When

they arrive with a delay in the next sequence Sτ +1 the model can be easily evaluated in

a ”train-then-test” scenario. This delay is known as verification latency and the scenario when labels are only provided in the beginning of a stream is known as ”initially labeled nonstationary streaming”. In these cases classifier knowledge needs to be propagated across several timesteps [14].

Another important consideration is that sometimes smaller classes occur so rarely that being able to detect when they occur and adapt accordingly is an important part of the setup.

(21)

In addition, it’s important for the classifier to be able to produce on demand predictions regardless of whether a similar distribution of incoming stream data has been previously seen in training data.

Last, different approaches and constraints call for different evaluation measures for the classifier. Usually streaming classification algorithms are evaluated by the required process-ing time, memory usage, predictive performance and ability to adapt [50].

2.2.2 Algorithms and methods

Approaches that deal with streaming data are either passive approaches that use a single classifier or an ensemble or active ones where an extra decision is made on whether to update the classifier. Most often classification algorithms such as Decision Trees, Rule-Based and Nearest Neighbor are used, whereas adjustments in neural network architectures to account for streaming have been proposed [1]. Figure 2.9 shows an overview of streaming data classification algorithms. It can be seen that not many deep approaches are used for this task, although a few attempts have been made [65].

Figure 2.9: Streaming data classification algorithms. Image from [25]

Active approaches Active approaches of streaming data classification aim to detect the concept drift in streaming data. A change detector looks at the features extracted and depending on whether there is a drift it decides on whether the classifier should update. These change is usually measured by variation in classification error as well as the inspection of raw data features themselves [14]. As these methods usually are able to adapt to concept drift, detecting it might be hard in cases when the drift happens gradually.

An example of an active approach is [2], where a complex sampling and filtering mechanism for active training and a random forest based classifier are used.

Passive approaches These do not seek to detect a drift, but simply continue the training as new labeled data arrives. The models are either based on a single classifier or on ensembles. Single classifier models are less demanding computationally.

An example is [55] where a micro-cluster Nearest Neighbor is used, which makes use of statistical summaries for data streams.

However, ensembles usually do best due to the availability of multiple classifiers to make decisions. In addition, ensembles are flexible to change as classifier members can be added or

(22)

Figure 2.10: Active learning framework for stream classification. A change detection mech-anism informs the model on whether to adapt or not given the newly incoming data. Image from [14]

removed based on incoming data, although this process requires more effort. Still, ensemble-based methods are most often preferred [51, 63].

Not many works look into exploiting unsupervised data for improving data stream classi-fiers. [54] use semi-supervised learning to adjust k-nearest neighbor weights over time. Due to the complexity of dealing with labeled data themselves, not much has been done about exploring potential ways to boost classifier results with unsupervised data.

(23)

2.3 Discussion

An overview of the topics introduced and related research was presented in this section. We discuss the motivation and choices made when developing the method that follows.

In the domain adaptation setting it can be noticed that not many methods are consis-tently used across tasks achieving competitive results. Usually domain adaptive solutions are tailored to the task. We believe that robust domain adaptation methods that can be applied across tasks without bells and whistles are important to the advancements in the field.

As it can be observed, adaptation approaches for semantic segmentation are mostly based on adversarial DA and use either a discriminator or a GAN that learns a mapping between the images. Many insights acquired from classification approaches are yet to be exploited, with discrepancy based methods being almost not represented. This can be due to the lack of enough understanding of how multi-dimensional embeddings behave in latent space and the complexity of semantic segmentation.

This thesis attempts to shed more light into understanding the behavior differences from classification to patchwise classification and further to semantic segmentation, for laying the grounds to better understanding of domain adaptation for semantic segmentation.

Regarding the streaming setup, we discussed how domain adaptation can be considered as a subset of streaming with two time steps and a single shift between them where pτ(X) 6=

pτ +1(X) . Thus we can generalize dynamic domain adaptation and stream adaptation over

time in a joint framework. To show how this would work, we use a streaming setup with an ”initially labeled environment”, that as described above does not receive further labels, only raw images, in a semi-supervised modality setting.

From the evaluation methods of streaming data classification discussed above, in this work we evaluate our streaming classification models for predictive performance and ability to adapt. We do consider memory usage by simulating a streaming scenario where data is used and then discarded, but explicit memory optimization for streaming is not the focus of this work. The processing time component is not considered for evaluation within the scope of this work.

(24)

3.1 Domain Adaptation with Associative Learning

3.1.1 Associative domain adaptation

From maximum mean discrepancy to learning by associations

Early domain adaptation approaches [41, 34] have used maximum mean discrepancy (MMD) as a distance measure between source and target feature distributions. MMD has been shown to be a good estimate of distribution distances through their mean embeddings [3]. Consider source and target datasets xsand xtand φ is a mapping such that φ : X → Hk where Hk is

a reproducing kernel Hilbert space (RKHS) [48]. Maximum mean discrepancy is defined as:

M M D(X, Y ) = k n X i=1 φ(xs_i) − m X i=1 φ(xt_i)kHk

Minimizing explicit source and target MMD yields to improved adaptation results on a model trained with source labels only. MMD is computed with the kernel trick in quadratic runtime, although there are linear time estimator versions [34]. Despite the computational complexity, a kernel and relevant parameters have to be chosen.

Learning by association [20, 19] is a technique that uses association of embeddings in latent space assuming that same-class data will have similar embeddings. Given a labeled and an unlabeled set of data, the unlabeled set should be associated to the closest labeled points, given that the overall distribution structure is maintained.

Similarly in associative domain adaptation, the aim is to minimize source and target discrepancy, but indirectly through source and target embedding associations, unlike explicit MMD-minimization based approaches, with the advantage of not having to choose a kernel and relevant parameters.

Definition

Considering the domain shift between source and target distributions, the goal is to associate the domains on embedding level. Let us assume that xs and xt are source and target data where distributions differ, so due to the domain shift P(xs) 6= P(xt). To perform adaptation in par with the supervised task, an additional loss component is added that operates on supervised source data and unsupervised target data. In associative domain adaptation this is the associative loss component that acts as a regularizer on the source classifier in order to minimize the distribution shift on feature level.

L = Ltask+ Lassoc (3.1)

The associative loss consists of two components: the walker and the visit loss.

L_assoc= αLwalker+ βLvisit (3.2)

(25)

Consider a network trained on a task that produces respective embeddings φs ∈ RN ×D

and φt _{∈ R}M ×D _{for x}s _{and x}t _{respectively where N and M are number of data points in}

source and target respectively and D is an arbitrary embedding dimensionality. In order to associate the correct source and target embeddings it is needed to compute their pairwise distances or affinities.

Affinities and transition probabilities Affinity matrix A ∈ RN ×M can be computed where every element Aij can be written as the affinity of source and target embeddings for the

respective indices:

Aij = aff(φsi, φtj) (3.3)

In [19], the unnormalized dot product between vector embeddings is used as a similarity measure, thus Aij = hφsi, φtji. In Section 3.2.2 we investigate different affinity measures and

their impact on the performance of the method.

If we could think embeddings as nodes in a graph where all source embedding nodes are connected to all target embeddings and all source to target affinities as edges or transition weights, affinities could be interpreted as transition probabilities between embeddings in source and target. Since every embedding point in the source is ”connected” to a finite set of points in the target, it is convenient to interpret affinities between a single sample in one domain to all samples in another as a probability distribution. Thus p(φt|φs

i) is a probability

distribution over samples in φt given the i-th source embedding and P

j0p(φt_j0|φs_j) = 1.

From here on, the notation ps→t_ij is used to denote p(φt_j|φs

j) with s → t indicating a

transition from source to target. To get from the affinity matrix A to transition probability distributions the softmax function can be used. Thus the affinity turned probability from one embedding vector φs_i in the source to embedding vector φt_j in the target becomes:

ps→t_ij = Pexp (Aij)

j0exp (A_ij0) (3.4)

It can be argued that different normalization functions can be used to get a probability distribution from the values. We discuss and investigate this empirically in Section 3.2.2.

In addition, for convenience the probability of two transitions happening jointly can be modeled as the product of these separate transition events. If a transition from an embedding φs_i in the source to φt_jin the target and back to another source embedding φs_k, the probability of this event as can be written as:

ps→t→s_ijk = ps→t_ij pt→s_jk (3.5)

and the overall two step probability as:

ps→t→s_ik =X

j

ps→t_ij pt→s_jk (3.6)

The associative loss component is based on two principles around these transition proba-bilities, constraining the embedding behavior as follows.

The walker loss To associate source and target embeddings the labeled source embedding can be leveraged to associate to target embedding points. The labels of the target domain embeddings are not known, but the distribution in the source and target is expected to follow a similar structure where same-class samples can be associated in embedding space.

(26)

(a) (b)

Figure 3.1: (a) Associative walker loss. Arrows represent probabilities of transitions, which are to be maximized if there is a same-class target-to-source transition. (b) Associative visit loss. Source to target probabilities are distributed uniformly.

Thus, while minimizing same-class embedding distance from both domains, all source and target embeddings of the same class should lie closer together. This can be considered equivalent to minimizing both source-to-target and target-to-source embedding distance for every particular class. With the probabilistic interpretation of the distances, this would mean maximizing the joint probability of a source-to-target and target-to-source distance-based transition for same class embeddings. As class specific target embeddings are not known, the two-step joint probability ps→t→s_ik and labels in the source can be utilized.

This joint probability should be maximized if the φs_i and φs_k samples in the source belong to the same class. In other words, a walker transition from source to target and back should end up in the same class in source with high probability. This can be enforced through a cross-entropy loss:

Lwalker= H(E, Ps→t→s) (3.7)

where Ps→t→s ∈ RN ×N _{is a matrix containing elements p}s→t→s

ik = ps→tij pt→sjk representing

two-step transition probabilities between an embedding sample in source to one in target and back to source and E is the normalized equality matrix indicating labels of the same class as: Eik = ( 1/|φs_i| if Cφs i = Cφsk 0 otherwise (3.8)

Based on this intuition, correct associations of target embeddings are encouraged if they lie close to multiple source embeddings which belong to the same class . For any i and k, ps→t→s

ik should be maximized if Cφsi = Cφsk where Cφn is the class of a sample embedding.

This is illustrated in figure 3.1a.

The visit loss The walker loss function risks to collapse into minimizing distances with target samples that lie the closest to the source samples, as they are easier to associate. This can partially be mitigated by the visit loss component which ensures that all transition probabilities from source to target are distributed equally among target samples. This is enforced with a cross entropy loss which distributes source to target transitions equally among samples as:

(27)

where P_js→t =P

ips→tij . Considering all probabilities ps→tij of transition from φsi to φsj, Ps→t

is enforced to be uniformly distributed with V ∈ RM having elements:

Vj =

1

|φt_| (3.10)

where P_js→t =P

ips→tij . This is illustrated in Figure 3.1b.

Assumptions The definition of the visit loss in eq. (3.10) assumes that source and target class distribution is similar on batch level where adaptation happens. The authors of [19] acknowledge the problem and recommend using a lower coefficient for the visit loss if this is not the case. In addition, they sample a uniform distribution over class labels per batch in the source, and attempt to alleviate the distribution difference in the target by sampling a batch 10-100 times larger than the number of classes. This often assures that all classes will be present in a target batch, but does not necessarily approximate the uniform distribution in source. This is even harder to ensure in diverse realistic datasets or tasks such as semantic segmentation where target class distributions in the batch might very a lot depending on the size of objects.

We argue that this assumption should be tackled on a theoretical level in the method. First, adjusting the coefficient β for the visit loss would require implicit access to labels in the target domain in order to tune β. We show in experiments (Section 4) that an increased difference in KL-divergence between source and target during training deteriorates the method results.

In addition, point out that a lower visit loss coefficient does not allow the network to exploit the full adaptation capability of the method. Below, we reformulate the approach for relaxing this distribution assumption and making the method robust to distribution differences between source and target, while preserving the full adaptation capacity between embeddings.

3.1.2 Relaxing the class distribution assumption

We discussed the assumption made by the original formulation of the visit loss in eq. 3.10. While setting a low coefficient for the visit loss component is a potential workaround, we show that this does not allow for full exploitation of adaptation capabilities of the method and tends to fail if the class distribution in the target is far from uniform in KL-divergence. Relaxing the distribution assumption is especially important for the scalability and us-ability of the method as well as applicus-ability to more complex tasks such as segmentation or streaming data classification. In the following sections we reformulate the method con-sidering potentially different class distributions and show the impact of our new approach in these tasks.

Intuition

In Figure 3.2 we illustrate why the uniform distribution of source to target probabilities with the visit loss would fail. In Figure 3.2a, if a large class in the source corresponds to a small source in the target, uniform distribution of source to target probabilities will enforce wrong associations from the large class to a wrong class in the target. If we were to consider class distributions, however, a larger portion of the source to target probabilties would be distributed among less samples in the target (see Figure 3.2b).

According to the intuition, we can rewrite the visit loss from eq. (3.9) as :

(28)

(a) (b)

Figure 3.2: (a) Uniformly distributed visit loss may enforce wrong associations among em-beddings. (b) Class balanced visit loss adjusts potentially wrong associations between embeddings when source and target class sizes vary.

with Vj = 1/|φt| and: wj = psrc(Cφt_j) ptgt(Cφt j) (3.12) where C_φt

j denotes the class label of the j-th target embedding.

The target distribution of the cross entropy can also be written as:

wjVj = psrc(C_φt j) ptgt(C_φt j) |φt_| (3.13)

To compute wj directly, we would need to know the respective class probability of this

target sample in the source as well as in the target. We discuss our technique for estimating these weights in an unsupervised manner.

Estimating visit loss weights

While due to the lack of target labels we can’t know the class which φt_j belongs to, we would need to estimate this class probability in the source and in the target. Mind here that while talking about class probabilities, we mean those in the sampled batch where association is taking place, not in the whole dataset. However, we can leverage here the labels in the source. By sampling a source batch with uniform distribution among classes, we would get a constant psrc(Cφt

j) across classes in the source. This would leave us with the task of

estimating class probabilities for every embedding in the target. We approximate this by retrieving clusters around the embeddings that attempt to approximate classes.

It is logical to expect that same-class embeddings in a latent space cluster together for a modern classifier to be able to discriminate between different class samples. If we could retrieve the clusters corresponding to the classes, we could compute the class distribution in a batch by from the respective cluster sizes and the overall batch size. This is illustrated in Figure 3.3. The approximation holds true under the assumption that the clusters are well aligned to the means of the respective, optimal classifiers.

There is a wide range of clustering methods that can be used for embedding clustering. We use an off-the-shelf hierarchical agglomerative clustering algorithm which experimentally

(29)

Figure 3.3: Clustering to estimate the class probabilities for target embedding samples. Clus-ters may not be fully accurate but they do approximate the class sizes for class probability estimation of samples in a batch.

appears to allow for good alignment between the obtained clusters and works well when clusters have very different sizes. We show in Section 4 that an off-the-shelf fast clustering approach yields to close approximations of the accurate oracle visit loss weights and improve-ments of the associative adaptation for cases when batch-sampling is not equalized among source and target datasets.

(30)

3.2 Associative Domain Adaptation for Semantic Segmentation

While several domain adaptation approaches work well for classification, semantic segmen-tation is a more challenging task to tackle due to the complexity and necessity for dense predictions. Current approaches make mostly use of adversarial training or GANs to learn a source and target mapping. Approaches that aim to minimize discrepancies in latent space of embeddings generated by a network trained on a semantic segmentation task have not been explored.

Having relaxed the distribution assumption, the associative domain adaptation approach can be applied to tasks where source and target distributions are not uniform or uniformity cannot be approximated. We show that our associative adaptation approach robust to distribution differences can achieve competitive results on dense prediction tasks.

3.2.1 Fully convolutional networks for semantic segmentation

Recent methods for semantic segmentation have shifted entirely towards fully convolutional networks. Large classification networks including fully connected layers can be easily turned into segmentation networks by turning fully connected layers into convolutions with kernels of size of the input region [32]. This facilitates not only transferability of classification architectures into semantic segmentation tasks, but also reusability of models trained on large classification datasets to segmentation tasks.

Good results can already be achieved with a very simple upsampling decoder consisting of a 1 × 1 convolution adjusting the channels to the number of output classes as well as upsampling by bilinear interpolation. Further attempts have developed more elaborate up-sampling modules using atrous convolutions, refining modules to capture global context and skip connections from the downsampling to the upsampling layers [66, 27].

With small modifications, a classification network such as VGG16 [47] and ResNet101 [22] can be plugged in as a backbone and one of the context-based upsampling modules can be combined to achieve state of the art semantic segmentation results.

3.2.2 Embeddings in semantic segmentation networks

Visualizing behavior

To understand the applicability of the associative domain adaptation method on segmenta-tion it is important to understand the behavior of embeddings produced by a segmentasegmenta-tion network. We expect that during training, embeddings of the same class start clustering together in latent space.

Recent approaches in semantic instance segmentation [40, 15] use embeddings produced by a semantic segmentation network and further regularize these to constrain same-instance embeddings to lie closer and embeddings belonging to different instances to lie further apart. This is done in order to constrain different instance pixels of the same class to cluster to their respective instances.

In addition, it has been shown that even semantic segmentation slightly benefits from further regularizing embeddings to lie closer if they belong to the same class and further if they belong to different classes [21]. This can be interpreted as extra supervision with an additional metric on the task.

In principle, however, we expect that although segmentation is a more challenging task for deep learning, a semantic segmentation network trained with cross entropy loss on pixel-level labels is supposed to learn good separation of different class embeddings and similarity of same class ones.

(31)

Figure 3.4: Embeddings produced by a segmentation network after (a) 1000, (b) 20000 and (c) 50000 iterations respectively. Visualized with t-SNE [36].

How to visualize this embedding space can also be complex. Embedding visualization re-quires high-dimensional embeddings to be projected down to 2D or 3D. This can be simply done with random projections or PCA, or through t-SNE [36], which is specialized for visu-alizing high dimensional embeddings. Although specialized, with very large dimensionalities and amount of data points, as well as more complex embedding space for segmentation, it can still be difficult to end up with a sample of the data where every class is in a clear and separate cluster. We illustrate this further.

A state of the art segmentation network is trained on the Cityscapes dataset for semantic segmentation. In Figure 3.4 we can observe the behavior of embeddings at different stages of the training visualized with t-SNE at a fixed perplexity value. For simplicity, we show only 8 out of 21 classes present in the dataset. Although it becomes clear through the training that embedding separability increases as the training progresses, t-SNE fails at keeping these clusters separate to each other further in the training. Looking at the different classes present, we observe that the clusters that fail to separate further in the training belong to pairs of object classes that are usually co-located in the image such as road and sidewalk or sky and vegetation. Co-location in these t-SNE plots might be also explained due to higher misclassification rate between these classes.

Analyzing embedding distance metrics

The authors of [19] use unnormalized dot product between embeddings as an affinity measure. Thus affinity matrix A has elements:

Aij = hφsi, φtji

We discuss potential affinity measures based on similarity scores or distances and how to use them as affinities. Given two vectors a = [a1a2...ad] and b = [b1b2...bd] we describe the

affinities as follows.

Dot product The dot product between the two vectors can be computed as:

a · b = aTb =X

i

(32)

From the definition, it can be observed that the dot product values are unbounded and the resulting value may increase infinitely with the increase of magnitude of the vector. [19] argue that this affinity measure worked best for convergence. However, L2 regularization is necessary in order not to allow the weights to explode towards values that maximize affinity by increasing vector magnitude. We argue that dot product works well as a similarity measure since due to the large variety of values it can take, it feeds more signal to the network during the association of embeddings. However, the usage of this affinity measure can get complex and cause very large weights if not regularized properly especially in deeper architectures.

Cosine similarity Cosine similarity is the normalized version of the dot product, where the latter is divided by the vector norms. Considering an angle θ between the vectors a and b, the cosine is defined as:

cos(θ) = a · b kakkbk = Pd i=1aibi q PN i=1a2i q PN i=1b2i

While conceptually this is a normalized dot product, the value ranges are between -1 and 1 yielding the cosine of the angle, therefore the cosine similarity behaves very different from the dot product. It can be argued that in theory the cosine similarity is a much more consistent measure of similarity between vectors. This is also equal to the dot product values when the vectors a and b have unit norm, but this cannot be guaranteed across neural network architectures.

While in theory it is a good similarity measure, in practice in experiments we observed that it doesn’t do much for the embedding association. We interpret this in the section below.

Euclidean distance into affinity The Euclidean distance between vectors a and b is defined as: d(a, b) = v u u t d X i=1 (ai− bi)2

This distance can be usually turned into affinity by using a radial basis function kernel or by taking the inverse _1+d(a,b)1 . Due to getting different behavior of adaptation when using different transformations for the Euclidean distance, we investigated this further and observed that value ranges of the affinity measure affect the adaptation convergence. If the value ranges do not allow for enough variety, the signal propagated to the network is not strong enough to allow for the transformation of embeddings. We observe that taking the negative of the Euclidean distance as an affinity yields the best results, since it preserves the magnitudes of the distance measure itself.

The probabilistic interpretation of distances and importance of numerical stability We discussed how we transform affinities between embeddings into transition probabilities by using the Softmax function in eq. (3.4). We can observe that the softmax function will produce very small values if classes with a very small presence occur in a large batch of embeddings, which is very often the case in semantic segmentation. This may cause the gradients to explode. It is important to stabilize the values accordingly with a proper affinity measure. It can be argued that other normalization methods can be used to turn the affinity

(33)

values into probability distributions. We show comparison results in Section 4 and it seems that softmax works best.

Initially when training models on deep semantic segmentation architectures, exploding gradients especially with the dot product similarity measure would often occur. To mitigate this, it proved important to stabilize the values fed to the softmax by the affinity measure. Using negative Euclidean distance for the affinity computation worked best. Since values are softmaxed, the sign becomes unimportant but the magnitude still affects the behavior in a similar manner.

In addition, the data normalization method seemed to impact the training results for semantic segmentation and avoid the exploding gradient problems. Dataset mean subtrac-tion and range normalizasubtrac-tion, where mean is subtracted separately for source and target, worked best with the ResNet50 based DeepLab-V2 architecture. The ResNet50 uses batch normalization per layer, so using mean centered data as the network input stabilizes the value ranges during training.

3.2.3 Adapting segmentation models

The designated embedding layer

Not only the training stage, but also the location in the network where we extract pixel embeddings from is important for extracting good representations. If extracted too early in the network, they risk to miss out properties learned by latter layers. However we still want relatively lower spatial dimensionality of feature-level representations.

We want to allow upsampling modules to capture global context earlier in the network than the features that will be used. In state of the art architectures, the amount of layers in the upsampling module is minimized for each of them to serve a clear purpose.

In addition, we observe that the dimensionality of pixel embeddings for semantic seg-mentation is crucial for convergence. If too large, the gradients propagated are noisy and adaptation not very effective. However, dimensionality has to be large enough to allow for similar embeddings to group together but still preserve discriminable structures in latent space. Therefore it is important to add a specific embedding layer to existing semantic seg-mentation architecture where embedding dimensionality can be adjusted according to the task. This layer is added in the decoder part of the the base DeepLab-V2 [6] network right before the final bilinear subsampling. This extension is illustrated in Figure 3.5.

Handling memory constraints

Due to the necessity to get pairwise distances between pixel embedding vectors, the memory requirements of the associative method are large in the case of semantic segmentation case. Matrix A of pairwise distances will have dimensionality N × M , where N and M are the amounts of embedding samples for source and target domains respectively.

In architectures such as the DeepLab-V2 [6] where dilated convolutions are used to capture global context, the pre-last layer is a tensor downsampled 8 times in each spatial dimension with respect to the input. Bilinear upsampling is then used to get an output with the original input spatial size. We add the embedding layer as explained above in between these, preserving the 8 times downsampled sizes. Using downsampled data allows us to fit more pixel-level embeddings in memory. A similar approach can be taken with multiple modern segmentation architectures.

In the cases when memory availability is even more limited, a subsample of pixel embed-dings can be used for association instead of the entire batch. We show in Section 4 that sampling pixel embeddings by a few times less than the batch size doesn’t hurt performance.

(34)

Figure 3.5: Extending a Deeplab V2 architecture with the embedding layer of adjustable dimensionality D.

Other sampling methods such as uniform, grid-based or sampling by max pooling have been investigated in this work. For future work, it could be interesting to investigate density-based embedding sampling as in [37].

Adaptation

Consider a source dataset DS = {xS_i, y_iS}, i = 1, 2, ...N where every data sample xS i with

spatial dimensionality H × W is annotated by the respective pixel level labels yS_i with the same spatial dimenstionality. The target images DT = {xT_j}, j = 1, ...M are available with-out annotations. Using a trained network we extract embeddings φ(xs), φ(xt) respectively, for which we use the simplified notation φs, φt.

Using the DeepLab-V2 semantic segmentation architecture as described above, we extract activations from the embedding layer in the decoder part of the network. These embeddings are considered on pixel level, on an activation map that is 8 times downsampled in each spatial dimension. This layer produces activations with dimensions U × V × D, where U = H/8, V = W/8 and D is an arbitrary embedding dimensionality that is selected according to the task, usually between 64 and 128. In addition, we downsample the label annotations and use y0S, where y0_i∈ RU ×V _{to match the downsampled source embeddings.}

For the experiments in this work we make use of all U ×V pixel embeddings for both source and target, extracted from a siamese architecture that shares weights for both domains and is trained on the source branch. We further train with the walker and visit loss to associate pixel-level embeddings, as explained in Section 3.1.