Adversarial neural networks for multimodal semantic segmentation

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Adversarial neural networks for multimodal

semantic segmentation

by

E

LIAS

K

ASSAPIS

12409782

August 13, 2020

48 Credits November 2019 - August 2020

Supervisors:

Dr. Deepak K. Gupta

Georgi Dikov, MSc

Dr. ir. Cedric Nugteren

Assessors:

Dr. Deepak K. Gupta

Dr. Efstratios Gavves

(2)

Abstract

Ambiguities in images or unsystematic labelling by annotators can lead to multiple valid solutions in semantic segmentation. To learn a distribution over predictions, recent work has explored the use of probabilistic networks. However, these do not necessarily capture the empirical distribution accurately. In this thesis, we devise an approach capable of learning a calibrated multimodal predictive distribution, where the empirical frequency of the sampled predictions closely reflects that of the corresponding labels in the training set. To this end, we develop a novel two-stage cascaded strategy for calibrated adversarial refinement. In the first stage, the data is modelled with an explicit categorical likelihood. In the second, an adversarial net-work is trained to sample from it an arbitrary number of coherent predictions. This model can be used independently or integrated on top of any black-box semantic segmentation framework to enable the synthesis of diverse predictions. We demon-strate the versatility and robustness of the approach by attaining state-of-the-art on the multigrader LIDC dataset and on a modified Cityscapes dataset with injected ambiguities. We also experiment on an illustrative toy regression dataset to show that our approach is not confined to semantic segmentation, and the core design can be adapted to other tasks requiring learning a calibrated predictive distribution.

(3)

Acknowledgements

I would like to express my sincere gratitude to my supervisors, Deepak K. Gupta, Cedric Nugteren and Georgi Dikov, who have been incredibly helpful and supportive throughout this project, making it a fun and rewarding experience. Deepak’s advice, encouragement and experience in research have been a very useful resource throughout this project. Cedric’s point of view was always interesting and insightful, facilitating fruitful discussions in our meetings. I’ve been especially lucky for having Georgi as my daily supervisor, whose ideas, insights and perpetual constructive criticism were instrumental to the end result of this work. He has been unbelievably patient, and has worked tirelessly and tenaciously offering much more than what was expected from him.

I would like to extend my gratitude to Michael Hofmann for giving me the opportunity to conduct my thesis with the wonderful team at TomTom’s autonomous driving unit in Amsterdam, and to my colleges and fellow interns for always being approachable for questions, and making this process an enjoyable learning experience. I would also like to thank Dr. Efstratios Gavves for agreeing to be the second reader of my thesis and part of the exam committee for my thesis defense.

Finally, I want to thank my parents for their unconditional support and encouragement and my friends for making the last two years a great and very enjoyable time. This accomplishment would not have been possible without them.

(4)

1 Introduction

Many real-world computer vision problems utilise semantic segmentation predictions from Con-volutional Neural Networks (CNNs) for downstream decision-making. This offers a pixel-level understanding of images, which is fundamental to a myriad of applications within the areas of medical diagnostics, autonomous driving and robotic interaction, to name a few. For instance, high definition map making for self-driving vehicles hinges on precise localisation of lane dividers or road obstacles, e. g. side rails or barriers, to ensure safety in autonomous driving. Likewise, in the clinical domain, fine-grained localisation of tissues is crucial in many clinical diagnoses and treatments, including radiotherapy [1] or tumor surveillance [2]. Automatic segmentation of images with deep learning approaches also offers a more robust and scalable solution than manual annotation. As a result, such models enjoy a high demand and widespread use in both industrial and academic settings, improving automation in workflows.

Be that as it may, training models generating highly accurate semantic segmentation maps can be very challenging due to the abundance of ambiguities in images, allowing for multiple valid solutions. These can emanate from an array of sources, such as indeterminate class definitions [3] (i.e. ambiguous label space), sensor noise, occlusions, and inconsistent or incorrect annotation by humans during the data acquisition process. Nevertheless, the majority of the research encompassing semantic segmentation focuses on optimising models that assign a single prediction to each input image [4–11]. These are often incapable of capturing the entire empirical distribution of outputs as they optimise for a one-fits-all solution for each input. Consequently, conflicts between labels corresponding to the same input image can compromise the reliability of the final predictions. To resolve such ambiguities, ideally, one would sample multiple consistent hypotheses, and leverage uncertainty information to identify potential errors in each one.

Several approaches have been proposed to capture multimodality in image-to-image translation tasks [12–16], with only a few of them applied on multimodal semantic segmentation [17–20]. These methods are capable of learning a diverse set of labels for each input, however, they are typically limited to a fixed number of samples, return uncalibrated predictions, or do not account for uncertainty. First, restricting to a fixed number of predictions can hinder a model from learning all of the modes of the predictive distribution [17]. Second, an uncalibrated predictive distribution is one that does not accurately reflect the occurrence frequencies of individual modes in the training set [21, 22]. Finally, uncertainty information is highly desirable, as it can be used to improve model training [23, 24], or for distinguishing between predictions that can be trusted and those that should be reassessed [25]. In this thesis, we attempt to devise a method addressing all three challenges.

1.1 Problem Statement

Even though in the real world there may exist a unique ground-truth solution, ambiguities in im-ages for semantic segmentation allow for several valid interpretations of the input x e. g. different

hypotheses over the precise location of object boundaries. That is, multiple solutions y1_{, . . . y}n_may

be consistent with the evidence in an input image x (e. g. Fig. 1). Modelling and quantifying the ensuing distributions and uncertainties can be instrumental for quality control in semi-automation and for diminishing risk in downstream decision making. In this work we aim to learn a multimodal conditional distribution over admissible solutions, and model the associated uncertainty for each input image x. Further, we require for the learnt distribution to be calibrated; that is, to reflect the ground truth probabilities of each segmentation variant, as distributed in the empirical distribution.

Figure 1: Example of multimodality in semantic segmentation from the LIDC dataset [26] for lung lesion segmentation. The input image x is a CT scan of lung tissue illustrating a possible malignancy. Four expert annotators disagree on the presence or shape of a tumour allowing for 4 distinct solutions.

(7)

1.2 Contributions

In this thesis, we explore schemes that address these challenges and propose a novel two-stage cascaded strategy for adversarial refinement that facilitates the learning of a calibrated multimodal predictive distribution, and allows extraction of reliable data uncertainty estimates [25]. We analyse the benefits of our approach with experiments on illustrative toy regression and 2-class segmentation problems, and two real-world semantic segmentation datasets—the multigrader LIDC dataset for lung lesion segmentation and a modified, stochastic Cityscapes dataset, introduced in [17]. Our key

contributions are as follows1_:

• We propose a novel cascaded architecture that constructively combines explicit likelihood modelling with adversarial refinement to sample an arbitrary number of self-consistent, high-quality predictions given an input image.

• We introduce a novel loss term that complements our two-stage strategy, leveraging the per-pixel categorical distribution learnt in the first stage of our approach to induce calibrated multimodality in the predictive distribution at the refinement stage, and to stabilise the training of our model.

• We demonstrate that the proposed model can be trained independently or used to augment any pre-trained black-box semantic segmentation model, endowing it with a multimodal predictive distribution.

1.3 Thesis Outline

The structure of the rest of this document is as follows. In Section 2, we lay the conceptual ground on which our method stands, introducing highly pertinent notions to the problem we are addressing. Subsequently, we discuss closely related work on adversarial and ambiguous semantic segmentation, and more generally in multimodal image-to-image translation in Section 3. Thereafter, in Section 4 we motivate and describe the salient features of our methodology in detail. Our experiments and experimental results are then presented and analysed in Section 5. Finally, Section 6 concludes this work with a critical review of our contributions, and an outlook on possible directions for future work.

1

(8)

2 Background

In this section we lay out the basic definitions and necessary background on which this thesis will be positioned.

2.1 Semantic Segmentation

Semantic segmentation is a visual scene understanding problem formulated as a K-class dense classification task. The goal is to partition an image into multiple segments by assigning a class

label yi to every pixel in an input image, where i ∈ {1, . . . , K}. Fig. 2 illustrates a schematic

overview of the labelling process. While early approaches relied on machine learning models utilising hand-crafted features, such as Histogram of Oriented Gradients (HOG) [27] or Scale Invariant Feature Transform (SIFT) [28], in recent years, the field has been dominated by deep learning approaches based on CNNs, that prescribe joint feature and model learning. With convolutions as their basic building block, these learn to extract a hierarchy of non-linear features from raw input, capturing translation invariant representations of the data [29]. Since the introduction of Fully Convolutional Networks (FCNs) for semantic segmentation by Long et al. [30], CNN based approaches have revolutionised the field achieving unprecedented improvements in prediction accuracy.

Figure 2: Overview of the process of semantic segmentation. Each colour represents a different class (e. g. road, sidewalk, car).

The archetype architecture for contemporary CNNs for semantic segmentation is the encoder-decoder design, utilized by the likes of UNet [4], SegNet [31], DeepLabV3+ [32] or more recently the HRNet [33]. These approaches use skip-connections [34] to combine fine-grained localization information extracted from early layers with high-level semantic information modelled at the later layers of the network, improving segmentation quality. The main problem of the field remains optimally integrating localisation information with global context to bridge the remaining performance gap [35]. Accordingly, recent research has yielded a panoply of strategies designed to tackle this issue, such as aggregation of multi-level features and capturing multi-scale contextual information using pyramidal architectures [36], dilated convolutions [37] or attention modules [38]. Following this trend, the current state-of-the-art is held by the HRNet equipped with multi-scale, hierarchical self-attention [39, 40]. A recent survey on the evolution of semantic segmentation models covering approaches predating the deep learning era to current state-of-the-art models can be found in [41].

2.2 Discriminative Modelling

The majority of the work on such deep learning approaches for semantic segmentation focuses on

deterministic discriminative models, which aim to approximate the conditional distribution pD(y| x),

for a given semantic segmentation dataset_{D, with a likelihood q}θ(y| x) modelled by a neural network

with parameters θ. These assign conditional class probabilities to an input datapoint x according to its distance from learned class boundaries. To learn these mappings, the ground truth labels y are typically expressed in a one-hot format, and the class probabilities are realised as softmax vectors,

defining qθ(y| x) as a pixel-wise factorised categorical distribution. This allows the model to be

tractably optimised with maximum likelihood estimation (MLE) using a pixel-wise categorical cross entropy (CE) loss function, given by:

Lce(D, θ) = −EpD(x,y)[log qθ(y| x)]. (1)

It is important to note that this loss formulation assumes independence between different pixels, as all class label variables are optimised independently from each other. As a result, even though CNNs

(9)

may harbour the architectural capacity to learn long-distance correlations between pixels e. g. by having a sufficiently large receptive field and being universal function approximators [42], in practice,

they tend to miss important global structural qualities when optimised with Eq. (1). Since_Lcedoes

not account for higher-order consistencies in the output [43], these models are not incentivised to learn distributed regularities beyond the bare minimum necessary to discriminate between classes. Further, since MLE-based optimisation converges on averaging all observed one-hot labels for a given input image x, when optimised with Eq. (1), the trained model learns to map x to an approximation

of EpD[y| x] [44] in the form of a per-pixel categorical distribution, capturing the different class

probabilities over each pixel of the corresponding label. The predicted label is then commonly

obtained by applying the argmax function along the class dimension of the predicted softmax

distribution, collapsing it to a 2D segmentation map. While this works fine for unambiguous semantic segmentation, label noise can lead to blurry, unclear predictions, as correlations between adjacent pixels are degraded when independently averaging each pixel across conflicting labels.

Another undesirable outcome of this approach is that due to its deterministic nature, it cannot produce multiple alternative predictions for the same input image. On the other hand, direct sampling from

qθ(y| x) yields spatially incoherent segmentation maps, since qθis a per-pixel distribution, and

maximising the factorised likelihood encourages high entropy for pixels within regions of inter-label inconsistency, e. g. pixels lining fuzzy object boundaries, exemplified later in the model overview diagram in Fig. 4. To synthesise self-consistent segmentation maps, it is essential to design a mechanism that does not sample each pixel independently, but instead leverages knowledge of meaningful image-global relationships between pixels.

2.3 Generative Modelling

Instead of learning the boundaries between different classes, generative modelling aims to model the

data distribution; that is, to learn the high-dimensional joint distribution pD(x, y). Fig. 3 provides an

intuitive illustration of the difference between discriminative and generative approaches. Generative modes are explicitly encouraged to learn long-distance regularities between pixels in order to generate predictions resembling samples from the marginal distribution. By jointly modelling the inputs x and the outputs y, they are capable of capturing complex inter-dependencies between them, allowing the synthesis of high-dimensional, self-consistent predictions.

In practice, generative models can show a poorer generalisation performance for classification than discriminative models, due to discrepancies between the learnt and the true distribution of the data [46]. However, they offer other important advantages relevant to our use case, making them an attractive alternative. Unlike discriminative models, generative models hold the ability to sample diverse coherent predictions, providing a natural way of dealing with uncertainty—when the model is not certain, it can produce multiple alternative hypotheses. Further, these are less prone to overfitting and not limited to strictly labelled data, and therefore preferred for small data regimes or unsupervised learning [46, 47]. Another benefit is that they can be used to increase the amount of training data [46] (i.e. for training other models), as they enable the generation of synthetic examples by sampling from the learnt joint distribution.

(10)

Generative models can have explicit or implicit density functions. In the remainder of this subsection, we provide background on two important generative frameworks, a member from each category, that are used in our experiments presented in Section 5. We then proceed with a brief description of con-ditional generative modelling, that promise to overcome several shortcomings of their discriminative counterparts for semantic segmentation.

Variational Autoencoders Variational Autoencoders (VAEs) [48] explicitly model the likelihood

of the data through an intermediary latent variable z. To do so, they combine a probabilistic encoder (i.e. encoding a distribution rather than a point estimate) with a deterministic decoder network to learn

an approximate posterior distribution q(z_{| x) using variational inference. This involves maximising}

a lower bound on the log-likelihood of the data by optimising the Evidence Lower Bound (ELBO) objective [48], defined as:

LELBO(x) = Eq(z|x)[log p(x| z)] − KL(q(z | x) || p(z)) (2)

where the first term Eq(z|x)[log p(x| z)] is the expected reconstruction error, and the second term

KL(q(z_{| x) || p(z)) is the reverse KL-divergence}2_{between the variational posterior q(z}

| x) and a prior distribution p(z). Accordingly, the former is referred to as the reconstruction term, and the KL-divergence is called the complexity term, because it acts as a regulariser, encouraging the approximate posterior to conform to the prior. This objective strikes a balance between reconstruction quality and regularisation, to learn an approximate posterior capturing the data manifold while being amenable to straightforward sampling during inference.

Note that this framework prescribes a number of important design choices that can affect the model performance dramatically. Most critical are the selection of the prior distribution on the latent space

p(z), the family of the approximate posterior distribution q(z| x), and the decoder distribution

p(x_{| z). The tightness of the lower bound induced by Eq. (2), and therefore the quality of the loss}

signal, depends on the expressiveness of the variational posterior [50] that in turn hinges on the first

two abovementioned criteria. However, in practice, both p(z) and q(z| x) are commonly taken as

factorised Gaussian distributions, with p(z) =_{N (0, I) to allow the complexity term to be integrated}

analytically. On the other hand, the reconstruction term is estimated using samples z _{∼ q(z | x),}

therefore the reparametrisation trick from [48] is employed to backpropagate gradients through the sampling operation, enabling end-to-end training of these networks using gradient-based optimisation methods.

By these means, VAEs learn to encode each input x as a distribution over low-dimensional latent codes z, each encapsulating a different plausible solution. Since the decoder is optimised to reconstruct x from sampled latents z, the networks are required to retain in z as much information from the input as possible. Compressing the high-dimensional input into these latent variables forces the model to learn complex dependencies between the dimensions, yielding abstract representations that can be useful for downstream tasks. For example, these can learn to abstractly describe natural images in terms of relationships between the rendered objects which could be leveraged to classify the images [51]. For more information on the latest advances in autoencoder-based representation learning, we defer the reader to [52].

Generative Adversarial Networks Introduced in the seminal work by Goodfellow et al. in [53],

Generative Adversarial Networks (GANs) learn to directly sample from the data distribution without optimising an explicit likelihood. This framework involves training a differentiable generator network and an adversary in a two-player minimax game. Formally, the adversary takes the form of a binary classification network D, trained to optimally discriminate between ground truth samples and predictions. The generative network G is then concurrently trained to maximise the probability that synthesised predictions are perceived as real by D. The non-saturated versions of the objectives used to train the two networks as proposed by [53] are given by:

Ladv(D) =−Ep()[log D(G())], (3)

LD(D) = −EpD(x),p()[log (1− D(G())) + log D(x)]. (4)

where is a noise variable sampled from a fixed noise distribution p(), typically taken as a standard Gaussian. In essence, the GAN setup leverages the maximum likelihood learning paradigm to tailor

2

(11)

an adaptive loss function, parametrised by D, on the training data. In turn,_Ladvis used to train G to

map a sample from the noise distribution p() to a sample from the marginal distribution p(x). Thus, G learns to trace out a manifold on the noise distribution whose points resemble training points. Similarly to the function of the decoder in VAEs, the generator G receives an input latent variable, and maps it to a high-dimensional observation x. Unlike explicit pixel-wise likelihood maximisation, the adversarial setup optimises on an image-global score rather than an array of independently scored pixels, as the discriminator D maps input segmentation maps to a single scalar representing the probability that the input is real. As a result, G implicitly models the joint pixel configuration of data samples to generate samples demonstrating local and global regularities matching those displayed in real observations in order to satisfy D.

Despite their impressive results in image generation [54, 55], training GANs is notoriously diffi-cult. The adversarial training objective embodies a saddle point optimisation problem, therefore convergence is not guaranteed with gradient decent based optimisation, often leading to oscillatory solutions [53, 56–58]. Further, training dynamics are sensitive to random weight initialisation, hy-perparameter settings, and architectural choices. Even if the training procedure is stabilised, the resulting generator may learn to synthesise samples that resemble only a few regions of the data distribution, in a phenomenon known as mode-collapse [53]. As a result, training GANs had been previously characterised as more art than science [59]. However, significant advances have been made in recent years on improving the training stability and robustness of GANs, such as complementing the adversarial loss term with other more stable losses [60], using learning rate equalisation [61], or introducing zero-centered gradient penalties to the loss function [57].

Conditional Generative Models Generative models can be easily adapted to sample from the

conditional distribution p(y_{| x) rather than from the joint distribution p(x, y) by conditioning the}

generative process to an input x. The resulting conditional generative models then jointly optimise for accurate modelling of the data distribution and predictive performance on discriminative tasks [62]. Consequently, they are able to capture higher-level semantic consistencies to generate coherent samples, while conditioning on x makes it possible to direct the data generation [63]. Therefore conditional generative models enjoy complementary benefits from both generative and discriminative modelling. To this end, conditional generative models are able to learn one-to-many mappings from input to output, and mappings across different domains [64], making them especially attractive for our use case of multimodal semantic segmentation. Further, they have shown improved robustness against outliers [65] and adversarial attacks [66], compared to discriminative models.

Naturally, image-conditional generative models [12, 13, 60, 64], such as conditional VAEs [67] (cVAEs) and conditional GANs [64, 68] (cGANs) have been investigated extensively for image-to-image translation [63, 64, 68], and more recently, have been used in forays in the more specialised domain of ambiguous semantic segmentation [17]. We further elaborate on work on conditional generative approaches closely related to our method in Section 3.

2.4 Predictive Uncertainty in Computer Vision

There are several sources of predictive uncertainty in deep learning, however, in computer vision, it is typically decomposed into two major types: epistemic uncertainty and aleatoric uncertainty.

Epistemic uncertainty This concerns the uncertainty over which parameter configuration for a

given model architecture best fits the data. This is a consequence of the finite nature of the available data, allowing multiple model configurations to be consistent with the given set of observations. It follows that epistemic uncertainty is heavily dependent on the representative power of a training set— increasing the amount and diversity of datapoints used to train a sufficiently powerful model reduces this type of uncertainty. This is because less model configurations can account for all datapoints in a larger set of observations [69]. Consequently, modelling epistemic uncertainty is more useful when we have a small training dataset.

Epistemic uncertainty is typically modelled using Bayesian neural networks [70, 71], with many

approaches using variational inference to approximate the variational posterior q(θ| x, y) over model

parameters θ [72]. Despite being principled, these methods often complicate inference, significantly increase the number of parameters, and are not scalable to large datasets [73]. Practical and scalable alternatives are deep ensembles [74] or dropout variational inference [75], also known as Monte Carlo

(12)

dropout (MC Dropout). For the latter, the inference is materialised by performing multiple stochastic forward passes using a fixed input while using dropout at test time, where the dropout at each forward pass acts as a sampling mechanism, drawing model samples from the variational posterior.

In practice, epistemic uncertainty is projected on the output space by computing the predictive

variance between multiple models sampled from q(θ_{| x, y). The collected epistemic uncertainty}

is then low in familiar regions of the input space; that is, datapoints that have been encountered during training, and higher in regions of the input space that are not well-represented in our training distribution. Data points that reside in unseen regions of the dataspace, or are drawn from an unfamiliar distribution, are referred to as out-of-distribution (OOD) datapoints [76]. Detecting OOD inputs is especially important in safety-critical applications as it can be leveraged to identify potential errors in model predictions attributed to the lack of data [77].

Aleatoric uncertainty This refers to uncertainty over data due to noise and ambiguities inherent

in the observations, allowing for the validity of multiple labels. It is further subdivided into

ho-moscedasticand heteroscedastic uncertainty. Homoscedastic, also known as task-dependent aleatoric

uncertainty, assumes constant observation noise across the entire input space, whereas heteroscedastic uncertainty assumes an uneven distribution of noise levels [25]. Modelling aleatoric uncertainty is for-malised by placing a probability distribution over the model outputs, where we learn to associate each

input xiwith a range of outputs y

j i, . . . , y

N

i corresponding to the amount of noise of the data [78].

Aleatoric uncertainty is typically modelled directly using deep neural networks to approximate the

conditional distribution p(y| x). For regression tasks, p(y | x) is commonly modelled with Gaussian

or Laplace distributions by using the L2 or L1 loss function, respectively [73]. In the homoscedastic uncertainty scenario, a single variance or scaling term is learned for the entire input space, whereas for heteroscedastic uncertainty, a separate term is predicted for each input datapoint [25]. Note that

this limits noise to unimodal distribution, even though the conditional distribution p(y_{| x) is often}

multimodal. For classification tasks, the gold standard is to use a softmax output layer to obtain a categorical distribution capturing class probabilities, enabling direct modelling of heteroscedastic aleatoric uncertainty as the entropy across the class dimension [24]. Even though this approach has the advantage of being able to model multimodal, non-symmetric, or heavy-tailed, heteroscedastic noise, it often leads to uncalibrated results [21, 74, 79], as discussed in Section 2.5 below.

Notice that in contrast to epistemic uncertainty, aleatoric uncertainty cannot be used to identify OOD inputs, and cannot be explained away with more data. Instead it becomes refined; that is, the

conditional distribution distribution p(y_{| x) becomes more accurate. Unlike epistemic uncertainty,}

heteroscedastic modelling can be used to identify ambiguous regions of the data space, and can be leveraged to improve model robustness against noise via loss attenuation—guiding the contribution of each datapoint to the loss by weighting loss residuals according to its aleatoric noise value. Aleatoric uncertainty is more useful for large datasets [25].

2.5 Calibration of Modern Neural Networks

Even though modern deep neural networks for multi-class classification achieve very high accuracy in their predictions, the learned softmax probabilities tend to be poorly calibrated, returning highly confident predictions even when they are wrong [21, 22, 80–82]. In other words, the predicted probability associated with each class label often does not reflect the corresponding ground truth correctness likelihood. A perfectly calibrated prediction on the other hand is one where the predicted probabilities match the frequencies in which the class conditionals occur in the empirical distribution e. g. if a given object in an input image is classified as "car" 1 out of 10 times it appears in the images

of a training set, then the "car" class conditional should be assigned a probability of0.1.

This phenomenon has been linked to a host of recent training and architectural trends of modern neural networks, such as the use of a softmax output layer, lack of sufficient regularisation methods, batch normalisation [21], and negative log-likelihood (NLL) optimisation (also called cross entropy optimisation in the context of deep learning). NLL overfitting has been shown to be a particularly important contributor to this issue, encouraging overconfidence in model predictions even when the evidence in the data is ambiguous [21]. This compromises the reliability of these models, as their predictions do not signal their probability of being correct, which can be crucial for risk assessment in safety-critical applications. For instance, if a semi-autonomous car uses predictions from a neural

(13)

network to detect pedestrians, the car should be able to notify the driver if it is uncertain whether an obstruction is a pedestrian or not.

To counter this, several post-hoc calibration methods have been proposed, such as decomposing the problem in to k one-vs-rest binary isotonic- [83] or beta- [79] calibration tasks and binning calibration with either equal-frequency or equal-width bins [84]. An alternative approach is to modify the learning algorithm directly. The simplest such modification is temperature scaling, where a

temperatureparameter, is used to "soften" the softmax (i.e. adjust the output entropy) by scaling the

input logits [85]. Other related methods include relaxed softmax [86] for auto-calibration, where the model itself predicts a sample-dependent temperature parameter, or Dirichlet calibration of the logit space [82]. More recently, label smoothing has also been shown to implicitly improve calibration [87]. However, these do not provide a foolproof solution, therefore calibrating the predictions of modern neural networks remains an open-ended question.

(14)

3 Related Work

In this section we discuss closely related work within the domains of adversarial and ambiguous semantic segmentation in detail, and subsequently explore relevant approaches used in the more general domain of image-to-image translation.

3.1 Adversarial Semantic Segmentation

Luc et al. [43] were the first to propose using cGANs to enforce higher-level semantic consistencies in semantic segmentation predictions. They complemented the standard CE loss with an adversarial loss term to train models achieving improved accuracy over previous state-of-the-art results on the Stanford Background [88] and PASCAL VOC 2012 [89] datasets. This approach was later formalised in a unified framework for multipurpose image-to-image translation by Isola et al. in [64] yielding sharp, confident predictions with high fidelity to the high-level structure in the conditioning input image. Ghafoorian et al. [90] modified the standard adversarial loss objective employed in these approaches, to discriminate based on embeddings corresponding to real and predicted input, extracted from intermediate layers of the discriminator. Therefore, rather than training the discriminator on real or fake segmentation maps independently, the discriminator directly contrasts the two in the embedding space. As a result, their approach produces crisp predictions, preserving fine-details observed in ground truth segmentation maps.

Hung et al. in [91] also take an adversarial approach for semantic segmentation, however, different to the methods mentioned above, they modify the discriminator to score each pixel individually rather than predicting single scalar for the entire image. Therefore given an input segmentation map, the discriminator learns to predict a confidence map at the same spatial resolution as the input. The authors then couple the adversarial and CE loss terms by leveraging the confidence map as supervisory signal, weighting the per-pixel CE loss residuals. Samson et al. [92] propose a similar scheme, however, the adversary is trained to identify incorrectly labelled pixels by being optimised to predict a "betting map", given a limited budget, that maximises the score when weighting the per-pixel CE residuals between the ground truth and predicted labels. Both approaches show improved structural qualities compared to competing models trained with CE on the Cityscapes dataset [93]. While the aforementioned cGAN approaches yield confident, self-consistent predictions with high structural fidelity, they are all limited to deterministic outputs. As demonstrated in [14, 64], simply incorporating noise vectors as additional inputs to the generator does not resolve this problem, as it learns to ignore the added noise due to the lack of regularisation between the noise vectors and the target domain, resulting in mode collapse [12, 13]. Consequently, these models learn to generate only a single admissible segmentation map for an input image, ignoring other valid modes. As far as we are aware, we are the first to use adversarial neural networks to learn a multimodal distribution for semantic segmentation.

3.2 Ambiguous Semantic Segmentation

Straightforward strategies towards learning multiple segmentation maps include ensembling [74, 94] or using multiple prediction heads [95]. Even though these approaches can capture a diverse set of predictions, they are limited to only a fixed number of samples. Alternatively, a distribution over the outputs can be induced by activating dropout during test time [75, 96]. This method does offer useful uncertainty estimates over the pixel-space, however, Isola et al. [64] have demonstrated that it introduces only minor stochasticity in the output and returns spatially inconsistent samples. To address this, Kohl et al. [17] introduce the Probabilistic UNet (PUNet), taking an orthogonal approach in combining a UNet [4] with a cVAE [48] to learn a distribution over semantic labels. Plausible segmentation variants are then encoded into a low-dimensional latent space, and new predictions are synthesised by sampling a random latent code and injecting it into the later stages of the UNet to propagate the sampled mode to the output. The conditional distribution over segmentations is learned using variational inference through a prior net used to map an input image x to a latent distribution over segmentation variants, regularised by a target distribution encoded by a posterior net conditioned on x and a valid ground truth label y. Notice that the ’prior’ and ’posterior’ name tags here refer to the conditioning on ground truth labels rather than adhering to the variational inference nomenclature. The distribution encoded by the posterior encoder in fact functionally replaces the

(15)

prior distribution as used in the standard VAE framework, reminiscent of empirical Bayes where the prior is learned from data [97].

Subsequent approaches have mostly focused on improving upon the PUNet. [18] and [19] have identified image-global modelling with only a single latent variable at the highest resolution level of the UNet as a limitation. By introducing a hierarchical latent space, they show that modelling the data on several scales of the image resolution yields improved diversity and higher sensitivity to finer structure. Nonetheless, these methods do not explicitly calibrate the predictive distribution in the pixel-space, and consequently do not provide reliable aleatoric uncertainty estimates [69, 73, 98]. Hu et al. [20] address this shortcoming by using the inter-grader variability as additional supervision on the PUNet. A major limitation of this approach is the requirement of a priori knowledge of all the modalities of the data distribution. For many real-world datasets, however, this information is not readily available.

In concurrent work to ours, Monteiro et al. [99] explore a different direction, and propose to improve inference efficiency using a single network to parametrise a low-rank multivariate Gaussian distribu-tion modelling inter-pixel dependencies in the logit space. This approach relaxes the independence assumption taken by pixel-wise CE, allowing the synthesis of coherent segmentation maps without necessitating the use of a decoder or optimisation with variational inference. Multiple segmentation variants are obtained after a single forward pass by sampling logit maps from the learnt distribution in the logit space. This approach offers useful aleatoric uncertainty estimates, however, it lacks a calibration mechanism, and the use of a low rank covariance matrix assumes only local dependence between pixels, which can lead to excessive sample diversity by breaking important long distance regularities.

3.3 Multimodal Image-to-Image Translation

For image-to-image translation tasks, cVAEs have been used for a multitude of one-to-many transla-tion problems such as generating images from visual attributes [100], forecasting from static images [101], and object segmentation with partial observations [67]. However, these often show blurry predictions, because they are optimised with pixel-wise CE reconstruction loss terms (see Section 2.2). To remedy blurriness, Esser et al. in [102] complement the reconstruction term in the VAE training objective from Eq. (2) with a perceptual loss, using extracted high-level features from a pre-trained VGG network to encourage perceptual similarity between real samples and predictions [103]. Alternatively, several recent efforts explore hybrid models that employ adversarially trained cVAEs [14, 15], to explicitly encode multimodality while generating sharp and realistic outputs. A common hurdle in such cVAE-GANs is that inducing multimodality by incorporating a sampled latent vector as an additional input, as done in the purely generative setting, often fails for the same reason given above for cGANs, leading to mode collapse. This issue is commonly resolved by using supplementary cycle-consistency losses [14, 15], originally proposed in [68], to encourage a one-to-one relationship between the output and the latent vector. For example Bicycle-GAN [14] constrains the connection between output and the latent code to be invertible in both directions; that is, the latent code and output are conditioned to be recovered from each other. These methods can indeed model continuous and multimodal distributions and generate sharp predictions, however, they typically require aligned training pairs.

Approaches designed to break the paired supervision constraints often disentangle the latent rep-resentation into a domain-invariant content space, and domain-specific "style" space, modelling multimodality [12, 13]. Lee et al. in [13] use two different encoders to embed input images into low-dimensional content codes and latent style distribution. Diverse, multimodal predictions are then synthesised by sampling style codes, and recombining them with the content code in the latent space before propagating the resulting embedding through the generator. Such models do not require paired datasets, however, as is the case for all the aforementioned cVAE-GAN approaches, they customarily rely on a fixed prior distribution that limits the expressiveness of the learnt predictive distribution. Similar to our method, augmented CycleGAN [104] learns multimodal image-to-image translation mappings without explicitly encoding multimodality. Instead a standard cGAN approach is adapted to leverage auxiliary noise variables and a modified cycle-consistency loss in order to learn stochastic mappings capturing multimodal conditionals. Our approach also utilises a cGAN generator injected with noise variables, however, to accommodate multimodality, we use a novel loss term establishing

(16)

a distribution-consistency constraint rather than the sample-consistency constraint entailed by cycle-consistency loss. More importantly, in contrast to all methods mentioned in this section, our approach is explicitly formulated to calibrate the predictive distribution, which can be crucial for downstream safety-critical applications leveraging semantic segmentation predictions.

(17)

4 Method

In this section we begin by motivating our method, and proceed to delineate the key features of our approach. Thereafter, we conclude this section by sharing some important practical considerations.

4.1 Motivation

As mentioned in Section 3.1, adversarial approaches for semantic segmentation commonly

com-plement the adversarial loss term with the_Lceloss from Eq. (1) to improve training stability and

prediction quality [43, 90, 92]. Even though the mixed supervision of adversarial and CE losses leads to improved empirical results, we argue that the two objective functions are in fact not well aligned in the presence of noisy data.

First, while categorical CE optimises for a single solution for each input x, thus encouraging high

entropy in qθ(y| x) within noisy regions of the data-space and calibrating the predictive

distribu-tion, the adversarial term optimises for low entropy, label-like output. Second, the discriminator parametrising the adversarial loss can be satisfied by multiple different predictions, as long as the

examined sample represents a plausible segmentation variant. Thus in contrast to_Lce, it allows for

multiple solutions, which is a desirable feature for our problem setting.

Combining these losses in an additive manner, and enforcing them on the same set of parameters can therefore be suboptimal. This issue can be mitigated to some extent by a scheduled downscaling of

Lce, however, the persisting conflict between the two losses adversely affects the optimisation process.

Below we describe how we alleviate this conflict while exploiting the complementary advantages of the two losses, and elaborate on the details of our method.

4.2 Cascaded Adversarial Refinement

In this work, we propose to avert potential conflict between the two aforementioned losses by

decou-pling them in a two-stage cascaded architecture consisting of a calibration network Fθ, optimised with

CE loss_Lce, the output of which is fed to a refinement network Gφ, optimised with an adversarial

loss term_Ladv. The adversarial loss term is in turn parametrised by an auxiliary discriminator Dψ,

trained with a mixture binary CE loss_LD(defined below).

More formally, for a semantic segmentation dataset of N image and label pairs and K semantic

classes,_{D = {x}i, yi}Ni=1, we initially explicitly model the conditional distribution pD(y| x) with

likelihood qθ(y| x), parametrised by the calibration network Fθ. To fit the likelihood to the data, we

follow the standard approach used in discriminative modelling, expressing y_{∈ {0, 1}}H×W ×Kas a

one-hot encoded label, and setting qθas a pixel-wise factorised categorical distribution, given by:

qθ(y| x) = H Y i W Y j K Y k Fθ(x) yi,j,k i,j,k . (5)

The parameters θ are then optimised by minimising the pixel-wise categorical CE from Eq. (1)

between pDand qθ, shared again below:

Lce(D, θ) = −EpD(x,y)[log qθ(y| x)]. (6)

Subsequently, the output of the calibration network Fθ(x) is concatenated to input x and fed into the

refinement network Gφ. For notational convenience, we do not show the concatenation of x to Fθ(x)

explicitly. To account for the multimodality in the labels, we additionally condition the refinement

network on an extraneous noise variable _{∼ N (0, 1), as done in the purely generative setting. In}

this way, Gφlearns to associate a sampled noise variable with the joint likelihood of output pixels,

enabling straightforward sampling of different modes. We accommodate the pre-processing by the

calibration network by modifying the non-saturated versions of the objectives used to train Gφand

Dψgiven in Eqs. (3) and (4) to:

Ladv(D, φ, θ) = −EpD(x,y),p()[log Dψ(Gφ(Fθ(x), ))], (7)

LD(D, φ, ψ) = −EpD(x,y),p()[log (1− Dψ(Gφ(Fθ(x), ))) + log Dψ(y)]. (8)

(18)

• Decoupling the CE optimisation from the adversarial optimisation allows us to explicitly model the likelihood of the data while enjoying confident, label-like output.

• Fθsupplies an augmented representation of x, enclosing probabilistic information about y,

for Gφto distill into a final refined prediction.

• Since Fθcaptures the per pixel class probabilities over the labels corresponding to a given

input, we can quantify the pixel-wise aleatoric uncertainty by computing the entropy of the

output of Fθ, H(Fθ(x)) [25].

• The well calibrated prediction by Fθcan be used as a target for the predictive distribution of

Gφ, improving mode coverage while stabilising training and increasing convergence speed.

We support this claim with empirical evidence presented in Section 5.

4.3 Diversity Calibration

To calibrate the predictive distribution, we impose diversity regularisation on Gφby complementing

the adversarial loss of the refinement network_Ladvwith a novel loss term encouraging the sample

average Gφ(Fθ(x)) := Ep()[Gφ(Fθ(x), )] to match the class probabilities predicted by Fθ(x).

Here Gφ(Fθ(x)) serves as an approximation of the implicit predictive distribution of the refinement

network. To this end, we define the predictive distribution of Gφas an auxiliary fully-factorised

categorical likelihood qφ, given by:

qφ(y| Fθ(x)) = H Y i W Y j K Y k Gφ(Fθ(x)) yi,j,k i,j,k , (9)

We then optimise φ by minimising the reverse Kullback-Leibler (KL) divergence,KL(qφ|| qθ), to

encourage coverage over all modes present in qθ.3 Since both qφand qθare categorical distributions,

the divergence can be computed exactly. We coin the resulting loss term as the calibration loss, defined as: Lcal(D, θ, φ) = EpD  E[log qφ(y| Fθ(x))]− X i,j,k

Gφ(Fθ(x))i,j,klog Fθ(x)i,j,k



. (10)

Since_Lcaloptimises through Gφ(Fθ(x)) rather than a single sampled prediction, the model is not

restricted to learning a single solution for each input x, and can therefore be combined with_Ladv

without impeding multimodality. The total loss for the refinement network then becomes:

LG(D, θ, φ) = Ladv(D, θ, φ) + λLcal(D, θ, φ), (11)

where λ_{≥ 0 is an adjustable hyperparameter determining the relative importance of each loss term.}

Figure 4 shows the interplay of Fθ, Gφand Dψand the corresponding loss terms.

Optimising_Lcalis equivalent to matching the two likelihoods qθ(y| x) and qφ(y| Fθ(x)), which

es-tablishes a consistency constraint between the input and output of the refinement network, resembling cycle-consistency regularisation [68]. Consequently, the refinement network is forced to process the

probabilistic information encoded in Fθ(x) and in order to learn a calibrated predictive distribution.

Intuitively, the refinement network can be interpreted as a stochastic sampler, modelling the inter-pixel dependencies to draw consistent samples from the explicit likelihood parametrised by the calibration network. Thus both the pixel-wise class probability as well as object coherency are preserved. Note that in theory, if the discriminator network is strong enough, it can improve upon the probabilities presented by the calibration network, as it can capture the relative frequencies of each plausible mode for a given input x e. g. by being more tolerant towards predictions adhering to more frequent modes,

and propagating this information to the refinement network through_Ladv. By the same token, the

refinement network may also learn to identify and filter out obvious errors in the calibration network’s output in order to satisfy the discriminator. In any case, this approach leads to improved mode coverage and training stability, and increased convergence speed, as demonstrated in Section 5.1.1.

3

The choice of divergence is heuristically motivated and can be changed to fit different use-case requirements. We delegate theoretical and experimental analysis of other divergences to future work.

(19)

Input image x Calibration network Fθ Aleatoric uncertainty H(Fθ(x)) Example labels yi Refinement network Gφ Gaussian noise Final predictions y1

ref, yref2, ..., yrefM

Average final prediction y_ref Discriminator Dψ Lce Lcal LD LG 1 MΣ

Figure 4: Model overview with an illustrative example vertically segmenting red from blue pixels.

A fuzzy boundary in the input image x allows for multiple valid labels yi_{. Initially, the calibration}

network is used to map the input to a calibrated pixel-wise distribution over the labels. This is then fed into the refinement network which samples an arbitrary number of diverse, crisp label proposals y1

ref, . . . ,yˆ M

ref. To ensure calibration, the average of the final predictions is matched with the calibration

target from the first stage through the_Lcalloss. Additionally, the aleatoric uncertainty can be readily

extracted from the calibration target, e. g. by computing the entropy H(Fθ(x)).

4.4 Practical considerations

In this section we reflect on some design choices we found useful for training our model. An

important consequence of the loss decomposition is that the weights of Fθcan optionally be kept

fixed, while the adversarial pair Gφand Dψare being trained. This allows Fθto be pre-trained

in isolation, consequently lowering the overall peak computational burden and improving training stability. We experiment with both setups in Section 5, however, we find that this can be especially important for training stability on higher dimensional datasets.

We also note that computing_Lcalrequires a Monte Carlo estimation of Gφ, where the quality of

the loss signal improves when increasing the sample count. Modern deep learning frameworks allow for the samples to be subsumed in the batch dimension, and can therefore be efficiently computed on GPUs. Nevertheless, increasing the number of samples maintains a significant increase in computational burden, establishing a trade-off between the quality of the loss signal and training speed. Therefore, the benefits experienced when increasing the sample size heavily depend on the dataset used, and the selection of the number of samples requires dataset-specific tuning. We discuss this further in Section 5.2.1.

Another important consideration is that in practice, the refinement network and discriminator do not necessarily need to be conditioned on the input image, however, we empirically found that this improves the quality of the results, presumably by making extra information available to the networks.

Additionally, the calibration network Fθcan be further conditioned on predicted semantic labels from

other models, scaffolding the generation process. Thus, our method can be used to augment any existing black-box model B for semantic segmentation, furnishing it with a multimodal predictive distribution. We demonstrate relevant results in Section 5.2.2.

In order for the refinement network to leverage the sampled noise vectors and learn stochastic mappings providing diversity in generated samples, we found it beneficial to inject the noise in hidden layers of the network, in a manner similar to [104]. To do so, we project the noise using two fully connected layers into scale and residual matrices with the same number of channels as the feature maps at the points of injection, and use these matrices to adjust channel-wise mean and variance of the activations, resembling the mechanism used for adaptive instance normalisation [105].

Conditional normalisationvia noise injection in hidden layers of a neural network has also been

shown to be more effective than concatenation in [104, 106]. In contrast to these approaches we do not normalise, but rather linearly transform the hidden features across the channel dimension.

(20)

5 Experiments and Results

In this section, we describe our experiments and the respective findings. In each subsection we begin by sharing the motivation of the experiment, present details of the dataset used, and continue by reporting and analysing our results. First we show results on two toy datasets and subsequently we show how our method scales to higher-dimensional datasets for multimodal semantic segmentation. To avoid repetition, we allot different types of experiments to different datasets, showcasing separate aspects our method concisely.

5.1 Toy Experiments

We begin by giving intuitive insights into the mechanics of the proposed method with experiments on illustrative toy datasets for regression and classification, for which we have full control over the multimodality of the data.

5.1.1 1D bimodal regression

To illustrate the basic benefits of our approach on training dynamics, and test the applicability of our model to regression settings, we experiment on a bimodal 1D regression dataset. We generate the

dataset by mapping an input x_{∈ [0, 1] to y ∈ R as follows:}

y=    0.5_{− b + ,} x_{∈ [0, 0.4)} (−1)b₍ −1.25x + 1) + , x ∈ [0.4, 0.8) , x_{∈ [0.8, 1]} (12)

where b_{∼ Bernoulli(π) and ∼ N (0, σ). We synthesise 9 different scenarios by varying the degree}

of mode selection probability π_{∈ {0.5, 0.6, 0.9} and the mode noise σ ∈ {0.01, 0.02, 0.03}.}

For every data configuration, we use a 4-layer MLP for each of Fθ, Gφ and Dψ, and train our

model end-to-end, optimising the refinement network with or without the calibration loss by setting the coefficient λ in Eq. (11) to 1 or 0, respectively. All models are trained with a learning rate of

1e_{−4, and each experiment is repeated five times. Note that unlike the categorical likelihood used in}

semantic segmentation tasks, we employ a Gaussian likelihood with fixed scale of 1. This changes the formulation of both Eqs. (6) and (10) to mean squared error (MSE) losses between ground truth

labels y and individual final predictions yrefforLce, and between the output of the calibration net

Fθ(x) and the average of multiple final predictions Gφ(Fθ(x)) forLcal, given by:

LceM SE(D, θ) = 1 N X i y_refi _{− y}i2 (13) LcalM SE(D, θ, φ) = 1 N X i Gφ(Fθ(x))i− Fθ(x)i 2 (14) The results, depicted in Fig. 5, show that when using calibration loss, the optimisation process enjoys improved stability, converges faster and results in better calibrated predictions, in comparison to the non-regularised baseline. The effect is more pronounced in data configurations with higher bias.

This is expected because in low bias scenarios, the discriminator is highly penalised from_LDfor

incorrectly rejecting samples from either mode, as both modes occur in high frequency in the data. On the other hand, in high bias data configurations, the discriminator can easily learn to reject samples

from the rare mode, as these do not occur often and therefore do not incur a significant penalty in_LD.

Therefore optimising only with the adversarial loss would only weakly encourage the generator to synthesise samples from both modes.

Fig. 6 shows the data log-likelihoods for the 9 data configurations for varying mode bias π _∈

{0.5, 0.6, 0.9} and mode noise σ ∈ {0.01, 0.02, 0.03} trained with and without the calibration

loss_Lcal. The individual likelihood curves for each of the 5 runs for every experiment are plotted

in Fig. 6b and Fig. 6d respectively. The results show that high bias is harder to learn, reflected by a

delayed convergence in all models, however, the_Lcal-regularised model displays greater robustness

to weight initialisation. In contrast, the non-regularised GAN exhibits mode oscillation during training, expressed as fluctuations between high (when one mode is covered) and low (between modes) likelihood scores.

(21)

0 100 200 300 400 500 Training iteration −20 −15 −10 −5 0 A verage data log-likelihood median (baseline) median (withLcal) iqr (baseline) iqr (withLcal)

(a) 0.2 0.4 0.6 0.8 x −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 y Fθ(x) Gφ(Fθ(x)) Gφ(Fθ(x), e) (b) 0.2 0.4 0.6 0.8 x −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 y Fθ(x) Gφ(Fθ(x)) Gφ(Fθ(x), e) (c)

Figure 5: (a) Median and interquartile range (iqr) over the data log-likelihood, averaged over all

9_{×5×2 experiments. (b) High bias and noise configuration (π = 0.9, σ = 0.03) with calibration}

loss. The ground truth target is shown as black dots and the predicted samples as light blue dots. The predictions average in dark blue matches the calibration target in red. The discriminator output is shown in the background in shades of red (real) and blue (fake). (c) The same experiment configuration but without the proposed calibration loss, resulting in a mode collapse.

−20 −15 −10 −5 0 π = 0.5 −20 −15 −10 −5 0 π = 0.6 0 100 200 300 400 500 σ =0.01 −20 −15 −10 −5 0 π = 0.9 0 100 200 300 400 500 σ =0.02 0 100 200 300 400 500 σ =0.03 median interquartile range Mode spread Mode bias (a) −20 −15 −10 −5 0 π = 0.5 −20 −15 −10 −5 0 π = 0.6 0 100 200 300 400 500 σ =0.01 −20 −15 −10 −5 0 π = 0.9 0 100 200 300 400 500 σ =0.02 0 100 200 300 400 500 σ =0.03 Mode spread Mode bias (b) −20 −15 −10 −5 0 π = 0.5 −20 −15 −10 −5 0 π = 0.6 0 100 200 300 400 500 σ =0.01 −20 −15 −10 −5 0 π = 0.9 0 100 200 300 400 500 σ =0.02 0 100 200 300 400 500 σ =0.03 median interquartile range Mode spread Mode bias (c) −20 −15 −10 −5 0 π = 0.5 −20 −15 −10 −5 0 π = 0.6 0 100 200 300 400 500 σ =0.01 −20 −15 −10 −5 0 π = 0.9 0 100 200 300 400 500 σ =0.02 0 100 200 300 400 500 σ =0.03 Mode spread Mode bias (d)

Figure 6: Log-likelihood curves for 5 runs on each of the 9 data configurations. (a) No calibration

loss (λ= 0), averaged. (b) No calibration loss, individual runs. (c) With calibration loss (λ = 1),

(22)

x y1gt y2gt y3gt y4gt y5gt EntropyGT

Figure 7: Toy Semantic Segmentation Dataset: An example input x (left) of our toy dataset, 5

representative sample labels y1gt, . . . ygt5 (center), and the corresponding entropy of the ground truth

posterior distribution (right). In the input image x, light blue represents the sky, dark blue represents the horizon, and brown represents the land. In the labels, black represents the sky and yellow represents the land.

5.1.2 Binary semantic segmentation

To give an intuitive understanding on how our model captures multimodality in semantic segmentation, we generate and experiment on a 2-class toy semantic segmentation dataset. The dataset consists

of32×32 RGB images and their corresponding annotations. The input images are comprised of

three horizontal zones, which we analogise to sky, horizon and land, however, only the sky and land classes are considered in the labels. The ambiguous horizon zone has a constant height of 5 pixels, and separates the sky and land zones at different heights of the input image. Each input image then corresponds to 5 different, equiprobable 2-class labels, where the boundary between the sky and the land is horizontally segmented within the horizon region. This setup gives 25 different input images with 5 labels each. Samples from our toy dataset are shown in Figure 7.

To learn the conditional distribution p(y_{| x) we use the same network for each of F}θ, Gφand Dψ,

comprised of 5 convolutional blocks, consisting of a convolutional layer, followed by a leaky ReLU

layer and a dropout layer with 0.1 dropout probability. Fθ, Gφare softmax-activated whereas Dψis

sigmoid-activated. To introduce stochasticity to the refinement network Gφ, we sample noise from

a standard 2D Gaussian, and inject it at each hidden unit, after the dropout layer. We utilise the Adam

optimiser [107] with a learning rate of2e_{−4 for F}θand Gφ, and1e−5 for Dψ, updating all networks

at all iterations. Further, we use a batch size of 32, consisting of randomly drawn image-label pairs,

and compute_Lcalwith 5 samples from Gφ(Fθ(x), ). Note that the number of samples does not need

to match the number of ground truth labels for each input image.

x y1ref y2ref y3_ref _y_4ref _y5_ref _y_ref _{F (x)} _EntropyF (x) _EntropyGT

Figure 8: Qualitative results on the Toy segmentation dataset. Each row shows an input image x

(left), followed by 5 corresponding sampled predictions from the refinement network, y1_ref, . . . , y5_ref,

the average over 16 sampled predictions y_refand the output of the calibration network Fθ(x). The last

two columns show the aleatoric uncertainty map extracted from the calibration network, H(Fθ(x)),

(23)

Fig. 8 illustrates representative sampled predictions from the refinement network, followed by the two

components being matched in the calibration loss, the average prediction yref= Gφ(Fθ(x)) and the

calibration target Fθ(x). The sampled labels y1ref, . . . , y

5

refare perceptually indistinguishable from the

ground truth labels, demonstrating clear horizontal class-boundaries that remain within the ambiguous region specified by the horizon. The last two columns of Fig. 8 show that the calibration network accurately captures the entropy (aleatoric uncertainty) of the ground truth conditional distribution

p(y_{| x). Finally, by observing that y}refand Fθ(x) appear almost identical, we can conclude that

the diversity regularisation from_Lcalsuccessfully conditions the refinement network to uniformly

sample from Fθ(x), capturing all 5 equiprobable segmentation variants for each input x.

Learning OOD maps Our method is designed to model the intrinsic multimodality in the data,

allowing us to extract aleatoric uncertainty maps that can highlight ambiguous regions of the input space. However, another important cause of predictive errors is OOD input. This refers to datapoints drawn from an unfamiliar data distribution or unseen regions of the input space, which can be detected with epistemic uncertainty maps [69]. As explained in Section 2.4, epistemic uncertainty is typically modelled using approximate Bayesian inference [24, 71, 75], to learn a probability distribution over model parameter configurations that are consistent with the training data. OOD maps are then extracted as the predictive variance between sampled models, since different weight configurations generalise in different ways on unfamiliar inputs.

Instead of relying on sample-based extraction of OOD maps, we explore an alternative approach that could serve as an extension to our method defined in Section 4. We re-interpret the role of the discriminator in the adversarial paradigm as an OOD detector, scoring an input segmentation map as real when it is in-distribution, and fake when it is deemed as out-of-distribution. Accordingly, we can then obtain pixel-wise OOD maps by modifying the discriminator to score each pixel separately rather than the entire image, similarly to [91]. We call this formulation of the adversary an auditor, on the grounds that its objective is to review the authenticity of each pixel. Importantly, in the original GAN framework, the discriminator only surveys datapoints from the output space [53], however, we are also interested in detecting OOD inputs. Output space detection can be used to identify faulty predictions, whereas input space detection can be utilised to improve learning by identifying the most informative regions of the input space [24] and prioritise training accordingly [92, 108].

To obtain pixel-wise OOD maps for both the input and output space, we condition the auditor on

both the sampled segmentation map, and the input image x; that is, the auditor maps a pair(x, yi

ref)

or(x, yi

gt) to a single per-pixel OOD score map highlighting pixels that are out-of-distribution in

either the input space or output space, as illustrated in Fig. 9. In order to preserve the higher-level consistencies enforced on the predictive distribution by the adversarial component of the total

refinement network’s loss_LG, the auditor also returns a global score for each input—this encourages

the auditor to consider evidence distributed across the entire image into account. To this end, we modify our discriminator to an auditor by introducing an additional OOD map prediction head on top of the first convolutional block, consisting of convolutional layer, followed by an instance normalisation layer [109] and a sigmoid activation. This ensures that the synthesis of the predicted map leverages the fine-localised features found in the early layers of the network. To incorporate

training of the OOD map head, we modify_LD, from Eq. (8), to define the auditor lossLaudas:

Laud(D, φ, ψ) = −EpD(x,y),p()



 X

i,j

log (1_{− D}ψ(x, Gφ(Fθ(x), )))_i,j+ log Dψ(x, y)i,j

 .

(15) Therefore, the total auditor loss is now given by:

LDTOTAL(D, φ, ψ) = LD(D, φ, ψ) + λLaud(D, φ, ψ). (16)

The OOD maps generated by our auditor, obtained as Dψ(x, y), are shown for a random in-distribution

input, and 4 out-of-distribution inputs in Fig. 9. As a baseline for OOD detection, we use MC dropout, implemented by activating dropout layers of the refinement and calibration nets with 0.5 dropout probability during test-time, and computing the map as the predictive variance between 100 stochastic forward passes [75]. As a control, we also show the pixel-wise variance between multiple samples from our model which captures the aleatoric uncertainty map of the refinement network’s predictive

(24)

x F (x) yref Entropy Var[G (F (x), )]F (x) DropoutMC AuditorMap

Figure 9: Qualitative results for OOD inputs on the Toy segmentation dataset. Each row shows an

input image x (left), followed by the corresponding calibration network prediction Fθ(x), a single

randomly sampled prediction from the refinement network yref, the aleatoric uncertainty, variance

map, and MC dropout (using 100 forward passes) and auditor OOD maps.

Fig. 9 shows the qualitative results for each method. As we can see, the MC Dropout and variance maps are very similar, indicating the presence of residual aleatoric uncertainty in the MC Dropout maps. This demonstrates that in noisy datasets, deriving OOD maps as the predictive variance between sampled models unavoidably also captures the multimodality of the data, as different models can adhere to different modes. Consequently, extra processing is required to decompose epistemic uncertainty and aleatoric uncertainty, for example as done in [24]. In contrast, the auditor maps appear to flag OOD pixels while ignoring aleatoric uncertainty, as evident in the top row of Fig. 9, without the need of drawing multiple samples.

Looking at the middle row of Fig. 9, we would expect that the entire image would be detected as OOD due to the unfamiliar colours presented in x (red and green). However, the red circle is not flagged by the auditor, and is understated in the MC dropout OOD map. It seems that MC dropout is more sensitive to numerical values of the segmentation map, e. g. different RGB values, whereas the auditor towards edges that are not horizontal. In this case, MC dropout appears to be more useful

than our method. Nevertheless, notice that even though the predicted segmentation map yrefmay

appear realistic, e. g. 4th rows of Fig. 9, the auditor is able to detect unfamiliarities in the input x. Further, the MC dropout maps in all images appear to be the composite of the variance maps and the auditor map. This supports the view that the auditor selectively captures epistemic uncertainty. In any case, a more thorough study is required to investigate the OOD detection abilities of the auditor, and we do not consider it for the rest of our experiments. Difficulties in OOD detection are discussed in more detail in Section 6.2.2.

An important caveat of our implementation of the auditor is that the we cannot know which of x or yi

refis OOD. A possible solution is implementing the auditor as a Siamese network [110], having

different input blocks to separate the processing for the input image from the labels or predictions, before converging in a shared body. Separate OOD maps can then be extracted for the input space and the output space via their respective input blocks. It is also worth noting that the field of view (FoV) at the level of the map head has a significant qualitative effect on the generated maps, determining the size of the highlighted pixel patches, and may require tuning for optimal results on different datasets. We further discuss limitations of this approach in Section 6.2.

Adversarial neural networks for multimodal semantic segmentation

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Adversarial neural networks for multimodal

semantic segmentation

E

LIAS

K

ASSAPIS

August 13, 2020

Supervisors:

Dr. Deepak K. Gupta

Georgi Dikov, MSc

Dr. ir. Cedric Nugteren

Assessors:

Dr. Deepak K. Gupta

Dr. Efstratios Gavves

Abstract

Acknowledgements

Table of contents

1

Introduction

2

Background

3

Related Work

4

Method

5

Experiments and Results