Weakly-Supervised Semantic Segmentation using End-to-End Adversarial Erasing

(1)

MSc Artificial Intelligence

Master Thesis

Weakly-Supervised Semantic Segmentation

using End-to-End Adversarial Erasing

by

Erik Stammes

10559736

July 31, 2020

48 ECTS November 2019 - June 2020

Supervisor:

T.F.H. Runia MSc

Daily supervisors:

Dr. M. Ghafoorian

Dr. M. Hofmann

Assessor:

Prof. Dr. C.G.M. Snoek

(2)

(3)

Abstract

Semantic segmentation is a task that traditionally requires a large dataset of pixel-level ground truth labels, which is time-consuming and expensive to obtain. Recent advancements in the weakly-supervised setting show that decent performance can be obtained by using only image-level labels. Classification is often used as a proxy task to train a deep neural network from which attention maps are extracted. The attention maps recover the location of the objects that are present in the image. This effectively alleviates the problem of requiring pixel-level ground truth labels. However, classification needs only the minimum evidence to make predictions, hence it focuses on the most discriminative object regions. In light of weakly-supervised semantic segmentation, we propose a novel formulation of adversarial erasing of the attention maps. In contrast to previous methods, we optimize two networks with opposing loss functions, which eliminates the requirement of certain suboptimal strategies; for instance, having multiple training steps that complicate the training process or a weight sharing policy between networks operating on different distributions that might be suboptimal for performance. The proposed solution is simple, requires less supervision than previous methods and is easy to integrate into existing weakly-supervised segmentation approaches. Our experiments on the Pascal VOC dataset demonstrate that our adversarial approach improves segmentation performance by 2.1 mIoU compared to our baseline and by 1.0 mIoU compared to previous adversarial erasing approaches.

(4)

List of Abbreviations

CAM Class Activation Mapping CNN Convolutional Neural Network CRF Conditional Random Field EADER End-to-end Adversarial Erasing EM Expectation-Maximization FCN Fully Convolutional Network GAN Generative Adversarial Network mAP mean Average Precision

MIL Multiple Instance Learning mIoU mean-Intersection-over-Union MTL Multi-Task Learning

PSA Pixel-level Semantic Affinity

ROC AUC Area Under Receiver Operating Characteristic Curve

WSSS Weakly Supervised Semantic Segmentation

List of Symbols

D Dataset

xi Input image with indexiin the dataset

ˆ

xi_c Image with index i in the dataset where the activations of classchave been erased

yi Input label with indexiin the dataset

L Loss function

ϕ, θ Parameters of neural networks S Segmentation masks

(5)

Acknowledgements

First and foremost, I would like to express my gratitude towards my supervisors. They have made this thesis work a fun and insightful experience, where I learned a lot in the process. To Mohsen, for his thorough feedback and recommendations throughout the project. To Tom, for being able to look at this project from another point of view and for the extensive feedback. To Michael, for being involved and providing feedback throughout the project despite a busy schedule. Thanks for being in my defense committee and assessing my thesis, along with Prof. Dr. Cees Snoek.

Furthermore, I would like to thank TomTom for financially supporting me during this thesis.

Finally, a thank you to my colleagues and the other interns at TomTom, who provided many fruitful discussions and helped tremendously throughout this process.

(6)

1 Introduction

1.1 Motivation

Semantic segmentation is one of the core computer vision tasks, with applications ranging from autonomous vehicles [1] to medical diagnosis [2]. There has been tremendous progress in quality of semantic segmentation in the era of deep learning [3–5], in part due to the availability of large-scale datasets with pixel-level ground truth labels [6–9]. However, labeling these datasets with pixel-level annotations is a laborious process. As an example, in the Cityscapes dataset the average time it took to annotate an image was 1.5 hours [6]. Weakly-supervised methods achieve reasonable performance with much coarser labels such as bounding boxes [10,11], scribbles [12–14], points [15] or even image-level labels [16–

18]. Even though these methods achieve lower accuracy compared to their fully supervised counterparts, they have the benefit of easier to acquire labels. It is common in methods that use only image-level labels to train a classification network and extract attention maps as initial object location seeds. However, learning semantic segmentation with only image-level labels is an ill-posed problem, since the labels indicate only the existence of a class instead of its location and shape. More specifically, the attended visual evidence generally corresponds to the most discriminative object regions and therefore fails to capture the complete object [19–

21]. We call this the discriminative localization problem, which is illustrated in Figure1. This

Figure 1: Attention maps obtained from a classification network. Classification networks need only the minimum evidence to make predictions, hence they focus on the most discriminative objects parts. We call this the discriminative localization problem. This is especially prevalent in non-rigid object classes.

problem is especially prevalent in non-rigid object classes such as birds, cats, horses and sheep where the texture of the fur or skin is much less discriminative than other body parts

(9)

such as heads or feet. Previous methods [17,20–22] propose to alleviate this problem by introducing adversarial erasing, which sets a threshold on the attention map to generate a mask which removes the most discriminative object regions from the image. The resulting image is then fed into a second classification network to find less discriminative regions that belong to the same object. Some of the existing methods perform the erasing in multiple steps, either implemented as a multi-stage training approach [17] or trained in an integrated fashion [21] with multiple erasing networks trained jointly. This will result in either a complicated multi-stage training strategy or a more extensive memory footprint that might hinder leveraging state-of-the-art network architectures. Other methods [20] aim at avoiding this shortcoming by training a single erasing step while sharing the parameters among the models operating on the input and erased input. The parameter sharing, however, might result in suboptimal performance given the different distributions of data they are operating on. Therefore, in this work we seek to answer the following research questions:

1. How can adversarial erasing be formulated such that it can be integrated into existing methods?

2. Does this formulation resolve the discriminative localization problem? 3. How does this formulation perform in terms of segmentation quality?

1.2 Contributions

To solve the discriminative localization problem, we follow the adversarial erasing methodol-ogy, but propose to train the two networks in a truly adversarial manner. We purposely keep our framework simple, making it easily integrable into existing weakly-supervised semantic segmentation methods. The main contributions of this thesis are as follows:

• We propose a novel end-to-end adversarial erasing method which helps capturing less discriminative object regions.

• We extensively analyze the behavior of this method in terms of segmentation and classification performance.

• We show how this approach can be integrated into existing weakly supervised semantic segmentation methods.

• We demonstrate its effectiveness on the Pascal VOC 2012 semantic segmentation benchmark, outperforming the baseline and existing adversarial erasing approaches. A paper version of this thesis is currently under review at ACCV 2020.

1.3 Thesis Contents

The remainder of this thesis is organized as follows. Chapter 2 introduces the semantic segmentation task, the different kinds of supervision and common techniques in existing weakly-supervised semantic segmentation methods. Chapter3categorizes existing works

(10)

into six categories and highlights several methods per category. We pay extra attention to adversarial erasing methods, as they provide the ground where this thesis builds upon. Chapter4introduces end-to-end adversarial erasing. We show how it can be integrated into existing methodologies and show one example integration into a baseline method. Chapter5 describes a number of experiments run on a semantic segmentation dataset. We analyze the important parameters in terms of segmentation and classification performance. Finally, we make a comparison to previous adversarial erasing methods and the current state-of-the-art. Chapter6summarizes the work presented in this thesis, and concludes this research and proposes future work.

(11)

2 Preliminaries

Semantic segmentation is the task where each pixel in an image needs to be associated with a class label, such as person, cat, aeroplane, chair or bicycle. Applications for semantic segmentation range from autonomous vehicles [1] to medical image diagnosis [2] and from industrial inspection [23] to land cover mapping based on remote sensing images [24]. Semantic classes are commonly divided into two categories: thing classes and stuff classes.

Things are objects with well-defined shapes, such as cat or train, while stuff classes are amorphous background regions such as sky or road. Semantic segmentation datasets either have both types of classes, such as Cityscapes for autonomous driving [6], or things classes and a single background class for the remaining pixels, such as Pascal VOC [7].

In the fully-supervised setting pixel-wise labels are available, while in the setting with weaker supervision the labels are coarser. Figure2shows the different types of supervision

Figure 2: Different supervision signal strengths. In the fully supervised setting, the output labels are available as annotations, while in the weakly supervised setting coarser labels are available. The weak supervision signals are ordered from hardest to easiest. Image from [25].

in the semantic segmentation task, ranging from full supervision with pixel-level labels to the weakest form of supervision, which is just image-level labels.

2.1 Semantic Segmentation Models

Semantic segmentation with pixel-level labels, i.e. full supervision, has been studied ex-tensively. Shallow approaches (i.e. without neural networks) include the normalized cut criterion [26], which optimizes minimum similarity between classes and maximum similarity within a class and the graph cuts algorithm [27] which takes various smoothing constraints into account, such as regional boundary information and pixel gray information. In the deep learning era of semantic segmentation Fully Convolutional Networks (FCNs) [28] were proposed, which replace fully-connected layers in a classification network with convolutional

(12)

layers to get dense class predictions for each input pixel. The U-net architecture [29] builds upon FCNs by increasing the number of feature channels in the upsampling part. Wu et

al.[30] made CNNs shallower and wider, which makes them more spatially efficient. This CNN architecture not only improves classification results, in a fully convolutional setup this also improves segmentation results.

Often, fully-supervised semantic segmentation models are combined with methods that exploit the contextual nature of the segmentation task, such as conditional random fields [31] or an adversarial training scheme [32,33]. More recent semantic segmentation methods can roughly be divided into two groups: 1) encoder-decoder networks, where the encoder captures higher semantic information, while the decoder recovers spatial information [3,28,

29] and 2) networks with spatial pyramid pooling, which exploit multi-scale information by performing pooling at multiple scales [4,5,34].

2.2 Weak Supervision

In this thesis work, we restrict ourselves to image-level labels, as this requires the least amount of effort to gather a dataset. It also the most difficult form of weak supervision, as image-level labels only provide the existence of a certain class in image, not its shape and location. We further restricts ourselves to things classes, i.e. images where only objects with well defined shapes are labeled. Things that would usually fall in the stuff class are denoted as background.

Commonly, weakly-supervised semantic segmentation methods utilize classification neu-ral networks (CNNs) to produce semantic segmentation masks. Besides, many methods share common techniques such as attention maps, conditional random fields (CRFs), saliency estimation, additional data and multi-stage training.

2.2.1 Notation

We formally describe a classification model asyˆ= f (x), wheref is the model,xis an input image andyˆis a classification prediction. During training, the input to the model is a set D = {xi, yi}N_i=1, whereN is the number of images,xi the input image andyi a multi-hot

vector of lengthC, withCbeing the number of classes, and in whichyi,c = 1, if classcis

present inxi andyi,c= 0, otherwise. Note that being in a multi-label setup, multiple classes

can be present in an input image and henceP

cyi,c ≥ 1.

2.3 Attention Maps

Since the early days of the breakthrough of deep neural networks, considerable attention was given to shed light into these “black boxes” to better understand the decision making process. For instance, in visual tasks such as image classification it is often useful to highlight the image regions responsible for the network’s decision. Earlier work achieved this by visualizing partial derivatives of predicted class scores w.r.t. the input image [35] or by making modifications to raw gradients [36].

(13)

Oquab et al. [37] made three modifications to a CNN to find the locations of objects in images. Namely, they replace the fully connected layers with convolutional layers, utilize a cost function that can explicitly model multiple objects and add a global max-pooling layer at the output to search for the highest scoring object position. The resulting network is able to localize a point within the boundary of an object with only image-level labels. Later, Zhou et al. [19] introduced Class Activation Mappings (CAMs) which uncover the implicit attention of classification networks using a global average pooling layer. By utilizing a global average pooling layer instead of a global max pooling layer [37] it is possible to localize the extent of the object. The class-specific attention maps are obtained by multiplying the classification weights with the feature map of the final convolutional layer and show the most discriminative area of each class.

ACAM_c (xi) =

X

k

wc_kf_kfinal(xi). (1)

Here,ACAM

c is the class activation map for classc,wc are the classification weights andffinal

are the weights of the final convolutional layer of classification modelf. Finally, we sum over the individual unitskand the attention map is normalized such that the maximum activation equals to 1. Figure3illustrates the generation of CAMs. The localization ability of these

Figure 3: Class Activation Mappings (CAMs) are generated by multiplying the classification weights with the weights of the final convolutional layer. Image from [19].

attention maps is remarkable and paved the way towards object localization, detection and semantic segmentation using only image-level labels [37].

A more general approach to attention map generation is Grad-CAM [38], which utilizes gradients backpropagating to the final convolutional layer to produce the attention maps. This makes Grad-CAM suitable to a broader range of neural network architectures compared to CAM. Examples include CNNs with fully-connected layers (e.g. VGG [39]), CNNs with structured outputs and CNNs used in tasks with multi-modal inputs. To produce Grad-CAMs,

(14)

first the gradients are global-average-pooled over the spatial dimensions, with sizes(U, V ), of the feature map to obtain neuron importance weightsαc_k:

αc_k(xi) = 1 U V X u X v ∂yc_(x i) ∂f_kfinal(xi)u,v (2) Here,yc denotes the score for classcbefore the softmax. The neuron importance weights replace the fully-connected layer weights as used in CAM, which makes the final computation similar to CAMs:

AGrad-CAM_c (xi) =

X

k

αc_k(xi) fkfinal(xi) (3)

Finally, negative values are discarded using the ReLU non-linearity to prevent the spread of the attention map to non-desired areas.

2.4 Conditional Random Fields

The output of WSSS methods is often coarse because of the lack of feedback from the image-level labels, which can be partly alleviated by post-processing with conditional random fields [18,40,41]. CRFs optimize the probability distribution to be fitted to the edge of Fregions by using information from the input image as features. Since the distribution now depends on the input data, a direct model for the posterior distribution is used. Kumar and Hebert [42] were the first to apply CRFs in computer vision. The idea of using information from the input image in the optimization process is however not new to segmentation, as Boykov and Jolly [43] used this approach for interactive segmentation. Somet methods optionally used CRF as post-processing step [41,44–47], while more recently some methods incorporate CRF in their loss function [14,18,48]

2.5 Saliency Estimation

Since attention maps form the basis of many WSSS methods, the coarse nature of the maps means it can cover not only object-related regions but also the background. Many WSSS methods therefore opt to use an extra supervision signal in the form of saliency estimation methods [17,20,22,49–55]. These methods are close neighbours of semantic segmentation methods, with the important distinction that they are class agnostic. Instead of predicting a mask per class, saliency estimation methods predict a binary mask for the entire image, separating the foreground from the background. Saliency estimation methods are often trained on fully-supervised labels [56,57], which gives the WSSS methods that adopt them stronger supervision implicitly. Two commonly used methods are [57], which is based on multi-level image segmentation and maps regional feature vectors to a saliency score and [56] which utilizes short connections in an existing architecture to obtain rich multi-scale feature maps for saliency estimation.

(15)

2.6 Additional Data

Besides the additional supervision of saliency estimation methods, some methods resort to external data sources to provide stronger supervision. Pinheiro and Collobert [58] and Hong et al. [59] train only on the ImageNet [60] and COCO [9] datasets respectively, while evaluating on the Pascal VOC [7] dataset. Both these training sets contain more classes and images than Pascal VOC. In their multi-stage training method, Wei et al. [61] start with 40k simple images from Flickr.com, before moving on to Pascal VOC. Both Hong et al. [62] and Lee et al. [63] utilize videos crawled from the web to provide extra supervision. The temporal dynamics in videos offer more information to distinguish objects from their background compared to images. Finally, Fan et al. [50] optionally use a subset of ImageNet with only one object class present in each image to train their model.

2.7 Multi-Stage Training

To the best of our knowledge, Wei et al. [61] were the first to propose multi-stage training of weakly-supervised semantic segmentation models. In their work, they train three distinct CNNs, starting with saliency masks and simple images containing only one class, they gradually train with more difficult images. Both Wei et al. [17] and Wang et al. [64] apply multiple iterations of their proposed methods to iteratively improve their output segmentation masks.

Most other existing WSSS methods since then also apply multi-stage training, where usually a final segmentation model is trained on the proxy labels produced by the proposed method [17,20,22,49–51]. The final model is then an off-the-shelf fully-supervised semantic segmentation model such as DeepLab [4,34], or WideResNet-38 [30]. This not only improves the segmentation performance, but another important benefit is that for inference only this final model is required to generate predictions. Both Huang et al. [47] and Fan et al. [54] take this approach one step further and retrain their model one more time to improve the segmentation performance even further.

(16)

3 Related Work

3.1 Weakly-Supervised Semantic Segmentation

Weakly-supervised semantic segmentation (WSSS) methods can be divided into two eras: before and after the popularization of deep learning. We will briefly go into some shallow methods before the deep learning era, and go more in depth in modern WSSS methods. The modern WSSS methods can be divided into six categories:

• Multiple Instance Learning • Multi-Task Learning • Region Growing • Cross Image Features • Random Erasing • Adversarial Erasing

Note that these categories are not mutually exclusive, since some methods utilize two or more of the aforementioned techniques. Therefore, we group each method based on its core contribution here.

3.1.1 Shallow Methods

Deep learning methods have gained popularity because they outperform shallow machine learning approaches. However, to give a full overview of weakly-supervised semantic seg-mentation, we go over some notable shallow methods here.

Vasconcelos et al. [65] were the first to explore semantic segmentation using image-level labels but their proposed method outputs binary masks only. Verbeek and Triggs [66] combine spatial models with aspect models and show that the combination outperforms aspect models trained on pixel-level labels. Vezhnevets and Buhmann [67] proposed to make use of both a Multiple-Instance Learning (MIL) framework and Multi Task Learning (MTL). They regularize the MIL framework by also optimizing for the task of geometric context. The authors continue their work with a multi-image model, a graphical model for recovering the pixel labels of the training images [68]. Graph-based models generally recover the labels for segments based on the similarity between images, which has been studied extensively [69–73].

Some methods utilize self-training to train a fully-supervised model but obtain the labels from an Expectation-Maximization (EM) procedure [11,74,75]. In these methods the features are obtained from a CNN, but the model itself is learnt using CRFs [75], concave-convex procedure [74] or max-margin clustering [11].

(17)

3.1.2 Multiple Instance Learning

Early methods in the deep learning era of semantic segmentation commonly utilize multiple instance learning [44,58]. Pathak et al. [44] trains an FCN to jointly optimize the represen-tation while disambiguating the pixel-image label assignment. Pinheiro and Collobert [58] constrain a CNN during training to put more weight on pixels which are important for classify-ing the image, which forces the CNN to discriminate the right pixels at test time. Papandreou

et al.[46] however note that the MIL paradigm forces the CNN to focus on the most discrim-inative object parts. They propose to solve this issue by using a expectation-maximization learning strategy.

3.1.3 Multi-Task Learning

Semantic segmentation is often jointly trained with object detection when only image-level labels are available [76–78]. Diba et al. [76] found that weakly-supervised object detection could be improved by using a cascade of networks, where one of the networks is tasked with semantic segmentation. While Wei et al. [77] utilized segmentation in their object detection method to find tight bounding boxes that cover the entire object. Finally, Li et al. [78] use a generative adversarial formulation of semantic segmentation to aid the object detection task.

Later works output both bounding boxes and segmentation masks for object detection and semantic segmentation respectively [79,80]. Ge et al. [79] proposed a multi-stage framework that can solve three tasks under weak supervision: object recognition, object detection and semantic segmentation. Shen et al. [80] use the failure patterns of both detection and segmentation to complement each other’s learning.

3.1.4 Region Growing

Kolesnikov et al. [41] proposed a unified framework of three loss functions: seeding, expan-sion and constrain-to-boundary. Based on weak localization cues, the expanexpan-sion loss forces the mask to expand to objects based on information about which classes can occur in an image. The constrain-to-boundary loss limits the mask such that they coincide with the object boundaries. Wei et al. [52] created a neural network architecture which combined attention maps from multiple convolutional layers with different dilation rates, which promotes the emergence of non-discriminative object regions. Jiang et al. [51] found that during training the attention maps tend to cover different regions of the image. They propose to accumulate the attention maps during training, which then serve as pixel-level supervision to find less discriminative object regions. Zhang et al. [18] take another approach by first pruning attention maps to only highly confident regions, which serves as reliable supervision for training a segmentation branch. The segmentation branch is trained with a dense energy loss function and over time grows the segmented regions into high quality segmentation masks.

(18)

3.1.5 Cross Image Features

Wei et al. [53] find that feature representations of the same classes in different images share some similar characteristics and representations of different classes differ from eachother. Hence, they propose a MIL-inspired cross-image contextual refinement module which selects class-interdependent object proposals with high predictive scores. Fan et al. [50] use a saliency estimation method to generate similarity features for the entire training set, from which a similarity graph is built. Next, this graph is split into multiple subgraphs corresponding to class labels using a graph partitioning algorithm. This algorithm considers the relationships between all salient instances in the dataset as well as information within them. Ahn and Kwak [16] first use attention maps to generate affinity labels, which are class-agnostic labels that give a relation about pixels. These labels are used to train AffinityNet which learns the semantic affinities of pixels. The affinities are finally used in a random walk to generate segmentation labels. Fan et al. [54] take a similar approach, but use reference images of the same class to generate affinity maps. The affinity maps are then used to retrieve supplementary material for each query pixel in the initial attention map.

3.1.6 Random Erasing

Lee et al. [49] introduced FickleNet, a modification of dropout [81] which randomly selects hidden units to generate a variety of attention maps, each highlighting different object areas. FickleNet implicitly learns the coherence of each location in the feature maps, highlighting not only the most discriminative object areas. Randomly erasing parts of an image to improve attention maps has also been used in weakly-supervised object localization [82,83], where the task is to generate bounding boxes instead of segmentation masks. Singh and Lee [82] randomly hide parts of an image, while Choe and Shim [83] randomly decide to erase either a drop mask or importance map based on a self-attention map.

3.1.7 Adversarial Erasing

Kim et al. [84] and Wei et al. [17] concurrently developed adversarial erasing methods for object localization and semantic segmentation, respectively. The idea is to find less discriminative object regions by erasing the most discriminative objects from either the image or feature map. While [84] perform one erasing step, [17] continue erasing until a satisfiable coverage of the object is obtained. Figure4shows an overview of the method used by Wei

et al.[17]. Common to both methods, after the erasing steps the attention maps are fused to generate segmentation masks. Later, this approach to weakly-supervised learning was improved in several ways. For object localization, Zhang et al. [21] integrated the erasing step into the network, making it end-to-end trainable, but still requiring a fusion step. This approach is shown in Figure5. For semantic segmentation, Hou et al. [22] took the same approach, but with two thresholds on the attention map, focusing not only on the most discriminative object regions, but also on background cues from saliency masks. Li et al. [20] introduce a network architecture with attention mining loss, which forces the attention map to cover the entire object region such that no recognizable pixels remain. This approach is

(19)

Figure 4: Adversarial erasing as defined by Wei et al. [17]. An image is repeatedly forwarded through a classification network, where each time the most discriminative object region is removed. After several steps, the object regions are fused into segmentation mask. Image from [17].

Figure 5: In adversarial complementary learning [21] the most discriminative object regions are erased from the feature map, enabling end-to-end training but still requiring a fusion step. GAP denotes Global Average Pooling. Image from [21].

end-to-end trainable and does not require a fusion step, as the attention map becomes a learnable part of the training process. The network architecture is shown in Figure6.

Common to all adversarial erasing methods for WSSS, saliency masks are used to restrict the spread of the attention map to background areas [17,20,22]. Saliency masks however often fail to capture multiple objects in a single image, therefore Chaudhry et al. [85] proposed adversarial erasing of saliency masks to find all occurences of objects. The masks are fused and then used in combination with attention maps to generate segmantation masks.

3.2 Adversarial Training

3.2.1 Introduction

The idea of adversarial training has gained significant attention in recent years after the introduction of Generative Adversarial Nets (GANs) [86], which is a combination of two neural networks, a generator and a discriminator. The task of the discriminator is to predict whether a given input image is real or fake, while the task of the generator is to fool the

(20)

Figure 6: Overview of Guided Attention Inference Network (GAIN) [20]. Here, the attention map becomes part of the training process by utilizing attention mining loss. The parameters between the two networks are shared, even though the input image distributions differ. Image from [20].

discriminator by generating real looking images. In other words, the two networks play an adversarial game by having opposing loss functions. This approach to generating images has proven to be powerful and has been extended to generate convincing looking images [87,

88]. The idea of adversarial training has since been extended to different tasks, such as image-to-image translation [89], reconstructing 3D objects from images [90], up-scaling low resolution images [91] and semantic segmentation [32,33]. The term adversarial training has also loosely been used in the WSSS field, where it denotes erasing part of an image and training an auxiliary model on this new image [17].

3.2.2 Algorithms and methods

For generative modeling, the adversarial framework consists of two players: a generatorG and discriminatorD. Gtakes a random inputz, whileDeither takes as input an example from the true data distributionxor a generated sampleG(z). Both players then play a mini-max game with value functionV(D, G):

min

G maxD V(D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z)))] (4) The goal of the generator is to fool the discriminator whose goal is to be able to distinguish between true and generated data. The quality of output images has further been increased by novel generator architectures allowing explicit control of the image synthesis [87] and by synthesizing images at multiple resolutions which simultaneously backpropagate their gradients from a single discriminator [88].

Isola et al. [89] proposed an extension to the adversarial modeling framework by making it conditional:

min

G maxD V(D, G) = Ez,x,y[log D (x, y) + log (1 − D (x, G (z, y)))] (5)

Now, both players of the adversarial game observe input datax, which depending on the task can be for example edge maps (for reconstructing objects) or grayscale image (for colorizing). When the adversarial training framework is applied to semantic segmentation the gener-ator produces the predicted segmentation masks while the discrimingener-ator alternately observes the predicted segmentation masks and the ground truth labels [32]. This approach can however suffer from insufficient gradient feedback, as the discriminator provides only a single

(21)

binary prediction. To mitigate this issue, both the predictions and the labels can be used as inputs to the discriminator at the same time [33]. This provides more useful feedback and steers the training of the generator in the direction of more realistic labels.

3.3 Summary and Motivation of Method

In this chapter we presented an overview of the topics and related work. Next, we discuss some observations from the related work that motivated the choices we have made when developing our method.

Many WSSS methods run into the discriminative localization problem, where the attention maps extracted from a classification network correspond only to the most discriminative object regions. They then aim to solve this using cross image features, region growing or random erasing. Adversarial erasing methods erase the most discriminative object regions to find complementary, less discriminative object regions. This can be done in iterative fashion [17,21], or by utilizing complicated network architectures [22], all of which require the fusion of several attention maps. Furthermore, saliency maks are often utilized to prevent the attention maps to spread to background regions, which adds another supervision signal [17,22].

Li et al. [20] were the first to propose a learnable attention map by utilizing an adversarial loss term from a second network. However, their solution has multiple design choices that can deteriorate the performance and make it hard to integrate into existing solutions. First, the weights are shared between the two networks, while the input data distributions between the two networks are different. Second, they require saliency masks to limit the spread to background regions. This implicitly adds additional supervision requirements, as the saliency method that provides the masks [92] uses a subset of the same semantic segmentation dataset with stronger labels, i.e. saliency masks.

This thesis attempts to utilize adversarial erasing in a similar fashion, making the attention map part of the learning process. However, we purposely keep our method simple, making it easily integrable into existing WSSS methods. Instead of sharing weights between the two networks, we train the networks separately and with opposing loss functions, making it closer to the generative adversarial nets from [86]. This makes our approach better integrable, as the choice of network architectures is now independent of each other. We drop the saliency mask requirement and instead opt for a regularization loss term. It limits the spread of the attention map to background regions without requiring the stronger supervision of saliency masks.

(22)

4 Method

In this chapter we propose a new formulation of adversarial erasing: end-to-end adversarial erasing (EADER). This section is structured as follows: first, we present our novel adversarial training formulation for weakly-supervised semantic segmentation. Then we illustrate the effectiveness of our method, by integrating EADER into an existing weakly supervised semantic segmentation framework.

4.1 End-to-End Adversarial Erasing

Consider a datasetD = {xi, yi}Ni=1, whereN is the number of images,xi the input image

andyi a multi-hot vector of lengthC, withC being the number of classes, and in which

yi,c = 1, if classcis present inxiandyi,c = 0, otherwise. Note that being in a multi-label

setup, multiple classes can be present in an input image and henceP

cyi,c ≥ 1.

Localizer network. First, an image classifier network is used to localize the target object by generating attention maps, hence we call this network the localizer. The localizer can be instantiated by any convolutional neural network from which attention maps can be extracted. For simplicity we assume the usage of CAMs as attention maps, but it is straightforward to extend this approach to more advanced attention maps such as Grad-CAM. The localizerG with trainable parametersϕis trained as multi-label classifier using binary cross entropy loss on each label class:

Lloc(Gϕ(xi), yi) = − 1 C X c yi,c ln(Gϕ(xi)) + (1 − yi,c)ln(1 − Gϕ(xi)) (6)

Attention maps. Given a trained localizer networkGϕ, the attention mapAcfor classccan

be obtained using its feature map of the final convolutional layerg_ϕfinaland the classification weightswc as follows: Ac(xi) =ReLU w_cTgfinal_ϕ (xi) . (7)

Ac is then normalized so that the maximum activation equals 1.

Soft masks. Next, the attention maps are converted into masks using a soft thresholding operation. Only the attention maps for ground truth classes are kept, which are then resized to the input image dimensions and thresholded to generate class specific masksMc:

Mc(xi) = σ (ω (Ac(xi) − ψ)) , (8)

whereσis the sigmoid non-linearity,ψis the threshold value andωis a scaling parameter that ensures that values above the threshold are (close to) 1 and values below are (close to) 0. In contrast to a regular thresholding operation, this soft threshold is differentiable which allows the gradients from any further computations to backpropagate to the localizer.

(23)

Erasing. The soft masks are used to create a new image where the regions highlighted by the attention maps are erased. The erased images, which are the input to the adversarial, are computed as follows:

˜

xi,c = xi⊙ (1 − Mc(xi)) (9)

Note here that only the attention map of one particular class is erased, which is why multiple images are created in cases where there is more than one target.

Adversarial network. The resulting images are then forwarded through the second net-work, which we call the adversarial. Its goal is to classify the images correctly, even when the target classes are erased. The adversarial network F with trainable parametersθ is then trained as multi-label classifier using the same binary cross entropy loss function as the localizer: Ladv(Fθ(˜xi,c), yi) = − 1 C X c

yi,c ln(Fθ(˜xi,c)) + (1 − yi,c)ln(1 − Fθ(˜xi,c)) (10)

Hence, the goal of this network is to classify the same targets as before, despite the erased evidence. The network architecture of the adversarial is agnostic to the localizer network architecture, as we have erased the most discriminative object regions from the image instead of one of the feature maps from the localizer.

Attention mining loss. To encourage the model to erase the object evidence thoroughly, we engage the localizer in an adversarial game with the adversarial. We follow [20] by utilizing attention mining loss, which is the mean of the logits of the classes that have been erased: Lam(˜xi, yi) = 1 C X c∈yi Fθ(˜xi,c) (11)

The attention mining loss captures the ability of the adversarial to still classify the erased object.

Regularization loss. Finally, to regularize the localizer, we impose an additional loss term: Lreg(xi, yi) = 1 W × H × C X c∈yi X j,k Ac(xi)j,k, (12)

where W, and H represent the width and height of the activations. Incorporating this regularization loss in the optimization process encourages the localizer to find a minimum attention map that covers the target class and hence prevents the localizer from the trivial solution where it erases the entire image to globally minimize the attention mining loss.

(24)

x y: {cat, dog} localizer (Gϕ) attention map (Ac) Llocw.r.t. ϕ mask (Mc) Lreg w.r.t. ϕ − x˜ adversarial (Fθ) Ladv w.r.t. θ Lam w.r.t. ϕ − = erasing = soft thresholding

Figure 7: An overview of our end-to-end adversarial erasing framework. The images x are forwarded through the localizer Gϕto extract per-class (c) attention maps Ac. Using a soft-thresholding operation they are converted to masks Mc, which are used to create images where the most discriminative object parts have been erased (˜x). These are forwarded through the adversarial Fθ, which is optimized using a classification lossLadv. The localizer is optimized using a classification lossLlocand an adversarial loss termLam. This forces the localizer to spread its attention to less discriminative object parts, while theLregloss encourages the model to bound the activation to the minimum necessary area.

Total loss. The total loss function for the localizer then becomes:

Ltotal = Lloc+ αLam+ βLreg, (13)

whereαandβ are hyper-parameters to tune the importance of the adversarial and regular-ization losses respectively.

The localizer and adversarial networks’ image classifiers are optimized using binary cross entropy loss, but in alternative fashion using distinct optimizers. With this setup, the proposed method can be integrated into existing methods, regardless of their network architecture. Furthermore, with this setup the weights of the networks need not to be shared, which can deteriorate performance as the input data distributions of the localizer and adversarial are different.

The combination of loss functions forces the localizer to not only classify the image cor-rectly, but also to spread its attention to less discriminative object regions without spreading to background regions. While the localizer is trained to minimize its adversarial loss term, the adversarial model tries to maximize it, by minimizing its loss in Equation10. Without the regularization loss term, a trivial solution for the localizer would be to hide the entire image from the adversarial. Hence, the regularization loss limits the attention of the localizer and thus forces it to only erase the regions that belong to the target class. An overview of the end-to-end adversarial framework is shown in Figure7.

Segmentation masks. After training the model with the described loss terms, we convert the attention maps to segmentation masks. We first upsample and stack all the attention maps into the image resolution withC+ 1channels. Since we do not train the classification

(25)

models for the background class, we set the first channel to a threshold value ofρ. To obtain the segmentation masks we take the argmax over the class dimension:

Si= arg max C

({ρ, A1, . . . , AC}) (14)

where ρ and the attention maps have dimension W × H, the same as the input image dimensions. The output segmentation maskSi corresponding to imagexiis therefore of the

same resolution.

Algorithm 1: End-to-end adversarial erasing (EADER) Data: Training setD = {xi, yi}Ni=1, parametersα, β

Result: Segmentation masksS while training has not converged do

Forward image through localizeryˆ_iloc← Gϕ(xi);

Generate attention mapai,c ← Ac(xi); (Eq. 7)

Generate maskmi,c ← Mc(xi); (Eq. 8)

Erase mask from imagexˆi,c ←erase(xi, mi,c); (Eq. 9)

Forward erased image through adversarialyˆadv_i ← Fθ(ˆxi,c);

if train localizer then

Compute localizer lossLloc← Lloc(ˆyloci , yi); (Eq. 6)

Compute attention mining lossLam← Lam(ˆyadvi , yi); (Eq. 11)

Compute regularization lossLreg← Lreg(ai,c); (Eq. 12)

Compute total lossLtotal ← Lloc+ αLam+ βLreg; (Eq. 13)

Updateϕw.r.tLtotal;

else

Compute adversarial lossLadv ← Ladv(ˆyiadv, yi); (Eq. 10)

Updateθw.r.tLadv;

end end

fori← 1toN do

Generate segmentation maskSi ← arg max(ai); (Eq. 14)

end

Algorithm1illustrates the training procedure of end-to-end adversarial erasing.

4.2 Integrability of End-to-End Adversarial Erasing

The proposed method is simple and integrable, as the adversarial model and loss term are agnostic to neural network architecture and attention map generation method. Furthermore, the model does not require saliency masks or any other form of extra supervision. We showcase its simplicity by integrating the proposed end-to-end adversarial erasing framework into an existing WSSS method. We integrate it into Pixel-level Semantic Affinity (PSA) [16], which suffers from the discriminative localization problem in the first stage of the method. PSA

(26)

consists of three stages, where the first stage trains a classification network to generate CAMs. In this stage there are no specific methods utilized to improve the segmentation masks, but extensive training and test-time data augmentations are utilized to increase performance in this regard. The segmentation masks generated from CAMs are used to generate two sets of labels for the second stage. The first set are labels with high confidence for the background regions, while the second set is with high confidence for the object regions. In the second stage semantic affinities are learned using AffinityNet, which propagates local responses in the maps to nearby regions which belong to the same semantic entity. The two distinct sets allow reliable training of AffinityNet, as regions of low confidence are discarded. The AffinityNet output predictions are used as proxy-labels in the last stage, where a fully-supervised semantic segmentation network is trained.

The CAM-generation stage of PSA is suitable for end-to-end adversarial erasing as it suffers from the discriminative localization problem and because the classification network is suitable as a localizer, i.e. the attention maps are generated without the need of any post-processing or other gradient-breaking computations. As before, we apply a soft threshold on the attention maps to create masks, which are then used to erase the most discriminative object regions from the input images. The resulting images are forwarded through the adversarial network and attention mining loss is applied as adversarial loss on the localizer network.

(27)

5 Experiments

5.1 Experimental Setup

5.1.1 Dataset.

We evaluate the performance of the proposed method on the Pascal VOC 2012 segmentation dataset [7], the most widely used benchmark on weakly supervised semantic segmentation. The dataset consists of 20 object classes and one background class and contains 1464, 1449 and 1456 images in the train, validation and test sets respectively. Following previous works in the WSSS literature [16,17,20,22,61,85], we augment the dataset with annotations from Hariharan et al. [93]. As a result, a total of 10582 training images are available with image-level annotations. We report the mean intersection-over-union (mIoU) for the validation and test sets. The test set results are obtained using the official Pascal VOC evaluation server.

In contrast to the previous adversarial erasing methods we do not employ any post-processing and keep the tricks to a minimum to keep our method simple. More specifically, we leave out tricks such as test-time augmentations, post-processing and saliency masks to the method we integrate with. This makes our approach easier to integrate into existing methods, but also makes it harder to achieve better performance. For example, methods that use saliency masks implicitly get an extra supervision signal, which we do not benefit from. 5.1.2 Network architecture details.

We test the adversarial training approach with a ResNet-101 [94] localizer network, while the adversarial model is a ResNet-18 network. We utilize ImageNet pre-trained weights1for both networks. Attention maps are obtained using Grad-CAM [38], unless otherwise specified. When integrating with PSA, to ensure fair comparisons, we do not change any of the existing networks, which means the localizer is a WideResNet [30] with 38 convolutional layers, while the adversarial model is kept as a ResNet-18. We also do not change the attention map generation method, which is CAM [19]. In the final stage we train a fully supervised semantic segmentation network on proxy labels. We utilize DeepLabV3+ [4], which is a modern segmentation model, with ResNet-101 and Xception-65 backbones and the same training strategy as [4].

5.1.3 Training specifications.

We train the localizer with a batch size of 16, while the batch size is dynamic for the adversarial model as it depends on the number of objects in each image. For example, when each image in the batch of 16 has two object classes, both objects are erased from each image separately and the batch size for the adversarial network will be 32. We randomly resize and crop the input images into 448×448 for both the localizer and the adversarial model. Both networks are optimized for 10 epochs with stochastic gradient descent with

(28)

α mIoU Precision Recall 0 41.37±0.26 58.26±0.50 58.18±0.66

0.01 42.51±0.41 57.79±0.72 60.88±0.52

0.05 43.89±0.40 54.78±1.03 68.13±1.51

0.1 42.88±0.99 52.68±2.30 69.31±1.84

Table 3: Performance of EADER on the Pascal VOC 2012 validation set. The α parameter controls the strength of the adversarial loss term, where α= 0 corresponds to no adversarial loss. We report both the mean and standard deviation over 6 runs.

α= 0 α= 0.01 α = 0.05 α= 0.1

Figure 8: Attention maps obtained using Grad-CAMs from the end-to-end adversarial erasing method using different values for the adversarial loss term α. As the α value increases, the attention spreads to less discriminative object regions.

a learning rate of 0.01. We alternately train the localizer and adversarial per 200 training steps. Throughout the experiments, unless specified otherwise, we have used anαvalue of 0.05 andβ is set to10−5 _{(both Equation}_{13). Further hyperparameter values are}_ω_{= 100}_,

ψ= 0.5(both Equation8) andρ= 0.3(Equation14).

We follow the training settings of [16] when training PSA and use an initial learning rate of 0.01 for the adversarial network. Note that this implies that we train the localizer network shorter than before, as only half of the iterations are used to update the localizer.

An overview of all hyperparameters can be found in the appendix.

5.2 Ablation study

Our first experiment is an ablation study to verify our hypothesis that the adversarial network forces the localizer network to spread its attention to less discriminative object regions. Recall from Equation13thatαcontrols the strength of the adversarial loss term. In Table3we vary theαparameter and report the mIoU, precision and recall of the segmentation masks. We make the following observations: first, we find that a higherαindeed increases the recall, i.e. it forces the localizer to spread its attention to less discriminative object regions.

(29)

Localizer Adversarial

α mAP ROC AUC mAP ROC AUC

0 95.20±0.31 98.89±0.09 81.63±1.07 95.13±0.33

0.01 95.50±0.56 98.82±0.18 81.19±0.82 94.82±0.49

0.05 95.22±0.14 98.87±0.02 79.46±0.72 94.58±0.30

0.1 92.59±0.44 98.34±0.07 73.01±0.66 92.63±0.28

Table 4: Classification performance of both the localizer and adversarial model on the Pascal VOC 2012 validation set. The α parameter controls the strength of the adversarial loss term, where α= 0 corresponds to no adversarial loss. Both the mean average precision (mAP) and area under the receiver operating curve (ROC AUC) are reported. The values are averaged over 4 runs and the standard deviation is shown in subscript.

Second, we observe that this increase in recall also increases performance in terms of mIoU. The highest mIoU is obtained atα = 0.05, which strikes the right balance between precision and recall. A higherαvalue further increases the recall but the degradation in precision is stronger, resulting in a lower mIoU score. Example attention maps generated for differentα values are shown in Figure8. Consistent with the previous observation, we see that a higher αvalue forces the attention map to spread to less discriminative object regions. However, when the value is too high some pixels belonging to other classes and background regions receive high responses and therefore cause a drop in the precision.

5.3 Classification performance

We now test the hypothesis that by minimizing the adversarial loss term (Equation11) the classification performance of the adversarial model is degraded. Since the adversarial loss term is instantiated as attention mining loss, which is the mean of the erased classes, we expect the classification performance to be degraded for the adversarial model when the strength of this term (controlled byα) is increased. The remaining classes should however still be classified, as the adversarial model is trained with exactly the same labels as the localizer. Therefore, only a small degradation in classification performance is to be expected.

Table4shows the performance of both the localizer and adversarial models in terms of mean average precision (mAP) and the area under the receiver operating characteristic curve (ROC AUC). For the localizer model, we find that the classification performance is unaffected forαvalues below 0.1, with a higher value the performance is somewhat degraded. This reinforces the finding that a too strongαvalues decreases the segmentation performance, as the task of the localizer shifts from classification to increasing the attention map such that the adversarial cannot perform classification anymore. Forα= 0.05, the best performing value in terms of segmentation performance, the classification performance of the localizer is unaffected. In other words, even though the adversarial loss term forces the attention map to spread to less discriminative object regions, the localizer can still classify the objects in the image with the same accuracy. For the adversarial model, we find that a higherαvalue indeed decreases the classification performance. Figure9shows the input images (xˆ) for

(30)

α= 0 α= 0.01 α= 0.05 α= 0.1 Figure 9: Input images (ˆx) for the adversarial model for different α values. The α parameter controls the strength of the adversarial loss term, where α= 0 corresponds to no adversarial loss. The regions covered by the attention map have been erased by converting them to masks and subtracting them from the image. The adversarial model is trained to classify the images as {cat, chair} (top) and {motorbike, car} (bottom).

CAM AffinityNet DeepLabV3+

Model mIoU Precision Recall mIoU mIoU

PSA 46.8 60.3† 66.7† 58.7 60.7†

PSA w/ EADER 48.6 61.3 68.7 60.1 62.8

Table 5: Comparison to our baseline, Pixel-level Semantic Affinity (PSA), on the Pascal VOC 2012 validation set. To enable a fair comparison we reproduce the PSA numbers and train the proxy labels from AffinityNet on DeepLabV3+. The numbers with a† denote our reproduced results.

the adversarial model for differentαvalues. Asαincreases, a larger region of the target class is erased, making it harder to classify the objects correctly. However, the classification performance is still relatively high, as the remaining objects can still be classified and the erased object can sometimes be inferred from the context and co-ocurrences with other objects.

5.4 Comparison to Pixel-level Semantic Affinity

We now compare the original PSA results to the results where we have integrated end-to-end adversarial erasing. Table5shows the improvements in terms of mIoU. Additionally, we report precision and recall after the CAM generation stage. End-to-end adversarial erasing improves performance in this stage for all metrics. In other words, the combination of the adversarial and regularization loss terms forces the attention map to spread to less discriminative object regions without spreading to background areas. Besides, the mIoU scores in this stage are higher than those reported in Table3, which is caused by the extensive

(31)

Validation Test

Class PSA† _EADER _PSA _EADER

background 86.7 88.2 89.1 89.1 aeroplane 53.2 54.9 70.6 53.7 bike 29.1 31.3 31.6 33.6 bird 76.7 84.1 77.2 81.6 boat 44.2 58.2 42.2 52.1 bottle 67.7 70.9 68.9 69.5 bus 85.2 83.0 79.1 82.7 car 72.4 76.2 66.5 74.3 cat 71.7 82.1 74.9 86.0 chair 26.7 24.4 29.6 26.7 cow 76.5 80.6 68.7 79.1 dining table 40.9 35.8 56.1 43.9 dog 72.2 80.7 82.1 83.1 horse 68.2 76.4 64.8 78.7 motorbike 70.2 73.7 78.6 75.6 person 66.4 70.8 73.5 71.0 potted plant 37.8 15.4 50.8 23.6 sheep 80.9 77.2 70.7 80.0 sofa 38.5 34.6 47.7 42.9 train 62.8 66.4 63.9 63.8 tv 45.4 52.6 51.1 49.9 mean 60.7 62.8 63.7 63.8

Table 6: Per-class comparison with Pixel-level Semantic Affinity (PSA) on Pascal VOC 2012 validation set with only image-level supervision. Note that PSA uses the more powerful WideResNet-38 as the final segmentation model on the test set. The column with a† corresponds to our reproduced results.

test-time augmentations used by PSA (cf. Table7). In the next stage, training AffinityNet with the improved outputs of the first stage again results in better mIoU scores. Finally, we report results when training a fully supervised semantic segmentation model on the proxy labels generated by AffinityNet. As we utilize DeepLabV3+ instead of ResNet-38 as fully supervised semantic segmentation model, we also report the results of DeepLabV3+ trained on proxy labels without end-to-end adversarial erasing. Again, this results in improvements in terms of mIoU, proving the integrability of end-to-end adversarial erasing into existing WSSS methods.

In Table 6we make a per-class comparison on the validation and test sets. Note that PSA uses WideResNet-38 as a final segmentation model. This model performs better than DeepLabV3+ with proxy-labels, as it gives a larger performance increase when trained on these labels (cf.9). Recall that the discriminative localization problem is especially prevalent in non-rigid object classes. We find that end-to-end adversarial erasing significantly improves

(32)

Scale Flipped PSA PSA w/ EADER 0.5 ✗ 39.8 40.1 1 ✗ _42.4 _44.2 1.5 ✗ _36.5 _38.2 2 ✗ 29.6 31.3 0.5, 1, 1.5, 2 ✓ _46.8 48.6

Table 7: Performance in terms of mIoU on the Pascal VOC 2012 validation set at different input image scales. The mentioned scales are the scales during inference. When multiple scales are mentioned the average is taken. "Flipped" denotes whether the flipped version of the image is also used during inference. The bottom row are the default scales and flipping that PSA [16] uses.

the results in many non-rigid object classes such as bird, cat, cow and horse. Typically in these object classes the most discriminative object region is the head or the feet, which causes the attention map to cover only a small portion of these object classes. With end-to-end adversarial erasing, the localizer is forced to capture the entire object region, as the fur or skin of these object classes are less discriminative, but still recognizable. For outdoor object classes the results are often similar to PSA, while for indoor object classes the performance is often degraded. Overall, end-to-end adversarial erasing improves the performance, even though it makes use of a less powerful fully supervised segmentation model.

In Figure10we show some qualitative results demonstrating the increase in precision, recall and mIoU. In the first four rows we find that end-to-end adversarial erasing better segments objects by capturing less discriminative object regions, especially for non-rigid object classes. The increased specificity, as for instance observed in the last samples, can be attributed to the regularization term that forces the attention to spread only to areas where the localizer is confident that it is an object region.

5.4.1 Scale invariance

Ahn and Kwak propose to use randomly scaled input images during training to impose scale invariance on the networks of the first and third stage [16]. We test this hypothesis by feeding the validation set at different scales into the network of the first stage and compare it to the network with end-to-end adversarial erasing. The results are in Table7. We observe that the mIoU changes with different input scales, both for PSA and PSA with end-to-end adversarial erasing. However, when taking the different scales into account, the standard deviation decreases 3% with EADER. Hence, both methods are not invariant to the input image scale, but with EADER the scale invariance improves.

In the bottom row the default test-time augmentation that PSA uses are shown. When using all four scales and their flipped versions of the input images, the average is take nover a total of eight images. This improves the performance significantly for both PSA and PSA with EADER.

(33)

Input Ground Truth PSA PSA w/ EADER Figure 10: Qualitative results on the Pascal VOC 2012 validation set. The white edges in the ground truth mask denote pixels that are ignored during evaluation. End-to-end adversarial erasing strategy increases the recall without sacrificing the precision (top rows) and is more precise in some cases (bottom two rows).

(34)

Method Supervision Validation Test AE-PSL[17] I + S 55.0 55.7 GAIN [20] I + S 55.3 56.8 DCSP [85] I + S 60.8 61.9 SeeNet [22] I + S 63.1 62.8 ACoL [21] I 56.1† -EADER (Ours) I 62.8 63.8

Table 8: Quantative comparison with the state-of-the-art in adversarial erasing methods for WSSS on the Pascal VOC 2012 dataset. For supervision,I denotes image-level labels and S denotes saliency masks. The result with† was obtained from [22].

5.4.2 Integrability of End-to-End Adversarial Erasing

We have purposely kept our method simple to make it easy to integrate into other method-ologies that can benefit of increased coverage of the attention maps. Our method requires only image-level labels and does not use any bells and whistles that are common in WSSS. Furthermore, it is agnostic to neural network architecture by erasing the attention map from the image instead of the feature map. In our standalone implementation we utilize Grad-CAM as attention map generation method, while we use CAM when integrating into PSA. These properties show that end-to-end adversarial erasing is can be integrated with a wide range of methods, even for other downstream tasks such as object detection.

5.5 Comparison to Adversarial Erasing Methods

In Table8we compare our results to previous WSSS methods that follow an adversarial erasing strategy. We outperform all existing adversarial erasing methods, even when most of them use stronger supervision signals in the form of saliency masks. When comparing to ACoL [21], the only other adversarial erasing methodology without saliency masks, we significantly outperform their number on the validation set. We cannot make a comparison to Kim et al. [84] as their method has only been evaluated on the weakly-supervised object localization task.

Furthermore, all of the mentioned methods use techniques that are complementary to adversarial erasing. This makes a fair comparison of just the adversarial erasing component impossible. We are the first to share the results of applying adversarial erasing without any other techniques and extra supervision. The results are the last row of Table5, i.e. we obtain 48.6 mIoU, 61.3% precision and 68.7% recall.

5.6 Comparison to the State-of-the-Art

In Table9we compare to previous WSSS methods, where we report the feature extractor that is used to generate object locations, the fully supervised model that is trained on proxy labels (if applicable) and the form of supervision. In this table we do not show methods that

(35)

Method Feature Extractor Fully Supervised Model( Backbone) Supervision Validation Test FCN [28] - (VGG16) F - 62.2 WideResNet-38 [30] - (WideResNet-38) F 80.8 82.5 PSPNet [5] - (ResNet-101) F - 82.6 DeepLabV3+ [4] - (Xception-65) F 84.6 87.8 SN_B [53] VGG-16 DeepLab( VGG-16) I + S 41.9 43.2 AE-PSL[17] VGG-16 DeepLab( VGG-16) I + S 55.0 55.7 Oh et al. [55] VGG-16 DeepLab( VGG-16) I + S 55.7 56.7 GAIN [20] VGG-16 DeepLab( VGG-16) I + S 55.3 56.8 MCOF [64] VGG-16 DeepLab( VGG-16) I + S 56.2 57.6 DCSP [85] VGG-16 - I + S 58.6 59.2 DSRG [47] VGG-16 DeepLab( VGG-16) I + S 59.0 60.4 SeeNet [22] VGG-16 DeepLab( VGG-16) I + S 61.1 60.7 MDC [52] VGG-16 DeepLab( VGG-16) I + S 60.4 60.8

MCOF [64] VGG-16 DeepLab(ResNet-101) I + S 60.3 61.2

DCSP [85] ResNet-101 - I + S 60.8 61.9

FickleNet [49] VGG-16 DeepLab( VGG-16) I + S 61.2 61.9

Fan et al. [50] ResNet-50 DeepLab( VGG-16) I + S 61.3 62.1

SeeNet [22] VGG-16 DeepLab( ResNet-101) I + S 63.1 62.8

OAA+ [51] VGG-16 DeepLab( VGG-16) I + S 63.1 62.8

DSRG [47] VGG-16 DeepLab(ResNet-101) I + S 61.4 63.2

Fan et al. [50] ResNet-50 DeepLab( ResNet-101) I + S 63.6 64.5

CIAN [54] VGG-16 DeepLab( ResNet-101) I + S 64.3 65.3

FickleNet [49] VGG-16 DeepLab( ResNet-101) I + S 64.9 65.3

OAA+ [51] VGG-16 DeepLab( ResNet-101) I + S 65.6 66.4

MIL-FCN [44] VGG-16 - I - 25.7

EM-Adapt [46] VGG-16 - I 38.2 39.6

SEC [41] VGG-16 DeepLab( VGG-16) I 50.7 51.7

MEFF [79] VGG-16 FCN( VGG-16) I - 55.6

RRM [18] WideResNet-38 DeepLab( VGG-16) I 60.7 61.0

Araslanov and Roth [95] WideResNet-38 - I 62.7 64.3

SSDD [40] WideResNet-38 WideResNet-38 I 64.9 65.5

RRM [18] WideResNet-38 DeepLab( ResNet-101) I 66.3 66.5

PSA [16] (baseline) WideResNet-38 DeepLab( VGG-16) I 58.4 60.5

PSA [16] (baseline) WideResNet-38 WideResNet-38 I 61.7 63.7

PSA [16]†_(baseline) _{WideResNet-38} _DeepLab_{( Xception-65)} _I _60.7

-EADER (Ours) WideResNet-38 DeepLab( ResNet-101) I 62.5 63.0

EADER (Ours) WideResNet-38 DeepLab( Xception-65) I 62.8 63.8

Table 9: Comparison of WSSS methods on the Pascal VOC 2012 dataset. For the supervision,I denotes image-level labels,S denotes saliency masks and F denotes pixel-level labels, which is the upper bound for fully supervised semantic segmentation. For the feature extractor, the mentioned architecture is the one that is used to generate the initial object locations (e.g. by generating attention maps). The result with a† is our reproduced result with a DeepLab version that is better comparable to ours.

use stronger supervision than image-level labels such as points, scribbles or bounding boxes. We also only show methods that train only on the Pascal VOC 2012 dataset, i.e. no additional data from the web or videos. The methods denoted with supervision signalF denote the upper bound of the segmentation performance. Note that our method outperforms a fully

(36)

supervised FCN [28] network and we achieve ≈ 73%of the upper bound, set by a fully supervised DeepLabV3+[34] model. Methods withS supervision use saliency masks, which are often obtained by methods that are trained with full supervision. Hence, we make a distinct comparison with methods utilize only image-level labels (I) and methods that also utilize saliency masks (I + S).

We show that we outperform many existing WSSS methods that use either only image-level labels or also saliency masks. We do not reach state-of-the-art results, but as we have shown that our method is simple and integrable into existing methods, we hypothesize that better performance can be obtained when integrating with a better performing baseline. Candidate methods would suffer from the discriminative localization problem, where only the most discriminative object region is selected as an initial segmentation mask for further stages in the model. For these models, the enhanced quality of the segmentation masks provides direct benefits for the next stage of the model, and it does not require any architectural changes to the next stages. In the attention map generation stage, only the adversarial model and its corresponding loss functions need to be added.