Hierarchical Inductive Bias for Semantic Image Segmentation

(1)

MSc Artificial Intelligence

Master Thesis

Hierarchical Inductive Bias

for Semantic Image Segmentation

by

Julian M. Schoep

10628126

September 8, 2020

36 European Credits March 2019 - August 2020

Supervisors:

Dr. Erman Acar

Dr. Nanne J.E. van Noord

Assessor:

Prof. Dr. Frank A.H. van Harmelen

(2)

Abstract

Semantic segmentation is a challenging computer vision task where all pixels in an input image need to be correctly labelled with a semantic class. We integrate hierarchical structure in the predictions of a state-of-the-art semantic segmentation algorithm, biologically motivated by the hierarchical representation inherent to human visual perception. To that end, we propose a system that performs hierarchical classification, using an a priori obtained soft hierarchical partitioning of a hyperbolic manifold as prototypical embedding space. This system performs hierarchical hyper-bolic multinomial logistic regression (MLR), which makes this thesis, to our knowledge, the first to perform full hierarchical classification using hyperbolic MLR. The current implementation is intractable for semantic segmentation, which is why we propose an implementation of hyperbolic MLR that is able to make predictions over all pixels in a high-resolution input image in parallel while retaining an acceptable memory footprint. This additionally enables a dense visualization of two-dimensional hyperbolic manifold prototypes, which we use to show that our hyperbolic pro-totype learning algorithm produces a maximally separated partitioning of the embedding space that guarantees hierarchical entailment, using only a single optimization objective with a minimal number of hyperparameters. By evaluating our approach on a challenging large-scale semantic segmentation benchmark, Coco-stuff, we empirically show that our system produces predictions that are comparable in performance but significantly more hierarchically consistent when compared to a non-hierarchical baseline. We further find that combining hyperbolic space with hierarchical classification is mutually beneficial, yielding hierarchical consistency that outperforms both meth-ods individually on the Coco-stuff benchmark. A useful side-effect of the hierarchical segmentation predictions is that they naturally enable generalization to unseen classes, which we demonstrate with a zero-label semantic segmentation experiment. This generalization ability is also apparent in our interpretability experiment, where we demonstrate that our system is able to accurately embed pixels on a two-dimensional hyperbolic manifold with a predefined, human-readable hierarchical structure. This visualization shows that our method is able to make meaningful, accurate and human-interpretable semantic segmentation predictions on visual concepts it has not seen before.

(3)

Acknowledgements

I remember a time, not long past, where I stated that this thesis would be at most 15 pages. Let’s just say I could have used some prior knowledge integration myself in the writing of this thesis. This work has been a significant part of my life for the past months and it was not without its difficulties. A late-stage topic change, an unfortunate incident involving milk, a backpack and a laptop, and a global pandemic are but a few of the challenges I’ve had to face in the completion of this work. But it was most certainly worth it. I present to you a work that I am most proud of, and something that I could have never achieved if it were not for a large group of people supporting me every step of the way.

First of all, I thank my supervisors, Erman Acar and Nanne van Noord. You listened attentively to every one of my crazy ideas, bravely tried to put boundaries on the scope of this thesis, provided incredibly constructive and detailed feedback and have supervised this work to completion until the very end. Furthermore, you put me in touch with the right people at the right time, and pushed me to be both a better writer and a better researcher. I have sustained significant growth as an academic researcher because of you, that I know I will profit from for the rest of my life. A significant challenge for me personally, was wrapping my head around the concept of hyperbolic space. I thank Christo Morison for helping me immensely by demystifying hyperbolic geometry. I also thank Andrii Skliar and Teng Long, who have taken time out of their busy schedules to explain how exactly you combine neural networks with hyperbolic geometry. You three put all the pieces of the hyperbolic space puzzle in place and made it ’click’.

I also thank David Kuric and Jerome Mutgeert for helping out a random person asking complicated linear algebra questions in the MSc AI whatsapp group. You helped me out a significantly. I also thank Emile van Krieken and Michael Cochez for fruitful discussions on conditional probability spaces, and for putting the intellectual thumb screws to my ideas in cramped meeting rooms at the VU.

Furthermore, my colleagues at Promaton deserve significant praise. Especially Alex Nederhof and Frank Claessen for proof reading and sincere enthusiasm and interest that motivated me to see this through. Oh, also thanks for that monster you literally delivered to my doorstep, its compute power was invaluable in the creation of this work.

Finally, I especially want to thank my amazing friends and family. You supported me through everything and made me feel like I wasn’t alone during thesis quarantine. I would have been nowhere without you.

To every one of you, I extend my heartfelt gratitude. Thank you. Julian Schoep

(4)

Introduction

The field of Artificial Intelligence (AI) has seen a recent resurgence in approaches that integrate prior knowledge into intelligent learning systems (Akata et al. 2015;Zhao et al. 2017a;Xian et al. 2019;von Rueden et al. 2019;

Liang et al. 2018b,a;Xian et al. 2019;Redmon and Farhadi 2017;Barz and Denzler 2019). Choosing the right type of prior knowledge to integrate is important, as the wrong type can have adverse effects on the performance of the system. A hierarchy is a popular choice of knowledge representation integrated by these approaches as it describes many real-world, complex, symbolic1_{systems, including language, music, governments, data storage,} and vision. Hierarchies are described as a tree structured graph where vertices refer to symbols and edges to directed subsumption relationships between them. The choice of subsumption relationship can express varying types of ordering2 _{on the symbolic elements contained in the graph, which makes hierarchies so effective at} describing a wide range of complex systems. Furthermore, we humans make heavy use of this data representation in structuring information internally (Tenenbaum et al. 2011). This makes hierarchical representations naturally familiar to humans, which, we argue, can lead to more human-interpretable machine learning predictions. We propose to integrate hierarchical prior knowledge into the task of semantic image segmentation, a task which demands dense and precise visual concept categorization. Hierarchies naturally describe the categorization of objects, which makes it a suitable data representation for this task. Furthermore, we make the case that humans perform dense visual concept categorization in a hierarchical way, and that enhancing semantic segmentation architectures with this ability causes them to produce interpretable and informative predictions that facilitate the incorporation of these models into human-operated interfaces. We equip a computer vision system with various forms of hierarchical inductive bias in localizing and categorizing a wide range of semantic concepts in real-world image-data, and perform a thorough evaluation of our approach on four common prediction quality metrics.

Even though symbolic representation is fundamental to human intelligence, the field of AI has been domi-nated by statistical, or, sub-symbolic approaches in recent years due to the astounding success of deep learning (Krizhevsky et al. 2012a;Silver et al. 2016;LeCun et al. 2015). Modern statistical methods by and large adopt a black box approach to creating intelligent systems by automatically learning large, complex functions that best reproduce intelligent behaviour. These systems learn by automatically optimizing a complex non-linear function composed of billions of parameters that can convert input into desirable output, given enough data. This approach is very effective and has enabled systems that can autonomously drive cars, fly planes, summarize novels, write movie-scripts, translate texts, identify pathologies and perform many other highly intelligent tasks. Although highly effective, fully statistical approaches are not without its limitations. The statistical black box approach was popular, in part because learning with symbolic knowledge, also called hand-engineered features, was difficult and subject to the capacity and expertise of the researcher. It was preferred to assume minimal to no prior knowledge about a problem domain and to just let the model learn everything. Although highly effective, this creates feature representations that are optimized only for the task they are trained on but do not necessarily generalize well to other, down-stream tasks. Furthermore, the shear size and complexity of the resulting functions make them next to impossible to interpret by humans (hence the term black box ). 1_{Symbols are the abstract mental constructs that we humans use to interact with, interpret and reason over the world around}

us, such as cats or a Masters Thesis. They can also refer to non-existent concepts such as unicorns and summer holidays.

(6)

There has been a push in recent years to alleviate these problems by integrating symbolic representations into statistical approaches, thereby combining strengths of both (Battaglia et al. 2018). This practice has recently gained traction in many areas of machine learning (von Rueden et al. 2019), including computer vision (Zhao et al. 2017a;Akata et al. 2015;Xian et al. 2019;Liang et al. 2018b;Setti 2018).

1.1 Inductive Bias for Better Solutions

Learning entails generalizing from past experiences to make good predictions on new data (Mitchell 1980). New data can theoretically take on any value, which makes it impossible to make correct predictions without making assumptions. Making assumptions about new data based on related but different past experience is called generalization, which is crucial to learning. Inductive bias expresses the set of assumptions we might have about about the data or the symbols represented within it, that might help the learning system to produce better solutions. Inductive bias is essential for making the inductive leap, i.e. composing a potentially uncertain hypothesis based on related experiences. For example, you have perhaps seen multiple ostriches, either in real life or on documentaries, and have never seen one of them fly. You might therefore induce that all ostriches are unable to fly, which constitutes the inductive leap. It can never be grounded in theory that ostriches cannot fly (perhaps they can fly on the moon, or maybe they only fly when no one is looking) which means you used assumptions based on some examples of ostriches to construct a hypothesis about the flying ability of all ostriches. Symbolic knowledge integration techniques equip the learning system with inductive bias to find solutions that are more robust or in other ways better suited for the task at hand, prioritizing one solution over another independent of the data (Battaglia et al. 2018).

Relational inductive bias is a variant of inductive bias that expresses assumptions about the relationships between symbols represented in the data, such as cat is more similar to dog than it is to pizza, or humans ride motorcycles while elephants do not. Relational inductive bias can also express relationships between structures in the data itself. An example of relational inductive bias in computer vision systems is for instance the use of convolutional layers that slide small windows over the image to recognize small patterns at various locations. This holds the assumption that translation equivariance, explained in Section 2.1, is helpful as we want the system to recognize objects regardless of their global position in the image. The inductive bias is relational in the sense that it expresses constraints on which pixels are considered in a local window and which pixels are not, thereby holding assumptions about the relations between all pixels in the image. This kind of bias has been shown to be effective at a multitude of computer vision tasks, including large-scale image recognition (Deng et al. 2014), few- and zero-shot learning settings (Akata et al. 2015;Xian et al. 2019), and more robust semantic image retrieval (Barz and Denzler 2019).

Figure 1.1: Reproduced fromMathieu et al.

(2019): A tree embedded on a Poincar´e disc, a model of hyperbolic space. The gray curves are equal-length geodesics, discussed in Sec-tion2.2

Structural bias is another form of inductive bias that expresses assumptions about the global structure of the data, instead of the individual relationships between the symbols contained within it. The use of hyperbolic space in machine learning tasks is an exam-ple of a structural bias used in tasks where a latent hierarchical structure in the data is assumed. Hyperbolic space is character-ized by its constant negative curvature, a term further expanded on in Section 2.2, which causes the space to grow exponentially with distance to origin3 _{similar to how trees grow exponentially} with depth (Nickel and Kiela 2017). Fig. 1.1 demonstrates this by illustrating a tree representing a hierarchy embedded in hy-perbolic space: the length of the grey curves, which are the Eu-clidean equivalent of straight lines in hyperbolic space, are the same, yet they appear to be shrinking as they reach the bound-ary. Very recently, various machine learning works (Nickel and Kiela 2017; Ganea et al. 2018a; Khrulkov et al. 2019; Mathieu

(7)

et al. 2019;Liu et al. 2019;Chami et al. 2019) have begun to leverage this intrinsic property to induce a struc-tural hierarchical bias on the embedding space, which has been shown to be very effective in generalizing over data with inherent hierarchical structure such as taxonomies, phylogenetic trees and entailment datasets.

This thesis will combine and test both types of inductive bias, relational and structural, to induce a strong hierarchical inductive bias on a system performing dense visual classification. We will equip the system with a relational inductive bias by using hierarchical relationship information among classification targets to force the system to make cascading, hierarchical predictions, as well as provide a predefined degree of similarity between semantic concepts. Based on recent successes in embedding hierarchical data, we will test whether the structural hierarchical bias induced by hyperbolic space is useful in this setting as well.

1.2 Semantic Segmentation

Inspired by how humans perform dense visual categorization, this thesis leverages hierarchical inductive bias specifically for the dense visual classification task semantic image segmentation. Semantic segmentation algo-rithms perform automatic localization and identification of semantic objects in an image. This means that the image needs to be partitioned into meaningful, semantic regions, or segments. Segments are continuous image regions where a single label is applied to each of the pixels in the region. This region ideally consists of all the pixels in the image that contain the semantic concept the label describes. As such, modern works approach this as a pixel-wise prediction task, an approach bolstered by recent advances in computer vision research with the introduction of convolutional neural networks (CNN) and dense feature extraction, which are statistical approaches further detailed in Section2.1. We will use this section to provide notation and a formal definition of the task of semantic segmentation.

The task of semantic segmentation is to assign a label y ∈ C to each of the pixels in an image, where C = {1, ..., K} denotes the set of all possible class labels in the dataset. As a supervised learning problem, an input image x ∈ X of height H, width W and color-channel count4 _{ch is provided with a target segmentation} y ∈ Y; a pixel-map of the same size dimensions H, W that contains the relevant class label at each pixel position. An example of an image and the associated semantic segments is graphically illustrated in Fig. 1.2. The task of the semantic segmentation algorithm is then to learn the correct mapping X → Y.

This mapping is learned by optimizing a complex function which we will break up into two separate parts, the embedding function f , and the decision function g. Given an input image x, these functions are con-catenated and optimized as a single model: g(f (x)). The embedding function f : X → RH,W,d5 _{generates a} grid of d-dimensional embeddings φ ∈ RH,W,d = f (x) for all pixel positions in the image. Embeddings are relatively low-dimensional points that represent very high-dimensional data, such as images, with their posi-tion in the embedding-space, i.e. the low-dimensional space the embeddings are posiposi-tioned in. Given a single

Figure 1.2: Adapted fromZhang et al.(2020): The semantic image segmentation of a cat and a dog. The image is shown on the left, and the semantic segments are illustrated with the class label superimposed on the right.

4_{Digital images generally hold three color-channels, Red, Green and Blue, constituting the RGB-color space.} 5_{with the notation R}x,y,z _{we denote all real-valued matrices with dimensions x, y and z}

(8)

pixel-embedding φ, the decision function g : Rd → C will predict a label ˆy = g(φ) for that pixel based on the position of the pixel-embedding in the embedding space. The focus of this thesis lies on integrating hierarchical prior knowledge only in the decision function, and is, by design, agnostic to the choice of embedding function. Embedding functions in this task are usually deep convolutional neural networks (DCNN), and we will discuss several variants used for semantic segmentation in Section3.2.1.

1.3 Hierarchy in Visual Perception

Figure 1.3: An unfamiliar object humans can still categorize as being an animal due to an internal hierarchical categorization of visual concepts.

The main contribution of this thesis lies in the integration of hi-erarchical structure into semantic segmentation predictions by combining various methods that induce hierarchical inductive bias in the decisions of the model. Our main motivation lies in the proposition that identifying and localizing objects in im-ages is inherently a hierarchical task. We argue this point with Fig. 1.3, where we see an unfamiliar animal. This animal does not exist in our world, but humans are still able to deduce that it is some kind of animal / living thing. The object in Fig. 1.3

possesses visual features that we recognize from other visual con-cepts, which allows us to make this general classification.

Cognitive psychology literature suggests that there exists a representation of visual concepts with inherent hierarchical struc-ture in the human inferior temporal cortex, a part of the brain that is crucial to visual processing and visual object recogni-tion (Kriegeskorte et al. 2008). This hierarchical representation, whichKriegeskorte et al.(2008) find also exist in monkey brains, generally describes an ordering from more general (e.g. animal )

to more granular (e.g. Siamese cat ) and is used internally to represent information at various levels of abstrac-tion. This representation helps us abstract away irrelevant information from our environments as it allows us to choose a pragmatic level of precision we wish to observe our surroundings with; you want to watch out for all vehicles when crossing a busy intersection, but you want to find your bicycle with that specific dent in the mudguard when you forgot where you parked it exactly. It also allows us to extract useful information when met with unfamiliar visual concepts as we can observe familiar features shared with more general concepts to roughly categorize what the unfamiliar object might be.

Non-hierarchical semantic segmentation algorithms, on the other hand, would have to choose one semantic label out of a mutually exclusive set of supervised targets and would not provide any more information. This also means that the prediction would always be wrong when confronted with a concept not in the target set, such as the unfamiliar animal in Fig. 1.3. Provided that hierarchical representation is innate in human visual perception, we argue that it is beneficial to integrate this representation into semantic segmentation predictions. This makes the predictions more informative, familiar and interpretable, which facilitates the integration of these models into intelligent interfaces used by humans.

Furthermore, integrating hierarchical knowledge representations into predictions can improve the perfor-mance, generalization ability and consistency of computer vision systems, demonstrated by an extensive body of research on this topic (Liang et al. 2018b,a;Li et al. 2019;Teng Long 2020;Liu et al. 2020; Khrulkov et al. 2019; Skliar 2019;Barz and Denzler 2019;Deng et al. 2014;Redmon and Farhadi 2017). Inspired by a broad analysis of these works, we design a new system that combines three complimentary hierarchical bias approaches, using a hierarchical partitioning of a hyperbolic manifold as prototypical embedding space while training a CNN-based hierarchical classification system that performs semantic image segmentation. We will perform extensive ablation studies and experiments that test how our approach affect the quality of semantic image segmentation predictions and demonstrate that our approach can indeed be used to create human-interpretable pixel-level predictions.

(9)

1.4 Research Questions and Contributions

Given the demonstrated benefits of explicitly inducing hierarchical bias in visual classifiers, combined with the fact that hierarchical structure is inherent to human internal visual categorization and dense visual perception, we formulate the following general research question:

RQ How does explicitly induced hierarchical inductive bias affect the quality of semantic image segmentation predictions?

We will answer this question by testing three methods that explicitly induce hierarchical inductive bias in visual classifiers on four concrete facets of prediction quality commonly used in recent literature. This leads to the following more specific sub-questions:

1 How does explicitly induced hierarchical inductive bias affect the standard classification performance of semantic image segmentation predictions?

Standard classification performance is the most typical measure of prediction quality that quantifies how good the model is at predicting the correct semantic class. We use three common semantic segmentation metrics to answer this question: accuracy, class accuracy and mean intersection over union. These metrics are explained in further detail in Section5.1.3.

Aside from measuring how good a model is at producing correct predictions, it is also insightful to know how bad predictions are when the model is incorrect. This leads to the following question:

2 How does explicitly induced hierarchical inductive bias affect thehierarchical consistencyof semantic image segmentation predictions?

Hierarchical consistency is measured through relaxed variants of the standard classification metrics, where predicting an incorrect semantic class very similar to the target class is still counted as correct. The similarity is defined on how close concepts are in a hierarchy, by stating that concepts that share an ancestor in the hierarchy are more similar than concepts that do not. This effectively measures how far off the model was to the correct target. Making better mistakes is a desirable property of visual classifiers as, for example, predicting a bicycle instead of a motorcycle is preferred over predicting a pizza instead of a motorcycle.

We already described how the internal hierarchical categorization of visual objects helps humans roughly categorize unfamiliar objects. We test whether hierarchical inductive bias achieves a similar effect by quantifying generalization ability:

3 How does explicitly induced hierarchical inductive bias affect the generalization ability of semantic image segmentation predictions?

Generalization ability is described as being able to make meaningful predictions in new and unfamiliar envi-ronments, which is a useful trait for both humans and intelligent systems and is essential for models that need to operate in the real-world. We quantify this ability by evaluating whether predictions on unfamiliar concepts are close to the correct semantic class. In recent literature, this is generally evaluated through zero-shot classi-fication tasks where prediction performance is evaluated on concepts the model has not seen labelled examples from. We evaluate our approach on a semantic segmentation variant of this task, called zero-label semantic seg-mentation, and measure performance on unfamiliar targets using both standard and hierarchical performance metrics.

Finally, we argued that hierarchical visual representation is natural to humans and that this representation can lead to more interpretable predictions. This leads to the final sub-question:

4 How does explicitly induced hierarchical inductive bias affect the interpretability of semantic image seg-mentation predictions?

We briefly highlighted a lack of interpretability as a significant drawback to current statistical, black-box ap-proaches. This is a significant barrier in the adoption of these models as it hinders effective interaction between

(10)

humans and intelligent systems. Interpretability generally measures how informative predictions are to hu-mans, which is relatively subjective and difficult to measure with quantitative experiments. Instead, we will provide empirical qualitative evidence that our proposed hierarchically biased approach can be used to create human-readable and informative predictions, that additionally are able to generalize to unfamiliar concepts in a interpretable manner.

We test these measures on a combination of three approaches that explicitly induce hierarchical inductive bias that we select based on a broad analysis of recent literature on this topic. This thesis provides a contribution in this field by combining three approaches that have not been implemented in a united system yet. We argue that these approaches are highly complementary and that combining them produces higher quality predictions than when these methods are applied individually, which we demonstrate with extensive ablation studies.

Specifically, we enhance a state-of-the-art semantic image segmentation architecture with a versatile decision function design that uses a hierarchical partitioning of a hyperbolic manifold as prototypical embedding space in performing hierarchical semantic image segmentation. A second contribution lies in enabling the application of hierarchical hyperbolic multinomial logistic regression, a hierarchical and hyperbolic variant of a popular decision function commonly used in classification approaches, on the dense semantic segmentation task. We propose a tractable implementation that is able to predict thousands of pixels with hundreds of semantic concepts in parallel while retaining an acceptable memory footprint. We subsequently use this implementation to produce dense visualizations of the hyperbolic prototypical embedding space using a hierarchical color-scheme. This visualization illustrates a third contribution, in that it shows that our prototype learning algorithm achieves maximal prototype separation and guarantees hierarchical entailment using only a single optimization stage, while this required separate optimization stages in earlier works. We rigorously test our system on three experiments that allow us to confidently answer our main research question by answering each of the four sub-questions.

Our first experiment performs ablation studies to evaluate standard classification performance and hierar-chical consistency of predictions on a large-scale and challenging semantic segmentation benchmark Coco-stuff (Caesar et al. 2018). We test the effect on performance brought about by hyperbolic space, hierarchical classi-fication, and the use of a prototypical embedding space, and further show how these components influence each other.

Our second experiment will evaluate generalization performance on a zero-label semantic segmentation task. Following a zero-label benchmark set by Xian et al. (2019), we obscure a sub-set of Coco-stuff labels from the optimization objective during training and subsequently evaluate both the hierarchical and the standard segmentation performance on these classes. We will perform ablation studies here as well, testing each of our system-components.

These experiments indicate that the three components of our approach6are mutually beneficial when com-bined and outperform variants that use only a subset or none of these components. These experiments further demonstrate that the hierarchical predictions generalize well and are naturally able to perform zero-label se-mantic segmentation without any extra modifications. We will additionally highlight strengths and weaknesses of the use of hyperbolic space, and are able to confirm useful properties of hyperbolic space often reported by the hyperbolic machine learning community.

Our final experiment provides qualitative evidence of the interpretability benefit of our design, which demon-strates that our approach can accurately embed pixels on a predefined two-dimensional hyperbolic manifold with a human-readable hierarchical structure. We show that combining this with a hierarchical color-scheme enables highly interpretable semantic image segmentation predictions. We perform this experiment on both Coco-stuff as well as the smaller-scale Pascal VOC benchmark (Everingham et al. 2011), and evaluate classification per-formance and hierarchical consistency on this smaller benchmark as well. We show that our approach enables meaningful, human-readable and semantically consistent predictions on unfamiliar visual concepts by visualiz-ing dense interpretable embeddvisualiz-ings of pictures containvisualiz-ing concepts that do not occur in either dataset, such as tigers and even dinosaurs.

(11)

1.5 Outline

This thesis aims to provide a clear understanding of the various methods to hierarchical knowledge integration into modern computer vision systems. To that end, in Chapter2we introduce the necessary preliminary material to understand our and related work on this subject. We will first introduce convolutional neural networks in Section 2.1, which provides insight into how modern approaches extract meaningful and discerning features from images. We then go on to describe hyperbolic geometry in Section2.2, which provides a intuitive idea of hyperbolic space, as well as the mathematical tools and concepts needed to perform machine learning operations on hyperbolic manifolds. Finally, Section2.3will explain how these hyperbolic machine learning operations work and how they enable a new class of neural networks called hyperbolic neural networks.

Then, Chapter 3 will be used to discuss and analyse various recent approaches to semantic segmentation and the integration of prior knowledge into computer vision tasks. We will first provide a broad analysis of prior knowledge integration techniques applied in computer vision that serve as inspiration for the design of our system. We subsequently discuss current state-of-the-art embedding functions for semantic segmentation in Section 3.2.1, and then discuss works that similarly integrate symbolic knowledge into semantic segmentation algorithms in Section3.2.2.

Chapter 4 will be used to explain and motivate our proposed design. In Section 4.1 we will discuss how semantic structure is enforced in embedding spaces through margin hyperplanes resulting from the popular Multinomial Logistic Regression (MLR) decision function. This section also introduces a derivation of hy-perbolic MLR, as well as our tractable implementation for semantic image segmentation. Finally, Section 4.2

will introduce our proposed algorithm called hyperarchical prototypes. It explains how we perform full hierar-chical classification, and how we can obtain and visualize hyperarhierar-chical prototypes by combining hierarhierar-chical classification with MLR.

We will report our experiments in Chapter 5. Section 5.1 will provide an exhaustive description of our experiments, including descriptions of the benchmark datasets we test our algorithms on, the hierarchies we use as prior knowledge, the metrics we use to evaluate semantic segmentation performance, and implementation and training details. Section 5.2 will discuss the results of our experiments, and will aim to answer our re-search questions. Specifically Section5.2.1will discuss the standard classification performance and hierarchical consistency of our approach, and will perform extensive ablation studies testing various components of our ap-proach and testing them against a non-hierarchical baseline. Section5.2.2will quantify how well our proposed model generalizes to unseen classes and Section5.2.3 will demonstrate the power of our approach in creating interpretable semantic segmentation predictions.

Finally, Chapter6will wrap up our work, where we answer our research question and summarize the findings of our research. We additionally offer a short discussion about the challenges we faced, practical considerations and potential future directions for this work.

(12)

Chapter 2

Preliminary Material

We will use this chapter to introduce the preliminary material for this thesis. This work is related to a wide range of topics in machine learning, and this chapter will provide the necessary background information needed to understand various concepts introduced in this thesis, as well as the works related to this thesis.

We will first explain what convolutional neural networks are, how they work and why they are so effective at computer vision tasks. We will also provide a brief description of techniques that enabled deep convolutional neural networks, a type of neural network that forms the bedrock of our approach.

We briefly mentioned the concept of hyperbolic space in the introductory material. We will provide a brief introduction to hyperbolic geometry in this chapter, which will provide an intuitive idea of what hyperbolic space is and how it differs from Euclidean space, the space implicitly used in most machine learning algorithms. Finally, the use of hyperbolic space in machine learning works has only recently been enabled with the introduction of hyperbolic machine learning. We will introduce this concept and explain how we can perform fundamental machine learning operations in hyperbolic space by using hyperbolic neural networks.

2.1 Convolutional Neural Networks

Research into embedding functions for computer vision has seen significant progress in recent years. The introduction of convolutional neural networks and machine-learning techniques that facilitated the training of deep neural networks, i.e. deep learning, has been fundamental to this progress, which enabled AI systems that surpass even human-level recognition in certain tasks. This section provides relevant background on convolutional neural networks, history of image classification algorithms and the breakthroughs that enabled to creation of deep convolutional neural networks.

2.1.1 Encoding Context with Convolutions

In order for an embedding function to capture the semantic meaning of an object contained in a image-pixel, it is crucial to take context of the pixel into account; a pixel containing a patch of fur can have a wide range of semantic meanings, from a fur-coat to llama, depending on the context. Obtaining semantic features from image context has been a long standing problem in the field of computer vision, which has been predominantly approached with convolutional neural networks (CNN) in recent years. These context-sensitive features are, ideally, defining and discriminating characteristics of the input data that help the model discern between different semantic concepts. CNNs are neural networks that mainly consists of convolutional layers (Fukushima 1980;

LeCun et al. 1989). A single convolutional layer consists of several filters: small1 _{patches that contain various} patterns. Each of these filters is slided over the input image, and at every position the dot product with the filter is taken and summed up. The individual filters usually have the same depth as the input image, but variations do exist where different filters attend to each input image dimension. The sliding dot-product operation is called cross-correlation, and will produce feature-maps for each of the filters in the layer. These feature maps will hold

(13)

the activations of the filters, i.e. the results of the sliding dot-product, thereby recording the similarity of the input image with the patterns contained in these filter at all positions. Cross-correlation introduces translation equivariance to the system, as the filters are re-used at all positions. Translation equivariance means that the translation of the input results in an equivalent translation of the output response. This property introduces an important inductive bias to the system that makes CNNs so effective for computer vision tasks. When learning the filters from data, cross-correlation additionally enables efficient re-use of parameters across the image and enables the neural network to learn to recognize patterns in the input irrespective of their global position in the image.

The patterns in these filters used to be hand-engineered and fixed in the early days of computer vision,but modern approaches generally take the statistical approach by learning the filters from data using recent advances in the deep learning field. Learning these filters from data avoids the cumbersome process of creating them a priori. Furthermore, choosing and constructing the right patterns is difficult and requires expertise of the designer of the system. Learning these patterns fully from the data, as a prime example of the power of statistical AI, enables the stacking of a large number of convolutional layers as new layers and patterns can be added with little to no cost and requiring no expertise on effective patterns for visual recognition.

Stacking a large number of these layers on top of each other is what enabled the impressive performance of CNNs we see today. Stacking convolutional layers allows the model to recognize increasingly complex patterns (Zeiler and Fergus 2014). Except for the first layer, each layer has as input the feature-map of the previous layer. Layers closer to the input generally learn low-level features like edges and colours, and higher-order layers can subsequently learn to combine patterns of the preceding layer to recognize more complex patterns like squares and circles. Stacking enough layers ultimately enables the network to recognize complete, abstract semantic concepts like faces, animals and vehicles.

2.1.2 Surpassing Human-level Image Recognition

The computer vision field has seen tremendous breakthroughs that have come about mainly in the research effort into image classification systems. In image classification, the intelligent system needs to produce a single semantic label for an input image. CNNs were successfully applied for the first time in 1998 by LeCun et al.

(1998) to recognize zip-codes from hand-written digits, where they used machine learning to learn the filters from labeled data automatically. Some time later, in 2012,Krizhevsky et al.(2012b) introduced AlexNet which used regularization techniques such as dropout and max-pooling, combined with a deeper network architecture to significantly increase performance on the 1.000 class ImageNet 2012 classification benchmark (Russakovsky et al. 2015). They achieved a 16.4% error rate, beating the second best by a margin of 9.7%.

Recent works predominantly use the VGG (Simonyan and Zisserman 2014) or ResNet (He et al. 2016) architectures to extract deep features. Deep features are the result of encoding high-dimensional data, such as images, with deep neural networks. These encodings are often able to represent high-dimensional, noisy data in relatively low-dimensional space, thereby extracting useful information (e.g. features) and abstracting away irrelevant noise. Deep CNNs, like the VGG and ResNet models, are architectures with tens- to hundreds of stacked, convolutional layers. Stacking that many layers is only recently made possible due to a tremendous increase in computing power which enabled highly parallelized gradient computation.

More important, however, were recent developments in deep learning, the ”art” of training deep neural networks (DNN). Stacking a large number of convolutional layers comes with various challenges that makes the training deep neural networks quite challenging. A significant problem of training DNNs is that the number of parameters in these models is so large that the model can effectively ”memorize” the input data, instead of learning the intrinsic causal relationships that generate the data. This phenomenon is called overfitting. Mini-batch stochastic gradient descent, weight-decay, residual connections, and the aforementioned max-pooling and dropout operations are techniques that partially relief this problem.

Mini-batch stochastic gradient descent (SGD) describes an iterative parameter optimization technique that is widely used in training neural networks. In a supervised learning setting, the objective function2 generally measures a similarity between the output of a neural network and a desired output, e.g. the ’objective’.

(14)

Objective functions optimized with this technique need to be differentiable, as this allows the computation of the partial derivatives, e.g. gradients, of this function with respect to parameters in the neural network. These gradients point in the direction in which the parameters need to be updated in order to minimize the difference between the current and desired output of the neural network. By computing the gradients over mini-batches, e.g. relatively small sub-samples of the dataset, the model is encouraged to find general solutions that fit a wide variety of data. The degree to which such solutions are found is called the generalization ability of the neural network. Generalization ability of a model can be seen as an inverse to overfitting, and describes the extend to which the model is able to learn features that generalize well to unseen data. Good generalizability is a desirable trait for neural networks, as this means the neural network will make good predictions on data it has not encountered yet.

Weight-decay adds a regularization term to the objective function that penalizes parameters that have a large norm value:

wd =X

v∈V

µwd· kvk, (2.1)

where µwd is a weighting term that determines the level of regularization, and kvk denotes the l2-norm of

parameter v. As the objective function is minimized by the optimizer, adding this regularization term causes the parameters to shrink slowly over time, which leads to the neural network ”forgetting” parameters that are not updated often. This motivates the model to use fewer, more generalizing connections in the network thereby counteracting overfitting.

Residual connections (Nair and Hinton 2010) connect layers in the network that are not directly stacked on top of each other. More specifically, these connections ”skip” several layers before reconnecting with the neural network, which is why they are also referred to as skip connections. This means that output from layer L will be used both as input for layer L + 1, as well as, for example, layer L + 5, thereby skipping layers L + 1 to L + 4 with this connection. This enables the model to ”choose” whether it will use the layers that are skipped over. The network can learn to deactivate3the skipped layers if they impede performance and use the residual connection to pass information from previous layers on to higher layers. The use of these skip connections was an important step in the design of the ResNet architecture, as more layers can easily be added without significantly increasing the difficulty of training these architectures.

Max-pooling reduces the resolution of the feature-maps between layers by taking the maximum activation in a small rectangular grid at each position. This effectively reduces the amount of available parameters and the memory demand of the model and introduces a small amount of translation invariance, meaning that the output is the same regardless of small translations of the input. Drop-out is the process that randomly switches off, or ”drops”, parameters in a convolutional layer, selecting new parameters to drop each forward pass4. This

Figure 2.1: Adapted fromAlom et al.(2019): Error rate on the large-scale ImageNet-2012 dataset from various CNN architectures over the years.

3_{Set the parameters in the layer to zero.}

(15)

also reduces the effective amount of parameters, thereby forcing the network to store more information with fewer parameters, which can reduce the tendency of the network to overfit. Furthermore, the network is forced to use new parameters each iteration, which further reduces the chance of memorizing the input data.

Two other developments were fundamental to the success of deep convolutional neural networks, namely batch normalization and the ReLU activation function. Batch normalization (Ioffe and Szegedy 2015) normalizes the feature maps between the layers. This forces the activation distributions to remain centered around 0 and have a limited variance, similar to how input data is often normalized before data analysis tasks. This reduces the exposure of the model to trivial distribution shifts in the activations, which greatly reduces the difficulty of training DNNs and also increases the speed in which they train.

Finally, activation functions are non-linear functions that act on the activations of every DNN layer to introduce non-linearity to the total function described by the DNN. This enables the neural network to learn non-linear relationships, which is essential for modelling complex input such as image-data. The ReLU activation function (Nair and Hinton 2010) only passes positive activations to the next layer, which reduces the chance of overfitting as it promotes network sparsity. It also mitigates a neural saturation problem that the earlier used sigmoid and tanh functions suffer from.

The development of these techniques, combined with increased computing capabilities, led to a surge in deep computer vision research that all other machine-learning fields, including semantic segmentation, have benefited from. Modern convolutional neural networks are able to recognize 1.000 categories from the ImageNet-2012 dataset (Russakovsky et al. 2015) with an error-rate surpassing human competence, as shown in Fig. 2.1.

2.2 Hyperbolic Geometry

Most neural-net based machine learning methods, including those discussed in the previous section, implicitly operate in Euclidean space. Recent research into hyperbolic machine learning has however demonstrated that neural networks that operate in hyperbolic space posses a structural hierarchical bias that is especially rewarding in embedding data with inherent hierarchical structure (Nickel and Kiela 2017;Ganea et al. 2018a; Khrulkov et al. 2019; Mathieu et al. 2019; Liu et al. 2019; Chami et al. 2019). As discussed in the introduction, the structural hierarchical bias induced by hyperbolic space results from its constant negative curvature. Neural network operations fundamentally depend on geometric notions like distance, angles and orientation, which, in Euclidean space, are retrieved based on Euclidean theorems that do not hold in this curved, hyperbolic space. To enable machine learning on these spaces, we need new geometric theorems to retrieve these fundamental geometric notions. This section provides the necessary mathematical preliminaries needed to understand non-Euclidean, and by extension, hyperbolic geometry at a high level. We briefly discuss the concept of curved spaces, what non-Euclidean geometry is, and how we can work with it using manifolds. Finally, we introduce the Poincar´e ball model, a widely used model for hyperbolic geometry which maps Euclidean to hyperbolic space. The main focus of this section is to provide an intuitive idea of what hyperbolic space is and how we retrieve some elementary geometric notions in this space. We recommend Robbin and Salamon (2011) for a more thorough introduction to the concepts explained here.

2.2.1 Curved Space and Non-Euclidean Geometry

Geometry is the branch of mathematics that concerns itself with notions of shape, size, relative position and distance. Euclid is credited with providing the bedrock for what we now call Euclidean geometry in his textbook the Elements around the 3rd century BC. The Elements introduced five postulates that were regarded as self-evident statements in the geometry of creating geometric figures with rulers and compass on a flat surface, such as ”one can draw a line between any two points in space”. These postulates could be accepted without proof and are fundamental to Euclidean geometry as all Euclidean geometric theorems5 are ultimately derived from these postulates. These postulates are however only valid on flat surfaces, which means that theorems on Euclidean geometry, derived from these postulates, are likewise only valid in flat spaces.

(16)

Figure 2.2: Reproduced fromabyss.uoregon.edu: effect of the curvature of space on sum of angles of a triangle. Left to right: positive curvature of a spherical manifold, negative curvature of a hyperbolic manifold, and zero curvature on a Euclidean manifold.

For this thesis, however, we need to retrieve geometric notions in hyperbolic space, which is a space charac-terized by a constant negative curvature and is decidedly not flat. A curved space, due to its ”non-flatness”, is generally described as being non-Euclidean. A defining characteristic of curved space is its departure from the well-known Pythagorean theorem, i.e. on a curved space, a2_{+ b}2 _{6= c}2_{, where a, b, c are sides from a right}

triangle. A sphere is a good example of a curved, two-dimensional space, which is a surface of constant positive curvature. Intuitively, a space with positive curvature ”curves away” from you in all directions, while a space with constant negative curvature, i.e. hyperbolic space, ”curves towards” you in all directions. Fig. 2.2provides visual examples of flat, spherical and hyperbolic surfaces, and illustrates a Euclidean geometric theorem that no longer holds on curved spaces, namely ”the angles of a triangle sum to 180◦”: the sum of angles is less than 180◦ on the negatively curved surface, while it is more than 180◦ on the positively curved surface.

A familiar example of non-Euclidean space is Earth, which was initially thought be flat and therefore subject to Euclidean geometry. It is now widely accepted that it is spherical and has a constant positive curvature. It violates, among others, the fifth Euclidean postulate, also known as the parallel postulate. It states that, for any given line and a point not on the line, there is a single line that goes through that point and is parallel to the original line. This postulate does not hold on spheres as all lines on a sphere will eventually intersect, rendering them not parallel. This means that Euclidean geometry is no longer straightforwardly applicable in this space, which is why non-Euclidean geometry is needed to calculate useful geometric notions. One such notion is distance, which is for instance used to determine optimal flight or sail trajectories, and is also essential to classification methods as we will see in Chapter 4. The distance, i.e. shortest path, between two points on a flat piece of paper is the length of a line connecting these points. This is however no longer the case for the distance between two points on Earth, as illustrated in Fig. 2.3 which illustrates a straight line A that connects two points, and the shortest path between these two points over the surface of Earth B. While line A provides the shortest distance on the flat Mercator projection on the left, its length is actually longer on the sphere-shaped Earth. The shortest path between two points in space is called a geodesic and can be thought of as a non-Euclidean equivalent of the Euclidean straight line.

Figure 2.3: Adapted from gisgeography.com: The straight line A compared with the shortest path along a geodesic B, shown on the Mercator projection of Earth on a Euclidean chart left and the actual spherical Earth right.

(17)

2.2.2 Riemannian manifolds

To find the shortest path between two points, we first need to be able to determine the length of a path. On a flat, i.e. Euclidean, piece of paper, where geodesics are straight lines, one could simply find a ruler that is long enough and measure the entire length of the path at once. This is known as an extrinsic approach; an approach that assumes to have a global perspective of the space. This global perspective is however not always available in curved or very large spaces such as the universe. The intrinsic approach provides an alternative by obtaining global measurements in geometric spaces where a global perspective is unavailable or impractical. This approach considers only local areas of the space to retrieve global geometric notions and therefore does not need a global perspective. Riemannian manifolds are used to obtain global measurements using intrinsic geometry.

A manifold is, simply put, an object positioned in space. Intuitively, it helps to think of manifolds as two-dimensional surfaces in R3_{. Earth, again, provides a good example, which is a smooth two-dimensional manifold}

in R3_{. A manifold is said to be smooth when it is differentiable at every point, or, in more informal terms,}

does not have any holes, tears or other discontinuities. Smooth manifolds locally resemble Euclidean flat space; Earth is clearly spherical when observed from outer space, but resembles Euclidean space so well when standing on it that some still believe that Earth is actually a Euclidean plane. More formally, a smooth, n-dimensional manifold M is defined as a set of points p ∈ M that locally resembles Rn.

Figure 2.4: Adapted from Wikipedia: The tangent space TpM at point p of a smooth,

spherical manifold M. A Riemannian manifold is a smooth manifold M equipped

with a Riemannian metric g, and is denoted with the tuple (M, g). A smooth manifold has at every point p a different tan-gent space TpM that consists of all the vectors that pass through

p and are tangential to the manifold at that point. Fig. 2.4shows an example of a tangent space of a sphere in R3. A Rieman-nian metric g = (gp)p∈M is a set of positive-definite functions

gp : TpM × TpM → R>0 that define the inner product of two

vectors in a tangent space TpM, for every point p ∈ M. This

inner product can be used to retrieve local geometric notions like lengths and angles of vectors on the manifold, and defines the local curvature of the manifold at every position. Riemannian

manifolds therefore obtain their global structure intrinsically from only local perspectives.

We can use the local measurements obtained with the Riemannian metric to obtain global quantities. Using a velocity vector analogy, a smooth curve γ : [0, t] → M between two points x and y on the manifold M is defined as moving from an initial position γ(0) = x along a path, defined by a set of local velocity vectors, to a final position γ(t) = y where t represents the time it takes to traverse this path. The local length of a velocity vector, i.e. the speed, in TpM is the square root of its inner product kvkp=pgp(v, v) and the global length L

of the curve γ is then given by integrating over the set of velocity vectors and recording the distance traveled:

L(γ) = Z t

0

q

gγ(t)(γ0(t), γ0(t))dt. (2.2)

where γ0(t) is a vector in the tangent space at point γ(t). The global distance between x and y is then obtained by taking the path, i.e. set of local tangent vectors γ0(t) at each time step, that has minimal global length

d(x, y) = inf

γ L(γ) (2.3)

This is path the geodesic mentioned earlier, which is a generalization of the concept of a straight line in Euclidean geometry. Moving along a geodesic with constant speed from point p is given by the exponential map at p exp_p : TpM → M and the inverse is given by the logarithm map logp : M → TpM. When well-defined,

these functions allow us to move between the tangent space and the manifold at all points on the manifold M. Finally, the parallel transport Px→y : TxM → TyM allows us to move tangent vectors along geodesics on the

(18)

2.2.3 Hyperbolic geometry on the Poincar´

e ball

Hyperbolic geometry is a non-Euclidean geometry where only the fifth (e.g. the parallel) postulate does not hold. It has constant negative curvature, which, intuitively, means that the space curves towards you at every point. It is difficult to imagine such a space, but there exists multiple models of hyperbolic space that can be accurately embedded in Euclidean space. We already presented such an embedding in the introduction in Fig. 1.1. There exists multiple equivalent models of hyperbolic space, such the Beltrami-Klein model, the hyperboloid model, the Poincaré half-plane model and the Poincaré ball model (Beltrami 1868). We will use and explain the Poincaré ball model as it has been the focus of recent work in the hyperbolic machine learning area, which led to well defined gradient-based optimization methods and operations for machine learning tasks (Ganea et al. 2018a;Nickel and Kiela 2017;Mathieu et al. 2019; Khrulkov et al. 2019).

The Poincar´e ball model (Dn

c, gDc) is defined as an open n-dimensional ball Dcn= {p ∈ Rn: ckpk < 1} with

radius r = 1/√c, equipped with a Riemannian metric:

gDc p = (λ c p) 2_gE_, _where _λc p= 2 1 − ckpk2 (2.4)

where gE_{is the Euclidean metric defined as the standard dot product, λ}c

pis the conformal factor, c is a parameter

governing the curvature and circumference of the ball and k · k operator denotes the l2-norm. From this point

on in the thesis we always refer to the l2-norm when using this operator. Two metrics are said to be conformal

to each other when they preserve angles between vectors. Between two arbitrary metrics ˜g and g, a conformal factor λ : M → R defines a smooth function such that ∀p ∈ M, ˜gp = λ2pgp (Ganea et al. 2018a). As shown

in Eq. 2.4, the Poincar´e ball model is conformal to Euclidean space and therefore preserves angles between vectors. This equation also shows that the conformal factor, and by extension the the hyperbolic inner product, gets smaller as the Euclidean norm of p becomes larger. This models the relative growth of hyperbolic space with respect to Euclidean space when approaching the boundary, which explains why the geodesics in Fig. 1.1

appear to shrink from the Euclidean perspective as they reach the boundary.

The global distance between two points on the Poincar´e ball has a closed-form expression and is given by:

dDc(x, y) = cosh −1 1 + 2c kx − yk 2 (1 − ckxk2_{)(1 − ckyk}2₎ (2.5)

The Poincar´e ball model is characterized by geodesics that are represented as segments of great circles that are orthogonal to the boundary. From this point on, when we use the phrase ”hyperbolic space”, we specifically refer to the Poincar´e ball model of hyperbolic space, unless states otherwise.

2.3 Hyperbolic Neural Networks

Hyperbolic neural networks, proposed byGanea et al.(2018a), are neural networks that operate on hyperbolic spaces. This means that the linear algebra operations that are fundamental to machine learning concepts like vector addition, subtraction and linear transformations are performed on hyperbolic manifolds. This chapter introduces the mathematical definitions used in this thesis for hyperbolic machine learning and parameter optimization.

2.3.1 Gyrovector Spaces

The framework of gyrovector spaces was proposed by Ungar (2005) to study hyperbolic geometry from the perspective of vector spaces. It defines binary operators ⊕ and ⊗ in gyrovector space, analogous to the vector space operators for addition and multiplication (+, ·). This framework is necessary for adding velocity vectors in hyperbolic space as its boundary represents the upper limit of speed6. Adding vectors should therefore not

6_{Hyperbolic space is used in special relativity and physics, where the hyperbolic disc represents an intersection of the light}

cone. The light cone models how a flash of light emanates through space as a circle with increasing radius. The boundary of the hyperbolic disc therefore represents the upper limit to speed, i.e. the speed of light.

(19)

Figure 2.5: expc

p(x) and expc0(x0) illustrated on a Poincar´e disc. The lighter-colored lines illustrate geodesics.

result in a vector that falls outside the boundary. Ganea et al.(2018a) use the gyrovector space to define various machine learning operations on a Poincar´e ball, which produces the hyperbolic neural networks.

We can add two vectors x, y ∈ Dn

c together via M¨obius addition:

x ⊕cy =

(1 + 2chx, yi + ckyk2_{)x + (1 − ckxk}2_)y

1 + 2chx, yi + c2_kxk2_kyk2 , (2.6)

where h·, ·i is the Euclidean dot product. As c → 0 this becomes the Euclidean vector addition of x and y in Rn. Using this framework,Ganea et al.(2018a) rewrite the the distance function from Eq. 2.5 as

dDc(x, y) = (2/

√

c) tanh−1(√ck − x ⊕cyk) (2.7)

and further derives closed-form expressions for the exponential map at p and at 0:

expcp(x) = p ⊕c tanh _√ cλ c pkxk 2 x √ ckxk , x ∈ TpDnc \ {0}, expc₀(x) = tanh √ckxk x √ ckxk, x ∈ T0D n c \ {0}. (2.8)

Example of both exponential maps are illustrated in Fig. 2.5. The inverse operations, the logarithm map at p and at 0, are also derived:

logc_p(y) = √2 cc_ptanh −1₍√_{ck − p ⊕} cyk) −p ⊕cy k − p ⊕cyk , y ∈ Dn_c \ {p}, logc₀(y) = tanh−1 √ckxk√y

ckyk, y ∈ D

n c \ {0}.

(2.9)

Finally, the parallel transport of a vector v in T0Dnc to TpDnc is derived as

P_0→pc (v) = logc_p(x ⊕cexpc0(v)) =

λc₀ λc p

v. (2.10)

2.3.2 Hyperbolic Parameter Optimization

Hyperbolic parameters can be optimized with Riemannian gradient-descend based optimizers. As the Poincar´e ball model is conformal to Euclidean space, the Riemannian gradient ∇Rfrom a parameter θ ∈ Dnc is given by

(20)

scaling the Euclidean gradient by the inverse Poincar´e ball metric g−1_θ (Nickel and Kiela 2017): ∇R=

(1 − ckθk2)2

4 ∇E (2.11)

The parameter update can then be performed via Riemannian stochastic gradient descent (RSGD) (Bonnabel 2013). Using µtas learning rate, this is done with the exponential map byGanea et al.(2018a):

θt+1← proj(expcθt(µt∇R)), (2.12)

or via simple retraction as by Nickel and Kiela(2017), where the exponential map is not explicitly defined:

θt+1← proj(θt− µt∇R). (2.13)

The projection operator is used to ensure that the updated parameter remains within the Poincar´e ball:

proj(θ) =    θ/kθk − if kθk ≥ √1 c θ otherwise, (2.14)

where is a small constant to ensure numerical stability.

2.4 Closing remarks

This concludes the preliminary material on convolutional neural networks, hyperbolic geometry, and the tools that enable machine learning on hyperbolic manifolds. Equipped with this knowledge, we are able to understand the various approaches to computer vision and hyperbolic machine learning, and understand why hyperbolic space might be beneficial under certain circumstances. It should also give an intuitive idea of what hyperbolic space entails. We will discuss various hyperbolic and computer vision approaches in the next section where we go over the broad spectrum of works that are related to this thesis.

(21)

Chapter 3

Related Work

By now we have discussed how CNNs are able to extract meaningful features from context in images using convolutional layers, and how non-Euclidean geometry can be used to implement neural networks that operate on curved manifolds. This section will show how state-of-the-art semantic image segmentation approaches extend CNNs used in image classification to perform dense, pixel-level predictions. We will also use this chapter to analyse a wide spectrum of methods from recent literature that design hierarchically biased computer vision systems. This will give a good overview of the different methods that are currently being applied, and provide evidence of the positive effect that hierarchical bias can have on the performance over a wide range of computer vision tasks.

We will first provide a broad overview of methods that equip computer vision systems with useful inductive biases through prior knowledge integration in Section3.1. We will subsequently narrow our scope to the task at hand, semantic image segmentation, in Section 3.2. We will begin this section by discussing the current state-of-the-art in semantic image segmentation architectures, and subsequently discuss works that integrated prior knowledge into these architectures.

3.1 Computer Vision with Prior Knowledge

Standard supervised learning approaches use the teacher-student paradigm, where the teacher provides examples and the student is tasked to label the examples with the correct label supplied by the teacher. Learning with privileged information (Vapnik and Izmailov 2015) is a popular method of prior knowledge integration in computer vision that extents the teacher-student paradigm by providing the student with privileged information in addition to the target label during training. We will discuss various works that use this approach in Section

3.1.1 and show how both hierarchical and non-hierarchical prior knowledge can be integrated in this way. Another approach is to integrate prior knowledge about a problem domain directly in the design of the computer vision algorithm. We will briefly discuss some of these approaches in Section3.1.2.

3.1.1 Using Prior Knowledge as Privileged Information

Many approaches, including this thesis, leverage prior knowledge by explicitly structuring the embedding space in a certain way prior to training. This predefined structure guides the student to embed visual concepts in the embedding space in a way that benefits subsequent tasks or improves generalization ability. The structure is generally conveyed through some kind of similarity measure and is materialized through class prototypes. Class prototypes are geometric structures in the embedding space whose positions characterize the semantic meaning of its corresponding semantic class. I.e., the prototype for the cat class should be closer to the dog prototype than the car prototype, as they are more semantically similar. The student is then forced to adhere to this semantic structure by embedding visual features as close as possible to relevant class prototype through the objective function. The therefore student gets privileged information on the relations between the target classes and learns to embed visual concepts in a way coherent with these relationships. Semantic similarity is subsequently quantified with similarity measures, which are functions that measure similarity between class

(22)

prototypes and embeddings through some distance metric. We will discuss various similarity measures and prototypes used in recent literature in this section.

Hierarchical prior knowledge

The predefined structure of an embedding space depends on the type of privileged information that is used.

Barz and Denzler(2019) andTeng Long (2020) are examples of works that, like our approach, use hierarchical relationships to construct prototypes that act as privileged information.

Barz and Denzler (2019) improve semantic consistency of image embeddings in an image retrieval task by placing n class prototypes on the positive orthant1 of an n-dimensional unit hypersphere with a deterministic algorithm such that semantic concepts that are close in a hierarchy tree are placed closer together. They then train a CNN to minimize the cosine distance between embeddings θ and class prototypes Ψk of class k with

objective function L: L(θ, k) = 1 − Scos(θ, Ψk), (3.1) where Scos(θ, Ψ) = θ>Ψ kθkkΨk. (3.2)

denotes the cosine similarity measure. They optimize this objective function in parallel with the standard MLR objective (see section4.1.1) in a separate output branch.

Figure 3.1: Reproduced from Teng Long (2020): Ac-tions prototypes embedded on a Poincar´e disc by

Teng Long(2020) using the dual-loss function described in Eq. 3.3.

. Closest to our approach, Teng Long (2020)

com-bine the prototype paradigm with hyperbolic machine learning for hierarchical action retrieval in video data. They propose the hyperbolic action network, which embeds an action hierarchy on a Poincar´e ball prior to training the network to map videos in this hierar-chically structured hyperbolic space. The action pro-totypes are obtained by optimizing a dual-objective function

L(P, N , P) = LH(P, N ) + λ · LS(P) (3.3)

with P and N respectively as the set of positive and negative hypernym pairs, P as the set of concepts in the hierarchy and λ acting as a balancing term. LH

is a variant of the hierarchical Poincar´e embeddings function proposed in Nickel and Kiela (2017), which has been shown to excel at embedding hierarchies:

LH(P, N ) =

X

(u,v)∈P

log exp(−dDc(u, v))

P

(u,v0_)∈Nexp(−dDc(u, v0))

!

(3.4)

with dDc being the hyperbolic distance from Eq. 2.7. They further use the prototype function introduced by

Mettes et al.(2019) to separate the target prototypes to ensure they do not overlap:

LS(P) = 1T( ˆP ˆPT − I)1 (3.5)

where ˆP denotes the set of l2-normalized leaf-node concepts that ultimately serve as the final action prototypes.

A schematic representation of the resulting prototypes is depicted in Fig. 3.1. They note that these prototypes do not guarantee entailment, which means that the regions of sub-trees are not guaranteed to be fully covered by their parent-tree. They enforce entailment by post-processing the prototypes by optimizing an entailment 1_{The two axes of a two-dimensional Cartesian coordinate system divide the plane in four sectors of 90}◦_{, called quadrants. An}

Hierarchical Inductive Bias for Semantic Image Segmentation

MSc Artificial Intelligence

Master Thesis