Interpretable Capsule Networks: Incorporating Semantic Knowledge with Guided Routing

(1)

M

ASTER

T

HESIS

Interpretable Capsule Networks:

Incorporating Semantic Knowledge with

Guided Routing

by

S

IMONE VAN

B

RUGGEN

11869704

August 12, 2019

36 EC

February - August 2019

Supervisor:

Dr Maaike

DE

B

OER

(TNO)

Supervisor:

Dr Zeynep A

KATA

Assessor:

Dr Efstratios G

AVVES

(2)

Deep neural networks are used in many applications and for a wide variety of tasks, but their internal decision processes are often hard to interpret. Such opaque models are suscep-tible to adversarial attacks, which may mislead the model by perturbing input images with imperceptible noise.

In this research, we study the interpretability of capsule networks. Combining a robust model with expert knowledge can be fruitful for creating an interpretable model that is re-silient against potential attacks. We propose a novel extension to the existing learning al-gorithm that incorporates external semantic knowledge into the network. By extending the existing loss function with an additional attribute loss, we can guide the routing process of the network.

We compare our approach with a convolutional neural network baseline as well as a linear attribute predictor baseline. The guided capsule network manages to jointly learn to predict attributes and classes. Although the guided capsule network slightly underperforms the baseline accuracies, it does provide additional interpretability and robustness. We evalu-ate the vulnerabilities of the models when presented with novel variations outside the train-ing distribution, and when attacked with two types of white-box adversarial attacks. These experiments show that the capsule network largely surpasses the baseline performance in terms of robustness, and maintains this performance when extended with guided routing. Additionally, we provide ways to verbalise the rationale behind a class prediction to a user by applying the predicted attributes in an explanation. This verbalisation can be applied in a set-up countering adversarial attacks with a human-in-the-loop.

(3)

Acknowledgements

Throughout the process of writing this thesis I received help and support from a number of people. First, I would like to thank my daily supervisor, Maaike, for her constant encour-agement and for always making time for our discussions. I’d also like to thank Zeynep, for providing me with new insights and helpful guidance during this research.

Additionally, I’d like to thank my colleagues at TNO, for their help and welcome dis-tractions during our lunchtime foosball and table tennis. In particular, I want to thank my fellow VWData team members for their helpful discussions and input. Stephan, thank you for giving me with the opportunity to participate in the project and discussing my ideas.

I also want to thank my parents for continuously supporting me throughout my studies, and teaching me to never give up. Mom, for sharing your passion for mathematics with me. Dad, for being unstoppable when it comes to teaching and helping me, from high school physics to meticulously proofreading this thesis.

Finally, there are my friends who never failed to put a smile on my face, and in particular Aafke and Mayon, for providing me with good-luck-hugs and treats. Lastly, a special thank you for a special person. Jaap, thank you for providing me with feedback and discussions on this thesis but most of all for your never-ending support and encouragement.

(4)

1 Introduction 1

1.1 Capsule networks . . . 2

1.2 Research questions and contributions . . . 3

1.3 Thesis outline . . . 4 2 Background 5 2.1 Explainability . . . 5 2.2 Related work. . . 8 2.3 Neural networks . . . 12 2.3.1 Activation functions . . . 12 2.3.2 Parameter optimisation . . . 13 2.3.3 Batch normalisation . . . 14 2.3.4 Regularisation. . . 14

2.3.5 Convolutional neural networks . . . 15

2.3.6 Pooling operation. . . 15

2.4 Attributes . . . 16

2.4.1 Word embeddings . . . 16

2.4.2 Attribute prediction . . . 17

2.4.3 Explainability through attributes . . . 18

3 Capsule Networks 19 3.1 Motivation . . . 19 3.2 Related work. . . 21 3.3 Architecture . . . 22 3.3.1 Overview . . . 22 3.3.2 Dynamic routing-by-agreement. . . 23 3.3.3 EM routing . . . 26

3.4 Performance of Capsule Networks . . . 29

3.4.1 Explainability of capsule networks . . . 30

3.4.2 Incorporating external knowledge . . . 31

4 Application: countering adversarial attacks 33 4.1 Project motivation . . . 33

4.2 Adversarial attacks . . . 34

4.3 Robustness of capsule networks. . . 35

(5)

5 Experimental set-up 40

5.1 Data sets . . . 40

5.1.1 Annotations for GTSRB and GTSRB*. . . 42

5.1.2 Data preprocessing . . . 44

5.2 Set-up. . . 44

5.2.1 Capsule Network . . . 44

5.2.2 Baseline . . . 44

5.2.3 Training the model . . . 45

5.3 Capsule network interpretability . . . 45

5.3.1 Input reconstruction . . . 45

5.3.2 Perturbing instantiation parameters . . . 46

5.4 Guided routing for semantic knowledge . . . 46

5.4.1 Guided entity representation in capsules (GuidedCapsNet) . . . 47

5.4.2 Guided routing with word embeddings (GuidedCapsNetEmb). . . 50

5.4.3 Guided routing for EM routing (GuidedCapsNetEM) . . . 51

5.5.1 Generalisation effect . . . 52

5.5.2 Generating adversarial images . . . 52

5.6 Generate verbalised explanations . . . 53

5.7 Metrics . . . 55

6 Results 56 6.1 Classification results . . . 56

6.2 Capsule network interpretability . . . 57

6.2.1 Input reconstruction . . . 57

6.2.2 Perturbing instantiation parameters . . . 57

6.3 Guided routing for semantic knowledge . . . 58

6.3.1 Interaction between class accuracy and attribute accuracy . . . 61

6.3.2 Attribute analysis. . . 61

6.3.3 Guided routing with word embeddings . . . 65

6.3.4 Guided routing for EM routing . . . 66

6.4.1 Generalisation effect . . . 67

6.4.2 Adversarial robustness. . . 69

6.5 Generating verbalised explanations. . . 71

7 Discussion 75 7.1 Comparing models . . . 75

7.2 Trade-offs . . . 75

7.2.1 Interpretability versus accuracy. . . 76

7.2.2 Robustness versus accuracy . . . 76

7.3 Scalability . . . 76

7.4 Interpretability . . . 77

7.5 Guided routing . . . 79

7.5.1 Incorporating word embeddings . . . 79

7.5.2 Guided EM routing. . . 79

(6)

A EM algorithm derivations 94

A.1 EM algorithm . . . 94

B Data set details 96

B.1 Annotations GTSRB* and GTSRB . . . 96

B.2 GTSRB* grouping . . . 98

C Model training parameters 99

C.1 Convolutional Neural Network . . . 99

C.2 Structured Joint Embeddings . . . 99

C.3 Capsule Network . . . 99

D Additional results 100

D.1 Rotating test images . . . 100

D.2 Amount of training data . . . 101

(7)

Chapter 1

Introduction

An inherent question for scientists is the timeless question of ‘Why?’. This question has pro-pelled research for centuries and has been a starting point for great discoveries and knowl-edge gain. Today we find that we are still concerned with this very question. Since more and more decisions are made by computers and intelligent systems, there is a growing need to understand their output.

In the last decade, machine learning and deep learning techniques have made an enor-mous leap due to new algorithms and increasingly powerful processing units. Due to the recent advances, neural networks have led to many breakthroughs in Artificial Intelligence (AI), in fields such as computer vision (Krizhevsky, Sutskever, & Hinton,2012) and natural language processing (Devlin, Chang, Lee, & Toutanova, 2018) and exceed human perfor-mance in many tasks (Silver et al.,2017; Esteva et al., 2017). These networks are used to tackle a wide variety of problems or provide assistance for human decision making and are omnipresent in modern human society.

A drawback of the increasing complexity of AI models and their incessant need of more data is the loss in transparency and feature interpretability. These models can be seen as black boxes, where only the input and output are observed and the inner representation is too complicated to be grasped by humans. At the same time, there is a growing need from society for such models to behave in an ethical, unbiased, and trustworthy manner. Espe-cially for decisions made by systems in critical application domains such as medicine and defence, insight and reasoning about the model’s capabilities, limitations and decisions are of key importance to gain trust from the user. The call for models that can be monitored and have the capacity to defend their actions and behaviour is becoming increasingly urgent in the present-day debate on AI and safety. The European Union established the right to expla-nation in the General Data Protection Regulation (GDPR), and DARPA has invested one of their research projects to explore explainability of AI models (Gunning,2017). Less opaque neural networks or other complex non-linear predictors could increase the usage of these models in science and society. Understanding the behaviour of machine learning models is also essential for designing and debugging models.

The field of Explainable AI (XAI) has been an active area of research since the 1970s. At that point rule-based approaches such as decision trees were most popular, which are by design more transparent and show comprehensible behaviour. Still, the more fundamental questions on what such an explanation should entail were already present. As the intelligent systems became increasingly complex, first with the rise of Support Vector Machines (Cortes & Vapnik,1995) and now with the highly successful deep learning models, there has been a growing need to understand what is happening inside the model and provide explanation

(8)

trusion detection systems (Marino, Wickramasinghe, & Manic,2018). Explainable models are also particularly useful in human-agent interaction, such as explanation interfaces (Pu & Chen,2006) or in a human-in-the-loop set-up (Adadi & Berrada,2018).

Deep neural networks, such as Convolutional Neural Networks (CNNs) (LeCun et al.,

1989), have issues with detecting objects that are not in their training distribution. These out of distribution examples can have catastrophic effects when objects are encountered that are not in their expected canonical pose (Alcorn et al.,2018). A dangerous example is that of a truck, which is correctly classified if the truck is driving, but when flipped over in an accident is recognised by the model as a snow plough. These examples are out of distribution of the network’s training data, but exist in the real world and have to be dealt with in a safe and correct way.

Besides natural out of distribution examples, another potential risk is that an AI system is attacked by adversaries. Such adversarial attacks exploit fundamental properties of neural networks (Szegedy et al., 2013) and are a serious concern for critical systems such as in-trustion detection systems. By visually perturbing the input image with noise, the classifier is deceived and returns an incorrect decision. These attacks can be used to, for example, at-tack self-driving cars by perturbing its visual field by using stickers on traffic signs (Eykholt et al.,2017) or other cars (Lab,2019). The translational invariance of CNNs, a convenient property which helps recognise objects with small translations, can also be used for adver-sarial attacks by exploiting its ‘excessive’ invariance (Jacobsen, Behrmann, Zemel, & Bethge,

2018).

Adversarial attacks form a serious threat for neural networks that are applied to real-world fields that can attract political or criminal interest such as financial or defence appli-cations. A further understanding of these attacks and the fragility of neural networks can help to build more robust, safer models. Explainable models could serve as a helpful tool for increasing this understanding and can also function as a safe guard against potential attacks by providing insight to a user why a decision is made. Additionally, in a human-in-the-loop system, a human can intervene in the decision process of an AI system, potentially aided by using an explainable model. Such a system can serve as a protection against adversaries.

Ideally, we wish to obtain a model that is robust against such attacks and is sufficiently transparent to provide more insight about its decision making process to a user. In this work, we want to investigate one such alternative to the existing CNNs.

1.1 Capsule networks

CNNs are generally extremely opaque and only allow for reasoning about the eventual model outcome rather than the internal workings of the model. Preferably these models would be more transparent by design. Additionally, there has been debate in the AI com-munity about the future of deep learning. Despite the unquestionably impressive results of

(9)

neural networks, questions have been raised whether or not its fundamentals, such as back-propagation and pooling operations, behind the neural networks as used today are flawless (Marcus,2018; Hinton,2017).

One promising alternative to conventional neural networks is formed by capsule net-works, proposed by Hinton et al. (Hinton, Krizhevsky, & Wang, 2011; Sabour, Frosst, & Hinton, 2017), that abandon the typical pooling operation used by CNNs. Capsule net-works are designed to not only detect features, but learn spatial relationships between these features as well. The network’s building blocks, rather than neurons, are capsules, which are groups of neurons that learn both the likelihood of features as well as their spatial prop-erties. These properties are essential to learn equivariance: the ability to detect an object and its transformations. This capacity is fundamentally different from invariance, which is the ability to detect an object despite its transformations. This is similar to how humans perceive objects: even if we have never seen an object upside down or rotated, we still recognise it without any difficulties. CNNs can only reach translational invariance by applying the maximum pooling operation which only retains the most activated features of each layer, meaning that the model actually does not observe any difference between shifted objects. Learning to handle other transformations requires seeing a large amount of these variations during training. Capsule networks are robust against such transformations by design, and can handle novel viewpoints without learning them during training. The capsule-structure of the network also allows for learning part-whole relationships that occur in the data.

The results from Sabour et al. (2017) and Hinton, Sabour, and Frosst (2018) look promis-ing on early benchmark sets and achieve results comparable to prior convolutional architec-tures. CNNs made incredible progress after further refining and expanding the networks into successful architectures such as AlexNet (Krizhevsky et al.,2012), VGGNet (Simonyan & Zisserman,2014), and the Inception networks (Szegedy et al.,2013). It is not unlikely that capsule networks will improve in a similar fashion with sufficient research efforts. Apart from its convenient equivariance properties, capsule networks have also proven to be sig-nificantly less vulnerable to adversarial attacks (Hinton et al.,2018).

Despite many applications of capsule networks to a variety of tasks, we find that there has been limited attention to what is actually learned by the network, particularly in the lower layers of the network. There seems to be little empirical proof of the theoretical claims on the model’s representations of the relationships between objects. This study focuses on the interpretability of capsules and what is represented in them, and provides a method for guiding these representations by jointly incorporating semantic knowledge and learning class predictions.

1.2 Research questions and contributions

This research focuses on the explainability of capsule networks by enriching their current learning approach with semantic knowledge. The main research question is: how can we increase the interpretability of capsule networks, using verbal explanation or reasoning? To answer this, we will address the following sub-questions:

RQ1: How can the intermediate parts and capsules of capsule networks be inter-preted, and how can we make these parts more interpretable?

RQ2: How can symbolic approaches, specifically attributes, be incorporated in the architecture to guide the routing process?

(10)

an adversarial attack countering system. The core component of the thesis is a novel method for jointly learning classification and attributes in capsule networks. Our contributions are twofold:

1. We propose an alteration to the learning procedure of capsule networks, which guides the learning of lower-level capsules. This allows us to attach a semantic meaning to these capsules. This allows for increased ad-hoc, or integrated, explainability of the model without increasing the complexity of the model.

2. To effectively test this approach, we provide attribute annotations for the existing GT-SRB data set (Stallkamp, Schlipsing, Salmen, & Igel,2012) as well as a variation on this data set with class groupings for better attribute generalisation.

1.3 Thesis outline

The thesis is divided into the following sections. First, in Chapter2the most important def-initions regarding explainability will be discussed and a general overview of explainable AI research is provided. The preliminaries on neural networks and in particular convolutional neural networks are provided and are meant to give the reader a basic understanding of the topic. In Chapter3, a detailed explanation of capsule networks is provided. The different routing paradigmas are explained. Chapter4serves as additional background on the prob-lem setting and the project of which this research forms a substantial part. Chapter5will discuss the experimental set-up and data used for the experiments, and will introduce our proposed addition to the routing process to incorporate semantic knowledge into the net-work. Chapter6gives an overview of the results which are discussed in depth in Chapter7. We finally summarise and conclude the work in Chapter8, in which we also underline the challenges and shortcomings of the proposed methods and give further recommendations for future work.

This research was conducted at TNO Research as part of VWData (Value Creation through Respon-sible Access and Use of Big Data), a programme of the National Science Agenda (NWA).

(11)

Chapter 2

Background

This chapter provides background on explainability and further examines the question What is an explanation? and What properties do we require from an explanation? (Section2.1). We discuss related work on explainable models in Section2.2. We describe the technical back-ground on neural networks (Section2.3), which form the basis for capsule networks, and attributes (Section2.4), which are used as the semantic information we want to incorporate in our models.

2.1 Explainability

When there is any form of communication between a system and a human user, we want this communication to be useful and fulfilling our requirements and needs. In the words of Pearl (2018):

“We also want to conduct some communication with a machine that is meaning-ful, and meaningful means matching our intuition.”

To create such a meaningful form of communication, we want our system to provide insight-ful and intuitive explanations to a user. A starting point for designing such an explainable model1is untangling the related definitions.

Causality When an explanation is requested , one often wonders about what caused a cer-tain event to happen. Pearl (2000) defines the ‘ladder of causation’ which is composed of three different levels of causation:

1. Association: reasoning about observations.

Example: What does this symptom tell me about the health of the patient? 2. Intervention: reasoning about intervening actions.

Example: What would happen if the patient takes this medicine?

3. Counterfactuals: reasoning about retrospection and understanding. Example: Was it the medicine that caused the healing of the patient?

The authors argue that intelligent systems, regardless of their predictive quality, are only able to reason at the level of association. On this level, we can still differentiate necessary and sufficient causes. If some cause x1is a necessary cause for y, then the occurrence of y implies

that x1must have occurred as well. The occurrence of a sufficient cause x2for y will always

1_{We will use the terms system, machine or model interchangeably to refer to any instance of an artificially}

(12)

Interpretability Interpretability is closely related to explainablility, yet subtly different. Whereas explainability encompasses all explanatory aspects of a model, interpretability fo-cuses on transferring these aspects to a human. Ribeiro, Singh, and Guestrin (2016) define an interpretable model as one that provides qualitative understanding between the input vari-ables and output, while using a representation that is understandable to humans regardless of the actual features used by the model. Doshi-Velez and Kim (2017) define interpretability as the ability to explain or to present model behaviour in understandable terms to a human. Kim et al. (2017) differentiate between the vector space of the model Emand the vector

space that is humanly interpretable Eh, and define interpreting a model as applying some

function g to the model space:

g : Em →Eh (2.1)

Completeness A complete explanation gives an accurate description of all internal pro-cesses in the system. When applied to deep neural networks, it is questionable whether a perfectly complete explanation is achievable at all. Even with perfectly transparent archi-tectures and full knowledge of all parameters of the network, it does not necessarily mean that it is possible to fully grasp the internal workings and learned connections of the model itself. Similarly, Gilpin et al. (2018) state that there is a difference between interpretability and completeness, where completeness is defined as accurately describing the operations of a system. Disclosing the full architecture and parameters of a neural network would be a perfectly complete explanation, yet this does imply an interpretable explanation.

Transparency Transparency can generally be viewed as an ethical requirement. We could define transparency in several ways. A measure could be our level of understanding of the internal workings of the model. Another way of measuring transparency could be the complexity of the modelΩ(g), where a lower complexity score indicates a more transparent model (Ribeiro et al.,2016).

Trust Trust of a user is essential for an explanation system to be effective. We distinguish two types of trust: trusting a model, and trusting a prediction. Trusting a model is more than just the confidence that a model will perform well (Lipton,2016). Otherwise any sufficiently accurate model would be trustworthy and free of concerns regarding interpretability. This is clearly not the case: there is a human tendency to be more strict on a single machine error than on a single human error, especially in the case of opaque models which lack ex-plainability. There also seems to be a strong preference for error alignment between human and machine - as soon as the machine fails in areas that are not error-prone for humans or does not match human moral or social standards, human trust in the machine will diminish rapidly (Andras et al.,2018). An extensive comparison between different levels of trust is given by Lipton (2016).

(13)

Model agnostic Model agnostic approaches for explainability do not depend on the spe-cific model that is used to make a prediction, and define methods that can be applied to any model regardless of their inner workings.

Ante-hoc and post-hoc We distinguish ante-hoc, or integrated, and post-hoc explainability. With ante-hoc explainability, the explanation method is integrated in the model or learned during training, and therefore requires a model that has intrinsic explainability. Post-hoc explainability explains the decision made by an existing trained model, which is regarded as a black-box. This has the advantage of maintaining the performance and properties of the original model, while incorporated interpretability can lead to performance loss. We will revisit this trade-off between performance and interpretability in more depth in Chapter

7. Post-hoc methods do require an external model or proxy, which may not always be a preferred or available option.

Locality The locality of an explanation again deals with the distinction between system behaviour and specific decisions. A local explanation reasons about a specific decision rather than trying to explain the behaviour of an entire system (Ribeiro et al.,2016). Since it is often not feasible to explain an entire model, we can state a minimal requirement that it should at least be locally faithful. Having an explanation model that is globally faithful would by design imply that it is also locally faithful.

Showing which features are generally more important in decision making is also a form of global explanation.

Contrastive reasoning When a why-question is posed, it is often contrastive in nature. The question “Why P?" generally means “Why P and not Q?" where P is the fact that is to be explained, and Q is the foil case which did not happen. We can apply this type of explanation for explaining model decisions, for example explaining a certain prediction y by not only stating what caused y to be the case, but also by providing which factors or features are missing for a different prediction y0.

To summarise, we define the following list of necessary characteristics that we require for explanations:

• the explanation should provide a causal explanation for a model decision,

• it should lie in an interpretable feature space, i.e. lie in human interpretable space Eh,

• it should be local, i.e. explanation on decision-level,

• it provides more qualitative information than the model output itself, while being sim-pler than the full model,

• it allows for contrastive reasoning.

These properties will be the goals of our explanation model for the remainder of this the-sis. Clearly, there are other requirements that could be focused on, such as Grice’s maxims (Grice, Cole, & Morgan,1975). These conversational rules state that communication is ef-fective when it is informative, truthful, relevant, and clear and brief. We will consider these properties as implicit requirements in this research.

(14)

tasks. These models are often fairly simple, and are not always fit for more complicated tasks. The interpretability of these models may also decrease when this internal representa-tion becomes too complex.

Interpretable models Some machine learning models are more interpretable by design, or allow for extracting explainable models from the original model. Models that have incorpo-rated interpretability allow for ante-hoc explainability. Examples of such models are shallow decision trees, which are highly interpretable as long as the feature space has a reasonable size, and Bayesian networks, of which the internal causal relationships and uncertainties can be interpreted.

Other models allow for extracting a more interpretable model from its original model. An example of such an algorithm is TREPAN, which extracts comprehensible decision trees while maintaining high levels of fidelity to the original model (Craven, 1996). Martens, Baesens, Van Gestel, and Vanthienen (2007) propose a rule extraction method for Support Vector Machines (SVMs). More recent work is done on building fair interpretable decision trees that reduce bias (Aghaei, Azizi, & Vayanos,2019).

Despite the appeal of models that can be made interpretable by one of the aforemen-tioned operations, there are a few major disadvantages. Firstly, these methods are often not applicable to all models and are only helpful to a limited range of models. Secondly, this approach is only feasible if the feature space has a reasonable size. Too many features or too large trees are no longer interpretable, even if the features themselves are. Lastly, it is not obvious that the space in which the most important features lie is equal to or even overlaps with the space of human interpretable features.

Dimensionality reduction If a model is not interpretable by design, or if additional visual-ization is required, dimensionality reducing methods can be used. One such method is clus-tering, where the data is split up into a number of clusters. Principal Component Analysis (PCA) is an eigenvector-based analysis method which extracts the linearly uncorrelated vari-ables (principal components) from a larger set of varivari-ables (Pearson,1901). t-SNE (Maaten & Hinton,2008) allows for visualising high-dimensional data in 2D by constructing distri-butions of similar points and minimising the Kullback-Leibner divergence between the dif-ferent distributions. The output is a 2D projection of the manifold in the higher-dimensional space.

Dimensionality reduction techniques allow for better understanding the data and can be particularly useful for exploratory analysis. This type of analysis does not transcend the level of data interpretation, and other methods are required to understand the decision process of a model.

(15)

Feature importance Feature level models such as LIME (Ribeiro et al.,2016) are post-hoc models that output the feature importance by measuring the effect of perturbation of the in-put on the model outin-put. By applying a simpler, interpretable model to these perturbations, LIME is able to build a local explanation of the input. This model agnostic approach can be applied to explain the decisions of any classifier. There has been research into feature-based explainable models, that provide insight which features are most relevant in the outcome of the model. This can prove useful for an end user, for instance by showing which factors contributed to a decision for loan provision or college applications. A similar approach is used for BETA which produces global explanations (Lakkaraju & Caruana,2017). A game theoretic framework for interpreting feature importance is SHAP (Lundberg & Lee,2017).

A disadvantage of these approaches is that showing the features that influenced the out-come may not be feasible or comprehensible for more complex models where a large number of features contribute to the final decision. A similar issue as for interpretable models is the fact that there is no guarantee that the deciding features lie in the space of human inter-pretable features. Despite the flexibility of a model agnostic approach, it cannot provide more than a local explanation and does not explain why the original model made a certain decision.

Heatmaps Many explanation methods for deep learning models attempt to visually ex-plain the decisions made by the model. One of the earlier methods to visualise patterns based on the model’s decision was Sensitivity Analysis (Gevrey, Dimopoulos, & Lek,2003), which uses the difference in the gradient of the decision function under varying inputs to find which pixels best account for the output. Saliency maps (Simonyan, Vedaldi, & Zisser-man,2013) create a local explanation by visualising the importance of individual pixels for causing the output. The disadvantage of these pixel-level methods is that it is contestable whether individual pixels are the best property to consider when explaining a decision. Pix-els that are close to each other are very correlated and do not account for high-level concepts. They are also fragile against adversarial attacks (Ghorbani, Abid, & Zou,2017), and may fail to explain the causal relationships between input and output (Adebayo et al.,2018).

A different approach for visual explanations is visualising decision supporting evidence by using the attention mechanism, which originated from machine translation research (Bah-danau, Cho, & Bengio,2014), to highlight areas of the input to which the model pays most attention. The resulting visualisation shows which areas of the input the model focused on during learning rather than the input pixel-level approach of aforementioned methods. This deictic approach (which points the user to the right area) is found to resonate with research in cognitive psychology (Contini-Morava & Goldberg,2011). Models that use deictic clues are found to be pedagogically more effective (Voerman & FitzGerald,2000). Jain and Wallace (2019) show that attention, applied in NLP tasks, does not suffice as an explanation since the attention weights are often uncorrelated with the actual causes of prediction, and different sets of attention weights can yield the same resulting prediction.

Decomposition Another type of explanation is showing which input areas contribute to a certain model prediction by splitting the model outcome into parts. These post-hoc methods often rely on applying backpropagation through the model.

Guided backpropagation (Springenberg, Dosovitskiy, Brox, & Riedmiller,2014) sets non-active gradients to zero and visualises the remaining weights that show the different parts of input that the model responds to for each layer. A similar method is DeconvNet (Zeiler &

(16)

presented by Montavon, Lapuschkin, Binder, Samek, and Müller (2017). Montavon, Samek, and Müller (2018) propose a method which uses deep Taylor decomposition to interpret the network by backpropagating the explanations from the output to the input.

These methods are useful to generate post-hoc explanations. In our research, we want to focus on adding integrated ante-hoc interpretability to our network, as we wish to integrate our explainability in the training process itself.

Textual explanations There exist a wide variety of use cases for textual explanations of decisions made by an AI model. For experts using an AI system the verbalisation of con-cepts may be beneficial, as it can ‘grey the black box’ of the underlying model. When end users of an AI system are not experts in the field at hand, giving an overview of relevant features can be intimidating and futile, and may lead to users abandoning the AI system due to frustration or lack of trust. A simple text explanation can provide interpretability without increasing the complexity of the task, and may lead to increased understanding of the underlying model and building a constructive way of communication between the AI system and a user.

In our research, we will focus on generating textual explanations for visual classification. Text explanations could also be applied to other tasks, such as explaining text classification outcomes (Liu, Yin, & Wang,2018) or verbalising agent decisions in games (Ehsan, Tamb-wekar, Chan, Harrison, & Riedl,2019).

Early work on textual explanation was primarily template-based (Core et al.,2006), which provided an effective but restricted explanation. A more advanced task would be to au-tomatically generate explanations, which can be viewed as a sequence-to-sequence task (Sutskever, Vinyals, & Le,2014) where the input and output sequences have different lengths. Text explanation generation models often apply similar techniques to captioning models that automatically describe the content on an image. These systems consist of a combination of successful computer vision and NLP models, often integrating pretrained language mod-els to generalise beyond the training data. Vinyals and Toshev (2015) propose an approach of combining a CNN for feature extraction with an LSTM for text generation. Rather than just generating explanations, Lei, Barzilay, and Jaakkola (2016) propose an approach to gen-erate justifications, or rationales, for predictions. Park et al. (2016) distinguish introspective explanations, which reflect the decision process of a network, and justification explanations, which discuss evidence that supports a decision. Hendricks, Hu, Darrell, and Akata (2018) introduce a phrase-critic model which ranks generated text explanations based on how well they are grounded in the input image. They also show that their model is able to generate accurate contrastive, or counterfactual, explanations. Table2.1provides examples for these different explanation types.

(17)

Type of explanation Example

Introspective This is a stop sign, because filter 4 in layer 2 shows an activation which is higher than the threshold.

Justification: class discriminative This is a stop sign, because it is a red sign with the word STOP in white letters.

Justification: accurate This is a stop sign, because it is a red sign with white letters.

Contrastive This is a stop sign (and not a speed limit sign), because it has the letters STOP.

Similarity This looks similar to a speed limit sign, because it shares the property red.

TABLE2.1: Examples of different types of explanations for the same input image.

the model’s decision without relying on its internal structure. A drawback of these meth-ods is that the explanations are relatively hard to produce correctly, due to their need for additional language models and ground truth data. Especially collecting ground truth de-scriptions and explanations, which require human annotations, can be expensive and time-consuming, and as a result may not be available for most data sets. Another issue is that textual explanations often require simplyfing the problem, depending on the context. While for some problems, such as simple object recognition, a short simple explanation might suf-fice, explanations for more critical tasks such as medical diagnosis often need to be more elaborate.

In this work, we will focus on generating textual justifications that reflect the specific decision-making process of the model. These explanations will be used to inform a human user about the decision processes made by a capsule network, as we will discuss in detail in Chapter4. Given the early stage of this set-up and the absence of human annotations, we will focus on efficient template-based explanations.

Multimodal explanations The complementary explanatory properties of visual explana-tions and textual explanaexplana-tions can be combined in multimodal explanation models. This combination of textual and deictic clues intuitively matches the way human communica-tion works, and can increase the explanatory power of the resulting explanacommunica-tion. These approaches are applied to multimodal tasks such as Visual Question Answering (VQA) and activity recognition. Hendricks et al. (2016) propose a model which jointly predicts class la-bels and explanations, while incorporating class relevance in the loss function to improve the class discriminative ability of explanations. This leads to explanations that are both class dis-criminative, e.g. black and white stripes are discriminative for zebras, and image relevant, e.g. the animal in the image does have black and white stripes. The former is important for informative explanations, whereas the latter is crucial for trust. If an explanation does not resonate with the input image, it will decrease the user’s trust in the system. Li, Tao, Joty, Cai, and Luo (2018b) propose a similar multi-task learning architecture which learns to generate textual explanations with visual clues on the VQA task. However, as before, these set-ups require annotated data sets with textual explanations which are not available in our set-up. Handling multiple modalities further increases the complexity of generating these explanations, and will not be the focus of this project.

(18)

symbolic AI techniques with connectionist approaches. An example of applying a user-defined symbolic framework of attributes in a neural set-up is proposed by Zhang, Lerer, Sukhbaatar, Fergus, and Szlam (2018), in which a model learns a policy to search through this attribute space. Garnelo and Shanahan (2019) provide an extensive review of recent efforts into combining connectionist approaches with neural networks.

2.3 Neural networks

Traditionally the majority of machine learning algorithms depended heavily on feature en-gineering and feature extraction. These are labour-intensive and require expertise of the do-main, and are dependent on the representation of the data (Ian Goodfellow, Yoshua Bengio,

2017). In recent years neural networks have started to replace these feature-based methods. Neural networks perform non-linear representation learning, a process which automatically extracts representations from the data. Not only does this process reduce human effort, but it also can find emerging patterns in the data that would be missed in manual feature engi-neering.

A neural network aims to approximate a function that maps input to output. This can be applied to a large number of domains and tasks. We will focus on image classification, which is the problem of annotating an input image with one (or multiple) class label(s) by learning to predict the function f which maps input imagesX to output class labelsY:

f :X → Y (2.2)

Neural networks are generally composed of multiple layers, consisting of neurons. The value of a neuron is calculated by performing a linear and a non-linear transformation:

x(l+1)_k = f

∑

j

w(l+1)_k,j x(l)_j +b(l+1)_k !

(2.3)

where f(·)is a non-linear activation function. Multiple non-linear transformations allow for more abstract, and thereby more useful, representations.

2.3.1 Activation functions

An activation function is the non-linear function f(·)that is applied to each linear operation as in Equation2.3.

ReLU Rectified linear units (ReLU) is a piecewise linear function which outputs zero for half of the domain:

(19)

This leads to conveniently consistent gradients that are 1 where the unit is active.

Sigmoid The sigmoid function, or logistic function, maps the input to the range(0, 1)by calculating:

σ(x) = 1

1+e−x (2.5)

2.3.2 Parameter optimisation

The network is trained by optimising its parameters θ, which are learned from data. This optimisation is done by minimising the loss, or error, function. First, a forward pass through the network is executed by feeding input examples to the network and computing the loss between the predicted outcome and the input’s true labels.

After calculating the loss of the forward pass, we can use this loss to update the weights of the network.

This loss is calculated by maximising the log likelihood between the predicted output ˆy and the ground truth label y. This is done by minimising the negative log-likelihood, which is equivalent to minimising the cross entropy loss over all classes C:

L = −

C

∑

i=1

yclog(pc) (2.6)

where yc is a binary value indication whether c is the correct class or not, and pc is the

predicted probability for class c for the current observation.

Stochastic Gradient Descent The derivative of the loss function with respect to the input parameters∇_θLof each layer is computed, which can then be used to update the parameters of the network by applying the update rule:

θ(t+1)=θ(t)−ηt∇θL (2.7)

where ηtdenotes the learning rate at time step t. Setting this learning rate correctly is

essen-tial for achieving a sufficient convergence rate. A small learning rate leads to slow conver-gence, whereas a large learning rate may lead to overly strong oscillations of the learning curve, increasing the loss.

To calculate the gradient of the loss function∇_θL, the backpropagation algorithm (Rumel-hart, Hinton, & Williams,1985) is used for flowing information from the loss function through the network. The gradient of the loss function with respect to the parameters at layer L is calculated by repeatedly applying the chain rule.

This update can be performed based on all data points, but this has the disadvantages of risk of overfitting and inefficiency. A commonly used optimisation method is (minibatch) Stochastic Gradient Descent (SGD) which in each iteration stochastically picks a subset of the data and performs the update in Equation2.7. When picking the size of the minibatch, there is a trade-off between generalisation, and variance and efficiency. Small minibatches provide regularisation, but lead to more variance during training and require a smaller learning rate, which in combination with the increased number of steps increases the training time (Ian Goodfellow, Yoshua Bengio,2017).

(20)

mt+1=β1·mt+ (1−β1) · ∇θL (2.8) vt+1=β2·vt+ (1−β2) · ∇θL 2 _(2.9) ˆ mt= mt 1−βt₁ ˆvt = vt 1−βt₂ (2.10) θ(t+1)=θ(t)−√ηt·mˆt ˆvt+e (2.11)

where ηtindicates the learning rate, and βt₁and βt₂are values close to 1. m and v are the

decaying averages of past gradients and past squared gradients, and initialised as zeros. The ˆmtand ˆvtupdates are performed as bias-correcting steps to avoid a bias towards zero.

2.3.3 Batch normalisation

Ioffe and Szegedy (2015) propose batch normalisation, a technique which addresses the problem of internal covariate shift, where the internal input distributions for each layer change and force the intermediate layers to adapt to this change at each step. Batch normal-isation transforms the activation in each layer by subtracting the batch mean and dividing by the batch standard deviation back to the same input distribution as before backpropaga-tion. As a result, the model does not have to constantly re-adjust to new distributions. This increases the efficiency and stability of the learning process.

2.3.4 Regularisation

A problem that is often encountered when training a neural network is the issue of overfitting. This occurs when a model learns the patterns in the training data so well, it is not able to generalise to images it has not seen before. A good practice is to split the available data in training, validation and test data. The train data is used for training the network, the validation data is used to decide on hyper parameters and the test data is solely used after training the network to test its performance. Often the training error keeps decreasing over time, whereas the validation loss is actually starting to increase again. Early stopping is a strategy which avoids this issue by terminating the training process once the validation error is no longer decreasing. Exponential learning rate decay is another method which helps against overfitting where the learning rate is slowly annealed over time. The learning rate is then updated as follows:

α=α0e−kt (2.12)

(21)

FIGURE2.1: Convolutional operation.

Weight decay regularises the model by driving the weights closer to 0. An L2 regularisa-tion term λ||W||2

2is added to the regular loss function, where λ is a coefficient that

deter-mines the level of regularisation.

2.3.5 Convolutional neural networks

Convolutional neural networks (CNNs) (LeCun et al.,1989) are a highly successful type of neu-ral networks that focus on the grid structure of input data and use convolutions rather than the general multiplication operation in their layers. Convolutional networks are build upon the convolutional operation, which builds a feature map by applying a feature detector to an input image. The convolutional operation is visualised in Figure2.1. By sliding the filter, or kernel, over the input input and calculating the dot product, a feature map is created which preserves the spatial relationship between the input pixels. These filters act as feature detec-tors. The stride of the convolution defines the movement across pixels (e.g. with a stride of 1 the filter is moved one pixel at each step). The number of filters applied is defined by the depth of the convolutional layer. The parameters of the filters are learned during training of the convolutional network. The returned feature map is a more compact representation of the input, and is used as the input for the next layer.

2.3.6 Pooling operation

After applying an activation function on the detected activations, a pooling function is ap-plied which replaces the output in a given area with a summary statistic of the area (Ian Goodfellow, Yoshua Bengio, 2017). The max pooling operation returns the maximum ac-tivation value of each area. This leads to a representation that is invariant with respect to small input translations, as well as a reduced number of features in the next layer. Figure

2.2gives an example of the max pooling function. Clearly, spatial information about each receptive field is lost in the operation as only the strongest activation is stored. This invari-ance property is useful since it allows the network to process inputs that are transformed to some extent, since small changes in the output will not change the output. However, precise spatial information and relative spatial information is lost in this process. This is not always problematic, for example in applications such as object detection where the specific location of features is not essential. However, for many applications, we do not want to discard spa-tial information regarding the input, for example when differentiating between variations of the same object.

(22)

preserves the maximum activations of each receptive field.

2.4 Attributes

Attributes are high-level visual properties or characteristics of the objects which can be shared between classes. If these properties can be expressed in natural language, they are referred to as semantic attributes (Farhadi, Endres, Hoiem, & Forsyth,2009). Attributes allow for a low-level semantic representation of high-dimensional low-level features.

These attributes can be modelled by adding an intermediate attribute spaceA to the model. Note that it is essential that this attribute space is in the space of humanly inter-pretable features Eh(Equation2.1), or can be mapped to Eh, if we want to use these attributes

for explaining the model’s behaviour.

Attributes can be encoded as a binary vector φ0,1, using a binary value for encoding the presence or absence of a property. This is an intuitive encoding for categorical attributes. For ordinal or continuous attributes, a vector of continuous values φArepresenting the con-fidence levels for each class can encode more information.

Attributes can be used for attribute-based classification, where we use the semantic prop-erties to fit a class description or find the closest class attribute encoding. This approach requires an attribute to class mapping.

Even though we could in theory represent 2kclasses with k binary attributes, many at-tributes are correlated, meaning that these atat-tributes generally co-occur. Some atat-tributes imply other attributes (for example, the attribute ‘has-mouth’ usually implies ‘has-nose’).

Semantic attributes can help in improving image classification performance (Su & Jurie,

2012). Attribute learning is also particularly useful in zero-shot learning cases of the classifi-cation task, where some classes are not observed during training (Rohrbach, Stark, & Schiele,

2011; Lampert, Nickisch, & Harmeling,2013; Larochelle, Erhan, & Bengio,2008).

2.4.1 Word embeddings

Word embeddings are a widely used method for representing natural language words or tokens, relying on the distributional hypothesis which states that words that occur in the same contexts tend to have similar meanings (Harris,1954). The embeddings are distributed high-dimensional vectors of which the meaning is represented by their relative distance to other embeddings. This distance represents the similarity between two words, e.g. the represen-tation of “shark" will be more similar to that of “dolphin" than that of “cowboy".

These unsupervised embeddings can serve as a convenient alternative to manually an-notated attributes, which may not always be available. Class labels can be turned into a semantically meaningful label embedding φWusing (pre-trained) word embeddings.

To generate these word embeddings, a model is trained to create a mapping from words or tokens to these vector representations. Well-known examples of such models are Word2Vec

(23)

(Mikolov, Chen, Corrado, & Dean,2013), a neural network trained to predict target tokens based on neighbouring tokens, either using skip-gram or bag-of-words, and GloVe (Pen-nington, Socher, & Manning,2014), which learns token vectors whose dot product match the token co-occurrences statistics. These embeddings can be obtained from a pre-trained model, or learned from scratch for a specific task. Word embeddings can be combined, for example when generating a label embedding for a class name consisting of multiple words, by taking the average or concatenation of multiple embeddings.

2.4.2 Attribute prediction

To predict the attributes for an input image, several types of models can be applied. One approach would be to train a different classifier for each attribute. This one-vs-rest approach was used to solve the multiclass prediction problem, where k independent Support Vector Machines (SVMs) were trained for each of the k classes (Tang,2013). Instead of having mul-tiple SVMs, we can also train one SVM to minimise a multi-class objective function. This score has to be higher for the correct class than for the incorrect classes by some minimum margin∆.

Ferrari and Zisserman (2008) propose a probabilistic generative model that learns to pre-dict attributes in a weakly supervised setting. This approach focuses solely on attributes and does not make any predictions on image classes.

One way of learning both attributes and class predictions is to train a separate proba-bilistic classifier for each attribute, and make a final class prediction by combining the scores of the learned attribute classifiers. This method is called Direct Attribute Prediction (DAP) (Lampert, Nickish, & Harmeling,2009). A shortcoming of DAP is the fact that attributes and classes are learned separately, which means that optimal performance on either of these pre-diction tasks does not guarantee an optimal performance on the other task. Additionally, we want to be able to scale the attribute prediction to other attribute types such as unsupervised attributes or hierarchy information.

In Chapter5we propose a novel method that extends capsule networks with attribute prediction while overcoming these issues. We now discuss an appropriate baseline model which also deals with these shortcomings and learns to jointly predict classes and attributes.

Structured Joint Embeddings

Structured Joint Embeddings (SJE) (Akata, Reed, Walter, Lee, & Schiele,2015) is a multiclass structured SVM framework inspired by structured SVMs (Tsochantaridis, Joachims, Hof-mann, & Altun,2005). The goal is to learn a compatibility between input embeddings θ(x) and output embeddings φ(y).

SJE objective The objective that the SJE model aims to optimise is a multiclass objective where the compatibility between input and output is maximised:

f(x; w) =arg max

y∈Y F(x, y; W) (2.13)

F(x, y; W)is the compatibility between x and y which is defined as follows:

(24)

L(i, j) =

1 i6=j (2.16)

This ranking loss is then applied to form the following hinge loss objective: 1 N N

∑

n=1 max y∈Y(0,L(xn, yn, y)) (2.17)

The weights W are learned discriminatively using SGD by sampling(xn, yn). The hinge

loss is sometimes squared to increase performance. W is then updated by performing the following update rule:

W(t)=W(t−1)+ηtθ(xn)[φ(yn) −φ(y)]T (2.18)

To make a class prediction, an output embedding φ(yn)for some input image x is

gener-ated, which is then mapped to the closest class label by finding the class for which the true output embedding φ(y)is closest, for example in Euclidian distance:

ˆy=arg min

i ||φ(yn) −φ(yi)|| (2.19)

2.4.3 Explainability through attributes

Attributes can help in interpreting a model’s decision by showing which factors led to the final class prediction, and can serve as arguments for a class decision. Since we aim to apply attributes to explain model decisions, we require the attributes to follow the earlier explain-ability conditions, with the additional properties that they should be class discriminative (i.e. help in differentiating between classes), and not class specific. If the attributes are defined in a human interpretable space, we meet our earlier explanatory requirements. Attributes pro-vide a local explanation that is interpretable by humans, and can be applied in informative causal explanations.

Farhadi et al. (2009) use attributes to describe objects, and learn to point out attributes that are unusual or unexpected for a certain image class. Attributes can also be applied in a more advanced explainability framework, such as the previously discussed instance-specific and class-specific natural language explanation generation model by Hendricks et al. (2016). Li, Fu, Yu, Mei, and Luo (2018a) propose a combination of explanation through attributes and a reasoning module to answers questions in a VQA task.

(25)

Chapter 3

Capsule Networks

In this chapter, we look into the workings of capsule networks, a novel neural network architecture. We first discuss the motivation behind the network (Section 3.1) as well as related deep learning models (Section3.2). The capsule network’s two main architectures are described in detail in Section3.3. We also explore some of the applications of capsule networks and their current successes and challenges (Section3.4).

3.1 Motivation

The initial concept of capsules was proposed by Hinton et al. (2011) in which he addressed the representational shortcomings of CNNs. These shortcomings are threefold:

1. The max pooling operation which is applied in CNNs to provide their translational invariant qualities means precise and relative spatial information is lost.

2. Whereas the max pooling operation makes the CNNs viewpoint invariant, a preferred property is viewpoint equivariance.

3. Due to the CNN’s viewpoint invariance, they are not well equipped to deal with novel viewpoints.

The problem with handling new viewpoints can be mitigated by applying data augmen-tation, where the training data is expanded by adding transformed images that are for ex-ample rotated, mirrored or zoomed in. However, this additional step is something that we ideally want to be automatically learned by the network. When comparing this with human learning, there is a stark contrast: humans do not require to see an object from every possible angle to correctly classify it. After seeing only a few examples, we can easily generalise and recognise objects that are upside down or mirrored.

Hinton et al. (2011) proposed a fundamental change to neural networks: instead of using the scalar-based neurons, the network consists of capsules, which are groups of neurons that represent the object’s pose and orientation in instantiation parameters. The motivation behind capsules was inspired by the concept of image rendering in computer graphics. Capsules are built by applying inverse graphics: rather than building an image based on a set of in-stantiation parameters, the image is used as input and the goal is to find the corresponding instantiation parameters. This inverse rendering process is performed to deconstruct the im-age into its underlying structure of parts. The biological inspiration for capsules networks comes from minicolumns in the brain, which are vertical columns consisting of many neurons (Buxhoeveden & Casanova,2002).

(26)

FIGURE3.1: Example of coupling weights after applying the routing proce-dures. Non-matching lower-level capsules receive a low assignment proba-bility, whereas active capsules get a high assignment probability to their

cor-responding higher-level capsule.

This notion of capsules also resonates with the human ability to apply common-sense reasoning to tasks that require understanding of the physics of the world, also referred to as intuitive physics (Lake, Ullman, Tenenbaum, & Gershman,2017). One such example is the task of deciding whether a letter is mirrored. If this letter is upside down, one has to make a mental rotation of the letter to find the answer (Cooper,1975). In other words, a simulation is used to anticipate the consequences of an action (e.g. rotating the letter). Humans require a causal understanding of the world to continuously perform such simulation.

The follow-up paper by Sabour et al. (2017) introduced the routing-by-agreement proce-dure which allowed for end-to-end training of a capsule network. Even though the network shares many characteristics with CNNs, its underlying structure is fundamentally different. Rather than focusing on the grid-like structure of an input, capsule networks exploit the idea that every input consecutively consists of edges, motives, sub-parts, and parts. By building a tree-like structure of the input the model allows for identifying part-whole relationships in the input. By calculating the agreement between parts and their corresponding higher-level concepts, such a tree-structure can be constructed. Figure3.1visualises a simple example, where parts that are in the correct orientation have a high probability of being assigned to their matching higher-level components. The capsule representations combined with this tree-structure make capsule networks more capable in handling input variants, as well as provides for potential interpretability by building a more interpretable internal object repre-sentation.

An important characteristic of capsule networks is their equivariance with respect to translation and viewpoint: varying the viewpoint of the observed input has no effect on the probability of the entity being present. Even though CNNs are translation invariant, they achieve this by using the max pooling operation which inevitably discards some information of the receptive field. It also requires a vast amount of training data with varying viewpoints to handle different poses and orientations. Instead of only keeping the maximum activated

(27)

feature, capsules keep a weighted sum of the features calculated in the previous layer. As a result, rather than detecting an object and matching this to a variant in one of the higher-level layers of a CNN, a capsule network detects an object and its variant (e.g. detecting a face with the variant rotation of 20◦).

Binding problem The binding problem originates from neuroscience and describes the problem of segmenting and combining multiple objects that are present in a single input. For example, if the input consists of a red square and a blue circle, how does the neural model (which could be human or artificial) learn to segment the different properties and map them to the correct objects - thus not perceiving the circle as being red or vice versa. If we represent these properties in a one-dimensional manner (so having binary properties encoding a circle shape, or a red colour), there is no way of differentiating which properties belong to the same object. Capsule networks make the strong assumption that at each position in the observed input, only one instance of an entity type can be present. In this case, position may refer to a small area of the input. This assumption solves the issues of the binding problem on a fundamental level, since we can always couple the currently active properties with the instance that is currently being observed. This assumption has a foundation in psychology, in a theory called crowding. Crowding is the influence of nearby objects on the object that is being observed, making it harder to recognise objects if they are very close together (Levi,

2008).

3.2 Related work

In this section, we provide a brief overview of neural methods that share some of the funda-mental underlying concepts of capsule networks.

Autoencoders There are some clear similarities between capsules networks and autoen-coders, since they were initially proposed as Transforming autoencoders (Hinton et al.,2011). Autoencoders learn to encode data and to decode this compressed representation to recon-struct the original input. The capsule network largely relies on a similar reconrecon-struction ap-proach for its regularisation, which we will discuss in detail in the next section. However, the goal of capsule networks is classification rather than feature extraction or dimensionality reduction. The encoded features are therefore used for both classification decisions as well as input for the decoding part of the network.

Attention Some similarities between capsule networks and a neural network with atten-tion (Bahdanau et al.,2014) can be found as well. Their architectures and implementation are quite dissimilar, but the way both networks learn to focus on the right aspect of the in-put shows resemblances. Intuitively the attention mechanism learns which parts of an inin-put the model should pay attention to, by building a context vector containing the weights for different elements.

The Transformer (Vaswani et al., 2017) is an encoder-decoder architecture that is built on multi-headed self-attention blocks that learn to build representations of each position with respect to all other positions in a layer. The multi-headed attention is able to learn representations from different representational subspaces.

(28)

FIGURE3.2: Capsule network encoder architecture.

Layer Details Output size

Input Image (n colour channels) 28×28×n

Convolutional layer 9×9 kernel, stride 1, ReLU 20×20×256 Primary capsule 5×5 kernel, stride 2, ReLU 6 ×6×8×32 Class capsule layer Capsule dimension 16 16×10

Input Pose vector v 16×10

FC layer 1 ReLU 512

FC layer 2 ReLU 1024

FC layer 3 Sigmoid 784(28×28)

TABLE3.1: Summary of the full capsule network architecture on a 28×28 sized input image (encoder at the top, decoder at the bottom).

Transformers use a top-down approach by finding the attention distribution of an ele-ment in layer L to each eleele-ment in layer L−1, to form a weighted combination of these lower-level representations. On the other hand, capsule networks build such a distribution in a bottom-up manner. The coupling coefficients are calculated from a lower-level capsule to form a distribution over all higher-level capsules. Additionally, the multiple attention heads in the Transformer model are similar to the use of different transformation matrices between each capsule pair of a lower level capsule and higher level capsule. A comprehen-sive comparison between the two types of networks is presented by Abnar (2019).

3.3 Architecture

3.3.1 Overview

In this section we discuss the different components of the capsule networks, as well as the routing algorithm. A summary of the full architecture can be found in Table3.1. We will describe the original architecture as proposed by Sabour et al. (2017). Architectural choices such as filter sizes, number of capsules and capsule dimensions are hyper parameters that could be substituted with different values.

Encoder

Figure3.2shows the architecture of the encoder in the capsule network. Below we discuss the individual components of the encoder.

Convolutional layer The convolutional layer applies 256 convolutional kernels of size 9× 9 and stride 1 with a ReLU activation to the input image, converting the pixel intensities

(29)

to feature detector activations. These activations serve as the input for the primary capsule layer.

Capsules Capsules are instances of capsule types. A capsule represents an entity, which can be learned without defining its location in the input or its semantic meaning. The norm of the capsule represents the likelihood of the entity, and each of its dimensions represents an instantiation parameter of the entity. This set of instantiation parameters define the variation of the input with respect to an implicitly defined canonical version of that entity. These instan-tiation parameters could also be regarded as representations of the intrinsic coordinates of its corresponding entity on the appearance manifold.

It should be noted that the properties that these parameters represent do not necessarily lie in the space of human interpretable features Ehas defined in Equation2.1, and as such a

capsule network is not an interpretable neural network per se.

Primary capsule layer The primary capsules represent the lowest level in the tree repre-sentation. The primary capsule layer groups the neurons from the previous convolutional layers into capsules. Its input is all of the 256×9×9 outputs of the convolutional layer, and outputs 32×m× (6×6)capsule outputs, with m the primary capsule dimension which is generally a value in the range[4, 32]. Every capsule consists of 6×6 output capsules, and each capsule output is formed by a vector which is called the pose vector1_.

Class capsule layer The final layer2of the network outputs one n-dimensional vector vj

for each class. vjis the result of squashing sj, the weighted sum of the coupling coefficients

from the previous capsule layer. Because of the squashing operation, the norm of vj can

be interpreted as a probability and determines the predicted class that is returned by the network. The vector values in vjrepresent the instantiation parameters of the input. These

instantiation parameters represent instance-specific properties of that particular input, e.g. rotation or brightness.

Decoder

The architecture of the decoder is visualised in Figure3.3. The decoder of the network is responsible for reconstructing the original input. A masking operation is used to mask out all but the correct or predicted class during training and testing respectively over the matrix containing all activity vectors A∈ RN×n _{where N is the number of output classes and n is}

the capsule dimension of the final layer. For an image x of class c, the reconstructed image ˆx will correspond to the model’s internal representation of c, or canonical representation of c, since it is based on the class label and the instantiation parameters of x.

3.3.2 Dynamic routing-by-agreement

To learn the coefficient parameters between the primary capsules and the class capsules, a procedure called dynamic routing by agreement is performed. The dynamic routing algo-rithm is used to learn the coefficient parameters and functions as a forward pass through

1_{We follow this convention which is mainly used by Hinton et al. (}₂₀₁₈_{) for clarity, but in other papers this is}

also referred to as the activation vector or simply capsule.

2_{This layer is referred to as the DigitCaps layer by Sabour et al. (}₂₀₁₇_{), and Class Capsules by Hinton et al. (}₂₀₁₈_).