Ingraining Expert Label Knowledge in Deep Neural Networks

(1)

MSc Artificial Intelligence

Track: Machine Learning

Master Thesis

Ingraining Expert Label Knowledge

in Deep Neural Networks

by

Bastiaan Sjouke Veeling

10767770

December 19, 2016

42 credits March 2016 – December 2016 Daily Supervisor: Dr. Vlado Menkovski Supervisor: Prof. Dr. Max Welling

Assessors: Dr. Efstratios Gavves Dr. Tibor Bosse

Machine Learning Group University of Amsterdam

(2)

Acknowledgments

I would like to thank my supervisors Vlado Menkovski and Max Welling for their ex-cellent supervision, insight and guidance. Furthermore, I want to thank my colleagues at Philips Research for providing me with the opportunity to intern in a great research environment, giving me the freedom to steer the direction of my research and providing the necessary resources. I want to thank Efstratios Gavves and Tibor Bosse for taking the time to participate on the defence committee. On a personal note, I thank my par-ents and my grandfather, George, for all their advice and support during my academic career and the process of writing this thesis. Finally I would like to give thanks to my friends — especially Otto Fabius, Joost van Amersfoort, Luisa Zintgraf and Elise van der Pol — for their invaluable discussions and thorough feedback.

(3)

Abstract

In this thesis we explore the challenges of applying the deep learning paradigm in the medical domain. Datasets in this field often have few datapoints and an unbalanced representation of the input distribution. Furthermore, as medical decisions need to be validated and trusted by human experts, there is a need for interpretable inference procedures. In this work we focus on classification tasks, for which we hypothesize that the previously described problems can be alleviated in part by providing the model with expert knowledge on the relationships between classes. We show how this information can potentially reduce the complexity of the problem space by providing a factorization of the likelihood. Furthermore, we present a tentative argument that this factorization aligns the inference procedure with the mental model of human experts, which can provide a step towards improving the interpretability of deep non-linear classification models.

Based on these observed concepts, we propose a novel deep learning method that lever-ages the label hierarchy information, motivated by a theoretic foundation of Bayesian networks and ancestral sampling. We evaluate the method on both a medical and a canonical dataset, and find that it does not outperform a typical multinomial classifica-tion model, despite the theoretical arguments for its strength. We provide a thorough empirical analysis on the method’s shortcoming and find evidence that the model suffers from an inherent discrepancy between the training and testing procedure of our proposal. Motivated by this insight, we derive a new method, based on an alternative perspective on the learning process. We find that this method achieves a performance improvement over the baseline, especially so for unbalanced datasets. Finally, we present an analysis of this second proposal and conclude with potential directions of further work in this space.

(4)

List of Figures

1.1 Example of a label hierarchy . . . 13

3.1 Bayesian network over the labels as defined by equation (3.6). . . 22

3.2 Base Model (Legend in figure 3.3). . . 24

3.3 Legend for figures 3.2, 3.4 and 3.5 . . . 25

3.4 Training architecture. (Legend in figure 3.3). . . 26

3.5 Test architecture. (Legend in figure 3.3). . . 27

3.6 Softmax Propagation architecture. (Legend in figure 3.3). . . 27

4.1 CIFAR-100: Comparing factorized likelihood model with baseline models 33 4.2 Comparison with baseline models on ImageCLEF. . . 33

4.3 Effect of number of samples . . . 34

4.4 Visualization of ˆW0 for sampling(left)and softmax propagation(right). . . 36

4.5 Entropy of normal and strongly regularized top label prediction model. . . 37

4.6 Effect of gated activation. . . 38

4.7 Softmax Propogation: CIFAR-100. . . 40

4.8 Softmax Propogation: ImageCLEF. . . 40

4.9 Unbalanced CIFAR-100 . . . 42

4.10 Effect of Weight Sharing: CIFAR-100. . . 42

4.11 Full-baseline method with Wide Residual Network as base model. . . 43

(8)

(9)

Glossary

NLL Negative Log Likelihood. DNN Deep Neural Network.

CNN Convolutional Neural Network. SGD Stochastic Gradient Descent.

Full-baseline The base model combined with a dense layer with a softmax activation per level of the hierarchy.

Leaf-baseline Conventional multinomial classification using NLL. Explainability The ability to explain the reason behind model output.

x Random variable representing the input of a datapoint. y Random variable representing the labels of a datapoint. H The number of levels in the label hierarchy.

yh _{A label at level h in the class hierarchy.}

Yh The concatenation of parent labels yH. . . yh+1.

Elementwise multiplication. ED Expectation over the dataset.

pθ(a) Probability-mass function parameterized by θ.

fϑ(x) Function over x parameterized by ϑ.

P (A = i) Probability of the event A = i. ˜

P (A = i) Unnormalized probability of the event A = i. pa(yh₎ _{Parent of y}h_{, e.g. y}h+1_.

J(θ) Loss function over parameters θ.

1a=b Indicator function. Evaluates to 1 if a = b and 0 if a6= b.

The following terms are used interchangeably to refer to levels in the label hierarchy and are deemed synonymous for the scope of this thesis:

• Root to leaf. • Top to bottom. • Coarse to fine. • yH _{to y}1

(10)

Chapter 1

Introduction

Of the parts of animals some are simple: to wit, all such as divide into parts uniform with themselves, as flesh into flesh; others are composite, such as divide into parts not uniform with themselves. And of such as these, some are called not parts merely, but limbs or members. Such are those parts that, while entire in themselves, have within themselves other diverse parts.

— Aristotle, Historia Animalium (350 B.C.E)

1.1 Motivation

The deep learning paradigm has garnered a large interest over the last years [46]. What makes this paradigm an especially powerful idea is the concept that deep learning mod-els are an encapsulation of intelligence. They incorporate perception, knowledge and inference ability; all encoded in short mathematical formulas and a few million param-eters. With the power of the internet, we have the ability to losslessly duplicate this intelligence and deploy it worldwide, at practically zero cost. Advancements in comput-ing technology allow us to apply this intelligence rapidly, cheaply and massively parallel. And as the amount of data and compute power increases, the abilities of deep learning models will continuously improve, already surpassing the human level on a broad range of tasks [25, 48].

The potential impact is tremendous, especially for a domain as medicine, which we focus on in this thesis. To illustrate, imagine what could be achieved if radiologists all over the world have access to a single pair of eyes which have seen medical imagery of every medical complication in existence. And what if a doctor – having limited time available for keeping up to date with medical research – has access to intelligence that can diagnose based on symptom information from millions of patients and the complete up-to-date body of medical research. Utilized correctly, deep learning techniques have the potential to save and improve the quality of millions of lives. It is thus not surprising

(11)

Chapter 1 Introduction 11

that the machine learning field is exploring this domain with vigor, both in academic and corporate circles1_{. However, there are certain roadblocks that must be overcome to}

realize the full potential of deep learning for this field.

In the medical domain, datasets created for classification tasks often lack a sufficiently large number of labeled datapoints and coverage of all classes; attributes which are required for current deep learning methods to fully prosper. Datasets that do exist for these tasks have challenging attributes, such as unbalanced class distributions, skewed cost functions and confounding variables. These are all problems that are traditionally solved with yet more data. In addition to the data scarcity issue, the tasks found in many practical applications are high stakes, with large financial costs and unnecessary human suffering associated with misclassification. Therefore when employing these models in an advisory role, used by human experts as a signal in their decision making process, a high level of transparency and interpretability is required. And although some efforts have been made towards visualizing the decision process of deep neural networks [30, 59, 62], research shows that we currently do not fully comprehend these models [23].

We find that classification tasks are frequently approached with straightforward multino-mial classification models (one-of-K prediction) [22, 43]. This has proven to be effective for sufficiently large, well-balanced benchmark datasets such as CIFAR-10 [28], MNIST [32] and ImageNet [43], on which such deep learning models thrive. However, posing a prediction task as one-of-K classification obscures the inherent relationships between the discretized classes. For example, in CIFAR-100, we know that classes ‘man’, ‘woman’, ‘boy’ and ‘girl ’ share many visual and semantic attributes, but we penalize predictions finding evidence for ‘boy’ when the target is ‘man’, and do not provide the correspon-dence between these classes.

We imagine that a classification model might benefit from knowing these relationships a priori, and we hypothesize that ingraining this knowledge can reduce the number of examples required. In theory, deep neural networks are capable of deriving this knowledge from large datasets, as they learn to represent classes as a nested hierarchy of distributed concepts [22], in which this expert knowledge might be naturally evident from the data. In fact, the lack of expert interference is often credited as a key factor in the effectiveness of these models. However, we find that the relatively small labeled datasets found in the medical domain, do not suffice for this notion to hold true. Furthermore, we hypothesize that we can improve the explainability of the models if we can leverage the expert knowledge in such a way that we align the decision making process of the model with that of the mental models in experts. In fact, in order to embody trust in these models, we need not have a fully transparent decision process. We do not demand a radiologist to substantiate a diagnosis with a full analysis in how their early vision system observed the low-level signals that led up to their conclusion.

1

90+ Artificial Intelligence Startups In Healthcare: https://www.cbinsights.com/blog/ artificial-intelligence-startups-healthcare

(12)

12 Chapter 1 Introduction

Rather, we trust such an expert in their ability to correctly interpret images (based on empirical evidence provided by the examination during their training) and solely demand that their diagnosis is supported by high level reasoning in line with the shared understanding of medical complications. With this idea in mind, we focus on proposing a model that can provide a high level reasoning at the last stages in its predictive process, constructed from probabilistic variables trained to represent high level concepts interpretable by human experts.

In this thesis, we explore methods that allow us to introduce and utilize expert knowledge of class structure in a deep learning model, with the aim that we can retain the flexibility of Deep Neural Networks (DNN), while overcoming the limitations of data scarcity, that is the reality of many real life problems.

1.2 Contributions

In the previous section, we identified two major challenges with applying deep neural networks in the medical domain: small and unbalanced datasets and a need for ex-plainability. To alleviate these issues we focus on exploring methods that utilize expert domain knowledge. We hypothesize that by introducing this knowledge both as prior information and as constraints on the solution space, the performance will increase and the resulting model will better align with the understanding of experts.

In this thesis, we limit our scope to classification problems with an inherent taxonomy of mutually exclusive is-a relationships (known as hyponymy), where labels have exactly one ancestor. Figure 1.1 shows such a taxonomy. These relationships between labels form the expert knowledge we seek to exploit. This information is abundant in many fields. For example, in the medical domain there exist well studied taxonomies on symptoms [40], anatomy [33], diseases [55], medicinal compounds [42] and other aspects. In addition, this information is relatively cheap to accumulate compared to the large cost associated with labeling individual datapoints. Extracting utility from this information thus has the potential to have a large impact on learning from less data and improving accuracy.

We contribute a broad literature study on the use of taxonomy information in machine learning methods in chapter 2. In particular we focus on work that employs deep neural networks. We find that there is room for improvement, in the form of a theory-based approach to using this information, and we highlight the lack of adoption of taxonomy information in state of the art DNN methods.

Our main contribution consists of a novel method that leverages a given taxonomy in a theoretically sound manner. In chapter 3 we present this method and show how a

2

(13)

Chapter 1 Introduction 13

Figure 1.1: Example of a label hierarchy: a subset of WordNET2.

Bayesian network naturally arises from the label hierarchy and how we can train indi-vidual DNN models for each level of the hierarchy using true parent label information. We use ancestral sampling to substitute these true labels at test time. Furthermore, we propose an alternative method which deviates from our theoretical framework, based on the perspective that the predicted distribution over the parent labels can be interpreted as a representation of the input and used as feature vector.

In chapter 4 we find that despite its theoretical foundation, the method does not con-vincingly outperform a strong baseline (the full-baseline introduced in section 4.1.4). We present an analysis on the method’s shortcomings and find that the model potentially suffers from a discrepancy between the dynamics of the training and testing procedure. Our second proposal does not suffer from this issue, and we find that this model does outperform the strong baseline. We present further analysis on the performance and highlight the models strength. In chapter 5 we draw our conclusions and provide some potential directions for further work.

(14)

Chapter 2

Related Work

If you do not know the names of things, the knowledge of them is lost too.

— Carolus Linnaeus, Philosophia Botanica (1751)

Before we present our proposed methods, we explore the two pillars of work on which we build: the use of taxonomy information in machine learning models, and the marrying of deep neural networks with probabilistic inference algorithms. We will highlight the major themes studied on linear models and elaborate on related approaches employing deep neural networks. We will give an overview of the work done on combining probabilistic inference with deep neural networks and point out relevant tangential areas. First, let us briefly review the current deep learning zeitgeist which provides the foundation of our work.

2.1 Preliminaries

We derive the basis of our methods from the current best practices as observed in the deep learning community. Our base model architecture is inspired by AlexNet [29], which was one of the first automatic representation learning architectures to out-perform methods with engineered feature extractors. We use dropout [50] and L2 regularization (commonly referred to as weight decay) to regularize the model and prevent overfitting. We refrain from using the commonly applied batch normalization [27] technique due to its additional computational cost. To initialize the weights of the layers of the neural network, we follow Glorot and Bengio [21]. To further reduce overfitting and to make the model somewhat transnational invariant, a global average pooling layer [34] is used. Leaky ReLU activations [36] are used for the (α = 0.1) non-linear activation function between the layers, which prevent the issue of vanishing gradients.

(15)

Chapter 2 Related Work 15

2.2 Early Work On Semantic Hierarchies

Before the modern advent of deep neural networks, the use of semantic hierarchies was extensively studied in the context of linear models. Tousch et al. [53] provide a survey of this research in the image recognition field. Notable is the work by Eric Maillot and Thonnat [12], Fan et al. [14, 15], Fan and Geman [16], Gao and Fan [19], who study the use of a learned conceptual hierarchy for image recognition based on WordNet [17] and predict in a top-down manner. Marszalek and Schmid [37] similarly predict objects top down, but rather than a learned hierarchy they use a given semantic hierarchy to build a tree of discriminative SVM classifiers. Bengio et al. [3] propose an approach to learn a hierarchy that is then used to build a tree structure of classifiers that outperforms traditional One-vs-Rest classification.

We find that the work on linear models each leverage the hierarchy information to achieve a subset of three goals: performance improvement, enabling multi-class prediction with discriminative classifiers and enriching the output of the network. The successful uti-lization of hierarchy information in linear models further motivates us to explore how this information can be used with deep learning models.

2.3 Object Hierarchies and Deep Neural Networks

The successful application of deep neural networks on the ImageNet challenge by Krizhevsky et al. [29] sparked a widespread effort into exploring variations and adaptation of DNNs, and research on utilizing hierarchy information was no exception. We highlight work relevant and similar to our proposal

Deng et al. [11] propose an alternative to the softmax activation function: Hierarchy and Exclusion (HEX) graphs. These graphs are a formalism to capture the expert knowledge of relationships between labels, and are then used to limit the possible configurations in the predicted label distribution. The proposed method achieves 1.8 percent point improvement on ILSVRC2012 [43] compared with a standard softmax. As a downside, the method is 6 times as expensive as a normal softmax.

Yan et al. [58] propose the HD-CNN, hierarchical deep convolutional neural networks. Although their method is not based on a theoretical foundation, they achieved a 2.65 percent point improvement over a baseline. The model learns a two level hierarchy, the first level is learned to consist of categories of visually similar classes, which are then disentangled in the inference of the second level. A related approach was explored by Chennupati et al. [7], who use confusion matrices and spectral clustering from a normal classification model, to detect clusters of similar classes, and train individual models to disentangle these clusters.

(16)

16 Chapter 2 Related Work

A different approach is taken by Salakhutdinov et al. [44] and Srivastava and Salakhutdi-nov [49] who propose using a (learned) class hierarchy as a way to have shared biases on the weights of the final layer, which shows to be an effective way to transfer knowledge between classes.

Fergus et al. [18] use a given hierarchy to build an affinity matrix of the labels. They replace the one-hot encoding of the true labels used to compute the cross-entropy loss with a soft affinity vector. This enables knowledge sharing between classes. This method is similar to the use of ‘dark knowledge’, a term coined by Hinton et al. [26] and further explored by Balan et al. [2] which entails the use of a learned soft representation of the target label to prevent overfitting of the model and to encourage knowledge sharing between classes.

2.4 Probabilistic Inference and Deep Neural Networks

Learning to predict multiple variables governed by a complex joint probability distri-bution is troublesome for typical neural network architectures, as these do not capture the statistical dependencies between the random variables of interest [6]. A potential circumvention to this problem is to combine the flexibility of neural networks – which shine in learning distributed high-level representation of high-dimensional input – with the rigidity of probabilistic graphical models and inference algorithms – which shine in their ability to perform inference over a complex factorized probability distribution over the random variables involved.

Chen et al. [6] research this avenue and combine a markov random field with deep learning to predict statistically related random variables, which results in significant performance gains. Tompson et al. [52] explore this in the application of human pose detection and outperform formerly state-of-the-art techniques. Parallel work to this thesis by Gkioxari et al. [20] explores the problem of pose detection with a chained prediction method, which is similar to our proposal.

Another field of research that relies on the strength of graphical models are CNN image segmentation models, which classify individual pixels with a convolutional neural net-work. Conditional random fields are employed to reduce the per-pixel noise from the CNN. Pixel classifiers were first introduced by Long et al. [35] and combined with a jointly-trained CRF by Schwing and Urtasun [45]. Zheng et al. [61] formulate the mean-field approximate inference algorithm for the conditional random mean-field as a Recurrent Neural Network. This allows for fully end-to-end training of the segmentation model.

(17)

Chapter 2 Related Work 17

2.5 Relevant Tangential Areas

In addition to the work that pursues goals similar to ours, we draw inspiration from the tangential areas of multi-loss methods and hierarchy for tractability.

2.5.1 Multi-loss methods

Another avenue of utilizing expert knowledge of class relationships is defining multiple loss functions over the network output. This idea is evaluated by Xu et al. [57] and found to have a regularizing effect on the neural network. We use this concept to define a baseline model that receives the same amount of information as our proposed method.

2.5.2 Hierarchy for Tractability

For classification problems with a large number of classes, computing the softmax output can be inhibitively expensive. This is a well studied issue in the field of natural language processing. A common approach to solve this problem is a hierarchical softmax, which reduces the computational complexity from O(N) to O(log N). This was proposed by Goodman [24] and first applied to the language domain by Morin and Bengio [39].

2.6 Analysis of Existing Approaches

Despite the attention that leveraging hierarchy information in deep neural networks has enjoyed, the use of it has not been adopted in common wisdom of neural network design when one considers the competitive submissions to machine learning challenges. This could mean two things: current methods for utilizing hierarchy information do not improve performance of state of the art architectures, or there is a lack of awareness and focus on these methods. This further motivates our study into this area.

We specifically find room for improvement with the presented work on the intersection of graphical models and deep learning that take a theoretically motivated approach. In the next section we propose a novel method motivated from a theoretical framework. The method can be appended to existing state of the art classification models, which can ease adoption compared to the methods highlighted above.

(18)

(19)

Chapter 3

Hierarchical Classification

It is a fundamental of taxonomy that nature rarely deals with discrete categories. Only the human mind invents categories and tries to force facts into separated pigeon-holes.

— Alfred Kinsey

In the introduction we highlighted three challenges of real world classification problems: small datasets, unbalanced labels and a need for explainability. With these challenges in mind, we seek to improve over the typical use case of multinomial classification with DNNs. By utilizing expert knowledge on label hierarchy, we can divide a large multino-mial classification problem into smaller problems that form a tree, leading to hierarchical classification.

3.1 Review of Multinomial Classification with DNNs

First, let us define the notation by reviewing how DNNs, and specifically feed-forward networks, are typically employed for multinomial classification. The Deep Learning book by Goodfellow et al. [22] can be consulted for a more in-depth discussion on this topic1_.

Given a dataset D with N examples: D = nhx1 . . . xNi_,h_y1 _{. . . y}Nio _{of inputs}

x and corresponding labels y _{∈ {0, . . . , K}, we seek to learn a probability distribution} pθ(y|x) defined by a DNN with parameters θ. We learn the parameters by training

with stochastic gradient descent (SGD) using the maximum likelihood approach, which means using the negative log-likelihood as the loss function:

J(θ) =− EDlog pθ(y|x) .

To train the model to perform multinomial classification, it should output a categorical distribution. This is achieved with the use of the softmax function as the final activation

1_{Note that we deviate from the notation in Goodfellow et al. [22] regarding z and h.}

(20)

20 Chapter 3 Hierarchical Classification

of the neural network. The last linear layer of the DNN computes unnormalized log-probabilities k from the previous layer’s output z = fϑ(x), where fϑ(x) represents the

DNN up to the last layer, parameterized by ϑ:

k= WTz+ b , (3.1)

ki= log ˜P (y = i|x) .

The normalized probabilities are computed from k using the softmax function, which is defined as: P (y = i|x) = softmax(k)i = exp(ki) P jexp(kj) . (3.2)

The log-likelihood log pθ(y|x) then simply becomes:

log P (y = i_{|x) = k}i− log

X

j

exp(kj) , (3.3)

which we can substitute into our loss function:

J(θ) =_{− E}Dlog p(y|x) =− EDlog softmax(k) =₋ N X n=1 X i 1yn_=i×  kn_i _{− log}X j exp(kn j)   . (3.4)

Here 1 is the indicator function, n iterates over the dataset (or mini-batch). From equation (3.4) we can infer what the effect is of minimizing the negative log likelihood in combination with softmax output: strong evidence for the correct label (p → 1.0 or log(p) _{→ −0.0) will be reinforced trough the k}n

i term, and evidence for all incorrect

labels will be penalized through the normalization factor logP

jexp(k n j).

As an illustrative exercise we can partially derive the gradients for θ (which determine the values of k): δ δθJ(θ) =− X n X i 1yn_=i×   δ δθk n i − δ δθlog X j exp(kn_j)   =₋X n X i 1yn_=i×    δ δθk n i −    P j δ δθexp(k n j) logP jexp(knj)       =−X n X i 1yn_=i×    δ δθk n i −    P jexp(k n j) δ δθ(k n j) logP jexp(k n j)       . (3.5)

(21)

Chapter 3 Hierarchical Classification 21

Equation (3.5) indicates that minimizing the loss with stochastic gradient would cause the parameters of the last layer to receive gradients that increase the probability on the true labels, and decrease the probability on the false labels.

3.2 Factorized Likelihood Model

Now that we have specified our notation, we shift our focus back to the problem of insufficiently large datasets. For the multinomial classification framework described above, the demand for large datasets arises in part from the large number of parameters required to model the intricate problem spaces found in e.g. computer vision problems in the medical field. A large number of parameters in turn requires a large dataset that sufficiently covers the input distribution, to prevent overfitting of the model. We hypothesize that we can use expert knowledge on hierarchy to both reduce the complexity of the problem space, and add diversity to the training signal (which can potentially improve the fit of the model according to the findings presented by Xu et al. [57]) and provide information on the labels not necessarily evident from the data.

First, let us formally define what type of expert knowledge on labels we seek to utilize. As stated in section 1.2 we focus solely on label relation information in the form of a tree, with each node having exactly one ancestor. The labels are encoded in the form of y = hyH _yH−1 _{. . . y}1i_(yH _{are the labels at the root of the tree, y}1 _{the leaves).}

Each yh _{∈ {0, . . . , K}h_{}, where y}h _{is restricted by y}h+1_{. This implies that the leaf label}

represents all knowledge of one datapoint as the ancestors can be determined from the tree structure2_.

We should note there are other interesting forms of expert label information to consider, such as shared attributes (“cats and dogs are both mamals and pets”) and multiple rep-resented concepts (“this picture represents the concepts of outside, flowers and sunny”). However, these forms of knowledge increase the cost of the labeling efforts or are prone to noise and ambiguity. In contrast, the label hierarchy can be determined by simply categorizing labels in a tree, which can then be applied to existing dataset without requiring further labeling. Although the tree-structure is still at risk of pigeon-holing class relationships, it is a already a substantial improvement over the prevalent one-of-K encoding. This motivates our focused scope for this thesis.

2

Alternative, the labels can be encoded with yh enumerating over the children of yh+1, which is a more efficient representation from a data structure perspective. Empirically we found that this encoding obfuscates the relationship between the labels when we require the neural network to infer this as part of the training process.

(22)

Figure 3.1: Bayesian network over the labels as defined by equation (3.6).

We utilize the label structure to define a factorized generative model over the H levels in the label hierarchy:

p(y_{|x) = p(y}H_{|x) p(y}H−1_|yH, x) _{· · · p(y}1_|y2, x)

=Y

h

p(yh_{| pa(y}h), x) (pa(yh) are the parent labels of yh) . (3.6)

This factorized distribution defines a Bayesian network [5] over the variables, which provides the theoretical foundation of our proposed method. The bayesian network is visualized in figure 3.1.

Rather than training a single DNN for the labels as with multinomial classification, we propose to use separate models for each level h in the label hierarchy:

p(y_{|x) =}Y

h

ˆ

p_θh(yh| pa(yh), x) . (3.7)

The models now receive as input both an image representation z as well as the parent label values. To facilitate the multiple inputs, we replace the final layer of a conventional DNN (as defined in equation (3.1)) with the following equation:

ˆ

p_θh(yh| pa(yh), x) = softmax(UT_hz_h+ VT_hpa(yh) + b_h) , (3.8)

where we define two weight matrices Uh and Vh and one bias term bh per level of the

hierarchy h. For ease of implementation and clarity we can replace the two input and weight pairs with one concatenated input vector Yh and respective weight matrix WTh,

which allows us to express (3.8) as a fully-connected layer with softmax activation:

ˆ p_θh(yh| pa(yh), x) = softmax(WT_h h zh Yh i + bh) . (3.9) We define3 _Y h = h yh+1 _yh+2 _{. . . y}Hi_. 3

Including all ancestors as input for the model is unnecessary from a probabilistic perspective, con-sidering the conditional independence of the parent labels as imposed by the topology of the bayesian network. However, we hypothesize that this introduces more signal for the neural network, the impor-tance of which becomes apparent as we introduce a derived model in section 3.3. We aimed to minimize the discrepancies between these variations.

(23)

As before z is the output of the second-to-last layer of the DNN: zh = fϑh(x). Since the first layers of a DNN tend to learn task-independent representations [29], we propose to couple the weights of these layers between the models. This should increase the diversity of the gradient signal these layers receive during training, improving regularization. This is observed in various multi-task learning studies [13], and we present an analysis on this in section 4.3.3. Figure 3.2 shows the layers for which we couple the weights. zh is thus

the output of a deep neural network with partly coupled weights:

zh= fϑshared,ϑh(x) . (3.10)

At this point the reader might wonder why the models are defined over all labels in one level in the hierarchy, rather than defining models per parent label, i.e:

p(yh_{| pa(y}h_{) = i, x). Although p(y}h_{| pa(y}h_{), x) is constrained to children of a given}

parent label yh+1 _{= i which indeed enables using individual models per parent label,}

we empirically find that learning this constraint is trivial for the hierarchy-level models to learn. Furthermore, this improves the simplicity of the implementation and enables generalization of the model to the different label structures mentioned before, such as non-mutually-exclusive label trees, for further work. We will refer to this model as the factorized likelihood model (FLM).

(24)

24 Chapter 3 Hierarchical Classification Max pool Input _Image Conv1 Conv2 Max Pool Dr opout Conv3 32 × 32 × 1 32 × 32 × 32 16 × 16 × 32 16 × 16 × 64 Max pool Conv4 8 × 8 × 64 8 × 8 × 128 Max pool 4 × 4 × 128 Global A vg Pool _Drop. A vg pool 2 × 2 × 256

Final conv (no

padding) 256 × 1 z vector Conv5 Conv6 Max Pool Dr opout (20+40+90) × 1 Final _dense

Base Model (CNN)

Max Pool _Drop.

Softmax

Pr

e training Layers

Σ

ˆp(

y

3 |x

)

ˆp(

y

2 |x

)

ˆp(

y

1 |x

)

Softmax Output

Coupled W

eights

Decoupled W

eights

16 × 16 × 64 32 × 32 × 32 8 × 8 × 128 32 3 3 3 3 Figure 3.2: Base Mo del (L egend in figure 3.3). The base mo del (fθ (x )) used in most of our exp erimen ts. Here it is visualized for a h yp othetical dataset with a lab el hierarc h y with 20, 40 and 90 lab els at the lev els resp ectiv ely . T ensors sizes are not to relativ e scale. The mo del is pre-trained using the pre-training la yers, whic h also mak e up the structured-baseline mo del. E ac h con volutional and dense la y er is follo w ed b y a Leaky ReLU( α = 0. 1) activ ation, except those b efore a softmax. The con v olutional la yers all ha v e filter-size 3 × 3. Drop out p -v alues are set to .25, .25, .5 and .5/.6 for CIF AR-100/ImageCLEF resp ectiv ely .

(25)

Chapter 3 Hierarchical Classification 25 1D Vector 3D Tensor Stacked Vectors Dense Layer 3 × 3 Convolutional Layer Σ Softmax Data flow 32 × 1 Tensor dimensions (Kernel size always 3)

Figure 3.3: Legend for figures 3.2, 3.4 and 3.5

3.2.1 Training Procedure

At training time we benefit from the available true parent labels, which allows us to train the models in parallel and free of noise on parent labels. We derive the negative log joint likelihood as the loss of the combined models:

J(θ) =− log pθ(y|x) = −

X

h

log p_θh(yh| pa(yh), x) .

Following these definitions, the models can be trained in parallel using standard stochas-tic gradient descent, with just minor adjustments to existing neural network training architectures and at a negligible computation cost. This should ease adoption of our proposals with existing work if proven effective as an extension to state of the art models, for which we provide tentative evidence in section 4.4.

We visualize the model in figure 3.4, which illustrates the computational graph at train-ing time for a hypothetical hierarchical classification problem with 3 levels in the label hierarchy (20, 40 and 90 labels respectively). An input image x is processed by the base models (of which the initial layers are shared), which provide zh. The label prediction

models take a concatenation of the image representation zh and the true parent labels

(one-hot encoded), and compute the probabilities of the labels.

3.2.2 Inference Procedure

The label prediction models now require parent labels as input. As these are not available at test time, an inference algorithm is required. Simply using the mean of the predicted parent label distributions might result in acceptable classification performance, but this procedure eliminates the uncertainty information of the models. For small label sets an exhaustive computation of the joint probabilities can be performed. This procedure has a complexity of _O(|y1_{| × H): the number of labels in the lowest level of the hierarchy}

times the number of levels in the hierarchy, and this might prove inhibitively expensive for vast label structures.

(26)

26 Chapter 3 Hierarchical Classification Input Image 32 × 32 × 1 Base Model 256×1 Σ 40×1 ˜ p(y2|y3, x) ˜ p(y1_|y3, y2, x) 90×1 con cat con cat (Partly coupled weights) Base Model Base Model z vector 90×1 Σ Σ (256 +20)×1 (256 +60)×1 20×1 40×1 Concatenate Vectors Softmax Output y3 y3 y2 ˜ p(y3|x) Softmax Softmax y3 y2 True labels

Figure 3.4: Training architecture. (Legend in figure 3.3).

The training architecture is visualized for a hypothetical dataset with a label hierarchy of 20, 40 and 90 labels at the levels respectively. The input image is processed by the base models (for which the first layers are coupled). The true labels are one-hot encoded and concatenated

with the z-vector. Each label prediction model (confined by the plain colored rectangles) computes the conditional probability of the labels at a level in the hierarchy given the parent

labels and image representation.

Instead we propose to use ancestral sampling [5]: N predictions are sampled from the top level model. From each top-level sample, just 1 sample is drawn from each subsequent level conditioned on the parent samples. This results in N predictions at each level of the label hierarchy. We draw the outcome from the frequency distributions of the sampled labels, which also provide a uncertainty metric. The outcome of the sampling procedure equates the exhaustive procedure as the number of samples approach infinity in the limit. As an additional benefit, the computational complexity can be reduced at the cost of increased variance, by reducing N . An illustration of the evaluation scheme is presented in figure 3.5.

3.3 Softmax Propagation Model

Although training the label prediction models independently with true parent labels allows for efficient modelling at a selected level of the hierarchy, and ancestral sampling provides a solid theoretical foundation for inference, we believe the difference in the training and inference approach potentially having a negative effect on the model’s performance. On datasets that we tested (section 4.2.5) we found that the proposed factorized likelihood model performs poorly in comparison with a baseline that predicts all provided labels without considering the statistical dependencies between the labels. We conclude that the model might suffer from the discrepancy between the training and

(27)

Chapter 3 Hierarchical Classification 27 Input Image 32 × 32 × 1 Base Model 256×1 N × 20×1 Σ _{N ×} 40×1 ˜ p(y2|y3, x) ˜ p(y1_|y3, y2, x) N × 90×1 con cat con cat sample N times sample 1 time sample 1 time (Partly coupled weights) Base Model Base Model z vector Sam-ple N × 90×1 Σ Sam-ple Σ Sam-ple N × 276×1 N × 316×1 20×1 N × 40×1 Concatenate Vectors Frequency Distribution of Sampled Labels ˜ p(y3|x) Softmax Softmax

Figure 3.5: Test architecture. (Legend in figure 3.3).

Differences with training architecture highlighted inblue. The sampling procedure draws N samples from the first label prediction model (the yellow box) represented as one-hot encodings. These samples are then processed in parallel by the subsequent models (green and

red box), who each just draw one sample from the softmax output, resulting in N sample at each level of the hierarchy. Note that the N samples are processed independently by the model

and only aggregated in the last step to compute the frequency distribution.

Input Image 32 × 32 × 1 Base Model 256×1 Σ 40×1 90×1 con cat con cat (Partly coupled weights) Base Model Base Model z vector 90×1 Σ Σ 276×1 316×1 20×1 40×1 Concatenate Vectors Softmax Output ˆ p(y3|x) ˆ p(y1|y3_{, y}2_{, x)} ˆ p(y2_|y3, x) Softmax Softmax

(28)

testing procedures. Furthermore, the sampling procedure is a computationally expensive operation which slows down the prediction process. To circumvent this problems, we present an alternative model for which training and inference is equivalent.

We derive this method from an alternative perspective and interpretation of the label hierarchy. Rather than training individual models and using a probabilistic inference scheme at prediction time, we follow the prevalent intuition that end-to-end trained models eventually surpass expert prediction schemes. To facilitate this, we envision that we can utilize the label hierarchy to add intermediate, non-latent variables in the deep neural network. We can train these variables to predict distributions over our expert-defined labels, which can be achieved by simply adding respective NLL terms on intermediate softmax outputs to a loss function. The output of these softmax activations are concatenated to the input vector of subsequent layers. The order of these layers, and the connections between them are governed by the expert knowledge, but the neural network can learn their utility and adjust accordingly. In fact, we can potentially encode any type of label hierarchy in this manner, even hierarchies that form cyclic or undirected graphs.

Whereas with the factorized likelihood model we proposed to utilize the label hierar-chy information to define a probabilistic graphical model, with the softmax propaga-tion model we propose an alternative method that integrates predicted parent label distributions as non-latent variables in the deep neural network. This introduces two hypothetical benefits for the DNN: The predicted distribution can be interpreted as a representation which carries more information than a simple sample, and the model can learn to adapt to the accuracy of the predicted distributions during training time. We provide an illustration of the proposed method in figure 3.6. The model can be trained end-to-end, and does not require a separate inference scheme. We will refer to this method as softmax propagation.

A strong benefit of the softmax propagation model lies in the fact that it does not require an inference scheme at test time. This drastically speeds up the prediction time. We further hypothesize that the softmax propagation model suffers less from sub-optimal hierarchy information, as the label prediction (sub)models can learn the influence that the predicted parent label distribution have on the final outcome, something that the factorized likelihood model is not capable of doing.

3.4 Interpretability of the Model

We now turn our attention to the previously mentioned need for interpretability. The stacked layers of neurons with intermediate non-linear transformations, allow deep neural networks to learn to represent the world as nested hierarchy of concepts [22]. Such a representation has proven to be very effective for various machine learning tasks, but

(29)

does not lent itself to straightforward interpretability by human domain experts. One can argue that this should not prove an issue, as humans similarly lack the ability to fully motivate high level observations – after all, the low level vision system of the brain is at least as complex as the early layers of a CNN, and we trust a radiologist to not hallucinate various complications. However, as deep neural networks lack the track record of humans, we must hold these techniques up to a higher standard than human experts. This is especially the case if we want to employ these models in the medical domain, where lives are at stake, and even if a DNN has been empirically proven to provide better performance on a specific task.

The hallmark of interpretable machine learning models are decision trees. The binary splitting rules of which a decision tree consist are highly transparent, and misclassifi-cation can be pinpointed to a specific decision in the tree. However, a decision tree is prone to overfit, especially on high-dimensional input such as images, which makes them unfeasible for image analysis. Furthermore, real world decision problems often involve consideration of multiple random variables with intricate dependencies, which are hard to express with binary rules. If we can observe high-level random variables (e.g. medical symptoms that can be determined with certainty), and if the relationships between the random variables in a prediction problem are known (perhaps provided by empirical research), then probabilistic inference algorithms can provide a transparent prediction procedure, that can be formally proven to be correct. Unfortunately, this information is not available for all problems, and these algorithms do not scale well beyond modelling the interaction of a small number of high-level random variables.

Both the factorized likelihood and softmax propagation model effectively combine the DNN with probabilistic inference. They force the DNN to predict intermediate, un-derstandable random variables, and use these variables in the prediction of depending variables. The linear relationship that the model learns between this variables, can be understood from the model weights (which will be shown in figure 4.4). These rela-tionships can be verified to align with the understanding of experts. In addition, the sub-models that predict intermediate variables can be empirically verified in isolation from the other random variables, which aids in developing trust in our proposals. To illustrate this, one can consider a factorized likelihood model that classifies what type of X-Ray is shown in an image. By providing a label hierarchy with expert knowledge on anatomy (e.g. a toe is connected to a foot), the model can learn to predict the intermediate random variables from the hierarchy (e.g. the occurrence of a toe), and from that learn to predict the relevant random variables (e.g. the image is an X-Ray of a foot and ankle). This rich output of multiple variables and their relationship, improves upon a blackbox-model that simply predicts probabilities for all potential anatomy X-Rays without providing further information. Whether this information suffices to provide the transparency required for medical problems, remains a research question to be explored in further work.

(30)

Chapter 4

Experiments

My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me: ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name. In the second place, and more importantly, no one knows what entropy really is, so in a debate you will always have the advantage.’

— Claude Shannon

4.1 Experiment Setup

4.1.1 Datasets

In this work we study two datasets. The motivation for our research comes from chal-lenges observed in medical imaging tasks. To evaluate the effectiveness of our models for such problems, we focus on the dataset from the ImageCLEF 2009 Medical Image Annotation Task1_{. It consists of X-Ray images of various body parts, labeled according}

to the IRMA code [33], a standardized coding scheme for medical images. We focus on predicting the anatomy labels, which are structured in a hierarchy of three levels. To verify our observations and to allow for comparison with prior art2_{, we use}

CIFAR-100 [28], a canonical dataset consisting of 32_{× 32 pixel photos of 100 ordinary object} classes, grouped into 20 coarse classes. The dataset is balanced with equal number of datapoints per fine and coarse classes. Although this attribute is desirable for training deep neural networks, this does not reflect the state of real world datasets. Furthermore, the main advantage of using hierarchical representations of the output can be expected

1_{http://imageclef.org/2009/medanno}

(31)

Chapter 4 Experiments 31

in unbalanced datasets. To explore this we propose an unbalanced variation of CIFAR-100, which we derive as follows. From each coarse class, one fine class was picked for which only 10% of the available training data will be used. Appendix 6.1 has more details on the exact makeup of the dataset. We refer to this variation as unbalanced CIFAR-100. We use the balanced CIFAR-100 as our default dataset, and specify if we use another.

4.1.2 Data Preprocessing and Augmentation

As the X-Ray images in ImageCLEF have varying input size and aspect ratio, the images have to be preprocessed into a format suitable for conventional CNN models. We follow Menkovski et al. [38] in taking random square croppings from the images after resizing the images to ensure they have similar dimensions. To speed up the training procedure we opt to take 64× 64 croppings and use a resizing factor of 1

10. We

refrain from using other augmentation steps in order to limit the variance of the training procedure. For CIFAR-100 no augmentation steps are utilized. A ZCA whitening step is commonly employed for this dataset – as proposed for CIFAR-10 by Krizhevsky and Hinton [28]. However, applying this procedure to the variably sized ImageCLEF images would potentially increase the variance between experiment runs, and we thus opt to forgo ZCA whitening altogether.

4.1.3 Implementation

Pre-trained weights are used for the base-model in all experiments. To ensure consistency between experiments we use the same weights for all experiments sharing a dataset, unless specified otherwise. The hyperparameters used for our experiments are presented in section 6.2

For efficient evaluation and inference, the model is implemented with the widespread deep learning framework Keras [8], using Theano [51] as the computational backend. Keras was adapted to cope with multiple labels per image, and various extensions were implemented. A custom hierarchical prediction layer was developed to facilitate the ex-periments and the model variations. This layer can be appended to existing keras archi-tectures in a straightforward manner. The code is available at github.com/basveeling/ thesis3. All experiments presented are run from a single code base, limiting the prob-ability of bugs influencing the result.

2

Due to business-related limitations we were unable to run experiments on the ImageNet dataset [10]. 3_{As this code was developed by the author during an internship at Philips Research, it will be} published once approved. Please inquire for further information and early access.

(32)

32 Chapter 4 Experiments

4.1.4 Baselines

To evaluate our proposed methods, we define two baselines. Both baseline models consist of the base model fbase(x) (as defined in figure 3.2) and a single-layer inference model

(fully-connected layer with softmax activations). The leaf -baseline predicts only the lowest labels in the hierarchy, which reflects a typical CNN model for image classification such as AlexNet [29]. The full -baseline (also referred to as the strong baseline) predicts labels from all levels in the hierarchy and computes a sum of the NLL losses per level of labels (see figure 3.2). This serves as a fair baseline for our proposed method as it utilizes the extra label information provided by the hierarchy. The full baseline can by itself provide an improvement over typical DNN models.

4.1.5 Interpreting Training Plots

In the rest of this chapter we will present plots visualizing various metrics observed during a training procedure. Unless stated otherwise, the information is visualized as follows: a solid line represents scores for a hold-out validation set and a dotted line the scores on the training set. The hue of the line color represents the experiment and the lightness the level of the hierarchy (lighter/more saturated implies higher level). Both attributes are stated in the legend (with “h = . . . ” indicating the level in the hierarchy). To improve interpretability the plots are smoothed with a running average over 5 epochs. The epoch number is on the x-axis and the performance metric on the y-axis.

4.2 The Factorized Likelihood Model With Sampling

We first analyse the classification performance of the proposed model with the sampling inference scheme. In figure 4.1 and 4.2 we present the performance of the two baseline models and our proposed method. We find that on CIFAR-100, the factorized likelihood model performs worse than our baselines, and on ImageCLEF performs just similarly on the leaf-baseline. Of note is the difference between the training and validation accuracy of the fine labels. This discrepancy arises from the fact that the training scores are computed with the true parent labels rather than samples. In an experiment that computes the validation score using the true parent labels we find similar increased performance, as illustrated by the blue ‘True Labels’ line in figure 4.1.

We conclude from these results that the factorized likelihood model with sampling infer-ence does not outperform the full-baseline in its current form on the examined datasets. We formulate the following hypotheses as potential causes of this unexpected result:

(33)

Chapter 4 Experiments 33 0 200 400 600 800 1000 Epoch 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy 7→Pretraining

7→Pretraining 7→Frozen Basemodel7→Frozen Basemodel 7→End to End7→End to End

Leaf Baseline: Validation Accuracy Full Baseline: Validation Accuracy h = 2 Full Baseline: Validation Accuracy h = 1

Factorized Likelihood (N = 100): Validation Accuracy h = 2 Factorized Likelihood (N = 100): Validation Accuracy h = 1 True Labels: Validation Accuracy h = 2

True Labels: Validation Accuracy h = 1

Figure 4.1: CIFAR-100: Comparing factorized likelihood model with baseline models Higher is better, see section 4.1.5 for interpretation hints. The weights from epoch 303 of the

full-baseline are used to initialize the basemodel of the sampling method (hence the offset of the green line). The basemodel is frozen for the first 300 epochs, and fine-tuned end-to-end for

the last 300 epochs. The sampling procedure is used to report the validation score. For the training performance we report the performance given true parent labels.

0 200 400 600 800 1000 Epoch 0.75 0.80 0.85 0.90 0.95 Accuracy 7→Pretraining

7→Pretraining 7→Frozen Basemodel 7→End to End

Full Baseline: Validation Accuracy h = 3 Full Baseline: Validation Accuracy h = 1 Leaf Baseline: Validation Accuracy

Factorized Likelihood: Validation Accuracy h = 3 Factorized Likelihood: Validation Accuracy h = 1

Figure 4.2: Comparison with baseline models on ImageCLEF. The results for the top and lowest level of label hierarchy are shown.

(34)

34 Chapter 4 Experiments 0 50 100 150 200 Epoch 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy

7→Frozen Basemodel 7→End to End

N = 1: Validation Accuracy h = 1 N = 10: Validation Accuracy h = 1 N = 50: Validation Accuracy h = 1 N = 100: Validation Accuracy h = 1 N = 1000: Validation Accuracy h = 1

Figure 4.3: Effect of number of samples

For this experiment we vary the number of samples. We find that from N _{≥ 100 onwards no} difference in variance and performance is observed. No smoothing is used in this plot.

2. The factorized likelihood model is overconfident in its predictions, causing biased, low-variance samples.

3. The noise of the true label gradient signal conceals the predictive signal of the image features.

4. The performance is a result from erroneous implementation on our part.

5. The model suffers from the discrepancy between the input distribution of the parent labels at training and test time.

We do not strive to provide a formal hypothesis testing analysis, but rather utilize these hypotheses as a framework to guide our exploration and conclusions. We present the hypotheses with an analysis in the following sections respectively.

4.2.1 Effect Of Number Of Samples

The inference scheme is sensitive to the number of samples N drawn from the initial predicted top-level label distribution. In figure 4.3 we present the effect of this hyper-parameter. We observe the expected behavior, with decreased average performance for a small number of samples (N = 1 and N = 10). We futher observe that increasing the number of samples beyond N = 100 does not further improve the accuracy of the predictions. Hence, we conclude that hypothesis 1 must be rejected.

(35)

4.2.2 Overconfident label prediction models

The inference scheme draws N samples from the top-level label prediction model’s soft-max output. If this model is strongly overfitted, the predicted distribution over the labels can become over-confident: a large amount of probability mass is placed on an potentially incorrect label. This causes biased samples, from which the model cannot recover in the prediction of subsequent levels, due to the strong weight it has learned to place on the parent labels during training.

To evaluate if the model suffers from this, we first ensure that the model is not over-fitting strongly. We have adjusted the dropout rate and l2-normalization to ensure no overfitting is occurring. As illustrated by the dashed and solid green lines in figure 4.1, the training and testing accuracy on the coarse labels are similar and not diverging; this signals a good fit of the model.

Secondly, we analyze the weights of the last label prediction model on CIFAR-100. As stated in equation (3.8), the concatenation of the image features and parent labels is multiplied with a weight matrix Wh, which – after addition of the bias term bh – results

in the activations for the fine classes. This weight matrix is indicative of the relative signal of the parent labels and weight matrices. To compare the weights on the parent labels and the image features, we compute the weight matrix adjusted for input strength by element-wise multiplying with the maximum absolute activation of the input:

ˆ Wh= Wh magnitude h fθ(x) Yh i . (4.1)

We define magnitude as the median of the maximum-1% absolute activations in a batch of 1000 random hold-out datapoints, computed per input:

magnitude(W )i = median max 1% (|Wi|) . (4.2)

Figure 4.4a visualizes the adjusted weight matrix ˆW0. We observe that the sampling

approach learns strong positive weights on the correct parent labels while having low weights on the image features. This implies that the second level label prediction model relies strongly on receiving correct parent labels as input, and the noise introduced by the label weights potentially overshadows the signal of the image features.

Lastly, we analyse the output of the top-level label prediction model. If the model is indeed overconfident in its predictions, we should observe high probability values in the softmax output for erronous predictions. An effective way to measure this is to compute the entropy over the conditional likelihood distribution. The Shannon Entropy [47] is

(36)

36 Chapter 4 Experiments

(a) Sampling (b) Softmax Propagation

Figure 4.4: Visualization of ˆW0for sampling(left)and softmax propagation(right).

Input 256-276 are the weights on the parent labels.

defined as: Hp(yh) =− n X i=1 p(yh_{) log} bp(yh) . (4.3)

We set b =_|yh_{| (number of labels at level h) to facilitate intuitive comparison between}

the hierarchy levels. H(yh_{) = 0.0 implies perfect certainty, and H(y}h_{) = 1.0 complete}

uncertainty e.g. a uniform distribution.

We compute the average entropy of the softmax output from the top level prediction over all datapoints:

Average Entropy = 1 N X n Hp(yˆ H n|xn)(y H n) . (4.4)

The results are presented in figure 4.5. We find that the entropy of the incorrect predic-tions is sufficiently high to warrant diverse sampling to occur. We further analyzed the average maximum probability values from the same softmax output, which converges to 50% for the incorrect predictions of the normal model. This is further evidence for

(37)

Chapter 4 Experiments 37 0 100 200 300 400 500 600 Epoch −0.1 0.0 0.1 0.2 0.3 0.4 0.5 En trop y

Normal: Validation Entropy Correct Normal: Validation Entropy Incorrect

Stronger Regularization: Validation Entropy Correct Stronger Regularization: Validation Entropy Incorrect

Figure 4.5: Entropy of normal and strongly regularized top label prediction model. Lower entropy implies more confident softmax. The darker lines represent the entropy of incorrect predictions, the brighter lines for correct. A normal configuration (red) and strongly

regularized version (yellow) is presented.

concluding that overconfident predictions do not inhibit the inference procedure. We thus reject hypothesis 2.

4.2.3 Incompatible Signal/Noise Ratio

As shown in equation 3.8, the prediction models weigh the combined input of the parent labels and the image features. During training, the parent labels have a strong predictive value, resulting in high weights learned on these inputs. However, these weights are subject to noise introduced by the SGD algorithm. It is thus possible that the signal from the image features (which have low weight due to their low predictive value) is canceled out by the noise of the parent label weights.

Rather than performing an exhaustive study on this, we propose a varation of the model that circumvents this issue. This is achieved by substituting the input to the softmax in equation 3.8 with a gated activation function, drawing inspriation from van den Oord et al. [54]:

(38)

38 Chapter 4 Experiments 0 100 200 300 400 500 600 Epoch 0.4 0.5 0.6 0.7 0.8 Accuracy

Normal activation: Validation Accuracy h = 2 Normal activation: Validation Accuracy h = 1 Gated activation: Validation Accuracy h = 2 Gated activation: Validation Accuracy h = 1

Figure 4.6: Effect of gated activation.

ˆ p(yh_|yH_,_{· · · , y}h+1_{, x) = softmax}_W h h fθ(x) Yh i + bh . (3.8)

And with gated activation:

= softmax Whfθ(x) + bh σ WhvYh+ bhv

. (4.5)

To illustrate how this circumvents the issue, one can imagine that the sigmoid activation — which is governed by the parent label values — serves as a switch that turns the signal from the image features on or off. This effectively separates the signals of the parent labels and image features. If introducing this gate does not improve the results, we can safely assume that the signal vs noise problem is not the culprit that causes sub-baseline performance. Figure 4.6 presents the results of this change. We find almost equivalent performance of the gated activation model versus the normal model. We thus reject hypothesis 3.

4.2.4 Incorrect Implementation Hypothesis

Due to the intrinsic complexity of deep learning models, we must consider a potential failure on our part to correctly implement the experiment as designed. We believe this is the case especially since we are providing a negative result for a theoretically appealing method, and extra care must be taken with verifying the code and experiment design as not to shut the door on a potentially effective method. Although we do not provide empirical results or formal proof in favor of rejecting this hypothesis, we present

(39)

the following arguments to inspire confidence in our results and to tentatively reject hypothesis 4. We have extensively analyzed our implementation by carefully studying the weights learned by our implemented modules, and concluded that they align with the theory of the optimal solution. We have written isolated test problems to validate the implementation of the extensions we have created for Keras. Lastly, we have used best-practises from software development principles, such as ensuring that all experiments share one code base, and minimizing duplication of code. This reduces the chance that bugs remain unnoticed and prevents erroneous variance due to the difference of implementation.

4.2.5 Training and Testing Input Discrepancy

Our method potentially suffers from the fact that the true parent labels at training time provide a strong signal, resulting in large weights attributed to the label input. At test time, the samples of these labels can be incorrect, which the network has not been able to adjust for. For certain disciplines such as language modeling and chained prediction, the discrepancy between training and testing has an inhibiting effect, leading to the compounding of errors. Various solutions have been proposed, notably scheduled sampling [4] and professor forcing [31], which narrow the gap between the distribution of certain inputs at training and test times.

To evaluate if this plays a role in the poor performance of our proposed method, we evaluate a derived method that has no discrepancies between training and testing pro-cedures, which allows the label prediction models to adjust accordingly to the accuracy and certainty of the ancestor models. In section 3.3 we propose the use of the softmax outputs from the parent label prediction models as a substitute for the parent labels during training and test time. We refer to this method as softmax propagation.

In figure 4.7 and figure 4.8 we present the results of softmax propagation. We find that this derived method has a slight improvement over both the sampling method as well as the baselines on CIFAR-100. Although this does not present conclusive evidence to accept hypothesis 5, the empirical evidence motivates further analysis of the softmax propagation model, which is presented in the next section.

4.3 Softmax Propagation Model

The softmax propagation model, as presented in section 3.3, demonstrates a slight per-formance improvement over the factorized likelihood model. In this section we analyze this derived model.

(40)

40 Chapter 4 Experiments 0 200 400 600 800 1000 Epoch 0.4 0.5 0.6 0.7 0.8 Accuracy

Full Baseline: Validation Accuracy h = 2 Full Baseline: Validation Accuracy h = 1 Factorized Likelihood: Validation Accuracy h = 2 Factorized Likelihood: Validation Accuracy h = 1 Softmax Propagation: Validation Accuracy h = 2 Softmax Propagation: Validation Accuracy h = 1

Figure 4.7: Softmax Propogation: CIFAR-100.

The figure shows a slight performance increase with the softmax propagation method for the fine labels of CIFAR-100 (lower green line). The coarse labels are not affected as they do not

benefit from ancestor label information.

0 200 400 600 800 1000 Epoch 0.75 0.80 0.85 0.90 0.95 Accuracy

Full Baseline: Validation Accuracy h = 3 Full Baseline: Validation Accuracy h = 1 Factorized Likelihood: Validation Accuracy h = 3 Factorized Likelihood: Validation Accuracy h = 1 Softmax Propagation: Validation Accuracy h = 3 Softmax Propagation: Validation Accuracy h = 1

Figure 4.8: Softmax Propogation: ImageCLEF. The performance increase is not observed on ImageCLEF.

Ingraining Expert Label Knowledge in Deep Neural Networks

MSc Artificial Intelligence

Master Thesis

Ingraining Expert Label Knowledge

in Deep Neural Networks

Bastiaan Sjouke Veeling

December 19, 2016

Acknowledgments

Abstract

Contents

List of Figures

Glossary

Chapter 1

Introduction

1.1

Motivation

1.2

Contributions

Chapter 2

Related Work

2.1

Preliminaries

2.2

Early Work On Semantic Hierarchies

2.3

Object Hierarchies and Deep Neural Networks

2.4

Probabilistic Inference and Deep Neural Networks

2.5

Relevant Tangential Areas

2.6

Analysis of Existing Approaches

Chapter 3

Hierarchical Classification

3.1

Review of Multinomial Classification with DNNs

3.2

Factorized Likelihood Model

Base Model (CNN)

Pr

e training Layers

Σ

Σ

Σ

ˆp(

y

3

|x

)

ˆp(

y

2

|x

)

ˆp(

y

1

|x

)

Coupled W

eights

Decoupled W

eights

3.3

Softmax Propagation Model

3.4

Interpretability of the Model

Chapter 4

Experiments

4.1

Experiment Setup

4.2

The Factorized Likelihood Model With Sampling

4.3

Softmax Propagation Model