• No results found

Learning to distinguish : agnostic unsupervised classification

N/A
N/A
Protected

Academic year: 2021

Share "Learning to distinguish : agnostic unsupervised classification"

Copied!
105
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

M.Sc. Stochastics and Financial Mathematics

&

M.Sc. Econometrics

free track

Master Thesis

Learning to distinguish

agnostic unsupervised classification

Author:

Supervisors:

Lars Haringa

Student’s number 10820310

dr. A.J. van Es

dr. K.J. van Garderen

Examination date:

Second examiners:

December 20, 2018

prof. dr. J.H. van Zanten

dr. J.C.M. van Ophem

Korteweg-de Vries Institute for

Mathematics

Amsterdam School of Economics

Faculty of Economics and Business

(2)

Statement of originality

This document is written by Lars Haringa who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The University of Amsterdam is responsible solely for the supervision of completion of the work, not for the contents.

Thank you Paulien, for sticking by,

and thank you to my family for supporting me all the way. Thank you Bert for helping me to do my thing,

and thank you Kees Jan for helping me to do Econometrics, from start to finish.

Title: Learning to distinguish—agnostic unsupervised classification Author: Lars Haringa (mail@haringa.orgmail@haringa.org), student’s number 10820310 Supervisors: dr. A.J. van Es, dr. K.J. van Garderen

Second examiners: prof. dr. J.H. van Zanten, dr. J.C.M. van Ophem Examination date: December 20, 2018

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 105–107, 1098 XG Amsterdam http://kdvi.uva.nl

http://kdvi.uva.nl

Amsterdam School of Economics University of Amsterdam

Roetersstraat 11, 1018 WB Amsterdam

http://ase.uva.nl/research/institute.html http://ase.uva.nl/research/institute.html

(3)

Abstract

A method for unsupervised classification is proposed which requires no prior knowledge about the classification task. The distinguisher is defined by a novel objective function and a neural network. The objective is to learn categorical distributions mapped from observations, distinguishing between arbitrary pairs in an information-theoretic sense. A neural network provides the map. Treated as an arbitrary function approximator, regularisation limits flexibility in an agnostic and practical manner. Accuracy of a first implementation exceeds standard techniques on a benchmark image recognition dataset (it achieves 74% on MNIST handwritten digits), but not recent clustering algorithms. Unlike existing methods, the distinguisher does not use prior information, such as observation-space similarities or the number of clusters, and hence can be applied to more tasks. It adapts to the observation space and clusters emerge sequentially. An investigation beyond this thesis is required to firmly establish both stability in novel applications, as well as the principles by which observations are distinguished. However, initial work suggests that the first clusters result from the angle between initial transformations, while a gradient study illuminates behavioural aspects, paving the way for further analysis. As a clustering algorithm, the distinguisher is useful for semi-supervised learning with very scarce labels. Next to that, it shows that familiar ideas about recognition and distinction can emerge from an uninformed initial state.

To demonstrate practical applicability, pop music lyrics are distinguished into four clusters: personal-emotional, foreign-epic, verbal-vocal, and narrative-descriptive. Each cluster has a different and significant effect on the prob-ability of a Billboard Hot 100 notation, with verbal-vocal lyrics the most favourable. Effects of smaller magnitude are found in musical features as key, mode, and tempo, providing aspiring artists with insight into Western pop music. Furthermore, logistic regression is expanded by deep learning with an adaptive variable transformation and a non-parametric likelihood ratio test. Finally, model analysis suggests that pop music is getting more predictable.

(4)

Contents

1 Introduction 1 Introduction 7 1.1 Motivation 1.1 Motivation . . . 7 1.2 Content summary 1.2 Content summary . . . 8 1.3 A brief history of neural networks and the current state of affairs

1.3 A brief history of neural networks and the current state of affairs . . . 11 2 Formal framework

2 Formal framework 13

2.1 Preliminary definitions and conventions

2.1 Preliminary definitions and conventions. . . 13 2.2 Learning classes by maximum likelihood

2.2 Learning classes by maximum likelihood . . . 14 2.3 Learning to distinguish

2.3 Learning to distinguish . . . 16 2.4 Probability metrics: the choice of δ

2.4 Probability metrics: the choice of δ . . . 18 2.5 Function spaces of neural networks: the choice ofH4

2.5 Function spaces of neural networks: the choice ofH4 . . . 20 2.6 Learning neural networks

2.6 Learning neural networks . . . 25 2.7 An implementation of learning to distinguish

2.7 An implementation of learning to distinguish . . . 28 2.8 Summary

2.8 Summary. . . 28 3 Distinguishing handwritten digits

3 Distinguishing handwritten digits 29

3.1 Introducing the problem

3.1 Introducing the problem. . . 29 3.2 Introducing the model and its assessment

3.2 Introducing the model and its assessment . . . 30 3.3 Results and analysis for zeros and ones

3.3 Results and analysis for zeros and ones . . . 32 3.3.1 Average objective

3.3.1 Average objective . . . 32 3.3.2 Emergent categories and impression concentrations

3.3.2 Emergent categories and impression concentrations . . . 33 3.3.3 Prediction accuracy

3.3.3 Prediction accuracy . . . 34 3.3.4 Average impression per true class

3.3.4 Average impression per true class. . . 35 3.3.5 Latent category transitions

3.3.5 Latent category transitions . . . 36 3.3.6 True class transitions and lasting doubts

3.3.6 True class transitions and lasting doubts . . . 37 3.3.7 Test assessment

3.3.7 Test assessment . . . 38 3.4 Distinguishing five digits

3.4 Distinguishing five digits . . . 39 3.4.1 Entropy-weighted gradient

3.4.1 Entropy-weighted gradient . . . 39 3.4.2 Results

3.4.2 Results . . . 40 3.5 Distinguishing all ten digits

3.5 Distinguishing all ten digits . . . 42 3.5.1 A convolutional distinguisher

3.5.1 A convolutional distinguisher . . . 42 3.5.2 Results

3.5.2 Results . . . 43 3.5.3 Comparison to clustering methods

3.5.3 Comparison to clustering methods . . . 46 3.6 Discussion and summary

3.6 Discussion and summary . . . 50 4 Derivations and theoretical results

4 Derivations and theoretical results 51

4.1 The largest impression category at initialisation

4.1 The largest impression category at initialisation . . . 51 4.2 Algorithmic behaviour of the objective gradient

4.2 Algorithmic behaviour of the objective gradient . . . 55 4.3 Open questions

4.3 Open questions . . . 60 4

(5)

Contents 5

4.4 Summary

4.4 Summary. . . 60 5 Agnostic unsupervised classification of lyrics to predict song popularity

5 Agnostic unsupervised classification of lyrics to predict song popularity 61 5.1 Introducing the setting and the model

5.1 Introducing the setting and the model . . . 62 5.1.1 Context, relevance, and related research

5.1.1 Context, relevance, and related research . . . 62 5.1.2 Data

5.1.2 Data . . . 63 5.1.3 Dataset statistics

5.1.3 Dataset statistics . . . 64 5.1.4 Model and estimation: logistic regression and deep learning

5.1.4 Model and estimation: logistic regression and deep learning. . . 65 5.2 Model estimates and performance

5.2 Model estimates and performance . . . 68 5.2.1 Estimates and significance

5.2.1 Estimates and significance . . . 68 5.2.2 Predictive performance

5.2.2 Predictive performance . . . 69 5.3 Cluster behaviour

5.3 Cluster behaviour . . . 70 5.3.1 Predictive performance on holdout sample

5.3.1 Predictive performance on holdout sample . . . 70 5.3.2 Stability

5.3.2 Stability . . . 70 5.3.3 Comparison to topic modelling: latent Dirichlet allocation

5.3.3 Comparison to topic modelling: latent Dirichlet allocation . . . 70 5.4 Further model investigations

5.4 Further model investigations . . . 71 5.4.1 Is pop getting predictable?—out-of-time test of ranking power

5.4.1 Is pop getting predictable?—out-of-time test of ranking power . . . 71 5.4.2 Model specification: neural network likelihood ratio test

5.4.2 Model specification: neural network likelihood ratio test . . . 72 5.4.3 Lyrical clusters have comparable effect across genres

5.4.3 Lyrical clusters have comparable effect across genres . . . 74 5.4.4 Greatest hits

5.4.4 Greatest hits . . . 75 5.5 Future work and conclusion

5.5 Future work and conclusion . . . 76 5.5.1 Future recommendations for understanding pop

5.5.1 Future recommendations for understanding pop . . . 76 5.5.2 Discussion and conclusion

5.5.2 Discussion and conclusion. . . 76 6 Conclusions

6 Conclusions 77

6.1 Discussion

6.1 Discussion . . . 77 6.2 List of open questions and directions for future research

6.2 List of open questions and directions for future research. . . 78 6.3 Popular summary 6.3 Popular summary . . . 79 7 References 7 References 81 Appendices Appendices 87 A Derivations A Derivations 87

A.1 Uniform convergence of bounded scaling sigmoid (lemma 2.5.22.5.2)

A.1 Uniform convergence of bounded scaling sigmoid (lemma 2.5.22.5.2) . . . 87 A.2 Angle gives probability of sharing largest category (prop. 4.1.14.1.1)

A.2 Angle gives probability of sharing largest category (prop. 4.1.14.1.1). . . 89 A.3 Objective gradient (lemma 4.2.14.2.1)

A.3 Objective gradient (lemma 4.2.14.2.1) . . . 91 A.4 A bound for the gradient summand term (lemma 4.2.34.2.3)

A.4 A bound for the gradient summand term (lemma 4.2.34.2.3) . . . 93 B Music popularity

B Music popularity 97

B.1 Distinguishing lyrics

B.1 Distinguishing lyrics . . . 97 B.2 Joint significance of indicators of key

B.2 Joint significance of indicators of key . . . 99 C Lyric word clouds

C Lyric word clouds 101

C.1 Emergent category: personal-emotional

C.1 Emergent category: personal-emotional. . . 102 C.2 Emergent category: foreign-epic

C.2 Emergent category: foreign-epic . . . 103 C.3 Emergent category: verbal-vocal

C.3 Emergent category: verbal-vocal . . . 104 C.4 Emergent category: narrative-descriptive

(6)
(7)

Chapter 1

Introduction

1.1 Motivation

Recognition and distinction are fundamental cognitive tasks which we learn to develop by ourself—not by example.

Babies learn to identify their parents spontaneously. The conception that some—but not all—observations are similar, cannot be an empirical observation itself, but must be a priori: an innate inclination to categorise experience. Information, including communication, is strongly related to the ability to recognise and distinguish concepts, like a word, a visual impression, or an abstract structure. We need to categorise experience before we can make sense out of it.

The principles by which we distinguish the most elementary concepts are not in our control, but shaped by evolution, tuned to success as a social and reproductive animal. For instance, communication requires agreeing on a shared language—but how then do we communicate the agreement? We need a shared language to begin with, such as agreement on when sounds or signs are the same. In general, we cannot acquire the concept of symbols from information, as that concept is a prerequisite for information. Instead, we are sensitive from birth to some bandwidths of regularities, and the method by which we distinguish these regularities into categories is innate and shared. The field of artificial intelligence pursues the simulation of human-like intelligence. It is not yet clear how to relate these simulations to e.g. cognitive, neurological, or behavioural sciences, although all these directions are studied. Philosophically, AI offers a unique view onto questions about knowledge by its empirical, data-driven, computational approach, delivering positive examples of the capacity of algorithms within formally understood, tight constraints. This is interesting, because it yields us sufficient conditions for systems to exhibit certain forms of intelligence. In addition, by controlling the mathematical framework, we gain deeper insight into concepts such as information, as the simulations, figuratively, bring the mathematics to life.

I wrote this thesis to find sufficient and simple conditions, mathematical and algorithmic, for recognition and dis-tinction to emerge spontaneously from an uninformed initial state. I expect that statistical and information theory provide a formalism suitable to specify and investigate these conditions, and that AI delivers tools to simulate and test such ideas empirically.

At the same time, writing for the hypothesis-driven fields of statistics and econometrics, I invite a sense of agnosticism which had historically been unavailable due to insufficient computer power. Artificial intelligence can sometimes let data speak for itself—it may not say what we expect it to.

(8)

8 CHAPTER 1. INTRODUCTION

1.2 Content summary

The topic of this thesis is unsupervised learning. Learning is a synonym of model estimation, meaning to find estimates ˆϑ of some parameter vector ϑ corresponding to a function fϑ. In supervised learning, fϑ typically

describes a relationship between observed variables ˜x and ˜y, assuming there exists ϑ0such that ˜y ≈ fϑ0(˜x). Common

examples are regression and classification.The practitioner supervises the learning by using the response value ˜y for

corresponding explanatory variables ˜x during estimation, to arrive at satisfactory ˆϑ. In this sense the model uses prior

knowledge. Unsupervised learning, in contrast, assesses the situation where a response variable ˜y is not available

or intended, with data consisting of observation vectors x ∈ D ⊆ RK. Common examples include estimating

the distribution of observations, so that fϑ(x) might be a density, or clustering, so that fϑ(x) might be a cluster

assignment. This thesis deals with unsupervised learning of unknown classes, or categories, where class labels are unknown. This is related to clustering, but with subtle differences. Specific uses of this technique include semi-supervised learning, where labels are known for a (small) subset of the sample, and active learning, where the model questions the practitioner during learning about the labels of specific examples. Both cases are valuable techniques, since class labels may be expensive to obtain relative to unlabelled data.

Unsupervised learning may still incorporate prior knowledge. For instance, by using predefined similarity, such as a matrix of pairwise difference or by defining a metric on the observation space. Such a definition is a form of prior knowledge, since it is specific to the learning task and typically originates from the practitioner’s understanding. It may be argued that providing such knowledge reduces the problem of unsupervised classification from ‘hard’ to ‘easy’, since it fixes what is meant by “different”. The goal of this thesis is unsupervised classification without any prior knowledge—to have the model distinguish groups in the observations from a completely uninformed initial state. To differentiate this task from supervised classification, the terms category and categorisation are used to refer to the learned, subjective, groups, which may not correspond to ‘true’ classes in the prior knowledge of the practitioner. In general, a modelling framework is given by a functional form (a model) and a learning method (an estimator). This thesis proposes a modelling framework for performing categorisation by a novel objective function which guides the learning of a neural network. A neural network is an arbitrary function approximator with flexibility limited by its size. It may be viewed as an uninformed model that does not contain hypotheses of the practitioner: it is agnostic. Since it differs from clustering, the proposed combination of objective function and neural network will be referred to as a distinguisher instead. The neural network acts as a map from an observation x ∈ D into different categories. The objective is to separate the categories of any pair of observations—to compare the interpretations of any two observations and attempt to view them as different. Learning happens through the interpretations of observations, and the interpretations themselves are learned such that they result in different categories. In this way, no prior knowledge at all is used. It is expected (or, hypothesised), that by lack of any prior knowledge, categories emerge from the ‘largest’ differences in the observation space D, where this notion of ’large’ remains unspecified.

This thesis uncovers no guarantees that a distinguisher learns sensible categories—which e.g. map to true classes. However, the flexibility of a neural network depends on its size and on parameter restrictions, so there is a practical way to control or regularise it. The hope is that regularisation allows only the ‘greatest’ differences (in an informal sense) to give rise to different categories. Different categories hence depend on the neural network, so they result from properties of the functional form of the neural network. This topic is not further investigated in this thesis, and it remains an important question which properties give rise to different categories. Instead, this thesis provides working examples on benchmark datasets and an analysis of those results. In general, a distinguisher is able to reliably recognise different handwritten digits with accuracy exceeding standard techniques, while using less prior information. Performance is not yet comparable to recent clustering algorithms. However, there are clear directions for improvement. Furthermore, since familiar classes emerge among the categories, this shows that categorisation can occur spontaneously from a simple learning rule and a large but finite parallel procesor, which is a direct analogy to a biological brain.

(9)

Sec. 1.2. Content summary 9

Chapter 22provides a detailled exposition of involved theory, and the distinguisher is introduced using a comparison to maximum likelihood. The main point is the proposed objective (2.3.12.3.1) which is implemented in this thesis by (2.7.12.7.1) written out below in (1.2.11.2.1). It depends on an arbitrary pair of vector-valued observations X1, X2 ∼ p0 from the

unknown true distribution p0, a neural network hϑwith parameters ϑ mapping to L-dimensional probability vectors,

and a probability metric dJScalled Jensen-Shannon distance. The objective is given by

(1.2.1) arg max ϑ∈Θ    E X1∼p0 X2∼p0 h dJS  hϑ(X1), hϑ(X2) i   

where hϑmaps an observation to a probability vector, which represents a categorical distribution, and dJSis

Jensen-Shannon distance, which is a probability metric derived from Kullback-Leibler divergence. For brevity, given an observation x, the function value hϑ(x) as a categorical distribution will be called an impression. It is a central

concept, and the name reflects its subjective, doubtful, and dynamic qualities. An impression may be considered the distinguisher’s recognition of an observation x, assigning it to any one of L categories with some probability determined by hϑ(x). When learning ϑ, impressions change when ϑ changes. The optimum of dJS is given by two impressions

which put all probability mass on one, but different, coordinate. In this sense, the objective drives impressions to so-called one-hot vectors. Impressions which are close to one-hot vectors are called strong impressions. They may be thought of as observations which are confidently recognised by the distinguisher as belonging to some category. Maximum likelihood classification also drives categorical distributions to one-hot vectors, but the target coordinate is known (is prior knowledge), while the proposed objective does not specify which coordinate to increase. Altogether, the proposed objective is intended to simulate an innate inclination to categorise with no prior knowledge. The purpose the objective is to drive most impressions towards strong impressions, using only a limited number of all available categories (the coordinates). Limitations are enforced in an agnostic manner by regularising the neural network. Learning is performed by stochastic gradient descent on a subsample approximation of the gradient of the objective function. Implementations are programmed in TensorFlow and the wrapper package Keras, which are dedicated Python packages for symbolic graph computation and deep neural networks, respectively.

To the writer’s best efforts in investigating published literature, the proposed objective function is a new concept. It is new in two ways. First, the concept of learning to increase a distance between arbitrary pairs, or distinguishing a pair, is new. The implementation in this thesis is a particular set of choices corresponding to the writer’s knowledge and preferences (e.g. using neural networks under an information-theoretic metric). Second, it is new to have a non-compound objective function (like the expectation of dJS) simultaneously learn a transformation (like hϑ) and cluster

the transformations (by largest categorical coordinate). Such technical simplicity is interesting from a theoretical and perhaps philosophical point of view, and is also useful in practice since it does not introduce any additional hyperparameters. The philosophical point of view would relate to the discussion which prior knowledge is required for any learner (artificial or biological) to be able to make sense of observations the way humans do.

In chapter 33, results are presented on a benchmark dataset of handwritten digits, called MNIST. Three experiments are performed: on zeros and ones (two classes), on the first five digits (five classes), and on all digits (ten classes). The primary assessment quantity is accuracy of categories interpreted most favourably as true classes. On zeros and ones, accuracy is nearly perfect, and an analysis of the learning process reveals several interesting properties. Most notably, strong categories emerge sequentially, they are nested at least in the binary example, and there are indications that categories emerge from random initial fluctuations. On the first five digits, accuracy is over 90%, and a novel gradient weighting technique is introduced which accelerates learning to distinguish, called entropy-weighted gradient. On ten digits, performance of a fully agnostic distinguisher degrades to 60%, which is comparable to standard clustering methods implemented in software packages. However, a common adjustment which provides the distinguisher with 2D spatial awareness achieves 74% accuracy, which is better than standard techniques, but not as good as recent algorithms. This convolutional distinguisher uses convolutional layers which are ubiquitous in computer vision and in clustering for image recognition. While this introduces some prior knowledge to an otherwise agnostic framework,

(10)

10 CHAPTER 1. INTRODUCTION

the prevalence of convolutional layers in literature suggest that spatial awareness is a requirement to do image recognition. Such a requirement might also hold for biological brains which have a dedicated visual cortex.

The results serve as a basic check of the concept. Since hyperparameters were tuned to obtain a working algorithm for each experiment, the results do not show that learning to distinguish works in arbitrary settings without prior knowledge—such a result is beyond this thesis. However, when categories emerge, they most often correspond to MNIST’s true classes. Also, the specific algorithms are stable. Moreover, they achieve impressive performance which on itself offers potential for clustering and one-shot applications. Most importantly, the results show that it is feasible to obtain familiar classes from unsupervised learning without requiring any hypothesis of the practitioner.

As a first step in understanding what constitutes the emergent categories, section 4.14.1shows that the angle between top-layer latent representations determines the probability that two impressions at initialisation share the same largest category. This is relevant, because there are indications that the first emergent categories are related to random initial fluctuations.

Section 4.24.2collects some derivations which deal with the gradient, which is central to the algorithmic learning rule. Since neural networks are function compositions, the gradient is derived using the chain rule. The section focusses on the objective gradient, which is that part of the applied chain rule that is different from existing algorithms. It is shown that the objective gradient is bounded, which ensures that the algorithm cannot fail due to very large numbers that exceed numerical capacity. The proof uses Pinsker’s inequality, which states that the square root of half Kullback-Leibler divergence bounds total variation distance. Furthermore, it is shown that the gradient is undefined when equal impressions are compared, equivalent to zero Jensen-Shannon distance. This requires explicitly extending the definition of the objective gradient to equal zero for equal impressions, which is the sensible choice. Another issue is identified when almost-equal impressions are compared, as a ratio of two nearly-zero numbers may induce numerical instability. Hence it is noted that an implementation should not only set the gradient of equal impressions to zero, but also the gradient of almost-equal impressions which have Jensen-Shannon distance below some small threshold. Finally, by studying the parity of the objective gradient, behavioural aspects of the algorithm are illuminated. Most notably, some conditions are specified under which impressions increase towards strong impressions, and some more evidence is found that categories emerge from random initial fluctuations.

An application of distinguished clusters in a classical statistical model, chapter 55predicts Billboard Hot 100 popularity from lyrics and musical features in a logistic regression. It is found that lyrical expressiveness is an important predictor for a Billboard notation, meaning rich use of conventional and conversational language along with vocal expressions. Furthermore, personal topics have more probability of popularity than descriptive, whereas foreign language is best avoided altogether. Genre is also an important predictor, having the largest magnitude among explanatory variables. Among musical features, tempo has an optimum for Billboard popularity around 120 bpm, while songs written in major or in F or A] were also found to have a significant slight increase in probability. Songs in D have a slightly disadvantageous effect. The analysis, of lyrics in particular, supports aspiring singers and songwriters looking to improve their ability.

Finally, the agnostic character of neural networks is leveraged in two novel applications (to the writer’s knowledge) of approximate non-parametric statistics. The first is to obtain an arbitrary, but optimal, non-monotone transformation of a single variable within the maximum likelihood framework of logistic regression during estimation. The second is a bootstrapped likelihood ratio test for logistic model specification, where a neural network plays the role of an expressive challenger model of which the logistic functional form is a special case (uniformly asymptotically on bounded domain— also shown in this thesis). Hence, the practitioner may relinquish the task of specifying an alternative hypothesis. This algorithmic likelihood ratio test is very general and checks for arbitrary variable transformations and interactions. For the Billboard model, the null hypothesis of correct specification could not be rejected, which seems quite a strong result given the flexibility of a neural network.

(11)

Sec. 1.3. A brief history of neural networks and the current state of affairs 11

1.3 A brief history of neural networks and the current state of affairs

Neural networks are the driving forces behind the current resurgence of the field of artificial intelligence. Since the technique is not as well known in mathematics and econometrics, its history and status deserve special attention.

Figure 1.1: Hardware perceptron—a

mechani-cal neural network (image from Rosenblatt, 19611961) The neural network arose from the intersection of

en-gineering, mathematics, and neuroscience, propagated by three corresponding motivations: model perfor-mance, theoretical understanding, and simulating bio-logical learning. It was inspired on biobio-logical models of computation, from the idea that input-output tasks are performed by a distributed system of simple operations. Complex behaviour was expected to emerge from simple learning rules, like the brain was assumed to work. The perceptron model and algorithm (Rosenblatt, 19611961) was arguably the first neural network and deep learning ar-chitecture. The concept was intended and executed as a mechanical device, rather than as software (figure 1.11.1). Mathematical feedback from Minsky and Papert (19691969) discredited the single-layer (“two-layer”) perceptron by the observation that it could not learn the elementary logical XOR operation. Although it was already noted that multilayer architectures do not suffer from this fun-damental limitation, the observation is often credited for

inducing the first “AI winter”, when research and results on neural networks stagnated. Aside from neural networks, the constructive, hypothesis-based approach known as symbolic AI diminished as a whole during the first AI winter. Neural networks were invigorated by backpropagation, often accredited to Rumelhart, Hinton, Williams (Rumelhart, Hinton, and Williams, 19851985) and published in Nature a year later (Rumelhart, Hinton, and Williams, 19861986). Back-propagation is repeated use of the chain rule on an objective function with respect to model parameters. It turned out that the complex, but more expressive, multilayer perceptrons could be estimated using gradient algorithms to satisfactory performance, unlocking for the first time their great flexibility. Squared prediction error was used in the original backpropagation proposition, but it was soon recognised that an arbitrary optimisation objective could be used (Baum, 19861986), in particular maximum likelihood (Baum and Wilczek, 19871987). Links to statistics were soon further formalised by Halbert White (19891989), among others. It was observed (Gish, 19901990) that networks trained under maximum likelihood could in principle be used for classical statistical practices, such as likelihood ratio tests, information criteria, or e.g. minimum description length selection.

Mathematically, Cybenko (19891989) and Hornik, Stinchcombe, and White (19891989) independently established universal approximation properties for the multilayer perceptron, the simplest extension of Rosenblatt’s input-output perceptron with one additional layer. Although it is still not rigorously established how to find optimal parameter estimates (since backpropagation yields no guarantees), it did become clear that there is no principal limitation on what a neural network can learn.

Principal contributors to research since that era include Geoffrey Hinton, Yann LeCun, Yoshua Bengio, Juergen Schmidhuber, and Max Welling who currently works at the University of Amsterdam. However, due to insufficient computer power, the field experienced a “second AI winter”. From the statistical perspective, neural networks are elusive objects, as today there is still no analogue for typical statistical handles such as consistency, asymptotics,

(12)

12 CHAPTER 1. INTRODUCTION

uniqueness of solution, and explainability. They are sometimes referred to as “black boxes”, referring to a lack of formal understanding and transparency. This meant that without sufficient proof-of-concept, interest in neural networks was quite sparse, and restricted to dedicated researchers pushing for results. Still, important progress was being made, e.g. in successfully estimating a restricted Boltzmann machine (Hinton, 20022002), an unsupervised neural network which is a universal approximator for the sample distribution.

Technological advances would eventually deliver proofs-of-concept for diverse complex problems. The second AI winter had definitively blown over when processing power was strong enough to allow a deep neural network to win the 2012 ImageNet competition (Krizhevsky, Sutskever, and Hinton, 20122012), a visual classification task which is a benchmark in the field. Neural networks started to prevail in computer vision and speech recognition, which they still do. Reinforcement learning allows neural networks to learn dynamically from interactions with an environment, leading e.g. to autonomous robots. In 2016, a well-publicised match between world-level Go player Lee Sedol and a neural-network-based AI AlphaGo by Google (Silver et al., 20162016) resulted in victory for the algorithm and a honorary 9 dan by South Korea’s Go Association—who write on the certificate:

“AlphaGo exhibited creative and brilliant Baduk [Go] skills and contributed greatly to the progress of Baduk.” As was initially speculated, complex behaviour can indeed emerge from simple learning rules.

Statistical and information theory has been of great support in constructing neural network architectures and algo-rithms. Illustrative examples are generative adverserial networks (GANs), which are unsupervised networks that learn to generate from the sample distribution by trying to mislead a second neural network trained to tell the difference (e.g. Arjovsky, Chintala, and Bottou, 20172017 and Chen et al., 20162016). The term “statistical learning” describes a different but related field, which studies model estimation and machine learning algorithms on e.g. probability of correctness and general feasibility (PAC learning), capacity (VC dimension), and convergence speed. Because of their functional complexity, neural networks seem to be hard to analyse on these topics. Statistical learning provides neural network learning with e.g. a hierarchy of runtime and sample complexity classes, which is a basis to compare different kinds of models—see Ben-David and Shalev-Shwartz (20142014, p. 268) or Hastie, Tibshirani, and Friedman (20092009, p. 389) for instance. More theoretical results are obtained from statistical learning; e.g., the VC dimension is nearly tight for neural networks (Bartlett et al., 20172017). However, there is no thorough understanding yet of what constitutes a satisfactory solution to a neural network learning problem. Related to this, unidentifiability and lacking guarantees of finding global or good optima inhibit classical statistical approaches of e.g. asymptotics.

Currently, neural networks are taught in master degree programmes either as a standard topic (e.g. in artificial intelligence) or as an elective (e.g. in statistical sciences). Common textbooks which cover neural networks include MacKay (20052005), Bishop (20062006), and Goodfellow, Bengio, and Courville (20162016), or Ben-David and Shalev-Shwartz (20142014) for a statistical learning perspective. Many useful knowledge about practical implementation is not covered in these textbooks, but may be obtained from specialised university courses. Open sourcing is prevalent in machine learning and some universities host their course material publicly online. Examples are the University of Amsterdam course on deep learning (Gavves et al., 20182018), the University of California, Berkeley’s topics course on deep learning (Bruna, 20162016) which is mathematically oriented, and the Stanford University courses on neural networks, for instance on computer vision (Karpathy, 20182018). Furthermore, most recent machine learning papers are freely accessible (e.g. via arXiv.orgarXiv.org) and computational implementations and results are often made available publicly (e.g. via github.comgithub.com). Finally, due to the rise in popularity in research, amongst students, and in industry, many tutorials are to be found online with various degrees of challenge and formalism.

(13)

Chapter 2

Formal framework

This chapter introduces a framework for unsupervised classification by a novel objective function together with a neural network. Choices are motivated step-by-step to illustrate possible points of departure for further investigation or change. After agreeing on conventions, the reader is invited to the proposition through a particular perspective on maximum likelihood. This naturally leads to a general formulation of the objective function of learning to distinguish, with specific decisions motivated in later sections. Finally, a description of an algorithmic implementation is the stepping stone to the next chapter, which puts the distinguisher to the test.

2.1 Preliminary definitions and conventions

This section introduces terminology and notation which will be used throughout this thesis.

·

Elementary conventions

·

The natural numbersN start at 1.

·

Vectors are column vectors and subscript is coordinate index.

·

A superscript is often an additional index, rather than raising to the power.

·

The logarithm base is not specified and does not matter; however, for numerical implementation and reports, the base of log is 2 rather than e to conform with information-theoretic interpretations of entropy and Jensen-Shannon distance.

·

Zero logarithm: in all situations where applicable, both 0 · log 0 and 0 · log00 can be treated as 0.

·

The natural logaritm is explicitly written as ln.

·

Probability theory and treatment observations and samples

·

In this thesis, x is an observation in the observation space D ⊆RK with K ∈N.

·

The distributions over D are contained in the distribution spaceP. Distributions are assumed to admit densities in this thesis, and p ∈P will be some density (or mass) over D, meaning p(x) characterises the distribution of some random variable X : Ω → D with Ω a probability outcome space. An exception is in the statement of Pinsker’s theorem A.4.2A.4.2, where a more general definition ofP is used to define total variation distance.

·

The terms “distribution” means either “density” or “probability mass function”, depending on whether the corresponding random variable is continuous or discrete.

(14)

14 CHAPTER 2. FORMAL FRAMEWORK

·

Definition of random variables: X ∼ p means to say that X is a random variable with distribution (density) p. Often the corresponding non-capital letter (e.g. x) is an element of its image (e.g. x ∈ img(X)). Often, img(X) = D.

·

An observation x is assumed to be drawn from a true distribution p0; hence X ∼ p0 with x ∈ img(X).

·

The sample {xn}Nn=1 consists of N ∈N observations xn ∈ D, drawn independently from p0.

·

A conditional distribution is written p|x, which means the distribution p conditional on x.

·

Should an observation x consist of explanatory and response variables, the explanatory variables are called ˜x and the response ˜y, such that x = (˜x, ˜y).

·

If x = (˜x, ˜y), then p0 is the joint distribution of ˜x and ˜y. Then, the conditional true distribution given ˜x

from which ˜y is drawn is p0|x˜ and this corresponds to ˜Y ∼ p0|x˜ and p0(˜y | ˜x) = p0|˜xy).

·

Machine learning terminology

·

Learning is estimation; the two terms are used interchangeably.

·

Input means explanatory variables; target means response.

·

Supervised learning has the goal of predicting a response/target variable ˜y given explanatory/input

variable ˜x; prior knowledge about ˜y is used during learning. Examples: classification and estimating p0|x˜.

·

Unsupervised learning uses no response variable. It comprises, among others techniques: learning the full distribution of the sample, or determining clusters in the sample.

Furthermore, in this thesis, a number of new concepts is introduced to facilitate explaining and analysing a new model. In order to communicate these mathematical concepts efficiently, they are given names which hint at the general mo-tivation from human subjective experience—e.g. “distinguisher”, “impression”, and “emergent category”. Although these names are somewhat interpretative rather then purely formal, the writer believes they are not suggestive, and the interpretations fully compatible with their mathematical definitions.

2.2 Learning classes by maximum likelihood

In order to introduce the idea of learning to distinguish, it is helpful to first recall maximum likelihood, a well-known method for estimating distributions. Learning to distinguish has illustrative similarities and contrasts compared to maximum likelihood. This section introduces the basics of maximum likelihood as well as relevant perspectives. Let {xn}Nn=1 ⊂ D be a sample of N ∈N observations and assume it is independently drawn from some unknown

distribution p0. Maximum likelihood is said to be performed when one proceeds to solve the optimisation problem

(2.2.1) arg max p∈H ( N Y n=1 p(xn) )

or equivalently11arg max

p∈H ( 1 N N X n=1 log p(xn) ) ,

whereH is some set of distributions over D, possibly indexed by a parameter vector. In statistics, H is called the model; in statistical learning, H is called the hypothesis class. The first formulation in (2.2.12.2.1) highlights the goal of maximising the joint density of the sample—maximising the infinitesimal probability of observing that sample—as well as factorisation due to independence. The second formulation is often practically convenient and also relates to information theory (later in this section). Note that p0 is not used since it is not known; any knowledge or

assumptions on p0 should be contained inH . In applications where the practitioner already has a hypothesis (or,

assumption) on p0,H is often chosen such that a unique solution to (2.2.12.2.1) exists which has desirable properties.

Under regularity conditions, a maximum likelihood solution has attractive properties such as consistency and asymp-totic efficiency. It is not in scope of this thesis to investigate such properties for the method of learning to distinguish.

(15)

Sec. 2.2. Learning classes by maximum likelihood 15

One other property of maximum likelihood does deserve special attention in the context of this thesis, namely the link with information theory through the Kullback-Leibler (KL) divergence from p to the truth p0

(2.2.2) KL(p0, p) ≡ E X∼p0  − log p(X) p0(X)  = E X∼p0 [− log p(X)] − E X∼p0 [− log p0(X)] or equivalently KL(p0, p) ≡ E X∼p0  logp0(X) p(X)  .

KL divergence is also called relative entropy; the p0-expectation of − log p(X) is the cross-entropy from p to p0and

the p0-expectation of − log p0(X) is Shannon’s entropy, sometimes denoted H [p0]. The KL divergence is a measure

of dissimilarity between two distributions, although not a distance—notice the asymmetry KL(p0, p) 6= KL(p, p0).

Minimising the sample equivalent of KL(p0, p) over p ∈ H is equivalent to maximum likelihood. To see this,

note that only the first expression on the right-hand side of (2.2.22.2.2), the cross-entropy, depends on p, and that that expression is the sample equivalent to the rightmost form of (2.2.12.2.1) with parity reversed. Hence, by the law of large numbers, the sample equivalent converges almost surely to the KL divergence, save a constant term, so that maximum likelihood (2.2.12.2.1) is asymptotically equivalent to minimising KL divergence (2.2.22.2.2) over p ∈H . Although KL divergence is not a statistical metric, it illustrates a formal notion of dissimilarity between distributions.

Consider the setting of supervised classification, where an observation x consists of a real vector of characteristics ˜

x ∈ ˜D and a one-out-of-M class label ˜y ∈ {1, 2, ..., M }, so that x = (˜x, ˜y) is contained in the observation space D which in this case is ˜D × {1, 2, ..., M }. In statistics and econometrics, ˜x is called the independent or explanatory

variable, and ˜y the dependent or response variable; in supervised machine learning, ˜x the input and ˜y the target. A

general framework to learn a classification model is to assume that an observation’s characteristics ˜x determine a

categorical distribution over M classes, so that a class label ˜y is a realisation of ˜Y ∼ p0|x˜ following a categorical

distribution conditional on ˜x. (Note that p0(x) = p0(˜x, ˜y) and p0|x˜(˜y) = p0(˜y | ˜x).) The categorical distribution

simply assigns some probability to each class, given its parameters, and assumes no further structure on these probabilities. Since the probabilities must sum to 1, there are M − 1 free distributional parameters. Assuming that ˜x

determines these parameters, let ˜4 be the set of all M -dimensional probability vectors, and let f : ˜D → ˜4 represent the probabilities of some categorical distribution conditional on ˜x. In other words, coordinate ˜y of f (˜x) is probability

mass p (˜y | ˜x). Noting that f (˜x) is equivalent to p|x˜, this thesis will freely move between a categorical distribution

and its representation as a probability vector. Since the goal of supervised classification is inference on the conditional distribution p|x˜ rather than on the full distribution p, categorical maximum likelihood is typically performed using

the conditional densities. Under the conditional distribution determined by the probabilities f (˜x), the conditional

likelihood (mass) p (˜y | ˜x) of the observation x = (˜x, ˜y) is fy˜(˜x), or simply the probability the distribution assigns to

the true category. Maximum likelihood may now be performed by solving (2.2.3) arg max f ∈ ˜H4˜ ( 1 N N X n=1 log fy˜nxn) ) whereH˜4˜ :=f : ˜D → ˜4

from which it is clear that the objective focusses on predicting the correct classes and ignores model probability mass on incorrect classes. In this general setting, existence and uniqueness of a solution is not guaranteed at all. Note that true class membership is unambiguous, but predictions may be uncertain22.

Categorical maximum likelihood has another interesting link to information theory. Given one observation’s charac-teristics ˜xn, take f (˜xn) ∈ ˜4 and p0|˜xn ∈ ˜4 to represent respectively an arbitrary probability vector determined by

f and the true probability vector determined by the unknown p0. For many classification problems it is natural to

2This approach to classification is not obvious. Many practical classification problems, e.g. classifiying handwritten digits, are non-probabilistic, in the sense that each observation has one unambiguous class. The solution to the classification problem could be considered satisfactory only when a model is found that assigns the correct class to each observation without any uncertainty. It is however ubiquous in machine learning to predict class with a categorical distribution, and whenever class assignment must be done, to choose the highest predicted probability. It is not clear whether this is different from searching for all-or-nothing predictors, or simply a generalisation.

(16)

16 CHAPTER 2. FORMAL FRAMEWORK

assume that an observation’s characteristics ˜xnfully determine its class ˜yn. Should this assumption be accepted, then p0|x˜n should only map to one-hot vectors, namely, to probability vectors with a 1 for the true category coordinate and 0s elsewhere. For instance,

xn belongs to class 3 out of 4 ⇐⇒ yn˜ = 3 ⇐⇒ p0|x˜n=     0 0 1 0     .

Hence, while p0 as a joint distribution over (˜xn, ˜yn) = xn is unknown, the true conditional distributions p0|x˜n are fully known for each n ∈ {1, 2, ..., N } as determined by the one-hot probability vectors p0|x˜n. This is not true in general since often the conditional distribution of the target is not known, but rather just a single realisation of that distribution—the difference here is in the assumption that there is no randomness in the target conditional on the observation. Let the mass p0|x˜ny) = p0(˜y | ˜xn) be the coordinate of p0|x˜n corresponding to class ˜y and note that it is 1 for the true class and 0 otherwise. Under the assumption of complete determination, the full conditional distribution p0|x˜n may be used to rewrite categorical maximum likelihood (2.2.32.2.3) as

arg max f ∈ ˜H4 ( 1 N N X n=1 log fy˜nxn) ) = arg min f ∈ ˜H4 ( 1 N N X n=1 M X ˜ y=1 −p0|x˜ny) log fy˜(˜xn) )

since p0|˜xn is one-hot for ˜yn

= arg min p|xn˜ ∈ ˜H4 ( 1 N N X n=1 E ˜ Y ∼p0|xn˜ − log p|x˜n Y˜  ) noting f (˜xn) is equivalent to p|˜xn = arg min p|xn˜ ∈ ˜H4 ( 1 N N X n=1 E ˜ Y ∼p0|xn˜ − log p|x˜n Y˜ − 1 N N X n=1 E ˜ Y ∼p0|xn˜ − log p0|x˜n Y˜  ) = arg min p|xn˜ ∈ ˜H4 ( 1 N N X n=1 KL (p0|x˜n, p|˜xn) ) .

Hence, categorical maximum likelihood with unambiguous classes aims to minimise average KL-divergence from the model’s conditional categorical distributions p|x˜ to the true conditional one-hot distributions p0|˜xn. This may be interpreted as bringing the model’s predictions as close to the truth as possible as measured by Kullback-Leibler diver-gence. Note that this formulation of categorical maximum likelihood in terms of KL divergence is not asymptotically, in contrast to the observation that maximum likelihood in general asymptotically minimises KL divergence from the model’s joint distribution over the observations to the truth.

Supervised classification uses existence and knowledge of the truth. The goal of this thesis is to perform classification without such knowledge.

2.3 Learning to distinguish

A key observation of the previous section is that categorical maximum likelihood may be viewed geometrically using probability vectors with a distance-like concept. Categorical distributions are easy to work with because of their vector representation and especially so for a finite number of classes. The distance-like concept in the previous section is KL divergence (2.2.22.2.2), which is not particularly satisfying for arbitrary pairs of distributions p, q ∈P because of the asymmetry KL(p, q) 6= KL(q, p). This section argues for a symmetrical distance-like concept on probability vectors

(17)

Sec. 2.3. Learning to distinguish 17

in order to increase distance between two observation’s perceived classes, which will be called the model’s latent impressions. The term “latent” here refers to being model-specific and not externally imposed: the model is ‘free to form its own subjective representations’. Learning consists of comparing two latent impressions, circumventing a need to compare an impression to some truth such as p0.

In order to emphasise the difference between truth and impression, if it is assumed that there exists a true class that an observation belongs to, this will be called its (true) class, to conform with the general setting of supervised classification. The number of classes is then M . If however the latent impression is referenced, a coordinate will be called a (latent) category, and the number of categories will be L ∈N.

The rest of this thesis will be chiefly concerned with categories in impressions. Categorical distributions will be defined by their probability vectors in 4 :=nz ∈ [0, 1]L

PL

`=1z`= 1

o

. A probability vector z ∈ 4 which is a model output will be referred to as an impression33. A category is a coordinate in 4; hence an impression is any distribution over categories. An impression which assigns all probability to one category is a one-hot vector.

Consider an arbitrary ’notion of dissimilarity’ δ : 4 × 4 →R satisfying for z1, z2∈ 4

·

non-negativity: δ(z1, z2) ≥ 0,

·

symmetry: δ(z1, z2) = δ(z2, z1),

·

identification: δ(z1, z2) = 0 ⇔ z1= z2,

and also exhibiting

·

perfect distinction44: δ(¯z1, ¯z2) is a global maximum for two unidentical one-hot vectors ¯z1, ¯z2,

representing some way of comparing categorical distributions. The first three properties characterise a semi-metric. They seem necessary for any sort of comparison between arbitrary pairs. The property of perfect distinction states that the strongest dissimilarity is between two certain impressions about two different categories. Note that it implicitly requires δ to be bounded. This property formalises what is meant by “distinguishing”. To see that a property like this is required, consider that for any δ satisfying the first three properties and an arbitrary bijection B : 4 → 4, the composition δ ◦ B also satisfies the first three properties. This shows that the semimetric properties do not impose a basis for comparison of impressions.

It is not clear whether it is useful or necessary to require the triangle inequality, which would imply that δ is a metric. Even when using a metric, perfect distinction is still required, since the same bijection argument applies.

Recall that an observation x ∈ D is a realisation of a random variable X with codomain D and distribution p0. Let

H4⊆ {D → 4}, hence a subset of all functions which map an observation to a probability vector over L latent

categories. The set of functionsH4contains different ways of forming impressions from observations. Learning to

distinguish proposes to solve (2.3.1) arg max f ∈H4    E X1∼p0 X2∼p0 h δf (X1), f (X2) i   

hence to find a way of representing observations which maximises latent dissimilarity of arbitrary pairs. Intuitively, a solution to (2.3.12.3.1) is a function that assigns different one-hot vectors to observations, so iterating through H4 while increasing the objective function should result in a sequence of estimates which are increasingly progressed categorisers. Note that in this sense, the objective resembles categorical maximum likelihood, where the target class is not specified but emerges spontaneously. The hope is that a proper choice ofH4 will induce an optimum

3This is an ad hoc definition not found in the literature. 4This is an ad hoc definition not found in the literature as well.

(18)

18 CHAPTER 2. FORMAL FRAMEWORK

corresponding ‘obvious’ differences in D without introducing bias. Note that f is learned only through its image 4, not using any information from D. This is a particularly unique approach, as most other unsupervised methods at least use a metric or other similarity assessment on D. That approach is considered too explicit for this thesis, as such information is not available to living creatures as they learn to recognise—so it should not be required. It is considered too explicit, because it already defines what is meant by ‘similarity’ or ‘identity’. The argument of this thesis is that a learner must first autonomously learn to categorise before he or she could start to make sense of any such prior knowledge. Further considering prior knowledge, the true distribution of an observation p0 is used in the

expectation, but in practice this will be approximated with a sample average, which does not disclose any knowledge about p0 other than what is obtained from exposure to a representative sample; in another word, observation55.

Without restrictions onH4, a solution to (2.3.12.3.1) is determined only by the probability of matching distinct pairs of observations and by the dissimilarity measure δ. If for instance δ is maximum absolute coordinate-wise difference

δ : 4 × 4 → {0, 1}, (z1, z2) 7→ max

`∈{1,2,...,L}{|z2,`− z1,`|}

then an unrestricted optimum will map observations to as many different one-hot vectors as possible, favouring a mapping with the highest probability of matching dissimilar impressions. For a solution f∗ in such unrestrictedH4, similarities in the observation space D play no role, and the latent representations f(x) for x ∈ D will be one-hot vectors that correspond to subsets of D of as equal probability measure as possible. This example illustrates that also when δ is a more elaborate structure on 4, an optimum in unrestricted H4will still not take any structure on D into account, because the function class is unboundedly flexible. The linkage of latent dissimilarity with probability of pairwise matching is left for future research. Under the hypothesis that observations are distinguishable into different classes, it seems natural to limit the capability of a learning system, to force the learning system to detect patterns that are representative of the classes rather than remembering each single instance, since pattern detection enables generalisability. In section 2.52.5it will clarified that function classes of neural networks have multiple intuitive options for limiting flexibility, where flexibility is informally understood to mean the ability to distinguish many different categories. This makes neural network estimation a suitable approximation to solving (2.3.12.3.1).

2.4 Probability metrics: the choice of δ

A strong notion of dissimilarity is defined by a metric, which satisfies the first 3 properties required in 2.32.3as well as the triangle inequality. This facilitates a definition of convergence on a distribution space. It is not clear whether the triangle inequality helps in the objective of learning to distinguish (2.3.12.3.1), but there is another reason why using a metric for δ is convenient. If (4, δ) is a metric space, then it admits a definition of Lipschitz continuity, which is an intuitive and explainable way to limit the flexibility ofH4. Hence for the purpose of this thesis, δ will be restricted to metrics.

A common type of metric is given by theLp norm z 7→ kzkLp where p ∈ N (replacing the distribution p only for this paragraph). For finite spaces with L dimensions this norm is given by

kzkLpp v u u t L X `=1 |z`|p.

A norm z 7→ kzk yields a metric by way of kz1− z2k; call the metric induced by theLp norm dLp. Note that dLp has perfect distinction on 4. The metrics dLp are ubiquitous in mathematics and intuitive—for instance, the L2 norm yields the Euclidean metric.

5In typical sets of physical data, the number of sample points N will be too low to create an empirical distribution which approximates

p0 on all of D closely. This is especially the case for high-dimensional D, and also for the case that p0is 0 or almost 0 on large parts of

(19)

Sec. 2.4. Probability metrics: the choice of δ 19

In general the dLp metrics are not particularly suitable for a space of probability distributions like 4. An interesting candidate instead is Jensen-Shannon distance dJS, a statistical and information-theoretic metric. It is based on the

Kullback-Leibler divergence (2.4.1) KL(p, q) ≡ E X∼p  logp(X) q(X)  for p, q ∈P

and on the Jensen-Shannon divergence of two distributions, given by (2.4.2) JSD(p, q) ≡ 1 2KL  p,p + q 2  +1 2KL  q,p + q 2  ,

a symmetrised Kullback-Leibler divergence. The Jensen-Shannon metric dJS is given by

(2.4.3) dJS(p, q) ≡

p

JSD(p, q).

Proof that dJSis a metric is given by Endres and Schindelin (20032003). General properties of Jensen-Shannon divergence

are given by Lin (19911991). By Fuglede and Topsøe (20042004) it is shown that the metric space (P, dJS) is an isometry to

a real Hilbert space, suggesting it is possible to define a norm onP which induces dJS.

Other common statistical metrics include

·

Hellinger distance,

·

total variation distance,

·

Wasserstein or earth-mover’s distance, which requires defining another metric on the corresponding space,

·

Ky Fan metric, and

·

Fisher information metric on a statistical manifold.

More information on this topic may be found e.g. in Rachev et al. (20132013) and Gibbs and Su (20022002). However, to facilitate other analyses and experiments, it was decided to restrict this thesis to Jensen-Shannon distance. The choice for dJS was motivated by its link to Kullback-Leibler divergence and its initial satisfactory performance for learning

to distinguish; see chapter 33. Jensen-Shannon metric dJShas perfect distinction and hence meets the requirements

of section 2.32.3. The rest of this section illustrates some of its properties.

For two impressions z1, z2∈ 4 the integrals of Kullback-Leibler divergence reduce to finite sums, so

dJS(z1, z2) = v u u t 1 2 L X `=1 −z1,`log  z1,`+ z2,` 2z1,`  +1 2 L X `=1 −z2,`log  z1,`+ z2,` 2z2,`  or equivalently dJS(z1, z2) = v u u t 1 2 L X `=1 z1,`log  2z 1,` z1,`+ z2,`  +1 2 L X `=1 z2,`log  2z 2,` z1,`+ z2,`  .

The second formulation shows that dJS is always well-defined for z1, z2 ∈ 4, contrary to KL divergence. The KL

divergence KL(p, q) is not defined on a set of q-measure 0 but positive p-measure; see (2.4.12.4.1). If for instance

z1= 0 1  and z2= 1 0  then KL(z1, z2) = 2 X `=1 z1,`log z1,` z2,` = 0 · log0 1+ 1 · log 1 0

and while the first term may be set to 0 as is convention in this application, the second term is not well-defined. In contrast, Jensen Shannon divergence is well-defined for all cases where either distribution is 0, and in this case

dJS(z1, z2) = s 1 2  0 · log0 1 + 1 · log 2 1  +1 2  1 · log2 1 + 0 · log 0 1  =plog 2.

(20)

20 CHAPTER 2. FORMAL FRAMEWORK

which evaluates to 1 for logarithm base 2.

Generalising the above example to unequal one-hot vectors of arbitrary dimensionality L also produces unit Jensen-Shannon distance. Remark (III)(III)in section 4.24.2provides an intuitive argument why this is a global maximum of dJS:

updating any unequal z1 and z2 with the gradient of dJS(z1, z2) always drives z1 and z2 towards unequal one-hot

vectors. Since any pair of unequal one-hot vectors produces unit Jensen-Shannon distance, this is a global maximum. Hence, dJS has perfect distinction and is bounded.

The following is a summary of properties of dJSrelevant to this thesis.

·

The property of perfect distinction in section 2.32.3is satisfied.

·

For all distributions, dJS is well-defined and bounded.

·

dJS is differentiable with respect to each coordinate of each categorical distribution, which is required for

gradient-based algorithms.

·

dJS is hence also continuous in each coordinate of each distribution.

·

dJS is a metric and hence allows for Lipschitz analysis and a definition of convergence on 4.

·

Convergence in dJS implies convergence in total variation (see appendix A.4A.4and in particular the application

of Pinsker’s theorem in the right-most inequality).

·

dJS is closely related to Kullback-Leibler divergence.

·

dJShas an information-theoretic interpretation as the mutual information with a class-selecting random variable.

·

Hilbert space embedding may facilitate definition of a norm implying dJS.

Figure 2.12.1and figure 2.22.2visualise the difference between supervised categorical likelihood and learning to distinguish. The former requires a class label ˜y to fit an impression to under Kullback-Leibler divergence, and the latter compares

two impressions under Jensen-Shannon distance instead.

Figure 2.1: Categorical maximum likelihood Figure 2.2: Learning to distinguish with dJS

2.5 Function spaces of neural networks: the choice of

H

4

The objective of learning to distinguish (2.3.12.3.1) is meant to be used for complex data, which is understood to elude explicit mathematical descriptions. Examples include images, language, and speech. It therefore seems natural for the framework to choose rigid mathematical concepts controlling general behaviour, but a heuristic tool to deal with the complexity of unknown relationships in the data. Such a tool should be able to adapt to a variety of data structures with little or no premonition. In particular, it should allow for different notions of similarity in the observation space, and for different functional forms mapping observations to impressions. In the rest of this chapter it will become

(21)

Sec. 2.5. Function spaces of neural networks: the choice ofH4 21

clear that neural networks fit into the requirement. In the next chapter 33, an application to image data indicates that an implementation with a neural network is indeed able to perform sensible innate categorisation, where latent impressions agree with true classes in a labeled dataset.

A model, or function space, to search for a solution to (2.3.12.3.1) should be both flexible and limited, so as to allow for agnostic estimation while avoiding to overfit: presenting a solution that does not generalise beyond the sample. Neural networks provide a practically implementable solution which meets both requirements in a controllable, albeit heuristic, manner. The functional form of a neural network is determined before estimation, but the model set is parameterised by a high-dimensional parameter vector. As will be shown, this approach allows both for theoretically unbounded flexibility and for practical estimation methods through parameter search. This section introduces the multilayer perceptron, a basic neural network variant which enjoys these properties.

A multilayer perceptron MLPJϑ is characterised by a real-valued parameter vector ϑ and a so-called hyperparameter

J ∈N which relates to the model’s flexibility. For a fixed value of J, MLPJϑ is a set of functions D →]0, 1[ indexed by ϑ, which is the form in which the model is estimated in most applications. This requires specifying J before estimation, hence it is called a “hyperparameter”. With J and ϑ variable, MLPJϑ describes a family of models which

has been shown to approximate any measurable function D →]0, 1[ arbitrarily closely in supremum norm (Cybenko, 1989

1989) or in Ky Fan metric (Hornik, Stinchcombe, and White, 19891989), a property called universal approximation. These proofs show that the class of multilayer perceptrons—a basic class of neural networks—is dense in the space of continuous, respectively Borel-measurable, functions with respect to aforementioned metrics, using the Hahn-Banach or Stone-Weierstrass theorem respectively. Universal approximation results also exists for a neural network that models the full (empirical) distribution on discrete D—an unsupervised neural network called restricted Boltzmann

machine (Hinton, 20022002). For these networks, a proof for binary observation vectors is given by Sutskever and Hinton (20082008) and an extension to arbitrary discrete probability vectors with approximation bounds by Montúfar (20142014). In general it is not clear how to construct such networks, but more explicit analysis is available; for instance with Fourier analysis of the variability of functions to be approximated, in relation to network flexibility (Barron, 19931993) or follow-up research (Lee et al., 20172017). However, in general it is not clear how to choose J or ϑ in order to achieve a desired approximation e.g. under an optimisation objective. This task still remains for the practitioner and it is typically dealt with by intuition and trial and error. Many successes with applied neural networks indicate that this is a rather fruitful approach however, hence the method should not be dismissed only on the basis of not being grounded on a complete protocol. Often, neural networks are employed for tasks which are classically perceived to be hard, so that adequate performance on a holdout sample provides a retrospective justification of the hyperparameter choice and parameter estimates. −6 −4 −2 0 2 4 6 R 0.0 0.5 1.0 σ :R →]0, 1[

Figure 2.3: The logistic sigmoid Call s the logistic sigmoid for y ∈R, shown in figure 2.32.3

and defined by

s(y) := 1

1 + exp(−y)

and for y ∈RK agree that s is coordinate-wise

s(y) =       1 1+exp(−y1) 1 1+exp(−y2) .. . 1 1+exp(−yK)       .

The functional form of a multilayer perceptron D ⊆RK→]0, 1[ is given by (2.5.1) MLPJϑ(x) := s   J X j=1 ϑW2 j s K X k=1 ϑW1 k,jxk+ ϑbj1 ! + ϑb2  = s  ϑW2>s  ϑW1>x + ϑb1  + ϑb2  ,

Referenties

GERELATEERDE DOCUMENTEN

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Alhoewel de gracht langs de noordoostzijde van de voormotte niet tot uiting komt in de topografische kaart, laat het vooral tijdens de voorbije droogte te zien

complexen zoals onder meer de recht- banken als afzonderlijke justitiële gebouwen los van de stad- of gemeen- tehuizen, beantwoordde men aan de urgente behoefte naar eigen

Rose P. A Report on Institutional Culture. A Review of Experiences of the Institutional Culture of the Medical Faculty, University of Cape Town. Cape Town: UCT Health Science

Although there are limited studies on the number of players involved in the breakdown during successful matches in rugby sevens, rugby union has shown attacking teams to be

Het doel van dit project is om na te gaan wat de effecten zijn op de ammoniakemissie als bij het maken van modelberekeningen en bij het formuleren van beleid rekening wordt gehouden

The HESS upper limit on the γ-ray flux emitted by 47 Tucanae can be confronted with scenarios of VHE γ-ray emission by msPSRs involving accelerated leptons in progressively larger

The purpose of this numerical study is to (1) compare the power of the Wald test with the power of the LR test, (2) investigate the effect of factors influencing the uncertainty