Interpreting Decision-Making in Interactive Visual Dialogue

(1)

MSc Artificial Intelligence

Master Thesis

Interpreting Decision-Making in

Interactive Visual Dialogue

by

Ujjwal Sharma

Student-ID: 11392010

August 17, 2018

36 European Credits January 2018 - August 2018 Examiner:

Dr. Raquel Fern´andez Supervisor:

Dieuwke Hupkes Co-Supervisor: Dr. Elia Bruni

Assessor: Dr. Miguel Angel Rios Gaona

(2)

(3)

Abstract

Dialogue systems that involve long-term planning can strongly benefit from a high-level notion of dialogue strategy and can avoid making poor decisions early in the game and opt for broadly successful strategies instead. A strategy-signal can additionally be used as a conditioning input on the dialogue generation mechanism allowing better training and generalization over a vastly smaller generation space.

In this work, we first analyze the human gameplay strategy, using regular expression-based strategy labels, to understand the selection and time-evolution of human-strategy in an object-discovery setup. Subsequently, we examine variational inference-based controlled text generation models as natural language generation models to generate language with pre-cisely controlled semantic features and examine their applicability to generating questions with designated semantics. Finally, we propose a strategy-conditional framework that gen-erates interpretable conditioning signals for a multi-modal question generation mechanism in a GuessWhat?! game. We explore variants with conditioning on gold and generated strategy-conditioning labels. These signals improve the performance of the question gener-ation module with stronger conditioning while additionally enforcing a stricter compliance with human game-play strategy. The discrete nature of these labels provides an interpretable readout into the model’s current questioning strategy. As a part of this conditioning setup, we also introduce a strategy-prediction module trained on a multi-task learning setup that can generate these conditioning signals autonomously and ahead-of-time allowing the encoder to generate conditioning signals for the decoder through an intermediate transformation to a label. We also present a detailed examination of optimization policies, strategy-predictor architectures and gradient propagation depth on the training of such models and show that while strategy conditioning improves model performance, it degrades generation ability due to weak strategy prediction accuracy.

(4)

(5)

Acknowledgements

I owe my deepest gratitude to my thesis supervisors Dieuwke Hupkes and Elia Bruni for their constant support at every step of this work. In the last six months, they have taught me the importance of experimenting with new ideas and slowly building them up to perfection all while allowing me to experiment and learn from my own mistakes.

I would also like to thank Prof. Raquel Fern´andez for giving me an opportunity to work in the Dialogue Modelling Group and for taking the time to review my work and provide invaluable feedback. It has been instrumental in helping me improve certain aspects of this work. I would also like to thank Tim Baumg¨artner and Aashish Venkatesh whose early experiments and code in this domain helped me set up a testbed for my own experiments. I am extremely grateful for the hours of painful debugging avoided because of their work.

Finally but certainly not the least, I am grateful to my parents for their unconditional and unwavering support. Without their unspoken sacrifices, none of this would be possible.

Amsterdam, The Netherlands Ujjwal Sharma

Friday 17th August, 2018

(6)

(7)

List of Figures

2.1 Cyclic Diagram for the RNN Module . . . 6

2.2 A recurrent network with no outputs that only incorporates inputs into the hidden state h. (Left) Recurrent diagram of the module. (Right) The same network as an unfolded computation graph where each node contains one time instance. From Goodfellow et al. [2016] . . . 7

2.3 Model Schematic for Long Short-Term Memory . . . 9

2.4 Sequence-to-Sequence Models. From Goodfellow et al. [2016] . . . 10

2.5 Vanilla Auto-Encoders. From Goodfellow et al. [2016] . . . 11

2.6 The graphical model for the generative model pθ(x|z)pθ(z). The dotted lines indicate the variational approximation qφ(z|x) of the intractable true posterior pθ(z|x) . . . 12

2.7 Baseline Oracle Model (From de Vries et al. [2016]). . . 14

3.1 Log Frequency vs Log Rank (Zipf) distribution for word tokens in GuessWhat?! 21 3.2 Distribution of Classes for Regex based classification . . . 22

3.3 Evolution of Strategy over Gameplay(timesteps) . . . 22

3.4 Successful Game Lengths vs Frequency of Occurrence . . . 23

4.1 Modified Auto-Encoder Network with a Posterior Recognition Model. From Bowman et al. [2015] . . . 28

4.2 VAE for text with latent-code attribute discriminators. From Hu et al. [2017] 32 4.3 Training Curves for VAE Training in Controlled Text Generation . . . 35

4.4 Training Curves for Cooperative (VAE + Discriminator) Training in Controlled Text generation . . . 35

5.1 Game State Encoder: Image features I are obtained from a ResNet [He et al., 2016] network. An LSTM module is used to encode (Q/A) interactions. The encoder then performs a late-fusion over the features and squashes them using a tanh non-linearity . . . 42

(12)

5.2 Baseline Question Generator: A simple RNN-LM decoder with LSTM mod-ules. The decoder is initialized with the Game-State Encoder output et before

decoding. . . 43

5.3 Image-Conditioned Game State Encoder + Baseline Question Decoder . . . . 43

5.4 Strategy-Conditioned Question Decoder . . . 44

5.5 Strategy-Conditioned Question Generation Model (with Baseline encoder) . . 44

5.6 Multi-Loss Training for Conditioned Question Generation with Strategy-Class Prediction . . . 46

5.7 Inference Schematic for Conditioned Question Generation with Strategy-Class Prediction . . . 47

E.1 The graphical model for the generative model pθ(x|z)pθ(z). The dotted lines

indicate the variational approximation qφ(z|x) of the intractable true posterior

(13)

List of Tables

2.1 Baseline Oracle Accuracy Values . . . 16

3.1 Statistical Information for GuessWhat?! data. From de Vries et al.de Vries et al. [2016] . . . 21

3.2 Game Parameter Examples for Clustered Encoder States with K = 3 . . . 24

3.3 Game Parameter Examples for Clustered Encoder States with K = 5 . . . 25

4.1 ELBO for variations in Word-Dropout Rate vs. Number of Epochs with logistic annealing on the test split.. . . 30

4.2 Cross-Entropy Loss and KL Divergence values for variations in Word-Dropout Rate vs. Number of Epochs for VAE model. . . 31

4.3 Cherry-picked Success and Failure Cases for the VAE language model . . . . 31

4.4 Training Scores for Controlled Text Generation VAE module trained on the IMDB text corpus . . . 35

4.5 Training Scores for Cooperative (VAE + Discriminator) trained on the SST-FULL sentiment treebank . . . 35

4.6 Cherry-picked Uncontrolled, Positive and Negative sentiment examples for the VAE language model . . . 36

5.1 Cross-Entropy Loss scores for conditioned encoder-decoder setup. . . 51

5.2 Evaluation metric scores for conditioned encoder-decoder setup.. . . 51

5.3 Training Results for Ensemble Models . . . 53

5.4 Entropy Score and Conditioning Accuracy for Ensemble Models (on Validation split) . . . 53

5.5 Precision, Recall and F1 for Baseline and Ensemble Models for the Spatial strategy class (on Validation split) . . . 53

5.6 Precision, Recall and F1 for Baseline and Ensemble Models for the Interaction strategy class (on Validation split) . . . 54

A.1 Distribution of Classes for Regex based classification . . . 63

(14)

A.2 Evolution of Strategy over Gameplay(time) . . . 63

A.3 Game Lengths vs Frequency of Occurrence . . . 64

B.1 Training Results for Ensemble Models . . . 65

B.2 Detailed Precision Scores for Ensemble Models . . . 65

B.3 Detailed Recall Scores for Ensemble Models . . . 65

B.4 Detailed F1 Scores for Ensemble Models . . . 66

C.1 Natural Language Generation for RNN-VAE model by Bowman et al Bowman et al. [2015] for varied word-dropout rates, logistic KL annealing and sampled from a model trained for 6 epochs. . . 68

(15)

Chapter 1 Introduction

Recent advancements in deep-learning research have brought a host of strong improvements to computer-vision tasks like image classification [Simonyan and Zisserman,2014,He et al.,

2016] and object detection [Ren et al., 2015, Redmon et al., 2015]. Similar advancements have been visible in natural language tasks like machine translation [Vaswani et al.,2017] and language-modeling [Yang et al.,2017]. With a strong push towards increasingly larger data-sets, performance is poised to improve dramatically. These tasks, however, have a narrow scope and do not require a high-level understanding of images or text. Visual reasoning in humans is based on a complex understanding of the visual features of the scene, the spatial positions of objects and their interactions with each other. Similarly, textual reasoning requires a context grounded in some modality to ascertain the meaning of the sentence. Tasks by the computer vision community like visual question answering [Kim et al.,2016,Zhu et al.,2015,Shih et al.,2015,Mostafazadeh et al.,2017,Lu et al.,2017] and visually-grounded dialogue [de Vries et al.,2016,Strub et al.,2017] are an attempt to move from simple detection tasks towards tasks that involve a detailed understanding of the scene. The visually-grounded dialogue task models a scene-understanding setup with dialogue as the principal method of discovery. This task encompasses multiple sub-problems from the computer vision community like object detection, scene classification and disambiguation between multiple occurrences in addition to tasks from the natural language community like language generation, co-reference resolution and summarization. Visual-grounding localizes the context of a dialogue with respect to an image so that its meaning can be determined and strongly resembles human interactions which are grounded in a common visual context. This task is hard because a dialogue-manager must be able to extract the relevant concepts from a dialogue and perform a search over the visual features. In addition to this, there is no well-defined evaluation metric for dialogue and most of the evaluation is subjectively performed. To mitigate this issue, the task-oriented dialogue framework was introduced which uses the task success as an automatic metric [Shah et al.,2016,Liu et al.,2018]. In this work, we principally work with a task-oriented dialogue dataset called GuessWhat?! [de Vries et al.,2016].

The GuessWhat?! dataset is modeled as an object-discovery task in a visually-grounded dialogue setup. One of the agents is designated the oracle and assigned a visual object class.

(16)

The other agent, referred to as the questioner asks a series of yes-no questions to ascertain a correct object class. In this respect, the question-generation module must act as a dialogue manager as well as a natural language generator. In this thesis, we principally investigate the nature of questioning in GuessWhat?! and how a structure can be enforced over the questioning policy. To understand the internal decision-making in GuessWhat?!, we rely on introspection methods.

The ability to introspect neural networks to expose the internal context is a powerful abil-ity for systems that engage in long-term planning. There have been efforts in the past to allow introspection based models in visual-question answering [Czuba et al., 2002] that can reevaluate a situation at a previous time-step. The principal motivation behind this work is to introduce a similar ability for introspection into the question-generation mechanism in GuessWhat?!, specifically with respect to its internal dialogue management decisions. Our aim is to allow an introspection of the strategy before it is used to generate questions. The mechanism to introspect internal representation is inspired from work by Hupkes et al.

[2018]. While they propose decoding intermediate representations for an arithmetic language, we use a distinct setup inspired by multi-task learning [Caruana,1993] to train an intermediate representation on discrete interpretable labels.

This work begins with a statistical examination of the structure of GuessWhat?! questions and their intended discovery strategy to better understand how questioning evolves over time. We then introduce a strategy labelling mechanism to label the GuessWhat?! game-play with the intended strategy at each point in time. To provide strategy transparency, we project the intended strategy to interpretable labels and use this to further constrain the questioning mechanism. We also examine a class of models trained using the Auto-Encoding Variational Bayes framework for diverse question generation and examine its suitability with respect to GuessWhat?!.

1.1 Objectives

The principal objectives of this work are :

1. Examine the nature of questioning in GuessWhat?! and the evolution of game-play with time.

2. Create a simple and extensible method to generate strategy labels for questions. Ad-ditionally, we also aim to investigate unsupervised approaches to label sentences based on their intended strategy.

3. Create a prediction-framework to introspect the game-play strategy from the encoded game-representation.

4. Examine approaches to generate questions with externally controlled semantics.

1.2 Outline

(17)

1.2 Outline 3

1. Chapter 2 contains a detailed introduction to the basic building blocks and techniques used in this work. It also introduces the relevant literature necessary to review this work.

2. Chapter 3 contains a detailed analysis of GuessWhat?! questions and the evolution of questioning-strategy and game-success with time. We also detail a hand-crafted method to create strategy labels for questions in GuessWhat?! using aggregated term-frequency statistics.

3. Chapter 4 contains a detailed analysis of natural language generation models trained using variational inference. We also examine their suitability as a question-generation mechanism for GuessWhat?!.

4. Chapter5details experiments with strategy-conditioning for Sequence-to-Sequence question-generation models. We examine the performance of multiple variants and analyze their performance with respect to GuessWhat?! baselines.

In addition to this, we also include some important proofs and results in the Appendix.

1. Appendix A contains raw data for plots in Chapter 3.

2. Appendix Bcontains detailed training and evaluation scores for models in Chapter 5. 3. Appendix C contains generation examples from all tested models in Chapter4.

4. AppendixDcontains a formal proof of the Back-Propagation-Through-Time algorithm. 5. Appendix Econtains a formal proof for the Stochastic Gradient Variational Bayes

(18)

(19)

Chapter 2 Background

In this chapter, we provide a technical overview of fundamental concepts that form the basis for this work. In the coming sections, we introduce the Recurrent Neural Network [Rumelhart et al., 1986] and detail fundamental training issues with vanilla RNN modules that lead to the rise of gated recurrent models. We also introduce the Long Short-Term Memory [Hochreiter and Schmidhuber,1997], a gated building unit for RNN layers that serves as the principal recurrent module in this work. Additionally we also layout a formal mathematical basis for training models in Chapter4 using the Auto-Encoding Variational Bayes [Kingma and Welling, 2013] framework. Finally, we introduce the GuessWhat?! dataset [de Vries et al.,2016] and detail the training of baseline modules that are instrumental in setting up a GuessWhat?! inference pipeline.

2.1 Recurrent Neural Networks and LSTM Modules

In order to lay the groundwork for Recurrent Neural Networks, it is essential to understand how simpler feed-forward networks work and where they fall short.

A feed-forward neural network is a neural network where information (or signals) flow in one direction only. The perceptron introduced by Rosenblatt [1958] is one of the simplest implementations of the feed-forward neural network that could learn through weighting of inputs. Formally, a feed-forward neural network is a computational graph with nodes as computational elements and edges as connections that transmit numerical information. The nodes are capable of performing primitive operations (activation) on the input information received from an incoming edge. The entire network behaves as a chain of multiple function compositions on the original input.

In order to understand the issues with modeling sequences with temporal properties using feed-forward networks, consider an application that needs to predict an output sequence {y = y1, y2, ..., yn} for the input sequence {x = x1, x2, ..., xm}. The first problem with using

a feed-forward network arises from the fact that the input and output sequences may have different lengths for different input and output pairs. This is problematic for a feed-forward

(20)

network which can only map a fixed-size input vector to a fixed-size output vector. This can be mitigated to some extent by an input/output size equal to the maximum sequence length and by using padding tokens. The bigger issue, however, is that it is not necessary that the x1 has any role in generating y1. As an example, machine translation often has to take into

account differences between word-orders of different languages. To work with sequences where previous inputs can influence future outputs, it is essential to maintain a temporal context that stores previous parts of the sequence that may be required in the future.

A Recurrent Neural Network (RNN) [Rumelhart et al., 1986] is an artificial neural network with cyclic connections that allow it to exhibit temporal learning for sequences. The core strength of the RNN module lies in its sharing of parameters across computational time-steps. This allows better generalization across time. A traditional feed-forward network will require a separate set of parameters for each input feature which leads to poor generalization and an inability to exhibit temporal invariance. 1

An RNN, on the other hand, uses an internal hidden state to process sequences. The evolving hidden state vector encodes temporal properties of the sequence and is shared between time steps. In a conventional RNN, all the elements of the sequence essentially undergo the same operation parameterized by a shared set of parameters. The difference between outputs comes from the evolving hidden state.

ot ht xt ht−1 Whh Whx Woh

Figure 2.1: Cyclic Diagram for the RNN Module

The RNN transition equations can be written as:

ht= Whhf (ht−1) + Whxxt−1

ot= Wohf (ht−1)

(2.1)

where xt and ot are the RNN input and output at time t and ht is the state of the hidden

layers at time t.

1

Temporal Invariance is the ability of a network to detect a feature even if it occurs at different timesteps in the input. For example, a network designed to detect the city (in this case Amsterdam) should work for “I am in Amsterdam” where Amsterdam occurs at the 4th position and “Guess what? , I am in Amsterdam” where it is the 8th _token

(21)

2.1 Recurrent Neural Networks and LSTM Modules 7

The output at each time-step can be expressed in terms of the parameters from a preceding time-step. The classical form of this dynamical system can be written as:

s(t) = f (s(t−1); θ) (2.2)

The recurrence relation in the above form can be removed with an unfolding operation which converts a recursive operation into a computational graph with repetitive structure [ Goodfel-low et al.,2016]. The above system, unfolded for 3 time-steps can be written as:

s(3) = f (s(2); θ)

= f (f (s(1); θ); θ) (2.3)

Figure 2.2: A recurrent network with no outputs that only incorporates inputs into the hidden state h. (Left) Recurrent diagram of the module. (Right) The same network as an unfolded computation graph where each node contains one time instance. From

Goodfellow et al.[2016]

By unrolling the computation graph for a finite number of steps, the system can be written as a directed acyclic computational graph (DAG). It is imperative to note here that the parameters of an RNN Whx, Woh and Whh are shared across all the time-steps. 2

The most common algorithm to train recurrent neural networks is Back-Propagation Through Time (BPTT) [Rumelhart et al.,1986]. In the following section, we formally introduce the BPTT process with the goal of highlighting gradient problems with vanilla RNN units.

2.1.1 BPTT and Training Problems with Vanilla RNN Modules

The Back-Propagation Through Time (BPTT) algorithm is used to train recurrent neural networks by unrolling the computation graphs over multiple time steps. It is a gradient based technique that uses an unfolded version of the network to propagate gradients back into the network. We provide a brief formulation of the BPTT objective to understand the inherent

2

The term recurrence implies that the parameters of the network at time-step t are obtained from a transformation of the network parameters at time-step t-1

(22)

problems with training vanilla RNN modules. In this section, we only discuss certain select results. For a complete proof, please check AppendixD.

The BPTT algorithm unfolds the recurrent model over t time-steps into t distinct copies of the network and instantiates each of these copies with common shared set of parameters. This effectively reduces the recurrent network into a computational graph with repetitive feed-forward structures with shared parameters. Each computation step contains an input/output pair and a copy of the network parameters. Errors are computed and summed up for all the individual computation steps as shown in Equation2.4.

L (x, y) = −X

t

(ytlog ˆyt) (2.4)

Gradients at time-step (t → t+1) can be written as:

∂L (t + 1) ∂Whh = ∂L (t + 1) ∂ ˆyt+1 ∂ ˆyt+1 ∂ht+1 ∂ht+1 ∂Whh (2.5)

Since all network parameters are shared and owing to the recurrence relation, the gradient at timestep t+1 will have contributions from all previous timesteps. The loss for the output at time t+1 can be written as:

∂L (t + 1) ∂Whh = t+1 X τ =1 ∂L (t + 1) ∂ ˆyt+1 ∂ ˆyt+1 ∂ht+1 ∂ht+1 ∂hτ ∂hτ ∂Whh (2.6)

At time-step t+1 of the computational graph, the gradients with respect to the the hidden state parameters Whh can be written as:

∂L ∂Whh =X t t+1 X τ =1 ∂L (t + 1) ∂ ˆyt+1 ∂ ˆyt+1 ∂ht+1 ∂ht+1 ∂hτ ∂hτ ∂Whh (2.7)

In Equation2.7, we observe that the gradient term uses a recursive product (term in red) of the gradients over the hidden states. The term can be recursively expanded as:

∂ht+1 ∂hτ = ∂ht+1 ∂ht ∂ht ∂ht−1 ∂ht−1 ∂ht−2 · · ·∂h2 ∂h1 (2.8)

This is a major weakness of vanilla RNNs where the recursive product of gradients less than 1 vanishes to insignificant terms whereas gradients above 1 rapidly explode to large numbers. They are known as the Vanishing and Exploding gradient problems respectively.

(23)

2.1 Recurrent Neural Networks and LSTM Modules 9

2.1.2 Long Short-Term Memory (LSTM)

σ σ tanh σ × + × × tanh ct−1 Cell State ht−1 Hidden State xt Input ct Cell State ht Hidden State yt Output

Figure 2.3: Model Schematic for Long Short-Term Memory

The primary reason behind vanishing gradients in vanilla RNN units is the use of the tanh non-linearity whose value lies in [0,1]. The LSTM modules proposed by Hochreiter and Schmidhuber [1997] uses a gated mechanism where gradients do not flow back in time but are scaled by the output gate and its non-linearity and can then flow back indefinitely [Gers,

2001]. These features allow the LSTM modules to handle arbitrary input/output time lags [Gers, 2001]. While vanilla RNN modules use simplistic (tanh) activations to squash input and hidden state vectors, LSTM module uses a more complex activation function controlled by gates that selectively allow information to pass through the module. LSTM modules use three gates to control the cell state:

1. The Forget gate f(t) _{is a trainable gate that determines the components to be retained}

from the long-term memory. A sigmoid activation is used for this gate. A gate output of ‘1’ completely erases memory whereas ‘0’ will not remove any components from memory and allow them to be retained. This module allows the LSTM to essentially reset its state to signal the end of the previous context.

2. The Input gate i(t) _{ascertains what parts of the input need to be retained and what}

can be dropped. It is a conventional ‘gate’ since it denotes the extent (0-1) to which information needs to be added to the current cell state. This gate chooses to either add new information or to reinforce reoccurring information.

An additional component of this gate is referred to as input modulation and is tasked with generating candidate additions to the long-term memory. This is not a traditional ’gate’ since it represents information rather than the extent of information that is to be retained and hence uses a tanh activation.

3. The output gate o(t) determines what parts of the memory are immediately useful and the extent to which the long-term memory must be used in ascertaining the output. The output gate is a traditional gate and uses the sigmoid activation.

(24)

2.2 Sequence-to-Sequence Models

Figure 2.4: Sequence-to-Sequence Models. FromGoodfellow et al.[2016]

Deep Neural Networks can only be applied to problems where the inputs and targets can be encoded with vectors of fixed dimensionality [Sutskever et al.,2014]. Using a single recurrent module requires a priori information about the alignment between input and output sequences [Sutskever et al., 2014]. This poses problems for tasks like machine translation where the alignment or length of the input and output sequences is not fixed or cannot be inferred a priori. Additionally, for tasks like machine translation, it is essential that the translation network has a high-level idea about the nature and structure of the complete input sentence before it can start to decode the translation.

A Sequence to Sequence model [Sutskever et al., 2014, Cho et al., 2014] aims to solve this problem by mapping the input sequence to an intermediate fixed-length vector representation. These models are typically structured in an Encoder-Decoder structure. The Encoder maps the sequence data into a fixed-representation. This intermediate-vector representation can then be used by a Decoder jointly trained with the Encoder to generate an output sequence. The implementation by Sutskever et al. [2014] utilizes the LSTM as the principal recurrent module.

The encoder-decoder model is trained to predict P(o1, o2, ..., oT0|i₁, i₂, ..., i_T) where i

repre-sents the input tokens and o reprerepre-sents output tokens. The encoder LSTM generates a fixed dimensional representation v of the input sequence {i1, i2, ..., iT}. The decoder LSTM then

predicts the output sequence with a standard LSTM-Language Model (LSTM-LM) formula-tion [Rumelhart et al.,1986,Sundermeyer et al.,2012,Mikolov et al.,2010] where the initial

(25)

2.3 Variational Inference 11

hidden state of the decoder is set to the fixed-dimensional representation v from the encoder.

P (o1, ..., oT0|i₁, ..., i_T) =

T0

Y

t=1

p(ot|v, o1, ..., ot−1) (2.9)

The distribution p(ot|v, o1, ..., ot−1)t is represented by a softmax operation [Sutskever et al.,

2014] over the output logits (with dimensionality |V| where V is the vocabulary size).

2.3 Variational Inference

Variational inference is a family of techniques for approximating intractable integrals in ma-chine learning and Bayesian modeling setups. The Auto-Encoding Variational Bayes [Kingma and Welling, 2013] framework allows us to perform efficient approximate posterior inference using a recognition model for continuous latent space models with i.i.d datapoints using a simple gradient descent optimizer.

Figure 2.5: Vanilla Auto-Encoders. FromGoodfellow et al.[2016]

The structure of the Variational Auto-Encoder strongly resembles a conventional auto-encoder structure. An auto-encoder uses two component structures known as the encoder and decoder respectively. The encoder maps an input x to an intermediate representation that can be written as h = f (x). The decoder component reconstructs an output x from this intermediate representation, r = g(f (x)).

A traditional auto-encoder learns a compressed representation of the input by first compress-ing (encodcompress-ing) and then reconstructcompress-ing (decodcompress-ing) from that compressed representation. A variational auto-encoder assumes an auto-encoder structure but enforces strong priors on the distribution of latent variables. In this way, it learns a distribution over the output rather than a compressed code. Since it enforces a strong structure on the latent space, output samples can be generated by feeding the decoder with a variable sampled from the latent prior distribution. A modified objective called the Stochastic Gradient Variational Bayes (SGVB) is used to train these networks that adds a regularization loss on the latent variable distribution in addition to a regular reconstruction error term. In this section, we provide a brief introduction to the SGVB estimator that serves as the principal method for training models detailed in Chapter4.

(26)

2.3.1 SGVB Objective

In this section, we introduce a minimal introduction of the SGVB as a practical estimator of the lower bound and its derivatives w.r.t. the parameters [Kingma and Welling, 2013]. The complete mathematical proof for the estimator is available in AppendixE.

z x

θ

φ

Figure 2.6: The graphical model for the generative model pθ(x|z)pθ(z). The dotted lines indicate

the variational approximation qφ(z|x) of the intractable true posterior pθ(z|x)

The principle objective of the VAE training is to bring the distribution parameterized by the neural network close to the assumed prior on the latent space. In order to formalize the divergence of one probability distribution from another, the Kullback-Leibler (KL) divergence is used:

KL(p||q) = −Xp(x) log q(x) +Xp(x) log p(x) =Xp(x) logp(x)

q(x)

(2.10)

such that KL divergence is always positive (KL ≥ 0) and not symmetric (KL(p||q) 6= KL(q||p))

Assuming the graphical model in Figure2.6, the posterior can be written as:

P (z|x) = P (x|z)P (z) P (x)

| {z }

R P (x|z)P (z)

(2.11)

The evidence term in Equation2.11 is an intractable integral in the higher dimensions that are generally used for latent spaces. To work around this, the variational Bayes framework approximates the true intractable posterior with a variational posterior qφ(z) to be close to the

true posterior p(z|x). To ensure that they represent the same distribution, the KL divergence between qφ(z) and p(z|x) is minimized.

(27)

2.4 The GuessWhat?! Game 13

Formalizing the objective to be minimized:

KL(qφ(z)||p(z|x)) = − X z q(z) logp(z|x) q(z) = −X z q(z) logp(x, z) q(z) + log p(x) (2.12)

This can be simplified to the form:

KL(qφ(z)||p(z|x)) = −

X

z

q(z) logp(x, z)

q(z) + log p(x) (2.13)

The objective can be rearranged as:

log p(x) | {z } Fixed Constant = KL(q(z)||p(z|x)) | {z } Objective to be minimized +X z q(z) logp(x, z) q(z) | {z } L (2.14)

Instead of minimizing the KL divergence between the variational posterior qφ(z) and the true

posterior p(z|x), the L term can be minimized. Since the KL divergence can only assume values greater than zero, L is a lower-bound on log p(x) and is termed the Evidence Lower-BOund (ELBO). Expanding the ELBO term,

L = Eq(z)[log P (x|z)] | {z } Reconstruction Loss − KL(q(z)||p(z)) | {z } Regularization Loss (2.15)

Minimizing the KL divergence between the variational posterior approximation and the true posterior is equivalent to maximizing the ELBO as seen in Equation 2.12. Equation 2.13

further demonstrates that the ELBO can be further decomposed into two components shown in Equation2.15.

The Reconstruction loss component represents the difference between the generated and the true data and aims to bring the generations closer to the ground truth. For example, over a Gaussian distribution, this term reduces to a Euclidean norm between the generated and true data. The Regularization term adds an additional loss on the latent code to bring it closer to the assumed standard prior distribution over the latent space p(z).

The loss in Equation 2.15 forms the basis of all models in Chapter 4, albeit with minor modifications or additions.

2.4 The GuessWhat?! Game

The GuessWhat?! game is a cooperative two-player game introduced byde Vries et al.[2016]. The game involves two players, one of whom is designated as an Oracle and allocated a visual

(28)

object. The other player, known as the Questioner then takes turns to ask binary YES/NO questions to ascertain the correct object class. Once the questioner is reasonably confident about his assessment of the correct class and has gathered sufficient evidence in terms of answers from the oracle, he proceeds for a guess and picks an object class. If his guess is correct, the game is marked as a success. In case of a wrong guess, a failure is marked and games that are not completed for any reason whatsoever are marked as incomplete.

Furthermore, the questioner is not provided with a list of possible object classes before the guess. The baseline implementation [de Vries et al., 2016] consists of three modules, the question-generator, oracle and the guesser. 3

In this work, we principally focus on the improvements to the question-generator and Chapters

3 and 5 are dedicated to a detailed analysis of the question generation strategy and some proposed improvements.

In this section, we introduce an Oracle baseline that is used without modifications in the rest of this work. We do not introduce or discuss the Guesser as it is not used in our experiments.

2.4.1 Oracle

The Oracle module provides valid responses (YES/NO) for natural language questions gen-erated by the question-generation module.

Figure 2.7: Baseline Oracle Model (From de Vries et al.[2016])

3

The term questioner refers to the GuessWhat?! gameplay agent tasked with generating questions and finally guessing the object class. Separately, the term question-generator and guesser are used to refer to baseline modules that generate questions and perform the final guess respectively. In the game, the same entity is tasked with question generation as well as guessing whereas in the baseline, guessing and question generation are performed by separate modules

(29)

2.4 The GuessWhat?! Game 15

The baseline Oracle model from GuessWhat?! uses the Image I, the cropped image features (the image overlaid with a binary mask for the object), spatial information S, its global object class and the current question q [de Vries et al., 2016]. The features are described in detail below:

Input Features

1. Image Features I: The image features are computed by rescaling the input image I to a 224 × 224 image. This is then passed through the VGG-19 Network [Simonyan and Zisserman, 2014] and the output from the FC8 layer is used as Image features. The GuessWhat?! authors did not experiment with ResNet features.

2. Crop Features C: The features for the target object class are also fed as features to the network. The selected object is first cropped by computing the smallest rectangle that can completely contain the segmentation mask for this object. The rectangle is then scaled back to 224 × 224 dimensions of the rescaled image.

3. Spatial Information S: The spatial information of the crop-features are also passed to the network as an 8-dimensional vector to assist the network in locating the spatial position of the crop with respect to the complete image. This is essential for assimilating spatial information in games where questions are of the form “Is it on the left?”, “Is it in the centre?”.

4. Question Embedding q: The question is first tokenized and then embedded using a uni-directional Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber,1997] module.

Model and Training

All features are concatenated into a single vector. An MLP is then used to map the con-catenated feature vector to a 2-dimensional vector which is then converted to (YES/NO) probabilities using a softmax operation. A final result is then computed by an argmax op-eration over the answer probabilities. A schematic diagram of this model created by the GuessWhat?! authors can be found in Figure 2.7. The network is trained to minimize the cross-entropy loss. For the training, the VGG network parameters were fixed and only the LSTM and MLP parameters were updated. For optimization, the Adam [Kingma and Ba,

2014] optimizer was used.

The model was trained for 15 epochs and early-stopping criterion [Girosi et al., 1995] was used on the validation split.

Results

The GuessWhat?! authors experimented with multiple input combinations. Empirically, the best results were obtained from the model which used the LSTM-encoded question q, the object category C and the spatial features S as an input. We separately trained the

(30)

best-reported model to validate these results and training and validation accuracy values are detailed in Table2.1

Model Training Error Test Error

Trained Oracle Model Scores 17.03% 21.48%

GuessWhat?! Reported Baseline Scores 17.8% 21.5%

Table 2.1: Baseline Oracle Accuracy Values

The baseline Oracle model is used without any modifications for free-running question gen-eration experiments.

In the next chapter, we present a detailed examination of the structure of questioning in GuessWhat?! and introduce a method to cluster questions on the basis of their intended strategy with respect to object discovery.

(31)

Chapter 3 Question Generation in GuessWhat?!

The GuessWhat?! gameplay involves a visual discovery task where a player narrows down on an object class by gathering evidence through a series of yes-no questions. In this respect, questions are the principal method of gathering evidence if visual features are not sufficiently informative. In this chapter, we examine the nature of questions in the GuessWhat?! dataset and examine if they can be separated in terms of their intended discovery strategy.

Formally, the question-generation task involves generating a new question qN +1 given N

previous questions q<N, answers corresponding to those questions D<N and image features

I. The quality of the question-generation task can be measured by the ability of the Guesser to predict the correct answer given the generated questions. The question generation task is a hard problem for several reasons including but not limited to:

1. It requires high-level visual understanding: The question generation module relies on the visual features from the VGG/ResNet networks for high-level visual understand-ing. It should ideally learn to associate high-level concepts from visual features with accompanying textual evidence encoded by the Sequence-to-Sequence encoder module. 2. It requires tracking the uncertainty over the object classes: The question-generation module must handle long-term dependencies and must constantly retain and update the belief-state over individual object classes. In an ideal setup, it should try to reduce the number of valid classes with every turn, the term valid referring to classes which have not been eliminated by previously gathered evidence.

For example, if the question “Is it red?” is answered in the negative, the model should decrease its belief over any object class that is red and subsequently refrain from gen-erating questions about classes with red features in the subsequent turns.

3. It suffers from compounding errors caused by Oracle and Guesser : The question-generation pipeline relies on the Oracle for responses and on the Guesser for estimating the infor-mativeness and relevancy of its questions. Given an imperfect Oracle and Guesser, these errors are effectively carried over into the question-generation pipeline.

(32)

In addition to theoretical difficulties, training it in a single-objective setup introduces addi-tional complexities:

1. It does not have a well-defined evaluation metric: The generated questions are only trained to maximize the conditional log-likelihood of the ground-truth question condi-tioned on the Q/A history and the image features. There is, however, no well-defined loss on the novelty and/or relevance of questions in this setup and evaluation relies on a subjective assessment of these factors.

2. The question-generation is not trained to be explorative: The question-generation mod-ule receives a single ground-truth question for every game. This implies that the question-generation mechanism only learns to respond to a state of the game with a single target generation and does not explore any other strategies/responses.

In this chapter, we present a statistical analysis of the GuessWhat?! ground-truth question data. Additionally, we also introduce a simple method for classifying questions into mutually exclusive categories based on a broad definition of the intended strategy.

3.1 Hand-Crafted Strategy Clustering

In order to investigate the structure of questioning in GuessWhat?!, it is necessary to introduce labels on the strategy type for each question. In this section, we detail a simple strategy labelling method based on aggregated term-frequency statistics.

3.1.1 Theory

The GuessWhat?! authors briefly introduce the idea of questions with specific strategies in GuessWhat?! [de Vries et al., 2016]. These strategies aim to indicate the broader objective of a question with respect to determining an object class. For example, questions like “Is it in the back?” do not provide strong evidence towards the correct class but help in cutting down the possibility space of object classes. Other questions like “Is it the boy at the back” can provide targeted evidence for a single object class. In this section, we formalize the methodology behind the creation of these strategy classes.

The general aim of questions in the GuessWhat?! gameplay is to gradually reduce the pos-sibility space for the Guesser module by generating questions that either eliminate multiple object classes or provide strong evidence in favor of a single class. de Vries et al.[2016] classify strategy into five broad sub-classes based on a broad delineation of the underlying strategy:

1. Questions about spatial features: This type of questions enquire about the spatial position of an object or multiple objects with respect to the image and/or other object classes. These may be further subdivided into:

(a) Absolute Spatial Position: Questions about the location of certain features relative to the complete image. Examples are “Is it at the bottom of the image?”, “Is it on the left?”

(33)

3.1 Hand-Crafted Strategy Clustering 19

(b) Relative Spatial Position: Questions about the location of certain features relative to other features. Examples are “Is it to the left of the man?”, “Is it under the tree?”.

2. Questions about visual features : This type of questions enquire about elementary physical attributes of objects. These can be further subdivided as:

(a) Size : “Is it big?” (b) Shape : “Is it round?”

(c) Color : “Is it green?”

3. Questions about taxonomical features: This type of questions exploit a hierarchy over the object classes to inquire about the presence or absence of multiple sub-classes simul-taneously. For example, “Is it a person?” will allow the model to ascertain the absence or presence of multiple sub-classes like ‘Man’, ‘Woman’, ‘Boy’, ‘Girl’ simultaneously. 4. Questions about the nature of Interaction : These questions are enquiries about the

kinds of interaction that are possible with the target class. For example, “Can you drive it?” could provide evidence for multiple classes like trucks, cars etc. which can be driven.

5. direct object enquiry questions : These questions are precise queries about the pres-ence or abspres-ence of a single object class. For example “Is it the cat?”, “Is it the person?” etc.

6. free class: In our experiments, we also use a free class which includes questions which cannot be cleanly separated into a single strategy class. These questions usually contain a mixture of strategies and a clear dominant strategy cannot be trivially chosen. For example, “Can I see it?” etc.

3.1.2 Implementation

The simplest methods to classify a sentence use a combination of the Bag-of-Words classifi-cation and Regular Expression Templates to classify questions into distinct strategy classes. We explain these methods in detail below:

• Bag-of-Words classification: The simpler of the two methods uses a Bag-of-Words (BoW) approach [Harris, 1954] to classify sentences into strategy classes. The entire sentence is converted to a list of words and then compared with word lists for each of the strategy classes. If any term present in the strategy word list also occurs at least once in the sentence, the sentence is marked positive for that strategy.

To classify questions, we first compute the top 500 most common terms in the Guess-What?! dataset to analyze the distributions of terms with respect to the classes. Al-though it is possible to use an inverse document frequency term to separate out terms for novelty, this negatively impacts the coverage. The selected terms are then inspected manually to analyze if their usage in a question indicates the presence of one or more strategies detailed in Section 3.1.1.

(34)

Most of the questions classified using the BoW strategy contain words from multiple strategy word-lists. Soft classification allows a question to be classified into multiple strategy classes whereas Hard strategy classification only allows questions with a single strategy class label. For our testing, we chose sentences with only one strategy and disregard any questions that contain a mixture of multiple strategy terms. We also remove terms with high inverse document frequencies like ‘is’, ‘it’, ‘?’ from strategy word-lists.

• Regular Expression Templates: Questions exhibit a more controlled structure and form as compared to prose and general corpora text. We exploit this property to create templates for each of the strategy types and filter out ill-formed questions. For example questions that are posed as a direct enquiry about object classes are generally of the form Is it the * or Is it a * (* denotes an object class) and can be converted to a regular expression isit[the,a,an]* (removed white-space). Although this helps in eliminating noisy data, it needs to be tuned by hit-and-trial to avoid dropping out valid questions that deviate slightly from the expected structure.

Additionally, for the Direct Object Enquiry class, we purely rely on Regex templates to classify sentences into this pool. This is because the human player does not have a list of valid object classes before the guess [de Vries et al.,2016]. These questions are detected using templates designed by analyzing queries of this nature.

3.2 Statistical Analysis

In order to understand the nature of questioning, it is important to analyze the statistical evolution of game strategy and success with time. In this section, we examine the distribution of question tokens, the time-evolution of strategy and the correlation between game-success and the number of questions.

3.2.1 Dataset Statistics

The GuessWhat?! [de Vries et al., 2016] dataset contains games with successful as well as unsuccessful outcomes. A successful game is defined as one wherein the human player could successfully guess the correct object class given the chosen line of questioning.

The vocabulary for the GuessWhat?! tokens resembles the Zipfian distribution as shown in Figure 3.1. In this work, we restrict our vocabulary to terms with a minimum occurrence count of 3 to keep our experiments in line with other results fromde Vries et al. [2016].

(35)

3.2 Statistical Analysis 21

Figure 3.1: Log Frequency vs Log Rank (Zipf) distribution for word tokens in GuessWhat?!

The GuessWhat?! dataset includes finished (games that were properly concluded with a correct or incorrect guess) as well as successful (subset of finished games that were concluded with a correct guess) games. Broad statistics for this dataset are available in Table3.1.

Counts Finished Games Successful Games

Dialogues 144,434 131,394

Questions 732,081 648,493

Average Questions / Dialogue 5.06 4.93

Vocabulary 10,985 10,469

Vocabulary (min. occ. count = 3) 5,179 4,919

Table 3.1: Statistical Information for GuessWhat?! data. From de Vries et al.de Vries et al.

[2016]

It is noteworthy to observe here that the success of a game is largely independent of the number of questions. While the entire pool of finished games has approximately 5.08 average questions per game, the successful games have slightly smaller average question count of 4.93. This indicates that the number of questions is not strongly correlated with game-success in GuessWhat?! .

3.2.2 Strategy Clustering Statistics

Based on the strategy clustering techniques discussed in3.1.2, we use a combination of the Bag-of-Words and Regular-Expression templates to classify sentences into strategy classes. Statistics for this are available in Figure3.2

(36)

Figure 3.2: Distribution of Classes for Regex based classification

3.2.3 Question Strategy-Time Evolution Characteristics

To understand the evolution of gameplay with time, we also record strategy labels for human data in Figure3.3.

Figure 3.3: Evolution of Strategy over Gameplay(timesteps)

The plots for human strategy evolution in Figure 3.3 illustrate the evolution of strategy dynamics over the course of a game. Humans start with questions about dominant object classes, a strategy which is closer to an O(n) linear object class enumeration strategy. Since humans do not have a list of object classes, this strategy cannot be sustained. The initial Direct Object Enquiries are primarily based on the visual evidence.

(37)

3.2 Statistical Analysis 23

Later sections of the game witnessed a shift towards enquiries about spatial or visual features, especially when the initial Direct Object Enquiry questions do not yield a positive response from the Oracle.

3.2.4 Gameplay Length Characteristics

GuessWhat?! players require a sufficient number of questions to accurately ascertain and gather evidence for their guess. To understand the dynamics between game success and the number of questions, we plot the number of successful games against the number of turns (questions asked).

Figure 3.4: Successful Game Lengths vs Frequency of Occurrence

An optimal strategy for a perfect player in the GuessWhat?! gameplay is to perform a Binary Search over the space of possible target classes. This will allow the player to ascertain the correct object class in at-most log₂N turns (N is the number of target classes) with O(log n) complexity. An alternative yet poorly scaling strategy would be to enumerate all the possible classes, effectively scaling linearly with the number of object classes and with complexity O(n).

From Figure 3.4, it is evident that a certain minimum number of questions are critical to achieving success in this gameplay with the graph peaking around 3-4 questions. It is, however, even more, surprising to observe that a larger number of questions diminishes the probability of success in the game with the number of successful games progressively decreasing with an increasing number of questions. This is probably because harder games that require a larger number of questions have a lower success rate.

(38)

3.3 Encoder Hidden-State Clustering

Methodology introduced in Section 3.1.1 primarily focusses on bringing anchored semantics into the question generation mechanism. While this is motivated by interpretability, this may not necessarily reflect the actual underlying criterion for decision-making in the encoder. In this section, we use unsupervised clustering methods over the encoder hidden states to ascertain if transitions in the latent space can be attributed to observable dialogue scenarios. The state of a GuessWhat?! game is parameterized by the question-answer history H and the image features for the image I. In this section, we encode the state of a game at multiple timesteps and to ascertain if these high-dimensional vectors can be clustered into pools with a common context. This underlying context would serve as a means to understanding how the encoder processes dialogue internally.

3.3.1 Experiments

We use K-Means [MacQueen et al. [1967], Lloyd [1982]] clustering algorithm to generate cluster centers. Since the choice of K can induce clusters with different interpretations, experiments were performed with multiple values of K. In this work, we experiment with K = 3,5,7,9 and 11. The K-Means implementation [Pedregosa et al.,2011] was executed with 10 different centroid seeds to prevent falling into local minima. Each experiment ran for 300 iterations. To ensure reproducibility, the seed was frozen to 0.

3.3.2 Results and Discussion

We only present results for clusters 3 and 5 for the sake of brevity. All trends consistently hold for higher values of K.

Cluster ID History Gold Question Actual Generation

0

<start>

is it a bird ? <no> is it a boat ? <yes>

is it the one all the way on the left ?

is it the one on the way to the left ?

1

<start>

is it a person ? <no> is it food ? <no>

is it a pot or kettle ? <no> is it the oven ? <no>

is it near the oven ? is it the the stove ?

2 <start> is a food item ? is it donut ?

(39)

3.3 Encoder Hidden-State Clustering 25

Cluster ID History Gold Question Actual Generation

0

<start>

is it an appliance ? <yes> one of the 3 stoves ? <no>

a refrigerator in the front row ? <yes> on the left side of picture ? <no>

the one on the right ? the one on the right with lots of black ?

1

<start>

is it the person ? <no> is it a pillow ? <no> is it on the bed ? <no> is it the mirror ? <no>

is it the dream catcher ? <no>

is it the of the beds ? is it one of the birds ?

2 <start> is it a vehicle ? is it a car ?

3 <start>

is it a person ? <yes> is it wearing blue ? is he wear black ?

4

<start>

is it on the right side ? <yes> is it a person ? <no>

is it yellow ? <yes>

is it the whole one of the person?

is it the full portion of that pipe?

Table 3.3: Game Parameter Examples for Clustered Encoder States with K = 5

While unsupervised methods provide a strong basis for understanding the workings of the encoder, their results are not always interpretable. In the above experiments, we observed that:

1. Games with no dialogue history are consistently clustered separately. This can be explained by the fact that the dialogue history is only consistent with this step. This cluster consistently occurs in all of the K values tested.

2. There is no discernible difference between clusters in terms of observable transitions. Most of the clusters have no definitive characteristic in terms of the type of questions in history, time-steps or nature of the question to be generated.

The Game-State Encoder is tasked with generating an encoded representation of the game state that can be systematically decoded by a jointly trained decoder to yield a question. The principal hypothesis under investigation in this section was that context-vectors can be separated in space based on observable patterns in the question-answer history or timing information. The experiment yields negative results for this hypothesis.

The results tie in strongly with the theoretical basis of sequence autoencoders learning dis-tributed representations that do not necessarily encode a high-level intuition of the sequence. In this case, while the encoder does have a probabilistic notion of the tokens to be generated, it does not have a distinctive intuition about the state of the game.

(40)

Games with no history are consistently clustered in a distinct pool. The generated question, in this case, is typical of the Direct Object Enquiry type which is in line with the observed trend from Figure3.3. Other clusters contain a heterogeneous mix of game states and are not easily differentiable in terms of observable characteristics. In experiments with larger cluster centers, it was exceedingly hard to find an observable distinction in the game parameters. We also additionally test the hypothesis of cluster centers encoding semantic context by replacing all the encoder states by their cluster center in a controlled generation setup. 1The results yield implausible generations that are often irrelevant to the context.

3.4 Conclusion

In this chapter, we investigated the structure of questioning in the GuessWhat?! game. By using labels extracted from term-frequency statistics and simple regular-expression based templates, we separated questions into distinct strategy pools. Since this data is not a part of the standard GuessWhat?! setup, the nature of these labels is coarse but future refinement of these labels can be performed with semi-supervised methods. An analysis of the time evolution of strategy yields an insight into the human questioning procedure and we try to correlate these statistics with the general structure of GuessWhat?! gameplay.

One of the strongest motivating factors for our work arises from an evaluation of the game success vs. the number of questions. The plots indicate that an increasing number of questions are not correlated with an increasing game success and by proxy a stronger belief over the correct object class. The nature and order of questions are vital to game success and while smaller games suffer from a lack of strong evidence, larger games are harder and the longer questioning does not help.

Finally, we investigate unsupervised methods for clustering encoder states and reconciling those clusters with observable game parameters. We present negative results and indicate that the clustering does not yield pools with discernible semantic differences.

In the next chapter, continuing with our underlying theme of interpretability, we investigate language models trained with variational inference that generate a semantically meaningful and disentangled latent space where relevant attributes of a GuessWhat?! question can be precisely controlled.

1

A controlled generation setup is a special type of inference pipeline where, in order to test a specific hypothesis, one of the model inputs is perturbed in a uniform fashion to test the effect of these perturbations on the model output. All other parameters are kept as close as possible to the ground-truth values. In this case, the encoder state is deterministically replaced by a pre-defined value to test if encoder states contain semantic context.

(41)

Chapter 4 Controlled Language Generation with

Variational Auto-Encoders

The question generation process in the GuessWhat?! [de Vries et al.,2016] requires the model to maintain a belief distribution over the object classes in GuessWhat?! and then use this distribution to generate an informative question. This requires the question-generation mech-anism to handle Dialogue Management (DM) as well as Natural Language Generation(NLG) tasks.

In this chapter, we attempt to analyze the natural language generation task decoupled from dialogue management and examine approaches based on the Auto-Encoding Variational Bayes [Kingma and Welling,2013] framework for natural language generation. We also examine the suitability of additional approaches that attempt to use disentangled representations with controlled attributes for generating natural language with the desired semantic properties.

4.1 Motivation

Sequence-to-Sequence models are currently the state-of-the-art in Machine Translation [Vaswani et al.,2017] and Image Captioning tasks [Chen et al.,2016]. A Sequence-to-Sequence model creates a time-evolving distributed representation which allows it to encode long dependencies and contextual information efficiently. This, however, implies that a Sequence-to-Sequence model only encodes the iterative evolution of its representation. Greedy-decoding of an en-coded representation obtained from an interpolation between the enen-coded representations for two sentences does not yield a grammatically correct sentence [Bowman et al.,2015].

In this chapter, we move to a recognition model based posterior where the deterministic transformation function φenc is replaced by a posterior recognition model p(z|x) where x is

the input and z is the latent code. Instead of the providing the latent code to the decoder, the model now parameterizes the posterior distribution from which the code can be sampled. Since the VAE latent space exists as smooth ellipsoidal regions instead of single points [Bowman

(42)

et al., 2015], it allows an intermediate representation obtained from a linear interpolation between two latent codes to have a valid and plausible language decoding.

We are principally motivated by a desire to control the attributes of a question using disen-tangled and discrete attributes. This would allow us to delegate the strategy control to an external module while using the VAE-Language Model as an independent question generation apparatus.

4.2 Text Generation from Continuous Latent Representations

Work presented in Generating Sentences from a Continuous Space by Bowman et al.[2015] introduces the first extension of the variational auto-encoding framework to text generation. In this section, we present our results from a reproduction of their setup to establish a baseline result for text generation using this framework.

4.2.1 Model Architecture

Figure 4.1: Modified Auto-Encoder Network with a Posterior Recognition Model. FromBowman et al.[2015]

The model uses a single layer LSTM module as encoder and decoder structures similar to sequence autoencoders [Sutskever et al., 2014]. The encoded state ht is mapped using two

separate fully-connected networks to the parameters µ and σ of the latent distribution. The neural networks serve as probabilistic encoders that, instead of generating the value, parame-terize the distribution over the random variable. The sampled high-dimensional latent vector ht is mapped back to the dimensions of the decoder LSTM hidden state. The model uses an

isotropic Gaussian distribution as the prior on the distribution of the code generated by the encoder. The decoder then proceeds to iteratively generate the entire sequence until a STOP signal is generated.

The negative of the Expectation Lower-Bound (ELBO) given in equation4.1 is minimized. Since the prior is assumed to be a standard multivariate Gaussian, we replace the KL di-vergence term by its analytic closed-form solution [Kingma and Welling, 2013] as given in Equation4.2. We use the ADAM [Kingma and Ba,2014] optimizer in our experiments.

(43)

4.2 Text Generation from Continuous Latent Representations 29 KL (qφ(z)||pθ(z)) = 1 2 J X j=1 1 + log((σj)2− (µj)2) − (σj)2 (4.2) where:

J = Dimensionality of the latent code z µj = j -th element of the variational mean µ

σj = j -th element of the variational standard-deviation σ

φ = parameters of the neural network posterior recognition model

4.2.2 Experiments and Setup

To test the suitability and performance of these models for language modelling tasks, we train the models on data from the Penn TreeBank [Marcus et al.,1994]. The models are trained on multiple hyper-parameter setups for a varying number of epochs.

The ELBO formulation in equation 4.1 represents the stochastic objective function to be optimized by the training process. L contains a reconstruction loss term which computes the difference between the network input and the reconstructed output and a regularization term which reduces the KL divergence between the distribution parameterized by the neural network qφ(z) and a centered isotropic Gaussian N (0, I)

With a simple vanilla setup described in the section above, the KL divergence rapidly drops to zero in all the hyper-parameter setups as also reported by the original authors. This implies that the distribution over the latent state q(z) degenerates to the Gaussian prior p(z) and the model reduces to a simple sequence auto-encoder model.

As suggested by the original authors, we anneal the KL divergence along a curve. At the beginning of the training, the KL divergence weight is set to zero allowing the model to effectively learn a sequence autoencoder. As training progresses, we increase the annealing weight allowing the KL divergence to occupy a bigger portion of the total loss and consequently this acts as a regularisation on the latent code.

A KL divergence close to zero makes the prior non-informative and allows the decoder to ignore the latent code z. To alleviate this, we weaken the decoder by using word-dropout where the word-input at each time-step is randomly replaced by an UNK token. This forces the decoder to rely on the latent code z to a greater extent and forces z to be informative. In our experiments, we primarily analyze:

1. Cross-Entropy and KL Divergence Loss: The cross-entropy component of the ELBO objective is equivalent to the reconstruction loss between the target and the generated sequence and quantifies the generation performance of the decoder. The KL divergence term in the objective quantifies the informativeness of the latent code. A KL divergence of zero denotes that the latent space has degenerated to a standard Gaussian prior and is no longer informative.

(44)

2. KL Annealing: WhileBowman et al.[2015] do not specify the exact methodology behind the KL annealing, we use a linear and a logistic curve to ascertain annealing weights. We report the effects of annealing strategies on the KL divergence in our results. The annealing rate for the logistic curve is given by:

f (x) = L

1 + exp(−k(x − x0))

(4.3)

where:

L = Maximum value of the curve. In our case, L=1 k = steepness of the curve

x0= x-value where the sigmoid curve value becomes 0.5

3. Word-Dropout Rate: We use multiple values for word-dropout rate and also experiment with history-less decoding to analyze its effects on the stability of the model.

4.2.3 Results

In our testing, we primarily use the logistic annealing curve for our testing since the choice of annealing strategy did not yield major differences in KL divergence stability. We report the total ELBO objective as optimized by the network in Table4.1. Separate Cross-Entropy (Reconstruction) and KL divergence (Regularization) components of the ELBO are reported in Table4.2. E W-D 10% 30% 60% 100% 2 107.57 108.70 111.82 135.84 4 108.23 109.08 112.16 138.89 6 94.48 108.38 110.82 138.65 8 108.45 108.36 110.26 139.14

Table 4.1: ELBO for variations in Word-Dropout Rate vs. Number of Epochs with logistic annealing on the test split. 1

1

In the accompanying table, the Word-Dropout rate has been abbreviated as W-D. Similarly, the number of epochs is abbreviated as E

Interpreting Decision-Making in Interactive Visual Dialogue

MSc Artificial Intelligence

Master Thesis