Poem2Image: Semantic Visualization of Classical Chinese Poetry

(1)

Poem2Image:

Semantic Visualization of

Classical Chinese Poetry

(2)

Layout: typeset by the author using LA_TEX.

(3)

Poem2Image:

Semantic Visualization of Classical

Chinese Poetry

Semantic preserving poem-to-image generation

Silvan Murre 11822872

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor D. Li

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

1

Abstract

Chinese classical poems are of great historical interest. These works of literature enable us to understand and appreciate the rich culture of China. However, it is dif-ficult to understand these poems from a modern perspective, due to their extremely concise and compact style and the heavy use of figures of speech. This work aims to make classical Chinese poems more broadly interpretable by implementing a Gen-erative Adversarial Network to paint a picture of the poem for us. The network applies an attention mechanism and a poem reconstruction technique to bridge the semantic gap between poems and images. The proposed implementation is able to generate images with slight semantic relations to its description and real image. This result does however open up the possibility of further research in this area.

(5)

2

3.3.2 Loss functions . . . 13 3.4 LAM . . . 15 3.4.1 Loss functions . . . 17 3.5 Datasets . . . 17 3.6 Experiment setup . . . 18 3.6.1 Implementation details . . . 18 3.6.2 Evaluation metrics . . . 19 4 Results 20 5 Discussion 23 5.1 Future work . . . 24 6 Conclusion 25 Bibliography 26 A Dataset details 28

(6)

3

1 Introduction

Classical Chinese poetry is precious cultural heritage. Rich Chinese history can be explored through poems dating back to early 1st millennium BC (Frankel, 1986). These works of literature are of great influence on modern Chinese society and global society, as they enable us to understand and appreciate the Chinese cul-ture. Classical Chinese poems can be difficult to understand however, because of the heavy use of figures of speech as well as their extremely concise and compact style. The poets often used imagery to describe a certain scene or to evoke emotions. As the word imagery suggests, painting a picture might assist someone in under-standing the semantics of a classical Chinese poem. Feng Zikai was an influential Chinese painter who committed to this idea by creating paintings that illustrated or mirrored classical Chinese poems (Barmé, 2002). This project aims to extend his work by developing a model that can generate images from classical Chinese poems. Recently, Generative Adversarial Networks (GANs) have proven to be succesful at generating realistic images from given text descriptions (Qiao et al., 2019; Reed et al., 2016; Tsue, Sen, and Li, 2020; Xu et al., 2018; Zhang et al., 2017). However, the images generated by these models do not always semantically align with the given text. This is because there exists a semantic gap between the low-level pixel data of an image and the high-level text description that is subjective to the use of language and expression of the annotator. Qiao et al. (2019) aim to bridge this gap by introducing a model called MirrorGAN, which contains an extra module that can regenerate text descriptions from the images generated by the GAN. The input of the GAN consists of word-level and sentence-level embeddings that are extracted from the text descriptions by a bidirectional Long Short-Term Memory (LSTM) Network. However, this approach learns both embeddings from scratch, which can be unfavourable if the training data is sparse.

A word is embedded correctly when words that have a similar meaning share a similar representation. It is also argued that effective word embeddings should contain rich visual semantics (Mao et al., 2016). Due to the abstract nature of Chi-nese poems and the sparsity of datasets containing poem-to-image pairs it is a chal-lenging task to accurately embed Chinese words. There are several ambiguities one has to consider when segmenting words in order to obtain the word embeddings. First, Chinese sentences are written as continuous sequences of characters and do not contain explicit word delimiters such as a space. Second, the combination of two or more characters can form a meaningful word, however, separating them also produces words with different meanings. Finally, two identical sentences produce different meanings if they are segmented differently. Figure 1.1 depicts two exam-ples involving the aforementioned ambiguities of the Chinese language.

A recent breakthrough in Natural Language Processing (NLP) is the use of Bidi-rectional Encoder Representations from Transformers (BERT) to extract word-level embeddings (Devlin et al., 2018). BERT stands out from previous language models as it is able to precisely learn the contextual representation of a word. Furthermore, BERT can also learn embeddings at character level instead of having to learn com-plete words. This prevents erroneous word segmentations from occuring due to the

(7)

Chapter 1. Introduction 4

语言学起来很难。 Language is difficult to learn. 语言 Language 学 learn 起来 to 很 very difficult难 Linguistics is a discipline. 语言学是一门学科。语言学 Linguistics 是 is 一门 a 学科 discipline 乒乓球拍卖了。

The table tennis paddle was sold. 乒乓

The table tennis paddle球拍 was sold卖了乒乓球

The table tennis ball was auctioned off 拍卖了 (a)

(b)

FIGURE1.1: (a) Illustration of the different meaning when combining

and separating characters. (b) Illustration of different segmentations

on identical sentences.1

aforementioned ambiguities in Chinese sentence structure. Due to its capabilities, BERT appears to be an ideal fit for the datasets that will be used to perform the task of generating meaningful images from classical Chinese poetry. Tsue, Sen, and Li (2020) already explored the benefits of using a BERT model to derive word embed-dings for text-to-image generation. To meet the requirement of semantic consistency, the model proposed by Tsue, Sen, and Li (2020) will be altered such that it can be ef-fectively trained on classical Chinese poems. Additionally, multiple configurations of this model will be explored and evaluated to generate the most meaningful and high-quality images. Different quantitative and qualitative evaluation methods are used to assess the quality and variation of the generated images, and to determine the point of convergence when training models. Lastly, the poem-to-image genera-tion task will be performed on seven different datasets and, consequently, the quality of these will also be determined.

1.1 Research questions

Against this background, the central question that motivates this thesis is: "Can we generate high-quality and meaningful images conditioned on classical Chinese po-ems?" This thesis also aims to investigate the impact of using a pretrained BERT model to learn word embeddings, instead of learning them from scratch. The corre-sponding sub-question is: "What is the impact of using a pretrained language model

(8)

Chapter 1. Introduction 5

on Chinese poem-to-image generation?" Another important factor that greatly deter-mines the interpretability of the generated images is the quality of the dataset that the model is trained on. This thesis will evaluate the performance of poem-to-image generation on six different datasets containing classical Chinese poems all written in Simplified Chinese or Traditional Chinese. This naturally leads to the final sub-question: "What is the impact of using different classical Chinese poem-to-image pair datasets on image generation?

1.2 Hypotheses

It was expected that basic concepts such as background can be precisely generated, because these pieces of information can be more easily captured from images and full sentences. The generation of specific semantic concepts such as ’children’ (孩子们 in simplified Chinese) were expected to be rather poor due to multiple-character words and figures of speech. The results were expected to improve with the intro-duction of a pretrained BERT model. As the BERT model is pretrained on differ-ent Chinese data, it might be able to capture semantic information that the original model is not able to capture because too few samples are present. Finally, it is ex-pected that the size of a dataset will greatly affect the performance of the model.

1.3 Thesis overview

Chapter 2 discusses the emergence of GANs and their increasing importance and development for text-to-image generation tasks. Chapter 3 provides an in-depth explanation of the models that were used for the task of poem-to-image generation. The datasets, implementation details and evaluation metrics are also discussed. The various results of the experiments and the intuitions behind them are illustrated in Chapter 4. Finally, Chapter 5 discusses the results and future research, and in Chapter 6 a conclusion is given.

(9)

6

2 Background

2.1 Generative Adversarial Networks

GANs were introduced by Goodfellow et al. (2014) to address the difficulty of fitting the parameters of a deep generative model to maximize the likelihood of training data. A deep generative model can generate new data that is hardly distinguish-able from the true data, if the learned distribution precisely captures the true data distribution. To capture the distribution of the training data, a probability density function has to be explicitly defined (PDF). A PDF can be used to approximate the likelihood of a specific image appearing in a specified distribution of images. How-ever, in practice the approximations are often rough and do not truly capture the data. Particularly with our task of generating realistic 256x256 images, every pixel of an image represents a variable that needs to be learned. Consequently, it would be difficult to precisely model the distribution of every image in a large dataset.

A GAN avoids the problem of having to explicitly define a density function by introducing the adversarial training of a generator and discriminator. The generator can be thought of as a Convolutional Neural Network (CNN) that is reversed. In-stead of applying convolution kernels across input images to extract feature maps, transposed convolutions can be applied to the feature maps to generate images. The generator is trying to fool the discriminator by generating images that resemble the true data distribution. The generator takes random noise in the form of a fixed-length vector as input and learns to map this vector to points in the true data distri-bution. The discriminator takes as input images created by the generator and images from the true training set. It identifies the input as either real or fake by using a CNN that can perform binary classification.

The training process of a GAN can be viewed as a minimax two-player game. The generator competes against the discriminator by trying to minimize the possibil-ity of the discriminator identifying the input as fake. The discriminator will in turn try to minimize the possibility of the generator fooling him, by frequently assigning the fake label to fake images. The model is said to converge if a Nash equilibrium is reached (Salimans et al., 2016). A Nash equilibrium is reached when no player is better off by changing its strategy, regardless of what the opponent may do. How-ever, the loss function for both the generator and discriminator is constantly being minimized by gradient descent, making it hard to find such an equilibrium point where both models perform optimally. This means that if the loss for the generator decreases, the discriminator will try to countermeasure the generator to also mini-mize his loss, making it difficult for the models to converge.

2.2 Evolution of text-to-image generation

Recent work in text-to-image generation has concentrated on the use of generative adversarial networks (GANs) for realistic image generation. Various models have

(10)

Chapter 2. Background 7

been proposed to generate visually realistic and textually-relevant images. By de-fault, the generator of a GAN takes random noise as input. Consequently, there is no control over the generation of specific features in images. In order to guide a GAN in what type of data to generate, Mirza and Osindero (2014) introduced condi-tional GANS. In a condicondi-tional GAN a class label can be concatenated with the input of both the generator and discriminator. This is essential for the task of text-to-image generation, as the generation of images is conditioned on the input text.

Reed et al. (2016) proposed the first model that generates images from natural language instead of attribute representations.They used a conditional GAN to map words and characters directly to image pixels. A visually discriminative representa-tion for text is automatically learned by using deep convolurepresenta-tional and recurrent text encoders that learn a correspondence function with images. However, Zhang et al. (2017) argue that the 64x64 images generated by this model are not of high resolu-tion and often lack important details, such as beaks and eyes of birds. Consequently, they proposed a model named StackGAN that generates realistic 256x256 images in two separate stages. In the first stage, low-resolution images are generated and in the second stage high-resolution images are generated by correcting defects of the low-resolution image and further detailing objects.

The aforementioned models only focused on encoding the text description into a global sentence vector for image generation. Xu et al. (2018) argue that this method of representing text lacks important information at the word level. As a result im-ages generated by these models do not always semantically align with the given text. To address this issue an Attentional Generative Adversarial Network (AttnGAN) is proposed. Similar to StackGAN, a low-resolution image is generated in the first stage using a global sentence vector. Then in the second stage a word-level attention model is also used to guide the generator to focus on relevant words when drawing different image regions. AttnGAN also includes a third stage using the same atten-tion model to generate 256×256 images from the 128×128 generated in the previous stage.

Qiao et al. (2019) emphasize that generating high-resolution images using only word-level attention does not guarantee global semantic consistency due to the di-versity between text and image modalities. Consequently, a text-to-image-to-text framework named MirrorGAN consisting of three modules is proposed. The first module utilizes a bidirectional LSTM to embed the given text description into local word-level features and global sentence-level features. The second module adopts the structure of AttnGAN to generate word-level visual features. Additionally a sentence-level attention model is implemented to generate sentence-level visual fea-tures. This model allows for the generation of semantically consistent images, when for example multiple sentences in a text share the same underlying semantic in-formation. The third module regenerates text out of the generated images, such that the semantic consistency between the original text and the image-to-text can be mea-sured in terms of loss. If the loss is minimal it means that the generated image acts like a mirror that precisely reflects the underlying text semantics.

The MirrorGAN architecture uses a bidirectional LSTM for text representation and can be further improved. As mentioned in the introduction, this approach learns the word embeddings and sentence embeddings from scratch. For the task of gener-ating images from Chinese poems, a dataset that consists of poems written in tradi-tional Chinese can contain many different characters. Some characters rarely occur in the dataset and, as a result, the right representation may not be learned. Tsue, Sen, and Li (2020) suggest the use of a pretrained language model such as BERT to derive

(11)

Chapter 2. Background 8

the word embeddings. Their results show that the use of this model improves the semantic consistency between the generated images and the text descriptions.

2.3 BERT

BERT addresses the suboptimal use of context when learning text representations by training deep bidirectional representations. Previous language models such as OpenAI GPT were trained unidirectionally, which means that only the left or right context of a word were attended (Radford et al., 2018). Peters et al. (2018) proposed ELMo to improve context awareness by conditioning on both the left and right con-text. Two Long short-term memory networks (LSTMs) are independently trained in opposite word directions and then concatenated. They have to be independently trained, because otherwise words might indirectly see themselves via the context of other words in different layers. Additionally, input is processed sequentially, i.e. word by word. As a result, each layer has to store all previous representations to learn the full context. This makes it difficult to learn more distant contextual rela-tions between words or characters.

BERT surpasses these constraints by introducing a masked language model to train deep bidirectional representations. Instead of predicting the words sequen-tially, MLM randomly masks a percentage of the tokens from the input, and tries to predict the masked words using only their context. Besides that, BERT is also trained to understand the relationships between sentences by trying to predict next sentences. Furthermore, BERT also sets itself apart from other language models through its effective use of transfer learning. Transfer learning allows us to pre-train the model on generic data and fine-tune it on specific downstream tasks, such as poem representation.

(12)

9

3 Method

The text-to-image generation architecture can be logically divided into two mod-ules. First, the Deep Attentional Multimodal Similarity Module (DAMSM) produces word-level embeddings and sentence-level embeddings from the given text descrip-tions (Xu et al., 2018). It is multimodal in the sense that it uses information of both the text description and the image to compute both embeddings during training. The second module is a Local collaborative Attentive Module for cascaded image gener-ation (LAM) (Qiao et al., 2019). In this module three image genergener-ation networks are stacked sequentially and an attention model is used to guide focus to both previous image features and current contextual representations when generating subregions. Both modules are extended with an extra component that regenerates text from im-ages to confirm that they are semantically aligned.

In this work two different language models were used to encode the text into word embeddings. The first language model is a bidirectional LSTM, which will be described in Section 3.2. The second language model is the state-of-the-art BERT model, which will be described in Section 3.2.

3.1 Bidirectional LSTM

LSTMs (Hochreiter and Schmidhuber, 1997) are capable of learning long-range de-pendencies by having the nodes decide to block or pass information from the pre-vious output. A bidirectional LSTM is able to capture both left and right context by concatenating the output of two independent LSTMs at each time step. The first LSTM accepts the leftmost word at the first node and inserts the rest of the input from left-to-right in a sequential manner. In the second LSTM the units are pro-grammed backwards. This LSTM accepts a reversed copy of the input sequence and runs it from right-to-left. The concatenated output represents the embeddings of the previous and next words. The structure of the bidirectional LSTM can be seen in Figure 3.1.

Although an LSTM outperforms an RNN in capturing long-term dependencies, it can still have trouble capturing the context of more distant words. For example, if the output of the rightmost word is computed, the information of the leftmost word has to sequentially travel through all other cells before it reaches the cell of the rightmost word. With all the small or large computations that are made in each cell, the information can easily be lost. Consequently, a bidirectional or basic LSTM is still occasionally confronted by the vanishing or exploding gradient problem (Verwimp, 2019). Furthermore, an LSTM can be computationally exhaustive as every cell is passed sequentially and each layer in that cell performs computations. Moreover, an LSTM is not truly bidirectional; they learn the left-to-right and right-to-left context separately. As a result the true context is not fully captured.

(13)

Chapter 3. Method 10

FIGURE3.1: General structure of a bidirectional LSTM network.

3.2 BERT

BERT effectively avoids sequential processing by reading an entire sequence of words at once using deep bidirectional Transformers (Devlin et al., 2018). This also allows for parallel computation to train the model faster. Two sentences are processed in three different approaches before BERT takes the sequences as input. First, the se-quence is tokenized using WordPiece embeddings (Wu et al., 2016). The WordPiece tokenization is a data-driven process, where less frequent words are divided into pieces. Double hashtags (##) are appended to the front of each piece to make them distinguishable from complete words. This method reduces the vocabulary size and makes sure that there are no out-of-vocabulary tokens. A character-based model is more suitable for the task of embedding Chinese text, because of data sparsity and the general difficulty of embedding words, as shown in Figure 1.1. For a character-based model type it is sufficient to perform character-character-based tokenization. After to-kenization a classification token ([CLS]) is added at the start for classification tasks. A separation token ([SEP]) is added at the end of each sentence to indicate a sen-tence boundary. Second, each word is represented by a positional embedding to learn information about the order of the input and to account for similar words in different sentences having different meanings. This is done because Transformers do not capture the sequential nature of the input. Third, segment embeddings are used to indicate whether a token belongs to sentence A or sentence B. With segment embeddings the embedding of a complete sentence can be learned. The three em-beddings are summed to form the final input representation for BERT. The input representation is shown in Figure 3.2.

A BERT model is trained on two unsupervised tasks simultaneously. The first task uses a masked language model to help understand bidirectional context. 15% of the tokens are arbitrarily chosen and masked in different ways. 80% of the chosen to-kens are replaced with the [MASK] token, 10% is replaced with a random token and the remaining 10% is unchanged. The reasoning behind this is that the [MASK] to-ken does not appear during fine-tuning. The introduction of random or unchanged tokens ensures that the model is not only optimized for predicting masked tokens, they also encourage the model to always take the contextual representation of every input token into account (Devlin et al., 2018). With the task of token prediction BERT learns a representation of each word based on its context. The second task consists

(14)

FIGURE3.2: The input representation of a BERT model (Devlin et al., 2018)

of predicting the next sentence to assist BERT in understanding the context between sentences. During training, BERT receives an input pair of sentences where 50% of the time the second sentence is the actual subsequent sentence and the other 50% a random sentence is given. The goal is to correctly classify the second sentence as real or fake.

The Transformer is a self-attention mechanism that consists of an encoder com-ponent and optionally a decoder comcom-ponent (Vaswani et al., 2017). The comcom-ponents can be subsequently divided into encoder blocks and decoder blocks. The first en-coder block takes the input embedding, as detailed in Figure 3.2, as input. The next encoder block take the output of the previous encoder as input, and so on. In each encoder the input goes through a multi-head attention layer and a feedforward layer. The multi-head attention layer takes multiple sets of query, key, value and corre-sponding weight matrices as input. The query matrix represents the current words for which the embedding is to be learned. The key matrix represents the words that the query matrix is matched against. The value matrix is used to preserve words that are important for learning the context. One row vector of each matrix corre-sponds to a word in the input sequence. In the first encoder block these matrices are randomly initialized. Each input sequence is represented by a set of matrices, this guides the model to jointly focus on different positions when embedding a word, hence the name ’multi-head’. The attention of one head is calculated as follows:

Attention(Q, K, V) =softmaxQ·K > √ dk ·V. (3.1)

The dot product between the query matrix Q and the key matrix K>is computed to express how similar each word is to every other word in the input sequence. The outcome is divided by the square root of the dimension of the key vectors to avoid large output. Softmax maps the values to a probability distribution. It is sensitive to large values, thus avoiding these assists in keeping the gradients stable (Vaswani et al., 2017). Taking the dot product between the softmax output and the weighted value matrix V will form the output representation. The multi-head attention of all matrices is calculated as follows:

MultiHead(Q, K, V) =Concat(head1, ..., headh)WO

where headi =Attention(QW_iQ, KWiK, VWiV).

(3.2)

(15)

The multiple matrices are therefore concatenated and multiplied with a weight ma-trix WO to focus on the best context regions. The weights W_iQ ∈ Rdmodel×dk_{, W}K

i ∈

Rdmodel×dk_{, W}V

i ∈Rdmodel×dkand WO ∈Rhdv×dmodel that each matrix is multiplied with

are learned throughout the training phase (Vaswani et al., 2017). The context ma-trix that is obtained via Equation 3.2 is the output of the multi-head attention layer. Each word vector in the output matrix contains information about the positions of every other word vector. The feed-forward neural network encodes the word vector at each position separately and identically, as opposed to the attention layer, into higher-level representations (Lu et al., 2019).

The Transformer trains to learn word embeddings by minimizing the combined loss functions of the masked language model and the next sentence prediction task. For both tasks a classification layer is added in each block, on top of the encoder output. The masked language model is used to mask a random token and the Transformer attempts to predict the original context matrix of the masked token. The context matrix of the masked word is then multiplied with the original context matrix. Fi-nally, softmax is used to calculate the probabilities of each word vector in the matrix. In the next sentence prediction task the classification layer takes the sentence-level representation token [CLS] as input. The output of the classification layer is a 2×1 vector. The softmax function is computed to obtain the probability of the next sen-tence being the actual subsequent sensen-tence and the probability of the next sensen-tence being a random sentence. Altogether, the Transformer encoder produces word em-beddings for a given input sequence. The complete architecture is referred to as the BERT language model. In the fine-tuning phase classification layers can be added to to optimize the model for downstream tasks, in this case text-to-image generation.

3.3 DAMSM

In order to ensure text-image similarity, the DAMSM trains a text encoder and an image encoder with the same loss functions. As mentioned before, either a bidirec-tional LSTM or pretrained BERT model is used as a text encoder to extract word embeddings and sentence embeddings. The image encoder utilizes an Encoder-Decoder architecture to compute global features and local features of images. It also generates the captions back from images to measure the visual-semantic similarity between text-to-image pairs (Qiao et al., 2019; Tsue, Sen, and Li, 2020).

During pretraining, the image encoder regenerates captions and extracts image features from original images. The losses are minimized to compute precise embed-dings and features. When the image generation networks from LAM are trained, the pretrained image encoder regenerates captions and extracts image features from the generated images. The calculated losses are then used to train the image generation network instead.

3.3.1 Text Encoder and Image Encoder

The bidirectional LSTM concatenates each forward hidden state and backward hid-den state (ht)to form a feature matrix of words (Xu et al., 2018). The feature matrix

of words e∈ RD×T _{is taken from the last layer of the LSTM. D is the word}

embed-ding dimension and T is the amount of words. Next to the word embedembed-dings, the output of the text encoder also consists of sentence embeddings. The last forward hidden state and backward hidden state are concatenated to form the global sen-tence vector ¯e ∈ RD_{, which is also taken from the last layer of the LSTM . The last}

(16)

forward hidden state and backward hidden state are essentially the word vectors for the left- and rightmost word. The concatenation functions as a sentence vector be-cause both words capture the full left-to-right context and right-to-left context of an input sequence. The sequence of hidden states at the last layer of BERT are extracted to form the feature matrix of words e. The [CLS] token represents the sentence-level embedding and can be found at the beginning of a sequence. The token represents the global sentence vector and it is extracted from the initial hidden state of the last layer.

As previously stated the Image Encoder takes the shape of an Encoder-Decoder architecture. The Inception-v3 model (Szegedy et al., 2016) pretrained on ImageNet (Russakovsky et al., 2015) is a CNN that is used as the Encoder part of the architec-ture. As the Inception-v3 model is built to take 299×299 images as input, the input images are resized to these dimensions beforehand. Additionally, the images are cropped randomly and flipped horizontally 50% of the time to make the model ro-bust to noisy images. The local feature matrix is extracted from the sixth layer of the image encoder. A 1×1 convolutional layer is applied to map the feature matrix to the same dimension as the text features. The local feature matrix becomes v∈ RD×289_,

where 289 is the number of subregions in the image. Each column of the local feature matrix denotes the feature vector of a subregion. The global feature vector ¯v ∈ RD

is retrieved from the last average pooling layer of the image encoder.

In the DAMSM, the decoder part of the architecture is tasked with predicting the captions for the original image, which can be compared to the original caption to measure the semantic consistency between the text and the images. The decoder is a Bidirectional LSTM, however, the outputs in each cell are processed differently. At a given cell state ctit predicts the output for ht+1, instead of ht, for the next cell state

ct+1(Soh, 2016; Vinyals et al., 2015). The reason for this is that the input xt, which is

the original word vector that is compared against the predicted word vector, remains unseen at the prediction. The global feature vector that the CNN puts out is used as input for the initial hidden state h0 and cell state c0. The input x is the feature

matrix of words from the text Encoder, which is either computed by the bidirectional LSTM or BERT language model. Applying the softmax function on ht+1converts the

predicted vector into a probability distribution. Each predicted vector contains the embeddings for each word in the vocabulary. The LSTM is made bidirectional by simply adding a backward layer that is concatenated with the forward layer, and reversing the feature matrix of words for the backward layer.

3.3.2 Loss functions

Five different losses are computed and combined to jointly pretrain the text encoder and image encoder in the DAMSM. These losses are also included in the training process of the LAM module. The first step is computing the dot product between the feature matrix of words e and the local feature matrix as follows:

s= e>·v, (3.3)

where s ∈ RT×289_{is the similarity between every word-subregion pair in the form}

of a matrix. The similarity matrix is then normalized as follows:

¯sij =

exp(sij)

∑T

k=1exp(skj)

(17)

where sij is the dot product similarity between the ith word and jth subregion. The

context of each word is captured by computing the weighted sum over all similarity matrices: ri = 289

∑

j=1 αjvj, where αj = exp(γ1¯sij) ∑289 k=1exp(γ1¯sik) , (3.5)

where riis the region-context vector for every ithword. γ1can be altered accordingly

to decide how much emphasis should be placed on each subregion of the image when computing the region-context vector for a word. Next, another measure of similarity is introduced to measure the relevance between the ithword and the entire image: similarity(ri, ei) = r>_i ·ei krikkeik = r > i ·ei q ∑289 i=1r2i q ∑289 i=1e2i . (3.6)

This is the cosine similarity; it measures the cosine of the angle between two vectors and determines how similar they are based on the directions they point to, regardless of their magnitude. Subsequently, the entire image Q as a region-context vector is matched against the complete text description D to produce the attention-driven image-text matching score. The similarity between each word and the context of the whole image is computed and then summed up and mapped to a logarithmic scale to prevent extreme outliers from being overly dominant. The equation is as follows:

R(Q, D) =log T

∑

i=1 exp(γ2R(ri, ei)) 1 γ2 . (3.7)

γ2can be altered accordingly to determine how much emphasis should be placed on

the most similar region context vector riand word vector ei.

During pretraining, batches of text-image pairs are simultaneously put through the model. This allows for semi-supervised classification tasks to be performed on the sequences. The only task that can be supervised, however, is predicting whether an image is matched with a text description. This is done by calculating the posterior probability of a text description Di being matched with image Qi, for a batch B of

image-text pairs{(Qi, Di)}Bi=1: P(Di |Qi) = exp(γ3R(Qi, Di)) ∑B j=1exp(γ3R(Qi, Dj)) . (3.8)

The posterior probability of a text desciption Qi being matched with image Di can

be computed symmetrically: P(Qi |Di) = exp(γ3R(Qi, Di)) ∑B j=1exp(γ3R(Qj, Di)) . (3.9)

What is simply being done in Equation 3.8 and Equation 3.9 is that the matching score between a given pair is calculated and then divided by the sum of the matching scores of the B−1 text description candidates. The loss function at the word level can then be defined as:

Lw₁ = −

B

∑

i=1

(18)

Chapter 3. Method 15 Lw₂ = − B

∑

i=1 log P(Qi |Di), (3.11) WhereLw

1 is the negative log posterior probability that the sequences of word

vec-tors are matched with the images, and vice versa forLw

2. The loss function at the

sentence level is retrieved by redefining Equation 3.7 by:

R(Q, D) = ¯v >_·_¯e k¯vkk¯ek = ¯v>·¯e q ∑2048 i=1 ¯r2i q ∑2048 i=1 ¯e2i , (3.12)

where ¯v is the global sentence vector and ¯e is the global image vector. Substituting it to Eq. 3.8, Eq. 3.9, Eq. 3.10 and Eq. 3.11, results into the loss functionsLs

1andLs2. Ls1

is the loss for matching the sentences with the global feature vector, and vice versa forLs

2.

The final loss is the text-semantic reconstruction loss. It is computed as follows:

L_CE = − 1 M B

∑

i=1 |V|

∑

p=1 y(pi)log(ˆy(pi)), (3.13)

where B is the batch size,|V|is the length of the vocabulary, y(pi)is the binary label

at position p of the ith batch entry, and ˆy(pi) is the predicted output between 0 and 1

at position p of the ith batch entry. Finally, all losses are added to each other into the DAMSM loss:

LDAMSM =λw(L1s+ Lw2) +λs(Ls1+ Ls2) +λCELCE, (3.14)

where λw, λsand λCEare the the loss weights to handle the importance of the

word-local-feature losses, the sentence-global-feature losses and the text-semantic recon-struction loss, respectively.

3.4 LAM

LAM consists of three sequentially stacked generators, which gradually generate images of higher quality. The pretrained text encoder from DAMSM computes the word vectors and sentence vectors used for image generation. The first generator takes a noise vector and augmented sentence vector as input. The noise vector z ∈

RZ _{∼ N (}_{0, 1}₎ _{is sampled from a normal distribution, where Z is the dimension}

of the noise vector. Due to a limited amount of training data, sentences with few permutations may appear. Consequently, the generator can get stuck generating identical images from similar sentences. To resolve this problem, the Conditioning Augmentation technique is applied to the sentence vector to randomly smooth out the sentence vector and produce more varying text-image pairs (Zhang et al., 2017). The sentence vector ¯e is fed into a fully connected layer to compute the means µ¯e

and standard deviations σ¯e. The augmented sentence vector ¯ecais then computed as

follows:

¯eca=µ¯e+σ¯ee, (3.15)

where ¯eca ∈ RD. e is sampled from N (0, ID), where ID is the identity matrix of

(19)

are concatenated and fed into four upsampling blocks to gradually generate a visual feature f0 ∈ RM0×N0×N0. M is the dimension of the visual embedding and N is the

dimension of the visual feature. In the second and third stage image generation is conditioned on the visual feature from the previous stage and a word-level attention model. The image generation process can be described as follows:

f0 =F0(z, ¯eca)

fi =Fi(fi−1, Fiattn(e, fi−1)), i∈ {1, 2} ˜Ii =Gi(fi), i∈ {0, 1, 2},

(3.16)

where fi ∈ RMi×Ni×Ni and ˜Ii ∈ RNi×Ni. {F0, F1, F2}are visual feature

transform-ers, which concatenate their input and feed the input into one or more upsampling blocks to generate the visual features fi. For F1 and F2 the concatenation is first fed

into residual blocks, before being upsampled, to learn multi-modal representations across the visual features and the text features (Zhang et al., 2017). Finally, the gen-erator utilizes a convolutional layer with a 3×3 convolutional kernel to transform the visual features fi into images ˜Ii.

The attention model F_iattntakes the word vector e and the visual feature fi−1from

the previous stage as input. First, a perceptron layer converts the word embedding into the same semantic space as the previous visual feature. The dot product of the semantic word embedding and previous visual feature is calculated to obtain the attention score. Then, the weighted sum over all attention scores is computed to determine how much weight the model assigns to the word when generating each sub-region. Finally, the dot product of the semantic word embedding and the weighted attention scores is computed to obtain the attentive word-context feature Attw i : Att_iw= T

∑

l=1 βl(Ui−1el), where βl = exp(f_i>₋₁(Ui−1el)) ∑T k=1exp(fi>−1(Ui−1ek))) , (3.17)

where Ui−1 ∈ RMi−1×D, Attwi ∈ RMi−1×Ni−1×Ni−1 and Bl are the weighted attention

scores for the lth_word.

The matching-aware discriminator proposed by Reed et al. (2016) is adopted in each stage to optimize both real or fake discrimination and matching of the image and description. The discriminator takes positive sample pairs and negative sample pairs as input that are correctly scored as real and fake, respectively. The positive pairs consist of real images with corresponding descriptions. The negative pairs consist of two groups; fake images with corresponding descriptions and real images with mismatched descriptions.

Each discriminator Di gradually down-samples the image into a 4×4×8C

fea-ture, where C is the number of channels in the tensor (Zhang et al., 2017). In the second group of negative pairs the images are conditioned on the descriptions. The mismatched description is transformed into the same dimension as the feature so that they can be concatenated. The other pairs are not conditioned on each other, because only the image has to be classified as real or fake. For the unconditional classification the feature is fed into a 1×1 convolutional layer to map all channels to a single value. Subsequently, the sigmoid function is used to output a probability of the image being real or fake. The unconditional classification follows the same process, but the feature is concatenated with the description as described before.

(20)

3.4.1 Loss functions

At each stage of the training process the discriminator tries to minimize the prob-ability of predicting the wrong label. The final loss function of the discriminator is defined as:

L_D_i = −1

2EIi∼p_datai[log Di(Ii)] −

1

2E˜Ii∼p_Gi[log(1−Di(˜Ii))]

| {z }

unconditional loss

+

−1

2EIi∼p_datai[log Di(Ii, ¯e)] −

1

2E˜Ii∼p_Gi[log(1−Di(˜Ii, ¯e))]

| {z }

conditional loss

,

(3.18)

where Ii is from the true image distribution pdatai at the i

th _{stage. D}

i(Ii)and Di(˜Ii)

denote the probability that a true image is real and that a generated image is real, respectively. For Di(Ii, ¯e)and Di(˜Ii, ¯e)the same process is followed, but the images

are conditioned on their corresponding descriptions. Each loss at the ith stage is backpropagated through discriminator Di, such that each discriminator focuses on

a single image scale.

At each stage of the training process the generator tries to maximize the probabil-ity of the discriminator predicting the wrong label. Contrary to the discriminators, the losses from every ith stage are summed up. This is because each image is condi-tioned on visual features from the previous stage(s), consequently, so is every loss. The loss function of the generator is defined as:

LG= 3

∑

i=1

−1

2E˜Ii∼p_Gi[log Di(˜Ii)]

| {z }

unconditional loss

−1

2E˜Ii∼p_Gi[log Di(˜Ii, ¯e)]

| {z }

conditional loss

. (3.19)

The final loss function of the generator is composed ofL_G, the DAMSM loss of the last stage and the loss from the Conditioning Augmentation process:

L_G= L_G+ L_DAMSM+ D_KL(N (µ¯e, σ¯e) k N (0, ID)),

where DKL(PkQ) = −

∑

x∈χ P(x)logQ(x) P(x) , (3.20)

where x is a generated sample. Including the DAMSM losses from previous stages does not improve the performance of the generative network (Xu et al., 2018). The Kullback-Leibler (KL) divergence lossD_KLis added to further enhance the smooth-ness of sentence vectors (Zhang et al., 2017). It measures the difference between the standard Gaussian distribution and the conditioning Gaussian distribution. The final loss of both networks is back-propagated through its convolutional layers.

3.5 Datasets

First, Caltech-UCSD Birds (CUB) dataset (Wah et al., 2011) is used to reproduce the results from the model proposed by Tsue, Sen, and Li (2020), which is to be altered. The dataset contains 11788 image-to-text pairs each belonging to one of 200 different classes, with each class representing a single bird species. Image-to-text means that a text description is specifically written for the image.

(21)

To perform the task of generating images from classical Chinese poems four dif-ferent datasets are used. The first two datasets are constructed by Sun (2020). The Famous Poem (FP) dataset consists of famous Chinese poems scraped from a web-site1. The Poem Line (PL) dataset consists of regular Chinese poems downloaded from a GitHub repository2_{. The poems are split line by line and used as a query to}

find ten corresponding images. A style classification model is then applied to extract the image that most resembles a Chinese painting. As a result, text-to-image pairs are retrieved where an image is found for a specific text description. The other two datasets are constructed by Nieuwburg (2020). In the Title-Image (TI) dataset the text descriptions are titles of images. The corresponding images are paintings made by Feng Zikai. The Poem-Image (PI) dataset consists of poems for which Feng Zikai has created paintings.

The titles occasionally contained English characters; consequently, titles contain-ing more than one English character were dropped. Additionally, the two datasets constructed by Sun (2020) are combined to generate more training samples. The same is done for the datasets constructed by Nieuwburg (2020). Furthermore, the combination introduces new semantic information to both datasets, which is benefi-cial for broadly-themed datasets. The details of each dataset are presented in Table A.1.

3.6 Experiment setup

3.6.1 Implementation details

The images are first resized into 299×299 images, as the Inception-v3 model only accepts images of this input size. In this work three different configurations of the model are trained and evaluated. The first configuration LSTM-GAN utilizes the Bidirectional LSTM as a text encoder. The second configuration BERT-GAN utilizes a pretrained BERT model as a text encoder. For the CUB dataset the pretrained BERT model named bert-base-uncased3 is used. the RegExpTokenizer4 is utilized to split the sentence into words. Vaudrin (2020) trained a BERT model specifically for embedding classical Chinese poems, this model is therefore used with the datasets containing classical Chinese poems. These datasets require the sentence to be split in characters, such that the character-based text encoders can be used effectively. The third configuration BERT-GANtknis similar to the second configuration, instead the

tokenizer that comes with the BERT model is used to tokenize text descriptions. The size of the vocabulary as a result of using the different tokenizers is shown in Table A.1.

For each model, the DAMSM is pretrained until its respective losses stabilize, indicating that the computed embeddings and visual features are optimal. During the training of the LAM, the weights of the DAMSM are frozen. The generator and discriminator are trained alternately at each stage. When the generator is trained, the weights of the discriminator are frozen. This is because the generator loss is computed and backpropagated via the discriminator loss. Consequently, we do not want update the discriminator’s weights with the generator’s weight.

1_{www.gushiwen.com}

2_{www.github.com/werneror/poetry}

3_{https://huggingface.co/transformers/pretrained_models.html} 4_{https://www.nltk.org/_modules/nltk/tokenize/regexp.html}

(22)

The word embedding dimension D was set to 768. The amount of words T was set to 18 for CUB. For the Chinese datasets the word amount was set to the maximum amount of characters found in all sentences. The dimension Z of the noise vector was set to 100. The visual embedding dimension Mi was set to 32. The dimension

Ni of the visual feature was set to 64, 128 and 256 for the first, second and third

stage, respectively. The balancing parameters γ1, γ2, γ3 were set to 4, 5 and 20,

respectively. For CUB, the loss weights λw, λs and λCE were all set to 20. For the

Chinese datasets the loss weights were increased to 50 to lay more emphasis on the text-image similarity and the visual-semantic similarity.

3.6.2 Evaluation metrics

Two quantitative evaluation measures are used to evaluate the generated 256×256 images. The first measure is the Inception Score (IS) (Salimans et al., 2016), which measures the quality and variety of the images. The label distribution of the image is compared to the marginal label distribution of all images to determine how much they differ. The label distribution is the range of probabilities that an image belongs to different class labels. It ideally is narrow, which means that an image is distinct. The marginal label distribution is computed by summing the label distribution of all images. This distribution is ideally uniform, which means that he image belongs to many different classes. The Inception-v3 classification model is used to predict the probabilities for each class label. The similarity between the distributions is mea-sured by the KL divergence:

IS(G) =exp

E˜I∼pGDKL(p(y| I) k p(y))

, (3.21)

where p(y | I)is the label distribution and p(y)is the marginal distribution. The more dissimilar the distributions are, the larger the KL divergence. A large KL di-vergence subsequently translates into a large IS.

The second measure is R-Precision (RP), which is used to evaluate the visual-semantic similarity. Each generated image and its ground truth description and 99 random mismatched descriptions are gathered. Then the similarity between the de-scriptions and the image is measured as a value, similar to Equation 3.3 and Equation 3.4. If the ground truth entry falls inside the top-R entries with the highest scores, the R-precision score increases. Finally it returns a percentage of how many entries fall inside the specified top-R entries. In this work, only the top-1 entries are considered. As mentioned before, the DAMSM is pretrained until its losses stabilize. How-ever, it is not possible to determine this stabilization point beforehand. Therefore the DAMSMs of the different models were pretrained for a specified amount of epochs. The DAMSM was pretrained for 100 epochs on the larger models PL and FP+PL. For the other datasets the amount of epochs was set to 600. Afterwards, the word loss, sentence loss and the text-semantic reconstruction loss in both the training stage and the validation stage were visualized to determine the epoch at which the model per-forms optimally. The same was done for training the LAM on the different datasets, except the amount of epochs was increased up to 2000 for the smaller datasets. Here, the loss of the generators and the summed loss of the discriminators were visualized to determine the convergence point of the generator and discriminator.

Additionally, the weighted attention scores β_l were upsampled and rescaled to be of the same size of the image. It is then mapped onto the image to see which regions the model attends to during image generation, given a specific word.

(23)

20

4 Results

The goal of this work is to generate high-quality and meaningful images conditioned on classical Chinese poems. The Inception Score and R-Precision Score provide a quantitative assessment of this task. The scores are calculated for each model and Chinese poem dataset. The results are shown in Table 4.1. The results are compared against the scores from the CUB dataset, as the text-to-image generation task has already been succesfully executed on this dataset. The scores for each model trained on the CUB dataset are shown in Table 4.2, additionally, the results of two baseline models are included. The highest inception score of 7.90 is obtained by the

LSTM-Model, λ =50

LSTM-GAN BERT-GAN BERT-GANtkn

Dataset IS RP(%) IS RP(%) IS RP(%) FP 6.54 60.05 6.42 41.12 7.83 33.64 PL 6.93 75.63 5.14 4.23 7.80 45.93 FP+PL 7.90 71.94 5.30 2.77 7.89 37.96 TI 3.55 45.85 3.57 33.78 3.42 27.44 PI 1.69 4.00 2.10 10.67 3.62 8.00 TI+PI 3.98 40.67 3.71 24.58 3.45 26.70

TABLE 4.1: Inception scores and R-precision scores for each model

and dataset.

GAN model trained on the FP+PL dataset. It also has the second best R-precision score of 71.94%, making it the best performing configuration. The inception score is 85% higher than the highest score reported by our best model trained on the CUB dataset. The R-precision score is 1.45% higher. Even though the dataset is approxi-mately 7.5 times bigger, the scores are still impressive if you keep in mind that the FP+PL mainly consists of Chinese poems for which a corresponding image had to be found. The FP, PL and FP+PL trained with the LSTM-GAN generally perform well. Noteworthy are the high inception scores and low R-precision scores that result from BERT-GAN_tkntrained on FP, PL and FP+PL.

To determine the impact of using a pretrained language model on Chinese poem-to-image generation the overall inception scores of each model are first compared with each other. The inception scores of BERT-GAN_tkn are generally 29.61% bet-ter than the scores from BERT-GAN and 11.18% betbet-ter than LSTM-GAN. The R-Precision scores of LSTM-GAN are generally 154.5% better than BERT-GAN and 65.94% better than BERT-GAN_tkn. The models that are described in this work are al-most identical to CycleGAN w/ BERT (Tsue, Sen, and Li, 2020), except for the λ that is used to denote the weight of the DAMSM losses. MirrorGAN (Qiao et al., 2019) closely resembles the three models as well by also introducing the text-semantic re-construction loss. The inception score of both baseline models is higher. However,

(24)

Chapter 4. Results 21 Dataset: CUB Model IS RP(%) LSTM-GAN, λ=20 4.29 70.91 BERT-GAN, λ=20 4.06 50.71 BERT-GANtkn, λ=20 3.87 58.58 MirrorGAN, λ=20 4.54 57.67 CycleGAN w/ BERT, λ=5 5.92

TABLE4.2: Comparison of the three models against baseline models.

The models are trained and evaluated on the CUB dataset.

the R-Precision score for LSTM-GAN is higher than the others. For CycleGAN w/ BERT the R-Precision score was not computed.

The ability of generating high-quality and meaningful images conditioned on classical Chinese poetry can be further explored by qualitative assesment of the re-sults. Attention images show the regions that condition on a specific word for their generation. Figure 4.1 and Figure 4.2 show an attention map for a caption from the CUB dataset and one from the FP+PL dataset, respectively. The first upper-left picture of the figure denotes the generated image in the second stage and the bottom-left picture denotes the generated image in the third stage. The second col-umn shows the complete attention map and the the colcol-umns with words above them show the generative attention for a specific word. The whiter the area, the more it is focused upon during generation. Figure 4.1 shows that the generator pays the

FIGURE4.1: Train and validation DAMSM losses of the LSTM-GAN model trained on FP

FIGURE4.2: Train and validation DAMSM losses of the LSTM-GAN model trained on FP

most attention to colours. This is the case for most attention maps, because colours provide important information for the visual features, and they appear frequently. Most, if not all, attention images show similar inexplicable attention patterns as in Figure 4.2. The focus is either strong for the words at the beginning of the image, or for the words at the end of the image.

(25)

Chapter 4. Results 22

(A)去雁一声归楚信，征帆十幅过淮舟。 Go Yan returned to Chu Xin with a cry, and sailed

across the Huaizhou with ten sails.

(B)旌旗夹道蔽山影，笳鼓入林闻谷声。 The flag clipped the road to hide the mountain shadows, and the drums entered the forest to hear

the sound of the valley.

FIGURE4.3: The bottom image is generated from the caption below and the top image is the real image.

Occasionally, the model is able to generate images that closely resemble the orig-inal image, or even the caption. In Figure 4.3a the generated image at the bottom seems to share some semantic information with the caption below, even though the real picture seems to have nothing in common with the caption. It appears that someone is indeed in a boat, which is implied in the caption. Figure 4.3b shows an-other pair of images that do closely resemble each an-other. It should be noted that the images above only represent a tiny group of high quality and meaningful images.

(26)

23

5 Discussion

The Inception Score and R-precision score are relatively high for the LSTM-GAN model trained on the FP, PL and FP+PL dataset. Per definition of the IS, this would indicate that the images belong to many different classes, i.e. no image is the same. Figure 4.2 is one of many examples of images that are colourful and distinct, thus belonging to many different classes. The generated images also contain various dif-fering patterns, consequently, the patterns all distinctly look like a specific object. Per definition of the IS, this would indicate that the images are also of high quality. With R-precision problems of the same nature can arise. Each image belongs to a different class. If the images as well as the text descriptions are all very distinct it is not difficult to match the image with its correct description. Essentially, the Incep-tion Score and R-precision only say something about the originality and distinctive features of images. In order to answer the main question of whether high-quality and meaningful images can be generated from classical Chinese poems, it is best to manually inspect the images for semantic features.

The hypothesis that basic concepts such as background are precisely generated and semantic concepts are generated rather poorly is shown to be partly correct by inspecting Figure 4.2. Figure 4.3a and Figure 4.3b are one of few examples where se-mantic concepts seem to be incorporated into the image. It is difficult to pinpoint the exact cause of poor semantic representations of the images due to the GAN model being conditioned on various hyperparameters, though there are some speculations that can be made.

The most obvious one is that the text encoder is not able to grasp the semantic meaning of the sentences, because the input consists of single characters instead of words. The meaning of separate Chinese characters in a sentence is often different than the meaning of the combined characters in a sentence. Furthermore, the model and its DAMSM losses in Figure B.2 is one of many examples where the training set is prone to overfitting. The dimension of the word embedding could be too high, allowing for too much flexibility in the model. Lastly, it is possible that not enough emphasis is being put on the text-image similarity and visual-semantic similarity.

Contrary to expectations, using the BERT models as a text encoder leads to worse results than using a Bidirectional LSTM. The hypothesis that the results were ex-pected to improve with the introduction of a pretrained BERT model is therefore false. This may be caused by the BERT model being pretrained on different types of datasets compared to the datasets that we use.

Smaller datasets such as PI perform significantly worse than larger datasets such as FP, PL and FP+PL. Confirming the hypothesis that the size of a dataset greatly af-fects the performance. Unfortunately, the potential of using data consisting of paint-ings that are made from poems, as in the PI dataset, is left untapped. This is because the amount of available training data is little.

(27)

Chapter 5. Discussion 24

5.1 Future work

Chinese word segmentation is a rapidly growing NLP task where each model is de-veloped to be better than the previous one. Cui et al. (2019) propose a BERT model that masks the whole word instead of the characters. Consequently, word embed-dings can be learnt as a whole instead of character by character. Using this model in future research could provide better results.

Additionally, it is good practice to carefully tune each hyperparameter while keeping the others unchanged. We suggest tuning word embedding dimension D, the loss weights λw, λsand λCE, the batch size and γ3. Doing so will likely further

improve the performance of the GAN model.

Next to the word-context feature Attw_i , Qiao et al. (2019) also condition the gen-erator on the sentence-context feature Atts_i. Expanding our model with this feature will likely improve the performance as well.

(28)

25

6 Conclusion

All in all, we address the challenging task of generating high-quality and meaning-ful images conditioned on classical Chinese poems. Three different model config-urations trained on six different poem-image datasets were compared. The models are based on GANs, which are sequentially stacked to progressively generate high-quality images. Additionally, a module consisting of a text encoder and encoder-decoder was pretrained to ensure text-image similarity and semantic consistency. The encoder was either a Bidirectional LSTM or a BERT model. For the third config-uration the BERT model was additionally used as a tokenizer. Through both qual-itative and quantqual-itative metrics, it was found that the best performing model was pretrained on a larger dataset and used the Bidirectional LSTM as an encoder. Ul-timately, the model was able to generate images with slight semantic relations to its description and real image. This work therefore demonstrated the possibility of visualizing classical Chinese poems, to make them accessible for wider audiences.

(29)

26

Bibliography

Barmé, Geremie (2002). An artistic exile: a life of Feng Zikai (1898-1975). Vol. 6. Univ of California Press.

Cui, Yiming et al. (2019). “Pre-training with whole word masking for chinese bert”. In: arXiv preprint arXiv:1906.08101.

Devlin, Jacob et al. (2018). “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805.

Frankel, Hans H (1986). The Columbia Book of Chinese Poetry: From Early Times to the Thirteenth Century.

Goodfellow, Ian J. et al. (2014). “Generative Adversarial Nets”. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14. Montreal, Canada: MIT Press, 2672–2680.

Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long short-term memory”. In: Neural computation 9.8, pp. 1735–1780.

Lu, Yiping et al. (2019). “Understanding and improving transformer from a multi-particle dynamic system point of view”. In: arXiv preprint arXiv:1906.02762. Mao, Junhua et al. (2016). “Training and evaluating multimodal word embeddings

with large-scale web annotated images”. In: Advances in neural information pro-cessing systems, pp. 442–450.

Mirza, Mehdi and Simon Osindero (2014). “Conditional generative adversarial nets”. In: arXiv preprint arXiv:1411.1784.

Nieuwburg, Elisha (2020). Building a Dataset for the Visualization of Classical Chinese Poems. Bachelor’s Thesis.

Peters, Matthew E et al. (2018). “Deep contextualized word representations”. In: arXiv preprint arXiv:1802.05365.

Qiao, Tingting et al. (2019). “Mirrorgan: Learning text-to-image generation by re-description”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514.

Radford, Alec et al. (2018). “Improving language understanding with unsupervised learning”. In: Technical report, OpenAI.

Reed, Scott et al. (2016). “Generative adversarial text to image synthesis”. In: arXiv preprint arXiv:1605.05396.

Russakovsky, Olga et al. (2015). “Imagenet large scale visual recognition challenge”. In: International journal of computer vision 115.3, pp. 211–252.

Salimans, Tim et al. (2016). “Improved techniques for training gans”. In: Advances in neural information processing systems, pp. 2234–2242.

Soh, Moses (2016). “Learning CNN-LSTM architectures for image caption genera-tion”. In: Dept. Comput. Sci., Stanford Univ., Stanford, CA, USA, Tech. Rep.

Sun, Fengyuan (2020). De-noise large-scale poem-image pairs for poem-to-image genera-tion. Bachelor’s Thesis.

Szegedy, Christian et al. (2016). “Rethinking the inception architecture for computer vision”. In: Proceedings of the IEEE conference on computer vision and pattern recog-nition, pp. 2818–2826.

(30)

Bibliography 27

Tsue, Trevor, Samir Sen, and Jason Li (2020). “Cycle Text-To-Image GAN with BERT”. In: arXiv preprint arXiv:2003.12137.

Vaswani, Ashish et al. (2017). “Attention is all you need”. In: Advances in neural in-formation processing systems, pp. 5998–6008.

Vaudrin, River (2020). PoemBERT: A representation of classical Chinese poetry, for poem based image generation. Bachelor’s Thesis.

Verwimp, L (2019). “Addressing Limitations of Language Models”. PhD thesis. KU Leuven, Leuven.

Vinyals, Oriol et al. (2015). “Show and tell: A neural image caption generator”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164.

Wah, C et al. (2011). The CUB-200-2011 Dataset. Tech. rep. Technical report, Cal-Tech. Wu, Yonghui et al. (2016). “Google’s neural machine translation system: Bridging the gap between human and machine translation”. In: arXiv preprint arXiv:1609.08144. Xu, Tao et al. (2018). “Attngan: Fine-grained text to image generation with

atten-tional generative adversarial networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324.

Zhang, Han et al. (2017). “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks”. In: Proceedings of the IEEE international conference on computer vision, pp. 5907–5915.

(31)

28

A Dataset details

Amount Tokenizer

Dataset Train Test Chr BERT RegExp

CUB 8855 2933 4588 5450 FP 4614 1538 3379 21128 PL 62289 20763 6642 21128 FP+PL 66903 22301 6711 21128 TI 2461 821 2257 21128 PI 225 76 944 21128 TI+PI 2687 896 2300 21128

TABLEA.1: Information about the size of train and test sets, and the

(32)

29

B Determining points of

convergence for DAMSM

The epoch at which the DAMSM model performs optimally is determined by the vi-sualization of its losses. In Figure B.1 the DAMSM losses of the LSTM-GAN model trained on CUB are shown. Here, both the word losses and sentence losses clearly stabilize after 100 epochs. It is important to not let the training losses decrease too much while the validation loss stays the steam, as overfitting may occur. In this case the model saved at epoch hundred can be considered optimal. In Figure B.2 the

FIGUREB.1: Train and validation DAMSM losses of the LSTM-GAN model trained on FP

DAMSM losses of the LSTM-GAN model trained on FP are shown. Here, the opti-mal model at a specific epoch is less well defined. The validation word loss briefly decreases in the early stages, however, the overfitting of the train losses causes the validation word loss to increase again. It is suboptimal to always use the model at the minimal word loss and sentence loss, because the model might miss out on im-portant features learned from the training set. The ideal point is slightly after the minimum validation losses, as the training loss normally decreases faster than the validation loss.

(33)

Appendix B. Determining points of convergence for DAMSM 30

FIGUREB.2: Train and validation DAMSM losses of the LSTM-GAN model trained on CUB

Poem2Image: Semantic Visualization of Classical Chinese Poetry

Poem2Image:

Semantic Visualization of

Classical Chinese Poetry

Poem2Image:

Semantic Visualization of Classical

Chinese Poetry

Semantic preserving poem-to-image generation

Abstract

Contents

1 Introduction

1.1

Research questions

1.2

Hypotheses

1.3

Thesis overview

2 Background

2.1

Generative Adversarial Networks

2.2

Evolution of text-to-image generation

2.3

BERT

3 Method

3.1

Bidirectional LSTM

3.2

BERT

3.3

DAMSM

∑

∑

∑

∑

∑

∑

3.4

LAM

∑

∑

∑

3.5

Datasets

3.6

Experiment setup

4 Results

5 Discussion

5.1

Future work

6 Conclusion

Bibliography

A Dataset details

B Determining points of

convergence for DAMSM