Poem2Image Generation for Chinese Classical Poems

(1)

Poem2Image Generation for

Classical Chinese Poems

(2)

Layout: typeset by the author using LATEX.

(3)

Poem2Image Generation for

Classical Chinese Poems

Nina M. van Liebergen 11906650

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dan Li Informatics Institute Faculty of Science University of Amsterdam Science Park 907 1098 XG Amsterdam Jun 26th, 2020

(4)

Abstract

Recent developments in the field of Text2Image generation has led to opportu-nities for automatic generation of images based on poems. This thesis explores the possibilities of generating images based on Chinese Classical poems, due to their artistic and cultural value. To achieve this, this study trains a neural net-work generative model with limited high-quality and large-scale noisy data. In doing so, it reproduces the AttnGAN model [26] and makes adjustments to im-prove the model. The model is trained and fine-tuned with limited high-quality data from the painter Feng Zikai and corresponding poems [13] and with a large-scale noisy data from web-images and Chinese poems [19]. This study contributes to the Text2Image research area by examining the interaction between generative adversarial neural networks and low quality datasets. The results show the impor-tance of different components of the AttnGAN network while training with noisy datasets. Moreover, it shows the possibilities of large-scale data for training the network and the difficulties with limited high-quality datasets. It is suggested that the AttnGAN model requires modification for large scale noisy data and limited high-quality data. A comprehensive experimental investigation with quantitative analysis explains the working of the different components on the model.

(5)

1 Introduction

While classical Chinese poems play an important role in the Chinese history [8], their artistic value is hard to experience without understanding ancient Chinese. More than 2000 years of Chinese history is captured within these poems [29]. Together with the specific syntactical, phonological and semantic requirements the classical poems consist of rich cultural and academic values [7]. Being restricted to the language they are written in, makes only a few people able to fully understand them.

Several Chinese artists have created paintings based on poetry. For example, the Chinese artists Zhengming, Bohu, Zhou and Ying called the Four Masters -were praised for combining painting with poetry [11]. When these artists finished a painting, they would create a corresponding poem that covered their emotions and state of mind expressed in the painting. Conversely, the famous Feng Zikai made a significant amount of paintings corresponding to poetry [12]. Zikai painted over 3500 different paintings, each visually translating a classical Chinese poem. These paintings reveal culture heritage, even for people unfamiliar with ancient Chinese. Despite the work of Zikai and the Four Masters, there still is a significant amount of poems without paintings.

To be able to experience the artistic value of poems without paintings, gen-erative models could fill the gap. Existing paintings based on Chinese Classical poetry could serve as input to train neural networks. Over the last few years, the development of Text-to-Image generating models has made significant progress ([17], [15], [28], [27]). The use of generative adversarial networks (GAN) [5] for this task has shown some promising results [18]. For example, the AttnGAN model [26] trained with the CUB dataset [23] has high potential for generating images including specific information at word level.

1.1 Research Question

The goal of the project this thesis takes part in, is to visualize classical Chinese poetry into images in the painting style of Feng Zikai. This goal causes several challenges. First, generating paintings in the style of Feng Zikai based on Classical Chinese poems requires specific data (Poem2Image data). Second, the represen-tation of Chinese Classical poems and images in one single vector space requires examination. Finally, a valid neural network generative model has to be trained and optimized for the task-specific data. The main research question states: How can classical Chinese poems be visualized in Feng Zikai’s painting style?

(7)

gen-erative model according to the main goal of the project. The main question of this thesis therefore is:

How to train a neural network generative model with limited high-quality train-ing and large-scale noisy traintrain-ing data?

This study uses a qualitative case study approach to investigate the AttnGAN model [26] in order to automatically generate images corresponding to the Poem2Image data. First, the reducibility of the AttnGAN model will be examined. Next, the model is refined for this specific problem of generating images for Chinese classical poems in the style of Feng Zikai. This thesis shows that the AttnGAN model re-quires refinement and shows how the different components of the AttnGAN model process limited high-quality training and large-scale noisy training data.

1.2 Contribution

This study contributes to the research area of Text2Image synthesis, specifically to the area of generative adversarial neural networks. Multiple studies have re-searched generating images of text descriptions using large-scale high quality train-ing data ([17], [15], [28], [27]) for example ustrain-ing the CUB [23] and the COCO [6] dataset. However, very little is known of the effect of while training with limited high-quality training and large-scale noisy training data on the generative adversar-ial models. This study therefore makes a contribution to research on Text2Image synthesis by demonstrating the possibilities and challenges of generative adversar-ial networks with limited high-quality training.

(8)

2 Related Work

In this chapter, the objective is embedded in related recent research. Theoretical background information of the generative adversarial network (GAN) is given and relevant GANs are highlighted.

2.1 Classical Chinese Poems Generation

To our knowledge, no research has generated images based on Chinese classical poems before. There are however, several models the other way around. They generate Chinese classical poems based on images. These studies are relevant for the Poem2Image generation because they give insight in the relationship between the images and poems while training a generative model.

Xu et al., (2018) [25] propose a memory based neural network which exploits images to generate poems. Given an image, linguistic keywords are extracted and used as the skeleton of the poem to generate. This way, visual information based on the image can be integrated in the poem. In short, the proposed network is an encoder-decoder model with a topic memory network. Another Images2Poem network is proposed by Liu et al. in the same year [7]. This model includes a selection mechanism and an adaptive self-attention mechanism. While generat-ing the images, the information flow from both images and previously generated characters is combined.

2.2 Generative Adversarial Models

In contrast to discriminative models which discriminate given data from different kinds of classes, generative models can generate new data instances [5]. Thus, gen-erative models are able to produce new content. This way, they can fill the task of text to image synthesis. Given a set of X data samples and Y labels, the dis-criminator captures the conditional probability of P (X|Y ) whereas the generator captures the joint probability P (X, Y ).

One of the generative models, is the generative adversarial network (GAN), proposed by Goodfellow et al. in 2014 [5]. This unsupervised learning algorithm is able to generate samples from the distribution of the provided dataset. Due to an adversarial game between two neural networks, the generator and the discrim-inator, the network learns turn-wise.

The Generator

The generator is trained to output samples following the real data distribution. Before it is able to generate these plausible data, the generator requires learning.

(9)

The generator learns through losses from the discriminator. The iteration of the following steps is considered: first, the generator takes as input a random noise distribution. This noise is transformed by the generator through a transform function. The discriminator classifies the output samples of the transform function as real or as fake. Together with this classification, the loss for the generator and the discriminator is calculated. The generator loss indicates the extent to which the generator failed to 'fool'the discriminator. Through back propagation, the weights for updating both the generator and the discriminator are obtained. Finally, the weights of the generator are updated.

The Discriminator

The discriminator is a simple classifier that learns to distinguish samples of the real distribution from samples generated by the generator. It outputs a value between 0 (fake) and 1 (real). During the training of the discriminator, the generator does not learn. The training data of the discriminator consist of both the real samples and the fake samples. The discriminator is trained through iteration the following three steps. First it classifies the given input as real or fake. Second, the discriminator loss penalizes the discriminator for classifying the real data as fake and visa versa. Finally, the weights of the discriminator are updated through back propagation.

Training

It is important to note that the generator and the discriminator do not learn simul-taneously, but alternatively. When the generator requires to update its weights, the gradients are calculated through back propagation from the output from the discriminator along the discriminator to the generator. However, while training the generator only the weights of the generator are updated. When training the discriminator, only the weights of the discriminator are updated.

To train the generative adversarial network in its entirety, the training pro-ceeds in alternating periods. Training the discriminator for an amount of epochs alternates with training the generator for an amount of epochs. When one of the two networks learns, the other is stable.

When the discriminator is not capable of distinguishing the real data from the generated data anymore, the generator is equivalent to the true data-generating function. The optimal discriminator can then be found at the Nash-equilibrium. Here, the discriminator has a 50 % accuracy.

While the generator is learning the true distribution, the performance of the discriminator is decreasing. Also the losses to update the networks generated by the discriminator will become less accurate. The convergence of generative

(10)

adversarial networks is not stable. The Loss Function

To measure the success of the generator and discriminator, the distance between the real data distribution and the generated samples has to be reflected. This question drives an area of active research. For this research, the minimax loss function is used and explained. This is the loss function proposed by the founders of the GAN [5] and used in the AttnGAN [26] model. The formula derives from the cross-entropy between the generated and real distributions.

The training procedure of the GAN consists over optimizing the minimax loss: min max Ex[log(D(x))] + Ez[log(1 − D(G(z)))] (1)

The discriminator tries to maximize this function while the generator tries to minimize this function. Here, z is a sample of the random distribution as input for the generator. D(x) is the discriminators estimation of the probability that the real data is real. D(G(z)) is its estimation of the probability that the generated data is real.

2.3 GAN and Text-to-Image Generation

One of the most common and challenging problems in Natural Language Pro-cessing and Computer vision is the problem of image captioning ([17], [15], [28], [27]). Text-to-Image and Image-to-text conversions are highly multimodal prob-lems. The task is more difficult than the language to language conversion: in language there are less translations possible than when generating images [4].

Besides the progress achieved with the GAN as image generator, the most recently proposed text-to-image generation methods are based on generative ad-versarial networks. Text-to-image synthesis refers to generating visually realistic images that match given text descriptions. The emergence of deep generative models have shown an evolution in the text-to-image synthesis. The generative adversarial network is one of the deep generative networks that shows great perfor-mance for sharper samples ([15], [16], [28], [30]). The conditional GAN, introduced by Reed et al. in 2014 [17], first showed the potential of GAN to generate plau-sible images from text descriptions. Incorporating additional conditions to the GAN result in improvement. The Stack-GAN, developed by Zhang et al [28] used different GAN to generate images of different sizes. The DRAW model, built by Mansinov et al. [9], generates patches of images corresponding to relevant words in the caption.

(11)

3 Preliminary: AttnGAN for Text-To-Image

Gen-eration

Based on the theory in Chapter 2, the AttnGAN model is chosen to fulfill the task of Poem2Image generation. Due to its promising results on complex scenarios and its attention mechanism, the AttnGAN model suits the structure of Chinese Classical poems and the complex poetic scenarios in the paintings.

In this chapter, the AttnGAN model [26] is elaborated and reproduced in order to lay the foundation for the development of the Poem2Image generation. First, the AttnGAN model and its architecture are demonstrated. Second, the AttnGAN model is reproduced on the CUB-dataset [23] and the results are evaluated.

3.1 The AttnGAN Model

All of the text-to-image GAN mentioned in section 2.3 are conditioned on the global sentence vector. This approach of encoding the whole text description into one global sentence vector lacks important fine grained information. The word-level information can improve the generation of high quality images. When the text descriptions and thus the scenes to visualize are more complex, fine grained information at word level is even more desired. To address this issue, the AttnGAN model incorporates a attention-driven mechanism.

The Attention Mechanism

Sequence transduction models adopt attention mechanisms for modeling multi-level dependencies in different kind of tasks such as machine translation, image question answering and image captioning. Before the development of the AttnGAN model, the attention mechanism has not been explored within the text-to-image generation of GAN. This adaptation enables the GAN to generate high quality images via sentence and word level conditioning.

3.2 The Architecture of the AttnGAN

Xua et al [26] propose two new components : the attentional generative network and the deep attentional multimodal similarity model. For better understanding of this research, the architecture of the AttnGAN model is explicated.

Text Encoder

First, the text encoder extract the semantic vectors from the text descriptions. Each word from the text description corresponds to two hidden states. The text

(12)

encoder is a bi-directional Long Short-Term Memory, in which each hidden states corresponds to each direction. These two states together represent the semantic meaning of the word. The whole text-description is indicated by the feature ma-trix e. The i-th column ei is the feature vector for the i-th word. T is the amount

of words in the description and D corresponds to the dimension of the word vec-tor. The last hidden states of the text encoder are concatenated to be the global sentence vector, denoted by ¯e.

Conditioning Augmentation Fca

This global sentence vector ¯e is converted to the conditioning vector by the Con-ditioning Augmentation Fca_{. This F}ca _{randomly samples latent variables from a}

Gaussian distribution. The global sentence vector ¯e is splitted into the mu and sigma. These are combined with a noise (the Gaussian distribution) to obtain higher variation in generated images. This way, the model will always generate different images for the same caption due to Fca_.

c = m + σ · , ∼ N (0, I) (2) F0 Network

The conditioning vector serves as input for the F0 network. This network is re-sponsible for the upsampling. Every upsampling block, the height and weight of the vector are doubled based on nearest neighbours interpolation. In this first stage, there are no word-level features used.

h0 = F0(z, Fca(¯e)) (3)

This h vector represents the word features of each subregion of the image. Each column of h is is a feature vector of the subregion of the image.

FattnGAN

This h0 ∈ RD×N vector is used for the input of the first generator, for the next

Fattn network and for the second F network. The second F network, F1 accepts as

input the h0 matrix together with the word features e ∈ RD×T. Before, the Fattn

network converts these word features to the common semantic space of the image features. cj = T −1 X i=0 βj,ie 0 i, whereβj,i = exp(s0_j,i) PT −1 k=0 exp(s0j,k) (4) Then, the F1 network combines the word features and the image features (h)

to compute a word-context vector for each subregion. The F network is also responsible for upsampling.

(13)

Generators

Every word-context vector is donated to a corresponding generator which outputs an image from a certain size. The first generator G0 outputs an image with size

64 × 64 × 3. The size of 64 is due to the upsampling from F0 and 3 represents the

RGB information. The sizes of the generated images from the next m generators are all doubled.

The Deep Attentional Multimodal Similarity Model (DAMSM)

For the last generated image, the image-text similarity at word level is measured to compute a fine-grained loss for the image generation. This is done by two neural networks, both learning the subregions of both the image and the words of the sentence.

The Image Encoder

The DAMSM trains an image-encoder and a text-encoder. The image encoder is a convolutional neural network that maps the generated images into semantic vectors. The later layers of the CNN learn global features of the image, while the intermediate layers of the CNN learn local features of different subregions. The image encoder that is used, is pretrained on ImageNet. This encoder is built upon the Inception-v3 model [10]. The generated image is first rescaled to 299 × 299 pixels. The local feature matrix f ∈ R768x289 is extracted. Each column of f

corresponds to a feature vector of a subregion of the image. Thus, there are 289 subregions and 768 is the dimension of each local feature vector. The image features are converted to the same semantic space as the description features.

v = W f, v = ¯¯ W ¯f (5) The DAMSM answers the intuitive question: does the generated image actually follow the description? To answer this question, the word features are combined with the image features through the dot product. This results into the similar-ity matrix. Experiments showed that normalizing this matrix improves the final results. ¯ si,j = exp(si,j) PT −1 k=0 exp(sk,j) (6) An attention model is built to compute a region-context vector for each word. This examines if the specific word has any significance for visualizing. This is done for every subregion of the image and afterwards all these regional visual vectors are summed up resulting in the region-context vector ci.

ci = 288 X j=0 αjvj, where αj = exp(γ1¯si,j) P288 k=0exp(γ1s¯i,k) (7)

(14)

γ1 Functions as the attention scaling factor. When this factor increases, more

attention is paid to features of its relevant sub-regions.

Finally, the relevance between the ith word and the image is calculated using the cosine similarity between the region context vector ci and the word feature vector ei.

R(ci, ei) =

cT_i ei

||ci||||ei||

(8) Combining these word-level calculations, the attention-driven image-text matching score is defined to measure the whole text-description with the entire image:

R(Q, D) = log( T −1 X i=1 exp(γ2R(ci, ei))) 1 γ2 (9)

γ2 Determines how much to magnify the importance of the most relevant word in

the description.

The objective function

To train the whole attention GAN network, the final objective function is defined as follows: L = LG+ λLDAM SM, where LG = m−1 X i=0 LGi (10)

The loss of the generators

The first part of this loss function corresponds to the summed losses of all the generators. The adversarial loss for Gi consist of two parts:

LGi = −

1

2E ˆxi ∼ pGi[log(Di(ˆxi)] −

1

2E ˆxi ∼ pGi[log(Di(ˆxi, ¯e)] (11)

The first part determines whether the image is real or fake (the unconditional loss). It takes as input the output of the discriminator for classifying the generated image as real or fake. Hence, it is not taking the description into account. The second part of the formula determines the extent to which the sentence matches with the image (the conditional loss). It takes as input the global sentence vector together with the generated image.

As described in section GAN, the training of the generator is combined with the adversarial training of the discriminator. The cross-entropy loss is minimized:

LD i= −

1

2E ˆxi ∼ pdatai[log(Di(xi)] −

1

(15)

+1

2E ˆxi ∼ pdatai[log(Di(xi), ¯e] −

1

2E ˆxi ∼ pGi[log(1 − Di(ˆxi, ¯e)] (13)

Each discriminator is trained to classify the generated image as real or fake. The unconditional loss (the first part) calculates if the generated image is real or fake. It trains to classify the true data close to 1 and the generated images close to 0. The second part, the conditional loss, examines whether the image and caption are of the same pair. The discriminator is trained to improve the generator and does not directly improve the generation of the images.

The DAMSM loss

The second part of the final objective function trains the attention model in a semi-supervised manner. The matching between the entire image and the whole sentence is therefore tested. For the image Qi, only one sentence-image pair is

real, the other image-sentence pairs are mismatching. The posterior probability is calculated for a batch of image-sentence pairs. Given the image, the posterior probability indicates how likely it is that a given description is matched.

P (Di|Qi) =

exp(γ3R(Qi, Di))

PM

j−1exp(γ3R(Qi, Dj))

(14) γ3 s a smoothing factor. This formula results in the loss function as the negative

log posterior probability that the images are matched with the corresponding text description. During the training of the encoders, this loss is minimized. Besides, the following formula is minimized:

P (Qi|Di) =

exp(γ3R(Qi, Di))

PM

j−1exp(γ3R(Qj, Di))

(15) This is the posterior probability that the descriptions are matched with the cor-responding images. The loss functions are defined as the negative log posterior probability that the images are matched with their corresponding descriptions and vice versa:

Lw₁ = − M X i=1 log P (Di|Qi) (16) Lw₂ = − M X i=1 log P (Qi|Di) (17)

These two losses mentioned above use the word-level information. However, when the Eq. 10 is redefined, and substituted to equation 15 and 16, the matching loss based on the sentence vector ¯e and the global image vector ¯e can be defined:

Lw₁ = −

M

X

i=1

(16)

Lw 2 = − M X i=1 log P (Qi|Di) (19)

These four losses concatenated form the DAMSM loss: LDAM SM = Lw1 + L w 2 + L s 1+ L s 2 (20)

3.3 The CUB-200-2011 Dataset

For the reproduction of the AttnGAN model, the implementation is trained upon the CUB-200-2011 dataset [23]. This set is chosen because it is adopted in previous text-to-image models ([15], [29], [17]) and in the original AttnGAN paper [26]. The CUB-200-2011 dataset is the extended version of the CUB-200 dataset [23] and consist of 11.788 images of 200 bird species. Each image contains 10 descriptions about the bird that is shown (Figure 1). These descriptions are built from a vocabulary of 28 attributes groupings and 312 binary attributes. For example, the attribute group belly color contains 15 different color choices, thus 15 different binary attributes. Therefore, the descriptions are structured and patterns could be extracted.

Figure 1: Captions for one image of the CUB dataset Training parameters

While training the model, the parameters proposed in the AttnGAN paper are adopted. Tao Xu et al. [26] found that they retrieved the best results with the following parameters for the CUB dataset:

(17)

Parameter Proposed value

λ 5

γ1 4

γ2 5

γ3 10

Learning ate encoders 0.002 Epoch training encoders 20 Epoch model training 600

Table 1: Parameters for training AttnGAN on CUB dataset

The values proposed by Tao Xu et al. [26] according the CUB dataset are adopted for training (Table 1). Here, the λ is the hyper-parameter to balance the two terms of the final objective function: the generators loss and the DAMSM loss. λ Indicates the weight of the DAMSM loss functions. γ1 Is the attention scaling

factor that determines how much attention is paid to features of its relevant sub-regions. γ2 Determines the importance of the most relevant word-to-region context

pair. When this factor is infinite, the most relevant word-to-region is momentous alone. γ3 Is for smoothing within the DAMSM loss, determined by experiments

[26].

Github repository

A pythorch implementation of the AttnGAN model is made by Tao Xu, one of the authors of the AttnGAN model, in 2018. The final code, based on Python 2.7 [21] and Pytorch [14] is available on Github 1_{. Several adjustments and}

implementa-tions are made. For the final code, see Appendix 1.

3.4 Evaluation Metrics

Inception Score

The inception score is a measurement developed by Salimans et al. [18] in 2016 and gives an objective indication about the quality and the diversity of the generated images [2]. It is mostly used for evaluating GAN’s. The inception score involves a pre-trained neural network to classify the generated images. Afterwards, two different scores are combined towards the inception score.

The first property, the image quality of the generated images, is tried to capture within the conditional probability. This is done by the pretrained V3 model [20]

(18)

through predicting the class probability for each generated image. When one class has a significant higher probability than the other classes, it is more probable that the image is of high quality. The conditional probability over all images must have a low entropy for a better score, because high entropy indicates less information.

The second property consists of the diversity of the generated images. One of the risks of generating new images based upon a given dataset is overfitting. It could be that the model is learning the given dataset perfectly, but is not able to combine this information into new unknown images. When the model is generating images which do not differ from each other, the GAN performs not as wanted. To measure this property of the model, the marginal probability is used. This measurement is able to calculate the variety between the generated images by calculating the probability distribution over all the generated images. A high entropy between the probability distribution of all the generated images is required, because this indicates a high diversity.

These two measurements are combined through the Kullback-Leibler diver-gence. When calculating the inception score of the generated images, the gener-ated images are cut into parts and the average of the inception score of all these parts is the final inception score. It is shown that the inception score is correlated with the human evaluation of image quality [18].

The code used for the inception score is from the the StackGAN model [28] and adopted and adjusted2_.

R-precision

While the inception score does not capture whether the generated image reflects the text description, the R-precision [30] measurement does. This common evaluation metric for ranking retrieval results is used to evaluate the generated image and his origin description. Just like the inception score, the R-precision is calculated through analysing the generated images from the validation set. For the CUB dataset, this is a batch of 3.000 images. Of every image the R-precision is calculated comparing it to 99 mismatching descriptions and the real description (i.e. R = 1). This is done by extracting the local features of both the image and the text description using the enconders of the pretrained DAMSM. The cosine similarity between the image and the text is computed and the most relevant description is found. This way, the R-precision score for each image is either 0 or 1. From all the generated images, the percent of correct descriptions as most relevant, is given as the R-precision score.

The code used for the R-precision is from the the DM-GAN model [30] and adopted and adjusted. 3_.

2_{https://github.com/hanzhanggit/StackGAN-inception-model} 3_{https://github.com/MinfengZhu/DM-GAN}

(19)

3.5 Reproduction Results

Pretraining DAMSM

Figure 2 visualizes the sentence loss and word loss for both the validation set and the training. Starting with a loss around 7, the losses decrease. After approxi-mately 200 epoch, the loss of the trainingset is increasing. This indicates overfit-ting. Thus, the text and image encoders are performing the best after training for 200 epoch. This is consistent with the AttnGAN paper.

Figure 3 shows the attention mechanism of the AttnGAN model. Images of the training set are visualized with the attention learned by the DAMSM.

(a) The learning of the loss (b) The Learning rate

Figure 2: The learning of the DAMSM

(20)

Pretraining AttnGAN model

Figure 4b shows that the loss of the discriminator decreases while figure 5a shows that the loss of the generator increases. Both the losses do not converge to a stable number. Figure 5 maps the attention mechanism from the DAMSM over the images generated by the generator.

(a) Loss of the Generator (b) Loss of the Discriminator

(c) Loss of the Generator and Discrimi-nator

(21)

Figure 5: Attention model on generated images Inception score and R-precision

For the outcomes of the evaluations metrics, see Table 2. Figure 6 shows the curve of the inception score when increasing the amount of epoch for training the GAN. As can be seen from Figure 6b, the inception score is at a optimum with epoch 400. After an amount of 100 epoch, the inception score fluctuates between 4.15 and 4.45.

Method Highest Inception score Highest R-precision

AttnGAN [26] 4.36 67.82%

Self trained AttnGAN 4.44 60.0%

Table 2: Achieved results with AttnGAN model on CUB dataset

(a) Inception scores (b) Inception scores from 300 epoch

(22)

Images

To get an insight in the generated images with the self-trained AttnGAN model, see Figure 7.

(a) this bird is brown, white, and black in color with a sharp black beak and black eye ring.

(b) a bird with a thin pointed bill, swept back brown crown, and red and white throat.

(c) the bird has a dark brown back and wings, with white sides and a white collar.

Figure 7: Images trained by the self-trained AttnGAN on the CUB dataset. Three generated with descriptions of the bird: Horned Grebe.

(23)

4 Method: Enhancing AttnGAN for Poem-Image

Generation

In this chapter, the method of the research is explicated. After the reproduction of the AttnGAN model on the CUB data (chapter 3) to ensure that the model that will be used is correct, the AttnGAN is refined for the Poem2Image generation. First, the used Poem2Image datasets are enlightened. Subsequently, the details of training the AttnGAN are set and finally the method for improving the AttnGAN through the replacement of the text-encoder is proposed.

4.1 Poem2Image Dataset Collection

To generate paintings in the style of Feng Zikai based on Chinese classical poetry, the AttnGAN requires training on task specific datasets. Therefore, two different sets of data are built. The first datasets are constructed manually [13], the other two are constructed automatically [19]. These four datasets are made to attempt the generation of paintings in Feng Zikai’s painting style. Table 3 shows the differences between the four datasets on word level.

Title Image Poem Image Poem Line Poem Famous CUB-dataset # Images 3.658 300 89.920 6.152 11.778

# Tokens 19.305 18.191 1.206.742 92.416 1.803.457 # Single tokens 2279 2.319 6.645 3382 5449 % Duplicate words 88,20% 87,30 % 99,45% 96,34% 99,70%

Table 3: Specifications of the Poem2Image datasets The manual constructed dataset of Feng Zikai’s paintings

There exist 13 books of Fenk Zikai’s comics of Classical Chinese poems [24]. The books contain paintings of Fenk Zikai toghether with captions of the paintings or the corresponding poems. The paintings by Feng Zikai are both chromatic and monochrome. In the developed manual datasets [13], these books are divided into 300 poem-image pairs and 3650 caption-image pairs. This information is captured within two datasets: the Poem-Image dataset with the paintings and the corresponding poems and the Title-Image datasets including the paintings and the corresponding captions.

The Poem-Image dataset contains 300 paintings of Feng Zikai with correspond-ing Chinese poems.

(24)

Figure 8: The vocabulary of the poetry consist of 18191 Chinese tokens in total and has 2319 different Chinese tokens. Of this, 728 Chinese tokens are mentioned once, 390 Chinese tokens are mentioned twice and there are 12 Chinese tokens that occurred more than 100 times.

The other dataset built upon the paintings of Feng Zikai is the Title-image dataset. This dataset contains 3296 images and corresponding captions. The captions are created by Feng Zikai himself.

(25)

Figure 9: The vocabulary of the captions consist of 19305 Chinese tokens in total from which there are 2279 Chinese tokens to distinct. 715 Chinese tokens are mentioned once, 315 Chinese tokens occur twice and 17 Chinese tokens occur more than 100 times.

The automatically datasets constructed from web images

In relation to the CUB-dataset (11.778 images), the datasets mentioned above are relatively small. To construct a dataset with more images, a automated construc-tion is engineered for generating datasets [19]. This resulted in two datasets with a large-scale image database: the Poem-Line data and the Famous-Poem data.

The first large-scale dataset is the Poem-Line image dataset. The poem-line image dataset consist of 89.920 images with a corresponding poem-line for each image.

(26)

Figure 10: 6.645 Different Chinese tokens are found responsible for a vocabulary of 1206742 Chinese tokens in total. 929 Chinese tokens occur once, 474 Chinese tokens occur twice, 1566 Chinese tokens are mentioned more than 100 times and 250 Chinese tokens are mentioned more than 250 times. The Chinese tokens that appear the most are: (do not = 9611), (people = 8786), (one = 8223), (wind = 8080) and (mountain = 7634).

The last large-scale dataset is the Famous-Poem dataset containing 6.152 im-ages and corresponding poem lines.

Figure 11: The vocabulary consist of 3382 distinct Chinese tokens and there are 92416 Chinese tokens in total. 809 Chinese tokens appear once, 430 tokens occur twice and 181 Chinese tokens occur more than 100 times.

(27)

4.2 Training AttenGAN Components on Poem2Image Datasets

Due to the differences between the CUB dataset and the Poem2Image datasets, the parameters for training the AttnGAN requires revision. To determine how to adjust these parameters, multiple sets of parameters are adopted and analyzed. First, the same parameters are adopted as for training the AttnGAN on the CUB dataset (see Table 1). Second, to refine the model on the Poem2Image datasets, experiments are done with the training of the text-encoder, the training of the GAN and adjusting value of λ.

Amount Epoch Text-Encoder

First, the best amount of epoch for training the DAMSM is determined through experiments. This is done by analyzing the training of the text-encoder. Expected is that the LDAM SM will decrease for both the training set and for the validation

set. It is expected that, after a certain amount of epoch, overfitting is recognized and from that amount of epoch the performance of the encoder decreases. Over-fitting occurs when the encoder learns the training data too closely, this can result in failing to predict data of the validation set [1].

Amount Epoch GAN

The aim of the second part of the experiments is to detect the best amount of epoch to train the GAN-models. Because GAN-models do not converge (see section 2.2), the learning curve of the generator and discriminator is not able to visualize when the best generator is trained. Therefore, multiple stages of the generator are analyzed by creating samples with these generators. Subsequently, the inception score is calculated over these images and the best generator is selected.

Poem2Image Models

When the four datasets are adopted and analyzed, the most promising dataset is utilized for the last part of the experiments. To determine the performances of the datasets, the inception score and R-precision are calculated. The scores are calculated based on the generated images from the model. The parameters of the model are determined in the previous experiments. For this research, the inception score reflects the performance of the AttnGAN objectively while the R-precision is incubating the performance of the DAMSM (see section X). Consequently, the Inception Score is the most meaningful value. This is taken into account when the best dataset is selected.

(28)

Value of λ

To determine the best value of λ, tree values of the λ are tested and evaluated. These experiments try to evaluate the importance of the λ for training the At-tnGAN model on Poem2Image datasets. The parameter λ balances the generators loss and the DAMSM loss during the training of the GAN of the AttnGAN model. More plainly stated, it determines the impact of the DAMSM loss function. First, the λ is set to 0 and thus the DAMSM does not influence the GAN training. This reflects the importance of the DAMSM loss. Secondly, λ is increased to 50. As mentioned in the AttnGAN paper of Xu et al. [26], the LDAM SM is especially

important for generating complex scenarios. The value of 50 is chosen based on promising results on the COCO dataset [6]. Because of the noise and the com-plexity of the Poem2Image datasets, the λ is once more increased to 70. Due to time limitations, this experiment is done using the most promising dataset based on the previous experiments.

4.3 Enhanced AttenGAN with BERT-based Text-Encoder

(AttenGAN+)

As explicated in section 2.2, the AttnGAN network embeds a recurrent neural network as a text-encoder for extracting the semantic vectors from the text de-scriptions. This is a bi-directional Long Short-Term Memory (LSTM). In this section, it is proposed to replace LSTM with the BERT [3] model to improve the generation of images from the Poetry-Image datasets.

Architecture of the BERT Model

The BERT model (Bidirectional Encoder Representations from Transformers) [3] is developed to train deep bidirectional representations from unlabeled text. The model conditions both right and left context in all layers. The key of the model is applying the bidirectional training of a Transformer, an attention mechanism that learns contextual relations between words and sub-words in a text. In contrast to the bidirectional LSTM, the transformer analyzes the entire sentence at once. This allows the model to learn the context of a word based on all the surrounding tokens.

AttnGAN+: Implementing the BaselineBERT and PoemBERT

To improve the AttnGAN model for the Poem2Image data, two BERT-models are tested: the BaselineBERT and the PoemBERT [22]. The BaselineBERT the BERT model trained on Wikipedia data. The PoemBERT is the BERT model

(29)

trained upon a corpus of 280.00 Chinese Classical poems [22]. This could improve the understanding of the poems and thus the generation of paintings based on Chinese classical poems. Experiments are done to examine the working of the BERT models within the AttnGAN model.

(30)

5 Experiments and Results

The methods described in the previous chapter are followed up in this chapter and the results are presented. First, different components are analyzed based on the corresponding results. Secondly, images generated with the different Poem2Image data are shown.

5.1 Component Analysis

The components of the AttnGAN model are revised based on their results. For the learning of the AttnGAN model, different parameters can be set. For each compo-nent, the performing of the learning according to different values of the parameter is analyzed. The most promising value is selected and this value is adopted for the following experiments. The parameters are fine-tuned while keeping the others stable.

Impact of the Text-Encoder

Figure 12 shows that there was a significant difference of the learning of the text-encoder between the Poem2Image datasets and the CUB-dataset. The to-tal DAMSM loss of the Poem2Image training sets decrease, while toto-tal DAMSM loss of the Poem2Image validation sets sustain or even increase. A possible ex-planation for this might be overfitting of the data: the analysis corresponds too closely to the training data and this can result in failing to predict the future data (here: the validation set) reliably [1]. Relatively to the other Poem2Image datasets, the Poem-Image datasets is the smallest (containing 300 images). This can explain the fact that the losses of the validation set of the Poem-Image set are the highest (infinity). Hence, the text encoders 'memorizes'the training data (225 images) instead of generalizing the data distribution.

(31)

(a) Loss DAMSM Title Image (b) Loss DAMSM Poem Image

(c) Loss DAMSM Poem Line (d) Loss DAMSM Poem Famous

Figure 12: The learning of the DAMSM on the Poetry-Image datasets The optimal amount of epoch to train the text-encoder is selected based on Figure 12. According to the learning of the CUB-dataset, the amount of epoch for the most promising text-encoder is where the learning rate and the validation loss begin to stabilize. Table 4 shows the selected optimal amount of epoch for the text-encoders.

Optimal amount epoch DAMSM

Title Image Poem Image Poem Line Poem Famous Optimal epoch DAMSM 50 40 40 40

Table 4: Optimal amount of epoch for training the DAMSM chosen based on the learning curve of the text-encoder

Impact of the GAN

No significant differences were found in the learning curves of the GAN between the Poem2Image datasets and the CUB dataset. While the training of the

(32)

text-encoders differs between the CUB and Poetry-Image datasets, the learning curve of the GAN differs as well. For example, the start value of the loss of the Poetry-Image datasets is more than ten times as large as the start value of the loss on the CUB dataset (see figure 13). This indicates that the Poem2Image data-distributions are more complex than the CUB data-distribution. However, the learning curves of the discriminators are approximately following the same gradient as the discriminator on the CUB dataset and end between 0 and 1. In all figures, the curves oscillate, which is a typical behavior of the adversarial generative model. It indicates the training of GANs is not stable, as expected.

(33)

Losses Poem2Image data GAN

(a) Title Image Generator and Discriminator.

(b) Title Image Generator. (c) Title Image Discriminator.

(d) Poem Famous Generator and Discriminator.

(e) Poem Famous Generator. (f) Poem Famous Discriminator.

(g) Poem Image Generator and Discriminator.

(h) Poem Image Generator. (i) Poem Image Discriminator.

(j) Poem Line Generator and Dis-criminator.

(k) Poem Line Generator. (l) Poem Line Discriminator.

Figure 13: learning curve of the Generator and Discriminator while training on the Poem2Image datasets

(34)

A positive correlation was found between the inception score and the amount of epoch for training the GAN (see figure 14). What can be clearly is the variability of the learning between the Poem-Line data and the Poem-Image data. It can be stated that the optimal amount of epoch for learning the GAN differs between the Poem2Image datasets.

(a) Inception score for different epoch of GAN for Poem-Line data

(b) Inception score for different epoch of GAN for Poem-Image data

Figure 14: The inception score for the different values of the GAN of the Poem-Line and the Poem-Image data.

Impact of the Poem2Image Datasets

The performance of the two manual constructed datasets is relatively low compared to the automatically constructed datasets (see Table 5). This may be due to the amount of images and tokens, which is significantly lower for the manual constructed datasets (300 and 3658 images) than for the automatically constructed datasets (89.920 and 6.152 images).

Based on the inception score and the R-precision, the Poem Line datasets fits the AttnGan model the best. When comparing the performance of the CUB-dataset with the Poem-Line CUB-dataset, the inception score of the Poem-Line image dataset is relatively small and the R-precision is relatively high. In other words, the diversity and the quality of the images generated with the Poem-Line dataset is low while the images scores high with matching their descriptions according to the text encoder of the pretrained DAMSM.

(35)

Evaluation Poetry2Image data 1: Title Image

(Feng Zikai) 2: Poem Image(Feng Zikai) 3: Poem Line(web-images) 4: Poem Famous(web-images) R-p validation 8.81% 16.6% 85,83% 55.63%

R-p training 9.33% 17.6% 87,56% 53.95% IS validation 1.25 1.27 1.62 1.61 IS training 1.24 1.26 1.66 1.60

Table 5: Evaluation metrics on the outcomes of the four different datasets, with epoch 50 on the GAN.

Impact of λ

There was a significant positive correlation between the λ and the performance of the AttnGAN model on the Poem2Image datasets (see Table 6). The performances of the model do not increase linear with enlarging the λ. According to these results, the AttnGAN model performs the best on the Poem-Image dataset with λ = 50. Hence, λ = 50 is chosen for the rest of the Poem2Image datasets experiments. Despite that the performance of λ differs for different datasets [26], this is done due to time limitations.

Evaluating λ

1: Poem Line 2: Poem Line 3: Poem Line

λ 0 50 70

R-precision valid 14.83 % 85,83% 1.68% R-precision train 16.81 % 87,56% 1.60% Inception score valid 1.55 1.62 1.46 Inception score train 1.59 1.66 1.45

Table 6: Evaluation metrics on the outcomes of the Poem Line datasets with different λ for selecting the best value of λ

Effectiveness of the AttnGAN+

Strong evidence for the performance of the Baseline on the Title-Image data was found, see Table 7. However, the scores on the Poem-Line data are the highest when using the original RNN encoder. The inceptions scores are higher for both the Poem-Image dataset and the Title-Image dataset. These inception scores are achieved using a λ = 5. Due to the lack of experimentation between λ = 5 and the original RNN, these results therefore need to be interpreted with caution.

(36)

Hence, no significant evidence is found in this thesis for the performance of the PoemBERT on the Poem2Image data. Further studies, which take these variables into account, will need to be undertaken.

Evaluation Text-Encoders

Poem-Image Title-Image Poem-Line λ = 50: original RNN 1.27 1.25 1.62 λ = 50: Baseline 1.20 1.61 1.36 λ = 50: PoemBERT 1.15 1.40 1.32 λ = 5: Baseline 1.11 1.24 1.39 λ = 5: PoemBERT 1.30 1.83 1.42

Table 7: Inception scores on the Poem2Image data with different Text-encoders (BaselineBERT, PoemBERT, RNN) and different λ

5.2 Synthesized Images

To have a insight in the generated images with the optimized AttnGAN for each dataset, the generated images can be viewed (Figure 15). These qualitative findings may help to understand the patterns the model is learning. The Poem-Image data trains a model that mainly repeats one pattern. This can be a explanation for the relatively low R-precision scores. Similarly, the Title-Image model seems to learn one type of figure. The generated images are filled with different variations of this shape. Nevertheless, the Poem-Image model and the Famous-Poem model output more figurative images. Limited high-quality data tend to suit the AttnGAN model less than large-scale noisy data.

Figures 16 and 17 show the correlation between the (poem)text, the original image and the generated image. The generated image is shown left and the original image from the validation set is shown right. The input text below is not a Chinese Classical poem, but a sample from the textdata for training the model.

(37)

Generated Images with Optimized AttnGAN and Poem2Image data

(a) Poem-Image dataset | epoch encoder = 40 | epoch GAN = 900 | λ = 50

(b) Title-Image dataset | epoch encoder = 50 | epoch GAN = 50 | λ = 50

(c) Poem-Line dataset | epoch encoder = 40 | epoch GAN = 30 | λ = 50

(d) Poem-Famous dataset | epoch encoder = 40 | epoch GAN = 600 | λ = 50

(38)

Figure 16: left: generated image, right: original image, below: input text 三五年时三五月，

可怜杯酒不曾消。 In three or five years,

the poor glass of wine never disappeared.

Figure 17: left: generated image, right: original image, below: input text 衣钵三宗在，

江河_万古流。 There are three mantles, and the rivers flow forever.

(39)

6 Conclusion

The main goal of this project was to train a neural network generative model to generate high-quality images of paintings in the style of Feng Zikai [24], cor-responding to Chinese classical poems. To achieve this, the AttnGAN [26] was reproduced and four task specific Poem2Image datasets were adopted and com-pared. Through experiments, different components of the AttnGAN model are analyzed and hyperparameters of the AttnGAN model are improved. Finally, the text-encoder was replaced with two other types of neural networks [22] for improv-ing the representation of the poems.

This study has shown the importance of different components of the AttnGAN network while training with noisy datasets. The results of this study indicate that the AttnGAN model requires modification for large scale noisy data and limited high-quality data. Limited high-quality data tends to suit the AttnGAN model less then large-scale noisy data. The findings reported here shed new light on the challenges of Text2Image synthesis and more specifically on the Poem2Image synthesis. This study shows the possibilities of a new field of generating images based on poems. A limitation of this study is that not all the hyperparameters of the AttnGAN model are examined and this would be a fruitful area for further work.

(40)

7 Future Work

Despite the results of the AttnGAN model on the Poem2Image data, questions remain. Further research should be undertaken to investigate other components of the AttnGAN network in combination with the Poem2Image data or other limited high-quality training data and large-scale noisy training data.

7.1 Hyperparameters

This research investigated the influence of the GAN, the DAMSM, λ and the text-encoder (LSTM). Besides, it demonstrated the impact of the PoemBERT and the Baseline as replacement for the text-encoder. However, several other components of the AttnGAN can be examined. To develop a full picture of the AttnGAN model on the Poem2Image data, additional studies will be needed that explore the impact of the learning rate of the discriminator and the generator, the influence of the batch size and the signifficance of different values of γ1, γ2and γ3. For example,

research has shown (BRON) that different learning rates for the discriminator and generator could stabilize the learning of the GAN.

7.2 Correlation between Poem2Image Data and AttnGAN

Although this research showed the impact of the different Poem2Image data on the AttnGAN model, the findings do not point out the exact aspects of the data which cause specific problems. More plainly stated, the correlation between the aspects of the data and the training of the AttnGAN can be further examined. For example, the exact correlation between the number of appearance of words in the data and performance of the model requires additional research.

7.3 Replacement of Components in AttnGAN

Lastly, several components of the AttnGAN can be replaced for improvement. Despite the experiments on the text-encoder, the image-encoder can be replaced. The used image-encoder is pretrained on Imagenet [10]. However, experiments can be done with pretraining the image-encoder on cartoons, paintings or even on Chinese Classical paintings to improve the results of the AttnGAN model.

(41)

Bibliography

(1) Abu-Mostafa, Y. S., Learning from data : a short course; AMLBook: United States.

(2) Barratt, S., and Sharma, R. (2018). A Note on the Inception Score. arXiv:1801.01973 [cs, stat], arXiv: 1801.01973.

(3) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs], arXiv: 1810.04805.

(4) Farmer, T. A., Brown, M., and Tanenhaus, M. K. (2013). Prediction, expla-nation, and the role of generative models in language processing. Behavioral and Brain Sciences 36, 211–212.

(5) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Net-works.

(6) Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and Dollár, P. (2015). Microsoft COCO: Common Objects in Context. arXiv:1405.0312 [cs], arXiv: 1405.0312. (7) Liu, L., Wan, X., and Guo, Z. (2018). Images2Poem: Generating Chinese

Poetry from Image Streams. 1967–1975.

(8) Liu, Y., Liu, D., and Lv, J. (2019). Deep Poetry: A Chinese Classical Poetry Generation System. arXiv:1911.08212 [cs], arXiv: 1911.08212.

(9) Mansimov, E., Parisotto, E., Ba, J. L., and Salakhutdinov, R. (2016). Gen-erating Images from Captions with Attention. arXiv:1511.02793 [cs], arXiv: 1511.02793.

(10) Mittal, S., Kaushik, P., Hashmi, S., and Kumar, K. (2018). Robust Real Time Breaking of Image CAPTCHAs Using Inception v3 Model. ISSN: 2572-6129, 1–5.

(42)

(11) Moore, A. G., Four Masters of Yuan and Literati Art: Tradition in China from Mongol Rule to Modern Times. CreateSpace Independent Publishing Platform: 2017.

(12) Nash, D. Feng Zikai, eng, 1996.

(13) Nieuwburg, E. (2020). Building a Dataset for the Visualization of Classical Chinese Poems.

(14) Paszke, A. et al. In Advances in Neural Information Processing Systems 32, Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R., Eds.; Curran Associates, Inc.: 2019, pp 8024–8035.

(15) Qiao, T., Zhang, J., Xu, D., and Tao, D. (2019). MirrorGAN: Learning Text-to-image Generation by Redescription. arXiv:1903.05854 [cs], arXiv: 1903.05854.

(16) Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., and Lee, H. (2016). Learning What and Where to Draw. arXiv:1610.02454 [cs], arXiv: 1610.02454. (17) Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H.

(2016). Generative Adversarial Text to Image Synthesis. arXiv:1605.05396 [cs], arXiv: 1605.05396.

(18) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved Techniques for Training GANs. arXiv:1606.03498 [cs], arXiv: 1606.03498.

(19) Sun, F. (2020). Noise large-scale poem-image pairs for poem-to-image gen-eration.

(20) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Re-thinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [cs], arXiv: 1512.00567.

(21) Van Rossum, G., and Drake Jr, F. L., Python reference manual; Centrum voor Wiskunde en Informatica Amsterdam: 1995.

(22) Vaudrin, R. (2020). PoemBERT: A representation of classical Chinese po-etry, for poem based image generation.

(23) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset, tech. rep. CNS-TR-2011-001, California In-stitute of Technology, 2011.

(24) Wu Haoran, F. Z., Feng Zikai Comics Collection; Dolphin Press: 2014. (25) Xu, L., Jiang, L., Qin, C., Wang, Z., and Du, D. (2018). How Images

In-spire Poems: Generating Classical Chinese Poetry from Images with Memory Networks. CoRR abs/1803.02994.

(43)

(26) Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X. (2018). AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. 1316–1324.

(27) Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. (2016). Stacked Atten-tion Networks for Image QuesAtten-tion Answering. arXiv:1511.02274 [cs], arXiv: 1511.02274.

(28) Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., and Metaxas, D. (2017). StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. arXiv:1612.03242 [cs, stat], arXiv: 1612.03242. (29) Zhang, J., Feng, Y., Wang, D., Wang, Y., Abel, A., Zhang, S., and Zhang,

A. (2017). Flexible and Creative Chinese Poetry Generation Using Neural Memory. arXiv:1705.03773 [cs], arXiv: 1705.03773.

(30) Zhu, M., Pan, P., Chen, W., and Yang, Y., DM-GAN: Dynamic Mem-ory Generative Adversarial Networks for Text-to-Image Synthesis, arXiv: 1904.01310, 2019.

Poem2Image Generation for Chinese Classical Poems

Poem2Image Generation for

Classical Chinese Poems

Poem2Image Generation for

Classical Chinese Poems

Contents

1

Introduction

1.1

Research Question

1.2

Contribution

2

Related Work

2.1

Classical Chinese Poems Generation

2.2

Generative Adversarial Models

2.3

GAN and Text-to-Image Generation

3

Preliminary: AttnGAN for Text-To-Image

Gen-eration

3.1

The AttnGAN Model

3.2

The Architecture of the AttnGAN

3.3

The CUB-200-2011 Dataset

3.4

Evaluation Metrics

3.5

Reproduction Results

4

Method: Enhancing AttnGAN for Poem-Image

Generation

4.1

Poem2Image Dataset Collection

4.2

Training AttenGAN Components on Poem2Image Datasets

4.3

Enhanced AttenGAN with BERT-based Text-Encoder

(AttenGAN+)

5

Experiments and Results

5.1

Component Analysis

5.2

Synthesized Images

6

Conclusion

7

Future Work

7.1

Hyperparameters

7.2

Correlation between Poem2Image Data and AttnGAN

7.3

Replacement of Components in AttnGAN

Bibliography