PoemBERT: A representation of classical Chinese poetry for poem based image generation

(1)

PoemBERT:

A representation of classical Chinese poetry

for poem based image generation

(2)

Layout: typeset by the author using LA_TEX. Cover illustration: Feng Zikai

(3)

PoemBERT:

A representation of classical Chinese poetry

for poem based image generation

River Vaudrin 11877154

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Ms D. (Dan) Li MA Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 26th, 2020

(4)

1

acknowledgements

First of all, I would like to thank my supervisor Ms. Dan Li for providing valuable guidance over the past twelve weeks. During every one of our weekly meetings, I felt that she was very invested and involved in our project. During these same meetings, she always helped me structure the goals for each week, which was very appreciated. Ms. Dan Li also read my final report twice, where after she gave helpful feedback that definitely improved this report.

I would also like to thank my fellow team members, Nina van Liebergen, Silvan Murre, Elisha Nieuwburg and Fengyuan Sun for a continuous collab-oration during the entire project. The entire project felt like a group effort, as we frequently asked each-other questions to clarify the work that one had done. These questions would be answered quickly and accordingly.

(5)

2

abstract

Classical Chinese poetry is one of the longest continuous traditions in world literature, these poems are often accompanied by paintings created by Chinese artists. Representing poetry, in a vector space is recognized as a difficult task in the field of Natural Language Processing, due to the unique and distinct features of poetry. In this report, we try to construct a language representa-tion model, that correctly models classical Chinese poetry in a vector space, to benefit poem based image generation task. We introduce PoemBERT, a BERT model that is trained with a classical Chinese poetry corpus. To de-termine whether PoemBERT benefits poem based image generation, it was incorporated in the AttnGAN framework. AttnGAN is a text-to-image gener-ation technique that can produce fine-grained details at different sub-regions of the image by paying attention to the relevant words in the natural language description. PoemBERT proved to achieve superior results on this task than the ChineseBERT, which was trained with a Wikipedia corpus, according to the Inception Score. A metric that has proven to correlate well with the hu-man scoring of the realism, which is calculated based on the sharpness and variety of the generated images. Although PoemBERT produced more realis-tic images compared to the ChineseBERT, the generated images did not show any correlation with the captions they were generated from.

(6)

contents 3

3 Methodology 12 4 Experiment Setup 14 4.1 Sub-questions . . . 14 4.2 Evaluation Method . . . 14 4.2.1 Intrinsic Evaluation . . . 14 4.2.2 Extrinsic Evaluation . . . 15 4.2.3 Baseline . . . 16 4.3 Datasets . . . 16 4.3.1 PoemBERT . . . 16 4.3.2 AttnBERT-GAN . . . 17 4.4 Configurations . . . 18 4.4.1 PoemBERT . . . 18 4.4.2 AttnBERT-GAN . . . 19 5 Results 21 5.1 Intrinsic Evaluation (PoemBERT) . . . 21

5.1.1 Training-policy (RQ1-1) . . . 21

5.1.2 Corpus (RQ1-2) . . . 21

5.2 Extrinsic Evaluation (AttnBERT-GAN) . . . 22

5.2.1 Text-Image Dataset (RQ2-1) . . . 22

5.2.2 Lambda (RQ2-2) . . . 27

6 Conclusion 31 7 Future Work 32

list of figures

Figure 1 Works from Chinese artist Feng Zikai . . . 5

Figure 2 Images generated by AICAN . . . 6

Figure 3 Example Next Sentence Prediction task . . . 9

Figure 4 Example results of the AttnGAN. . . 10

Figure 5 The architecture of the AttnGAN. . . 10

Figure 6 The architecture of the GAN . . . 11

Figure 7 PoemBERT flowchart . . . 13

Figure 8 Two examples from Poem-Image dataset . . . 18

Figure 9 Two examples from Title-Image dataset . . . 19

Figure 10 Three examples from Regular Poem-Image dataset . . 20

Figure 11 Example from Poem-Image validation-set . . . 23

Figure 12 Results from the Poem-Image dataset. . . 23

Figure 13 Examples from Title-Image validation-set . . . 24

Figure 14 Results from the Title-Image dataset. . . 24

(7)

list of tables 4

Figure 16 Example from Regular Title-Image validation-set . . . 26

Figure 17 Results from the Regular Title-Image dataset. . . 26

list of tables

Table 1 PoemBERT: training corpora information . . . 17

Table 2 AttnBERT-GAN: training datasets information . . . . 18

Table 3 Different configurations of PoemBERT . . . 19

Table 4 Intrinsic evaluation of PoemBERT . . . 21

(8)

introduction 5

1 introduction

Classical Chinese poetry is considered an important cultural heritage in China. With its thousands of years of history, it is one of the longest and largest con-tinuous traditions in world literature [1]. Classical Chinese poetry is charac-terized by its strict set of syntactic, phonological, and semantic rules. It also differs from generic text because, of its wide use of literary quotations and its concise wording. During its more than 2000 year history, millions of poems have been written, frequent themes are heroic characters, beautiful scenery, re-lationships, and a deep yearning for the past [2]. Throughout this rich cultural history, Chinese artists have created paintings that reflect the feeling, event, or landscape described in the poems (Figure1). This has led to an abundance of artistic works inspired by classical Chinese poetry, which is now considered an ancient art form [3]. Classical Chinese poetry has evolved into modern (or post-classical) Chinese poetry, the key difference is that latter is free from strict rules. Artificial Intelligence (AI) could provide the answer to extending a tradition, that already belongs to one of earth its longest traditions.

Figure 1: Works from Chinese artist Feng Zikai

In recent years, the field of AI has increasingly become interested in the intersection between art and itself [4,5,6,7]. The main reason for this is that creative acts, such as paintings, poetry, and music, are regarded as uniquely human. Therefore, the main objective of these AI researches is to create artificially made art that is indistinguishable from human-made art.

An advancement that has contributed a great deal to computer-generated art are Text-to-Image (T2I) generation algorithms. T2I generation has be-come a popular topic in the field of Natural Language Processing (NLP) and Computer Vision (CV) [8], as it is the perfect junction between the two. T2I generation aims to generate realistic images from text descriptions. A chal-lenging task, considering that images and language in the real world are noisy with great variability. Generative Adversarial Networks (GAN) have become the state-of-the-art method for generating new synthetic data, that can pass as real data [9]. The fundamental idea of GANs are that they try to map local image features to their corresponding words, this constitutes to a model that can generate images based on an unseen text description. The results yielded

(9)

introduction 6

by GANs have been outstanding, as it is difficult to determine whether these generated images are created by man or a computer (Figure 2). The next step for T2I generation is to generate realistic images based on more complex images and image descriptions, as GANs can now successfully create images based on relatively simple datasets, such as the CUB and COCO datasets.

Figure 2: Images generated by Rutger University’s AICAN [10]. AICAN is built upon

the GAN architecture.

A great leap forward would be for GANs to successfully generate realis-tic images from classical Chinese poem descriptions. However, this will be a tremendous obstacle to overcome, as GANs will have more difficulty learning the local features of an image from poem descriptions than from text descrip-tions. The main challenge is finding a language representation model that has an accurate understanding of poetry and that can be utilized for T2I gener-ation. Representing Chinese poetry, and poetry in general, in a vector space is recognized as a difficult task in the field of NLP because, of its linguistic features, strict rules, literary quotation, and concise wording. A successful method of representing classical Chinese poetry, uses additional information, such as the semantic meaning of each line, semantic relevance among multiple lines, the theme of the poem, and structural, rhythmical and tonal patterns [11]. This additional information is difficult to incorporate into the state-of-the-art GAN models, thus making this method unsuitable for T2I generation. A language representation model that is known to be widely applicable and extremely successful is Google AI’s BERT model, a language representation model that took the entire field of NLP by storm after its introduction in 2018.

BERT1_{, which stands for Biderectional Encoder Representations from}

Transformers, has proven to obtain state-of-the-art results on a broad set

of NLP tasks, such as Natural Language Understanding (NLU); Question and Answering (QA); and Natural Language Interference (NLI) [12]. The main idea of BERT is that it learns by performing two unsupervised tasks on a large corpus. Based on these tasks a vector for each unique token (word or wordpiece) in the corpus is computed and adjusted after each iteration. Each vector represents a token its context, and the collection of these vectors form the output of the model. The key technical innovation of BERT is that it is bidirectionally trained, therefore the model has a deeper understanding of language context compared to unidirectional language models. Another

(10)

introduction 7

portant advantage of BERT is that the model has shown to be multilingual, meaning that it yields a similar performance across many different languages, including Chinese. Due to these characteristics, BERT could prove to provide a suitable language model to correctly represent classical Chinese poetry.

The research question we will try to provide an answer to is: Can BERT learn a correct representation of classical Chinese poetry, which benefits the poem based image generation task?

The report is divided into six sections. Each of these sections will provide insight into how the research was conducted and why it was conducted in the way that it was. Section 2, related work, will provide the theoretical back-ground needed to have a better understanding of the research method. More specifically, the language representation model, BERT, and the T2I genera-tion model, AttnGAN, will be explained in detail. Secgenera-tion 3, methodology, will outline the approach of the research and clarify the motive behind cer-tain methodological choices. Section 4, experiment setup, will expand upon the methodology by explaining and motivating the evaluation method and the different experiments that have been conducted in this research. Section 5, results, will present and discuss the results of the described experiments. The final two sections will conclude the entire research and recommend ideas for future research.

(11)

related work 8

2 related work

2.1 Bidirectional Encoder Representations from Transformers

As previously mentioned, BERT can be applied on a wide range of NLP tasks, as it has demonstrated to achieve state-of-the-art results on a variety of NLP tasks. The reason for its broad success can be explained due to its deep bidi-rectional nature. Where its predecessors, known as unidibidi-rectional language models, make use of a left-to-right or right-to-left architecture to model the context of a token2, BERT jointly conditions both left and right context, en-suring that BERT has a better understanding of the entire context.

BERT its framework consists out of two steps: pre-training and fine-tuning. During the pre-training phase, the model is trained on unlabeled text data, preferably a large amount. The output is a number of vectors equal to the total number of tokens, and the size of each vector is determined by the pa-rameter hidden size (H). H is determined by the size of the model, in total there are six different BERT sizes. To determine the model size needed, a trade-off has to be made between the complexity of the NLP task and compu-tational/time resources. Complex NLP tasks might require a greater BERT size, however, pre-training BERT is computationally expensive, and selecting a smaller BERT size will drastically decrease the training time. This report will primarily report results on BERTT IN Y and BERTBASE3. A BERT model is

usually trained on a large generic corpus, an example of this is Wikipedia data. However, BERT can also be fine-tuned on smaller task-specific datasets, this means that you can use a pre-trained BERT, trained on a large generic cor-pus, and fine-tune it on a more task-specific corpus to ensure that the model performs well on that specific task. In other words, you can use a pre-trained BERT, trained on Wikipedia data, and fine-tune it on a movie review dataset, to ensure that BERT performs well on the Sentiment Analysis task. The main advantages of this approach are that it will achieve better results than pre-training BERT from-scratch on only a movie review corpus and that it will also save a lot of training time. The original BERT_BASE was trained on 4 cloud TPUs for 4 days and the original BERTLARGE was trained on 16 TPUs

for 4 days, this illustrates how expensive it is to train a BERT model.

BERT learns by performing two unsupervised tasks: the Masked LM (MLM) task and the Next Sentence Prediction (NSP) task. The MLM task is essen-tially a game of Fill-In-The-Blank. In practice, the model achieves this by randomly masking a percentage of the tokens in the corpus, thereafter it tries to predict these masked tokens based on their context. The MLM task greatly benefits from BERT its bidirectionality. The advantage becomes evident when we examine the sentence "I accessed the [MASK] account". In this particu-lar example, a unidirectional model would have more difficulties predicting "bank" than a BERT model because the unidirectional model only has knowl-edge about either the tokens before or after the masked token and BERT has knowledge about both.

2 A token is typically a word or wordpiece, however, it could also be a single character 3 BERTT IN Y (L=2, H=128, A=12, Total parameters=4.4M);

BERTBASE (L=12, H=768, A=12, Total parameters=110M)

(12)

related work 9

Figure 3: Example of the Next Sentence Prediction task

For the NSP task, the model tries to predict whether sentence B does or does not come after sentence A (Figure3). To accomplish this, two sentences get paired together with a label IsN ext or N otN ext, 50% of the time sen-tence B comes after A and 50% of the time sensen-tence B is a randomly selected sentence form the corpus. The model then tries to predict whether sentence B comes after sentence A based on the context of both sentences. The NSP task is especially beneficial for QA and NLI because these tasks are based on understanding the relationship between sentences. BERT tries to minimize the combined loss function of these two strategies to achieve the desired result. The result is a vector for each unique token in the corpus of size H, with each vector representing the context of its corresponding token. The combination of these vectors form the final output of the model, which can then be used as the language representation model for the desired NLP task.

2.1.1 ChineseBERT

After the positive reaction to BERT within the NLP community, Google AI Language has released BERT in a total of 106 different languages in recent years. All of these BERT models, including the ChineseBERT, have proven to attain a considerable performance boost compared to its predecessors, exem-plified by the performance on the XNLI evaluation corpus [13]. ChineseBERT was, such as the English BERT, trained on extracted Wikipedia data, which consisted of 13.6M lines of Simplified and Traditional Chinese text. The archi-tecture of the English BERT and the ChineseBERT are practically the same, the only differences concern the tokenization and the encoding. Where the English BERT uses WordPiece tokenization, the ChineseBERT uses character-based tokenization. This is because Chinese characters store the meaning of an idea, whereas English characters only store pronunciation. In essence, each Chinese character is already a word-piece. Furthermore, English BERT is ASCII encoded, whereas ChineseBERT is UTF-8 encoded, in order to support Chinese characters.

(13)

related work 10

2.2 Attentional Generative Adversarial Networks

The Attentional Generative Adversarial Network (AttnGan)4 is a state-of-the-art T2I generation technique that can synthesize fine-grained details at differ-ent sub-regions of the image by paying attdiffer-ention to the relevant words in the natural language descriptions (Figure4)[14].

Figure 4: Example results of the AttnGAN. The first row gives the low-to-high

reso-lution images generated the AttnGAN; the second and third-row are atten-tion maps that show the local image features mapped to their corresponding word.

AttnGAN differs from other GAN-based models for T2I generation because AttnGAN enables the generative network to draw different sub-regions of the image conditioned on words that are most relevant to those sub-regions. Pre-decessors would typically encode the entire sentence of the text descriptions into a single vector as the condition for image generation. Thus lacking the fine-grained word-level information.

Figure 5: The architecture of the AttnGAN.

The AttnGAN framework consists of two components, the Deep Attentional Multimodal Similarity Model (DAMSM) and the Generative Adversarial

(14)

related work 11

work (GAN) (Figure5). The DAMSM is called multimodal because, its objec-tive combines two modes of understanding, namely textual and visual. The DAMSM maps the sub-regions of the image and the words of the sentence to a common semantic space. It does this by training two neural networks, the text encoder, and the image encoder. The text encoder is a Recurrent Neural Network (RNN), known as a bidirectional Long Short-Term Memory (LSTM). The text encoder computes the semantic vectors for each word based on the text description of an image. The image encoder is a Convolutional Neural Network (CNN) that maps the images to the semantic vectors extracted by the text encoder. The CNN learns the local features of the different sub-regions of the images and the global features of the image and tries to connect them to their corresponding word. After every one-hundred epochs the DAMSM outputs an attention map, this map shows what local image features are de-tected and to which word they are associated with (Figure4). The DAMSM is pre-trained by minimizing the L_{DAM SM}, thus the DAMSM is converged when this loss function is at its lowest point. When the model is converged, the text and image encoders are used as the input for the GAN part of the framework.

Figure 6: The architecture of the GAN.

The GAN also consists of two networks, a generator and a discriminator (Figure6). The generator is trained to generate images that can deceive the discriminator. Thus the discriminator tries to distinguish the generated data from the actual data. After each iteration, the discriminator should improve its skill of detecting, and the generator should become better at generating images. The generator and the discriminator improve each-other, and after a sufficient number of training epochs, the generator should produce realistic images.

(15)

methodology 12

3 methodology

To represent classical Chinese poetry in a vector space, we introduce a language representation model called PoemBERT (Figure 7). PoemBERT will be a BERT model trained on a classical Chinese poetry corpus. This will be realized by employing two different training-policies.

The first training-policy is training PoemBERT from-scratch, this means that the model its only input text will be a corpus of classical Chinese poetry. It is recommended to train your own BERT from-scratch if you can not find a pre-trained model that is trained on a similar domain to your NLP task [13]. The advantage of this training-policy is that PoemBERT will be custom-made for the NLP task, namely poem based image generation. A disadvantage is that training from-scratch requires a considerable amount of computational and time resources. Due to time constraints, it will not be possible to train PoemBERT from-scratch with the recommended amount of training-steps (one million), subsequently, this could drastically impact the performance of the model. As a result of this complication, the second training-policy might prove to be the superior policy.

The second policy is using the pre-trained ChineseBERT Uncased5 and ex-tending it with the same classical Chinese poetry corpus, which will be used for training PoemBERT from-scratch. This policy has two main advantages, the first being that the pre-trained ChineseBERT is a stable model, which means that extending it with a relatively small amount of poetry data will not entirely destabilize the model. Another key advantage that training from a pre-trained model has over training from-scratch, is that it requires less train-ing time. The publishers of BERT recommend ustrain-ing 10K traintrain-ing-steps when training from a pre-trained model, which is in the scope of the time resources. Nonetheless, it is still advantageous to train PoemBERT from-scratch, as a relatively small corpus might require less training-steps to culminate in a lan-guage representation model that correctly represents classical Chinese poetry. Additionally, PoemBERT will also be trained on two different corpora. The main difference between the two corpora is the size of them. It is recommended to input a large corpus into the BERT model, however, due to the distinct characteristics of poetry, PoemBERT might prefer a smaller corpus. This is due to the fact that a smaller corpus might prove to be less confusing to BERT, because the number of unique contexts per Chinese character is lower than in the large corpus. As a result, it might be easier for BERT to determine the context when the number of unique contexts is lower.

After all versions of PoemBERT have been trained, they will be intrinsically and extrinsically evaluated. The intrinsic evaluation consists of three metrics, and the extrinsic evaluation consists of a downstream task and one metric to evaluate the downstream task. The downstream task will be T2I generation, more specifically, PoemBERT will be incorporated into the AttnGAN frame-work. The extrinsic evaluation will determine whether PoemBERT benefits poem based image generation. Only the configuration of PoemBERT that has the best performance on the intrinsic evaluation will be extrinsically evaluated

5 https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12. zip

(16)

methodology 13

Figure 7: PoemBERT will be trained using two training policies and two different

corpora, resulting in four different configurations of PoemBERT

because, the intrinsic evaluation will reveal what version of PoemBERT has learned the best representation of classical Chinese poetry. However, it does not exhibit how the PoemBERT will perform on a T2I generation task. The next section will delve deeper into the details of the intrinsic and extrinsic evaluation methods.

(17)

experiment setup 14

4 experiment setup

Extensive experimentation is carried out to evaluate the proposed PoemBERT. All of the experiments are designed to answer the research question. To make our research question more comprehendible, we propose to divide it into four sub-questions. The results of the intrinsic and extrinsic evaluation will provide the answers to these sub-questions.

This section will cover the four sub-questions; the background needed to un-derstand why these sub-questions are meaningful; the intrinsic and extrinsic evaluation methods; the baseline model; and the different training configura-tions of PoemBERT and the downstream task, which are designed to reveal the answers to our sub-questions.

4.1 Sub-questions

In total there are four sub-questions, two concerning the PoemBERT and two concerning the T2I generation. These questions will lead to the different combinations of training configurations for these two models.

The sub-questions concerning PoemBERT are: (RQ1-1) What training-policy is the most adequate for representing classical Chinese poetry? and (RQ1-2) What corpus is most adequate for representing classical Chinese po-etry? The first question is important because, it will reveal if PoemBERT can learn a correct representation of poetry from-scratch, or if PoemBERT benefits from a pre-trained model to learn a correct representation of poetry. The answer to the second question will reveal if PoemBERT prefers a rela-tively small or large amount of poetry data, to learn a correct representation of classical Chinese poetry.

The sub-questions concerning T2I generation are: (RQ2-1) What text-image pair dataset yields the most realistic images from T2I generation? and (RQ2-2) What is the impact of the Lambda parameter on the T2I generation? The first question is important because, the datasets will differ in size and image features. These differences might reveal what characteristics a text-image pair dataset needs to benefit poem based image generation. The second question is important because, the Lambda parameter of the AttnGAN, indirectly deter-mines the level of influence the PoemBERT has on the T2I generation process. Thus it will reveal whether the T2I algorithm generates more realistic images when strongly or weakly influenced by PoemBERT.

4.2 Evaluation Method 4.2.1 Intrinsic Evaluation

BERT has two built-in intrinsic evaluation metrics, the Masked LM accuracy (Eq. 1) and the Next Sentence accuracy (Eq. 2). These metrics are calculated after every step in the training process. We have proposed to add another intrinsic evaluation metric, namely the Top-k accuracy (Eq. 3), this metric is calculated after the training process. Top-k accuracy is added because MLM accuracy might not be an accurate metric to evaluate the model. For

(18)

experiment setup 15

example, it could be that BERT does not exactly predict all of the masked tokens, however, it could be that the correct prediction might be in the Top-k predictions. The Top-k accuracy will allow PoemBERT a margin for error, thus the metric will reveal if PoemBERT is progressing in the right direction.

masked lm accuracy The MLM accuracy is calculated by dividing the

number of correctly predicted masked tokens, by the total number of masked tokens after each training step.

Masked LM accuracy = Correctly predicted masked token

Total # of masked tokens (1)

next sentence accuracy The NS accuracy is calculated by dividing the

number of correctly predicted labels, IsN ext or N otN ext, by the total number of next sentence predictions.

Next Sentence accuracy = Correctly predicted labels

Total # of predictions (2)

top-k accuracy The Top-k accuracy is calculated by dividing the

cor-rect Top-k predictions of the masked token, by the total number of predicted masked tokens.

Top-k accuracy = Correct Top-k predictions

Total # of predicted masked tokens (3)

4.2.2 Extrinsic Evaluation

The BERT architecture allows itself to be fine-tuned on a wide range of NLP tasks, and thus PoemBERT will also be evaluated by examining its perfor-mance on a downstream task, namely the generation of images based on poem descriptions, with the AttnGAN.

the downstream task The downstream task will be the deciding factor

in determining if PoemBERT has learned a representation of classical Chi-nese poetry that benefits poem based image generation. The T2I generation model which will evaluate PoemBERT, is provided by the AttnGAN frame-work. AttnGAN is able to generate fine-grained details at different sub-regions of the image based on word-level information. This could seriously benefit T2I generation from poem descriptions, as most of the Chinese poetry paintings depict more than one object. AttnGAN allows for these objects to be detected and mapped to their corresponding word, this ensures that the generated im-ages will be more diverse and detailed compared to a T2I model that only uses sentence-level information because details and words are mapped to each-other opposed to the entire image and a sentence.

BERT is easily applicable to a wide range of NLP tasks, and thus it perfectly lends itself to be incorporated into the AttnGAN framework, this modified ver-sion will be called AttnBERT-GAN. In essence, the AttnBERT-GAN will use the entire framework of the original AttnGAN, however, the key difference can be found in the DAMSM. The original text encoder, the RNN model trained from-scratch, will be replaced with the pre-trained PoemBERT. To ensure

(19)

experiment setup 16

that PoemBERT can slightly adapt to the input text descriptions correspond-ing to the input images, only the last three feature layers of PoemBERT will be trainable, all of the others will be frozen. The code for this implementation is available athttps://github.com/zhengfei0908/SBA-GAN.

The AttnBERT-GAN will not only be evaluated based on the fact if it does or does not generate images that resemble the semantic meaning of classical Chinese poems. However, to determine the performance of PoemBERT, the AttnBERT-GAN will be evaluated with the inception score.

inception score (is) The IS is a widely used evaluation metric that

au-tomatically evaluates the quality of image generative models [15]. The metric has proven to correlate well with the human scoring of the realism of generated images based on a variety of datasets, including the CIFAR-10 dataset. The authors who proposed the IS determined that a generative model should have two desirable qualities:

1. The generated images should contain clear objects, meaning that the images should be sharp rather than blurry.

2. The generative model should output a diverse set of generated images. If these qualities are satisfied by the generative model, a high IS is expected. The IS has a range between 1.0 and the number of classes supported by the classification model. This report will use the pre-trained Inception V3 model6 , which has a maximum score of 5.0 [16].

4.2.3 Baseline

To discover if PoemBERT actually benefited from the input poetry data, the ChineseBERT Uncased, or ChineseBERT, will be utilized as a baseline to compare to PoemBERT. The ChineseBERT and PoemBERT will both be assessed according to the same intrinsic and extrinsic evaluation techniques. This evaluation method will reveal if PoemBERT has learned a more correct representation of classical Chinese poetry, which also benefits poem based image generation, compared to the ChineseBERT, that was only trained on Wikipedia data.

4.3 Datasets 4.3.1 PoemBERT

PoemBERT will be trained with two different input corpora, a small and a large one (Table 1). Both of these corpora contain classical Chinese poems from the Ming, Qing, Yuan, Tang, and Shi Xue Han dynasty7. The small corpus contains 15% of the large corpus and is randomly selected. As previ-ously mentioned, this is done to determine whether PoemBERT benefits from a corpus that has a lower number of unique contexts, due to the metaphorical nature of classical Chinese poetry.

6 https://github.com/sbarratt/inception-score-pytorch 7 https://github.com/Disiok/poetry-seq2seq/tree/develop/data

(20)

experiment setup 17

Table 1: PoemBERT: training corpora information

Dataset Poems # Sentences Max sentence length Small corpus 42.037 204.922 228

Large corpus 280.247 1.349.772 234

Example from corpus: Classical Chinese poem

朔风动地来，吹起沙上声。

English translation

The northern wind came up, shak-ing the ground, creatshak-ing noises of blowing sand.

闺中有边思，玉箸此时横。 A young woman was missing her

lover inside her boudoir, while hold-ing chopsticks in her hands without an appetite.

莫怕儿女恨，主人烹不鸣。 Don’t be afraid of the daughters

hate, as the father had split them apart.

Lu Guimeng

4.3.2 AttnBERT-GAN

The three text-image pair datasets that will be used to train the AttnBERT-GAN show a clear correlation between the size and consistency of images. This correlation is valuable because, it will show if AttnBERT-GAN gen-erates more realistic images based on a large amount of data or data that is consistent. The three text-image pair datasets that will be used for the AttnBERT-GAN training process are the Poem-Image, Title-Image, and Regular Poem-Image dataset (Table 2). These datasets all have different characteristics, these differences might reveal what characteris-tics are beneficial for poem based image generation. The Poem-Image dataset consists out of 301 images that all have the exact same struc-ture and color palette to them (Figure 8). The Title-Image dataset is slightly larger, and the images are more diverse (Figure9). The diversity can mainly be seen in the level of detail and color palette. The struc-ture and image sizes of the Title-Image dataset are comparable to the Poem-Image dataset. The Regular Poem-Image dataset is the largest dataset, it contains 83.052 text-images pairs that were collected via web scraping (Figure 10). These images have a lot of variety, they differ in image size, color palette, structure, and artistic style, making the Regular Poem-Image dataset the most inconsistent dataset.

All of the images in these datasets are paired with an image description, this description is a summary of the corresponding poem, however, it is

(21)

experiment setup 18

reduced to a sentence that best describes the content of an image. The image description is a superior text to generate images from than a full poem because image description reduces the total number of Chinese characters and strips most of the ambiguity.

Table 2: AttnBERT-GAN: training datasets information

Dataset # poem-image pairs Image consistency

Poem-Image 301 Perfectly consistent

Title-Image 3.292 Slightly inconsistent

Regular Poem-Image 83.052 Highly inconsistent

Figure 8: Two examples from Poem-Image dataset. Image descriptions (from

left-to-right): 古冢密于草，新坟侵官道。 (The ancient tomb is dense with the grass.);_{长条乱拂春波动，不许佳人照影看。 (Long strips fluttering in the} spring.)

4.4 Configurations

4.4.1 PoemBERT

In total, four different versions PoemBERT will be trained (Table 3). The four unique PoemBERT models will all differ in training-policy and training corpus size. The combination of these versions will provide the answer to the two sub-questions concerning PoemBERT. The intrinsic evaluation will be performed on the same validation-set for each configu-ration. Based on the results of the intrinsic evaluation, we will be able to select one PoemBERT that will be incorporated in the AttnBERT-GAN framework, alongside the ChineseBERT.

8 All of the different configurations of PoemBERT were trained with: learning rate = 2e-5; batch size = 64; max sequence length = 256; max predictions per sequence = 40;

(22)

experiment setup 19

Figure 9: Two examples from Title-Image dataset. Image descriptions (from

left-to-right): _{杨柳岸晓风残月。 (Willow Shore Xiaofeng Can Yue.); 楼上黄昏，} 马上黄昏。 (Upstairs at dusk, immediately at dusk.)

Table 3: Different configurations of PoemBERT.

Model Dataset Training-Policy BERT size training-steps

ChineseBERT Wikipedia from-scratch BASE 1M

PoemBERT-A Small corpus from-scratch TINY 500K8 PoemBERT-B Small corpus From pre-trained BASE 20K

PoemBERT-C Large corpus from-scratch TINY 300K

PoemBERT-D Large corpus From pre-trained BASE 20K

4.4.2 AttnBERT-GAN

In total, six different versions of AttnBERT-GAN will be trained with Po-emBERT, alongside six that will be trained with the ChineseBERT. The unique configurations will differ in the dataset and the size of Lambda. Each dataset will be split into a training-set and validation-set9, and each configuration will be evaluated on the same validation-set, correspond-ing to the dataset it has been trained on. Further, for each of the three datasets, an AttnBERT-GAN will be trained with a Lambda of both 5.0 and 50.0. Thus, in total there will be twelve unique configurations that will be compared to one another.

(23)

experiment setup 20

Figure 10: Three examples from Regular Poem-Image dataset. Image descriptions

(from left-to-right): 应有仙家住，避秦来至今。 (There should be a fairy family to live, avoiding Qin to date.); 十年噩梦何曾寤，一往幽忧欲付谁。 (A ten-year nightmare, He Zengyu, who has always been worried about paying for it.);_{遂令虎林人，得免马邑屠。 (The Hulin people were} ordered to avoid Mayi slaughter.)

(24)

results 21

5 results

5.1 Intrinsic Evaluation (PoemBERT)

5.1.1 Training-policy (RQ1-1)

The goal of this sub-question was to determine what training-policy pro-duces the best representation of classical Chinese poetry in a vector space. To determine this, four unique versions of PoemBERT were trained (Ta-ble3), and these models were evaluated according to the intrinsic evalua-tion metrics. Only the ChineseBERT and the PoemBERT with the best performance on the intrinsic evaluation were evaluated extrinsically.

When we look at the results in Table 4, we can see that PoemBERT-B and D, which were both trained from the pre-trained ChinesePoemBERT-BERT, achieved higher accuracies for all metrics than PoemBERT-A and C, which were both trained from-scratch. This indicates that training Poem-BERT from a pre-trained model is the best training-policy to represent classical Chinese poetry. However, the PoemBERT models that were trained from-scratch, were not trained with the recommended amount of training-steps10, due to limited time resources. This could clarify why the A and C achieved considerately worse than PoemBERT-B and D. Another reason PoemPoemBERT-BERT-PoemBERT-B and D scored higher is that the ChineseBERT already had a solid foundation before the corpus was added.

Thus, training PoemBERT from a pre-trained model is the most suit-able training-policy to represent classical Chinese poetry. However, to be completely certain of this fact, a BERT model should be trained with a poetry corpus from-scratch, for the appropriate amount of training-steps.

Table 4: Intrinsic evaluation of PoemBERT, compared to the ChineseBERT

Model MLM NSP Top-5 ChineseBERT 0.83 0.95 −11 PoemBERT-A 0.18 0.72 0.25 PoemBERT-B 0.29 0.83 0.33 PoemBERT-C 0.23 0.80 0.27 PoemBERT-D 0.40 0.92 0.48 5.1.2 Corpus (RQ1-2)

The goal of this sub-question was to determine if PoemBERT would benefit from a relatively small or large amount of poetry data, to learn a correct representation of classical Chinese poetry. To discover the effects of the corpus size, two PoemBERTs were trained with a relatively small corpus, and two PoemBERTs were trained with a relatively large corpus. 10 The original BERT paper recommends to train BERT with one million training-steps when

training from-scratch

(25)

results 22

Both corpora were trained with both training-policies, resulting in four unique configurations.

PoemBERT-A and B were both trained with the small corpus, and for both training-policies, they achieved worse accuracies for all of the three metrics, than PoemBERT-C and D, which were trained with the large corpus (Table4). This exemplifies that PoemBERT has learned a better representation based on a large corpus, even though the unique number of contexts per character is higher in the large corpus. This could be explained by the number of masked tokens, which is greater for the large corpus. Subsequently, a higher number of masked tokens leads to more predictions, and due to the high number of predictions, there is more room to learn the context of each character, even though the unique contexts per character might be higher.

In conclusion, PoemBERT-D has shown the best performance on the intrinsic evaluation, hence it will be extrinsically evaluated to determine whether it benefits poem based image generation. Although PoemBERT-D still scores significantly worse on the MLM accuracy, when compared to the ChineseBERT, this is most likely due to the characteristics of poetry. However, this does not naturally mean that ChineseBERT has a better representation of classical Chinese poetry than PoemBERT-D, as the ChineseBERT has never seen poetry data before. Thus, PoemBERT-D might prove to be a more competent language representation model for generating images from poem descriptions than ChineseBERT.

5.2 Extrinsic Evaluation (AttnBERT-GAN)

5.2.1 Text-Image Dataset (RQ2-1)

The goal of this sub-question was to determine what characteristics a text-image pair dataset needs to generate realistic Chinese poetry paint-ings. This was done by training the AttnBERT-GAN with three different datasets, that showed a clear correlation between image size and image consistency. The IS was calculated for each configuration of AttnBERT-GAN, to verify which dataset has produced the most realistic images. The IS score is based on how clear the images are, and the degree of image variety in the validation-set.

poem-image The Poem-Image dataset is the smallest and most

con-sistent dataset AttnBERT-GAN was trained on. For both, the Chinese-BERT and PoemChinese-BERT, the DAMSM converged at around 100 epochs, at a loss of approximately 10.0. The GAN converged, for both λ=5 and

λ=50, at around 250 epochs. In Figure 12, we can see an example of a generated image for each configuration with the Poem-Image dataset. These samples were generated from the same caption in the validation-set, and thus they should resemble the same image (Figure11). However, the four images show little resemblance of Chinese poetry paintings, the only characteristic they have learned is the color palette. Besides that,

(26)

results 23

the images show seemingly random spots and shapes that do not corre-spond to the caption they were generated from. These characteristics are consistent throughout all four of the validation-sets.

Figure 11: Example poem-image pair from Poem-Image validation-set, with the

cor-responding caption: 春水船如天上坐。 (Spring water boat sits in heaven.)

Figure 12: AttnBERT-GAN generated images from the same poem in the

validation-set of the Poem-Image datavalidation-set. On the top-row you can see images gen-erated with the ChineseBERT and on the bottom, you can see images generated with the PoemBERT. The left side shows images generated with λ=5, and the right with λ=50.

title-image The Title-Image dataset was the medium-sized dataset,

with a slight variation in image characteristics. For both, the Chinese-BERT and PoemChinese-BERT, the DAMSM converged at around 200 epochs, at a loss of approximately 13.5. The GAN converged for both a λ=5 and

λ=50, at around 250 epochs. In Figure14and15, we can see two sets of generated images, each set contains four images generated by each con-figuration of AttnGAN-BERT trained on the Title-Image dataset. These

(27)

results 24

samples were generated from the same caption in the validation-set (Fig-ure 13). The generated images are more detailed than the images gen-erated with the Poem-Image dataset. Besides the color palette, clear shapes, clear edges, and even Chinese characters can be seen. These characteristics are consistent throughout all of the four validation-sets. The results of this dataset are images that moderately resemble Chinese poetry paintings. However, the generated images still do not seem to correspond to the caption they were generated with.

Figure 13: Example poem-image pairs from Title-Image validation-set, with the

cor-responding captions: left: 蚂蚁搬家。 (Ants move.) right: 落红不是无情物，化作春泥更护花。 (Fallen flowers are not ruthless things, turned into more quads for Chun ni.)

Figure 14: AttnBERT-GAN generated images from the same poem-image pair in the

validation-set of the Title-Image dataset. On the top-row you can see images generated with the ChineseBERT and on the bottom, you can see images generated with the PoemBERT. The left side shows images generated with λ=5, and the right with λ=50.

(28)

results 25

validation-set of the Title-Image dataset. On the top-row you can see images generated with the ChineseBERT and on the bottom, you can see images generated with the PoemBERT. The left side shows images generated with λ=5, and the right with λ=50.

regular title-image Due to time constraints, the DAMSM and GAN

could not optimally converge with the Regular Title-Image dataset, as the size of the dataset resulted in a lengthy training process12. Thus the DAMSM was only trained for 100 epochs, where its loss was approxi-mately 12.0. The GAN, for both λ=5 and λ=50 was trained for 80 epochs. In Figure17we can see an example of a generated image for each config-uration of AttnBERT-GAN trained on the Regular Title-Image dataset. All of these samples have been generated from the same caption from the validation-set (Figure 16). The generated images show a great variety in color palette, which indicates that each configuration has learned a drastically different relation between the image details and Chinese char-acters. Besides the color palette, the images show some characteristics of Chinese poetry paintings, as some generated images show a similar structure and Chinese characters that can be found in the training-set. The image generated with the ChineseBERT and a λ=5, even has a great resemblance to the image of the caption-image pair it was generated from, however, this does not immediately indicate that a correct relationship between images details and Chinese characters was learned. The images generated from the validation-set of this configuration, show no signs of a correct relationship between the images and characters. The samples generated by the other three configurations, also do not show any clear

12 Training the DAMSM and GAN with the Regular Title-Image dataset would have taken up to four weeks. (DAMSM 300 epochs; GAN 300 epochs.)

(29)

results 26

indication that such a relationship is learned.

Figure 16: Example text-image pair from Regular Title-Image validation-set, with

the corresponding caption: 依依翠袖生寒处，寂寂空山听雨年。 ( Yi Yicui sleeves in the cold place, the silence of the empty mountain listening to the rain.)

validation-set of the Regular Title-Image dataset. On the top-row you can see images generated with the ChineseBERT and on the bottom, you can see images generated with the PoemBERT. The left side shows images generated with λ=5, and the right with λ=50.

Although none of the datasets produced results that showed any signs of a correctly learned relationship between image details and Chinese char-acters, the Title-Image dataset has generated the most realistic images according to the IS (Table5). This can be explained due to the clearness and variety of the generated images, compared to the images generated with the other datasets. The reason that the Title-Image yielded the best results is because of its characteristics. The medium size and the fact that

(30)

results 27

the image features were only slightly inconsistent, resulted in clear im-ages that showed some characteristics of Chinese poetry paintings. The reason the other two datasets produced less realistic images according to the IS, is also because of their characteristics. The Poem-Image dataset did not have enough text-image pairs to accurately map the features to characters, hence the generated images only depict vague shapes. And the Regular Title-Image dataset was too large and the image features were too inconsistent, this resulted in a diverse set of generated images, however, the contents of the images are very irregular and do not resem-ble Chinese poetry paintings. The variety of the generated images caused that the Regular Title-Image dataset performed better when evaluated with the IS, than the Poem-Image dataset.

Table 5: Inception score for each configuration of AttnGAN-BERT.

BERT ChineseBERT PoemBERT

λ 5 50 5 50

Poem-Image 1.11 1.20 1.30 1.15

Title-Image 1.24 1.61 1.83 1.40

Regular Title-Image 1.39 1.36 1.42 1.32

5.2.2 Lambda (RQ2-2)

The goal of this sub-question was to determine whether AttnBERT-GAN would produce realistic images when lightly or heavily influenced by the DAMSM. The hyper-parameter, Lambda, has the capability to determine the influence of the DAMSM on the generation process.

Lambda (λ), is a scalar for the DAMSM loss (LDAM SM). The DAMSM loss is the loss function used to pre-train DAMSM. When training the GAN the λ has to be declared, in essence, it functions as a weighing factor of the DAMSM for the GAN. A large λ ensures that DAMSM has a greater impact on the generation process of the GAN. In other words, the GAN will adopt more information from the attention maps of the DAMSM to generate images. A smaller λ will ensure that the GAN is less influenced by the DAMSM, and thus it will generate images less based on the DAMSM and more based on the images in the training-set.

The size of λ has shown to have a great effect on the results. The original AttnGAN achieved state-of-the-art results on both the COCO and CUB dataset. The final λ of the COCO dataset turned out to be much larger than the CUB dataset, this indicated that the LDAM SM is especially important for generating complex scenarios like those in the COCO dataset.

The text-image pair datasets of poem descriptions and paintings are even more complex than the COCO dataset. Thus training the AttnBERT-GAN with a high λ seems like the logical procedure, however, it is also beneficial to train AttnBERT-GAN with a smaller λ. To illustrate, when

(31)

results 28

the DAMSM has learned an incorrect attention map, meaning that the fine-grained details do not correspond to their words, and the λ is high, the generated images will be less realistic than when the exact same model was trained with a lower λ. This is because the deficient DAMSM will have less influence on the generation process.

In conclusion, the λ determines how much the generated images are based on the learned relationship between the details of an image and the classical Chinese poetry representation model, PoemBERT. To identify the usefulness of the DAMSM, the GAN was trained with two different Lambdas, λ=5 and λ=50, and evaluated with the IS.

poem-image The results produced with a λ=50 for the Poem-Image

dataset, produced images that show a gritted pattern, for both the Chi-neseBERT and PoemBERT (Figure 12). The key difference is that the image generated with PoemBERT (λ=50) has colors that do not resem-ble the training-set, whereas the image generated with ChineseBERT (λ=50) moderately resembles the colors in the training-set. These char-acteristics can be observed in all of the generated images, for both the ChineseBERT (λ=50) and PoemBERT (λ=50).

With a λ=5 the difference between images generated with the Chinese-BERT and PoemChinese-BERT are less obvious. Both Chinese-BERT models show a lot more resemblance with the training-set than the images generated with

λ=50, due to their colors and the ratio element to background.

How-ever, there still is a difference between the two images, the image gen-erated with PoemBERT (λ=5) has a visible shape or element, whereas the image generated with the ChineseBERT (λ=5) does not show any structured element. This difference is obviously visible throughout the entire collection of generated images for both the ChineseBERT (λ=5) and PoemBERT (λ=5).

Overall, the configuration with PoemBERT and a λ=5 produced im-ages that most resemble the training-set for the Poem-Image dataset. This is also supported by the IS, as it scored the highest, 1.30, out of all configurations with the Poem-Image dataset (Table5).

title-image The results produced with a λ=50 for the Title-Image

dataset, produced the images that least resemble the training-set, for both ChineseBERT and PoemBERT (Figure 14, 15). The generated images differ from the training-set due to the dark background color and the lack of clearly visible objects. Between the images generated with ChineseBERT (λ=50) and PoemBERT (λ=50) there are not any obvious differences, this can also be seen throughout the entire collection of generated images by these two configurations.

In contrast to the uninspired results with a λ=50, λ=5 produces very impressive results. Both images generated with ChineseBERT (λ=5) and PoemBERT (λ=5) are less blurred, this ensures that there is a clear distinction between background and elements. Both the collection of gen-erated images with ChineseBERT (λ=5) and PoemBERT (λ=5) have a

(32)

results 29

lot of images that moderately resemble the training-set. This can be seen in the background color, and the element to background ratio. However, images generated with PoemBERT (λ=5) are clearly superior to images generated with ChineseBERT (λ=5). This is because of five different factors, namely, images generated with PoemBERT (λ=5) often have characters that resemble Chinese characters, that are grouped together in an appropriate place; show a greater variety in structure and coloring; contain objects that have a greater level of detail; and are less blurred. It is very impressive that these characteristics can be seen throughout the entire validation set of generated images in a variety of ways.

Thus overall, the configuration with PoemBERT and a λ=5 produced images that most resemble the training-set for the Title-Image dataset. This is also supported by the IS, as it scored the highest, 1.84, out of all twelve configurations (Table 5).

regular poem-image The results produced with a λ=50 for the

Reg-ular Title-Image dataset, produced images that did not show any charac-teristics of the training-set (Figure 17). The images generated with the ChineseBERT (λ=50) vary a lot in color palette, however, a constant that can be seen throughout the entire validation-set, are similar irreg-ular shapes. Paradoxically, the images generated with the PoemBERT (λ=50) show much less variety in color palette, as all of the images are yellow with some dark shapes, however, each shape is unique.

The images generated with a λ=5, showed a lot more resemblances of Chinese poetry paintings. The content of the images is more detailed, less blurred, more varied, and it often includes Chinese characters. These characteristics result in more realistic images than the images that were generated with a λ=50. Overall, the images generated with Chinese-BERT (λ=5) varied more in color palette than the images generated with PoemBERT (λ=5), however, the latter exhibited more characteris-tics that resemble Chinese poetry paintings, such as Chinese characters, structure, and detailed shapes. This difference between them explains why the IS for both configurations is almost equivalent, as both show a lot of variety, but in different aspects.

Again, the configuration with PoemBERT and a λ=5, produced images that most resemble the training-set for the Regular Title-Image dataset. This is also supported by the IS, as it scored slightly better than the ChineseBERT (λ=5), with a score of 1.42. (Table5).

One configuration yielded the highest IS for all of the datasets, namely the AttnBERT-GAN trained with PoemBERT and a λ=5. This indicates that the AttnBERT-GAN generates the most realistic images when the DAMSM has less impact on the generation process. Thus the DAMSM did not learn a correct representation of the relationship between image details and Chinese characters, and because of this, the GAN produced more realistic images when it relied more on the training-set. Hence, random Chinese characters could be seen in the samples generated from

(33)

results 30

the Title-Image and Regular Title-Image datasets. In general. the re-sults show that the PoemBERT produced more realistic images when

λ=5, according to the number of characteristics that resemble those of

Chinese poetry paintings, and to the IS. In conclusion, the complexity of the datasets proved to be too high for the DAMSM to learn a correct relationship between image details and Chinese characters. Therefore, the images that were generated with a λ=50 were heavily based on this incorrect relationship, consequently, the generated images proved to be unrealistic. Due to the complexity of the datasets, it is recommended to train the GAN with a lower λ to obtain realistic images.

(34)

conclusion 31

6 conclusion

In this paper, we introduce PoemBERT, a language representation model that represents classical Chinese poetry to a vector space, that is sup-posed to benefit poem based image generation. To determine the ideal configuration for PoemBERT, multiple experiments were conducted with different training-policies and corpora. Training PoemBERT from the pre-trained ChineseBERT in combination with a large corpus yielded the best results on the intrinsic evaluation. However, the usefulness of the training-policy, from-scratch, could not fully be determined, due to the fact that the policy was not trained with an adequate number of training-steps. Nonetheless, the training-policy, from pre-trained, scored almost twice as high on the MLM and Top-k accuracies than from-scratch. Be-sides the training-policy, the corpus size also showed to be influential for training PoemBERT, as a large classical Chinese poetry corpus yielded higher accuracies for all intrinsic metrics. Conclusively, the PoemBERT, which was trained from the pre-trained ChineseBERT with a large clas-sical Chinese poetry, achieved the best performance on the intrinsic eval-uation, and thus it was evaluated extrinsically, along with the baseline ChineseBERT.

The extrinsic evaluation determined whether PoemBERT benefits poem based image generation. The extrinsic evaluation consisted of a down-stream task, AttnBERT-GAN, and a metric to evaluate the downdown-stream task, the IS. The experiments conducted with the AttnBERT-GAN, would reveal the effects of particular datasets and Lambdas. Due to its size and image consistency, the Title-Image dataset, produced the most realistic images, based on its similar characteristics to Chinese poetry paintings and the IS. The experiments conducted with a λ=5 proved to produce significantly more realistic images than the experiments conducted with a

λ=50, indicating that the DAMSM has not learned a correct relationship

between image details and Chinese characters.

Overall, the extrinsic evaluation indicates that the proposed Poem-BERT benefits poem based image generation more than the Chinese-BERT. Thus PoemBERT has learnt a better representation of classical Chinese poetry, than the baseline ChineseBERT, for poem based im-age generation. However, all of the generated imim-ages do not show that any relationship between image details and Chinese characters is learned. This is because the text-image pair datasets were too complex for the DAMSM.

(35)

future work 32

7 future work

Due to limited time and computational resources, there are a number of experiments that could significantly improve the results. In a future research, the following experiments could be conducted.

The first experiment is, training PoemBERT from-scratch with the recommended amount of training-steps. The authors of BERT, strongly recommend for BERT to be trained on a corpus that suits the NLP task. Thus, it will be advantageous to train BERT from-scratch on the poetry corpus, for it to benefit poem based image generation.

Secondly, if training PoemBERT from-scratch produces unusable re-sults because, of the limited amount of classical Chinese poetry data that is available. An idea could be to train PoemBERT from the pre-trained ChineseBERT with a modified classical Chinese poetry corpus. The modifications that could be done are, (1) translating the classical Chinese poetry corpus to simplified Chinese; (2) altering the corpus by converting the Chinese characters to their highest hierarchical seman-tic meaning, this removes a lot of the ambiguity present in the poems13. Both of these modifications will cause that the corpus is more compatible with the ChineseBERT, and this could result in a better representation model of classical Chinese poetry.

Thirdly, the COCO and CUB dataset used to train the AttnGAN, both had ten text captions per image. This ensures that the DAMSM has more data to determine the relationship between image details and text descriptions. The text-image pair datasets used to train the AttnBERT-GAN all had one image description, this might explain why the DAMSM in all configuration did not seem to learn a correct relationship between image and text. Extending the text-image pair datasets with more poem descriptions might significantly improve the performance.

13 This can be achieved with the following dataset: https://github.com/ChaosPKU/Poetry/ blob/master/FirstSentence/dataset/shixuehanying.txt

(36)

references 33

references

[1] Lixin Liu, Xiaojun Wan, and Zongming Guo. Images2poem: Gen-erating chinese poetry from image streams. In Susanne Boll, Ky-oung Mu Lee, Jiebo Luo, Wenwu Zhu, Hyeran Byun, Chang Wen Chen, Rainer Lienhart, and Tao Mei, editors, 2018 ACM Multimedia

Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22-26, 2018, pages 1967–1975. ACM, 2018.

[2] Linli Xu, Liang Jiang, Chuan Qin, Zhe Wang, and Dongfang Du. How images inspire poems: Generating classical chinese poetry from images with memory networks, 2018.

[3] Ma Sen. Contemporary chinese literature: An anthology of post-mao fiction and poetry. The China Quarterly, 109:120–121, 1987.

[4] X. Ni, L. Liu, and R. Haralick. Music generation with relation join. volume 10525, pages 41–64. Springer Verlag, 2017.

[5] A. Astigarraga, J.M. Martínez-Otzeta, I. Rodriguez, B. Sierra, and E. Lazkano. Emotional poetry generation. volume 10458, pages 332–342. Springer Verlag, 2017.

[6] S. Luo, S. Liu, J. Han, and T. Guo. Multimodal fusion for traditional chinese painting generation. volume 11166, pages 24–34. Springer Verlag, 2018.

[7] Rebecca Chamberlain, Caitlin Mullin, Bram Scheerlinck, and Jo-han Wagemans. Putting the art in artificial: Aesthetic responses to computer-generated art. Psychology of Aesthetics, Creativity, and

the Arts, 12(2):177–192, 2018.

[8] Chenrui Zhang and Yuxin Peng. Stacking vae and gan for context-aware text-to-image generation. In 2018 IEEE Fourth International

Conference on Multimedia Big Data (BigMM), pages 1–5. IEEE,

2018.

[9] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.

[10] Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone. Can: Creative adversarial networks, generating "art" by learning about styles and deviating from style norms. 20170621. [11] Xiaoyuan Yi, Ruoyu Li, and Maosong Sun. Generating chinese

clas-sical poems with rnn encoder-decoder, 2016.

[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.

(37)

references 34

[13] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. Pre-training with whole word masking for chinese BERT. CoRR, abs/1906.08101, 2019.

[14] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks.

CoRR, abs/1711.10485, 2017.

[15] Shane T. Barratt and Rishi Sharma. A note on the inception score.

ArXiv, abs/1801.01973, 2018.

[16] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.