De-noise large-scale poem-image pairs for poem-to-image generation

(1)

De-noise large-scale poem-image pairs

for poem-to-image generation

(2)

Layout: typeset by Fengyuan Sun using LA_TEX.

(3)

De-noise large-scale poem-image pairs

for poem-to-image generation

Fengyuan Sun 11697318

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Ms. Dan Li Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Jun 26th, 2020

(4)

Abstract

Classical Chinese poems are difficult to understand due to its abstraction and ambiguous use of language. Visualization of poems provides a novel way of understanding. However, the task of generating images from po-etry remains a challenge. To address the problem, this thesis proposes an automated construction method for poem-image datasets by retriev-ing images from search engines and removretriev-ing noise usretriev-ing the image style classification model. Based on the method, a large-scale poem-image dataset in the traditional Chinese painting style is introduced. Through various experiments, the effectivity of the image style classification model is proven and the quality of the dataset is evaluated. Furthermore, the AttnGAN model that is used for text-to-image generation and trained on the dataset, is able to produce interesting novel images, and thus demonstrates the applicability of the dataset.

(5)

1 Introduction

Classical Chinese poetry is an important heritage of the Chinese culture. It holds a high cultural, artistic and academic value. However, understanding poetry is a challenging task due to its abstraction and the lack of historical background [1]. A useful way to aid the understanding of classical poetry is visualization. Throughout the history of humankind, visual scenes have formed an important source of inspiration for poets [2]. Thus, by visualizing poems, the semantics and emotions of the poem can be expressed more clearly.

Methods to visualize natural language already exist in the field of text-to-image gen-eration. Recently, Generative Adversarial Networks (GANs) [3] have been applied on the text-to-image generation task and produced visually realistic images from natural language descriptions [4, 5], which advances applications such as art generation.

However, the problem of generating images from classical poetry has not been ex-plored. To aid the understanding of classical Chinese poetry and to assist research in poem-to-image generation, this thesis proposes a large-scale poem-image dataset for the generation of images based on classical Chinese poems. In this work, the poem-image data consists of poem line-image pairs. Poems tend to convey a story, describing multiple scenes and topics. Because of that, an image cannot match all the semantics expressed in a poem. To overcome the problem, an image is matched to a single poem line, and poem line-image pairs are constructed. One example of the data is shown in Figure 1.

Additionally, because classical Chinese poetry and art are culturally connected, the proposed dataset incorporates the art style in image selection to create stylistically and semantically relevant poem-image pairs.

During the development of the dataset, several methods to acquire relevant images for poems are investigated and the best one is identified. Using the optimal method, the images are obtained and filtered by an image style classification framework to remove irrelevant images. Finally, the poem-image pairs are formed by the proposed method. The line-image pairs are evaluated based on experimental studies and proven to be stylis-tically consistent. Furthermore, the dataset is utilized in an image generation task and its applicability is demonstrated. The final dataset is proposed for further research in image generation that is based on classical Chinese poems.

(8)

1.1 Research Questions

The goal of the thesis is to construct an appropriate poem-image dataset that can assist in poem-to-image generation. To approach this task, the following research question is formed:

How can we automatically construct a large-scale poem-image dataset with the traditional Chinese painting style that can assist in poem-inspired image generation?

This is further divided into four sub-questions.

To construct poem line-image data, a construction method has to be designed. More-over, several factors can influence the quality of the data during data collection. To investigate this, the first research question is defined:

RQ 1: How do we construct poem line-image pairs for the dataset?

In the proposed method, the images retrieved by the search engine can contain irrel-evant and noisy images such as realistic photographs. To reduce noise in the data, an image style classification model is developed and the optimal model configurations are studied. The second research question is outlined:

RQ 2: How do we filter noisy images by training an image style classification model? To validate the proposed image filtering method, the third research question measures the effectiveness of the image style classifier.

RQ 3: How effective is the proposed image style classification model?

At last, the quality of the constructed dataset is evaluated to assess the proposed construction method. The fourth research question is defined:

RQ 4: To what extent can the proposed dataset assist in the research of poem-to-image generation?

(9)

2 Related Work

2.1 Image Classification

In general, image classification is the task of categorizing images based on its visual content. The visual content is interpreted as the contained objects or the aesthetic style of the image. The past research of image classification has emphasized on object recognition. Before the rise of deep learning, traditional machine learning approaches were used for image classification. However, the performance of such methods is limited due to inefficient feature extraction processes.

With the recent upcoming of deep learning, different classification methods that over-come the limitations have been invented. Deep Neural Networks have been developed to learn complex correlations from raw input data by utilizing backpropagation for er-rors. Among the different types of networks, the Convolutional Neural Network (CNN) shows exceptional performance on object categorization tasks. In 2012, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [6] was won by a deep CNN called AlexNet [7]. This success led to further development of CNN models with improved performance.

2.2 Image Style Classification

Although various research has been carried out on object recognition, the problem of image style classification has not been thoroughly explored. The key of solving this problem is to find an image representation that successfully portrays the image style.

Karayev et al. [8] investigated several image features that involve the image style and proposed an image style classifier that utilizes CNN features. They proved that the CNN features, derived from the fully connected layers, outperform conventional image features (visual saliency, GIST and Lab colour histogram). Still, image style classification models face difficulties when classifying unseen data. To tackle the problem, Tseng et al. [9] proposed a pointwise ranking model based on random forests method combining hand-crafted visual features.

Depite the effectiveness of the CNN features proposed by Karayev et al., its correlation with image style still remains unclear. Recently, Gatys et al. [10] has developed a style descriptor that can effectively describe the image style. They proposed an image style transfer algorithm, which is based on image representations derived from CNNs. The artistic style of one image is transferred to another image, while the semantic content of that image is retained. The algorithm minimizes the losses between the style and content representations of a white noise image, and the target style and content representations, by the Gradient Descent method. The content representation is constructed by extracting the feature maps of a convolutional layer, while the style representation is realised by calculating the correlations among the feature maps of a given convolutional layer.

(10)

to a style vector, then used it as the image feature for image retrieval. They showed that style vectors outperform conventional CNN features and achieved further perfor-mance boost by applying principal component analysis (PCA). On the task of image style classification, Chu and Wu [12] investigated the performances of style vectors from different layers and various types of correlations between feature maps. They reported performance gains when using combinations of correlations and correlations among the style vectors extracted from multiple layers.

2.3 Traditional Chinese Painting Classification

In Chinese painting classification, researchers have studied various image features for image texture based classification. Jiang et al. [13] proposed an image classification framework that detects TCP images from general images by utilizing low-level features such as color histograms, color vectors, texture autocorrelation and edge-size histograms. By using Scale Invariant Feature Transform and Canny Edge detectors, Gao et al. [14] detected key regions for images and extracted pixel-difference based features for classifi-cation.

Recently, inspired by the successes of CNNs, Meng et al. [15] applied different CNN models to TCP classification. The best performing model they found is the modified VGG-16 framework. Similarly, Liong et al. [16] introduced a TCP image classification framework that utilizes a CNN model for feature extraction. They compared high-level CNN features of several models, which are extracted from the second fully connected layer. A support vector machine is used to classify the image features, which leads to a performance boost.

These approaches employ either low- or high-level features that are related to image content, and thus do not explicitly take image style into consideration. In the thesis, an image style classification algorithm is proposed to classify TCP images.

2.4 Text-to-image Generation

Recently, the progress in the area of generating images from natural language descriptions has been made by using the methods based on Generative Adversarial Networks [3]. GAN consists of two main parts: a generator that learns to generate new data from the training data distribution and a discriminator that learns to identify the generated data from the real data. During the training stage, the two parts try to surpass each other by improving themselves. In those methods [4, 5], the image generation is conditioned on a general text vector, which is transformed from the text description. Although the approach is able to generate visually realistic images from text, it still faces difficulties in ensuring semantic consistency between the text and the generated image.

To tackle the problem, Xu et al. [17] proposed an attention-based approach, in which the generator is guided to focus on the most relevant words in the text description when generating different parts of the image. In their AttnGAN model, two new components are introduced: the attentional generative network and the Deep Attentional Multimodal

(11)

Similarity Model (DAMSM). The attentional generative network improves the quality of image generation by calculating relevancies between words from the text description and image areas, and by focusing on the most relevant areas for the generation of image features. The DAMSM component calculates the similarity between the generated image and the text description using word and sentence information. The similarity functions as the loss measurement during the training stage.

(12)

3 Research Method

The construction method and the image style classification framework that are used to construct the large-scale poem-image dataset in this work, are introduced in Section 3.1 and 3.2, respectively.

3.1 Dataset Construction Method

To construct a dataset that can assist to-image generation, a large scale of poem-image pairs is needed. However, no available poem-poem-image datasets have been found. Hence, new poem-image data needs to be constructed.

Because manually creating data is time-consuming, an automated construction method, retrieving relevant images for each poem line by querying on a search engine and scraping the image results, was designed. The relevance of an image is defined by two factors: whether it conveys the traditional Chinese painting style and whether it is semantically consistent with the matched poem line. In the thesis, the style relevance of images is the emphasized aspect of the dataset quality.

The dataset construction method can be divided into three separate phases: • collecting Chinese poem lines;

• obtaining relevant images for poem lines;

• filtering the images to create poem line-image pairs.

The images returned by the search engine cannot guarantee style relevancy, and can contain irrelevant and noisy images such as realistic photographs. Therefore, image results are filtered with an image style classification model to remove noise.

The dataset is split into two categories: line-image pairs derived from the famous poems and line-image pairs derived from the regular poems. The famous poems are included because it is expected that famous poem lines can retrieve even more relevant images. The details of the dataset construction method are described in the following sections.

3.1.1 Collecting Poem Lines

To collect poem lines, two distinct sources were used. The first source is the Chinese poem indexing site1. This source is used because it contains poems that are more well-known than the second source. Therefore, it is expected that these poems can retrieve higher-quality image search results at the next stage. 6,700 famous poems were scraped from the the ‘famous poems’ category.

(13)

In text-to-image generation, it is usually beneficial to use a high-volume of training data. Hence, the second source contained a large quantity of regular poems [18], each grouped by the period of its origin. 20,000 regular poems were retrieved, evenly selected across different dynasty periods to avoid bias.

For the regular poems, all lines were split and extracted 85,000 poem lines. As for the famous poems, these contain famous quotes. Only the poem lines that contained a famous quote were extracted to a set of 6,700 poem lines. The two types of poem lines were stored separately and formed two parts of the whole dataset.

3.1.2 Retrieving Images

At this time, there is no Chinese poem-image or painting database in which images are retrieved based on sentence matching. An alternative way is to use search engines, as they can walk through high volumes of data and return relevant results based on search query. Hence, to obtain stylistically and semantically relevant images for the poem lines, search engines are used to download images.

To find the optimal search engine and image search guidelines, an analysis among different search engines and combinations of different search query keywords has been carried out. The results are described in Section 5.1. Based on these findings, the best performing search engine is identified as Baidu. And the most effective method to retrieve relevant images for a poem line is to use the following query combination: the poem line and the keyword ‘_{国画’ (GUOHUA - traditional painting style). The latter is added to} improve the relevancy of image results, and is referred to as the GUOHUA keyword in following sections.

Finally, all famous and regular poem lines were queried in Baidu. Then, the top-10 ranked image results for each poem line query were downloaded using a web scraping script.

3.1.3 Filtering Noisy Images

In order to collect an accurate dataset, the downloaded images were filtered based on style. An image style classification model was used to remove noise in the image data. For each poem line, the downloaded images with the traditional Chinese painting style were kept, while noisy images were removed. New poem line-image pairs containing the line and the highest-ranked image result after filtering all images is formed. By following this method, a 1-to-1 correspondence is ensured within the pairs.

3.2 Image Style Classification Model

The aim of this thesis is to construct a large-scale poem-image dataset that contains poem image pairs with the traditional Chinese painting style. To construct line-image pairs, line-images are retrieved from a search engine. However, these line-image results

(14)

can vary greatly in image style. Thus, to ensure that line-image pairs are formed in the correct art style, an image style classification model is utilized to filter the image results. The image style classification framework consists of three consecutive parts: the ex-traction of style features, the transformation of features to style vectors and classification. Its framework is illustrated in Figure 2. Firstly, given an input image, a deep convolu-tional neural network is used to extract deep style features. With these features, a style representation is created to capture the image style. This style representation, also known as the Gram matrix, is further transformed into a style vector. Finally, the style vector is classified using a support vector machine (SVM) classifier.

In the following sections, a detailed description of each component is given.

Figure 2: An illustation of the image style classification framework. (In this illustration, the Gram matrix is extracted from the conv5_1 layer.)

3.2.1 Deep Convolutional Neural Network

In this work, the very deep convolutional network (VGG-19) [19] is used for the extraction of image style features. It is a high-performance convolutional neural network (CNN) pre-trained on the ImageNet dataset [20]. The network consists of 16 convolutional and 3 fully connected layers. Its architecture is visualized in Figure 3. Within the convolutional layers, a kernel of size 3x3 with a stride of 1 pixel is used to cover the notion of the image, and a spatial padding of 1 pixel is used to preserve spatial resolution. Max-pooling is performed in 5 layers that follow some of the convolutional layers over a 2x2 kernel with stride 2.

The convolutional layers can be divided into 5 blocks, each separated by a max-pooling layer. The layers are named as ‘conv1_1’, ‘conv1_2’, ‘conv2_1’, etc., where the first digit represents the block and the second digit indicates the layer. In this thesis, the layers are used to extract image style features, which are then transformed to style vectors for image style classification.

(15)

Figure 3: Illustration of the VGG-19 framework [21].

(Conv stands for convolutional and FC stands for fully connected.)

3.2.2 Style Vector

In the work on style transfer by Gatys et al. [10], the style matrix is proposed as the style representation of an image to transfer an artistic style into images. Since then, the style matrix has been used in several studies on style image classification [22, 12] and style image retrieval [11]. The style matrix measures correlations between feature maps of a convolutional layer in the network. These correlations can effectively describe local image textures. In this way, the style information of an image is captured. Hence, this thesis utilizes the style vector, which is derived from the style matrix, as the image representation for image style classification.

The style matrix is a Gram matrix Gl_{∈ R}Nl×Nl _{and is represented as}

Gl_ij = F_il· F_jl, (1) where Gl_ij is the dot product between the vectorized feature maps F_il and F_jl in layer l.

We then transform the style matrix to a style vector Vl with a length of N_l2, defined as:

Vl= [Gl_1,1, Gl_1,2, . . . , Gl_1,N_l, Gl_2,1, . . . , Gl_N_l_,N_l], (2) where N_l is the number of feature maps in layer l. The vectorization is carried out in order to enable classification using style features.

3.2.3 Classifier

Support vector machines are robust against high-dimensional data. Therefore, an SVM is used to classify the style vectors. Given a set of datapoints, the SVM divides the points into separate classes by finding the optimal decision boundary (a hyperplane) between

(16)

the classes. At first, it finds the datapoints (n-dimensional vectors) that are closest to the boundary, also known as support vectors. Its goal is to maximize the margin between the decision boundary and the support vectors. In an n-dimensional space, the datapoints become (n-2)-dimensional vectors, with separating hyperplane of n-1 dimensions. An illustration of the algorithm is given in Figure 4.

Before classifying extracted style vectors, principal component analysis (PCA) was performed on all style vectors to reduce its dimensions to 200. This diminished compu-tational costs by a large factor while minimizing information loss of the data. Approxi-mately 80% of the variance was explained when using 200 components (Fig 5).

Figure 4: Schematic representation of the SVM algorithm [23].

(17)

4 Experimental Setup

4.1 Datasets

4.1.1 Training Datasets

To find the best-performing set of training data for the image style classification model, different model configurations were trained using different types of training data and evaluated.

Three different sources of positive training data samples were used.

– ‘Feng Zikai’ painting images were gathered from the influential Chinese artist Feng Zikai, extracted from the book Zikai Manhua [24]. These are coloured cartoons that portray poems.

– TCP images were collected from a github repository [25].

– Stylistically relevant images were downloaded from Baidu using queries (poem line + GUOHUA combination). The images depict a similar traditional Chinese art style as the TCP images, but contain more variation and are less restricted than TCP’s. This set is referred to as ‘Baidu Query’.

Regarding the negative training samples, two sources were used.

– General and realistic photo images were selected from the MS COCO dataset [26]. – Noisy images were collected from Baidu using poem line queries without GUOHUA,

which are referred to as ‘Noisy Query’ images.

The performance results of all model configurations are reported in Section 5.2.1.

4.1.2 Test Datasets

In this thesis, the following datasets were used for testing:

– The ‘Balanced’ dataset, which consists of 73 positive TCP and 77 negative noisy image samples, manually selected from the ‘Baidu Query’ and ‘Noisy Query’ sets but unrelated to the training data.

– A TCP subset, containing 1000 TCP images from the github repository [25]. All images are positive samples.

– A subset of the MS COCO dataset, containing 1000 general and realistic photo images. These are all labeled as negative samples.

(18)

4.2 Evaluation Metrics

The accuracy and F1 score are the primary metrics for evaluation. In binary classifica-tion, the accuracy is the ratio of correct predictions made on the total number of samples:

Accuracy = T P + T N

T P + T N + F P + F N, (3) where TP, TN, FP and FN stand for true positive, true negative, false positive and false negative. In this thesis, the accuracy measures the ratio of stylistically relevant images.

The F1 score is defined as the harmonic mean of precision and recall: F 1 = 2 · ( P recision · Recall P recision + Recall), (4) where P recision = T P T P + T F, (5) and Recall = T P T P + F N. (6)

(19)

5 Experimental Results

5.1 Image Search Guidelines

In this section, the experiment is designed to answer the first research question. Two factors, search engine and query performance, that influence image search results were investigated. The goal of the experiment is to find the optimal search query that can return the most relevant image results for a poem line.

5.1.1 Search Engine Performance

Firstly, the performances of three search engines are compared based on the style rele-vancy of retrieved images. An image was considered relevant if it expressed the traditional Chinese painting style. The test method for the relevancy is based on search queries. Ten search queries were formed, each consisting of a poem line and the GUOHUA keyword, and the top-10 image results were evaluated. The queries have been tested on two search engines: Google and Baidu.

By comparison, the image results retrieved from Baidu contained the least amount of noisy (irrelevant) images. Google yielded less relevant images. When comparing Google to Baidu, the image results of Google are less semantically consistent to the input query. A comparison is shown in Figures 6 and 7. Thus, in the following studies, Baidu is used as the search engine for the retrieval of relevant images.

Figure 6: Google image results for the query “一学芙蓉叶，初开映水幽。国画”.

Figure 7: Baidu image results for the query “一学芙蓉叶，初开映水幽。国画”. The difference in performance between Baidu and Google could have an intuitive explanation. Baidu is a Chinese search engine and focuses primarily on its Chinese user base. This means that the search engine is optimized for Chinese queries, while Google

(20)

is not. Another factor that can boost Baidu’s performance is that Baidu can reach a larger scale of Chinese websites than Google.

5.1.2 Search Query Performance

To find the optimal query for the retrieval of relevant images, various combinations of keywords were designed. The query combinations were formed using the full poem text, the author, the title, a single line of the poem, and GUOHUA that refines the search results to primarily Chinese paintings. The performance of seven combinations of search queries were compared on five poem lines. The top-10 retrieved images for each query were manually annotated: stylistically relevant images were classified positively, while irrelevant images were classified as negative. Each query was evaluated based on its top-10. The results are presented in Table 1. The relevant query rate is defined as the number of poem lines that returned at least one relevant image divided by the total number of lines tested. The mean accuracy is calculated as the average accuracy of query results for the five poem lines, where accuracy is defined in Equation (3).

Query keywords Mean accuracy Relevant query rate

full poem 8.0% 2/5

full poem + GUOHUA 42.0% 4/5

poem title 16.0% 2/5

poem title + author 8.0% 1/5 poem title + GUOHUA 48.0% 4/5

poem line 16.0% 3/5

poem line + GUOHUA 78.0% 5/5

Table 1: Experimental results for queries using different keyword combinations. Table 1 indicates that the best performing query is the ‘poem line + GUOHUA’ combination, which achieves an accuracy of 78.0%. It is also the only combination that returns relevant results for all five search queries. Two other combinations, ‘full poem + GUOHUA’ and ‘poem title + GUOHUA’, achieve a fairly high relevant query rate (4/5), but scores lower on the accuracy (42.0% and 48.0%, respectively). The query combinations without GUOHUA perform the worst. When only comparing queries of a single search term, the best one is the ‘poem line’, followed by ‘poem title’ and ‘full poem’.

It is interesting that adding extra keywords does not necessarily result in better queries. One example in Table 1 is ‘poem title’. Adding the ‘author’ keyword leads to a reduction of 8.0% accuracy, and a reduction of the relevant query rate by half. This performance reduction may be caused by the less well-known authors of historical Chinese poems.

The mean accuracy in Table 1 indicates that the queries containing GUOHUA clearly outperform those without. It shifts the search engine’s focus from general images

(21)

(in-cluding photographs and occasional paintings) to TCP’s. Therefore, the performance improvement of queries containing GUOHUA is expected.

The results in Table 1 also indicate that the single poem line performs better than the full poem. A poem generally conveys a story, in which multiple visual scenes and different topics are described. On the contrary, a single poem line describes only one visual setting and can be represented with a single image. Accordingly, search engines might struggle with full poem queries as it contains excessive information. This can explain why the single poem line query performed better than the full poem query in both settings (with and without GUOHUA).

When comparing all query keywords, the ‘poem line + GUOHUA’ combination proves with its high accuracy of 78.0% that this is an efficient way to retrieve relevant TCP images.

5.2 Effectiveness of Image Style Classification Model

The experiments described in this section are carried out to clarify the second research question.

5.2.1 Selecting Model Training Data

To find the most effective training data for the image style classification task, different model configurations, trained on various partitions of the training data, were evaluated. For all training data, style vectors were extracted from the conv5_1 layer. Performances of the model configurations were measured on the Balanced test dataset. Table 2 shows the performance variations among these partitions.

Training sets that only contain Feng Zikai images (set 1 and 6) show the worst performance (62.67% and 57.33%). TCP and Baidu images both perform relatively well. When MS COCO is used as the negative training samples, TCP images perform better than Baidu images (86.60% vs. 82.0%), also see training set 3 and 4. To the contrary, when Noisy Query images are used as the negative samples, Baidu images outperform TCP images by 14.10%.

To investigate the influence of the type of negative training data, pair-wise comparison was done between sets with MS COCO and sets with Noisy Query images. Each pair of sets has the same positive training samples so that the comparison would be unbiased. When comparing set 1 against 6 and set 3 against 7, the sets with MS COCO images outperform the other by 5.34 and 2.0 points, respectively. However, in favour of Noisy Query data, set 9 and 10 outperform set 4 and 5 by a larger margin of 14.7 and 13.37 points, respectively. The Noisy Query samples achieve greater margin increases than the MS COCO samples. Moreover, sets containing Noisy Query images achieve the highest overall accuracy. Hence, on average, the Noisy Query dataset outperforms MS COCO.

As shown in Table 2, the training data partitions that perform the best are training sets 8, 9 and 10, each scoring an accuracy of 94.6%, 96.7% and 96.7%, respectively, on

(22)

Training Feng Baidu MS Noisy

set# Zikai TCP images COCO Query Accuracy

1 200 346 62.67 2 100 100 346 84.60 3 200 346 86.60 4 200 346 82.0 5 100 100 346 83.33 6 200 346 57.33 7 200 346 82.60 8 50 50 100 346 94.60 9 200 346 96.70 10 100 100 346 96.70

Table 2: Performance of model configurations trained on different partitions of the training data.

(For each partition, the amount of training samples from each dataset is listed. Further-more, each partition contained 200 positive and 346 negative samples.)

the Balanced test dataset.

To further examine the performance of these training splits, the rate of false positives and false negatives was measured on the Balanced test set. The results are presented in Table 3.

Training set# FP rate FN rate

8 5.19 5.48

9 2.60 4.11

10 1.30 5.48

Table 3: Performance variations on the Balanced dataset. (FP means ‘False Positive’ and FN means ‘False Negative’.)

In Table 3, training set 10 has a lower FPR but a higher FNR than set 9. In this image style classification task, the focus lies on filtering out noisy images instead of perfectly detecting TCP’s. Thus, a low false positive rate (FPR) is preferred over a low false negative rate (FNR) to improve effectiveness of the model against noise. Therefore, set number 10 is chosen as the final training data set.

In the following studies, the experiments were conducted with an extended version of training set 10. More positive training samples were added in order to match the number of negative samples. The final training data includes 173 TCP, 173 Baidu Query and 346 Noisy Query images.

(23)

5.2.2 Style Vector Performance

Each convolutional layer of the VGG-19 network extracts different features from the image. In the first few convolutional layers, the low-level features are extracted. These features represent minor details of the image, such as lines or dots. As the input image propagates through the network, the extracted features become more abstract. They lose fine detail but represent larger parts of the image. One example of low- to high-level features is visualised in Figure 8.

Figure 8: Visualization of low- to high-level features from a CNN [27].

To find the optimal style vector, the performance variations between style vectors from different layers were investigated for image style classification.

Style vectors from six convolutional layers (conv1_1, conv2_1, conv3_1, conv4_1, conv5_1, conv5_4) were extracted. The conv1_1, conv2_1 and conv3_1 style vectors are 4096-, 16384- and 66536-dimensional. The other three style vectors are 262144-dimensional. The CNN features from the first and second fully connected layer (fc1, fc2) were also included, as these layers are widely used in image classification tasks. The features that are derived from the fully connected layers are not style vectors, but simply 4096-dimensional output vectors. After all vectors were extracted, vector dimensions were reduced to 200 using principal component analysis (PCA). This enabled unbiased comparison. The Balanced test set was utilized to measure the accuracy and F1 score for classification. Additionally, five-fold cross validation was applied on the training dataset to report the average training accuracy.

Table 4 shows the test results for all the defined vectors. The best performance is achieved by the style vector derived from layer conv5_1, with Test accuracy and F1 score of 98.0% and 0.9809, respectively. The accuracy of the style vector from conv5_1 exceeds the accuracy of the second-best performing style vector (conv4_1) by a margin of 3.33 points. Both style vectors perform slightly better on Test accuracy and F1 score than the fc1 and fc2 layers, thus proving these two style vectors to be effective. Especially the vector from conv5_1 is the most effective for image classification.

(24)

per-Layer Train acc. Test acc. F1 score conv1_1 79.75 92.0 0.9241 conv2_1 83.66 91.33 0.9161 conv3_1 86.55 91.33 0.9161 conv4_1 86.26 94.0 0.9404 conv5_1 86.55 98.0 0.9809 conv5_4 84.39 92.0 0.9241 fc1 85.11 90.0 0.9182 fc2 86.12 92.0 0.9231

Table 4: Experimental results of style vectors based on the Balanced dataset.

form approximately the same on the test set, with Test accuracies ranging from 90.0% to 92.0%.

When comparing performances on the training set, conv1_1 has the lowest accuracy (79.75%), while conv3_1 and conv5_1 have the highest (both 86.55%).

In Table 4, a correlation is found between a style vector’s effectivity and the convo-lutional layer: deeper layers tend to produce style vectors with better performances. A possible reason is that the first layers extract low-level features of the image that do not capture the image style effectively. Later layers, such as conv5_1, capture higher-level features that are better at describing image style. However, because of the relatively lower performance of the last layer, conv5_4, this correlation is not significant. Instead, there seems to be a performance peak around the conv5_1 layer.

To investigate performance variations of the layers closer to conv5_1, the perfor-mances for all convolutional layers between conv4_1 and conv5_4 were measured. The results are presented in Table 5.

Layer Accuracy F1 score conv4_1 94.0 0.94039 conv4_2 94.67 0.9459 conv4_3 98.0 0.9804 conv4_4 94.67 0.9494 conv5_1 98.0 0.98089 conv5_2 96.0 0.9610 conv5_3 95.33 0.9548 conv5_4 91.33 0.9161

Table 5: Performance variations of style vectors based on the Balanced dataset. The conv4_3 and conv5_1 layers perform the best, both acquiring Accuracy of 98.0% and F1 score of approximately 0.98. There is no clear peak in the layers’ performance, as

(25)

there are two layers with optimal performance. Studying the cause of these performance variations is not in the scope of this thesis. Future research could look into the differences in filter activations among layers.

To select the optimal style vector, the accuracy performance of the conv4_3 and conv5_1 layers were evaluated on two additional test sets: the 1000 image TCP subset and the 1000 image MS COCO subset. The results are shown in Table 6.

Conv4_3 outperforms conv5_1 on both test sets based on accuracy. On the MS COCO subset, it outperforms by 0.70%, and on the TCP subset by 2.0%. Therefore, the style vector from the conv4_3 layer is selected for the final classification model.

Layer TCP MS COCO conv4_3 96.0 94.80 conv5_1 94.0 94.10

Table 6: Accuracy of two style vectors.

5.3 Final Classifier

To answer the third research question, the performance of the final classification model with style vectors from the conv4_3 layer was evaluated based on three test sets: the Balanced dataset, the 1000 image TCP set and the 1000 image MS COCO set. The TCP and MS COCO test sets were used to measure model performance on the classification of traditional Chinese paintings and general noisy images, respectively.

The SVM classifier in our framework is compared against two general classifiers, Logistic Regression and Naïve Bayes. All classifiers utilize style vectors derived from the conv4_3 layer. The measured classification accuracies are shown in Table 7.

Naïve Logistic

Dataset Bayes Regression SVM Balanced set 64.67 92.67 98.67 TCP set 87.20 95.50 96.0 MS COCO set 27.80 92.50 94.80 Table 7: Classification accuracy on the test sets.

According to these results, the SVM classifier outperforms other classifiers on all three datasets based on accuracy. On the Balanced set, it outperforms Logistic Regres-sion and Naïve Bayes by 6.0% and 24.0%, respectively. On the TCP set, the SVM shows an increase of 0.50% and 8.80% compared to Logistic Regression and Naïve Bayes, re-spectively. On the MS COCO set, it outperforms by 2.30% (vs Logistic Regression) and 60.70% (vs Naïve Bayes). Therefore, the SVM is proven to be an effective classifier for

(26)

the proposed framework. Furthermore, the results demonstrate the high-performance of the image style classification model and validate the proposed method.

5.3.1 Practical Task Performance

To test the final classification model’s performance on a practical image filtering task, the following steps were carried out.

1. 200 retrieved poem lines and their corresponding top-10 downloaded images were selected randomly.

2. To create poem line-image pairs, each set of top-10 images was filtered using the style classification model, in the order from highest to lowest ranked image, until a positively classified image was found.

3. This image was selected and the line-image pair was formed. The quality of the 200 images from the resulting line-image pairs was evaluated.

To compare the results, a baseline model was created that only selects the highest (top-1) ranked image of each top-10 set for each poem line. The comparison between line-image pairs from the baseline and the final classification model were given in Table 8.

Method Accuracy # False positives

Top-1 baseline 73.0 54

Proposed Style classification model 82.5 35 Table 8: Results for the practical filtering task of 200 line-image pairs. (A false positive is a non-TCP image that was classified as a TCP image.)

In Table 8, the final classification model outperforms the baseline on the accuracy by 9.5%. Furthermore, from a total of 200 positive images it classified, only 35 false positives were returned. The top-1 baseline returned 54 false positives. The results indicate that the proposed image style classification model brings a substantial increase in performance.

The final model acquired optimal classification scores on the Balanced, TCP and MS COCO datasets. It also shows its potential for filtering images on a practical task. Hence, this classification framework, consisting of the extracted style vectors and SVM classifier, proves itself to be efficient for image style classification.

5.4 Quality of the Constructed Dataset

To answer the last research question, the quality of the constructed dataset was evaluated based on the level of noise, the style and semantic consistency of the line-image pairs,

(27)

and the effectiveness in poem-to-image generation. The level of noise represents the ratio of stylistically relevant images contained in the dataset. The style consistency metric evaluates whether the image portrays the TCP style. The semantic consistency of line-image pairs is measured by checking if the image represents the meaning of the line. Lastly, the effectiveness of the dataset is tested by evaluating images that were generated by a text-to-image model trained on this dataset.

5.4.1 Image Style Noise

The level of noise is a subjective metric and requires human evaluation. Because human evaluation is costly, the level of noise was measured on a subset of the dataset. 200 images were randomly selected as samples from the famous poem lines and regular poem lines. The amount of relevant and irrelevant images were manually counted and shown in Table 9.

#TCP images #noisy images ratio of noise

Famous poem lines 168 32 16%

Regular poem lines 172 28 14%

Table 9: The level of noise for each type of poem line in the dataset.

In this small-scale test, the line-image pairs of the dataset that contains the least amount of noise (14%) were the regular poem line pairs. The famous poem lines contained only slightly more noise (16%). Some examples of noisy images are shown in Figures 9 and 10.

Figure 9: Example of noisy images that remained in the famous line-image data.

(28)

5.4.2 Poem Image Style and Semantic Consistency

Human evaluation was used to evaluate the style and semantic of the images in the constructed dataset. Because human evaluation is time-consuming, the same test sets as in Section 5.4.1 were used.

When excluding the noisy images in both test sets, the relevant images from both sets share the same high-level style: the traditional Chinese painting style. Within the TCP style, there exists large variation, that can also be seen throughout the two test sets (Figures 11 and 12). The various types include flower, bird, landscape, person and ink brush paintings. The distribution of various types of traditional Chinese paintings remains approximately the same in both test sets.

Figure 11: Example of positively classified images from the famous line-image pairs. As for the evaluation of semantic consistency between poem lines and its correspond-ing images, line-image pairs were manually analyzed based on the meancorrespond-ing of the sentence and the portrayed subject of the painting. A total of 24 line-image pairs were checked: 12 regular and 12 famous pairs.

For 19 pairs, the image did not portray a visual representation that was related to the semantics of the poem line. In fact, the visual representation was mostly random, showing an unrelated traditional Chinese painting. Only for 5 out of 24 pairs, the corresponding image contains a somewhat relevant illustration. These poem lines all contain a noun that represents a simple object, i.e. crane, fish, sunset, flower and boat. The image only accurately represents this object, but does not represent the other words contained in the poem line (such as adverbs, verbs or more complex nouns). These findings suggest that the lines and images are generally not semantically consistent with each other.

(29)

Figure 12: Example of positively classified images from the regular line-image pairs.

5.4.3 Application of the Dataset

In the field of text-to-image generation, the automatic generation of images from natural language descriptions is studied. Within this field, the proposed dataset can assist in a previously unexplored task: poem-inspired image generation. A recent work on poem-to-image generation has been carried out by a group member of my project, who adapted the AttnGAN model to this task.

To evaluate the quality of the dataset, the AttnGAN model was trained on three different configurations for poem-to-image generation:

1. 6,700 poem line-image pairs

2. 80,000 regular poem line-image pairs

3. the complete dataset containing 86,700 pairs

For each configuration, 8 images were generated based on a corresponding poem line. The images were analyzed based on its style and semantic relevance.

5.4.3.1 Semantic Consistency

Generated images from the famous poem configuration were not semantically relevant to the input sentence. 7 out of 8 generated images portray an unrelated scene to the input sentence, and only one image contains a representation of a keyword within the sentence. Furthermore, the objects contained in the images are abstract and difficult to identify. Four examples are displayed in Figure 13.

The regular poem configuration generated slightly improved images compared to the famous poem configuration. It generated 3 images that contain word features from the

(30)

Figure 13: Novel images generated by the first configuration of AttnGAN.

sentence (Figure 14). The corresponding sentence for the first image contains the word ‘flower’, and the sentences for the second and third image contain ‘person’. However, the images are not relevant to the semantics of other words in the sentence. The other 5 images portray landscapes and mountains (Figure 15), but are not semantically relevant to the input sentence.

Figure 14: Novel images generated by the second configuration of AttnGAN. (These images showed semantic features of its corresponding poem line.)

Figure 15: Novel images generated by the second configuration of AttnGAN. (These images were not semantically relevant to corresponding poem lines.)

Results of the last model configuration, trained on the whole dataset, are not seman-tically relevant at all. Some examples are shown in Figure 16. Compared to the other results, these images are the most abstract. They seem to depict mountains, but apart from that, there are no concise objects. The images do not contain features from the sentence.

(31)

Figure 16: Novel images generated by the third configuration of AttnGAN.

5.4.3.2 Style Consistency

Regarding image style, the images generated from the famous pairs correctly convey the TCP style. Though, all images show little variation in structure and lack in colour, and thus, portray a somber vibe.

Concerning the images from the second configuration (Figures 14 and 15), the overall style is similar to the TCP style, but occasionally show abstract blobs, which are un-usual in the traditional style. Additionally, more types of sceneries are depicted with an appropriate style than the other two configurations.

The third configuration generates the least notable results (Figure 16). The style is overall the same, as it depicts mountains in each image. The drawing style does resemble the TCP style, but it is more abstract. Furthermore, the images do not include any colours and are grey in general.

When comparing all image results, the second configuration shows the most promising results. It performs slightly better in representing sentence semantics in images than the other configurations. It also generates images with more topic and colour variation while conveying an appropriate style that resembled TCPs. This proves that the poem line-image are the most useful for line-image generation. This might be caused by the increased volume of the regular pairs compared to the famous pairs (80,000 vs. 6,700). However, it does not explain why it is more useful than the complete dataset that contains a total of 86,700 pairs.

Even though the constructed dataset shows interesting results when used for poem-inspired image generation, these results are still not significant in the field of text-to-image generation. The sentence semantics are not consistently represented in the image results, and thus, further research is needed.

(32)

6 Conclusion and Discussion

This thesis proposed an automated construction method for poem-image datasets by retrieving images from search engines and removing noise using the image style clas-sification model. By following this method, a large-scale poem-image dataset in the traditional Chinese painting style was introduced to assist poem-to-image generation. To design the construction method, four research questions were defined.

To address the first research question, a poem-image pair construction method has been developed. The method contains several steps. Firstly, relevant images for poem lines were downloaded from a search engine using a specified search query. To find the optimal parameters within this framework that influenced the quality of retrieved images, the performance of different search engines and search guidelines were investigated. The results show that Baidu performs the best, together with the ‘poem line + GUOHUA’ query. As the last step, the images were filtered by the proposed image style classification model to reduce noise, and the final poem line-image pairs were created.

With regard to the second research question, how to effectively train an image style classifier and apply it, various model configurations and image style vector features were analyzed. The performances were compared on several datasets. The optimal training data and style vector were selected and implemented in the final classification framework. By extensively testing the performance of the image style classification model, the effectiveness of the proposed noise filtered method, which corresponds to the third re-search question, was proved. A high performance of the model has been reported on several datasets, and poem line-image pairs constructed from filtered image data show an improved quality compared to pairs from unfiltered data. Thus, the usage of an image style classification model to filter irrelevant images is validated.

In order examine the last research question, to what extent the proposed dataset assists poem-to-image generation, the usefulness of the dataset was studied through two ways. One is analyzing the quality of poem line-image pairs based on the overall stylistic and semantic relevance. The other is evaluating the quality of the images generated by AttnGAN with the poem-image dataset as the training data. The results indicate that the images in the dataset are stylistically relevant, but lack in semantic relevance. Furthermore, although the generated images are interesting and correctly draw in in the TCP style, they cannot perfectly convey the semantics of the poem lines.

While the images in the dataset correctly matched the targeted TCP style, they can-not perfectly represent the meaning of the poem lines. It indicates that creating seman-tically corresponding line-image pairs is a challenging task and the proposed framework can be improved in future research.

One limitation that causes the low semantic quality of line-image pairs is related to the query used by the search engines. The quality of the retrieved images highly depends on the keywords in the search query. As the collected poem lines from classical poetry contain abstract and ambiguous semantics, the corresponding queries may cause the search engines to perform poorly.

(33)

A possible improvement can be limiting the vocabulary difficulty of the poems during dataset construction. By only including poems that are less ambiguous and more concise, the search engine can return more semantically accurate image results.

Furthermore, to assist poem-to-image generation, a simplified poem-image dataset can be constructed by removing the style limitation for images. This could improve the semantic relevance of line-image pairs.

(34)

References

[1] Brindley D. Breaking the Poetry Barrier: Towards Understanding and Enjoying Poetry. 1980;.

[2] Elder J. Imagining the earth: Poetry and the vision of nature. University of Georgia Press; 1996.

[3] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, editors. Advances in Neural Information Process-ing Systems 27. Curran Associates, Inc.; 2014. p. 2672–2680. Available from: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

[4] Zhang H, Xu T, Li H, Zhang S, Huang X, Wang X, et al. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. CoRR. 2016;abs/1612.03242. Available from: http://arxiv.org/abs/1612.03242.

[5] Reed SE, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H. Generative Ad-versarial Text to Image Synthesis. CoRR. 2016;abs/1605.05396. Available from: http://arxiv.org/abs/1605.05396.

[6] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV). 2015;115(3):211–252.

[7] Krizhevsky A, Sutskever I, Hinton G. ImageNet Classification with Deep Convolu-tional Neural Networks. Neural Information Processing Systems. 2012 01;25. [8] Karayev S, Hertzmann A, Winnemoeller H, Agarwala A, Darrell T.

Recognizing Image Style. CoRR. 2013;abs/1311.3715. Available from: http://arxiv.org/abs/1311.3715.

[9] Tseng T, Chang W, Chen C, Wang YF. Style retrieval from natural images. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2016. p. 1561–1565.

[10] Gatys LA, Ecker AS, Bethge M. A Neural Algorithm of Artistic Style. CoRR. 2015;abs/1508.06576. Available from: http://arxiv.org/abs/1508.06576.

[11] Matsuo S, Yanai K. CNN-Based Style Vector for Style Image Retrieval. In: Pro-ceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ICMR ’16. New York, NY, USA: Association for Computing Machinery; 2016. p. 309–312. Available from: https://doi.org/10.1145/2911996.2912057.

[12] Chu WT, Wu YL. Deep Correlation Features for Image Style Classification; 2016. p. 402–406.

(35)

[13] Jiang S, Gao W, Wang W. Classifying traditional Chinese painting images; 2004. p. 1816 – 1820 vol.3.

[14] Gao F, Nie J, Huang L, Duan LY, Li XM. Traditional Chinese Painting Classification Based on Painting Techniques. Jisuanji Xuebao/Chinese Journal of Computers. 2017 12;40:2871–2882.

[15] Meng Q, Zhang H, Zhou M, Zhao S, Zhou P. The Classification of Traditional Chinese Painting Based on CNN. In: Sun X, Pan Z, Bertino E, editors. Cloud Computing and Security. Cham: Springer International Publishing; 2018. p. 232– 241.

[16] Liong ST, Huang YC, Li S, Huang Z, Ma J, Gan YS. Au-tomatic traditional Chinese painting classification: A benchmark-ing analysis. Computational Intelligence;n/a(n/a). Available from: https://onlinelibrary.wiley.com/doi/abs/10.1111/coin.12328.

[17] Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018. p. 1316–1324.

[18] Werner. Poetry. GitHub; 2018. https://github.com/Werneror/Poetry.

[19] Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. 2015;abs/1409.1556.

[20] Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09; 2009. .

[21] Zheng Y, Yang C, Merkulov A. Breast cancer screening using convolutional neural network and follow-up digital mammography; 2018. p. 4.

[22] Chu W, Wu Y. Image Style Classification Based on Learnt Deep Correlation Fea-tures. IEEE Transactions on Multimedia. 2018;20(9):2491–2502.

[23] Mir A, Nasiri JA. LightTwinSVM: A Simple and Fast Implementation of Standard Twin Support Vector Machine Classifier. Journal of Open Source Software. 2019 03;4:1252.

[24] Feng Z. Zikai manhua xuan = Selected Cartoons of Feng Tsi-kai. Xianggang : Xianggang: Zhongwai chubanshe ; Shidai tushu; 1975.

[25] Wang G, Chen Y, Chen Y. Chinese Painting Generation Using Generative Adversarial Networks. 2018;Available from: http://cs231n.stanford.edu/reports/2017/pdfs/311.pdf, Dataset available from: https://github.com/ychen93/Chinese-Painting-Dataset,.

(36)

[26] Lin T, Maire M, Belongie SJ, Bourdev LD, Girshick RB, Hays J, et al. Microsoft COCO: Common Objects in Context. CoRR. 2014;abs/1405.0312. Available from: http://arxiv.org/abs/1405.0312.

De-noise large-scale poem-image pairs for poem-to-image generation

De-noise large-scale poem-image pairs

for poem-to-image generation

De-noise large-scale poem-image pairs

for poem-to-image generation

Contents

1

Introduction

2

Related Work

3

Research Method

4

Experimental Setup

5

Experimental Results

6

Conclusion and Discussion

References