Artistic eye color through generative models

(1)

Artistic eye color through generative models

(2)

ABSTRACT

Artistic portrait painting is a depiction of humans translated in a

abstract creative space. This process often augments the perception of a person compared to real life photos. In this work we focus on the abstraction of eyes in artistic painting, their distinguishing

features and characteristics. We describe the research and develop-ment of applying certain machine learning models and a dataset

in this domain. We perform multiple experiments that build upon each other to find, extract, analyse and finally create painted eyes.

Based on our experimental design we elaborate on the following contributions: (1) a dataset of painted eyes, (2) a model that can

classify eyes, (3) an analysis performed on painted eyes, (4) an eval-uation system. Based on the learned data distributions we propose a (5) generative model for generating eyes in the artistic painting

domain. We evaluate the quality of our model with interactive hu-man trials where our generated samples fool the huhu-man observer

most of the time.

KEYWORDS

Machine learning, Classification, Conditional Generative Network, Painting, Eye color

1 INTRODUCTION

Paintings are a form of art that is plentiful in the world. Each painting tells a story and in the visuals of this story information

about the artist and that period in history can be found. For example, works of Rembrandt van Rijn show how he experimented with light

and dark, a technique called Chiaroscuro that originated during the Renaissance. Analyzing this style information is often done by people knowledgeable on art. However, they can only work on a

few pieces at a time since analyzing is a time consuming process with multiple facets. Teaching computers to analyze paintings is

a step forward. An attempt at recognizing art styles is done by Lecoutre et al. [7]. They have performed analysis and conclude

that with a bigger dataset better results can be achieved. Such a dataset exists with OmniArt as a result of the research performed

by Strezoski and Worring [11].

Research can use this dataset to analyze individual artistic paint-ings and extract the information that is in it. However, we use this

large dataset to perform large scale research. We take a large rep-resentative chunk of artistic portrait paintings and analyze them.

It is then possible to make verifiable generalizations and answer questions that previously went unanswered. An example of a

ques-tion that originates from the art community is that most portrait paintings have dark eye colors, even though the real eye color of the subject is different. Examples of this are portraits of Jeanne

Hébuterne. In some of the paintings where she is the subject the eyes are dark, whereas in others they are blue. A reason for this

could be to increase contrast. However, the question can only be answered after fully analyzing the eyes of a large set of portrait

paintings. In this thesis we detect, localize, extract and classify eyes in the portrait paintings subset of the OmniArt dataset. The

resulting dataset is then a possible source for generating new eyes.

Figure 1: A chronological pipeline of tasks in this research. In (A) we detect and localize the faces and eyes in portrait

paintings. We then, in (B), extract eyes through traditional computer vision methods. In (C) we use this small dataset

to train a CNN. In (D) we use the CNN to classify eyes and create the OmniArt eyes dataset. We use this dataset in (E)

to train a cDCGAN. We then generate painted eyes in (F) and evaluate them with humans in (G).

A good generator can show the potential of machine learning in

art, for example, in the restoration process of paintings. We use a combination of techniques for this to achieve better results. This all

can be formalized in the following research question:“How can we use a combination of traditional computer vision and machine learn-ing techniques to find, analyse, and create eyes from and in artistic paintings?” We answer this research question with the following contributions, of which we visualize the pipeline in figure 1:

• Dataset of artistically painted eyes

• Custom eye detection and eye color classifier model • Conditional painted eye generative model

• Human evaluation of generated painted eyes

2 RELATED WORK

The majority of this research builds upon two fields within machine learning. These are face and facial landmark detection, and

gen-erative networks. Face and facial landmark detection techniques are necessary to extract faces and eyes from paintings, whereas

generative networks are necessary to generate new painted eyes. Here we detail the current research and state-of-the-art tools in both of these fields.

2.1 Face and facial landmark detection

The detection of facial landmarks is a field within computer vision that is important to many applications, e.g. face verification and face

recognition [14]. The last few years many different tools have been released that can detect faces and landmarks, which is key within this research. A state-of-the-art tool for this is face_recognition

(3)

by Geitgey [3]. The tool is able to detect and group similar faces.

However, it is not perfect as the authors state that its network is primarily trained on adult faces and may fail to detect the faces of

children. This can introduce difficulty when trying to detect faces in paintings as these paintings might have faces painted in artistic

styles that the model cannot recognize.

This potential issue has been previously documented by Wech-sler and Toor [12]. They propose a new dataset, MAFD-150, that can

be used to train networks to accurately recognise faces in artistic paintings with different styles. Furthermore, they state that because

current face detection tools fail to detect faces in many paintings in the dataset while humans have no issues and that due to this

face detection tools are not near the performance of humans. An approach to this issue is described by Mzoughi et al. [9]. Their

ap-proach is to limit the training on a specific artistic style, in this case Tenebrism which is known for its high contrasts. The results are positive and the resulting model is able to accurately detect faces.

However, it should be noted that it is limited to one art style which could mean that it is less applicable to datasets that consist of a

multitude of artistic styles, such as the OmniArt dataset. Instead of limiting the detection to one specific art style we aim for styles

in which humans still look like humans. This means that for some artistic styles human detection is less optimal, however, in total more detections should happen because the detection is not limited

to one artistic style.

2.2 Generative networks

Generative networks are networks that are able to generate data

that, preferably, fits within the distribution of data the network was trained on. Through a well trained generative network it becomes possible to create, for example, faces of non-existing humans. Much

of the current research in generative networks is building upon the concept of a Generative Adversarial Network (GAN) which was

introduced by Goodfellow et al. [4]. In a GAN two components, a generator and discriminator, compete in a minimax-game; the

generator tries to fool the discriminator with generated images. The most optimal result of a GAN is a generator that makes the

discriminator classify the generated image with a 0.5 confidence, meaning that it is unable to distinguish a generated image from an image that is from the training dataset.

Building forth upon the GAN is the Deep Convolutional Genera-tive Adversarial Network (DCGAN). These networks use a set of

constraints that make the generative part of it more stable during training, which is one of the potential shortcomings of Generative

Adversarial Networks (GANs) [10]. A well trained DCGAN will be able to generate good representations that are usable in generative modelling and supervised learning. However, there is still a

possi-bility that the model collapses resulting in a generated set of similar images, as stated by the authors.

Another extension of generic GANs are Conditional Generative Adversarial Networks (cGANs). The generator in a Conditional

Generative Adversarial Network (cGAN) has two inputs, a sam-pling of random noise and a configurable parameter,y, in most cases this is a label. The discriminator then receives the generated image along with the same parametery. Gauthier [2] were able to produce results that indicated that the configurable parameter

Figure 2: The evaluation results of the two tools for face and

landmark detection. (A) is the face_recognition library and (B) is the face.evoLVe library. In (A) more

informa-tion is available regarding eye landmarks meaning we can extract the eye more accurately.

was able to deterministically influence the generated image. This shows that it is possible, with the correct model, to define certain

properties of the output from a GAN. It is even possible to have more detailed images by adding more conditionals as demonstrated

by Zhang et al. [13]. They have shown that it is possible to create relatively high resolution images (256 x 256) based on a descriptive text. This is achieved with a multistage process where first a sketch

with the defining properties is created. This sketch is then scaled up and any defects in the image are fixed whilst at the same time

more detail is added. This results in images of high quality where some input decides the actual content.

3 EXPERIMENTS

This section chronologically details the experiments. First, we de-scribe the source dataset: OmniArt. Then, we explain the process

of finding faces in the paintings, such as what tools are used and why. Then we detail how we extract eyes with traditional computer

vision methods to build a small dataset which is then used for cre-ating a Convolutional Neural Network (CNN) classifier for large

scale extraction.

3.1 Dataset

This research uses the OmniArt dataset as its source for artistic

portrait paintings. It is created by Strezoski and Worring [11]. They have sourced digital images of artistic paintings from many artwork

collections and online art platforms. The earliest entries are from around 150 years BCE, whereas the latest entries are recent, as the

collection is continuously updated. The art in the dataset is very diverse with over 36,000 artists creating art in more than 500

differ-ent types. Each piece of art has metadata giving information about it, e.g. artist, period of time, materials used and the current location of the artwork. This research only needs a subset of the OmniArt

dataset, namely artistic portrait paintings. This filtered subset con-sists of nearly 250,000 paintings that we analyze. Effectively, we

only use the digital image. The additional available metadata is not yet useful.

(4)

3.2 Facial Features Detection

The OmniArt dataset by itself has no metadata regarding the lo-cation of facial landmarks. The first task at hand is finding them.

Multiple approaches for this can be taken. For example, with tra-ditional computer vision methods such as histograms of oriented

gradients, made popular by Dalal and Triggs [1]. Another approach is with a CNN that is trained on faces to output the coordinates

of the facial landmarks in the image. For this also multiple open source libraries exist. We evaluate two of these: face_recognition and face.evoLVe. face_recognition is a Python wrapper around

the facial aspects of the dlib toolkit, whereas face.evoLVe is a CNN based on the Multi-Task Cascaded Convolutional Neural Network

(MTCNN) architecture. Both libraries are able to detect faces in images and return the coordinates of several landmarks. They,

how-ever, differ in how detailed the landmarks are. face_recognition is able to give six coordinates per eye, whereas face.evoLVe is only able to give the center point of an eye. This difference in

informa-tion is visible in figure 2. As extracinforma-tion of the eyes is one of the main tasks it is necessary to have as much information available

as possible for better results. With the six coordinates it is pos-sible to capture a square patch that fully contains the eye, if the

localization is accurate. When only a center point is available, it becomes impossible to accurately capture the whole eye as there

is no indication where the eye starts or stops. Thus, the detection uses the face_recognition library for this task.

An issue that can arise is that the predicted coordinates are not

always correct. It is possible that the detection has issues predict-ing where the landmarks are, for example, due to some tilt or low

contrast, see figure 3. This is likely due to the dataset dlib is trained with, which is frontal-faces. It correctly detects a face but is unable

to correctly locate the landmarks. This issue signifies the need for a tool that can accept or reject a proposal of an eye. This is covered

in sections 3.3 and 3.4.

The OmniArt dataset has near 250,000 paintings that are relevant

for this research. We perform the face detection and landmark local-ization over all these paintings. This is a computationally expensive

task, especially because the images of the paintings are of high reso-lution. The performance can be increased by downsampling images.

This should be done carefully as not to downsample too much. If the images become too small it is possible that some faces are not detected anymore. Furthermore, an even better improvement can

be achieved by offloading the detection task to a GP U, which is supported by face_recognition. This increases the images per

second from two to eight. Detected faces and eyes are stored in separate objects so they can be referenced later during runtime.

Each of the objects also have the OmniArt id of the painting in it so future analysis can be done with other metadata of the painting.

The eye objects also have a property indicating whether it is the left or right eye for the viewer of the painting. The face_recognition library does not explicitly mention left or right, however, it does

always encounter and store the left eye first. Whether an eye is the left or right eye can be useful later on, for example, to condition a

generative network to generate a left or right eye.

Figure 3: An example of a correct and incorrect prediction of facial landmarks. The second face is slightly rotated

side-ways and as a result dlib is unable to correctly predict where the landmarks are.

3.3 Deep Eye Extraction

We perform eye extraction two times. In the first iteration we extract and classify eyes with traditional methods of computer vision. This

provides us with a small but accurate dataset of eyes and eye color labels. This small dataset is necessary to train a CNN classifier in the step after. A CNN is useful because it can more accurately detect

and classify eyes. This is possible because a CNN is able to find more patterns that identify and classify eyes than just the hard

coded methods used in the traditional computer vision approach.

3.3.1 Dataset requirements. For the purpose of training our own model for facial feature detection in artistic painting, we are using

existing detection approaches providing us with a small, low noise dataset on which manual intervention is possible to correct the resulting elements. We propose the following structure for the

resulting dataset:

• Eye image

• OmniArt painting id • Eye color

• Left or right eye

We emphasize the importance of the eye color and orientation for

this research. The information is stored in the file name with the following format_{id_color_L|R_face.jpg where face denotes the} index of the face in a given painting. This is necessary because if there are more than two eyes in a painting they will overwrite

earlier results if they are of the same eye color. The images of eyes are square patches and come directly from the original OmniArt

image. The images must not be resized, they must be stored at the size they are found at. This makes it possible to filter out lower quality images and makes sure that future research on these images

has the original samples.

3.3.2 Judging proposals. It is possible to extract a patch from the painting with the coordinates from the face and landmark

de-tector. The detector thinks this patch is an eye. These proposals are are not always accurate, see figure 3. Sometimes the proposal only

captures a small part of an eye, whereas other times it misses and proposes something that is nothing like an eye. The next step is about accepting or rejecting proposals. It is necessary to find one or

more features that the majority of the correct proposals share. We use one major feature, the presence of an iris. If an iris is present

we can safely assume we are dealing with an eye. The geometric shape of a human iris is a near perfect circle, which means a circle

(5)

Figure 4: The process of finding the iris. We increase

con-trast, convert to single channel, apply a Gaussian blur, and threshold the image. Thecircle Hough Transform algorithm then tries to detect circles.

detection algorithm should be able to detect it. Such an algorithm

exists with thecircle Hough Transform. This algorithm takes a single channel image and is then able to detect not only circles

but also arcs that are part of a circle that might not be completely visible. This is important because we cannot assume that every iris is completely visible. For example, sometimes the eyelid occludes a

part of the iris. The algorithm can produce fake positives, it detects circles where there are none due to noise or other artifacts. We

reduce the noise by applying a Gaussian blur with a kernel size of 3 by 3 to the single channel image. Furthermore, the image is

thresholded to make the iris stand out more from the surrounding features of an eye. These two steps come after increasing the con-trast of the image which we do to make edges sharper. The results

of this process, visualised in figure 4, are sufficient. The detection of the iris is not always fully accurate, but in most cases it is able

to at least capture the majority of the iris in its circle, if the image contains an iris. If it finds no circles we reject the proposal, it is

unlikely to be an eye. If it detects one or more circles we assume we accept the proposal with the first circle as the iris. The return

value of the algorithm is a coordinate with a radius value which we use in the next step.

3.3.3 Eye color classification. Classifying the eye by its color builds upon the previous steps. We have the coordinates that outline the eye as a shape from the landmark detection. We also have

a coordinate and radius representing the iris from the proposal judgment. We use these two as polygons to find the area that we need to analyse. Both are converted to polygons and we look for an

intersection of them. The intersection gives us an area to analyze, an example is shown in figure 5. If there is no intersection we reject

the proposal, a circle was detected outside the eye shape and we cannot accurately use it anymore.

The pixels in the intersection determine the eye color. Each of these pixels have a weight. The further away the pixel is from the centroid of the intersection polygon the more weight it has.

This weight grows linearly in increments of 1 towards the edge where the weight is 10. This weight is necessary to counter the near

complete black values of the pupil. We then create three clusters of these weighted pixels. This gives three dominant colors of which

we assume the most dominant is the color of the iris. This color,

Figure 5: Visualization of the polygons of the eye shape and

the iris. We assume that the intersection of these polygons, the orange area, to be the iris and analyze its color.

however, is not yet usable to attach a label to the eye. To find the color family it belongs to it is necessary to calculate the distance to

samples of each color family.

We sample several colors for each color family from the X111 color list. For each sample we calculate the distance to the eye color

in the HSV color space. We use the HSV color space for this because it allows for better color comparison than RGB [6]. We assume that

the family of the sample the eye color has the shortest to is the correct color family.

3.3.4 Manual correction. We create a dataset of eyes labeled by their color and the orientation, left or right with the previous steps.

The size of this dataset is just over 15,000. For this nearly 250,000 paintings are analyzed. However, the method is not completely accurate. The process can detect a circle other than the iris which

it will then analyze for color, resulting in a faulty label for the eye. This needs manual correction. The inspection of the dataset also

brings forth several extra classes. These are:

• Amber • Hazel • Iris-less • Grayscale

The amber and hazel classes can be classified under the brown,

and green or blue classes. However, samples show that they differ sufficiently enough and often show more characteristics that are not present in blue or green eyes. For hazel, for example, stray specks

of brown and red exist in the iris. This occurs often enough to warrant a new class. Often samples exist that show an eye without

an iris, this warrants the iris-less class. These are often in paintings of demons as seen in figure 6. These eyes likely often pass the iris

detection because of arcs in their eyelids. We additionally add the grayscale class to be able to classify eyes that have no color because

they are painted in grayscale. Instead of rejecting these eyes for not having color they are captured in a separate class.

To account for these new findings manual we perform a manual

correction. This results in a new dataset of around 3,000 eyes with correct labels according to human perception of color. Another class

is also necessary for the next step. A class that consists of images that are not eyes: the “negative” class. We source these manually

from mispredictions from the above dataset.

1

https://en.wikipedia.org/wiki/Web_colors#X11_color_names

(6)

Figure 6: An example of iris-less eyes in OmniArt paintings.

Many of these are in paintings of demons.

Class Training set Validation set

Amber 304 71 Blue 347 82 Brown 286 77 Gray 251 69 Grayscale 287 69 Green 296 76 Hazel 213 61 Iris-less 149 35 Negative 215 51 Red 264 65

Table 1: The classes we use for training the classifier and the amount of eyes for each class in both the training set and the validation set.

3.4 OmniArt Eyes Dataset

The previous experiment builds a small but accurate dataset. With this we train a model to discriminate and categorize eyes by eye

color and "fakeness".

3.4.1 Dataset. The eye dataset is imbalanced in its distribution of classes. There are more blue eyes than hazel and amber eyes. This

is normal as the distribution of eye color in the normal population is also not equal, and can differ even per region or country. If the

model is trained on this imbalanced dataset it can overfit on certain classes. To counter this we create a subset where every class is

represented roughly equally. The distribution of this can be seen in table 1. There is no extra test set because the hyperparameters of the models we evaluate are more or less static. The iris-less class has

fewer samples because there are fewer available from the previous step as it was not a class the non-CNN classifier looked for, they

were by-catch. To artificially grow this subset, we perform data augmentation, e.g. randomly flipping the image. We evaluate the

classifier models with this dataset.

3.4.2 Models. The dataset consists of ten classes. A classifier needs to be able to classify each of these accurately. However, an

accuracy of 100% is not necessary as classification is not the most important task. We use transfer learning due to time constraints and

evaluate several pre-trained models for the CNN. These models have

Model Validation accuracy

ResNet18 84%

VGG11-bn 75%

Alexnet 60%

Table 2: The evaluated models for the eye classifier and their

respective accuracy.

been trained on the 1000-class Imagenet dataset. The parameters of the models are not frozen because the dataset they are trained on is much more diverse in shapes compared to the eyes dataset.

By not freezing the parameters the model can learn to put more emphasis on the color of the eyes instead of the different shapes.

Furthermore, we alter the classifier layers. Two fully connected layers are added. The first goes to 256 nodes and the second to 10,

the amount of classes, nodes. Both layers are preceded by a dropout of 60%. This dropout can help prevent the model from overfitting [5]. The models are:

• ResNet18

• VGG11 with batch normalization • Alexnet

The results of the evaluation are in table 2. The ResNet18 model has the highest accuracy of the models with 84% accuracy on the

validation set and is used for the classification task. As stated ear-lier, it is not necessary to achieve the highest accuracy possible, it just needs to be high enough to be able to create a dataset that is

mainly correct. For this 84% is acceptable and a great improvement compared to randomly guessing one of the ten classes.

3.4.3 Eye color classification. In the non-CNN classification task multiple steps are necessary to determine the color of the eye. With the CNN classifier, however, the only steps are inputting an image

and relating the output, an index value between 0 and 9, to a class. The eye image is stored in the same manner as described in section

3.3. This then builds a dataset of all the eyes detected and classified: the OmniArt eyes dataset. The total amount of eye detections is 214,618 in 247,251 paintings. Of these 118,576 are classified as eyes,

the remainder is classified as negative: proposals from the face and eye detector that are rejected.

4 PAINTING EYES

With the OmniArt eyes dataset we create a model that can generate painted eyes. Generating painted eyes can assist art conservators in

estimation purposes they have to perform in their work. Many old artistic paintings suffer damage or loss of quality and need to be

restorated at some point. The restoration is a painstaking process. Art conservators have to estimate what the damaged regions of

a painting originally looked like. This becomes difficult when no memory or photographic copy of the original is available.

For example, in Eastern Europe icons in Christian churches often suffer from damaged eyes. It was believed that removing the eyes from depictions of saints in churches would bring good health

to whomever does it. In most of these cases no indication of the original state of these damaged icons are available. This means that

art conservators have to estimate what the eyes would look like.

(7)

For this estimation purpose, we evaluate our model’s ability to

generate out-of-context eye patches based purely on a conditional statement. In the future we would like to extend this to actual

in-painting of eyes in artistic paintings.

This generator should be able to generate eyes with a certain

eye color. Multiple approaches to generative networks exist, for example by training an autoencoder and extracting its decoder for image generation. However, GANs have shown a lot of promise and

Radford et al. [10] demonstrate that a DCGAN is able to generate accurate images while Gauthier [2] show that conditioning the

network is also possible. Thus we perform the task of generating eyes with a Conditional Deep Convolutional Generative Adversarial

Network (cDCGAN). With this approach a generator network and a discriminator network compete against each other, both helping

each other improve. The goal is to have a generator that can create painted eyes that are hard to distinguish from real painted eyes for humans.

4.1 Architecture

Two separate networks exist with a relatively similar architecture based on the findings of Radford et al. [10], including the

hyperpa-rameters. The generator uses deconvolutions to create an image from a vector, whereas the discriminator uses convolutions to create

a classification vector from an image. An overview of the archi-tectures can be found in figure 7. The generator network has two inputs. The first input is a noise vector, this is sampled from a

Gaussian distribution. The second input is a one-hot encoded vec-tor denoting the eye color it should generate. These two inputs are

concatenated and sent through deconvolutions which in turn will generate a 3-channel image of 128 by 128 pixels. Successful training

will make this image take on the shape of an eye and condition the color of the iris on the class it has as input. The discriminator

network has a single input: the image. Through convolutions it generates two outputs: a prediction whether the image is from the training dataset or a generated image, and a prediction of the color

of the iris. These respectively are the first value and the other eight values from the fully connected layer.

4.2 Training

We trained our model for 200 epochs. In each epoch batches of im-ages and its labels are taken from the dataset created in the previous

section. We only used eyes with a resolution greater than 25 by 25 pixels to increase the quality of the generator. Smaller eyes convey

little meaning when they are upsampled due to becoming blurry and can only hinder training. For training, first the discriminator is trained on these real images. The output gives two values, for

both the loss is calculated. For the real or fake prediction a binary cross-entropy loss is used, as the prediction is one of two values.

For the color prediction a cross-entropy loss is used. These losses are not yet propagated back into the discriminator. First, samples

are sourced from the generator with random noise and random class labels. These samples are then sent into the discriminator

which again outputs two predictions. However, now only the loss for the prediction of real or fake is calculated. The color predic-tion is ignored because otherwise it can become possible that the

discriminator will learn wrong labels for iris colors because it is

Figure 7: The architectures of the generator and discrimina-tor network. The generadiscrimina-tor has noise and a one-hot vecdiscrimina-tor denoting the eye color as input and through deconvolutions

transforms it into an image. The discriminator takes an im-age as input and uses convolutions to classify the realness

and the eye color.

impossible to know for sure that the generator will generate correct

iris colors. The calculated losses for the discriminator are added up and then back propagated through its network.

After training the discriminator on real and generated samples

the generator will be trained. As the generator learns from the discriminator it is necessary to forward the generators output yet

again to the discriminator. The losses of the discriminator are cal-culated again, however, this time the loss for real or fake sample is

calculated by saying all samples are real. If the discriminator is able to classify them as fake we can calculate a significant loss for the generator. If it was unable to classify them as fake the generator

is doing a good job and will not learn much this iteration as it would otherwise start outperforming the discriminator which is

not wanted. After training the two networks the generator should be able to generate images of eyes while the discriminator should

become unable to see the difference between a fake and real eye.

5 RESULTS

In this section we provide results with quantitative and qualita-tive measures for the CNN classification model, the OmniArt eyes

dataset, and the conditional generator model.

5.1 Classifying Eyes by Color

We can evaluate the classifier by looking at the output. If the input

is an image of an eye with a blue iris and the output is the label blue, we can consider that the classifier can classify accurately. This is already done during the training phase by checking the accuracy

of the classifier on the validation set, which was 84% which is much higher than randomly guessing out of ten classes. This of course

assumes that the validation set is labeled correctly. However, this validation set is created by a human, and color is a spectrum. This

means that mislabels can exist due to a different perception of the color. Furthermore, colors could be exist on a point of the spectrum

where they look both blue and green. The classifier is able to deal with this by outputting the confidences of the classes. For example, it could output 50% blue, 40% green. However, this research uses

(8)

Figure 8: A normalized confusion matrix of the classifier on

the entries in the validation set. This matrix shows that after training the classifier, it has a good accuracy for every class on the validation set.

only one label and with this example blue will be the label for the color.

The single value of 84% accuracy only gives an indication of the overall accuracy. It does not tell anything about the performance for each class and which classes have overlap. A confusion matrix

on the validation set can give this information. It is visible in fig-ure 8. The confusion matrix shows that the classes blue, grayscale,

irisless, and red are classified most accurately. It also shows how color is a spectrum as certain colors that are close to each other

are sometimes predicted as each other. For example, an amber eye is most often wrongly classified as brown. Training on a larger

dataset is likely a solution to this as the subtle differences can then be learned more accurately. Besides this, a different architecture for the CNN might also improve accuracy.

A potential shortcoming of the classifier exists in its inability to

account for the surrounding context. The classifier is only trained on the patch of the eye, meaning any additional information of the

surroundings is lost. For example, a yellow light shining on a blue eye could result in the eye being painted as an eye with a green

hue. This classifier will then classify the eye as green. This does not mean that the classifier is incorrect because the pixel values are green, however, additional information could make it more

accurate. Accounting for this can be done by creating a larger patch. However, this requires a new dataset. It is likely to be necessary

that this dataset is labeled by humans as they can already deal with this difference in perception.

5.2 Dataset evaluation

We created the OmniArt eyes dataset that consists of 118,576 entries of eyes with their color label. The distribution of eyes for each class

can be seen in figure 9. This shows that most eyes in paintings are in fact brown according to the classifier. Blue is second, however, it

is not even half the amount of brown eyes. This is in accordance

Figure 9: The distribution of colors of eyes in the OmniArt

dataset. The negative class is not shown as it is not relevant, it only captures non-eye images.

Figure 10: For every class ten random samples are shown.

The classes from top to bottom: amber, blue, brown, gray, grayscale, green, hazel, iris-less, negative, and red.

with the hypothesis from the artistic community; brown is used most often. However, the question as to why cannot immediately

be answered, this requires more research.

Evaluating a dataset is hard because it is impossible to evaluate

every single item to see if it is an eye or not and whether the label is correct. Randomly sampling entries can provide an indication of

the total set. Such a sampling is visible in figure 10. These samples show that every class has images that belong to at least that class,

still it does not rule out faulty or overlapping classifications. The iris-less class is shown to be not that accurate. It contains samples of images without irises, however, these are only iris-less in the

sense of not visible in the image. Meaning that if only half of the eye is shown, and it contains no iris it could be classified as iris-less,

even though the other half might contain it.

We create the average eye for all classes. This gives an indication

of the overall quality of each class. The average eyes are visible in figure 11. For every class except the negative the shape of the eye is clearly present. The color of the iris is also easily distinguishable,

even the absence of an iris in the irisless class. It is also interesting to note that the average color of the skin also differs per class. For

(9)

Figure 11: The average eye for each class in the OmniArt eyes dataset. On average each class is well represented as the shape and color of the eye are clearly visible.

example, the skin around brown eyes seems to be of a warmer tone than of gray and blue eyes. This itself raises new questions such

as“do people with brown eyes in paintings on average have warmer skin tones?”. For the red class an observation is that the skin tone has a red hue, which is not human-like. Finally, these average eyes indicate that the dataset is of sufficient quality for other tasks.

5.3 Generator

The task of the generator is to be able to generate painted eyes from nothing. For this it has to be trained, and in this case the training was done in an adversarial way. The results of the training, eyes, are to

be evaluated. They should be hard to distinguish from real painted eyes for humans. Instead of the usual evaluation methods such

as inception score, a qualitative evaluation is done with humans. Furthermore, the conditioning ability should be evaluated as well.

Does the generator really generate red eyes when that is the input? The results of both evaluations will also implicitly evaluate the

other two experiments: the classifier and the dataset. If the dataset is of good quality and labeled correctly by the classifier then the generator should be able to create painted eyes that look like real

painted ones of certain eye colors.

5.3.1 Human evaluation. For the human evaluation of the re-sults of the cDCGAN we created a simple web application. On

entering the application the user is presented with a small descrip-tion of the task at hand, and a sample of generated painted eyes.

When the task is started the user is presented twice with a set of 16 eyes in the resolution they are generated at: 128 by 128. Then the user has to select the eyes they think are fake. This method is used

because the generator should be able to fool humans, just like it should be able to fool the discriminator during training. After the

user is finished the noise vectors and the decisions on fake eyes of the user are stored so they can be analyzed. These noise vectors are

used to show the eyes the users have seen. None of the users in this evaluation should have any experience within the domain of ma-chine learning, or more specific generative networks. This means

they do not know what to look for in terms of graphical glitches. Furthermore, no selection is done based upon art knowledge of the

users.

Statistic Value

Participants 26

Fake eyes shown 416

Real eye shown 416

True positive 168

False positive n/a

True negative n/a

False negative 248

Success rate 59.6%

Table 3: The statistics of the human evaluation of the

gener-ator model. The ability to select fake eyes is tested. The false negative statistic shows that humans are often fooled.

Results. In this evaluation a total of 416 generated painted eyes and 416 real painted eyes were shown to 26 users. Of these 416 fake painted eyes only 168 where marked to be fake. This means that 248

fake eyes were not noticed. Assuming that the users participated honestly this means that in 59.6% of the cases the generator is able

to fool humans. An overview of the statistics can be seen in table 3. The eyes the users have seen are generated again from their

noise vectors and visible in figure 13. Differences can be found between these two collections of eyes. A substantial amount of eyes in the fake eyes image have glitches or noise making them

easy to recognize as fake. However, light noise and artifacts also exist in eyes on the real images. A probable reason that these have

not been marked as fake is the subtlety which makes it look like a texture. Users could have interpreted it as the texture of the

canvas of the painting. From this it is possible to conclude that the generator is able to paint eyes that look like eyes from real

paintings. Still, the generator does sometimes have output that is of low quality. Furthermore, it should be noted that in this evaluation only patches of the eye are shown to users. It is likely that when

more information is available, such as larger areas around the eye, users are better at recognizing fake eyes.

5.3.2 Conditioning ability. The generator is able to generate painted eyes that in many cases go unrecognized as fake. However,

the generator also takes an input that is supposed to alter the eye color. If the generator is trained well it should be able to generate eyes in which everything besides the eye color is based on the noise

vector that is given as input. To see if this is the case a simple test can be performed in which one noise vector is sampled and then for

each class an eye is generated with that noise vector. If everything but the eye color is the same in each generated image then it is

possible to state that the generator has learned to separate content from eye color. The result of this test can be seen in figure 12. From

this result we can conclude that the generator has learned what the difference between the two inputs means. However, besides the eye color it has also learned that the skin color is slightly different for

each class. This is similar to the average eye, section 5.2, detailed earlier and a logical effect from the dataset. Furthermore, it is visible

that the hazel class also is the only class in this image that has eye lashes. This probably means that most hazel eyes in the dataset

have eye lashes, as the generator seems to attach this feature to this class.

(10)

Figure 12: For each class an eye is generated with the same noise vector. The only difference in each eye is the eye color,

meaning that the generator has learned to separate the eye color from other content.

6 DISCUSSION

In this section we summarize the findings and limitations of this

the-sis. We then propose future work that can improve and build upon the results. Finally, we draw a conclusion on the work performed

in this thesis.

6.1 Findings

This thesis is about combining techniques, both traditional and ma-chine learning, to find, analyze, and create eyes in artistic paintings.

We detail experiments and their results. We use already existing tools to locate human faces and facial landmarks in artistic paintings. Then we extract and classify eyes to create a dataset of painted eyes.

We do this twice. First with traditional computer vision methods to create a small dataset that can be used to train a CNN classifier for

more accurate detection. With an accuracy of 84% this classifier is a much better choice than randomly guessing. The generator builds

upon these two experiments. Even though the field of generative adversarial networks is relatively new and developing rapidly, the

results are promising. The generator is able to create painted eyes that humans often cannot detect as fake. Furthermore, the genera-tor is able to change the color of the iris with one value. This shows

that machine learning, and especially generative networks, have a place in art. Generator networks could give inspiration to artists,

help in creating sketches of ideas, or even aid in the restoration process of damaged paintings.

6.2 Limitations

Although the results are satisfying, some limitations in it exist. First, the classifier is very optimistic when it detects an iris-less eye. The

eye can be without a visible iris, but this is often because the eye is not completely in the image patch. It thus fails to reject a proposal.

Second, the accuracy of the classifier is not perfect. 84% accuracy is very promising, however, nearly one in five predictions is probably

wrong. This has an immediate effect on the dataset as it is created by this classifier. Faulty predictions have been observed in the dataset.

Finally, the generator sometimes generates images with artifacts

or glitches. Future work could potentially create a generator that yields better results.

6.3 Future work

Most future work exists in the domain of generating. Several ap-proaches are possible that are likely to improve the quality of the generator. First, improvements of the generator could possibly be

made by unrolling cDCGAN during training. By unrolling the gen-erator it might avoid getting stuck in a local optimum and create

more diverse images [8]. An attempt can also be made to broaden its generation capacity. For example, by training it on multiple

la-bels that are available in the OmniArt metadata, and by accounting for the contextual information that is available around eyes. This creates the possibility to in-paint eyes. This can be helpful to

esti-mate what eyes would look like in the domain of art restoration on any painting. Furthermore, human inspection of the dataset

could increase the accuracy of the labels. This is, however, a labor intensive task.

6.4 Conclusion

In this research we present three contributions: a painted eye clas-sifier, a new dataset of painted eyes, and a painted eye generator with the human evaluation system.

The first contribution is a painted eye classifier. It is a CNN that is able to classify a painted eye based on the color of the iris. It

supports ten classes for classification, including a class that serves as an aggregate of non-eye images. It achieves an accuracy of 84% on

the validation set during training. Its accuracy is further confirmed through the other results.

The second contribution is the OmniArt eyes dataset. It is a

dataset of over 118,000 painted eyes. It is build by using the classifier on a portrait painting subset of OmniArt. Each class supported by

the classifier is clearly present in the dataset. Every eye has, besides its color, extra metadata: OmniArt id, orientation, and index of its

face in the original painting.

The final contribution is the conditional generator model. This

model is able to conditionally generate painted eyes. It has learned the distribution of the eyes in the OmniArt eyes dataset and can generate the structure of an eye from a noise vector. A conditional

input is then able to influence the final color of the iris of the eye. We assess the quality of the generator with a human evaluation

system. In this evaluation every human sees a total of 32 eyes of which half are generated with our model and the other half is taken

from the OmniArt eyes dataset. The results show that we are able to fool humans, as nearly 60% of the fake eyes are not detected. The

quality of the generator can possibly be increased with the several suggestions that we propose.

We provide free, open access to the classifier, dataset, and the

generator. They can be installed through pip as a git package, e.g. pip install git+https://github.com/rogierknoester/omniart_eye_ generator.git . Future research and other projects can also use

these to further increase the presence of machine learning in art.

(11)

Figure 13: The eyes the generates created for the human evaluation. The top row are the generated painted eyes that were

marked by humans as fake, they correctly recognized them. The bottom row are the generated painted eyes that humans thought were real, the generator was able to fool humans with these painted eyes.

REFERENCES

[1] Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. Ininternational Conference on computer vision & Pattern Recognition (CVPR’05), Vol. 1. IEEE Computer Society, 886–893.

[2] Jon Gauthier. 2014. Conditional generative adversarial nets for convolutional face generation.Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester 2014, 5 (2014), 2.

[3] Adam Geitgey. 2016. Machine Learning is Fun! Part 4: Modern Face Recog-nition with Deep Learning. (July 2016). https://medium.com/@ageitgey/ machine- learning- is- fun- part- 4- modern- face- recognition- with- deep- learning- c3cffc121d78 [Online; posted 26-July-2016].

[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In_{Advances in neural information processing systems. 2672–2680.} [5] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and

Ruslan R Salakhutdinov. 2012.Improving neural networks by preventing co-adaptation of feature detectors._{arXiv preprint arXiv:1207.0580 (2012).} [6] Simardeep Kaur and Dr Vijay Kumar Banga. 2013. Content based image retrieval:

Survey and comparison between rgb and hsv model. International Journal of Engineering Trends and Technology 4, 4 (2013), 575–579.

[7] Adrian Lecoutre, Benjamin Negrevergne, and Florian Yger. 2017. Recognizing Art Style Automatically in Painting with Deep Learning. InProceedings of the Ninth Asian Conference on Machine Learning (Proceedings of Machine Learning Research), Min-Ling Zhang and Yung-Kyun Noh (Eds.), Vol. 77. PMLR, 327–342. http://proceedings.mlr.press/v77/lecoutre17a.html

[8] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. 2016. Unrolled generative adversarial networks.arXiv preprint arXiv:1611.02163 (2016). [9] Olfa Mzoughi, AndrÃľ Bigand, and Christophe Renaud. 2018._{Face Detection in}

Painting Using Deep Convolutional Neural Networks: 19th International Conference, ACIVS 2018, Poitiers, France, September 24âĂŞ27, 2018, Proceedings. 333–341. https://doi.org/10.1007/978- 3- 030- 01449- 0_28

[10] Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representa-tion learning with deep convolurepresenta-tional generative adversarial networks._arXiv preprint arXiv:1511.06434 (2015).

[11] Gjorgji Strezoski and Marcel Worring. 2017. OmniArt: Multi-task Deep Learning for Artistic Data Analysis._{arXiv preprint arXiv:1708.00684 (2017).}

[12] Harry Wechsler and Andeep S. Toor. 2018. Modern art challenges face detection. Pattern Recognition Letters (2018). https://doi.org/10.1016/j.patrec.2018.02.014 [13] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei

Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. InProceedings of the IEEE International Conference on Computer Vision. 5907–5915.

[14] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2014. Fa-cial landmark detection by deep multi-task learning. In_{European conference on} computer vision. Springer, 94–108.