Explain Yourself Explaining the Results of Query-By-Example Retrieval through Labeled Classification Scores

(1)

Explain Yourself

Explaining the Results of Query-By-Example Retrieval through Labeled

Classification Scores

Koen Cuijpers

10803866

Supervisor

Thomas Mensink

Second examiner

Stevan Rudinac

Thesis Master Information Studies

Human Centered Multimedia

University of Amsterdam

Faculty of Science

Final Version: 30th_{June 2015}

(2)

Explain Yourself:

Explaining the Results of Query-By-Example Retrieval through Labeled Classification

Scores

Koen Cuijpers

Information Studies, University of Amsterdam Amsterdam, the Netherlands

koen.cuijpers@student.uva.nl

ABSTRACT

The purpose of our research is to find out how well current state-of-the-art computer vision techniques succeed in re-trieving artwork that is perceived to be similar by humans and particularly to investigate whether such a system could explain its own results to the user. In this paper we describe the implementation and evaluation of a retrieval system for different types of artwork and cultural heritage. The retrieval system is built on the digital collection of the Rijksmuseum containing 112,039 annotated images of a variety of objects such as paintings, drawings, sculptures, clothing and other artifacts. A user study was committed both quantitative and qualitative to research the above.

Categories and Subject Descriptors

Computer Vision, Machine Learning

General Terms

Algorithms, Experimentation, Human Factors

Keywords

Query-by-example, Image Retrieval, Image Classification, Rijksmuseum, User Study

1. INTRODUCTION

In recent years, an increasing number of organizations and museums have digitalized their cultural heritage collections. A growing number of art and cultural heritage items is now becoming available in high quality digital form.

However, as these collections can contain large numbers of items, we face a new challenge of effectively selecting the right items out of these sets that match the interest of the user. Specifically describing that interest can be difficult as it requires that the user knows what he or she is looking for and knows the right artistic concepts to describe it. As pointed out in [1], laypersons tend to use far more general concepts (man, battle, lion), whereas experts would be able to be more specific about the same painting (Hercules, Hy-dra, Eurystheus twelve labours).

Query-by-example retrieval systems can provide a bypass to this laypersons problem by accepting an example image as a

1_{https://www.rijksmuseum.nl/en/api}

description of what the user is looking for. The system can then display similar artwork to the user, which makes the cultural heritage collection accessible for a larger public. For such a system to succeed, it is important that the system is capable of retrieving similar artwork and to provide expla-nation on what this similarity is. It is not yet evident to what extend current state-of-the-art retrieval systems succeed in doing so. In this research the ability of a retrieval system to return artwork that is humanly perceived as similar is tested. Additionally we research means by which the system can au-tomatically explain its results to the user, by providing a hu-manly interpretable insight in the criteria for similarity used by the system.

For a realistic approach to processing the digital collection of real world museums, it is important that the methods de-scribed in research apply across different types of art. This is an important reason why the retrieval system described in this paper is implemented using the digital collection of the Rijksmuseum1_{. This dataset contains 112,039 digital images} of artwork and cultural heritage items of different types (e.g. painting, sculpture, photograph) that are exhibited in the Rijksmuseum (see figure 1). The images are digitally anno-tated using XML-structured metadata, which makes the da-taset very suitable for the evaluation of image classification and retrieval methods. Using the Rijksmuseum dataset we can test the performance of our system with annotated im-ages of real cultural heritage and art, while focusing on dif-ferent types of objects instead of focusing on exclusively one type of object (e.g. paintings).

2. RELATED WORK

Image classification has recently become popular due to image classification challenges such as the Pascal Visual Object Classification Challenge [2] and the ImageNet Large Scale Visual Recognition Challenge [3]. One typical approach in image classification challenges to represent an image is the Bag-of-Words (BoW) model [4], in which a codebook is formed based on locally extracted features. Images are then represented by a “bag” of codewords (hence

(3)

Bag-of-Words) from the codebook. More recent approaches are Fisher Vector (FV) features [5] and Convolutional Neu-ral Network (ConvNet) [6] features which represent images by a multidimensional vector and take into account higher order statistics.

A representation of an image using vectors as described above allows for easy computational comparison. Using e.g. cosine similarity, Manhattan distance or Euclidean distance the difference between the vectors can be calculated, which provides a measure of how similar the images are in terms of the chosen representation method. In our research we have chosen to represent images using Fisher Vectors and to calculate similarity based on Euclidean distance between the vectors. Approaches similar to these have already been proven succesful in art and cultural heritage in other researches[7]–[10] .

Even though the methods described above can be used to find similarities between images, it is very questionable if they can be used directly to explain this similarity to humans. Options would be to show the distance values calculated or to show the ranking based on these distance values. However, these methods use numeric values that are very useable for computers but rather meaningless to humans. As always we have to bridge the semantic gap by letting the computer talk in human language, which means resorting towards more high order similarity descriptors that humans can refer to.

One way describing similarity between different artworks is to find similar objects depicted in these artwork [11]. Crow-ley & Zisserman succeed to find the same objects appearing in different paintings by learning object-category classifiers from natural images (e.g. random everyday photographs). Another measure to describe similarities between different artwork is match the visual complexity of a painting. The concept of visual complexity lacks a uniform definition, however in [8] it is defined to be determined by:

1. Distribution of compositions; 2. Colors;

3. Content.

Guo et al. [8] define 3 levels of visual complexity based on the combination of these factors: Low Complexity (LC),

Medium Complexity (MC) and High Complexity (HC) (see figure 2).

We expect that including both detecting objects as well as visual complexity in our classification and retrieval system would have a positive effect on the system’s ability to find and explain similarity between items in the Rijksmuseum da-taset. However, due to the limited time available for this re-search we were not able to include these similarity de-scriptors in the retrieval system.

The retrieval system implemented in this research searches for similarity based on the work done in [12]. In this research Mensink and van Gemert classify artwork from the Rijks-museum dataset while focusing on four properties of the art-work:

 Artist – e.g., Rembrandt, Vermeer;

 Type – e.g., Painting, Sculpture, Drawing;

 Materials – e.g., White Marble, Canvas, Oil Paint;

 Creation Year – e.g., 1737, 1641-1676.

The work done by Mensink & van Gemert provides the base for our retrieval system, as it allows us to predict labels for the properties listed above. By showing these predicted la-bels to the user we hope to explain the similarity that was found by our retrieval system.

Figure 1: Example images of the Rijksmuseum dataset and their annotations [12]

Title The Milkmaid Seated Cupid Reclining Lion Dish with a landscape

Creator Johannes Vermeer Étienne-Maurice Falconet Rembrandt van Rijn Frederik van Frytom

Material Canvas, Oil paint White Marble, Copper Paper, Paint Faience

Type Painting Sculpture Drawing Sauces

Year 1660 1757 1658 – 1662 1670 – 1685

(4)

3. METHODS

3.1 Research Question

The goal of the research is to find out whether current state-of-the art classification and retrieval methods are able to ex-pose and explain similarities between different artworks. In the context of this research, with similarity we mean that two artworks have some similar properties that form a semantic connection between these works.

In order to reach the goal defined above, we try to answer the following research question.

To what extent can similarity of digitized artistic content as a result of query-by-example retrieval be exposed and ex-plained by classification scores?

This research question is best visualized with the following example. An image from the Rijksmuseum dataset is used as an example image (query image) to query the database. Through retrieval we receive several result images (or matches) that were selected based on their similar classifica-tion scores. We then ask quesclassifica-tions to find out to what extent humans agree with the similarity that was found through classification and retrieval. Simultaneously, we attempt to find out whether explaining the matches by showing labeled classification scores to the user can help the user to under-stand why the system considered the result images to match with the query image.

3.2 Sub-questions

In order to find an answer to the research question, a set of sub-questions is defined below. Through these sub-questions we aim to research the different angles of the research prob-lem as thoroughly as possible within the scope of this re-search.

1. To what extent can computer vision be utilized to expose similarity between artwork?

In order to expose similarity between artwork, we have to choose on which level of abstraction we execute the search for similarity. Therefore we first research whether there is a significant difference in performance using low level image comparison (FV features) versus high level image compari-son (classification scores).

2. Do humans agree with the similarity between the query image and its matches that was found by the system? Using the image representation chosen we intend to find dif-ferent artworks with mutual properties (similarity). Next, we test if this similarity is agreed upon and considered valuable in the human opinion.

3. Can labeled classification scores be used to explain the found similarity to the user?

Through this sub-question we intend to find out whether dis-playing labeled classification scores can help the user under-stand why the system considered the result images to match with the query image.

4. To what extent do users deem labeled classification scores useful?

With this final sub-question we research how users perceive the usefulness of showing labeled classification scores.

3.3 Research Methods

We now chronologically describe the methods applied to an-swer the research question.

3.3.1 Classification & Retrieval

To implement the classification and retrieval system, the classifiers trained by Mensink & van Gemert [12] were used. They extracted Fisher Vector [5] features from the train set (approx. 78,000 images) to train one-vs-rest linear Support Vector Machine (SVM) classifiers for creator, materials, types of artwork and estimated year of creation. The perfor-mance of these classifiers was measured using the validation set (approx. 11,000 images) and is presented in table 1. Knowing the performance of the classifiers is important to put the conclusions that can be drawn from the user studies (see 3.3.2 & 3.3.3) into perspective.

For the retrieval we exclusively used the test set of approxi-mately 22,000 images. The system randomly selects a query

Table 1: Performance of classifiers

Performance Measure

Creator 51.0 Mean Class Accuracy

Materials 67.1 Mean Average Precision

Art types 48.6 Mean Average Precision

Year 89.2 Average difference with

ground truth year

Figure 3: Three obvious queries (query image = first im-age on each row)

(5)

image from the test set and calculates the 7 Nearest Neigh-bors (7-NN) based on the Euclidean distance between the normalized FV features of the query image and all other test set images. It then finds and displays the 7 images with the smallest Euclidean distance to the query image. For these images the system retrieves the classification scores for the Creator, Materials and Art Types classifiers. It then includes a ranked top three of the classification scores with the corre-sponding labels (e.g. Rembrandt van Rijn, Paper, Drawing) and the estimated year of creation.

If we had chosen to use the entire dataset (112,039 images) for the retrieval system, the system’s results would have been slightly better. Figure 4 shows the 5 closest matches for 2 query images using the test set (T row) and the full dataset (F row). Using the test set in general showed slightly more images with different appearance compared to the query im-age than when using the full dataset. The reason for the better performance when using the full dataset is that there is a larger amount of images to select better matching artwork from. However, the larger part of the dataset was used for training the classifiers. Had we used the entire dataset then the results would be biased as the performance of the pre-dicted annotation would be significantly higher for the part of the dataset that was used for the training.

3.3.2 Quantitative User Study

Once the classification and retrieval framework was set up it was ported to an online demo2_{to serve as a platform for the} quantitative user study. The purpose of this quantitative ex-periment was to find out whether users agreed with the sim-ilarity between the query image and its matches (sub-ques-tion 2), and to find out whether displaying labeled classifi-cation scores could help the user in understanding why the result images were considered to be matches (sub-questions 3 & 4).

Selecting queries

Initially the approach was to let the system randomly pick an image from the test set as a query image and to find its 7 nearest neighbors (NN). However, the system often returned a set of rather obvious matches (see figure 3). While a system returning such matches is expected to perform rather well in exposing similarity between artwork (sub-question 2), with

2_{http://www.koencuijpers.com/thesis}

such obvious matches it would be very hard to research in particular sub-question 3. If the matches are obviously all very similar to the query image, there is no need for any ex-planation in the form of labeled classification scores. There-fore, the effect of showing classification scores could not be measured by showing such queries.

As an alternative to showing these obvious queries to our participants, we selected 15 sets of query images and their matches. These queries have been selected because of the unobvious matches that showed up. The 15 sets were then displayed in the same order to all participants. For each set the participant was asked to rate (0-10 scale) how similar they found the set of matches compared to the query image (0 being very bad matches, 10 being very good matches). These ratings were analyzed to form an answer on sub-ques-tion 2.

Two groups

The participants were randomly divided into two groups. For group 1 (from now referred to as “group Labels”) the labeled classification scores for every image were shown alongside the images, for group 2 (from now on referred to as “group Base”) these scores were not displayed (see appendix A). Comparing the participants’ ratings between the two groups is our quantitative measure to answer sub-question 3. After having graded all 15 sets of matches, participants from group Labels were asked whether they (1) had used the la-beled classification scores in comparing the results and if so, (2) how useful they found the labeled classification scores in understanding why the result images were matches with the query image (5-point Likert scale). For group Base the par-ticipants were asked whether they had missed information about why the system considered the result images to be matches with the query image. The answers to these ques-tions were used to form answer to sub-question 4.

Apart from the presence or absence of labeled classification scores, the online demo was kept as identical as possible for both groups Labels and Base. One minor change in position-ing had to be made for group Base, as removposition-ing the classifi-cation scores from the interface decentralized the position of

(6)

the query image. In appendix A the complete interfaces for both groups are shown.

Participants

For the quantitative research a total number of 52 partici-pants were gathered and randomly placed in either group bels or group Base. At the end of the experiment, group La-bels counted 22 participants and group Base counted 30 par-ticipants. 33 of the participants were male and 19 of were female. The age of the participants ranged from 20 – 56 years old, however, the larger part of the participants was between 22-27 years old. Also, the majority of the participants was either enrolled in a higher education (Higher Vocational Ed-ucation or University degree) or had already graduated from a higher education. Due to the imbalance in education and age, the sample for this experiment not entirely representa-tive for the Dutch population.

3.3.3 Qualitative User Study

The goal of the qualitative user study was to get an opinion of experts in the field of art and cultural heritage on the sys-tem and the added value of the labeled classification scores. For this part we use the same sets of images as used in the quantitative user study. In this case however, we conduct the study offline, in a face-to-face semi-structured interview. Instead of asking to grade the similarity on a 10-point scale, we asked them to describe whether they considered the paired images to be similar and why or why not. For each of the 15 sets of images, the participants were first shown the matches without labeled classification scores and asked whether they found the matches to be similar to the query image. After that the labeled classification scores were dis-played and the participants were asked whether the labels helped him/her understand why the images are considered to be matches by the system.

When all sets had been shown, the participant was asked whether and why (not) he/she found the categories of the la-beled classification scores (creator, materials, art type & es-timated year of creation) to be useful. Finally the participants were asked if they would have found other categories of la-beled classification scores (e.g. based on objects depicted, color etc.) more useful.

Participants

For the qualitative user study we were specifically looking for experts in the field of arts and cultural heritage. We con-sider experts to be persons who have a profession and/or (have) follow(ed) an acknowledged higher degree study in these fields. While there was only limited time available and we sought people with a very specific profile we managed to interview 3 experts in this research

4. RESULTS

4.1 Data characteristics

Quantitative data

The research data gathered from the quantitative research was checked for outliers using the outlier labeling rule [13], [14]. Next, the data was checked for normality using a Shapiro-Wilk’s test [15], [16] and by visual inspection of the histograms, Q-Q plots and box plots (figure 5) of the data. The Shapiro-Wilk’s test showed no significant difference from a normal distribution for both group Labels (p=0.132) and group Base (p=0.302). The histograms, Q-Q plots and box plots of showed an approximate normal distribution for both group Labels and group Base, although the boxplot for group Labels was quite asymmetric.

Figure 5: Boxplots for both groups

(7)

In terms of skewness and kurtosis the data for group Labels was significantly different from a normal distribution. The data of group Labels showed a skewness of -0.973 (p=0.491) and a kurtosis of 2.001 (p=0.953) and the data of group Base showed a skewness of -0.747 (p=0.427) and a kurtosis of 0.367 (p=0.833). Although the skewness and kurtosis make the data for both group Labels significantly different from a normal distribution, the independent samples t-test is still a valid test for our data as the skewness and kurtosis are within acceptable range due to the robustness of that test [17]–[19].

Qualitative data

For the qualitative data the semi-structured interviews were recorded. The recordings were summarized by extracting relevant statements, these summaries can be found in appen-dix C. Statements were considered relevant if they held in-formation that could contribute to answer sub-questions 2 or 4. A statement is relevant for sub-question 2 if it was any expression about the reason why the interviewee thought an artwork was a good or bad match. For sub-question 4 a state-ment is relevant if it concerned anything related to the use of the labeled classification scores.

4.2 Sub-question 1: Low vs. high

Comparing images using computer vision can be done at two different levels of abstraction, low level comparison and high level comparison. With low level comparison we mean comparing images based on the extracted features (in our case Fisher Vectors). Finding similar images based on their classification scores is what we consider to be high level comparison.

We compared the results of low and high level comparison for in total 25 query images, of which two examples are

given in figure 6. The result images for both levels of ab-stractions were in all cases very similar. In some cases the query image returned the same images only shown in a dif-ferent order (figure 6, right query).

We chose to use the low level FV features for our retrieval system. As both levels of abstraction performed very similar, the choice in which one to use should not significantly affect our results. Where the low level FV features are used for re-trieval, the high level features are used to let the system ex-plain its results to the user. The success of this approach is analyzed in sections 4.4 & 4.5.

4.3 Sub-question 2: Exposing similarity

To answer the question whether humans agree with the sim-ilarity as pointed out by the system we look at both the quan-titative and the qualitative data. In both experiments the same 15 queries were used (see appendix B)

4.3.1 quantitative validation

Because we want a general impression of the system’s abil-ity in finding and displaying similar artwork, we do not make the distinction between group Labels and group Base in find-ing and answer to this sub-question. In figure 7 the distribu-tion of the average scores for the entire sample of 52 partic-ipants is displayed. With the average score we mean the av-erage of the scores given to each of the 15 queries by a par-ticipant. Figure 7 shows the distribution of these average scores for all participants.

Figure 7 shows a mean average score of 5.89 for our sys-tem’s ability to find and display similar artwork. As de-scribed in section 3.3.2 we chose to select 15 queries that showed non-obvious results in order to increase the measur-ability of a possible effect in sub-question 3. Therefore, the similarity scores given by the participants are not entirely representative for the total performance of the system, which would arguably be higher than the scores presented here. However, the selected queries are still all produced by the system. A mean average score of 5.89 for these queries tells us that automatic image recognition has ample room for im-provement.

Figure 7: Distribution of average scores for all partici-pants

(8)

Figure 8: Distributions of the scores for the individual queries, horizontal line represents mean value (5.89)

In figure 8 the distributions of scores for the individual que-ries are given. This chart shows 3 queque-ries (Q6, Q12 & Q14) which mean scores are lower than the mean of all average scores given by the participants. These three queries have had an important influence on the evaluation of the system’s performance.

4.3.2 Qualitative validation

In table 2 the ratings given to the individual queries by our 3 experts (E1-E3) are given. Double plusses or double minuses were given when the expert was really decisive in in rating the query as either good or bad. A query was rated as neutral when the expert could not make up his/her mind or named an equal amount of negative and positive aspects for that query. In the Total row the sum of the plusses and minuses is given. In total our experts rated 5 queries as good queries, 7 as bad queries and 3 queries as neutral. The best 3 queries (green cells) by the experts’ ratings were the same queries as the top 3 queries in the quantitative data (Q3, Q4 and Q11). The queries that were rated by the experts as returning the worst results were Q2, Q6, Q12 and Q14 (red cells), which is again very similar to the quantitative data (in which Q6, Q12 and Q14 were rated worst).

Important reasons for the experts to rate queries as being good were:

 Same type of objects/persons depicted;

 Same theme/setting (e.g. outside, religion, architec-ture, portrait etc.);

 Same styles/techniques (e.g. sketchy, detailed);  Same materials used;

 Same time period

The objects depicted and the theme of the item were most important to the experts. We conclude this from the fact that an absence of similarity in objects or theme always resulted in rating the query as being bad. An absence of similarity in either styles, materials or time period was considered as a negative aspect of the query, but was compensable by other positive aspects and never a “deal breaker”. In fact, differ-ence in materials made one expert value certain results even more, because they were still in the same setting and the dif-ference in materials added a degree of variety to the query (see appendix C, interview 3, Q2 & Q4).

Table 2: Expert ratings for the individual queries ( - - = very bad, +/- = neutral, ++ = very good)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15

E1 + - - ++ ++ +/- - - - - +/- - ++ + - - +/- - -

E2 + +/- ++ + ++ +/- +/- +/- +/- + ++ - - + - +

E3 + +/- + ++ + - - +/- + +/- - ++ - - +/- - +

Total +3 -2 +5 +5 +3 -4 -1 0 0 -1 +6 -3 -1 -2 0

Figure 9: Relative frequency distribution for both groups. Horizontal lines represent mean values

(9)

4.4 Sub-question 3: Explaining similarity

In order to find out whether showing labeled classification scores can help the user understand the matches that are re-turned by the system, we measured the effect of showing these scores versus not showing them.

In figure 9 the relative frequency distributions of average scores given by participants in group Labels and Base are displayed. The distributions show that the average scores given in both groups are rather similar. Also the means of the average scores given by the participants of each group (represented by the horizontal lines) are very close. To find out whether showing labeled classification scores had any measurable effect, the means were compared using an inde-pendent samples t-test. The indeinde-pendent samples t-test did not at all show a statistical significant difference in means for the average scores given by the users (p=0.626). Figure 10 shows the average scores given to the individual queries for both groups. Apart from queries 1 and 12 (appen-dix B), the average scores are more or less the same for both groups. Independent samples t-tests for comparing the mean scores for the individual queries did not show significant dif-ferences between group Labels and Base, also not for Q1 (p=0.057) and Q12 (p=0.112).

The t-tests tell us that showing the labeled classification scores as used in our experiment, had no statistically signif-icant effect on our participants’ opinion on the similarity be-tween the query images and their matches.

4.4 Sub-question 4: Usefulness

4.4.1 quantitative validation

After our participants had seen and rated all queries, they were told to which group they were assigned and what that

meant. Then they were asked a question to explicitly ask for their opinion on the use of the labeled classification scores. Participants of both groups could not go back to the matches at this point, therefore these final questions had no influence on the scores given to the queries.

For group Labels we first asked the participants whether they had used the labeled classification scores in comparing the matches. From the 22 participants in group Labels exactly 50% stated to have used them in comparing the matches. These 11 participants were then asked to what extent the la-beled classification scores helped them in understanding why the matches were considered to be matches by the sys-tem. The distribution of the scores is displayed in figure 11. These scores show that a slight majority of the participants that had used the labeled classification scores considered them to be helpful.

The participants of group Base had seen no labeled classifi-cation scores during the experiment (see appendix A). After the final query the users were informed that they did not re-ceive the labeled classification scores as opposed to the other group. The participants were then asked whether they had missed this information in understanding the matches during the experiment. Over 60 percent (19 out of 30) responded that they indeed deem the labeled classification scores to be useful for explaining the retrieval results. This is not a sig-nificant difference according to the Pearson Chi-Square test (p=0.144), which could be due to our limited sample size of the experiment.

4.4.2 Qualitative validation

Our expert interviews allowed us to get a more detailed in-sight in what effect showing the labeled classification scores had on understanding the matches returned by the system. The predicted values for the artist of the artwork contributed the least to our experts’ understanding of the matches, as in most cased the experts did not know the predicted artist. In some cases the expert was certain that the predicted artist was not the right artist or the same artist was predicted in a very dissimilar earlier query as well, which did not increase trust in the labeled classification scores.

Figure 10: Average scores for individual queries for both groups

Figure 11: Deemed usefulness of labeled classification scores

(10)

The predicted labels for materials, art types, and year of cre-ation were considered a more valuable reference point for the experts, as these were mentioned more frequently when viewing the labeled classification scores. However, the ac-curacy of these labels proved to be very important. Seeing the same predicted labels across the images of a query often resulted in an increased positive opinion about this query (e.g. interview 2, query 3 & interview 3, query 7). However, when the system showed labels that were clearly incorrect (e.g. object type “spoon” for query image 6, or “porcelain” for wooden objects) the query results were not taken very seriously. Showing the labeled classification scores defi-nitely had an effect on the expert’s opinion about the query quality, but whether it was the desired effect depended on how accurate the expert believed these scores to be. The labeled classification scores for materials, art types and year of creation were considered valuable base information by all experts. However, including information about what was semantically depicted in or resembled by the artwork was considered to be a valuable addition by all three experts. Particularly learning different settings or themes (e.g. land-scape, sea, portrait, historical event etc.) was considered to add a degree of semantics to the results that was missed in the lower rated queries. Additionally, our experts found that information about objects depicted and (although less unan-imously) color and style would have been helpful in explain-ing the queries.

5. CONCLUSION & LIMITATIONS

In this research a query-by-example retrieval system was im-plemented in order to test whether we could expose and ex-plain similarity between different types of artwork and cul-tural heritage objects using current state-of-the-art computer vision methods. A user study was conducted in which a se-lection of 15 non-obvious queries were shown to both expert and non-expert participants. From this user study we con-clude that there is ample room for improvement in automat-ically retrieving artwork that is humanly perceived as simi-lar. The user study did show the potential of using labeled classification scores to explain the results of query-by-exam-ple retrieval to the user. However, the used computer vision methods are currently not yet accurate enough to consist-ently provide trustworthy labeled classification scores, which can result causing confusion with the user instead of elaborating on the reason for similarity. As the accuracy of the labeled classification scores improves over time, so will the added value of showing labeled classification scores to explain retrieval results to the user.

There are some limitations that have to be taken into account when reviewing these conclusions. First, the sample sizes of both our quantitative as well as qualitative research should be increased in order to provide more statistical support for our conclusions. Additionally, the selection process of the 15 queries may well have given a slight negative bias to the

per-formance evaluation of our system. An automatic query se-lection process that would include obvious as well as non-obvious queries (e.g. based on difference between ground truth of query image and results) might have resulted in a more representative performance evaluation for the system.

6. FUTURE WORK

The two performance areas of our system that should be im-proved in future work are (i) the similarity ranking of the artwork and (ii) the similarity explanation by the system. Performance in both these fields are closely related and there are different aspects of our system that could be improved to increase this performance. One solution could lie in better training of the classifiers by using for example different training parameters, different classification methods or dif-ferent datasets. Another solution might lie within using other image representation features such as ConvNet features. An additional study could point out whether such features would provide better performance in retrieving similar artwork on low level retrieval, and in predicting the high level classifi-cation scores for the different annotation categories. Also, the performance of the system could be improved by adding additional annotation categories, such as objects depicted and setting/theme. If we would succeed in successfully learning such annotation categories, our system could offer higher level explanation to the user in addition to the basic annotation elements offered in the current system. Finally, finding different ways to visualize classification scores could make it easier for users to compare the classification scores between results.

7. AKNOWLEDGEMENTS

First off, I want to thank Thomas Mensink for his very active participation during my research. Receiving very constant and rapid feedback during the implementation of the system as well as during the research itself was of great value to me. Also, I would like to thank Thomas Mensink and Jan van Gemert for providing me with pre-structured data and trained classifiers, which provided an important base for the retrieval system. Finally, I want to thank Stevan Rudinac for acting as a second examiner of my thesis.

REFERENCES

[1] D. Isemann and K. Ahmad, “Query terms for art images : A comparison of specialist and layperson terminology,” BCS-HCI ’11 Proc. 25th BCS Conf. Human-Computer Interact., pp. 145–150, 2011. [2] M. Everingham, S. M. A. Eslami, L. Van Gool, C.

K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge: A

(11)

Retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, Jun. 2014.

[3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis., Apr. 2015. [4] P. Perona, “A Bayesian Hierarchical Model for

Learning Natural Scene Categories,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005, vol. 2, pp. 524–531.

[5] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image Classification with the Fisher Vector: Theory and Practice,” ijcv, 2013. [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton,

“ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural

Information Processing Systems, 2012, pp. 1097– 1105.

[7] C. Johnson, E. Hendriks, I. Berezhnoy, E. Brevdo, S. Hughes, I. Daubechies, J. Li, E. Postma, and J. Wang, “Image processing for artist identification,” IEEE Signal Process. Mag., vol. 25, no. 4, pp. 37– 48, Jul. 2008.

[8] X. Guo, T. Kurita, C. M. Asano, and A. Asano, “Visual complexity assessment of painting images,” in 2013 IEEE International Conference on Image Processing, 2013, pp. 388–392. [9] G. Carneiro, “Graph-based methods for the

automatic annotation and retrieval of art prints,” in Proceedings of the 1st ACM International

Conference on Multimedia Retrieval - ICMR ’11, 2011, pp. 1–8.

[10] E. Crowley and A. Zisserman, “Of Gods and Goats: Weakly Supervised Learning of Figurative Art,” Procedings Br. Mach. Vis. Conf. 2013, no. i, pp. 39.1–39.11, 2013.

[11] E. Crowley and A. Zisserman, “In search of Art,” Work. Comput. Vis. Art Anal. ECCV, pp. 1–16, 2014.

[12] T. Mensink and J. Van Gemert, “The Rijksmuseum Challenge: Museum-Centered Visual Recognition,” Proc. Int. Conf. Multimed. Retr., pp. 2–5, 2014.

[13] D. C. Hoaglin, B. Iglewicz, and J. W. Tukey, “Performance of Some Resistant Rules for Outlier Labeling,” J. Am. Stat. Assoc., Mar. 2012.

[14] D. C. Hoaglin and B. Iglewicz, “Fine-Tuning Some Resistant Rules for Outlier Labeling,” J. Am. Stat. Assoc., Mar. 2012.

[15] S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),” Biometrika, vol. 52, no. 3–4, pp. 591–611, Dec. 1965.

[16] N. Razali and Y. Wah, “Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests,” J. Stat. Model. Anal., 2011. [17] C. A. Boneau, “The effects of violations of

assumptions underlying the t test.,” Psychol. Bull., vol. 57, no. 1, pp. 49–64, Jan. 1960.

[18] H. Posten, Robustness of Statistical Methods and Nonparametric Statistics. Dordrecht: Springer Netherlands, 1984.

[19] E. Schmider, M. Ziegler, E. Danay, L. Beyer, and M. Bühner, “Is It Really Robust?,” Methodol. Eur. J. Res. Methods Behav. Soc. Sci., vol. 6, no. 4, pp. 147–151, Jan. 2010.

(12)

(13)

APPENDIX B: SELECTED QUERIES

Query 1-5

First image of each query is the query image

Query 1 average score: 5.67 Query 2 average score: 6.04 Query 3 average score: 8.04 Query 4 average score: 7.72 Query 5 average score: 6.46

(14)

Query 6-10

(15)

Query 11-15

(16)

APPENDIX C: INTERVIEWS

Questions Asked

For each query:

1a. Do you consider these matches to be similar or dissimilar to the query image? 1b. Why?

[show labeled classification scores]

2. Did the labels help you in understanding why the images are considered to be matches?

After all queries:

3. We have used predictions for the artist, materials used, types of artwork and estimated creation year to elaborate on the matches, and did you consider these matches to be useful?

4. Do you think that adding annotations for color and objects depicted would help in better understanding the matches? 5. Are there other types of annotation that you would have liked to see to understand the matches?

Interview Summaries

In the following sections the interviews are summarized by extracting relevant statements from each interview. The chrono-logical order of the interview is maintained and statements literally transcribed from the recordings (in Dutch).

Legend:

Yellow = relevant statement for sub-question 2 (exposing similarity)

Yellow = relevant statement for sub-question 4 (usefulness of labeled classification scores)

Interview 1

Gender: Female

Age: 26

Study: Fine Arts (graduated)

Employment: Artist/painter

Query 1

1. Ja ik zie er wel gelijkenissen in, het is een soort ja, soort sierlijke compositie. De mensen zijn ook net wat sierlijker getekend. En heel realistisch.

2. Alleen deze (1.7) is iets schetseriger, maar die komt dan wel uit dezelfde tijd.

Query 2

1. Type object, Haags porselein... Lepel? hier staat lepel. Haha, dit zijn allemaal lepels. 2. Nee dit is echt heel slecht, want het is ook geen Haags porselein

Query 3

1. Ja dit klopt denk ik wel. Dit is uit de tijd dat de vrouwen lichamen niet naakt gezien mochten worden, dus dan gebruikten ze mannen als voorbeeld voor vrouwen, dus daarom zien deze vrouwen er allemaal mannelijk uit.

2. Allemaal rond dezelfde tijd ook. Ja.

Query 4

1. Allemaal aquarellen. Van landschappen. Met koeien of mensen. Ja ik vind dit wel ehm... Volgens mij komt het ook allemaal uit dezelfde tijd. Ja dit vind ik wel een goeie match.

(17)

3. Ja dat is een heel andere techniek. Ja ik weet niet... Je moet wel ergens een bepaalde lijn trekken in wat wanneer iets een match is en wanneer niet. Het kan misschien wel, als je alleen qua thema matched, dan kun je schilderijen en tekeningen hebben.

Query 6

1. Weer een lepel haha. Dit (image 2) is een fles, dat vind ik wel een fles. ja.

Query 9

1. Alleen wat je wel hebt is dat je bepaalde portretten (image 2, 5, 6) van mensen hebt, en de andere meer iconische schilderijen (image 3, 4, 7).

2. Dus als ik niet specifiek zou kijken zou ik zeggen, ja dit is een goeie match. Gewoon omdat het allemaal een beetje in dezelfde stijl geschilderd is. Maar omdat ik ook op de thematiek let, is het een minder goeie match.

Query 10

1. Nou ik heb niet zo heel veel verstand van dit soort dingen, maar ik heb het gevoel dat dit niet zo goed is. En de labels geven hier ook geen duidelijk verband weer.

Query 11

1. Dus jij zegt dit zijn matches omdat het allemaal architectuur is? - Ja en dezelfde kleuren, hetzelfde materiaal.

Query 12

1. Ik denk… Dit voelt wel goed.

2. Ja, alleen wat ik wel apart vind is dat je een jurk neemt, en dan allemaal potjes en dingen krijgt.

Query 14

1. Het gezochte object dat hij geeft is een beeldhouwwerk en de rest zijn allemaal schilderijen.

2. Ja, die (labels) bevestigen voor het feit dat er veel schilderijen in zitten, maar niet zozeer waarom ik de schilderijen op elkaar vind lijken, dat is juist omdat er iets op gebeurt.

Closing

1. Ja, even kijken, dit is allemaal basis informatie die wel belangrijk is. Maar wat ik zelf ook nog zou doen is iets met thema. 2. Als ik op zoek zou zijn naar matches zou ik liever hebben dat het uit dezelfde tijd en met dezelfde thematiek zou zijn dan dat er dezelfde kleuren in terug zouden komen.

Interview 2

Gender: Male

Age: 28

Study: History of Art

Query 2

1. Ja het zijn allemaal maquettes. Het valt in hetzelfde ballpark, als het ware. Het zijn allemaal modellen, maar alleen deze is ook een boot. Het thema zit wel goed, maar ik zou het wat meer verfijnd willen hebben.

Query 3

1. Hier zie je allemaal een beetje het goddelijke, hier god, een engel, Maria. Veel wolken. 2. Deze schattingen 1613 en 1684 voor een zelfde werk, zo lang leefden mensen toen nog niet. 3. Papier, plaat, dat klopt allemaal wel.

Query 4

1. Voor mij is de afbeelding qua type, bijvoorbeeld landschap, belangrijk en uit de kleur leid ik dan af uit wat voor tijd het waarschijnlijk is. Want als ik zoek naar vergelijkbare werken dan zoek ik naar werken uit dezelfde tijd, en uit kleur leid ik dan enigszins af uit welke tijd het zou kunnen zijn. Match nummer 5 vind ik wel het meest verschillend omdat het een andere stijl is.

2. Qua papier en plaat doet hij het wel goed.

3. Tekenen versus schilderen is wel een gigantisch verschil.

Query 6

1. Valt wel binnen hetzelfde thema, allemaal objecten rondom het huis. Maar deze is wel een beetje meer random dan de vorige.

Query 7

(18)

Query 8

1. Tot en met 4 komt overal iets van een veldslag of explosie terug, maar daarna wordt het compleet random.

Query 9

1. Qua schilderstijl vind ik het redelijk vergelijkend, maar qua afbeelding zijn sommigen echt een portret en anderen meer de afbeelding van een gebeurtenis.

2. Ik zou of op de zelfde stijl zoeken of op het gene dat afgebeeld is.

3. De jaartallen lijken wel in de juiste periode te zitten. Al denk ik dat dit juist een later werk is dan deze, terwijl het hier andersom aangegeven wordt.

Query 10

1. Dit vind ik allemaal wel vergelijkend, het is allemaal een beetje chinees. 2. Kroeseman, dit is duidelijk geen Kroeseman.

Query 11

1. Deze kloppen allemaal wel enigszins, hier is de architectuur dominant, het gaat hier meer om het gebouw.

2. Ja met al die namen en zo kan ik ook niet zo heel veel. Als mijn ex-vriendin hier had gezeten die zou juist weer heel erg naar de namen kijken, die weet daar veel meer van.

Query 13

Deze hebben allemaal iets technisch, iets natuurkundigs. Alleen die tweede is gewoon een wapen, dat is compleet anders.

Closing

Voor mij is het jaartal erg belangrijk. Als ik naar een kunstwerk zou zoeken zouden stijl en jaartal voor mij de meest belangrijke zoekopdrachten zijn. Ik zoeken op bijvoorbeeld Olieverf-Portret-Tijd.

Het is wel belangrijk dat het in hetzelfde thema zit, wat er op is afgebeeld, zoals portret in olieverf-portret-tijd. Kleur vind ik lastig, maar ik zou hem wel erbij doen.

Interview 3

Gender: Male

Age: 22

Study: General Cultural Sciences

Employment: Rijksmuseum/audio tours

Query 1

1. Nummer 7 is minder gedetailleerd, meer een schets. Hier is het echt waarheidsgetrouw en anatomische vormen e.d. Het onderwerp komt niet echt overeen, dat vind ik minder aan deze match.

2. Type object, boekillustratie, oooh ja, natuurlijk, met die tekst eronder zijn natuurlijk altijd uit een boek.

Query 2

1. Ja waarom komen hier eigenlijk niet alleen maar boten uit? Want in de collectie zijn er nog veel meer. 2. Nee dit is geen Haags porselein inderdaad, dat moet gewoon hout zijn.

3. Verder had ik wel meer boten verwacht, maar het heeft er wel allemaal een beetje mee te maken. Dat schept misschien wel een groter beeld dan alleen maar kopieën zoeken.

Query 3

1. Ja heel goed, allemaal zelfde setting en als je in een oogopslag kijkt, zelfde stijl. 2. Het lijkt me alleen nogal vreemd dat het 70 jaar scheelt aangezien dit hetzelfde werk is. 5. Het lijkt me sterk dat dit door dezelfde artiesten is gemaakt.

Query 4

1. Aah, die horizon zie je hier wel goed terug komen. Het materiaal dat gebruikt wordt vind ik wel weer verschillen, maar uiteindelijk vind ik het wel goed gelinkt dat je een soort zelfde setting hebt. Ik vind het eigenlijk wel goed, als je dan toch 7 matches hebt, dat er dan een paar tussen zitten met ander materiaal. Anders heb je 7 exact dezelfde.

2. Ik zou inschatten dat het iets later is gemaakt. Wat zegt hij, waterverf mehh... Poppenhuisgoed... Nou papier?

Query 5

1. Je ziet op elke afbeelding een verzameling mensen en het zijn allemaal open composities. Alleen match 5 valt erbuiten. 2. Kuniyoshi? Dat lijkt me sterk dat die hem heeft gemaakt.

Query 6

1. Ik begrijp niet helemaal waarom de metalen dingen niet als eerste match neergezet, die lijkt op het oog al veel meer op het gezochte object. Ik zou materiaal en periode juist belangrijk vinden bij dit soort objecten, er zijn veel meer van dit soort tinnen objecten die beter zouden passen.

2. Porselein natuurlijk niet, mwah is dit zo oud?

Query 7

(19)

Query 8

1. De laatste 2 zijn qua thema wel verschillend. Maar de rest vind ik wel goed, als die in periode dan wel kloppen.

Ja, de datering zijn ook ongeveer gelijk, dan zie ik de match wel. Ja de artiesten kan ik niet zo heel veel mee voor etsen e.d.

Query 10

1. De datering hier van lijkt me zeker niet te kloppen, dit moet veel ouder zijn.

Query 11

1. Je ziet hier overal hetzelfde perspectief terug, een soort dieptewerking. En je ziet overal dezelfde boogwerken terugkomen. Dat is mooi dat hij dat herkent.

Query 12

1. Klopt eigenlijk helemaal niks van, materiaal, type object, periode, kleuren, 0. Er zijn juist meer kledingstukken die hij had kunnen laten zien.

Query 14

1. Dit een totaal ander object, een sculptuur en de resultaten zijn schilderijen.

Closing