Classifying image galleries into a taxonomy using meta-data and Wikipedia

(1)

Master’s Thesis

Classifying image galleries into a taxonomy

using meta-data and Wikipedia

(2)

Acknowledgements

(3)

List of Figures

3.1. Taxonomy . . . 18

4.1. Gallery classification pipeline . . . 19

4.2. Gallery from IMDB about Simon Cowell . . . 20

4.3. Schematic overview of the entity scoring algorithm . . . 24

4.4. Semantic graph of the sample gallery . . . 25

A.1. Example of an image gallery . . . 55

B.1. Taxonomy . . . 56

(6)

List of Tables

4.1. Meta-data of the sample gallery (image captions are shortened) . . . 21

4.2. An excerpt of the sample gallery’s terms with their annotations . . . 23

4.3. Sample gallery entity scores . . . 28

4.4. Categories of the top three entities from the sample gallery . . . 30

4.5. Resulting category list of the sample gallery . . . 31

5.1. Retrieval of test set galleries . . . 33

5.2. Amount of multi-labels in test set . . . 34

5.3. Amount of test galleries per first-level category and Contents . . . 34

5.4. Results types . . . 36

5.5. Contingency table . . . 36

6.1. Overall performance . . . 43

6.2. Per subtree performance . . . 44

6.3. False positive hierarchical errors . . . 45

6.4. False negative hierarchical errors . . . 46

(7)

1. Introduction

The World Wide Web (WWW) is flooded with images. With the development in digital photography and an increasing popularity of online photo sharing websites like Flickr1 and social network sites like Facebook2, the number of images on the web is only growing. Organizing and managing this overwhelming amount of images has become an active area of research. Without organisation, it is impossible to quickly find a particular image in this collection of billions of images.

1.1. Image retrieval and classification

Image retrieval is the field of study concerned with the problem of searching and re-trieving digital images from an image collection. In an image retrieval system, a user is usually able to enter a search query and the system will search for relevant images and present them to the user. However, the system does not always ‘know’ exactly what to search for, because a search query might be ambiguous. For instance, a search term like Jaguar can refer to the animal, but also to the car manufacturer. It would be helpful to the user to be able to specify a category for his query like animals or cars. In that way, the system presents more relevant results. To facilitate this, the images should be categorized.

Image classification is the task of (automatically) classifying images into semantic categories. Humans can instantly recognize content on an image, but for a computer algorithm, it is hard. In essence, a digital image is just binary code. Fortunately, clues about the image’s content can be extracted. These can be low-level and visual like colour, texture, and shapes. However, state of the art technology for recognizing objects on images is still immature and its algorithms are complicated and computationally expensive.

Other clues about the image’s content can be extracted from the surrounding text of an image. On a website, meta-data can be extracted like the title of the webpage, images labels, image captions and so forth.

1.2. Image gallery classification

In this thesis, we aim to perform a special case of image classification, namely image gallery classification. The classifier should be able to assign one or more categories from a predetermined taxonomy to an image gallery. In this process, we make extensive

1

http://www.flickr.com

(8)

use of the meta-data of the gallery and an ontology. In the following sections, the previously mentioned concepts are explained in further detail. Then, an introduction to the developed algorithm will be presented.

1.2.1. Image gallery

An image gallery is a (part of a) website that displays a collection of images. An ex-ample of an image gallery is attached in appendix A. Usually, the images of a gallery have something in common. For example, a gallery can contain pictures about a cer-tain person, group, event or place, but a gallery can also concer-tain a collection from a certain photographer or artist. Apart from images, a gallery often includes some textual information about the images as well. A typical gallery might have a title, description, image captions, or keywords stored in the meta tags of the HTML. Sometimes, the gallery URL is descriptive as well. In our attempt at automatic gallery classification, this textual meta-data is used as the input.

1.2.2. Classifying into a taxonomy

With the gallery’s textual meta-data as input, the goal is to classify the gallery into one or more predetermined categories. This is known as multi-label classification. The predetermined categories might be ‘just’ a set of categories, but it is also possible that the categories are arranged somehow. If the categories are arranged hierarchically, it is a hierarchical classification problem. In our approach, we use such a hierarchical arrangement of target categories, in this thesis referred to as a taxonomy. A taxonomy is useful, because when image galleries are classified into this taxonomy, a user of a gallery retrieval system can choose a narrow or broad category for a search query. 1.2.3. Ontology-based classification

The traditional approach for solving a classification problem is statistical supervised classification. In this sense, supervised means that the classes are predetermined, unlike unsupervised image classification, in which similar images are only clustered. In the statistical supervised approach, a classifier is trained with a set of images of which categories are known and with this knowledge (computed statistics), the classifier assigns a category or categories to images of which the categories are unknown. The statistics are calculated based on visual or textual properties (features) of images.

Usually, meta-data of an image gallery is short and about recent topics. Statistical classification is always depending on some kind of training-set. To be able to accurately classify recent galleries, the training-set continuously needs to be updated. However, creating a representative amount of pre-classified image galleries is a labour-intensive task. Adding new classes and/or updating the set is therefore time-consuming and expensive.

(9)

loosely based on an interesting ontology-based classification technique from Janik and Kochut (2008). This technique has similar performance compared to statistical classi-fiers. The idea is that first, a text is scanned for the presence of concepts and instances from the ontology. Then, the network of relationships from an ontology is used to find broader concepts of the concepts or instances in the text. Finally, broader concepts that are present in the pre-defined set of concepts become categories of the text. In this way, the ontology is the classifier.

The ontology we use is this research is an extraction of the category network from the free and open encyclopedia Wikipedia. This encyclopedia stays up-to-date with the help of the community. The English version has a large coverage. In addition, its articles are pre-classified in specific categories. These categories, in turn, are often subcategories of other categories. The whole structure of these linked categories forms the category graph. This graph is not a real ontology, because the type of relations between categories are not defined. A snippet of the Wikipedia category graph is given in appendix .

Interestingly, by using a multi-lingual ontology like Wikipedia, ontology-based classi-fication can (relatively easy) be applied to other languages besides English as well. 1.2.4. Algorithm overview

In this thesis, a new algorithm will be presented for the ontology-based classification of image galleries into a taxonomy. The algorithm focusses on the use of concepts and instances found in gallery text. Concepts and instances are commonly referred to as named entities. Throughout this thesis, the shorter term entities will be used.

The idea behind the presented classification algorithm is that categories of the most important entities found in a gallery text, and/or categories of frequent mentioned en-tities, will be categories of the whole gallery. In chapter 4, an extensive overview of the whole process will be given, but a summary is given here.

The first step of the algorithm involves extracting text from incoming galleries. Sec-ondly, this text is submitted to a cleanup process. Next, entities are extracted from the cleaned up text. For this task, we use a wikifier, introduced by Mihalcea and Csomai (2007). It is a system that automatically links terms in a text to Wikipedia articles. It can be used as a named entity recognizer, because most Wikipedia articles can be treated as entities.

After the entity extraction process, the entities are scored by importance. For each entity, its corresponding Wikipedia article is used to find broader Wikipedia categories that are also present in the taxonomy. These categories receive a score that indicates likeliness of being a category of the entity.

Then, all collected categories will get a score based on the amount of entities that are suggested to fall in this category, the importance of these entities, and the plausibility scores of being a category of these entities.

(10)

1.3. Evaluation

Evaluating hierarchical classification is not as straightforward as for flat classification. When an image gallery is classified in a too broad or to narrow class from the taxonomy, it is not totally wrong. Traditional measures like recall and precision are pessimistic for these cases. Therefore, we also use special hierarchical evaluation methods.

1.4. Research questions and scope

Summarized, the aim of this research is to develop a hierarchical multi-label classifier for image galleries, using the meta-data of the galleries and knowledge from Wikipedia. The research question is as follows:

How well is a multi-label and hierarchical classifier of image galleries per-forming with the use of gallery meta-data and Wikipedia?

Sub questions requiring answers are:

• How can galleries be classified in a custom taxonomy based on textual meta-data? • Is Wikipedia and its category graph usable for ontology-based classification? • What is the overall performance of such a system?

• What is the performance of classification within subtrees of the taxonomy? • How much is the performance of the classifier affected by the choice of Wikifier? This research is limited to the English language because of time restraints. Only the English Wikipedia is used and image galleries with English meta-data.

1.5. About the project

This thesis is a result of research at the internet company Kalooga3_{. Kalooga is}

spe-cialized in the semi-automated, and partly fully automated, enrichment of web articles with relevant images. These images are collected from the internet. Instead of collecting regular images, the web crawler of Kalooga only collects image galleries. This decision is made, because in contrast to a regular image on the web, an image enclosed in a gallery is often not illustrative or decorative, but more substantial. This, on average, results in a high image quality and, while being in the eye of the beholder, tends to provide a high content quality as well. Apart from that, image galleries are commonly found at news websites and consequently its images have a (newsworthy) topic in common. Fur-thermore, the given clustering by topic of images in a gallery simplifies finding related

(11)

images, because not only a single related image is found, but a whole collection. At the moment of writing, the gallery database contains over 30 million galleries.

In general, the clients of Kalooga are publishers of websites that fit in a category like, for instance, Football, Entertainment and Travel. When searching for relevant galleries to an article, a query is constructed from keywords from the article. Additionally, the publishers category is specified. With the query, the systems searches for relevant galleries in the database, but galleries from the specified category will get a higher relevance score. For this reason, Kalooga aims to accurately classify their image galleries.

1.6. Thesis outline

(12)

2. Related work

In this thesis, we built on studies in several fields of research. In the following sec-tions, a separation is made in research related to hierarchical image classification, image classification, ontology-based classification, and Wikipedia related work.

2.1. Hierarchical image classification

The idea of creating a large-scale taxonomy of image galleries is not totally new. A project named ImageNet (Deng et al., 2009) aims to assign images to concepts of the semantic hierarchy of WordNet. ImageNet is largely created manually. The goal is to provide a visual data set to be used in training and benchmarking classification algorithms.

Binder et al. (2010) studied the problem of classifying images into a given pre-determined taxonomy as well. However, they used content-based classification.

2.2. Image classification

In our research we try to build a taxonomy of image galleries using automatic image classification. Commonly, the term image classification is mixed with automatic image annotation. Automatic image annotation is the task of automatically assigning words to an image that describe the content of the image. Both terms are often interchangeable, because image annotation can be seen as a classification problem with a large amount of classes.

In recent image classification research, the focus is mainly on content-based image classification. It dates back to 1990s and currently, it is a popular field of research. Low-level visual features of images like colour, texture and shapes are used. Techniques from the field of computer vision, content-based information retrieval (CBIR) and machine learning are applied. A good review about CBIR can be found in Datta et al. (2008). Although content-based algorithms are very sophisticated, they are not mature enough for practical use (Zhu et al., 2011).

(13)

2.3. Ontology-based classification

An ontology can be beneficial in statistical classification. In text-based classification, a new word or a synonym for a concept that is not in the training-set, but is present in the meta-data of newly found images, will give problems for the classification. The classifier simply does not have statistics about these terms. To bridge this ‘semantic gap’, an ontology can be used (Wang et al., 2009). The possibility of using an ontology in web image retrieval for assigning concepts to images is discussed in Wang et al. (2008) and Wang et al. (2006).

2.4. Wikipedia

The online and open encyclopedia Wikipedia is a valuable source for computer scientists. A comprehensive overview of research related to Wikipedia is given by Medelyan et al. (2009). In our research, Wikipedia is used extensively. For example for wikifying input texts, calculating importance of entities, and finding entity categories by using Wikipedia categories.

The idea of an automatic Wikifier was introduced by Mihalcea and Csomai (2007), but also further investigated and implemented by Milne and Witten (2008) and Hoffart et al. (2011). An open-source toolkit is available to be able to incorporate this wikifier (Milne and Witten, 2009). Additionally, the toolkit is able to calculate a percentage of relatedness between two terms as described in Witten and Milne (2008).

The relatedness measure based on Wikipedia can be used for several purposes. For example, Medelyan et al. (2008) and Tsatsaronis et al. (2010) used it to rank concepts in a text by importance (topic indexing). The idea is that concepts from a text that are highly related to a high number of other concepts, are important topics of the text.

Almost each article on Wikipedia is assigned to some categories. These categories are linked to parent categories and have subcategories. This builds up to the Wikipedia category graph. A snippet of the English Wikipedia category graph is visualized in appendix C. Using the on-line catgraph1 _{application, more areas of the graph can be}

visualized.

The English category graph of Wikipedia is relatively large. It contains over 700.000 categories at the moment of writing. It is a dense graph, but this denseness is not distributed evenly. Some areas are very dense, for example, it takes 7 steps to travel from Terriers to Mammals, but it only takes 2 steps to travel from Elephants to Mammals. This, however, does not mean that an elephant is more closely related to Mammals than terriers. Consequently, accurate derivations of broader categories from a collection of detailed categories is challenging. Using the interlanguage links of Wikipedia, the English category graph can be linked to Wikipedia’s of other languages as well.

Zesch and Gurevych (2007) acknowledge that this Wikipedia category graph can be used for various NLP tasks. However, the graph has some shortcomings. The categories are not always linked with strict semantic relations like is-a or part-of. Instead, the

1

(14)

(15)

3. Taxonomy

The taxonomy used in this research is displayed in figure 3.1. Because of readability, the taxonomy is flipped in this figure. Therefore, the leftmost category is the root category. A top-down version is available in appendix B.

The taxonomy is a hierarchical structure of categories. Most categories in the tax-onomy are relevant to publishers that are clients of Kalooga. For example, a publisher needs images about football, and therefore it is useful for this publisher to have a cate-gory filled with image galleries about football.

Some categories are not very relevant to publishers, but still included in the taxonomy. These categories are in the taxonomy to overlap a large part of image galleries (e.g. Plants). Because the taxonomy is largely created from the point of view of publishers, some first-level categories contain a high number of subcategories (e.g. Sports), whereas other first-level categories have no subcategories at all (e.g. Plants). The relations between taxonomy categories are mostly is-a relations, but there are also part-whole relations like “Cities are part of Countries”.

Each category from the taxonomy is linked to a Wikipedia category from the Wikipedia category graph. For example, Football is linked to the Wikipedia category: Association football. This mapping is made to be able to perform the entity classification process described in section 4.5. In this process, parent categories of Wikipedia articles are tra-versed to find categories that are linked to category from the taxonomy. Most category names in the taxonomy are equal to category names of the Wikipedia graph. However, some category names are changed for ease of use. For instance, the category MotoGP from the taxonomy is called Grand Prix motorcycle racing in the Wikipedia category graph.

3.1. Taxonomy categories

The root category of the taxonomy is Contents. Image galleries that fit none of the descendants should be classified in this category. The direct children or first-level classes are Entertainment, People, Sports, Plants, Arts, Animals, and Places. A further expla-nation and motivation of some of the less obvious categories and their subcategories is described below.

(16)

are supposed to be classified under Contents. This decision is made because gallery about, for example, a group of friends is usually not interesting for publishers. • The Olympics category has a place in the taxonomy, because in times of the

Olympics, articles about all aspects of Olympics are popular. In the taxonomy, Olympics is a child of Sports and a parent of categories about sports present on the Olympics. For instance, Olympics is a parent of Hockey, but it is not a parent of American football. An alternative way to incorporate the Olympics category in the taxonomy would be to put, for example, a category Olympic hockey besides (or under) Hockey. However, in practise, the number of galleries about Olympic hockey is small and galleries about Olympic hockey or just hockey are often re-lated to each other. When searching for galleries about Olympic hockey, it usually does not matter to retrieve galleries about ‘normal’ hockey as well. The other way around is also true, when searching for galleries about hockey, it is acceptable (maybe even desirable) to retrieve galleries about Olympic hockey as well.

• The Arts category should include galleries of which the images are about an artistic object (e.g. a statue) or the images are art themselves (e.g. an artistic photograph). The arts category is not linked to the music category. However, galleries about ‘artistic’ music are allowed to be classified in this category. However, galleries about popular music (albums or artists) are better suited in the music categories. • Vehicles can contain image galleries about all kinds of vehicles, for example cars,

boats, plains and motorcycles.

• The categories of Places should include Travel related galleries. Take for example famous tourist attractions and sceneries. Categories from Places should only in-clude galleries of which the main emphasis is a place. This can be troublesome in the classification process, because each photograph is taken at a certain place. • The subcategory Games belonging to the Entertainment category should not

in-clude galleries about sports. Furthermore, the subcategory Sex should inin-clude adult material. This material is not considered in this research, because content-based algorithms will likely perform better for such galleries.

3.2. Taxonomy classification

(17)

This decision is made, because of the application of assigning gallery images to an article. If an article is about a football player, and the system searches for images in the Football players categories, it is possible that unwanted images about stadiums will be assigned to the article as well. Therefore, only if each image of a gallery belongs to multiple categories, then the system should (in theory) assign all of these categories. Take for example a gallery about David Beckham. If each image is about David Beckham, then it belongs in Football players and Models. Summarized, the classification rules are as follows:

• The gallery should be classified as specific as possible in the taxonomy. • If no category can be found for a gallery, assign Contents

(18)

(19)

4. Methodology

In this chapter, an algorithm for classifying galleries into the taxonomy is presented. The idea of the algorithm is based on the intuition that categories of the most important entities found in a gallery text, and/or categories of frequent mentioned entities, will be categories of the gallery. The complete classification pipeline is summarized in figure 4.1. In the following sections, the steps of the algorithm are motivated and explained in further detail.

(20)

Throughout this chapter, a sample gallery is used to illustrate the effect of each step. A screenshot of this gallery is presented in figure 4.2. It is a gallery from the internet movie database website IMDB1about Simon Cowell2 who is a television personality and especially known for being a talent judge in British and American television shows like America’s Got Talent. The gallery is manually assigned to People and Television.

Figure 4.2: Gallery from IMDB about Simon Cowell

4.1. Extract gallery text

The first step in the algorithm involves extracting textual content from an image gallery. The HTML of a typical image gallery on the web includes several kinds of meta-data. For example a title, description, image captions, and keywords. The gallery URL can contain relevant words as well. All textual content that might be relevant is merged into one text. Intuitively, the title and gallery URL contain terms that are highly relevant to the gallery. Hence the title and gallery URL words are included twice in the resulting text,

1

IMDB: http://www.imdb.com/

(21)

because a higher frequency of these terms increases their importance in this algorithm. The textual meta-data of our sample gallery is displayed in table 4.1.

Meta-data type Text

Url http://www.imdb.com/name/nm1101562/mediaindex?refine=tt0759364

Filtered path name mediaindex

Title All Photos of Simon Cowell Add photos and your resume >> Simon Cowell > All Photos > America’s Got Talent

Description View the latest pictures, photos and images of Simon Cowell - Simon was born in Brighton, England. His parents’ names are Julie Cowell and Eric ...

Image captions 29 August 2007 Titles: America’s Got Talent Names: David Hasselhoff, Simon Cowell Still of David Hasselhoff and Simon Cowell in America’s Got Talent 29 August 2007 ...

Meta keywords Biography, Photos, Awards, News, Message Boards

Table 4.1.: Meta-data of the sample gallery (image captions are shortened)

4.2. Clean up text

In gallery text, certain words like gallery, picture, and image occur frequently. These words are usually irrelevant to the content of the gallery. Including these kind of words in the input text will bias the classifier towards the Arts category. Therefore, a blacklist of such words is constructed. Words from this list are removed from the input text.

Further cleanup includes correcting irregular spacing and removing numbers and spe-cial characters like ˆ, ∼, and c . Numbers and special characters can be recognized as entities on their own and will introduce noise. A gallery is usually not about the number 2 or the copyright symbol. The resulting text is used as the input for the next step.

(22)

A l l o f Simon C o w e l l Add and your resume Simon C o w e l l A l l America ’ s Got T a l e n t A l l o f Simon C o w e l l Add and your resume Simon C o w e l l A l l America ’ s Got T a l e n t August T i t l e s : America ’ s Got T a l e n t Names : David H a s s e l h o f f Simon C o w e l l S t i l l o f David H a s s e l h o f f and Simon C o w e l l i n America ’ s Got T a l e n t August T i t l e s : America ’ s Got T a l e n t Names : R e g i s P h i l b i n Simon C o w e l l R e g i s P h i l b i n and Simon C o w e l l i n America ’ s Got T a l e n t August T i t l e s : America ’ s Got T a l e n t t t Names : David H a s s e l h o f f J e r r y S p r i n g e r Brandy Norwood P i e r s Morgan Sharon Osbourne Simon C o w e l l David H a s s e l h o f f J e r r y S p r i n g e r Brandy Norwood P i e r s Morgan Sharon Osbourne and Simon C o w e l l i n America ’ s Got T a l e n t August T i t l e s : America ’ s Got T a l e n t Names : J e r r y S p r i n g e r Simon C o w e l l J e r r y S p r i n g e r and Simon C o w e l l i n America ’ s Got T a l e n t August T i t l e s : America ’ s Got T a l e n t t t Names : J e r r y S p r i n g e r Simon C o w e l l S t i l l o f J e r r y S p r i n g e r and Simon C o w e l l i n America ’ s Got T a l e n t September T i t l e s : America ’ s Got T a l e n t Names : Simon C o w e l l B i a n c a Ryan Simon C o w e l l and B i a n c a Ryan i n America ’ s Got T a l e n t t h e l a t e s t and

o f Simon C o w e l l − Simon was born i n B r i g h t o n England . H i s p a r e n t s ’ names a r e J u l i e C o w e l l and E r i c . . . B i o g r a p h y Awards News Message Boards name m e d i a i n d e x name m e d i a i n d e x

4.3. Extract entities

For the process of recognizing entities from the input text, we use a wikifier. A wikifier recognizes entities from a document and links them to corresponding Wikipedia articles. This process is also known as annotating. The motivation for using a wikifier instead of other entity recognition algorithms, is the possibility of exploiting relevant Wikipedia articles. In our method, we use the article for calculating relatedness with other articles, calculating commonness of the article in Wikipedia and we use the categories of the article to start searching for relevant categories in the Wikipedia category graph.

Kalooga has its own wikifier used for annotating articles. This is the wikifier used in this research. An important task for the wikifier is disambiguation. For a certain term, it is possible that there are multiple candidates. The Kalooga wikifier uses the Wikipedia link structure to create a graph of entities found in the text. In this graph, the nodes are Wikipedia articles and the edges are links on Wikipedia between these articles. If a term has multiple candidates, the one with the highest number of outgoing and/or incoming links is probably the most relevant one to the term. In addition to the relevance, the popularity (taken from Wikipedia statistics) of the candidates is taken into account. The more popular a candidate, the more likely it will be the most suitable Wikipedia article for the term. Furthermore, the wikifier uses a stop word filter, a part-of-speech tagger, term positions, and string similarity measures (e.g. Levenshtein distance) to find optimal entities.

(23)

Terms Wikipedia article

Simon Cowell Simon Cowell

America’s Got Talent America’s Got Talent

Titles Title

Names Name

David Hasselhoff David Hasselhoff

Regis Philbin Regis Philbin

Jerry Springer Jerry Springer

Brandy Norwood Brandy Norwood

Piers Morgan Piers Morgan

Sharon Osbourne Sharon Osbourne

Bianca Ryan Bianca Ryan

Simon Simon Cowell

Brighton Brighton

England England

parents Parent

names Name

Julie Julie Chen

Cowell Simon Cowell

Biography Biography

Awards Award

News News

Message Boards Internet forum

name Name

Table 4.2.: An excerpt of the sample gallery’s terms with their annotations

4.4. Score entities

(24)

Figure 4.3: Schematic overview of the entity scoring algorithm

4.4.1. Semantic relatedness

To express semantic relatedness in a number, we use the wikipedia link-based measure (WLM) presented by Witten and Milne (2008). It is an computationally inexpensive measure, which can be calculated from (just) the Wikipedia link structure. It has a relatively high correlation with human judgement. The formula is given in 4.1.

SRa, b = log(max(|A|, |B|))− log(|A ∩ B|)_{log(|W |)− log(min(|A|, |B|))} (4.1) In this formula, an entity is treated as a Wikipedia article. The intuition behind the formula is that the more incoming links two articles have in common, the higher their semantic relatedness will be. Incoming links pointing to only one of the articles, however, decrease the relatedness. In SRa,b, a and b are articles. A and B are the sets

of all articles that link to a and b respectively and |W | is the size of Wikipedia. In our system, we use the wikipedia mining toolkit created by Milne and Witten (2009) for calculating SR.

(25)

introduced by the wikifier, we don’t create an edge between all pairs of entities, but we apply a threshold of minimum relatedness.

The semantic graph of our sample gallery is presented in figure 4.4. The numbers on the edges are the semantic relatedness scores between the connected entities.

Figure 4.4: Semantic graph of the sample gallery

For each entity, we calculate their centrality in the semantic graph. A high centrality score indicates that an entity is connected to a high number of highly related entities. We use a variant of the closeness centrality from Wolfe (1997).

In closeness centrality, the concept of distance is used. In our semantic graph, the concept of relatedness is used. Distance is the opposite of relatedness and therefore, all semantic relatedness scores are first converted to distances: distance = 1 − SR.

(26)

CCei = g ·

g − 1 P

jd(ei, ej)

(4.2) In this formula, CC is the centrality score of entity ei. g is the entity’s subgraph

node count. As described before, an edge weight in the semantic graph is the semantic relatedness SR between two entities and therefore the distance between two entities is 1 − SR. The distance d(ei, ej) is then the sum of all distances on the shortest path

between ei and a reachable other entity ej. If an entity is not connected to another

entity, d(ei, ej) = 0.0.

In the semantic graph of our sample gallery in figure 4.4, the numbers behind the entity names are the centrality scores of the entities. There are no isolated areas, so only one subgraph is present. As can be seen (from what is readable), entities connected to the cluttered and central section of the graph have a higher CC than the outer entities of the graph. In this example graph, the threshold of minimum relatedness was set on 0.20.

4.4.2. Tf-idf

The centrality of an entity in the semantic graph does not take into account the frequency of an entity in the gallery text, which also indicates importance. Additionally, common terms can distort relatedness in the semantic graph. For example, places are often related to the other entities, but usually a place is not the main concern of a gallery. Therefore, we add a tf -idf weighting component to the entity scoring algorithm.

Tf-idf considers the importance of a term in a document, and the importance of the term in the total document collection. It can be computed by taking the product of the (normalized) term frequency (tf ) in a document and the inverse document frequency (idf ), which is the logarithm of the total number of documents divided by the number of documents containing the term. The idf weight reduces the tf weight of common terms and increases the weight of rare terms. We apply a variant of tf -idf , which is presented in formula 4.3, 4.4, and 4.5. tfe, g = Pne, g knk, g (4.3) idfe = log |W | 1 + |w : ei :∈ w| (4.4)

(27)

4.4.3. Combined entity score

The combined entity score is a variant of the averaged PageRank weighting (APW) presented by Tsatsaronis et al. (2010). APW is a combination of the importance of an entity in a semantic graph and the importance of an entity in Wikipedia. When taking the top 20 important keywords found in a text, the APW method has the highest performance compared to related measures. Just as in our approach, the APW creates a semantic graph of entities connected to each other with the semantic related measure of formula 4.1. The only difference in APW versus the formula we use is a different semantic graph based ranking method. We apply the centrality measure from section 4.4.1 instead of PageRank. The calculation of the entity score is presented in 4.6.

ESe, g =            1 2 CCe CCmax + tf -idfe, g tf -idfmax if CCmax 6= 0 tf -idfe, g tf -idfmax if CCmax = 0 (4.6)

The entity score ES of entity e in gallery g is the mean of the normalized centrality score CC and the normalized tf-idf weight. If the maximum centrality score is 0, only tf-idf is used to prevent a division by zero.

(28)

Entity Count tf -idf CC ES

Simon Cowell 18 2.09 12.36 0.92

America’s Got Talent 14 1.73 13.70 0.88

Jerry Springer 6 0.78 12.26 0.60 Bianca Ryan 2 0.31 14.74 0.57 David Hasselhoff 4 0.51 12.47 0.54 Name 9 1.13 7.68 0.53 Title 5 0.62 10.71 0.51 Piers Morgan 2 0.27 12.66 0.49 Sharon Osbourne 2 0.26 12.55 0.49 Regis Philbin 2 0.25 12.34 0.48 Brandy Norwood 2 0.24 12.14 0.47 News 1 0.11 8.74 0.32 Parent 1 0.12 7.20 0.27 Award 1 0.12 7.19 0.27 Internet forum 1 0.11 6.28 0.24 Biography 1 0.11 0.00 0.03 Brighton 1 0.10 0.00 0.02 England 1 0.05 0.00 0.01

Table 4.3.: Sample gallery entity scores

4.5. Find and score entity categories

In this step, we try to find taxonomy categories for each entity extracted from the gallery. Additionally, each category receives a score, which indicates confidence of this category being a category of the entity. The intuition is that a category that is highly likely to be a category of an highly scored entity, is also likely to be a category of the whole gallery. The taxonomy categories for an entity are retrieved using the Wikipedia category graph. At the moment of writing, this graph contains over 700.000 categories. The search for entity categories starts with the Wikipedia article of the entity. Each Wikipedia article belongs to categories of the Wikipedia category graph. By searching through the ancestors of the article’s categories in the Wikipedia category graph, we try to find broader categories that are also present in the taxonomy.

Unfortunately, ancestor categories of the article’s categories are not necessarily cate-gories of the entity. The deeper we search in the graph, the less likely the ancestor is a broader category of the entity. For example, if we start a search at terrier, we can end up at the category Sports. A terrier is a Hunting dog, which is a subcategory of Hunting and Hunting is a sport. However, the relation between a terrier and Sports is not obvious. The reason for this is that links between categories in the Wikipedia category graph are not strictly ‘is-a’, but can also be ‘part-of’ or just associative.

(29)

travel from terrier to Mammals, while travelling from elephant to Mammals only takes 2 steps. As a trade-of between finding enough correct categories, but not too many false positives, a maximum search depth is applied of 5 steps.

Before the graph search, some checks are performed. First, if the entity has the same name as a category from the taxonomy, then this category belongs to the entity. For example if the entity people is found, it belongs to the category People. In the Wikipedia graph, we will never find that category, because the parent category of the entity people is Humans.

Second, if an entity has an equal category, we add the parents of that category as a starting point for the graph search. For instance, Paris belongs to a category Paris. Therefore, we add the parents of Paris to the article’s categories.

After these checks, the search in the graph is performed. A recursive depth-first search was implemented to collect all paths and their depths. Every category that is also present in the taxonomy gets a score. A high score indicates a high chance of the category being a category of the entity. The score is based on two parts. First, the minimum distance needed to travel from the article to a category that is present in the taxonomy. Second, the number of paths leading to this category, normalized by the mean distance of these paths. The closer the category and the more (short) paths leading to the category, the higher the confidence that the entity belongs to that category. The formal notation of the entity category score (ECS) is presented in formula 4.7.

ECSc, e = min(0.5, 1

dmin) + min(0.5, npath

davg2) (4.7)

Let c denote a taxonomy category found for entity e, then the entity’s category score ECSc,e is the combination of the minimum distance dmin to travel from e to c and the

number of paths npath from e to c normalized by the average distance of the paths davg.

Both parts can maximally contribute 0.5 points to the score, leading to a score that is always between 0 and 1.

For the minimum distance part, the smallest value of 0.5 and the f rac1dmin is chosen. When the minimum distance is 1 or 2, this part of the score gets the maximum of 0.5. If the distance is 1, then the second part of the formula will add the other 0.5 to obtain a score of 1.0.

In the second part, the square of the mean distance is taken to penalize each additional step in the mean distance a lot higher (e.g. the difference between a distance of 2 to 3 is not a large as the difference between 4 to 5). It would also be possible to take a higher power than 2. Just as in the first part of the formula, scores above 0.5 will stay 0.5. Otherwise, a high number of paths will influence the score too much.

After the scoring formula is applied to each category of the entity, we rank the cate-gories by score in descending order. Then, some further selection is performed.

(30)

Second, we like to classify the entity as specific in the taxonomy as possible. Therefore, if a certain category from the taxonomy is found, but ancestor categories as well, then these ancestors are deleted.

The result will be a list of entities, of which each entity has a list of categories with a confidence score. For our sample gallery, the top 3 entities with their categories are listed in table: 4.4.

Entity Category Min dist. # Paths Mean dist. ECS

Simon Cowell Entertainment 4 16 4.88 0.75

People 4 12 4.92 0.75

Music 4 9 4.67 0.66

Games 3 5 4.40 0.59

America’s Got Talent Entertainment 4 18 4.78 0.75

Television 4 9 4.67 0.66

Games 3 3 3.67 0.56

Jerry Springer People 4 10 4.90 0.67

Singers 4 5 4.20 0.53

Actors 3 3 4.00 0.52

Table 4.4.: Categories of the top three entities from the sample gallery

As can be seen in this table, some non-obvious categories are found for the entities. For example, Jerry Springer is a Singer and Actor but he is known as a television personality. On his Wikipedia article, he belongs, among others, to the category American male singers, American country singers, and American actorpoliticians. These categories are largely responsible for these connections.

Surprisingly, the category Television is not found for Simon Cowell and Jerry Springer. It appeared that these were 5 steps away and not with a great amount of paths, therefore, the score was too low to be included. This example points out that choosing right threshold values is trivial for this algorithm.

4.6. Select gallery categories

The final step in the whole process is selecting categories for the gallery. The input is a list of entities with a score of importance and for each entity a list of categories with a score indicating reliability of this category being a category of the entity.

The first step is to collect all categories of all entities, and create a list of categories, with for each category a list of entities belonging to this category. For each category, we calculate a score. This gallery category score (GCS) is formulated in 4.8.

GCSc, g =X k

(31)

Here, GCSc,g calculates the score for category c of gallery g. For each entity, the

entity score ES is multiplied with the entity category score ECS. Then, these scores are summed for all entities belonging to the category.

After calculating the score and sorting the categories by score, some additional selec-tion steps are required. First, all categories with a score lower than a predetermined preferred threshold are removed. Second, categories with a score lower than a preferred percentage of the highest score are removed. Third, ancestor categories of a found cate-gory are removed, because the goal is to classify as specific in the taxonomy as possible. Finally, if there are more than x categories remaining, then all categories ranked af-ter position x are removed. This parameaf-ter will be set to 3 or 4, because it is rare that a gallery belongs to more than 4 categories. If no category succeeded the selection procedure, then the gallery belongs to Contents.

For our sample gallery, the final candidates with their scores are listed in table 4.5.

Category GCS People 3.12 Entertainment 1.88 Singers 1.64 Games 1.33 Television 1.11 Musical groups 0.83 Music 0.61 Actors 0.61 Television actors 0.30 Arts 0.26 Sex 0.20 Countries 0.01 Places 0.01

Table 4.5.: Resulting category list of the sample gallery

(32)

5. Evaluation

In this chapter, the test set and the performance measures used for evaluation are pre-sented. We start with an overview of the test set, followed by an elaboration of the evaluation measures and special hierarchical evaluation measures. Then, the baseline classifier is presented, followed by a listing of the used thresholds. Finally, we look at the evaluation setup for the comparison of the classifier performance using different wikifiers.

5.1. Test set

For image gallery classification, there is no established benchmark dataset available. Therefore, a test set had to be created. The galleries of this test set are selected from the Kalooga gallery database. This database contains over 30 million galleries. A large amount of these galleries include adult imagery or product photos. Such galleries will be more accurately classified by using content-based techniques like skin detection for adult galleries and ‘even background’ detection for product galleries. Consequently, these kind of galleries are not considered for inclusion in the test set.

The proposed algorithm uses a meta-data approach and, accordingly, galleries with no text or little text will be unclassifiable. Therefore, only galleries containing text are included. The minimum amount of text is set to 20 characters. This amount of text should be enough to extract an entity, unless this sequence of characters contains no words. Since this research focusses on English galleries only, all test set galleries contain English text. Summarized, the galleries of the test set are non-adult, non-product, and contain at least some English text.

5.1.1. Creating the test set

At first, the idea was to randomly select galleries from the database and assign categories to them. Unfortunately, it seemed that the resulting set had little category diversity. The Kalooga database contains a lot of Arts related galleries and galleries belonging to none of the 8 first-level categories of the taxonomy. Thousands of random selected galleries should have been checked to obtain higher category diversity, which is not feasible because of time restraints. Therefore, besides the set of randomly selected galleries, an additional set had to be created using a different method. Instead of starting with a gallery and classify it, the other method starts by picking a category and then selecting galleries for it.

(33)

sub-categories. To one person, the category Contents was assigned as well, which includes galleries fitting none of the first-level categories or its descendants.

A list of instructions was constructed as a guide. First, each person has to use the search portal1 from Kalooga to find (English) galleries belonging to the assigned first-level category or its subcategories. An exception is the Contents category, for which no subcategories have to be considered, because these are covered by the first-level categories. When galleries of other categories were found, they should classify them as well. Second, in the search portal, keywords belonging to a category, without the category name itself, should be used to find galleries. Third, the galleries should originate from various sources on the web. Fourth, the classification rules of chapter 3.2 apply. Finally, the format of the test set has to be like the sample below.

Speed s k a t i n g ; h t t p : / /www. s o c a l s p e e d s k a t i n g . o r g / albums /2006 Olympic . . . G ol f , S p o r t s p e o p l e ; h t t p : / /www. z i m b i o . com/ p i c t u r e s /PC7wxeukhym/ phot . . . G o l f ; h t t p : / /www. pga . com/ p g a c h a m p i o n s h i p /2009/ i n d e x . cfm

G ol f , Games , S p o r t s p e o p l e ; h t t p : / / s c r e e n s h o t s . teamxbox . com/ g a l l e r y / 1 . . .

With the use of the Kalooga search portal, the retrieved galleries will have at least one entity. Furthermore, the amount of relevant meta-data tends to be relatively high, be-cause the more relevant meta-data a gallery contains, the higher the rating and position on the result page.

The manual classification was based on the whole website of the gallery and not just the gallery text alone. This is in contrast to the proposed automatic classifier which is based on meta-data only.

5.1.2. Test set properties

The resulting test set has a total of 734 galleries. The amount of galleries for each retrieval method is given in table 5.1.

Method # Persons # Galleries

Random selection 1 223

Search on portal 6 511

Total 734

Table 5.1.: Retrieval of test set galleries

In table 5.2, we can see the percentage of galleries having n labels. Most galleries (90%) are single-label.

The size of the test set per subtree is given in table 5.3. A subtree in the taxonomy is a first-level category (e.g. Sports, People) with all of its subcategories. If a first-level

(34)

# Categories # Galleries Percentage 1 661 0.90 2 64 0.08 3 7 < 0.01 4 2 < 0.01 Total 734 1.00

Table 5.2.: Amount of multi-labels in test set

category has no subcategories like Vehicles, then the subtree is just one category. The category Contents is also treated as a first-level category. It is considered as a subtree without subcategories, because the children of Contents are already covered by other subtrees. The collection of first-level categories including Contents is referred to as top-level categories.

Subtree Tot. # galleries Avg # cat. # Galleries Avg # top cat.

People 184 1.23 169 1.89 Sports 168 1.08 157 1.32 Vehicles 82 1.11 82 1.16 Places 23 1.52 23 1.52 Entertainment 216 1.11 200 1.52 Arts 87 1.44 82 1.45 Animals 52 1.06 52 1.06 Plants 6 1.33 6 1.33 Contents (other) 133 1.0 133 1.0 Total 951 904

Table 5.3.: Amount of test galleries per first-level category and Contents

The first column of table 5.3 shows the name of the subtree’s root category. The Tot. # galleries column shows for each subtree the total of the per-category gallery counts. In other words, for each category of the subtree, we count the assigned galleries and then the sum of these counts is taken. Therefore, if a gallery belongs to multiple categories of the same subtree, then this gallery is counted multiple times for this subtree. For example, if a gallery belongs to Football players and Football clubs, then the Football players counter increases by one and the Football clubs counter increases by one. Consequently, the total Sports subtree counter increases by two.

(35)

the subtree is presented. For example, galleries belonging to a category of the People subtree have, on average, 1.23 categories assigned.

The # Galleries column shows the amount of galleries belonging to the top-level category of a subtree. The top-level categories of a gallery are found by mapping the gallery categories to their top-level categories. The numbers in this column are different then the Tot. # galleries column, because when a gallery belongs to multiple categories of a subtree, only one top-level category is chosen (e.g. Football players and Football clubs becomes Sports). Therefore, these numbers are lower than the numbers of the Tot. # galleries column, in which the per-category counts are summed.

The last column shows the average number of top-level categories for a gallery from the subtree. These numbers can be higher than in the Avg # cat. column, because a subcategory can easily have multiple top-level categories. For example, if a gallery is only assigned to Football players, then it belongs to two top-level categories, namely Sports and People.

The overall average number of categories per gallery is 1.11. The overall average number of top-level categories is 1.24.

5.1.3. Development set

During the development of the algorithm, a separate test set was created including about 100 galleries of various categories. This set was used to establish optimal thresholds for various cut-offs in the classification algorithm. In classic statistical classification, such a set is usually referred to as a validation set.

5.2. Performance measures

5.2.1. Precision, recall, and F-score

For the measurement of classification success, we use the traditional information retrieval measures known as precision, recall and F-score. These measures are common in the evaluation of text classifiers (Sebastiani, 2002). Precision of category Ciis the percentage

of documents correctly assigned to Ci out of all documents assigned to Ci, while recall is

the percentage of documents correctly assigned to Ci out of all documents that should

have been assigned to Ci. The F-score is a combination of the two. The measures are

(36)

Type Assignment is: True positive correct

False positive unexpected False negative missing

True negative correctly absent Table 5.4.: Results types

In the evaluation, each category Ci has its own table in which the amount of true

positives, false positives, false negatives, and true negatives are stored. This table, known as a contingency table or confusion matrix, is shown in table 5.5. Here, T Pi denotes the

amount of true positives for category Ci. F Pi, F Ni, and T Niare respectively the amount

of false positives, false negatives and true negatives for category Ci.

Category Ci Predicted

TRUE FALSE Expected TRUE T Pi F Pi

FALSE F Ni T Ni

Table 5.5.: Contingency table

Out of the contingency table, precision P ri and recall Rei for category Ci can be

calculated as follows: P ri = T Pi T Pi+ F Pi (5.1) Rei = T Pi T Pi+ F Ni (5.2) The F -score is the harmonic mean of both measures and useful for comparing algo-rithms based on one figure. It assumes that precision has equal weight compared to recall. Formally:

F = 2P R

P + R (5.3)

(37)

5.2.2. Macro-average and micro-average

When calculating precision and recall for a subtree or the whole taxonomy tree, the results for each category belonging to the tree should be combined somehow. In our evaluation, this is done using two different approaches called micro-averaging and macro-averaging. In micro-averaging, all true positives, false positives, and false negatives of each category of the tree {C1, . . . , Cm} are merged in one global contingency table. Out

of this global contingency table, precision ˆP rµand recall ˆReµis calculated. As formally denoted by Sun et al. (2003):

ˆ P rµ= Pm i=1|T Pi| Pm i=1(|T Pi| + |F Pi|) (5.4) ˆ Reµ= Pm i=1|T Pi| Pm i=1(|T Pi| + |F Ni|) (5.5) In the macro-average approach, precision and recall is calculated for each category of the tree {C1, . . . , Cm}, and then the average precision ˆP rM and recall ˆReM of these

categories is taken. Formally: ˆ P rM = Pm i=1P ri m (5.6) ˆ ReM = Pm i=1Rei m (5.7)

The difference between the two methods is that macro-averaging, compared to micro-averaging, gives a relatively high weight to categories with only a few documents in the test set. In other words, the micro-average gives equal importance to each document, while the macro-average gives equal importance to each category (Sun et al., 2003).

5.3. Hierarchical performance measures

In the previously described precision/recall/F-score measures, the taxonomy is treated as a flat set of categories. The hierarchical structure of the taxonomy is not taken into account. For hierarchical classification, this may be a bit pessimistic. For example when the target category is Football players and the found category is Football, then this is considered as a whole error, while one could argue that this is not totally wrong. Therefore, we applied hierarchical evaluation measures.

(38)

5.3.1. Top-level performance

The first measure is top-level classification evaluation. It is a variant of measuring performance at certain levels of a taxonomy as done, for example, by Pulijala and Gauch (2004). In this performance measure, the target and predicted categories of a gallery are first mapped to their top-level parent (a first-level category or if the category is already Contents, then Contents). For instance if a gallery belongs to both Football players and Hockey players, then the result of the mapping is Sports, Sports, People, and People. Then, redundant categories are removed resulting in Sports and People. The same procedure is applied to the predicted classes. In this way, the precise subcategories are discarded in the final classification. After the mapping to top-level categories, precision, recall, and F-score are calculated for each top-level category. The non-flat classification results are compared with the flat classification results to see if there is a large difference in performance between classification using all subcategories or classification using only the top-levels. Both micro- and macro-averaged precision and recall are calculated. 5.3.2. hPerformance

The second measure is known as hPerformance and proposed by Kiritchenko et al. (2005). The measure is a (hierarchical) variant of the precision and recall measures described before. However, the hierarchical precision hP r and hierarchical recall hRe is not calculated for a category, but for a document. For a document i, all ancestor categories of the expected categories are included in the set of expected categories ˆEi.

All ancestors of the predicted set of categories are included in the set of predicted categories ˆPi. The root category is not included in both sets, because all categories

share this category as an ancestor. The true positives, false positives and false negatives of all documents are added in one contingency table. From this table, the micro-averaged hP r, hRe, and the F-score (hF ) is calculated as follows:

hP r = P i| ˆPi∩ Êi| P i| ˆPi| (5.8) hRe = P i| ˆPi∩ Êi| P i| Êi| (5.9) hF = 2 · hP · hR hP + hR (5.10)

(39)

The first drawback of these semantic- or distance-based measures is that the choice of distance/similarity metric and the mean distance/similarity may result in subjective results. Second, these distance/semantics based measures may be complicated to grasp, whereas hPerformance can be summarized as ‘ancestor performance’. Third, in the context of multi-label classification and classification in a direct acyclic graph (DAG), it is hard to choose a path to base the error calculation on, because the error can be calculated using multiple target categories and multiple paths. Lastly, the hPerformance measure can simply be applied to a multi-label setting.

The hPerformance method has three important features. First, it gives some credit to partially correct classification. For example when Football players is intended, but Football is chosen. Second, distant errors are penalized more compared to close errors. For example if Football players is expected and the parent category Football is predicted, then this is not penalized as much as when the grandparent Sports is predicted. Third, the measure penalizes errors at a high level more heavily than at a deep level of the taxonomy. For example, if the category Football is expected, and Tennis is predicted, then this error is more severe than when the intended category was Football players and Football clubs was picked.

A drawback of the hPerformance measure is that it is hard to calculate per category or subtree, because a contingency table is created for documents, and not for categories. 5.3.3. Hierarchical statistics

Besides the formerly described measures, a further analysis was performed to determine what kind of hierarchical errors the classifier made and how severe these errors are. There are two main types of errors, one is a false positive, the other is a false negative. When looking at a false positive, five types of hierarchical errors can be distinguished:

• The false positive is too broad (e.g. expected: Football players, predicted: Foot-ball ).

• The false positive is too specific (e.g. expected: Football, predicted: Football play-ers).

• The false positive is a sibling of the target category (e.g. expected: Football, predicted: Tennis).

• The false positive is in the same subtree as the target category, but not too broad, specific or a sibling (e.g. expected: Football, predicted: Formula One).

• The false positive is outside the target category’s subtree (e.g. expected: Football, predicted: Plants).

When looking at the false negative, similar five type of errors can be distinguished: • The target found category is too broad compared to the false negative (e.g. false

(40)

• The target found category is too specific compared to the false negative (e.g. false negative: Football, closest predicted: Football players).

• The target found category is a sibling of the false negative (e.g. false negative: Football, closest predicted: Tennis).

• The target found category is in the same subtree as the false negative, but not too broad, specific or a sibling (e.g. false negative: Football, closest predicted: Formula One).

• The target found category is outside the subtree of the false negative (e.g. false negative: Football, closest predicted: Plants).

Besides establishing the amount of errors for each hierarchical error type, the severity of these errors was also calculated. The severity of an false positive or false negative error is calculated in two ways. In the first, the ancestor overlap hF measure from section 5.3.2 is used. A low score indicates a higher severity of the average error. In the second, a node counting distance measure as described in Binder et al. (2010) is applied. In this measure, the distance is the number of nodes between the error category and the target category. For example the node count between Football players and Football is 2. Note that the start category is included in the count.

In the multi-label setting of this study, it is not clear what category of the correct categories is the target category. For example, the target can be the closest or the farthest category from the error category. Besides calculating the distance from the chosen category to one target category, it is also possible to use multiple target categories and calculate the distance as the sum (or mean) of distances to all these targets. The problem with taking the mean or summed distance is that a gallery can belong to two very distinct categories, like Tennis and Arts. If the false positive is Football, then averaging the distance to Tennis with the distance to Arts has an unfair lowering effect. The more optimistic choice was made to pick the closest of the candidates, indicating that this must be the one intended. With a false positive, the target is the closest correct one. With a false negative, the target is the closest found (predicted) category. The closeness is calculated by taking the candidate that has the highest ancestor overlap hF score (see section 5.3.2) compared to the false positive/negative. If the candidates have an equal hF score, then the candidate with the lowest node count δT is the target.

5.4. Baseline

(41)

Unfortunately, because the people from Kalooga already helped in creating the test set, and time being a factor, this baseline was not considered as well.

Because of the reasons above, a simple baseline was created to evaluate against. The options were to build a most frequent baseline or a random baseline (by chance). The most frequent baseline is not used, because galleries from the test set are not distributed over categories like in the Kalooga database. Thus, a random baseline classifier was created.

The random baseline classifier was instructed with 90 percent of chance to pick one category for a gallery and 10 percent chance to pick two categories for a gallery. For each category it is allowed to pick, the classifier randomly chooses categories. Then, the overall precision/recall/F-score is calculated, top-level performance, overall hPerformance and per subtree the precision/recall/F-score and the top-level performance.

To minimize the chance of the random classifier performing too poor or too high, the average of 100 runs is taken.

5.5. Thresholds

The proposed classifier uses some thresholds for narrowing down the number of entities, relations between entities and the number of entity categories and gallery categories. Changing these thresholds affect the results, especially the balance between recall and precision. During tests on the development set, we found the following thresholds to result in a reasonable balance between precision and recall:

• Thresholds for entity scoring (section 4.4): – Minimum semantic relatedness SR: 0.20 – Minimum entity score ES: 0.45

• Thresholds for the entity category selection (section 4.5): – Minimum entity category score ECS: 0.45

• Thresholds for the gallery category selection (section 4.6): – Minimum gallery category score GCSc,g: 0.50

– The gallery category score GCSc,g must be at least 50% of the highest GCS

– Maximum number of gallery categories: 3

5.6. Evaluation of the classifier using different wikifiers

The proposed classifier is depending on the output of the entity extraction process, which is performed by a wikifier/annotator (see section 4.3). To see how much the performance of the classifier is affected by this wikifier, the classifier is run using three wikifiers.

(42)

category, about 5 galleries were selected, summing up to a total of 50 galleries. The average number of (non-unique) entities per gallery was 50.82. Most galleries contained about 20 entities, but three galleries had a lot of entities resulting in a higher average.

The second wikifier is the Wikipedia Miner Annotator of Milne and Witten (2008), which has a recall and precision of almost 75%. This wikifier is publicly accessible via a web service (Milne and Witten, 2009). The web service is used to evaluate the classifier with this wikifier. The third wikifier is the Kalooga wikifier as described in section 4.3. For each wikifier, the classifier was tested on the same subset of the test set that was manually wikified. If the classifier with one of the automatic wikifiers was run on the whole test set, the comparison would not be fair.

5.7. Summarized evaluation setup

Summarized, we determine the performance of the classifier by testing it on the manually created test set of 734 galleries. During tests on the development set, thresholds for the classifier were chosen and used for the evaluation. Three evaluation measures are used. First, the flat performance, which is the averaged precision and recall of all categories. Both micro- and macro-averaged results are calculated. Second, the top-level performance, which only calculates precision and recall of the top-level 1 categories (subcategories are mapped to level 1 categories) and Contents. Again, both micro- and macro-averaged results are calculated. Third, the overall hPerformance is calculated. This measure calculates precision and recall of a gallery by comparing the ancestor categories of the found categories with the ancestor categories of the correct categories. Of all precision and recall pairs, the F-score is calculated as well. The results of the overall measures are compared to the results of a random baseline classifier.

Furthermore, for each subtree, the micro- and macro-averaged flat performance and the top-level performance is calculated. Next, for each subtree with children, detailed statistics about the kind of errors (false positives and false negatives) are presented.

(43)

6. Results

In this chapter, the results of the evaluation are presented. We start with an overview of the overall performance. Then, we look at the performance per subtree, followed by statistics of errors related to the taxonomy hierarchy (e.g. classified too broad or too specific). Next, an overview of the performance of the classifier using different wikifiers, including manual wikification, is presented. In the next chapter, the results of this chapter will be interpreted, analysed and discussed.

6.1. Overall performance

In table 6.1, the overall performance of the system is presented. Per measure, the precision, recall, and F-score is listed for the random baseline and the proposed classifier.

Measure Method P r Re F

Flat micro-avg random 0.020 0.019 0.020 classifier 0.345 0.460 0.394 Flat macro-avg random 0.020 0.019 0.020 classifier 0.482 0.474 0.478 Top-lvl micro-avg random 0.196 0.197 0.196 classifier 0.535 0.663 0.592 Top-lvl macro-avg random 0.138 0.138 0.138 classifier 0.536 0.701 0.608 hPerformance random 0.112 0.143 0.125 classifier 0.543 0.593 0.567 Table 6.1.: Overall performance

The classifier performs significantly better than the random baseline. The macro-averaged performance for the flat measure is higher compared to the micro-macro-averaged results. In attachment D, detailed performance results per category of each subtree is given.

(44)

that some galleries are assigned in the correct subtree, but the exact subcategory is wrong. Further statistics about these types of errors are given in section 6.3.

6.2. Per subtree performance

In table 6.2, the micro-averaged, macro-averaged, and top-level precision/recall/F-score is given for each subtree and Contents. Some subtrees consist of only one category i.e. Vehicles, Arts, Plants, and Contents. Therefore, for these subtrees, the results of the top-level categories are the same as for all categories. The subtrees that are actually just one category are indicated with an asterisk.

Subtree P rµ Reµ Fµ P rM ReM FM P rT ReT FT People 0.248 0.299 0.271 0.360 0.303 0.329 0.535 0.680 0.599 Sports 0.418 0.411 0.414 0.495 0.449 0.471 0.842 0.713 0.772 Vehicles* 0.815 0.646 0.721 0.815 0.646 0.721 0.815 0.646 0.721 Places 0.096 0.565 0.165 0.237 0.660 0.348 0.119 0.696 0.203 Entertainment 0.259 0.343 0.295 0.366 0.335 0.349 0.552 0.775 0.644 Arts* 0.481 0.736 0.582 0.481 0.736 0.582 0.481 0.736 0.582 Animals 0.574 0.596 0.585 0.716 0.645 0.678 0.717 0.731 0.724 Plants* 0.286 1.000 0.444 0.286 1.000 0.444 0.286 1.000 0.444 Contents* 0.478 0.331 0.391 0.478 0.331 0.391 0.478 0.331 0.391

(* = no subcategories, µ = micro-avg, M = macro-avg, T = top-level) Table 6.2.: Per subtree performance

In the subtrees of Sports, People, Entertainment, and Animals, a relatively high num-ber of mistakes is made within the subtree, as can be seen when comparing the top-level F-score with the other F-scores. The more subcategories a subtree has, the higher the difference in top-level performance compared to the flat overall performance. More about hierarchical errors in the next section.

When looking at the flat measures, the best performing subtree is Vehicles. However, this subtree has no child categories. The worst performing subtree is Places. The precision of Places is low compared to the recall. Interestingly, all galleries for Plants are found (recall 1.000), but the precision is only 0.286. The Arts category also suffers from a low precision compared to the recall.