• No results found

Gender gap on Wikipedia: visible in all categories?

N/A
N/A
Protected

Academic year: 2021

Share "Gender gap on Wikipedia: visible in all categories?"

Copied!
31
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Gender gap on Wikipedia:

visible in all categories?

Paul Schrijver 6373975 / 10116052

Bachelor thesis Credits: 12 EC

Bachelor Opleiding Informatiekunde University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. M. J. Marx ILPS, IvI Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam 2016-05-25

(2)

Contents

1 Introduction 4

1.1 Definitions . . . 5

1.2 Overview of thesis . . . 6

2 Related Work 7 2.1 Gender gap on Wikipedia . . . 7

2.2 Categorization of Wikipedia . . . 7

3 Methodology 9 3.1 Description of the data . . . 9

3.1.1 User account data . . . 9

3.1.2 Categories . . . 9

3.1.3 Revision history . . . 11

3.1.4 Resulting revision history dataset . . . 12

3.2 Methods . . . 13

4 Evaluation 15 4.1 Categories with an overrepresentation. . . 15

4.2 Analysis of overrepresented categories . . . 15

5 Conclusion 18 5.1 Discussion . . . 19

5.2 Acknowledgements . . . 19

A Category tree 23

B Category statistics 24

C Fishers exact test results 27

(3)

Abstract

Since 2001 Wikipedia has been ever growing in both popularity and content. This development also leads to more and more critique on the content of the online encyclopaedia. One of the main concerns is the gen-der bias on Wikipedia. There is a well-documented male bias observable on Wikipedia. There is also a ‘gender gap’ on Wikipedia, meaning around 90% of the content editors is male. Other research has shown that both males and females display different behaviour on Wikipedia when it comes to editing activity. This raises the question whether the gender gap on Wikipedia is as black-and-white as it seems to be. The editing activity of each gender could be divided.

To explore different areas of the website, each Wikipedia article was assigned one or more categories derived from the DBpedia ontology. These categories were based on the classes used in the ontology that is made by DBpedia, a ‘semantic web version’ of Wikipedia. For each category a set of all articles belonging to this category was made. The Wikimedia Foun-dation provided a dataset which made it possible to extract the gender of an editor. Finally a dataset was created based on the full revision history of the English Wikipedia, providing the history and metadata of all revi-sions that were made on the categorized set of articles. This dataset covers 75% of all articles on Wikipedia and 10% of its total revision history.

To measure edit activity per category a range of statistics was gen-erated. These statistics were used to execute Fishers exact test for each category. This resulted in indicators for each category which helped de-termine whether there was an overrepresentation of a gender, and which gender that was. A significant result for a category from the test would mean a certain gender was more active in that category than expected. Expected was an equal grade of editing activity per gender in each cate-gory.

Further analysis of the findings resulted in the conclusion that more than three-quarter of the categories on Wikipedia do have a significant overrepresentation when it comes to edit activity. Unlike what was most expected, this overrepresentation is by far not only male. It turns out 35% of these categories have a female overrepresentation. With a mean size of 15830 articles these categories are on average much smaller than their male counterpart, which has a mean of 83960 articles. Despite this these findings suggest the gender gap and gender bias on Wikipedia might be more nuanced than is often presented.

(4)

1

Introduction

In the last fifteen years the online encyclopaedia Wikipedia has made quite

a name for itself. Thanks to the millions of visitors each day the website

has been firmly planted in the top ten of most visited pages on the web, according to the analysis of Alexa.[Alexa, 2016] The online knowledgebase is nowadays available in 292 languages with a combined total of almost 40 mil-lion articles.[Wikipedia, 2016a] Combine this with the fact that all content on Wikipedia is written by volunteers and available to be edited by anyone, and it becomes clear why this is one of the largest online open-source, community-driven projects of all time. This is also why Wikipedia formed the basis of knowledge bases such as YAGO, DBpedia, Google’s Knowledge Graph and the Jeopardy winning IBM Watson supercomputer.

This makes Wikipedia not only an interesting asset to its visitors and a numer-ous knowledge projects, the encyclopaedia itself also holds a wealth of informa-tion for research purposes in a variety of different research fields. To informainforma-tion scientists the way Wikipedia is used by its editors, also called ’Wikipedians’, can be particularly interesting. For example, since about 2006 lots of research has been conducted on specific behavioural traits of editors. When looking further into this, gender seems to play a large role in one form or another in a fair share of this research.

The growing popularity of Wikipedia also fuels concerns and discussings about the quality of its content. Especially the possibility of a certain gender bias on Wikipedia has been a hot topic for years now. [Reagle and Rhue, 2011, Wagner et al., 2015, Klein, 2015, Graells-Garrido et al., 2015, Collier and Bear, 2012, Kim, 2013, Matias et al., 2015, Antin et al., 2011] The concept of gender bias on Wikipedia can be split into two main variations. There could be a gender bias in the editing simply because of a stronger representation of a certain gender within the total amount of Wikipedians. On the other hand the content itself could be biased, resulting in more articles about persons of a certain gender. One of the most common findings is that a majority of biographies on Wikipedia are about men. [Reagle and Rhue, 2011, Wagner et al., 2015, Klein, 2015] Both approaches of gender bias seem to have a strong connection with one another. The Wikimedia Foundation, the foundation behind Wikipedia, stimulates re-search on gender bias on Wikipedia. Gender bias became an important issue for the Wikimedia Foundation after concluding from own research that almost all editors are male.[Wikipedia, 2016d, Wikimedia, 2016c]

Past research has shown that about 90 percent of all editors on Wikipedia are male, although those numbers have been very slightly decreasing over the years.[Wikimedia, 2011] Recent research has shown that all language varieties of Wikipedia cope with the issue of underrepresentaition of women. There is however a direct connection between the amount of female editors on language varieties of Wikipedia and the amount of women active in science in the corre-sponding country. [Massa and Zelenkauskaite, 2014]

Women are not just far less represented in the Wikipedia community, they also seem to behave differently on the online encyclopaedia. Specific research on editing behaviour made clear that women are active in a less broad spectrum

(5)

of subjects. Women also tend to make more extensive edits overall compared to male editors.[Antin et al., 2011] An increase of female editors can be seen over the years, but the grade of participation of these newcomers is deemed much lower. This results in less editing. [Matias et al., 2015] Of the generation that grew up using Wikipedia there are also noticeable differences in editing behaviour. Young females turn out to have a lack of motivation to contribute to Wikipedia in the form of editing despite using the website on regular basis. [Kim, 2013]

It is clear that there is a gender gap on Wikipedia on multiple levels. Re-curring theme in all research on this topic is ‘male domination’ of Wikipedia because of the sheer amount of male editors. It is also shown that both male and female editors display unique behaviour and character traits when it comes to editing. This raises the question whether the gender gap on Wikipedia is as wide as its often presented. It could be more nuanced if the behavioural differ-ences also extend to the areas of Wikipedia where editors are actually active. In other words, different parts of Wikipedia could have a higher representation of a certain gender. If this is the case there could even exist a specific gender bias per category on Wikipedia. Currently there does not exist any research that directly addresses this subject.

To make a first effort to fill that gap, this bachelor thesis attempts to inves-tigate just that. The following research question is raised:

Main research question Does there exist a correlation between gender of Wikipedia editors and the categories where they are active?

To pave the way to answering the main research question, several subques-tions have been drafted:

RQ1 Are there any categories on Wikipedia with a significant overrepresenta-tion of a certain gender when looking at edit activity?

RQ2 Do categories with an overrepresentation show specific traits when com-paring them by gender?

1.1

Definitions

The following defines frequently used terms in this paper.

Wikipedia Refers to the online encyclopaedia found at www.wikipedia.org. Further defined in the methodology section.

Edit, revision Wikipedia consists of a collection of pages that can be edited. Changing a page and saving it is seen as a revision.

Editor, Wikipedian Somebody who makes revisions on Wikipedia. (Edit) Activity The act of making revisions.

(6)

1.2

Overview of thesis

Related work Provides a more detailed view of related research already con-ducted complimentary to the introduction, especially on the subject of categorization.

Methodology Describes how all necessary data was found, wrangled and used. Evaluation An overview of the findings that resulted from the methodology,

answering the subquestions.

(7)

2

Related Work

2.1

Gender gap on Wikipedia

Wikipedia has been active since early 2001. In the mid 2000s the popularity of Wikipedia really started to grow. With this there also started research on possible gender bias on the online encyclopaedia. Gender bias would be a form of systematic bias on the website, and is one of the biggest critiques on Wikipedia. These concerns are supported by data from the Wikimedia foundation that was published in 2011, showing that over 90 percent of all Wikipedia users

indeed are male. [Wikimedia, 2011] This gender bias has for example been

researched by Antin et al. in 2011 [Antin et al., 2011]. This research specifically investigated behavioural differences between users of a different gender. This has shown that female editors tend to make more extensive edits on articles. Another finding is that female editors are active in a much less wide spectrum of subject compared to male editors. If there is also a difference in subjects per gender is not mentioned. This study uses a sample of 13.598 Wikipedians,

of which 82 percent says to be male. This is slightly less than the figures

of Wikimedia suggest. Whether the used sample was therefore representative could be somewhat questionable.

More recent research investigating the gender gap on Wikpedia shows that female editors on Wikipedia are still by far the minority. Also it is shown that female editors are more highly motivated than their male counterparts. The exact result of this higher motivation is not further explained. We already know that women make more elaborate edits, which could be a direct result of this. [Hill and Shaw, 2013] But it could also result in activity in certain categories on Wikipedia.

The reason why men are so dominant on Wikipedia has been investigated on multiple occasions. It is found that internet skill level is a predictor of contribut-ing to Wikipedia. Women have on average a lower internet skill level, which could explain the gender gap. [Hargittai and Shaw, 2015] A more psychological approach was taken by a large research project that has shown multiple differ-ences between men and women on the usage and editing of Wikipedia. Women turn out to be far less likely to edit pages on Wikipedia due to a lack of confi-dence. Women tend to think they lack expertise or information. Furthermore they don’t like to participate in discussions which is the result of a fear of get-ting ‘yelled at’. [Collier and Bear, 2012] This behaviour could lead to areas of Wikipedia that might already be predominantly edited by males to get more biased.

2.2

Categorization of Wikipedia

Since the beginning of Wikipedia it has been equipped with a categorization system. For editors it is both possible to create categories and assign them to articles. A category will always have a parent category, until the fourteen main topic classifications have been reached. Categories that are assigned to an article are often very specific. For example the article ’Albert Einstein’ has been assigned among others to the category ’Sigma Xi’ which is a category containing just 48 articles. This can making comparisons between categories complicated. Furthermore past research demonstrated flaws in the used categorization system:

(8)

There is no strict enforcement of which higher-level categories a child category can belong to; thus, the category structure is neither a tree nor a directed acyclic graph, permitting such paradoxes as a category being its own “grandparent”. [Kittur et al., 2009]

The lack of a tree structure in the categorization system makes using parent categories impossible. A reversed approach using the main topic categories as a starting point struggles with the same flaws. Because of this alternatives methods for categorization have been used in previous research. First of all topic modelling is a broadly used method. This technique can be implemented in many ways. One application that is used is based on all categories that are linked to a page. By automatically processing all categories an overlapping subject can be assigned. This subject can be mapped to a single category from a collection of high-level categories. [Kittur et al., 2009] Other variants of topic modelling are used by only analysing article titles but it is also possible to analyse the full text of a page. [Medelyan et al., 2008] What is notable about research using this technique is that is relatively old, dating back to before 2010. More recent activity on categorizing Wikipedia articles can be found in the project DBpedia. DBpedia is also a community project, but it focuses on ex-tracting knowledge from Wikipedia and making it freely available on the internet in a structured manner. For this Semantic Web and Linked Data technologies are being applied. [Lehmann et al., 2015] An important part of adding struc-ture to Wikipedia content is by classifying its content. For this DBpedia has set up an ontology, a collection of classes which are structured in a tree-like manner. Content on Wikipedia, mainly articles, can be assigned to multiple classes. The mapping of articles to classes is done by an automated process. A mapping is made from Wikipedia infobox templates to the ontology. [Bizer et al., 2009] Every infobox has unique identifiers such as ‘Infobox person’. An article with that infobox will be mapped to the class ‘Person’. [Mendes et al., 2012] Because not every article has infoboxes on Wikipedia not every single article will have been mapped to the ontology. More intelligent mappings and combining cur-rent techniques with topic modelling are curcur-rently being investigated however. [Holze, 2016]

Using the DBpedia ontology for scientific research is regularly seen. An ex-ample of one of the many scientific usages of the DBpedia ontology has been used to help set up a system to classify entities in text. [Dojchinovski and Kliegr, 2013] Also the classification system has been used to determine which pages are about persons in a research about the gender gap on research pages. [Marx and Alberts, 2016] The usage of the ontology for categorization purposes is further supported by research on the semantic linking of ‘Learning Object Repositories’. The classes that exist within the ontology are being used as a categorization system. [Lama et al., 2012]

(9)

3

Methodology

3.1

Description of the data

First of all it was necessary to determine the scope of this research regarding the version of Wikipedia to use. Wikipedia is available in 292 languages and every language has its own datasets. It was decided to only analyse one version of Wikipedia. Analysing more languages would add a lot of complexity without necessarily improving the validity of this research. With a 13.1% share of articles on Wikipedia overall the English version is by far the largest, almost double the size of the number two language (Swedish). [Wikipedia, 2016a] The English version also has the most registered users and because of being around for the longest, also the richest revision history. [Wikipedia, 2016c] Therefore the English Wikipedia provides a good sample of Wikipedia overall and was chosen accordingly. When the term Wikipedia is used in this paper this will refer to the English version Wikipedia.

For this research several datasets where necessary. These datasets at least needed to consist of the revision history of Wikipedia, gender of users and metadata about pages such as category.

3.1.1 User account data

Registered users on Wikipedia all have personal profiles. By default this profile only contains the chosen username. It is possible, but not obligated, to enter additional personal information. Additional information can include the gender of a user. To be able to answer the research question it is critical to know the gender of as many editors as possible.

Although this information is publicly available via the Wikimedia API this wouldn’t have been a viable option due to the vast amount of users that would have needed to be queried through the webservice. [Wikimedia, 2016a] This would have come down to over ten million API requests. Therefore the assis-tance of the Wikimedia Foundation was requested. This resulted in two datasets, one containing the user IDs and gender of all users with a registered gender. The other dataset consists of all user IDs, the time of registration and the number of revisions made by that user. Merging these datasets resulted in a dataset with the strictly required data and possibly useful extra meta data. A small remark concercing the user dataset has to be made: some user data was filtered out by Wikimedia. This mostly had to do with usernames that are abusive or reveal personally identifying information about someone. [Glenn, 2016]

Currently Wikipedia has 27.915.403 registered users. [Wikipedia, 2016c] Of these users there are 628.865 individuals who have registered their gender. This is a share of 2,25%. Of this number only 257.157 users have actually made one or more revisions on Wikipedia. Of these users 219.617 (85%) are male and 37.539 (15%) are female.

3.1.2 Categories

Being able to categorize articles on Wikipedia is essential to perform an analysis of activity within categories. As described in the related work section there are different approaches to divide Wikipedia pages into categories. Creating topic modelling algorithms were deemed to complex and time consuming. Therefore

(10)

Figure 1: Sample of the tree structure in the DBpedia ontology.

it was decided to use the ontology of DBpedia. The classes within the on-tology were adopted as categories. The DBpedia onon-tology has a tree-structure which also added an extra layer of information by making it possible to measure activity in just high-level categories or very specific categories if needed.

All classes have the same ancestor class, which is ‘Thing’. A subclass of ‘Thing’ is ‘Activity’ for example, which has its own subclasses like ‘Game’ and ‘Sport’. To further clarify this system a sample is shown in figure 1. A full version of the category tree is added in the appendix.

It was decided the most practical way to categorize articles was to create a dataset containing identifiers for pages and the corresponding categories. This dataset was created by parsing a raw dataset from DBpedia, constisting of

instance types. [DBpedia, 2016] This dataset contains page titles and their

corresponding classes. Since page titles do not make the best identifiers due to their size and encoding issues they had to be resolved to page IDs. This was done by making use of a database copy of Wikipedia’s pages table which is made publicly available on the Wikimedia data dumps. [Wikimedia, 2016d] Since the instance types dataset only provided a somewhat limited categorization when it comes to depth and amount of pages, a complementary dataset was needed. This was found in a collection of 648 datasets created by the University of Mannheim. Per class a CSV file was available containing, as far as possible, i.e. page IDs mapped to that class. [Mannheim, 2014] These CSV datasets were created by extracting the DBpedia RDF data and transforming it to CSV. [DBpedia, 2014] Every file has its own structure requiring a dynamic parser. By merging the CSV datasets to the previously established dataset about 900.000 articles were added and the category depth grew by about 350 categories.

The resulting dataset that was created contains: • 501 categories

(11)

3.1.3 Revision history

The Wikimedia Foundation publishes several ‘data dumps’ on their website. One of these datasets contains the full revision history with a lot of metadata of every page on Wikipedia, structured per page in XML. [Wikimedia, 2016d] This data dump formed the basis of the dataset containing the revision history that was needed for this research. All revisions starting from September 2001 up to February the fourth of 2016 were included. This dataset had a compressed size of 45GB and contained a total of 822.420.040 edits. [Wikimedia, 2016b] To make the dataset more workable an extraction had to be made that was as concise as possible.

First of all the necessary information per revision was identified: • Page ID

• User ID • Username • Revision ID

• Time stamp of revision

• Whether the revision was minor or not

Extracting all this information per revision would still leave a lot unnecessary data. To filter out irrelevant revisions some conditions were introduced. These limit the extraction of data to certain factors, which are explained below.

Limitation of pages The revision history contains the full history of all pages

on Wikipedia. Pages have many forms on Wikipedia and could be things like media, discussions and articles. By using the category dataset as a guide the amount of pages that were used for the final dataset could be limited. If a category could not be found for a certain page ID, that page would be skipped and therefore would not be a part of the final dataset.

Limitation based on users Every page in the XML structure of the revision

history contains all unique revisions. For every revision it was known which user ID the editor has. To further limit the extraction of revision data for every revision would be checked whether the gender of that editor was known. This was done by consulting the user dataset. Revisions that were made by users with an unknown gender were ignored.

Minor edits Furthermore it was investiged whether minor revisions should

be taken into account or not. The definition of a minor edit by Wikipedia is an edit that only makes a superficial difference. An editor can define an edit as minor by checking a certain checkbox before saving the edit. Whether an edit is considered minor or not is fully up to the judgement of the editor. By marking an edit as minor an editor states that the revision doesn’t alter the underlying content of an article and therefore doesn’t need review by other editors. A lot of automatic edits done by bots are automatically marked as

(12)

minor. [Wikipedia, 2016b] These bot edits were not included in the dataset however, since they are not registered users with a known gender.

To determine whether minor edits should ignored or not the revision size of both minor and ‘major’ edits was tested. The revision size is the amount of bytes that have actually been altered in an edit.It was expected that minor revisions also had a very small revision size and therefore possibly not interesting due to barely being able to be accounted for as ‘activity’.

Samples of both types of revisions were made, both containing 10.000 revi-sions. To find out the size of a revision the Wikimedia API was be used. Of both samples the API could provide the size of 9918 edits. The results are as shown in table 1.

Type of edit Count Mean Standard deviation

Minor 9918 18034 25812

Non-minor 9918 19078 29307

Table 1: Sample of revision types, size in bytes

As seen here the difference in size is not that large. A T-test confirmed the two types of revision do not differ significantly. This combined with the fact that determining whether a revision is minor or not is fully up to the user, and an easily forgotten step, resulted in the choice of including both kinds of revision in the final analysis.

3.1.4 Resulting revision history dataset

After defining the above conditions the revision history could be parsed. This resulted in a 2,8GB (uncompressed) measuring dataset with the following con-tents:

• 75.414.969 revisions • 3.796.460 articles • 176.015 users

The amount of revisions came down to a coverage of about 10% of all edits that were ever made on articles. The articles included in the dataset cover 73% of all articles on Wikipedia. The reason why not all known articles were included is because some articles weren’t edited by users with a known gender, or the article wasn’t categorized. There is a discrepancy of 81.142 users when looking at the earlier mentioned number about the user dataset. The underlying reason for this was because not every edit has to be made on an article, it also could have been on another kind of page, or the category was simply not known.

When specifically looking at the revisions and the gender of the editors it is clear that male editors are far more represented. A total of 69.435.179 edits have been made by males and only 5.97.9790 revisions were the work of females. This means 92% of the revisions are made by males and 8% by females. These figures are similar to the ones earlier mentioned in the introduction and literature review.

Figure 2 shows the distribution of the amount of revisions per gender. No-ticeable is that more men have a made a larger number of revisions.

(13)

Figure 2: Distribution of amount of revisions per gender on log-log scale.

3.2

Methods

After establishing these datasets it was possible to start answering the research question and subquestions. The first step was to generate statistics per category. For every known category a number of statistics were calculated per category, such as number of articles and edits. This required joining information from several datasets to create a dataset per category. These datasets were not dis-joint, the dataset of a parent category would also contain all data of underlying categories. This resulted in datasets that became more specific per level of nest-ing in the category tree. Table 2 shows a sample of the data that was generated. A full version of this table is included in the appendix.

Category Articles Edits ♂-edits ♀-edits Editors ♂-editors ♀-editors

Amphibian 5642 26705 93% 7% 2170 91% 9%

Band 30585 1029719 93,50% 6,50% 22703 89,50% 10,50%

Conifer 657 7596 93,00% 7% 1368 92,50% 7,50%

Drug 9978 308589 91% 9% 13614 92% 8%

Entomologist 1501 20712 91% 9% 2401 91% 9%

Table 2: Sample table of generated statics per category.

It was necessary to calculate per category whether a certain gender was more represented or not. Testing this based on just a editor-gender distribution per category would not suffice. From the previously mentioned statistics became clear that every category contains more male editors than female editors. The category ‘Artist’ has the lowest relative number of male editors with a share of 85,19%. Furthermore the stronger representation of a gender in a category does not have to mean a higher representation of that gender when it comes to activity. Activity in a category can be represented by revisions that have been made in that category.

To statistically test the categories Fisher’s exact test was chosen. To use this test contingency tables for each category had to be made based on two groups, each containing two genders. The first group was the ‘observed group’. This group was established by calculating the amount of revisions that are actually

(14)

made by both male and female editors. The second group was the ‘expected group’. For this group the expected amount of revisions were calculated for each gender. This was done by calculating the gender distribution and using that to make an equal distribution of the total amount of revisions. The expected group represents the null hypothesis that both genders are equally active in every category. Table 3 shows an example of a contingency table for the category ‘Fashion’. In this case 86,9% of the editors is male. Translating this to 86,9% of the revisions results in an amount of 3065. The remaining revisions are the expected value of female revisions.

Observed Expected Row total

Male 3016 3065 6112

Female 512 463 944

Column total 3528 3258 7056

Table 3: Contingency table of category ‘Fashion’

The outcome of Fisher’s exact test is a P -value which describes the signifi-cance of the deviation between the two groups. The chosen signifisignifi-cance level is p <0.01. There is also Bonferroni correction applied. This correction prevents type I errors by making it harder for a difference to be accepted as significant. Bonferroni correction is applied by dividing the significance level by the amount of revisions that has been made in that category.

The outcome of the example shown in table 3 is p = 0.098. This makes the difference between the groups not significant resulting in the acceptance of the null hypothesis. In other words, in the category Fashion both male and female editors are equally active.

However, when a difference is accepted as significant this means there is an overrepresentation of a gender in that category. To determine whether this overrepresentation is male or female a calculation is made based on the observed and expected amount of female edits. By dividing the amount of female edits by the expected amount of female edits a measure of representation for women

is acquired. If this figure is smaller than 1 the amount of expected female

edits is divided by the observed amount of female edits. The resulting figure is subsequently made negative. If the measure of representation is negative this means that category has more edit activity by males than females and a positive degree means the opposite. The calculated measure also provides an indicator for the degree of representation of women in that category. It shows the deviation of how many times the expected amount of female representation is present in a category.

To prevent very small categories from distorting the results a threshold was introduced for the amount of revisions that needed to have been made in that category. The threshold was set on one thousand revisions. This measure re-sulted in 31 categories being excluded, leaving 470 categories including the main category Thing. For all categories Fisher’s exact test was applied as described above and a dataset with further statistics was created.

(15)

4

Evaluation

4.1

Categories with an overrepresentation.

RQ1 Are there any categories on Wikipedia with a significant overrepresenta-tion of a certain gender when looking at edit activity?

Resulting from Fisher’s exact test of the 470 categories 358 categories do differ significantly. That means that 76% of the categories do not have a similar distribution of editing activity between genders and the gender distribution of editors in that category. Within this number 124 categories have a female

overrepresentation and 234 categories a male overrepresentation. Relatively

speaking this comes down to 35% of the categories being ‘female’ and 65% being ‘male’.

Of these results a top five of most ‘male’ and ‘female’ categories is shown in table 4 and table 5. A full table containing all results is attached in the appendix.

Category Articles Revisions ♀-representation

YearInSpaceflight 60 7805 -28,31

Asteroid 7219 34049 -15,71

BaseballSeason 286 3789 -7,89

MotorsportSeason 2305 66435 -7,08

FormulaOneTeam 146 14791 -6,87

Table 4: Top five categories where male editors are most overrepresented.

Category Articles Revisions ♀-representation

FigureSkater 2688 35861 6,00

Skater 342 2804 4,87

Garden 4040 44293 4,75

GaelicGamesPlayer 3110 18871 3,94

Mollusca 21528 115103 3,75

Table 5: Top five categories where female editors are most overrepresented.

4.2

Analysis of overrepresented categories

RQ2 Do categories with an overrepresentation show specific traits when com-paring them by gender?

A recurring theme in the top five categories where male editors are most overrepresented is sports. This trend is seen throughout all significant ‘male categories’. Besides sports other recurring subjects are transport and politics. Compared to categories with a male overrepresentation the categories with a female overrepresentation show somewhat less obvious recurring themes. Many of these categories are more or less culture related however.

(16)

Figure 3: Distribution of overrepresentation per category. Distribution for male categories is on the left, female on the right.

In figure 3 the distribution of overrepresentation per gender of all significant categories is visualized. This shows that categories with male overrepresenta-tion are slightly more even distributed. Categories that have a higher female overreprentation tend to have a stronger overrepresentation compared to male categories.

Size of the category also plays a role when looking further into properties of categories that have a certain overrepresentation. The size of a category can be defined in multiple ways. In figure 4 the size of categories is defined by the amount of articles. When comparing these graphs is becomes clear that categories where male editors are overrepresented contain more articles. The graphs shown in figure 5 measure category size by amount of revisions. Com-paring these graphs shows a similar outcome as before. Based on these graphs categories with a male overrepresentation are more extensive. Comparing the mean and median of amount of articles and revisions for both genders confirms this observation. These figures are shown in table 6.

Articles Revisions

Mean Median Mean Median

Male 83962 3988 Male 1888395 84836

Female 15830 3766 Female 214138 76902

Table 6: Mean and median of articles and revisions for both male and female

categories.

With use of the category tree a comparison can also be made between level categories and more nested categories. The category tree contains 30 high-level categories that fall directly under the category Thing. Five of those cat-egories do not have a significant overrepresentation. The 25 catcat-egories that do have an overrepresentation are as evenly distributed per gender as possible. 13 of the main categories have a male overrepresentation and 12 have a female overrepresentation. These 25 categories are shown in table 7.

(17)

Figure 4: Distribution of amount of articles per category with an overrepre-sentation. Distribution for male categories is on the left, female on the right. This shows the indivdual amount of articles per category.

Figure 5: Distribution of amount of revisions per category with an

overrepre-sentation. Distribution for male categories is on the left, female on the right. This shows the indivdual amount of revisions per category.

(18)

Category Articles Revisions ♀-representation Main gender Agent 1603676 33840461 -1,53 Male AnatomicalStructure 4430 101337 2,40 Female Area 51110 1079342 1,39 Female Award 38331 2205560 -1,35 Male Biomolecule 17074 142201 -1,30 Male ChemicalSubstance 19670 456236 1,09 Female Colour 211 14690 1,27 Female Device 13514 438271 -1,17 Male Disease 5816 357521 1,13 Female EthnicGroup 4655 290413 1,12 Female Event 102398 3056249 -1,43 Male Food 11513 388447 1,29 Female Holiday 922 63799 1,24 Female List 713896 16206171 -1,41 Male MeanOfTransportation 56936 1253144 -1,77 Male Name 32114 295704 1,28 Female PersonFunction 907 28736 1,20 Female Place 852371 12803961 -1,11 Male Species 227059 1797351 1,12 Female SportCompetitionResult 798 23379 -1,76 Male SportsSeason 60340 1903050 -2,53 Male TimePeriod 2096 89426 1,37 Female TopicalConcept 38717 1872661 -1,21 Male UnitOfWork 2547 52265 -2,10 Male Work 2260598 55340530 -1,63 Male

Table 7: The 25 ‘main categories’ and their properties.

5

Conclusion

Main research question Does there exist a correlation between gender of Wikipedia editors and the categories where they are active?

First of all it becomes very clear that Wikipedia is indeed very much rep-resented by males when it comes to editing activity. This is exactly what was expected based on past research. In the dataset of revision history that was used is 85% the work of men. In every identified category there is a majority of male editors. Only three out of 501 categories have a female majority when it comes to editing activity.

When looking closer at the categories some interesting findings show. Only for 24% of the categories it can be said that there is no connection between the gender of the Wikipedia editor and its behaviour in that category. The other 76% of the categories does show a significant connection between gender of the editor and the category in which that editor is active. Within this number there is a distribution of 35% of the categories being ‘female’ and 65% begin ‘male’. Some of the categories are truly male dominated are mostly sports, transport and politics related.

(19)

based on edit activity is ‘Mollusca’. Despite having almost four times as much edit activity by women than was expected, still 67% of all edits were made by men. There are only three categories that have a majority of female edit-ing actity, beedit-ing ‘FigureSkater’ (65% female edits), ‘Skater’ (51%) and garden (54%).

The main category Thing contains all data that has been used in the under-lying categories. This category also shows to significantly differ and the activity in this category is predominantly by male Wikipedians. This also confirms male bias on Wikipedia. However, the findings of this research show that the assumed gender bias on Wikipedia might not be as black-and-white as stated in a lot of research. The majority of categories on Wikipedia are edited by an unequal distributed amount of male and female editors when assuming male and female editors would be equally as active. Furthermore these categories are far from being only over represented when it comes down to edit activity by male editors. In short this leads to the conclusion that, yes, there is with 76% in most cases a correlation present between the gender of Wikipedia editors and the categories where they are active.

5.1

Discussion

The sample of editors that is being used might not be representative for all Wikipedia editors. Although the sample covers all users who registered their gender, this group might have some unique traits. This suspicion is further sparked by the fact that the sample only covers about 1% of all Wikipedia users but still covers over 10% of all revisions that have been made on Wikipedia. This suggests that these users might be unusually active on Wikipedia. Furthermore, as also mentioned in previous research, there is no way to check whether users have truthfully registered their gender. Also a single user account could be used by multiple people with different genders. [Antin et al., 2011]

Despite these concerns a series of rich datasets has been created of which this research merely touches a small part of the information that could be gathered from it. More intensive statics could be applied comparing categories. Also there was meta data available containing edit and registration dates. This data could be used to provide an unique insight in shifts of activity per category over time for example. Furthermore this research has a quantitative character and treats all revisions as being equal. It could be possible that there is a difference in size and quality of revisions between men and women. This research provides a good basis for further qualitative research that could investigate this.

5.2

Acknowledgements

Hereby I would like to thank my thesis supervisor Dr. M. J. Marx. His pas-sionate participation and input has helped me shape this research greatly. Fur-thermore I would like to thank Mr. Ariel T. Glenn, DevOps engineer at the Wikimedia Foundation. Without his help I would have been forced to use a much smaller sample of Wikipedians. Providing me with a custom-made dataset with the gender of all Wikipedia users who have this registered has had an incredible influence on the validity and extent of this research.

(20)

References

[Alexa, 2016] Alexa (2016). wikipedia.org site overview. http://www.alexa. com/siteinfo/wikipedia.org. (Accessed on 12-04-2016).

[Antin et al., 2011] Antin, J., Yee, R., Cheshire, C., and Nov, O. (2011). Gen-der differences in wikipedia editing. In Proceedings of the 7th International Symposium on Wikis and Open Collaboration, pages 11–14. ACM.

[Bizer et al., 2009] Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., and Hellmann, S. (2009). Dbpedia-a crystallization point for the web of data. Web Semantics: science, services and agents on the world wide web, 7(3):154–165.

[Collier and Bear, 2012] Collier, B. and Bear, J. (2012). Conflict, confidence, or criticism: An empirical examination of the gender gap in wikipedia. In CSCW12: Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, pages 383–392.

[DBpedia, 2014] DBpedia (2014). Dbpedia as tables — dbpedia. http://wiki.

dbpedia.org/services-resources/downloads/dbpedia-tables.

(Ac-cessed on 04-05-2016).

[DBpedia, 2016] DBpedia (2016). Downloads 2015-10 — dbpedia. http:// wiki.dbpedia.org/Downloads2015-10. (Accessed on 12-04-2016).

[Dojchinovski and Kliegr, 2013] Dojchinovski, M. and Kliegr, T. (2013). En-tityclassifier. eu: real-time classification of entities in text with wikipedia. In Machine Learning and Knowledge Discovery in Databases, pages 654–658. Springer.

[Glenn, 2016] Glenn, A. T. (2016). private communication.

[Graells-Garrido et al., 2015] Graells-Garrido, E., Lalmas, M., and Menczer, F. (2015). First women, second sex: Gender bias in wikipedia. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, pages 165–174. ACM.

[Hargittai and Shaw, 2015] Hargittai, E. and Shaw, A. (2015). Mind the skills gap: The role of internet know-how and gender in differentiated contributions to wikipedia. Information, Communication & Society, 18(4):424–442. [Hill and Shaw, 2013] Hill, B. M. and Shaw, A. (2013). The wikipedia

gen-der gap revisited: Characterizing survey response bias with propensity score estimation. PloS one, 8(6):e65782.

[Holze, 2016] Holze, J. (2016). Dbpedia @ google summer of code 2016. http: //blog.dbpedia.org/?p=190.

[Kim, 2013] Kim, J. (2013). Wikipedians from mars: Female students’ percep-tions toward wikipedia. Proceedings of the American Society for Information Science and Technology, 50(1):1–4.

(21)

[Kittur et al., 2009] Kittur, A., Chi, E. H., and Suh, B. (2009). What’s in wikipedia?: mapping topics and conflict using socially annotated category structure. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1509–1512. ACM.

[Klein, 2015] Klein, M. (2015). Wikipedia in the world of global gender inequal-ity indices: what the biography gender gap is measuring. In Proceedings of the 11th International Symposium on Open Collaboration, page 16. ACM. [Lama et al., 2012] Lama, M., Vidal, J. C., Otero-Garc´ıa, E., Bugar´ın, A., and

Barro, S. (2012). Semantic linking of learning object repositories to dbpedia. Educational Technology & Society, 15(4):47–61.

[Lehmann et al., 2015] Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kon-tokostas, D., Mendes, P. N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al. (2015). Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195.

[Mannheim, 2014] Mannheim (2014). Ontology classes. http://web.

informatik.uni-mannheim.de/DBpediaAsTables/DBpediaClasses.htm. (Accessed on 12-04-2016).

[Marx and Alberts, 2016] Marx, M. and Alberts, H. (2016). Gender bias on wikipedia’s person pages? .

[Massa and Zelenkauskaite, 2014] Massa, P. and Zelenkauskaite, A. (2014). Gender gap in wikipedia editing.

[Matias et al., 2015] Matias, J. N., Diehl, S., and Zuckerman, E. (2015).

Pass-ing on: Reader-sourcPass-ing gender diversity in wikipedia. In Proceedings of

the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, pages 1073–1078. ACM.

[Medelyan et al., 2008] Medelyan, O., Witten, I. H., and Milne, D. (2008). Topic indexing with wikipedia. In Proceedings of the AAAI WikiAI work-shop, volume 1, pages 19–24.

[Mendes et al., 2012] Mendes, P. N., Jakob, M., and Bizer, C. (2012). Dbpedia: A multilingual cross-domain knowledge base. In LREC, pages 1813–1817. Citeseer.

[Reagle and Rhue, 2011] Reagle, J. and Rhue, L. (2011). Gender bias in

wikipedia and britannica. International Journal of Communication, 5:21. [Wagner et al., 2015] Wagner, C., Garcia, D., Jadidi, M., and Strohmaier, M.

(2015). It’s a man’s wikipedia? assessing gender inequality in an online

encyclopedia. arXiv preprint arXiv:1501.06307.

[Wikimedia, 2011] Wikimedia (2011). Wikipedia editors study. Results from the editor survey.

[Wikimedia, 2016a] Wikimedia (2016a). Api:users - mediawiki. https://www. mediawiki.org/wiki/API:Users. (Accessed on 29-03-2016).

(22)

[Wikimedia, 2016b] Wikimedia (2016b). enwiki dump progress on 20160204.

https://dumps.wikimedia.org/enwiki/20160204/. (Accessed on

15-03-2016).

[Wikimedia, 2016c] Wikimedia (2016c). Gender gap - meta. https://meta. wikimedia.org/wiki/Gender_gap. (Accessed on 20-04-2016).

[Wikimedia, 2016d] Wikimedia (2016d). Wikimedia downloads - data dumps. https://dumps.wikimedia.org/. (Accessed on 14-03-2016).

[Wikipedia, 2016a] Wikipedia (2016a). List of wikipedias - wikipedia, the

free encyclopedia. https://en.wikipedia.org/wiki/List_of_Wikipedias. (Accessed on 20-05-2016).

[Wikipedia, 2016b] Wikipedia (2016b). Minor edit - wikipedia. https://en. wikipedia.org/wiki/Help:Minor_edit. (Accessed on 28-03-2016).

[Wikipedia, 2016c] Wikipedia (2016c). Wikipedia:statistics - wikipedia, the free

encyclopedia. https://en.wikipedia.org/wiki/Wikipedia:Statistics.

(Accessed on 25-04-2016).

[Wikipedia, 2016d] Wikipedia (2016d). Wikipedia:wikiproject countering

sys-temic bias/gender gap task force/research - wikipedia, the free

en-cyclopedia. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_

Countering_systemic_bias/Gender_gap_task_force/research.

(23)
(24)

Category Articles Edits Editors Male edits Female edits Males Females SpeedwayLeague 15 94 23 94 0 23 0 MixedMartialArtsLeague 1 39 19 39 0 19 0 AustralianFootballLeague 3 24 14 24 0 14 0 Ginkgo 4 21 11 21 0 11 0 AcademicJournal 5441 60407 2584 55941 4466 2334 250 Activity 1554 91996 7603 85920 6076 7096 507 Actor 65801 2529372 36357 2228064 301308 31679 4678 AdministrativeRegion 93924 1652540 27316 1521018 131522 25163 2153 AdultActor 1308 54409 3952 48495 5914 3603 349 Agent 1603676 33840461 124769 30995163 2845298 108737 16032 Aircraft 9748 324208 7026 314928 9280 6712 314 Airline 3342 160835 5630 156836 3999 5337 293 Airport 13041 331103 7595 318778 12325 7187 408 Album 113774 1805129 18854 1728333 76796 17271 1583 AmateurBoxer 375 2123 457 2037 86 427 30 Ambassador 7961 161634 8135 150417 11217 7470 665 AmericanFootballCoach 312 14098 1517 13283 815 1422 95 AmericanFootballLeague 84 6642 983 6472 170 921 62 AmericanFootballPlayer 13689 352904 7100 337273 15631 6565 535 AmericanFootballTeam 77 22352 2507 21403 949 2350 157 Amphibian 5642 26705 2170 24837 1868 1980 190 AmusementParkAttraction 1254 48182 2684 46215 1967 2469 215 AnatomicalStructure 4430 101337 6524 80528 20809 5966 558 Animal 175171 1345359 17267 1190111 155248 15562 1705 AnimangaCharacter 229 32278 2153 30416 1862 1924 229 Anime 1216 56900 3581 53155 3745 3213 368 Arachnid 3687 24356 2898 22707 1649 2660 238 Archaea 226 1666 348 1564 102 330 18 Archipelago 2848 90894 6857 82079 8815 6397 460 Architect 2175 37801 3430 32566 5235 3109 321 ArchitecturalStructure 228384 3902229 36127 3604625 297604 33223 2904 Archive 2261 111471 8383 102881 8590 7692 691 Area 51110 1079342 22058 955329 124013 20229 1829 Artery 367 3223 457 2678 545 409 48 Article 2027644 50837992 158203 46831098 4006894 138141 20062 ArtificialSatellite 1912 52437 3785 49729 2708 3587 198 Artist 222644 6508470 58353 5821426 687044 50415 7938 ArtistDiscography 3142 294664 5507 284803 9861 5036 471 Artwork 3974 71960 4897 63578 8382 4459 438 Asteroid 7219 34049 516 33902 147 481 35 Astronaut 629 23297 3081 21110 2187 2864 217 Athlete 282829 5010445 28977 4740233 270212 26709 2268 AustralianFootballTeam 369 11115 1088 10700 415 1007 81 AustralianRulesFootballPlayer 6726 81120 2262 78801 2319 2088 174 Automobile 5195 239527 7633 235305 4222 7234 399 AutomobileEngine 333 9234 1024 9063 171 971 53 AutoRacingLeague 7 169 41 168 1 40 1 Award 38331 2205560 35935 2040130 165430 32295 3640 Bacteria 524 6394 1292 5738 656 1175 117 BadmintonPlayer 568 6237 843 6027 210 785 58 Band 30585 1029719 22703 962867 66852 20310 2393 Bank 4285 86932 6698 81878 5054 6236 462 Baronet 5382 68843 3076 54017 14826 2811 265 BaseballLeague 207 7794 1091 7588 206 1028 63 BaseballPlayer 20160 372290 6553 357504 14786 6062 491 BaseballSeason 286 3789 142 3745 44 129 13 BaseballTeam 440 23086 2331 21797 1289 2181 150 BasketballLeague 378 18194 1565 17906 288 1489 76 BasketballPlayer 7993 292791 6189 284317 8474 5759 430 BasketballTeam 1153 57436 2925 56082 1354 2758 167 BeachVolleyballPlayer 104 999 290 948 51 265 25 BeautyQueen 1964 24115 2929 21073 3042 2641 288 Beverage 3489 110105 9240 100872 9233 8548 692 BiologicalDatabase 323 2188 438 2080 108 399 39 Biologist 10076 176069 8660 158428 17641 7828 832 Biomolecule 17074 142201 6428 132290 9911 5842 586 Bird 12382 176216 5927 159996 16220 5404 523 Bodybuilder 209 3550 915 3321 229 833 82 BodyOfWater 28956 270621 9713 240160 30461 9027 686 Bone 415 6485 921 5095 1390 826 95 Book 54237 1191548 25184 1071036 120512 22399 2785 BowlingLeague 1 79 13 77 2 11 2 Boxer 3536 74401 4546 71487 2914 4232 314 Brain 547 9873 1459 6952 2921 1314 145 Brewery 360 5616 970 5245 371 901 69 Bridge 3456 74065 4669 67649 6416 4388 281 BritishRoyalty 13351 481515 10797 414035 67480 9795 1002 Broadcaster 34265 963441 19026 914867 48574 17448 1578 BroadcastNetwork 1197 47863 4110 46169 1694 3822 288 Building 71219 1061880 19803 933188 128692 18080 1723 BusCompany 1286 41507 2446 40366 1141 2292 154 BusinessPerson 885 7119 1530 6336 783 1390 140 CanadianFootballLeague 7 973 188 957 16 176 12 CanadianFootballTeam 25 2805 366 2724 81 334 32 Canal 287 14468 1299 14031 437 1212 87 Canoeist 4151 20529 969 20091 438 883 86 Cardinal 919 20830 1939 20137 693 1787 152 Cartoon 3341 146504 7266 136957 9547 6581 685 Case 2526 52086 3352 49827 2259 3050 302 Casino 614 11673 1970 10540 1133 1823 147 Castle 1225 13822 1360 11448 2374 1234 126 Cave 392 6230 880 5443 787 801 79 CelestialBody 25837 315341 8547 305909 9432 8013 534 Chancellor 85 3889 792 3515 374 730 62 Cheese 304 2018 657 1755 263 577 80 Chef 448 14458 2374 13124 1334 2136 238 ChemicalCompound 10302 168152 8259 155131 13021 7660 599 ChemicalSubstance 19670 456236 16096 417019 39217 14838 1258 ChessPlayer 20884 280625 6663 246465 34160 6107 556 ChristianBishop 5648 96436 4289 89465 6971 3945 344 City 20489 953984 24302 891453 62531 22502 1800 ClassicalMusicArtist 305 1943 480 1744 199 439 41 ClassicalMusicComposition 584 4624 692 4288 336 628 64 Cleric 12441 277249 9247 257803 19446 8480 767 ClericalAdministrativeRegion 13252 220423 7046 206013 14410 6551 495 ClubMoss 84 859 174 805 54 160 14 Coach 6425 162955 3962 157889 5066 3674 288 College 91 5679 915 5242 437 826 89 CollegeCoach 6085 148588 3560 144359 4229 3290 270 Colour 211 14690 2563 13159 1531 2352 211 Comedian 1139 95962 8121 87613 8349 7436 685 ComedyGroup 56 2628 844 2365 263 768 76 ComicsCharacter 4435 189791 6390 182866 6925 5845 545 ComicsCreator 2525 55322 3804 52267 3055 3465 339 ComicStrip 365 2569 667 2380 189 618 49 Community 47824 787898 22389 708252 79646 20672 1717 Company 128070 3281731 49914 3066776 214955 45399 4515 Competition 36772 878914 15001 847950 30964 14021 980 Congressman 3275 63885 3744 60004 3881 3430 314 Conifer 657 7596 1368 7071 525 1265 103 Contest 110 2298 833 2089 209 770 63 Continent 314 32648 4540 30981 1667 4249 291 Convention 2859 83582 8225 77552 6030 7632 593 Country 3574 548288 17491 518362 29926 16310 1181 Crater 2121 19426 845 19068 358 792 53 Cricketer 11145 146896 5061 142092 4804 4746 315 CricketGround 251 1189 140 1039 150 131 9 CricketLeague 9 289 83 285 4 79 4 CricketTeam 519 14252 1275 13656 596 1204 71 Criminal 2220 132999 7915 120940 12059 7222 693 Crustacean 1966 12450 2080 10444 2006 1900 180 CultivatedVariety 697 6682 1559 5800 882 1412 147 Curler 914 14303 686 14078 225 621 65 CurlingLeague 3 223 23 222 1 22 1 Currency 352 24365 3464 23247 1118 3297 167 Cycad 116 598 208 510 88 185 23 CyclingRace 574 7721 838 7220 501 780 58 CyclingTeam 251 8200 453 7529 671 420 33 Cyclist 6228 75307 3437 64993 10314 3169 268 Dam 3978 48661 3759 43445 5216 3519 240 DartsPlayer 464 9984 743 9783 201 683 60 Database 2584 113659 8499 104961 8698 7796 703 Desert 1177 24744 3343 21996 2748 3087 256 Device 13514 438271 18179 418303 19968 17207 972 Diocese 3041 42462 2009 41015 1447 1855 154 Disease 5816 357521 13547 315930 41591 12156 1391 District 68324 1107214 23298 1018082 89132 21530 1768 Document 4506 158430 11415 149181 9249 10646 769 Drama 2589 129203 7801 114411 14792 6975 826 Drug 9978 308589 13614 280121 28468 12540 1074 Economist 4954 120799 7605 111244 9555 7019 586 EducationalInstitution 94131 2287380 40805 2053625 233755 36685 4120 Egyptologist 2547 166694 10283 153125 13569 9416 867 Election 6713 211883 5855 200143 11740 5490 365 Embryology 193 2590 671 1976 614 601 70 Engine 333 9234 1024 9063 171 971 53 Engineer 725 15992 2192 14950 1042 2027 165 Entomologist 1501 20712 2401 18894 1818 2184 217 Enzyme 4736 22899 1249 22192 707 1134 115 EthnicGroup 4655 290413 10962 264425 25988 10090 872 Eukaryote 221788 1718336 20773 1521558 196778 18660 2113 EurovisionSongContestEntry 1103 12844 856 12381 463 785 71 Event 102398 3056249 34998 2894892 161357 32343 2655 Fashion 340 3528 1150 3016 512 999 151 FashionDesigner 615 14336 2385 11951 2385 2104 281 Fencer 2470 11322 1281 9685 1637 1174 107 Fern 817 7044 701 6604 440 628 73 FictionalCharacter 11514 627158 14103 531984 95174 12551 1552 FieldHockeyLeague 24 823 130 814 9 122 8 FigureSkater 2688 35861 1977 12340 23521 1761 216 Film 112675 2916559 35713 2719023 197536 32054 3659 FilmFestival 677 9776 1363 8991 785 1215 148 Fish 15097 110482 5165 97125 13357 4749 416 FloweringPlant 1047 10959 1677 10039 920 1535 142 Food 11513 388447 16725 349185 39262 15338 1387 FootballLeagueSeason 9585 303826 3816 300334 3492 3653 163 FootballMatch 2359 44930 1988 44247 683 1892 96 FormulaOneRacer 825 58632 2440 57872 760 2284 156 FormulaOneTeam 146 14791 902 14674 117 853 49 Fungus 5830 34454 2390 31479 2975 2179 211 GaelicGamesPlayer 3110 18871 1329 12103 6768 1208 121 Galaxy 611 8636 823 8420 216 777 46 Game 1290 42586 4680 40639 1947 4353 327 Garden 342 2804 417 1303 1501 370 47 Gene 12 114 33 110 4 29 4 Genre 12095 660512 24478 610108 50404 22283 2195 GivenName 2442 64729 6151 55155 9574 5594 557 Glacier 633 6142 512 5924 218 485 27 Gnetophytes 28 169 71 141 28 61 10 GolfCourse 333 2357 492 1965 392 448 44 GolfLeague 14 1553 153 1523 30 139 14 GolfPlayer 2495 31081 2377 29243 1838 2187 190 GolfTournament 1579 15535 884 14597 938 832 52 GovernmentAgency 3892 80562 7131 76028 4534 6663 468 GovernmentalAdministrativeRegion68324 1107214 23298 1018082 89132 21530 1768 Governor 2507 62681 4091 58500 4181 3773 318 GrandPrix 1242 40863 1144 39690 1173 1076 68 Grape 425 4202 696 3930 272 630 66 GreenAlga 314 1364 289 1270 94 261 28

B

Category statistics

(25)

GridironFootballPlayer 20251 415001 7613 397017 17984 7034 579 Group 85629 2676317 42477 2484192 192125 38216 4261 Guitarist 145 1692 529 1583 109 482 47 Gymnast 2002 19334 2317 17320 2014 2094 223 HandballLeague 25 527 77 520 7 71 6 HandballPlayer 1106 7869 482 7772 97 448 34 HandballTeam 334 2967 303 2926 41 281 22 Historian 16276 302103 10819 272774 29329 9855 964 HistoricBuilding 6137 91291 4136 83649 7642 3813 323 HistoricPlace 13292 140648 5456 119585 21063 5002 454 HockeyTeam 1961 76016 2898 73197 2819 2692 206 Holiday 922 63799 7343 57838 5961 6789 554 HollywoodCartoon 1569 22744 1943 21229 1515 1771 172 Horse 2834 20036 2169 18221 1815 1951 218 HorseRace 2213 26375 1152 24052 2323 1056 96 HorseRider 367 1336 270 931 405 241 29 HorseTrainer 202 1107 305 1000 107 270 35 Hospital 2561 33446 3224 29754 3692 2959 265 Host 2232 160993 10084 148004 12989 9240 844 Hotel 1092 21130 2079 19703 1427 1918 161 HumanGene 12 114 33 110 4 29 4 Humorist 729 29184 4074 26136 3048 3711 363 IceHockeyLeague 272 14534 1034 13905 629 982 52 IceHockeyPlayer 12732 177884 4765 170179 7705 4385 380 Ideology 2609 190339 10316 176753 13586 9538 778 InformationAppliance 1016 88390 7556 85548 2842 7198 358 Infrastructure 123379 2200622 24289 2085728 114894 22684 1605 InlineHockeyLeague 21 203 82 200 3 79 3 Insect 90741 376455 5478 339688 36767 4984 494 Instrument 5996 150277 10882 139556 10721 10133 749 Instrumentalist 145 1692 529 1583 109 482 47 Island 4068 85518 6050 76696 8822 5639 411 Jockey 765 7590 1363 6877 713 1219 144 Journalist 26110 631711 17941 571011 60700 16224 1717 Judge 13070 281173 10400 259941 21232 9457 943 LacrosseLeague 32 1844 182 1801 43 171 11 LacrossePlayer 348 6160 508 5557 603 457 51 Lake 9845 87660 5662 75017 12643 5243 419 Language 19348 518303 19053 482197 36106 17733 1320 LaunchPad 83 1587 202 1456 131 190 12 LawFirm 445 7704 1089 7030 674 962 127 LegalCase 2526 52086 3352 49827 2259 3050 302 Legislature 1438 48217 4369 44814 3403 4108 261 Letter 678 30156 3924 27921 2235 3660 264 Library 795 14809 1955 12402 2407 1760 195 Lieutenant 1678 42796 3652 36421 6375 3354 298 Ligament 194 1126 235 779 347 208 27 Lighthouse 1480 30000 1471 27455 2545 1349 122 Linguist 4876 87580 6285 78728 8852 5746 539 Lipid 387 7654 1370 6336 1318 1237 133 List 713896 16206171 87629 14793767 1412404 76864 10765 Locomotive 5046 82947 3166 80988 1959 2989 177 Lymph 81 426 81 268 158 69 12 Magazine 4116 66797 6334 60347 6450 5758 576 Mammal 13700 383984 12181 350454 33530 11002 1179 Manga 2854 135225 5105 122372 12853 4482 623 MartialArtist 2630 102839 4478 100068 2771 4172 306 Mayor 13957 224709 9051 206636 18073 8351 700 MeanOfTransportation 56936 1253144 18180 1214823 38321 17197 983 Media 90 2454 756 2307 147 706 50 Medician 292 9730 1742 8554 1176 1554 188 Medicine 221 20503 3074 18751 1752 2829 245 Meeting 536 10984 2353 9815 1169 2172 181 MemberOfParliament 8920 138737 4756 107277 31460 4400 356 MilitaryConflict 13226 717703 14824 679131 38572 13903 921 MilitaryPerson 24931 590945 11267 549925 41020 10409 858 MilitaryStructure 3426 63076 4391 56068 7008 4120 271 MilitaryUnit 15649 387687 9833 374103 13584 9254 579 Mill 1075 19299 1916 17264 2035 1752 164 Mine 1986 32512 2879 30407 2105 2673 206 Mineral 1175 20533 2615 19110 1423 2390 225 MixedMartialArtsEvent 639 24996 1125 24714 282 1057 68 Model 8892 449042 18033 392666 56376 16062 1971 Mollusca 21528 115103 2682 77108 37995 2446 236 Monarch 2667 96730 5859 91722 5008 5413 446 Monument 7875 212441 10667 189571 22870 9847 820 Mosque 1192 31103 2956 28974 2129 2742 214 Moss 318 1347 344 1213 134 299 45 Motorcycle 908 19353 1545 18872 481 1445 100 MotorcycleRacingLeague 21 329 95 322 7 92 3 MotorcycleRider 1634 27753 1261 27295 458 1173 88 MotorsportRacer 5323 198500 4648 195026 3474 4332 316 MotorsportSeason 2305 66435 1256 65822 613 1174 82 Mountain 12797 116393 5937 106252 10141 5521 416 MountainPass 956 9935 1248 8709 1226 1166 82 MountainRange 2515 36451 3318 33913 2538 3073 245 Murderer 91 10505 1813 9260 1245 1635 178 Muscle 282 5799 930 4868 931 848 82 Museum 4082 82499 5643 66588 15911 5120 523 Musical 1244 35239 3594 32066 3173 3220 374 MusicalArtist 62873 2274862 34889 2044843 230019 30472 4417 MusicalWork 170261 3416126 27087 3264803 151323 24592 2495 MusicFestival 378 3200 702 2900 300 636 66 MusicGenre 3822 305399 17144 284524 20875 15810 1334 MythologicalFigure 749 4840 837 4438 402 761 76 Name 32114 295704 13621 265668 30036 12541 1080 NascarDriver 892 62973 2124 62165 808 1961 163 NationalCollegiateAthleticAssociationAthlete297 3524 571 3334 190 515 56 NationalFootballLeagueEvent 9 2078 537 2010 68 501 36 NationalFootballLeagueSeason 3082 69562 1518 68423 1139 1435 83 NaturalEvent 382 1450 140 1434 16 131 9 NaturalPlace 53186 602420 14960 543884 58536 13869 1091 NCAATeamSeason 7689 139903 1900 132071 7832 1780 120 Nerve 336 3708 536 3044 664 479 57 NetballPlayer 246 2297 327 1965 332 293 34 Newspaper 5538 72830 6247 66745 6085 5787 460 Noble 4677 84843 3177 57704 27139 2861 316 Non-ProfitOrganisation 3963 93734 7282 79798 13936 6610 672 Novel 658 6276 910 5807 469 820 90 OfficeHolder 45998 1235243 22202 1148751 86492 20447 1755 OlympicEvent 3316 46096 1685 44752 1344 1571 114 OlympicResult 764 22083 1030 21055 1028 949 81 Olympics 3374 61213 3114 59424 1789 2921 193 Organisation 478477 12643460 92264 11742033 901427 82367 9897 OrganisationMember 569 22478 954 22173 305 902 52 Orphan 417 56826 4827 52497 4329 4373 454 Painter 2418 9546 1350 7952 1594 1223 127 Parish 10211 177961 6321 164998 12963 5865 456 Park 22901 506797 14294 445089 61708 13121 1173 Parliament 508 16561 1922 15568 993 1791 131 PeriodicalLiterature 15095 200034 10004 183033 17001 9130 874 Person 1167850 23087741 94030 20987831 2099910 81579 12451 PersonFunction 907 28736 4625 25767 2969 4226 399 Philosopher 8950 376781 13776 347067 29714 12633 1143 Photographer 5298 91269 6914 79986 11283 6198 716 Place 852371 12803961 77047 11762787 1041174 70124 6923 Planet 13068 178695 6572 173129 5566 6129 443 Plant 39817 315182 10434 278997 36185 9494 940 Play 1666 30601 3478 26603 3998 3142 336 PlayboyPlaymate 291 12336 1580 11262 1074 1427 153 PlayWright 6052 155793 8643 137098 18695 7750 893 Poem 333 2311 642 2047 264 577 65 Poet 301 2436 471 1627 809 415 56 PokerPlayer 617 9549 1419 9164 385 1307 112 PoliticalParty 6581 191896 8541 181418 10478 8021 520 Politician 144829 3099025 33740 2838317 260708 30775 2965 PoloLeague 18 547 68 544 3 65 3 Polysaccharide 159 5797 1553 5249 548 1418 135 Pope 410 31709 2958 30196 1513 2745 213 PopulatedPlace 467376 7066834 63023 6504841 561993 57703 5320 PowerStation 1817 25813 2384 24337 1476 2231 153 Presenter 6771 277641 12848 252586 25055 11664 1184 President 14594 663345 17897 622856 40489 16558 1339 PrimeMinister 1485 43308 3792 41128 2180 3511 281 Prison 852 16556 2017 13753 2803 1858 159 Producer 13962 580994 19258 526287 54707 17331 1927 ProtectedArea 7960 129327 5795 108355 20972 5340 455 Protein 10358 70805 3308 67200 3605 2999 309 Psychologist 2664 55669 5099 47773 7896 4561 538 PublicTransitSystem 1472 69224 4656 66856 2368 4423 233 Publisher 2203 65600 6222 60777 4823 5736 486 Race 9601 198702 7080 188072 10630 6578 502 Racecourse 213 3442 641 3184 258 584 57 RaceHorse 2834 20036 2169 18221 1815 1951 218 RaceTrack 213 3442 641 3184 258 584 57 RacingDriver 3689 170747 4312 167731 3016 4013 299 RadioHost 311 11467 2116 10777 690 1952 164 RadioProgram 929 27508 3206 25598 1910 2951 255 RadioStation 18107 263710 6380 256513 7197 5922 458 RailwayLine 2732 78012 3528 75525 2487 3328 200 RailwayStation 6363 31937 911 30766 1171 835 76 RailwayTunnel 128 2966 693 2823 143 661 32 Rebellion 3036 146294 8882 136978 9316 8235 647 RecordLabel 4651 86725 7401 81225 5500 6785 616 Referee 1285 20894 2443 20170 724 2304 139 Region 93924 1652540 27316 1521018 131522 25163 2153 Religious 899 26762 2899 24575 2187 2691 208 ReligiousBuilding 4581 86479 5434 78223 8256 5021 413 Reptile 4947 57220 3757 53612 3608 3449 308 ResearchProject 21 179 59 167 12 54 5 Restaurant 4034 115244 8342 106178 9066 7694 648 River 18128 152444 7049 137193 15251 6571 478 Road 26471 476418 10748 436563 39855 10037 711 RoadJunction 132 2238 548 2034 204 511 37 RoadTunnel 186 4054 755 3847 207 715 40 Rocket 265 11127 1250 10993 134 1212 38 RollerCoaster 688 25885 1645 24787 1098 1527 118 RouteOfTransportation 32740 623525 12419 575252 48273 11623 796 Rower 2662 15275 1908 13986 1289 1740 168 Royalty 13351 481515 10797 414035 67480 9795 1002 RugbyClub 1924 41423 2148 40339 1084 2012 136 RugbyLeague 427 10348 1045 10152 196 988 57 RugbyPlayer 11956 216173 3998 211105 5068 3702 296 Saint 3269 97776 6422 90700 7076 5888 534 Sales 11 1627 525 1544 83 489 36 Satellite 1912 52437 3785 49729 2708 3587 198 School 58832 1485017 31105 1339407 145610 27914 3191 Scientist 60914 1259386 24935 1143873 115513 22533 2402 ScreenWriter 651 5299 1160 4653 646 1031 129 Sculptor 5094 77592 5487 65834 11758 4907 580 Senator 13108 332399 9850 309476 22923 9065 785 Settlement 357902 4296876 47216 3959891 336985 43460 3756

(26)

Single 43429 1073677 16560 1024413 49264 15091 1469 SiteOfSpecialScientificInterest 895 3964 535 3717 247 495 40 Skater 4040 44293 2566 21662 22631 2297 269 SkiArea 546 7098 1223 6136 962 1127 96 Skier 1802 15931 1440 15387 544 1316 124 Skyscraper 3 63 28 57 6 25 3 SnookerChamp 23 5540 473 5438 102 433 40 SnookerPlayer 303 15597 981 15290 307 905 76 SnookerWorldRanking 34 1296 24 1286 10 23 1 SoapCharacter 2488 198469 3813 136756 61713 3378 435 SoccerClub 21109 842935 11089 825383 17552 10529 560 SoccerClubSeason 6507 335510 2618 331646 3864 2500 118 SoccerLeague 1407 67638 3335 66890 748 3189 146 SoccerManager 13994 255815 6842 249211 6604 6444 398 SoccerPlayer 91059 1595674 12099 1558908 36766 11332 767 SoccerTournament 4828 164068 3915 161940 2128 3761 154 SocietalEvent 55749 1797061 24923 1700233 96828 23250 1673 SoftballLeague 18 114 62 99 15 54 8 Software 30429 1162033 28641 1109456 52577 27009 1632 SolarEclipse 382 1450 140 1434 16 131 9 Song 6554 148128 7976 139757 8371 7357 619 SongWriter 11925 557689 18049 506241 51448 16063 1986 Sound 1072 25919 2658 24386 1533 2463 195 Spacecraft 157 5957 1179 5765 192 1110 69 SpaceMission 1 172 79 155 17 69 10 SpaceShuttle 18 2294 616 2177 117 582 34 SpaceStation 31 6001 714 5867 134 686 28 Species 227059 1797351 21628 1592806 204545 19427 2201 SpeedwayRider 609 3777 428 3650 127 392 36 SpeedwayTeam 76 582 97 478 104 87 10 Sport 253 47783 5033 43737 4046 4709 324 SportCompetitionResult 798 23379 1047 22341 1038 965 82 SportFacility 9848 178883 8170 167022 11861 7732 438 SportsEvent 28925 663020 11647 631503 31517 10958 689 SportsLeague 3944 172334 6173 169076 3258 5878 295 SportsManager 13994 255815 6842 249211 6604 6444 398 SportsSeason 60340 1903050 14466 1849223 53827 13431 1035 SportsTeam 31147 1214608 14514 1181576 33032 13755 759 SportsTeamMember 569 22478 954 22173 305 902 52 SportsTeamSeason 24067 783051 5408 767819 15232 5169 239 Square 1435 33376 4182 31119 2257 3891 291 SquashPlayer 359 3865 438 3725 140 402 36 Stadium 8505 164797 7715 154698 10099 7318 397 Star 3198 46457 2089 45607 850 1950 139 Station 75030 1230979 18771 1179529 51450 17538 1233 Stream 18473 168186 7314 152432 15754 6821 493 SumoWrestler 456 13056 399 12252 804 362 37 SupremeCourtOfTheUnitedStatesCase 2526 52086 3352 49827 2259 3050 302 Surfer 314 6217 1565 5540 677 1419 146 Surname 30078 244909 12389 222860 22049 11485 904 Swimmer 3529 46514 2474 44455 2059 2259 215 Synagogue 1049 15225 1772 12990 2235 1607 165 TableTennisPlayer 445 3842 582 3675 167 537 45 TelevisionEpisode 7923 279238 8091 248004 31234 7390 701 TelevisionHost 42 1640 566 1560 80 528 38 TelevisionSeason 2821 225527 6313 212722 12805 5684 629 TelevisionShow 28723 1247826 24088 1148769 99057 21366 2722 TelevisionStation 7263 310862 8949 299891 10971 8334 615 TennisLeague 11 520 155 498 22 146 9 TennisPlayer 3865 123026 4650 116715 6311 4303 347 TennisTournament 4181 48629 1077 46992 1637 1014 63 Theatre 4646 76213 5446 66554 9659 4972 474 TimePeriod 2096 89426 4687 81084 8342 4369 318 TopicalConcept 38717 1872661 39722 1728677 143984 36039 3683 Tournament 10648 229545 4653 224831 4714 4454 199 Tower 1480 30000 1471 27455 2545 1349 122 Town 41047 521721 14231 476952 44769 13232 999 TradeUnion 1322 13293 2076 12013 1280 1918 158 Train 2825 68657 4747 66911 1746 4494 253 Tunnel 58 614 175 556 58 163 12 Type 19997 800119 26242 734848 65271 24003 2239 UnitOfWork 2547 52265 3370 49994 2271 3067 303 University 43215 1031980 26763 920022 111958 24424 2339 Valley 116 1289 312 1139 150 275 37 Vein 234 1611 290 1261 350 257 33 Venue 26369 478840 15433 435779 43061 14319 1114 VideoGame 18941 823069 18617 786797 36272 17439 1178 VideogamesLeague 7 1205 227 1178 27 209 18 Village 107216 759311 15867 685066 74245 14745 1122 Vodka 201 4007 1247 3686 321 1146 101 VoiceActor 536 4004 566 3763 241 504 62 Volcano 781 23415 2609 22127 1288 2422 187 VolleyballCoach 28 269 92 247 22 84 8 VolleyballLeague 81 1843 232 1808 35 219 13 VolleyballPlayer 1178 8800 977 8481 319 893 84 Watermill 198 5331 731 4948 383 668 63 WaterRide 73 1700 294 1634 66 271 23 WaterwayTunnel 19 474 78 462 12 75 3 Weapon 5214 160486 6771 156046 4440 6437 334 Website 3307 120206 10787 110680 9526 9983 804 WineRegion 357 3387 505 2904 483 456 49 Winery 833 12111 1066 10883 1228 965 101 WomensTennisAssociationTournament 60 1313 177 1302 11 168 9 Work 2260598 55340530 161480 51013433 4327097 140850 20630 WorldHeritageSite 741 29449 3899 26077 3372 3596 303 Wrestler 3570 314511 5925 249205 65306 5462 463 WrestlingEvent 946 65446 1969 53438 12008 1848 121 Writer 140633 3713879 43448 3322749 391130 38279 5169 WrittenWork 2074401 51560929 158776 47481305 4079624 138568 20208 Year 2036 81621 4621 73295 8326 4304 317 YearInSpaceflight 60 7805 224 7789 16 211 13

Referenties

GERELATEERDE DOCUMENTEN

De verwerkende industrie (de fabrikanten van chips, friet, puree en dergelijke), verenigd in de VAVI (Vereniging voor de Aardappelverwerkende Industrie), en

Summarizing, two techniques that can cluster categorical data are the model-based latent class analysis, and the optimal scaling method GROUPALS. In the next section, these methods

The pushout of two such morphisms results in an object obtained by gluing together the two graphs along the image of a, which is exactly how Razborov defines the product in

H2 In price insensitive categories the magnitude of the switching effect is greater when separating promotion instruments are applied than in price sensitive

In the next chapter, we shall take advantage of this equivalence by classifying the automor- phism group of a covering space and by proving the famous Seifert-van Kampen theorem in

As in the setting of model categories, one can define the notion of a homotopy (co)limit in C as a best approximation to the ordinary (co)limit such that the result does preserve

For example, low organizational performance enhances diffusion because it fosters a willingness to act on the diffusing information (Greve, 2005: 1028; Levitt &amp; March, 1988).

The arguments include whether it is an open or closed schema, the vertical adjustment of the left-hand side and delimiter over against the right-hand side, the size of the brace,