Using an external knowledge base to link domain-specific datasets

(1)

Using an external knowledge base to link

domain-specific datasets

Gideon G.A. Mooijen 10686290

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dhr. Prof. dr. P. T. Groth University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam June 28th, 2019

(2)

Abstract

Conventional record-linkage methods are word-similarity based. A metrics of choice cal-culates the records that are most similar and present them as matches. The emergence of comprehensive online knowledge bases gives rise to the presumption that for domain-specific record-linking, incorporation of the notion of the entities in dispute might offer a new approach in linking records. As the context of the labels is present, the knowl-edge base in choice can be used to form the mapping between labels and entity. This knowledge-based record-linking (KBRL) outperforms the conventional method in terms of precision. It shows to be size-invariant, where the performance of conventional methods decreases if the labels show high variance or there are a large number of candidates per entry. Additionally, KBRL seems to more resilient to noisy and incomplete data.

(3)

1 Introduction 1 1.1 Approach . . . 2 2 Related work 3 2.1 Word-similarity based RL . . . 3 2.1.1 Levenshtein . . . 3 2.1.2 Jaro-Winkler . . . 3 2.1.3 Qgram . . . 4 2.1.4 LCS . . . 4 2.2 Knowledgebased RL . . . 4 3 Method 6 3.1 Knowledgebased RL . . . 6 3.2 Word-similarity RL . . . 8 4 Experiments 9 4.1 Data acquisition . . . 9 4.1.1 Dataset A . . . 9 4.1.2 Dataset B . . . 10 4.2 Wikidict construction . . . 12 4.3 Record linkage . . . 12 4.4 Golden standard . . . 13 5 Results 14 5.1 Performance scores . . . 18 6 Evaluation 19 6.1 Mismatches . . . 19 7 Conclusion 20

(4)

Chapter 1

Introduction

Data analysis is a powerful tool that is widely used in a large number of disciplines. Amongst others, it can be used to detect anomalies in medical systems to increase cancer survival rates or provide demographic statistics. To perform this type of research, compre-hensive and valid data is required. This data is not always available. In some cases, the desired data is scattered throughout multiple files and locations. These multiple datasets might each contain different and valuable information on the same person, institution, or country. The combination of these datasets offers new insights over the matter in dispute. The most obvious approach is to merge the datasets. This technique requires that both datasets are free of noise and compatible. It is possible to transform datasets into noise-free and compatible datasets through preprocessing. The third criterium is more complicated to establish: both datasets need to maintain the same labels for the same entities. If this is not the case, the system is not aware of the fact that different entries of dataset A and dataset B concerns the same entity. If this matching is not established, combining the data fails.

An example: one dataset states that ’The Netherlands’ has 17 million inhabitants in 2017. A different dataset states that ’Netherlands’ scored a 7.23 on the subject: ’how high would the average resident score their level of happiness’. In order to detect the correla-tion between these two factors, a technique is required to express the fact that both ’The Netherlands’ and ’Netherlands’ are different labels for the same entity.

Real life data rarely meet this ’conformity of labels’ criterium. The lack of conformity can be resolved by a technique called record linking (RL). This is the task of matching elements cross-database. Many RL-algorithms are metrics-based; they calculate word sim-ilarity between potential matches. This type of matching lacks an actual notion of the entity in dispute; it merely establishes the alikeness of the labels.

The incentive for this project was the presumption that a different approach that does include the ’notion’ of the entities might offer advantages over metrics-based techniques. The hypothesis is that for domain-specific record-linkage, a well-chosen knowledgebase (KB) can map different labels to the correct entities. This mapping is a more sophisticated and reliable method as the KB ensures that the different labels are indeed referring to the same entity, rather than being labels with high word similarity. This results in the following research question:

(5)

1.1. APPROACH CHAPTER 1. INTRODUCTION

advantages over metrics based record linking for domain-specific datasets?

1.1 Approach

To answer this question, a case study is performed. The domain of choice is transfers in European soccer. This domain meets all requirements in order to answer or research question: To answer this question, a case study is performed. The domain of choice is transfers in European soccer. This domain meets all requirements in order to answer the research question:

1. There is much data available.

2. The different datasets maintain different labels for the same entities. 3. There is a KB available to enable the record linking.

4. The combination of the data yields interesting competition-specific statistics that can be used for data analysis.

To answer the research question, two datasets are extracted online. Dataset A contains information on transfers and is structured per year and competition. Dataset B is an extensive, unstructured database that also contains fees of transfers. The approach is to extract the fees from the correct matches for all transfers in Dataset A, achieved through record linkage. This is executed multiple times with different techniques. One of these techniques is knowledgebase record linking (KBRL); the others utilize standard word-similarity based methods (WSRL). Finally, a golden standard was constructed to evaluate the performance of KBRL with respect to WSRL.

(6)

Chapter 2

Related work

Previous research was examined to develop the approach for comparing the performance of WSRL with KBRL.

2.1 Word-similarity based RL

Essential for this project is the evaluation of the WSRL. The four techniques that are used in this project range from simple to complex and are entirely different but share one fundamental property: they calculate word-similarity.

2.1.1 Levenshtein

The Levenshtein-distance [3] (LD) is one of the most straightforward word-similarity mea-surements. It calculates the number of operations required to transform one string into another. Possible operations are the addition, deletion, or swapping of letters. The LD increased with 1 for every required operation.

LDs1,s2(i, j)        max(i, j) if min(i, j) = 0 min    LDs1,s2(i − 1, j) + 1 LDs1,s2(i, j − 1) + 1 LDs1,s2(i − 1, j − 1) + 1(as16=bs2) otherwise (2.1)

This procedure produces the LD for strings s1 and s2. Identical strings have the minimal

possible LD: 0. A high LD indicates low word-similarity.

2.1.2 Jaro-Winkler

A different method is Jaro-Winkler (JW) [6]. It is similar to Levenshtein but has addi-tional mechanisms that compensate for significant differences in length. These mechanisms solve the problem that arises when abbreviations are used.

In order to calculate the Jaro-Winkler-similarity (simw), firstly the Jaro-distance

(simj) is calculated. simj = 0 if m = 0 1 3( m |s1|+ m |s2|+ m−t m ) otherwise (2.2) • s are the strings that are under evaluation.

(7)

2.2. KNOWLEDGEBASED RL CHAPTER 2. RELATED WORK

• m is the number of matching characters.

• t is half the amount of transpositions required to match two characters in the different strings.

Subsequently, the Jaro-Winkler-similarity (simw) is calculated.

simw = simj+ lp(1 − simj) (2.3)

• l ranges from {0, 4}, is defined by the range for the first characters that are identical. • p defines the weight of l. It ranges from {0.0, 0.25} and quantifies how much l favors the alikeness of the strings. If p is large, strings with identical prefixes will have high similarity. If p is small, identical prefixes have a lower impact on the total calculation.

For this project p = 0.1 is maintained.

2.1.3 Qgram

Qgram string matching is introduced by Ukkonen [5].

Dq(x, y) =

X

v∈Pq

|G(x)[v] − G(y)[v]| (2.4)

For this project, q = 2 is maintained. This means that for every string, each 2-gram is calculated. An example is given below:

br ro ot th he er mo

brother 1 1 1 1 1 1 0

mother 0 0 1 1 1 1 1

The L1 norm (or Manhattan-distance) of these q-gram profiles indicate the similarity. The similarity increases as the amount of overlapping 2-grams increases and so, decreases the distance.

2.1.4 LCS

The longest common subsequence (LCS) [1] can be transformed into a metric as well.

f (s1, s2) denotes the LCS. M (s1, s2) denotes the length of the largest of both strings.

This is transformed into a metric through the following equation:

d(s1, s2) = 1 −

f (s1, s2)

M (s1, s2)

(2.5)

When the strings are equal, d(s1, s2) = 1. As the LCS or the relative size of one of the

strings decreases, this value becomes smaller. If no characters coincide, the value is 0.

2.2 Knowledgebased RL

Many of the word-similarity metrics were developed in an era where the internet was not widely adopted or did not even exist yet. The progression of online encyclopedias might seem to be a valuable asset for record linking. WSRL is an ’unaware’ technique, where

(8)

CHAPTER 2. RELATED WORK 2.2. KNOWLEDGEBASED RL

an extensive KB can provide certainty. The KB might have information about multiple different labels for the same entity. Shen et al [4] showed that their developed LINDEN framework outperformed all other state-of-the-art techniques for Entity Linking. Their goal was to extend texts with additional information on the entities in that text. The challenging part was disambiguation; ensuring that Michael Jordan referred to the bas-ketball player if it fits that context and that it referred to the computer scientist if the name was used in that context.

Doan et al. [2] described a solution to the problem that arises when multiple labels refer to the same entity. The proposed solution entails a concordance table, with the different labels as entries on the columns for every row (entity).

(9)

Chapter 3

Method

3.1 Knowledgebased RL

The manual algorithm is named knowledge-based record-linkage. The name is derived from the mechanics: using all present labels and a knowledge base and convert that into a dictionary that maps all different labels to the correct entity. The first step is to extract all labels from the databases.

Algorithm 1 Extracting all elements from the correct columns as labels Data: D, important columns

Result: labels for entry ∈ D do

for element ∈ entry do

if element ∈ important columns then labels.append(element)

end end end

return labels

Subsequently, domain-specific terms are added to disambiguate: providing the context for the KB. The algorithm restricts the search engine to the URL of the selected KB. The name that the KB maintains is extracted as the correct entity for that label.

Algorithm 2 Converting the labels to the correct entity Data: labels, terms, KB

Result: labels mapped to entities mappings = [ ]

for label ∈ labels do query = label + terms

entity = get entity(query, KB) mappings.append(label,entity) end

(10)

CHAPTER 3. METHOD 3.1. KNOWLEDGEBASED RL

The entity retrieval is executed only once per label to achieve computationally efficient record linking. The resulting entities are stored in a dictionary that maps labels to the entity. This mechanism is an adaptation of the method that is described by Doan et al [2].

Algorithm 3 Obtain dictionary to facilitate record linkage between two datasets Data: D1, D2, features, terms

Result: dictionary dict = {}

labels D1 = extract labels(D1, features) labels D2 = extract labels(D2, features) all labels = merge(labels D1,labels D2)) unique labels = set(all labels)

mappings = convert labels(unique labels) for label, entity ∈ mappings do

if entity in dict then

dict(entity).append(label) else dict(entity) = label end end return dict

Finally, comparing all entries in D1 to all candidates in D2 leads to the presentation of matches. If the elements are identifiers - elements that are identical in both datasets, the algorithm checks whether they are equal. If this is the case, the first criterium is met. Subsequently, it compares the labels. The algorithm checks if these labels refer to the same entity according to the previously constructed dictionary. After this stage, if no match fails, the candidate is presented as a correct match to the entry.

Algorithm 4 Matching entries to candidates Data: D1, D2, dict

Result: matched (entry, candidate) tuples matches = [ ]

for entry ∈ D1 do for candidate ∈ D2 do

for i in range (0, len(entry)) do

if entry[i] != candidate[i] and: dict(entry[i]) != dict(candidate[i]) then break end matches.append((entry, candidate)) end end end return matches

(11)

3.2. WORD-SIMILARITY RL CHAPTER 3. METHOD

3.2 Word-similarity RL

KBRL does not calculate a degree of certainty. A match is presented if and only if all entries are equal or they are labels to the same entity. WSRL works differently. Before performing the record linking, some parameters need to be specified.

1. Identifiers. The algorithm needs to be aware of which elements need to be identical in both datasets. Every time that an entry is compared to a candidate, it only compares those candidates for whom the identifiers are identical in both the entry and the candidate. This technique is called blocking and reduces computational complexity.

2. Elements that need matching: labels. The algorithm will calculate the word-similarity for every (label, label) pair; one in the entry, one in the candidate. The similarities for every pair are added up, and the result is the word-similarity for all labels com-bined. Other options are to calculate the sum of squared errors. This option will favour candidates who do not have (label, label) pairs with extremely low similarity. 3. Elements that do not need matching: these will be ignored for the record linking. WSRL will always present a candidate, provided that there is at least one candidate with coinciding identifiers. If there are multiple candidates with coinciding identifiers, it will present the candidate with the highest word similarity between the labels.

(12)

Chapter 4

Experiments

4.1 Data acquisition

As mentioned before: two datasets are used. Dataset A (DB) and dataset B (DB) both contain information on transfers, but DB has more information than DA: it also contains the fee of the transfers. An example of entries in both datasets, after acquisition and preprocessing, is given below:

year name from club to club sum Source A 2018-2019 Dominic Iorfa Wolverhampton Wanderers Sheffield W

Source B 2018-2019 Dominic Iorfa Wolves Sheff Wed e2 Mill.

The function of DA is to define the scope of the data. It allows specificity: which transfers are required for the data analysis. It yields all transfers for players who left their club between the considered years.

4.1.1 Dataset A

To obtain all required transfers, the website FootballSquads1 is used. DA contains all

transfers for players in the Spanish, English, Dutch, German, Italian, French, Russian and Portuguese prime competitions. All transfers between 2011 and 2019 are used: 3244 transfers.

Figure 4.1: Architecture of the acquisition of DA 1_{http://www.footballsquads.co.uk/}

(13)

4.1. DATA ACQUISITION CHAPTER 4. EXPERIMENTS

Link generator

All URLs for the necessary files follow the same pattern. The root URL2 per nation, per

year, contains all names of the participating clubs, as well as the link to their squad page for that season. The names and links are stored.

Webscraping

The previous process yields all links to the pages that require extraction. Again, all squad pages satisfy the same format. This property allows for the construction of a generic web scraper. This web scraper downloads the HTML table on the website. The python-toolkit Pandas is used to convert all tables on that page to a data frame. The first data frame contains the necessary data. Below is an example of that data frame.

Number Name Nat Pos Height Date of Birth Old club 1 Petr Cech CZE G 1.96 20-05-82 Chelsea 2 H´ector Beller´ın ESP D 1.78 19-03-95 Barcelona . . . . 87 Bukayo Saka ENG M 05-09-01 . . . Players no longer at this club

Number Name Nat Pos Height Date of Birth New club 13 David Ospina COL G 1.83 31-08-88 Napoli (On Loan) 60 Gedion Zelalem USA M 1.85 26-01-97 Sporting Kansas City http://footballsquads.co.uk/eng/2018-2019/engprem/arsenal.htm

Preprocessing

Not all data is required. To decrease the computational complexity, all unneeded infor-mation is disregarded. For every transfer, the following inforinfor-mation is stored:

year name from to 2018/2019 David Ospina Arsenal Napoli

2018/2019 Gedion Zelalem Arsenal Sporting Kansas City

Storage

The tables are converted and stored in CSV files. All these transfer files can be found in:

data/year/nation/

Such directories contain the following files:

data/2018-2019/england/Arsenal.csv data/2018-2019/england/Arsenal_transfers.csv data/2018-2019/england/... data/2018-2019/england/... data/2018-2019/england/Wolverhampton Wanderers.csv data/2018-2019/england/Wolverhampton Wandereres_transfers.csv 4.1.2 Dataset B

The German website Transfermarkt3 provides extensive information on transfer fees.

2

http://www.footballsquads.co.uk/eng/2018-2019/engprem.htm

(14)

CHAPTER 4. EXPERIMENTS 4.1. DATA ACQUISITION

Architecture

Figure 4.2: Architecture of Dataset B acquisition Link generator

The year of the transfer and name of the player are extracted from Dataset A. Subse-quently, the name of the player is converted into a Google query. The Google queries are limited to Transfermarkt.de

Source A: 2017-2018 Martin Terrier Lille Lyon Name: Martin Terrier

Google query: Martin Terrier player transfermarkt transfers

Raw result: https://www.transfermarkt.de/martin-terrier/profil/spieler/442891 Processed result: https://www.transfermarkt.de/martin-terrier/transfers/spieler/442891

Webscraping

The desired information is located in the body of the website.

Figure 4.3: All information present on Transfermarkt on the player Martin Terrier All these transfers are extracted. An evaluation was needed to conclude that every row is a < tr > element with an identical class. All these elements have the same structure,

(15)

4.2. WIKIDICT CONSTRUCTION CHAPTER 4. EXPERIMENTS

the contents of which are extracted using the Python toolkit BeautifulSoup. This toolkit allows for easy processing of HTML pages. It has built-in functions to search for certain elements (the required < tr > elements, for instance). This facilitates the extraction of all required data.

Preprocessing

The transfers are stored if and only if the value of the ’Season’ cell matches the year that is present in Dataset A. In the case of Martin Terrier all transfers are stored, with the exception of the last one, since the years do not match. In order to compare 2017 − 2018 with 17/18 or 15/16, a small converter was developed which uses string manipulation. A similar converter was constructed to process the ’Fee’, found in the last column of the data. Often they are strings: players can be loaned to other clubs, return to their club, or have no fee at all. On other occasions, there was a price involved, which is expressed on their website as a combination of digits and text. A more desirable formulation is in solely integers, which was also achieved by a simple manual string manipulation algorithm. DB did differentiate between the actual club and youth academies of that club, while DA does not. Example:

DA: 27 2011-2012 Shkodran Mustafi Everton Sampdoria

DB: 39 2011-2012 Shkodran Mustafi Everton U21 Sampdoria 75000

The KB tends to have separate documents for these youth academies. This results in the fact that the matching would fail because DA is underspecified regarding this matter. For that reason, all labels are stripped from any occurrence of U18 - U23.

4.2 Wikidict construction

The final step in order to perform KBRL is accumulating all (label, entity) pairs to a dictionary. A small subset of the dictionary is given below:

1111 VVV-Venlo,[’VVV-Venlo’, ’VVV Venlo’] 1112 VV_Bennekom,[’VV Bennekom’] 1113 Valdres_FK,[’Valdres’, ’Valdres FK’] 1114 Valencia_CF,[’Valencia’] 1115 Valencia_CF_Mestalla,[’Valencia B’] 1116 Valenciennes_FC,[’Valenciennes’, ’Valenciennes FC’]

1117 Vancouver_Whitecaps_FC,[’Vancouver’, ’Vancouver Whitecaps’]

4.3 Record linkage

Both DA and DB are processed into a new file: dataset C (DC). This file has the same amount of entries as DA with an extra column for each of the record linking techniques. This values of the column are the fees that are extracted from the candidate in DB that was matched to that entry. Pseudo-code of this technique is described earlier on as Algorithm 4.

(16)

CHAPTER 4. EXPERIMENTS 4.4. GOLDEN STANDARD

4.4 Golden standard

There was no golden standard available for this project. In a way, the goal of the project was to find the type of record linkage algorithm that results in the golden standard, or as close to the golden standard as possible. In order to perform an evaluation, a subset of 100 samples from DC was constructed. For this subset, the correct fee was inserted manually. The same KB and sources were used for this manual data acquisition as the ones that DA and DB utilized.

(17)

Chapter 5

Results

The following tables show the performance of the five evaluated techniques: Wikidict (WD), Levenshtein (LS), Jaro-Winkler (JW), Qgram (QG), Longest Common Subse-quence (LCS). The golden standard (GS) is found in the last column.

The colours in the cells indicate whether they are false positives (red), true negatives (cyan) or false negatives (yellow). True positives are uncoloured.

• ’XXX’ means that the algorithm did not manage to match the entry to any candidate • ’ ?’ and ’-’ indicates that a match is found, but the source did not have information

on the fee

• End of loan (EOL) means that the player has finished his period of loan to the club in dispute

(18)

CHAPTER 5. RESULTS y ear name from to WD LS JW QG LCS GS 40 2011-2012 Craig Bellam y Manc hester Cit y Liv erp o ol F ree F ree F ree F ree F ree F ree 75 2011-2012 Roman Bedn W est Brom wic h Albion Blac kp o ol F ree F ree F ree F ree F ree F ree 95 2011-2012 Kris Stadsgaard Mlaga F C Kb enha vn F ree F ree F ree F ree F ree F ree 113 2011-2012 Mano Villarreal Lev adiak os XXX F ree F ree F ree F ree F ree 116 2011-2012 Braulio Zaragoza Cartagena XXX XXX XXX XXX XXX -150 2011-2012 Martin Am edic k 1.F C Kaiserslautern Ein trac h t F rankfurt XXX 350000 350000 350000 350000 350000 182 2011-2012 Emiliano V iv ia no In ternazionale P alermo 8500000 8500000 8500000 8500000 8500000 8500000 251 2011-2012 Ludo vic Obraniak Lille Bordeaux 1000000 1000000 1000000 1000000 1000000 1000000 255 2011-2012 Adama T our Lorien t Sp orting Gijn B XXX F ree F ree F ree F ree F ree 285 2011-2012 Artur V alik a y ev Amk ar P erm Spartak M osco w XXX XXX XXX XXX XXX EOL 295 2011-2012 Vladimir Ga bulo v CSKA Mosco w Anzhi Makhac hk ala EOL EOL EOL EOL EOL EOL 310 2011-2012 Y uriy Kirillo v Krylia So v eto v Dynamo Mosc o w XXX -EOL 373 2011-2012 Sergey P esy ak o v T om T omsk Spartak M osco w EOL EOL EOL EOL EOL EOL 481 2012-2013 Robbie Brady Manc hester United Hull C 2500000 2500000 2500000 2500000 2500000 2500000 500 2012-2013 James McF adden Sunderland Motherw ell F ree F ree F ree F ree F ree F ree 520 2012-2013 Conor Samm on Wigan A thletic Derb y Co 1500000 1500000 1500000 1500000 1500000 1500000 548 2012-2013 Nic ki Bille Nielsen Ra y o V allecano Villarreal B XXX EOL EOL EOL EOL EOL 555 2012-2013 P ablo Hernndez V alencia Sw ansea C 7000000 7000000 7000000 7000000 7000000 7000000 559 2012-2013 Vctor Mong il V alladolid A tltico Madrid B F ree F ree F ree F ree F ree F ree 592 2012-2013 F rancisco Ro drguez VfB Stuttg art Amrica XXX XXX XXX XXX XXX 2000000 636 2012-2013 Djamel Mesbah Milan P arma ? ? ? ? ? ? 671 2012-2013 Jens T o ornstra ADO Den Haag F C Utrec h t 950000 950000 950000 950000 950000 950000 673 2012-2013 Theo Jansse n Ajax Vitesse 550000 550000 550000 550000 550000 550000 687 2012-2013 Luis P edro Heracles Almelo Botev Plo vdiv F ree F ree F ree F ree F ree F ree 713 2012-2013 T om Miola Brest Sla via Prague XXX XXX XXX XXX XXX 500000 746 2012-2013 An toine Con te P aris Sain t-Germain P aris St-Germain B -874 2012-2013 Milan Lalk o vi Vitria Guimares Chelsea EOL EOL EOL EOL EOL EOL 880 2013-2014 An thon y Jeffrey Arsenal Wycom b e W F ree F ree F ree F ree F ree F ree 890 2013-2014 Juan Mata Chelsea Manc hester U 44730000 44730000 44730000 44730000 44730000 44730000 899 2013-2014 Nikica Jela vi Ev erton Hull C 7800000 7800000 7800000 7800000 7800000 7800000 909 2013-2014 Dann y Graham Hull Cit y Sunderland EOL EOL EOL EOL EOL EOL 940 2013-2014 P elly Ruddo ck W est Ham United Luton T ? ? ? ? ? ? 950 2013-2014 Da vid Ro drguez Celta Vigo Brigh ton & HA F ree F ree F ree F ree F ree F ree 978 2013-2014 P anagiotis Vlac h. Augsburg Olympiak os EOL EOL EOL EOL EOL EOL

(19)

CHAPTER 5. RESULTS y ear name from to WD LS JW QG LCS GS 999 2013-2014 Cristian Mol inaro VfB Stuttgart P arma F ree F ree F ree F ree F ree F ree 1016 2013-2014 Mic hael Agazzi Cagliari Chiev o V erona ? ? ? ? ? ? 1027 2013-2014 Gusta v o Muna Fioren tina Nacional [UR U] XXX F ree F ree F ree F ree F ree 1092 2013-2014 Mathias Sc hamp Heracles A lmelo Oudenaarde XXX XXX XXX XXX XXX XXX 1132 2013-2014 Florian Thauvin Lille Marseille 12000000 12000000 12000000 12000000 12000000 12000000 1176 2013-2014 Willian Anzhi Ma k ha chk ala Chelsea 35500000 35500000 35500000 35500000 35500000 35500000 1278 2013-2014 Christian Irob iso P aos de F erreira Senica ? ? ? ? ? ? 1285 2013-2014 Elderson Uw a E chiejile Sp orting Braga Monaco 1500000 1500000 1500000 1500000 1500000 1500000 1298 2014-2015 F ernando T orres Chelsea Milan 3000000 3000000 3000000 3000000 3000000 3000000 1364 2014-2015 Mic hel Getafe V alencia XXX F ree F ree F ree F ree EOL 1410 2014-2015 Raphael H olzhauser VfB Stuttgart Austria Vienna 250000 250000 250000 250000 250000 250000 1434 2014-2015 Alessandro M atri Genoa Milan EOL EOL EOL EOL EOL EOL 1480 2014-2015 Ry a n Ko olwijk F C Dordrec h t AS T renn F ree F ree F ree F ree F ree F ree 1496 2014-2015 Jukk a Rait ala SC Heere n v een F C V estsjlland F ree F ree F ree F ree F ree F ree 1499 2014-2015 Dragan P alji Heracles A lmelo P erth Glory ? ? ? ? ? ? 1505 2014-2015 Nik os Ioannidis PEC Zw olle Olympiak os EOL EOL EOL EOL EOL EOL 1541 2014-2015 Clmen t Chan tme P aris Sain t-Germain Bordeaux 700000 700000 700000 700000 700000 700000 1546 2014-2015 Jakub W a wrzyniak Amk ar P erm Lec hia Gdask F ree F ree F ree F ree F ree F ree 1585 2014-2015 F redy Belenenses CRD Li b olo 200000 200000 200000 200000 200000 200000 1599 2014-2015 Mohamed Ibrahim Martimo Zamalek ? ? ? ? ? ? 1605 2014-2015 Mahmoud Ezzat Nacional Arab Con tractors EOL EOL EOL EOL EOL EOL 1685 2015-2016 Mauro Zrate W est Ham United Fioren tina 2100000 2100000 2100000 2100000 2100000 2100000 1711 2015-2016 Nabil El Zhar Lev an te Las P almas F ree F ree F ree F ree F ree F ree 1728 2015-2016 Alb erto Guitin Sp orting Gijn Zaragoza F ree F ree F ree F ree F ree F ree 1767 2015-2016 V edad Ibievi VfB Stuttgart Hertha Berlin Loan Loan Loan Loan Loan Loan 1902 2015-2016 Enzo Real e Lorien t Clermon t F ree F ree F ree F ree F ree F ree 1908 2015-2016 Claudio Beauvue Ly o n Celta Vigo 5000000 5000000 5000000 5000000 5000000 5000000 1945 2015-2016 William V ainqueur Dynamo Mosco w Roma 579000 579000 579000 579000 579000 579000 1966 2015-2016 Aras zbiliz Spartak Mosco w Beikta 1300000 1300000 1300000 1300000 1300000 1300000 1984 2015-2016 Agustn V uletic h Arouca Arsenal [AR G] XXX F ree F ree F ree F ree F ree 2097 2016-2017 Ik e chi An y a W atford Derb y Co 4700000 4700000 4700000 4700000 4700000 4700000 2153 2016-2017 Julian Gree n Ba y ern M unic h Stuttgart 300000 300000 300000 300000 300000 300000 2178 2016-2017 Y un us Mall FSV Mainz 05 W olfsburg XXX -12500000 2206 2016-2017 Nicol F azzi Crotone P erugia 700000 700000 700000 700000 700000 700000

(20)

CHAPTER 5. RESULTS y ear name from to WD LS JW QG LCS GS 2215 2016-2017 Mauro Zrate Fioren tina W atford 2750000 2750000 2750000 2750000 2750000 2750000 2231 2016-2017 Mirk o V aldifiori Nap o li T orino 3400000 3400000 3400000 3400000 3400000 3400000 2237 2016-2017 Simone Aresti P escara T ernana ? ? ? ? ? ? 2252 2016-2017 Josef Ma rtnez T orino A tlan ta U 4500000 4500000 4500000 4500000 4500000 4500000 2259 2016-2017 Hector Hev el ADO Den Haag AEK L arnaca F ree F ree F ree F ree F ree F ree 2279 2016-2017 Alb ert Rusnk F C Groningen Real Salt Lak e 435000 435000 435000 435000 435000 435000 2355 2016-2017 Ch uma Anene Amk ar P erm Kairat A lmat y ? ? ? ? ? ? 2387 2016-2017 Ibrahim Tsallago v Krylia So v et o v Zenit St P etersburg 400000 400000 400000 400000 400000 400000 2413 2016-2017 Aleksandr Zhiro v T om T omsk F C Krasno da r F ree F ree F ree F ree F ree F ree 2428 2016-2017 Diego C arlos F C Ufa So Ben to XXX 1300000 1300000 1300000 1300000 F ree 2438 2016-2017 Jorginho Arouca St-Etienne XXX XXX XXX XXX XXX 1000000 2565 2017-2018 Jac k Devlin Stok e Cit y South Shields F ree F ree F ree F ree F ree F ree 2580 2017-2018 Diafra Sakho W est Ham United Rennes 10000000 10000000 10000000 10000000 10000000 10000000 2632 2017-2018 P ark Jo o-Ho Borussia Dortm und Ulsan Hyundai XXX -2659 2017-2018 F elix Platte F C Sc halk e 04 Darmstadt 800000 800000 800000 800000 800000 800000 2698 2017-2018 Riccardo Cappa Roma Sassuolo F ree F ree F ree F ree F ree F ree 2710 2017-2018 Da vide Zappacosta T orino Chelsea 25000000 25000000 25000000 25000000 25000000 25000000 2798 2017-2018 Alan B a gaev Anzhi M ak h ac hk ala Mordo via Saransk XXX XXX XXX XXX XXX F ree 2830 2017-2018 Magomed Ozdo ev Rubin Kaza n Zenit St P etersburg 3300000 3300000 3300000 3300000 3300000 3300000 2834 2017-2018 Den ys Dedec hk o SKA-Khabaro vsk F C Mariup ol F ree F ree F ree F ree F ree F ree 2845 2017-2018 Ja vi Garca Zenit St . P etersburg Betis 1500000 1500000 1500000 1500000 1500000 1500000 2894 2017-2018 Andr Moreira Sp ortin g Braga A tltico Madrid EOL EOL EOL EOL EOL EOL 2901 2017-2018 Csar Vitria Se tbal Benfica XXX XXX XXX XXX XXX EOL 2969 2018-2019 Benjamin H enric hs Ba y er Le v erkusen Monaco 20000000 20000000 20000000 20000000 20000000 20000000 2972 2018-2019 Juan B e rnat Ba y ern M unic h P aris St-Germain 5000000 5000000 5000000 5000000 5000000 5000000 2998 2018-2019 Ali A dnan A talan ta Udinese EOL EOL EOL EOL EOL EOL 3038 2018-2019 Simone P on tisso Udinese Vicenza Virtus 800000 800000 800000 800000 800000 800000 3125 2018-2019 Idrissa D ou m bia Akhmat Groz n y Sp orting CP 3800000 3800000 3800000 3800000 3800000 3800000 3170 2018-2019 Denis T umasy an F C Ufa Alashk ert F ree F ree F ree F ree F ree F ree 3204 2018-2019 Ousmane Dram Moreirense Belenenses XXX -3224 2018-2019 F ernando Andrade San ta Clara F C P orto XXX XXX XXX XXX XXX XXX 3236 2018-2019 Hlder T a v ares T ondela Alta y ? ? ? ? ? ?

(21)

5.1. PERFORMANCE SCORES CHAPTER 5. RESULTS

5.1 Performance scores

Figure 5.1: Performance of the different RL techniques

This figure shows a significant difference between the KBRL and WSRL. All four types of WSRL score equally for all measurements.

(22)

Chapter 6

Evaluation

6.1 Mismatches

The test set, DC, shows irregularities for multiple samples. Some of these are accumulated in the following table. As the four different WSRL techniques yield the same results, they are concatenated into one column: ’WSRL’.

Testset

# year name from to WD WSRL GS 255 2011-2012 Adama Tour´e Lorient Sporting Gij´on B XXX Free Free 1364 2014-2015 M´ıchel Getafe Valencia XXX Free EOL 2438 2016-2017 Jorginho Arouca St-Etienne XXX XXX 1000000

For the transfers with mismatches, all candidates in DB are given below:

Dataset B

# year name from to fee 369 2011-2012 Adama Tour´e FC Lorient Sporting B Free 370 2011-2012 Adama Tour´e Paris SG FC Lorient Free

1870 2014-2015 M´ıchel Novorizontino Guarani-SC Free Transfer

In the case of Adama Tour´e, WSRL acquires the correct fee, the manual algorithm does

not. Below are the Wikidict entries for the labels involved:

Wikidict

Entity Label 1 Label 2 FC Lorient FC Lorient Lorient Sporting CP Sporting B Sporting CP Sporting de Gij´on B Sporting Gij´on B

It becomes clear that the KBRL algorithm does not match the entry of Adama Tour´e to

the right candidate, as the ’Sporting Gij´on B’ and ’Sporting B’ are not labels to the same

entity. This shows a flaw in the functionality of Wikidict. In this case, the algorithm did

not map ’Sporting Gij´on B’ to ’Sporting Gij´on’, as it concerns the second division club

of ’Sporting Gij´on’. DA, similar as for the youth acadamies, does not differentiate over

the primary club or secondary associations. In the case of M´ıchel, there are candidates in Dataset B that agree on the identifier (year, name). However, this is probably not the same M´ıchel - presumably this is a common name in soccer. This results in a false positive for WSRL. The Wikidict manages to ’detect’ that because the clubs are not different labels to the same entity: no match. In the case of Jorginho, there is no single candidate that agrees on both the year and the name. This results in false negatives for both algorithms.

(23)

Chapter 7

Conclusion

Important to note is that all inferences about the performance of the different algorithms apply for this specific project. It is no general comparison for the quality of the techniques, as this relies heavily on the domain, knowledge base and data.

The KBRL shows one advantage over WSRL for this specific project: it scores perfectly in terms of precision. This full score translates to the fact that all matches presented by this algorithm are valid. The algorithm is never wrong, in contrast to WSRL. This is in agreement with the hypothesis: the use of a knowledge base to incorporate the notion of the entities does provide advantages for record linking.

However, except for precision, WSRL outperforms KBRL. This outperformance is based on the combination of the number of candidates per entry and the variance in the labels. As these increase, the probability that the algorithm presents an incorrect match increases. This increase is logical: a broader spectrum of options that show high diversity increases the chances of similar labels for wrong candidates. The KBRL technique is size-invariant. If the mapping from labels to entities is successful, the matching will succeed regardless of the number of candidates and variance in label.

Additionally, KBRL has proven to be more prone to noise and incomplete data than WSRL. The word-similarity algorithms have shown to present false matches because the data was incomplete. KBRL might be a suitable approach for a project that utilizes noisy and incomplete data because it does not produce false positives when the data is incorrect or absent.

Knowledge-based record-linking shows advantages over conventional word-similarity based record linking for domain specific tasks in terms of precision. The technique is promising for domain-specific projects that use noisy or incomplete data, or data sets that contain a large number of candidates per entry.

(24)

Bibliography

[1] Daniel Bakkelund. “An LCS-based string metric”. In: Olso, Norway: University of

Oslo (2009).

[2] AnHai Doan, Alon Halevy, and Zachary Ives. “3 - Describing Data Sources”. In:

Principles of Data Integration. Ed. by AnHai Doan, Alon Halevy, and Zachary Ives. Boston: Morgan Kaufmann, 2012, pp. 65–94. isbn: 978-0-12-416044-6. doi: https: / / doi . org / 10 . 1016 / B978 - 0 - 12 - 416044 - 6 . 00003 - X. url: http : / / www . sciencedirect.com/science/article/pii/B978012416044600003X.

[3] V. I. Levenshtein. “Binary Codes Capable of Correcting Deletions, Insertions and

Reversals”. In: Soviet Physics Doklady 10 (Feb. 1966), p. 707.

[4] Wei Shen et al. “LINDEN: Linking named entities with knowledge base via semantic

knowledge”. In: WWW’12 - Proceedings of the 21st Annual Conference on World Wide Web (Apr. 2012). doi: 10.1145/2187836.2187898.

[5] Esko Ukkonen. “Approximate string-matching with q-grams and maximal matches”.

In: Theoretical Computer Science 92.1 (1992), pp. 191–211. issn: 0304-3975. doi: https : / / doi . org / 10 . 1016 / 0304 - 3975(92 ) 90143 - 4. url: http : / / www . sciencedirect.com/science/article/pii/0304397592901434.

[6] William E. Winkler. “String Comparator Metrics and Enhanced Decision Rules in

the Fellegi-Sunter Model of Record Linkage”. In: Proceedings of the Section on Survey Research. Wachington, DC, 1990, pp. 354–359.

Using an external knowledge base to link domain-specific datasets