• No results found

Record linkage using graph structures - Combining structural and attribute identity for graph alignment within REGAL.

N/A
N/A
Protected

Academic year: 2021

Share "Record linkage using graph structures - Combining structural and attribute identity for graph alignment within REGAL."

Copied!
38
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Record linkage using graph

structures

Combining structural and attribute identity for graph

alignment within REGAL.

Niek IJzerman 11318740

Bachelor thesis Credits: 18 EC

Bachelor B`eta-Gamma (major Artificial Intelligence) University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Mr. Prof. Dr. U. Aickelin Ms. Dr. P. Lin Mr. Dr. S. van Splunter

School of Computing and Information Systems University of Melbourne

Level 8, Doug McDonell Building VIC 3010, Australia

(2)

Abstract

Record linkage aims to match and, if desired, merge matching records across different datasets. Multiple methods exist to match records across datasets. In addition, the REGAL algorithm has been proposed to identify matching records by combining their structural and attribute identity in graph alignment. Within this thesis, it was investigated whether the combined use of structural and at-tribute information within REGAL could be beneficial for record linkage. A va-riety of experiments was conducted, including comparison of string metrics for attribute matching, graph structure adjustments and inclusion of edge weights within graphs. The addition of attribute information in the alignment process has proven to be of positive influence. Also, the use of weights within graphs improved alignment performance. In conclusion, it has been shown that com-bining structural and attribute information within the REGAL framework has potential for successful record linkage. There is however, room left to refine the alignment process. Especially, the alignment accuracy for node pairs that have differences in connectedness across graphs could be improved.

(3)

Acknowledgements

First and foremost, I would like to thank Dr. Pauline Lin for the incredible support and guidance she gave me during this project. It truly has been a great experience to work with you on this project. I would also like to thank Prof. Dr. Uwe Aickelin for allowing me to be part of his research institute. I feel very privileged to have been part of their research team. Lastly, I would like to thank Dr. Sander van Splunter for supporting my proposal to do my thesis overseas.

(4)

Contents

1 Introduction 5

2 Related work and theory 7

2.1 Attribute-based record linkage . . . 7

2.1.1 Deterministic record linkage (DRL) . . . 7

2.1.2 Probabilistic record linkage (PRL) . . . 8

2.1.2.1 Similarity metrics for probabilistic linkage . . . . 8

2.1.2.2 Scenario analysis for similarity metrics . . . 11

2.2 Structure-based record linkage . . . 11

3 Theoretical foundation of the REGAL framework 13 3.1 Node Identity Extraction . . . 14

3.2 Efficient Similarity-based Representation . . . 15

3.3 Fast Node Representation Alignment . . . 16

4 Data formatting 17 4.1 Data . . . 17

4.2 Pre-processing . . . 18

5 Experiments and results 20 5.1 Setup . . . 21

5.2 Experiment 1: Performance improvement by attribute treatment 22 5.2.1 Accuracy improvement by adding attribute information . 22 5.2.2 Accuracy improvement by venue deduplication and prob-abilistic title comparison . . . 23

5.2.3 Accuracy inhibition by structural information . . . 24

5.3 Experiment 2: Solve structural information limitations . . . 25

5.3.1 Improvement of rule for graph construction . . . 26

5.3.2 Outlier removal to equal degree distributions . . . 27

5.3.3 Addition of degree weights to improve discriminative power 28 5.3.4 Algorithm design limitations for structural information alignment . . . 30

5.4 Experiment 3: Including disconnected components . . . 31

5.4.1 Inclusion of small clusters . . . 31

6 Discussion 33

7 Conclusion 35

(5)

1

Introduction

The global collection of data has grown tremendously for the past few years [1]. Many companies and organizations collect and process huge amounts of datasets in order to obtain interesting insights that allow for quality decision making. To be able to perform adequate and meaningful analysis on collected data, multiple data sources are often linked and conglomerated. In a broad va-riety of fields, such as healthcare, business- and government services and crime, linking documents from various sources with the aim to improve data quality and gain new insights is becoming popular [2]. For example, data linkage has been used successfully to decrease perinatal mortality in The Netherlands [3] and develop prostate cancer research [4]. These applications incorporate infor-mation including demographics, medical history and laboratory research.

When data about an entity (e.g., an individual or company) comes from multiple sources, it is often desirable to match and merge records from those sources that correspond to the same entity. Other terms that relate to record linkage are entity resolution, disambiguation and deduplication, meaning that records that correspond to the same entity are being linked and, if desired, merged [5]. There are considerable challenges involved in record linkage, mainly due to poor data quality. First of all, spelling in different records may differ. An example would be different spelling of addresses (e.g., 44 Stroud Str. and Stroud Street 44). Although the compositions of these addresses differ, it is noticeable that they relate to the same address. Besides diverse notations for entity features that refer to the same entity, missing values might cause compli-cations as well (e.g., a missing date of birth for an individual).

Notwithstanding the long history of work on record linkage, there is still a surprising diversity of approaches. The most classical approach to check whether two records refer to the same entity would be by manual comparison. Although this method is likely to be quite exact, it can be very time consuming. Despite the fact that for small datasets this might not be an issue, most insti-tutions work with large directories that would be too immense to apply manual analysis on. For this reason, multiple other methods have been developed to link dossiers. Lots of these methods rely on computational power. A machine-based technique would be computation of a pre-defined similarity metric. As an instance, Jaro or cosine distance could be used to measure similarity between features from different records [6, 7, 8]. If a pair of records has a similarity score higher than a threshold θ, these are considered to be the same entity.

More recent research focuses on addressing the problem of record linkage by the use of graphs or networks [9]. Networks are informative structures that are able to capture the relationships in our interconnected worlds (e.g., friend-ships or email exchanges) [10]. Accordingly, numerous tasks can be exploited on graphs, including record linkage. Performing record linkage on graphs inher-ently corresponds to a multiple graph problem, in which network alignment (i.e. finding corresponding nodes in different networks) has to be executed. Multiple algorithms are able to find similar entities based upon their structure-based em-beddings in networks. In addition to these existing algorithms, a new graph alignment algorithm named REGAL (REpresentation learning-based Graph ALignment) has been proposed for identifying corresponding nodes in differ-ent networks. The intelligence of REGAL is the ability of using attribute-based

(6)

identity as extension to structural identity for record linkage.

In this thesis it will be observed how and if the combined use of structural-and attribute-based information within REGAL can be beneficial for record link-age. To evaluate this question, it can be broken down into the following sub questions that will be answered with experiments:

1. Can alignment accuracy be increased by appending attribute-based infor-mation on top of structural-based inforinfor-mation within REGAL? (section 5.2)

2. Is REGAL’s accuracy affected by the degree connectedness within the pair of graphs that are considered for alignment? (section 5.3)

3. To what extend is REGAL sensitive for graphs containing disconnected components? (section 5.4)

This thesis starts with section 2 where background information, including related work and relevant theory, on record linkage is highlighted. Section 3 will elaborate on REGAL’s functionality. In section 4 the dataset, pre-processing steps and the evaluation metric to determine REGAL’s performance will be introduced. Section 5 shows experiments that were performed and discusses the results. Last, section 6 and section 7 will respectively discuss and conclude the research including possible future work.

(7)

2

Related work and theory

In 1946, Dunn was one of the first to consider record linkage [11]. Dunn’s notice of record linkage stems of the principle of each person in the world creating a Book of Life. According to Dunn, events of importance worth recording in the Book of Life are frequently put on record in different places as a person moves about the world throughout his lifetime. Therefore, assembling the Book into a single compact volume is difficult. Yet, examining all of a persons import records simultaneously sometimes is necessary.

Although this exact type of linkage appears to be regained, research on record linkage only increased. Christensen study in 1958 linked marriage, birth and divorce records to give a continuing family picture over time and in 1969 Fellegi and Sunter considered linking files in terms of Neyman-Pearson hypothesis testing [12, 13].

2.1

Attribute-based record linkage

Most current approaches, in which record linkage is seen as a classification problem, are derived from the Fellegi-Sunter model [14]. In practice this means that given a vector of similarity scores between the attributes of two records, they can be classified as a match or non-match.

To be able to determine similarity scores, data from entities is needed. Examples would be medical records of individuals containing disease rates or business reports that incorporate sales figures. The simplest linkage process involves two data sources. Lets say we have two files. Both files hold patient records containing medical information. Specifically, we are interested in exact matching, i.e., linking information on the same entity from different sources. In contrast to exact matching, it could also be desirable to link entities that are not the same but similar to each other. Usually this is referred to as statistical matching. Although this thesis will not focus on statistical matching, it can be of interest to marketing ventures for example.

Suppose that records in both files hold two attributes (e.g., name and D.O.B.). It is possible to compare the overlapping attributes of two records and associate them with an attribute similarity score. In the example of a record holding three attributes, correspondingly three similarity scores would be appointed. Conventionally, acquiring attribute similarity scores from different records can be done in a deterministic and probabilistic way. Both approaches will be elaborated upon.

2.1.1 Deterministic record linkage (DRL)

In DRL, deterministic refers to exact matching. This denotes that two records are considered to be the same if all or a predefined subset of record attributes are of the exact same value. DRL usually requires precise and robust unique quantifiers of high quality. High quality assigns for the absence of missing values or typos. Instances could be social security, tax or student numbers. If by any means one or multiple attributes mismatch, whether it arises from a true

(8)

mismatch or a missing value, a match is prevented. Hence, the highest possible data quality in directories is coveted.

2.1.2 Probabilistic record linkage (PRL)

In PRL, probabilities are being determined or calculated for attribute pairs from two records. In general, probabilities between zero and one are assigned to pairs. Take the names John F. Doe and J. F. Doe for example. Although these names seem to refer to the same person, in DRL these attributes would not make a match as their strings differ from each other. In PRL however, we are able to look beyond exact matching. With the use of similarity metrics, we are able to determine to what extend strings are analogous. In theory, high similarity scores indicate high similarities in two strings and vice versa. In existence are multiple metrics that can be put into practice for PRL of which a few will be highlighted and used within this research. These are highlighted in the next section.

2.1.2.1 Similarity metrics for probabilistic linkage

String similarity metrics are being deployed in a range of application fields, in-cluding text mining, conversational agents and record linkage [15]. Illustrations of string metric algorithms that can be used are Levenshtein, Cosine and Jaro similarity. All these metrics are based upon different rules and hence may be used in different scenarios.

The Levenshtein distance is the minimum number of insertions, deletions and replacements for single characters required to change a string into another. This results in a positive integer. Mathematically, the Levenshtein distance between two strings a, b is given by leva,b(i, j) where

leva,b(i, j) =

       max(i, j) if min(i, j) = 0,    leva,b(i − 1, j) + 1

leva,b(i, j − 1) + 1 otherwise.

leva,b(i − 1, j) − 1(ai6= bj )

(1)

For example, the Levenshtein distance between ’foo’ and ’bar’ is 3 and the Lev-enshtein distance between ’beauties’ and ’beautiful’ is 3 as well. Although these distance scores are the same, linguistically and esthetically, for humans, ’beau-ties’ and ’beautiful’ are more in common than ’foo’ and ’bar’. For this reason, drawing conclusions from raw Levenshtein results can be difficult. To overcome this complication, the Levenshtein distance can be normalized by dividing the score by the longest of both string length. In the example, both foo and bar are of length 3, hence the normalized Levenshtein similarity would be 1 −33 = 0 (not in common at all).

As an addition to the classical Levenshtein distance, the Damerau-Levenshtein distance was developed [16]. This metrics differs from Damerau-Levenshtein distance by including transpositions as an allowed operation. Transposition comes in addition to the insert, delete, and substitute operations. It therefore

(9)

is able to perform better on strings that contain misspellings that are caused by the transposition of two characters.

Cosine similarity is given by similarity = cos(θ) = A · B ||A|| · ||B|| = Pn i=1AiBi pPn i=1A 2 i pPn i=1B 2 i (2)

As opposed to Levenshtein, the cosine metric requires sentences or words to be converted into vectors. Ways to do this include term frequency (TF) or the Word2Vec algorithm. We will illustrate cosine similarity with term frequency. Take the sentences:

1. This sentence is similar but not the same 2. This sentence is similar and the same

Lemmatization of these sentences would result in the following term frequency table:

Sentence this sentence is similar but not the same and

1 1 1 1 1 1 1 1 1 0

2 1 1 1 1 0 0 1 1 1

Table 1: Term frequencies of lemmatized sentences.

Following up, is normalizing the frequencies according to the sentence they belong to. Although this step should not be considered necessary, working with unnormalized figures can lead to impracticable results. Normalization can be done with the respective magnitudes. Summing up squares of each frequency and taking a square root, the magnitude of sentence 1 is √8 and sentence 2 is √7. In linear algebra, this transformation is referred to as the unit vector. Dividing the above term frequencies with their norms results in the following table:

Sentence this sentence is similar but not the same and

1 √1 8 1 √ 8 1 √ 8 1 √ 8 1 √ 8 1 √ 8 1 √ 8 1 √ 8 0 2 √1 7 1 √ 7 1 √ 7 1 √ 7 0 0 1 √ 7 1 √ 7 1 √ 7

Table 2: Normalized term frequences of lemmatized sentences.

Following up, the cosine similarity will be computed by taking the sum of pair-wise dot products. So,

Pn

i AiBi

where i denotes a term and Ai denotes its frequency in sentence A. So, the

cosine similarity would be:

6 · (√1 8 ·

1 √

7) ≈ 0.8

Although this example focusses on sentences, cosine distance could also be em-ployed on words. In that occasion, terms would be n-grams i.e. character

(10)

sequences of length n from its word. For example, the word ’car’ could be split up in the terms ’ca’ and ’ar’.

Lastly, we will focus on Jaro similarity. The Jaro similarity between sequences s1 and s2is defined by

simj= ( 0 if m = 0, 1 3  m |s1|+ m |s2|+ m−t m  otherwise. (3) where

• simj is the Jaro similarity,

• m is the number of matching characters (only matching if characters are the same and not further away thanh max(|s1|,|s1|)

2

i - 1)

• t is the number of transpositions (mismatches for the ith character in s1

and s2divided by 2)

• |s1| and |s2| area the lengths of s1 and s1 respectively

To clarify, the Jaro similarity between ’dwayne’ and ’duane’ will be determined. Given these strings we identify the following parameters:

m = 4 t = 0 |s1| = 6 |s1| = 5 While m 6= 0, we find: simj= 13  4 6+ 4 5+ 4−0 4  ≈ 0.82

As per envisagement, the similarity score implies that |s1| and |s2| are very in

common.

To aggrandize Jaro similarity, Winkler has developed the Jaro-Winkler similarity [17]. Jaro-Winkler similarity is given by

simw= simj+ lp(1 − simj) (4)

where

• simw is the Jaro-Winkler similarity for s1and s2

• simjis the Jaro similarity for s1 and s2

• l is the length of a common prefix up to four characters

• p is a predefined scaling factor for the amount of adjustment when common prefixes are present. This factor should not exceed 0.25, otherwise the similarity could become larger than one.

(11)

If using Jaro-Winkler similarity, strings that have common prefixes are given favorable similarity scores. Accordingly, it can be a more favorable option than Jaro similarity in specific events. In the event of the strings ’dwayne’ and ’duane’ and p = 0.1, it is noticeable that the Jaro-Winkler formula gives a higher similarity score than Jaro similarity:

simj= 0.82 + 0.1 · 1(1 − 0.82) ≈ 0.84

If a string contains more than one word (i.e., whitespace or another predefined operator), it is possible to sort strings alphabetically before Jaro-Winkler is applied. Unless errors occur in the first few characters of a word, the sorting will bring words from s1 and s2 in the same order. Thereby matching quality potentially improves.

2.1.2.2 Scenario analysis for similarity metrics

As mentioned, different applications require different string similarity metrics. Although there are no specific guidelines to follow, some notions about the use in of metrics in different situations can be made.

First of all, we can make a distinction for strings that involve multiple words. From our examples, we proposed cosine similarity for this situation. Cosine similarity is able to lemmatize sentences that include multiple words and effectively compute a similarity. A potential threat to cosine similarity though, are misspellings. If strings tend to be similar but a misspelling prevents them from being exactly similar, the cosine metric would treat them as con-trasting strings. Ultimately, strings that correspond to each other but contain misspellings can be regarded as divergent by their similarity scores.

Despite the fact that cosine similarity can be executed on words as well, it is could encounter problems when considering word composition as it only would examine presence or absence of n-grams. Levenshtein, Damerau-Levenshtein, Jaro and Jaro-Winkler however have this ability. For this reason, these options carry a higher distinctiveness for word similarity and might there-fore be preferred over cosine similarity. Multiple of our proposed metrics could be used if misspellings occur. In particular Damerau-Levenshteins, Jaros and Jaro-Winklers architecture are able to capture these typographical errors.

Although these basic principles give some guidance, every unique sit-uation requires a different metric based upon the performance quality that a metric dispenses. Experimenting with different approaches can assist in mak-ing a profound choice for practical deployment. This will also be encompassed within this research.

2.2

Structure-based record linkage

Data is representable in various ways, including graphs. Graphs are informa-tive systems that are able to capture real-world relations such as road systems or friendships. A graph holds nodes and edges. In the example of friendships represented within a graph, a node would translate to a person and an edge would indicate a friendship between persons. Whenever multiple people and their friendships are represented, these altogether form a graph structure. This

(12)

way of data representation does not only improve the representation of rela-tionships, but also allows for graph analysis methods. Examples of analysis techniques are community detection, pattern recognition and link prediction.

Figure 1: Concept graph. [29]

Besides these analysis examples based on single graphs, multi-graph analysis to answer questions or solve problems is practicable as well. Problems that involve multiple graphs are prevalent in many domains. For instance, in MRI-based brain graphs of patients it is desirable to perform analysis across a collection of graphs [18]. Graph alignment (or network alignment) is an inter-esting example in particular. Instances of network alignment appear in various settings, including chemistry, bioinformatics and security [19, 20, 21, 22, 23, 24]. The task of network alignment could be described by finding corresponding nodes in different networks. What makes network alignment an interesting con-cept, is its easy translation to record linkage. Basically, a node would represent a record and an edge would be regarded as a predefined similarity or rule (e.g., a common attribute) between two records.

(13)

In network alignment we are able to compare nodes by their structural identity. This notion stems from the well-established assumption that aligned nodes have similar structural connectivity or degrees [23, 24]. In short, struc-tural information about a node within a graph can be learned by looking at its degree (i.e., the number connected nodes). Higher-order information can be gained by taking its neighbors degrees up to K hops into consideration as well. Following up, degree information about a node can be stored into an ar-ray. To link nodes from different graphs, these arrays can be compared pairwise to determine their structural similarity. We will elaborate on the exact align-ment procedure that will be used within this thesis later on.

As mentioned, a predefined variable or rule is needed to connect nodes within a network. When creating a network from a database, attributes can be used to create rules that allow for edges between nodes. For instance, genres for a movie database or medicine dosage rates in a hospital registry can be utilized. Combining multiple attributes to create linking rules would be possible as well. Important factors that should be taken into consideration are a rules dis-criminative power and the availability of the resource or recourses it is based upon. If for example, a social network is being constructed, edges between per-sons could be allowed if their sex is equal. Although this would be a legitimate rule, it almost certainly does not provide sufficient structural information for nodes within the network. The cause for this is the rules low discriminative power. Because all women and all men will be connected to each other, no unique structural profiles for nodes are being generated. For this reason, per-forming alignment between multiple graphs using a low discriminative rule can be difficult. To eliminate this issue as far as possible, it is aspired to create rules that hold sufficient discriminative power where possible.

The availability of features to base rules upon should be considered as well. When constructing graphs, missing feature values in either or both used datasets can prevent a linking rule that is based upon those features to perform adequately. Ultimately, this form of low data quality shifts to the structured graph, which results in a low-quality graph structure. Careful tailoring and data preprocessing steps could facilitate to improve a rules quality and hence the usability of its associated graph. This will also be carried out later on in this thesis.

3

Theoretical foundation of the REGAL

frame-work

In this section we will walk through the framework that will be used for network alignment experiments, REGAL. This will be done by consecutively explaining the steps that REGAL takes to align nodes in different networks. The steps can be summarized as following:

I Node Identity Extraction

This first step is responsible for the extraction of structure- and attribute information for all nodes.

(14)

II Efficient Similarity-based Representation

The second step obtains the node embeddings. This will be done by factor-izing a similarity matrix of the node identities from the previous step. The Nystr¨om method for low-rank matrix approximation will be used to avoid expensive computation of pairwise node similarities and explicit factoriza-tion. Conceptually, the Nystr¨om method is able to perform an implicit similarity matrix factorization by (1) comparing the similarity of each node only to a sample of p  n landmark nodes and (2) using these similarities to construct representations from a decomposition of its low-rank matrix [26].

III Fast Node Representation Alignment

The final step aligns nodes between graphs by matching the created em-beddings with the use of an efficient data structure. Eventually, the top-α most similar nodes based on their embeddings will be returned.

3.1

Node Identity Extraction

To extract information from all nodes, we will use REGAL’s representation learning module, xNetMF. xNetMF is able to define node identity that general-izes to a multi-network problem. This may be a straightforward decision but is of the utmost importance as in multi-network problems, nodes are not directly connected to each other and thus cannot be sampled in each others context. As mentioned, we will focus on structural- and attribute-based identity as quanti-fiers for similarity. To extract structural identity, xNetMF looks at the degrees of its neighbors up to k hops. Let V be the set of all nodes in two graphs G1

and G2 . For a node m ∈ V, we denote Rkmas the set of nodes that are exactly

k ≥ 0 steps away from m in Gi. For the node subset Rkm we want to capture

their degree information. Storing these degrees can basically be done by placing them in a D-dimensional vector dkm where D is the maximum degree in graph Gi and with the ith index (i.e., dkm[i]) being the number of nodes in Rkm with

degree i. In reality however, graphs possibly have skew degree distributions. Hence, high-degree nodes can expand the length of these vectors which makes this procedure inconvenient. To compensate for these imbalances, nodes are merged together into logarithmically scaled clusters such that the ith entry of

dkm contains the number of nodes m in Rkm with blog2degree(m))c = i. The

main benefits of this alternation are the shortening of the vectors dkmand more

robustness to changes in degrees that arise from noise or errors. As an addition to structural information, node attributes or features have shown to be useful for cross-network tasks as well [24]. If nodes hold F features, a F-dimensional vector fmcan be created which holds a nodes’ attributes. These attributes can

either be numerical or alphabetical. If no attributes are available, xNetMF can rely on structural information only as well.

To combine both sources of node identity, a similarity function is used that can be used to find correlated nodes within or across networks. This func-tion is as following:

(15)

where

• γs and γa are scalar parameters that control the effect of structural- and

attribute-based identity respectively

• dist(fm, fn) is the attribute-based distance between nodes m and n (this

term will be excluded from the function if no attribute information is available)

• dm=P K k=1δ

k−1dk

mis the neighbor degree vector for node m aggregated

over K different hops where δ is a discount factor for greater hop distances This function enables to compare structural identity by combining the neigh-borhood degree distributions at multiple hop distances in which the influence of distant neighborhoods can be attenuated with weight parameters. Also, at-tribute information is accounted for by including the distance between atat-tribute vectors. Important to mention is that only hard alignments (i.e., deterministic comparisons) are considered within xNetMF by default. Hence, the distance would be calculated by the following formula:

dist(fm, fn) = F

X

i=1

1fm(i) 6= fm(i) (6)

For the sake of the experiments that will we conducted, we will implement probabilistic linkage strategies as well. For these probabilistic comparisons, the string metrics that are described in section 2.1.2.1 will be used.

3.2

Efficient Similarity-based Representation

A common representation learning method to determine a full similarity matrix that holds all pairwise node similarities is by taking random walks on graphs [9, 28]. Within REGAL however, taking random walks is avoided for two main reasons. First of all, they introduce variance in representation learning that lead to non-comparable embeddings between networks. Secondly, random walks can add to the computational expense which can cause long run times. To overcome these problems, an implicit matrix factorization-based approach will be used. Intuitively, the goal is to determine n x p matrices Y (the node embedding matrix) and Z (not needed for alignment purposes) such that S ≈ YZT. Theoretically, the full similarity matrix S will be approximated with a low-rank matrix eS which is never explicitly calculated. This matrix eS is built by randomly selecting p  n nodes from graphs G1 and G2, and compute their

similarities to all n nodes in V using function (5). As a product of this step, a n x p similarity matrix C is constructed. From matrix C we can extract a p x p landmark-to-landmark similarity matrix W. Matrix C and W together hold for a sufficient approximation for the full similarity matrix without actually computing and factorizing eS. The low-rank matrix eS is given by:

e

(16)

where

• C is the n x p matrix constructed by computing similarities between the chosen p landmark nodes and all n ∈ V nodes

• W† is the pseudoinverse of landmark-to-landmark similarity matrix W

• CTis the transpose of matrix C

Figure 3: xNetMF versus classical matrix factorization approach for computing the node embeddings Y. [26]

Ultimately, the interest is not in similarity matrix S or its approximation e

S, but the node embedding matrix that can be retrieved from eS. As mentioned, similarity matrix S ≈ YZT. From this equation it follows that

e

Y = CUΣ12 (8)

where

• C is the n x p matrix constructed by computing similarities between the chosen p landmark nodes and all n ∈ V nodes

• UΣ12 arises from W= UΣV12 which is the full rank SVD of

pseu-doinverse of landmark-to-landmark similarity matrix W (we can rewrite S ≈ eS = C(UΣV12)CT= (CUΣ12) = (Σ12VTCT) = eY eZT)

Following these steps, eY can be obtained by only using the node comparisons in matrix C. The computational expensive SVD is only performed on the small matrix W. In conclusion, node representations are obtained by implicitly fac-torizing eS. The p-dimensional node embeddings from graphs G1and G2are eY1

and eY2, which are subsets of eY. Working with this structure effectively reduces

runtime and memory use [26].

3.3

Fast Node Representation Alignment

As a final step, nodes will be aligned using their representations. This will be done assuming that nodes m ∈ V1 and n ∈ V2 may match if their xNetMF

representations are similar. The most straightforward way to do this would be by computing all similarities between the node embeddings (i.e., the rows of

e

Y1 and eY2) and choose the top-α for each node thereafter. Although this is

(17)

stored in a k -d tree data structure to quickly find the most similar nodes. Here, similarity is calculated by converting Euclidian distance to a similarity measure with

simemb( eY1[m], eY2[n]) = e−|| eY1[m]− eY2[n]||

2

2 (9)

Finally, according to preference, the nodes corresponding to the top-α align-ments across graphs G1 and G2are returned.

Figure 4: Complete overview of the REGAL approach to node alignment. In this example, k = 2 hops and discount factor δ = 0.5. No logarithmic clustering is applied. [26]

4

Data formatting

In this section, the data that will be used for the experiments will be described. Secondly, the pre-processing steps that are necessary to make the data usable for record linkage are presented.

4.1

Data

The experiments will be performed with the use of two datasets; the ACM and DBLP datasets. Both datasets are Comma Separated Values files that are generated and made publicly available by the Database Group Leipzig [25]. The ACM and DBLP datasets respectively hold 2294 and 2616 academic publications with data science as overlapping subject. Each record (i.e., a publication) holds five attributes:

• ID: a unique identifier for every paper • Title: title of the published paper • Authors: authors of the published paper

• Venue: event where the paper was first presented • Year: year of event

(18)

An example record from the ACM database would look as following:

ID 304586

Title The WASA2 object-oriented workflow management system Authors Gottfried Vossen, Mathias Weske

Venue International Conference on Management of Data

Year 1999

Table 3: Record extracted from ACM database.

In the DBLP dataset, this same publication is at disposal as well and looks as following:

ID conf/sigmod/VossenW99

Title The WASA2 Object-Oriented Workflow Management System Authors Mathias Weske, Gottfried Vossen

Venue SIGMOD Conference

Year 1999

Table 4: Record extracted from DBLP database.

As visible, these matching example records are not exactly similar as the id and venue attribute differ. These differences are not only present in this example but occur in almost all matching pairs across the datasets. To resolve these alterations where possible, we put multiple methods in practice such as our proposed probabilistic linkage. This will be done in experiment 1. Also, there is no missing data in both datasets.

Since a golden standard is available, it is known that in total 2225 record pairs exist across both datasets. This golden standard will be used to determine whether made alignments within REGAL are correct or incorrect. As a result, accuracy for REGAL runs can be measured.

4.2

Pre-processing

The ACM and DBLP datasets are unworkable in their raw form. Therefore, pre-processing steps will be taken to make the data usable for alignment purposes. All these steps will be performed using Python 3.6.4 and NetworkX. NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. Within this thesis it will be used to build and edit graphs. From these graphs, edge lists will be extracted and merged as REGAL uses this representation of graphs as input. The complete process has the following steps:

1. Add ACM and DBLP records to respectively ACM and DBLP graph 2. Add edges in ACM and DBLP graphs

3. Remove small clusters from ACM and DBLP graphs 4. Extract and merge edge lists from ACM and DBLP graphs

(19)

From these steps, the second demands careful thought. As described in section 2.2, adding edges between nodes in a graph requires a rule. In this research, we will use a rule that connects nodes if they have an attribute in common. This leaves us with a few options: The use of title, authors, venue or year as a common attribute. Title as common attribute would result in all or almost all nodes being isolated because there are very few records in each dataset that carry the same title. As for venue and year, there are only very few unique values present in the datasets. Hence, the discriminative power might not be very high. Using the authors of papers though, would account for proper graphs as (1) every paper has its own unique authors and (2) within both datasets, most authors have written multiple papers. Consequently, an edge between nodes will be created when they have an author in common. The author comparison will be done by deterministic comparison by default. In experiment 2 however, we will also use probabilistic comparison for allowing connections between nodes. In figure 5 and 6, the resulting graphs for respectively the ACM and DBLP datasets using the proposed rule, are given.

Figure 5: Unprocessed ACM graph. Figure 6: Unprocessed DBLP graph. As visible, in both graphs there is a large piece of connected nodes surrounded by small clusters of nodes. Although we will investigate to what extend these small clusters influence the alignment performance in experiment 3, they will be removed for the first experiments so only the large pieces of interconnected nodes are present in both graphs. The resulting ACM and DBLP graphs respectively hold 1629 and 2109 nodes. The corresponding plots and node degree distributions are given below:

(20)

Figure 7: Processed ACM Graph. Figure 8: Processed DBLP graph.

Figure 9: ACM degree distribution. Figure 10: DBLP degree distribution.

Per observation, it is noticeable that the distributions are quite similar in shape. However, the DBLP graph has a few outliers that surpass the maximum node degrees existing in the ACM dataset, which are around 110. As a part of experiment 2, we will investigate to what extend this skew scale between both graphs affects the node alignment performance.

As a final step of the pre-processing, we extract and merge the edge lists from both graphs. Following up, this merged edge list can be feeded into the REGAL framework as a text file, so we are able to align nodes between both graphs.

5

Experiments and results

The aim of this thesis is to determine how and if the combined use of structural-and attribute-based information can be beneficial for record linkage. Hence, we proposed REGAL that can serve as a framework to align nodes by there feature representations. In this section, we give a brief overview of the setup that will be used. Secondly, three experiments and their results are shown that answer the three stated sub questions in this thesis.

(21)

5.1

Setup

The hardware that will be used for running the experiments is a MacBook Pro with a 2,7 GHz Intel Core i5 processor and 8GB RAM. The software that is used to run the REGAL framework is Python with the Jellyfish library added on top, which will be used for probabilistic linkage purposes. Pre-processing of data and analyzation of REGAL output is done with Python and the NetworkX library.

The hyperparameters that will be used are δ = 0.01, K = 2, γs = γa

= 1 and clustering scale φ = b10 log2nc . These default settings will be used due to their stable results at reasonable computational costs [26]. As REGAL randomly samples nodes to approximate the full similarity matrix S, we will do x = 5 runs to determine accuracy score for each top-α similar nodes. We will determine accuracies for top-1, top-5, top-10, top-20 and top-50 most similar node outputs. Accuracy will be measured with the following formula:

accuracy = # true positives

# possible correct alignments (10) where

• # of true positives is the number of correct alignments made based upon the golden standard pairs

• # possible correct alignments is the number of golden standard pairs ex-isting across both graphs

Underneath is a full overview of the parameters and abbreviations used (unless stated otherwise):

Parameters Value(s)

δ (discount factor) 0.01

K (maximum hop distance) 2

γs(effect scalar structure-based identity) 1

γa (effect scalar attribute-based identity) 1

φ (clustering scale) b10 log2nc

x (# runs) 5

top-α (# of most similar nodes returned) 1, 5, 10, 20, 50

B (baseline)

-S (structural information)

-V (venue attribute information) -Y (year attribute information) -T (title attribute information)

(22)

5.2

Experiment 1: Performance improvement by attribute

treatment

The first question to answer is: ’Can alignment accuracy be increased by append-ing attribute-based information on top of structural-based information within REGAL?’. Hypothetically, it is expected that feeding attribute information to our alignment algorithm can achieve a higher alignment accuracy than using structural information solely. To test this hypothesis, first we will run the al-gorithm without attribute information. Next, we insert attribute information as well. This will be done by applying all possible attribute combinations that may influence node alignment performance. Attributes that will be applied are title, venue and year. Consequently, seven combinations are possible. Attribute values are compared deterministically, which is the default REGAL setting. The results are visualized in figure 11.

Figure 11: Alignment accuracies for structural information solely and in com-bination with attribute-based information. Comparison of attributes is done deterministically.

5.2.1 Accuracy improvement by adding attribute information As we see, three different trends emerge from adding attribute information. From these trends the combinations structure-venue and structure-venue-title only lower the accuracy of node alignment. Adding only the title as an attribute is slightly better but very similar to the performance with structure-based iden-tity only. There are two combinations that perform better when top-α > 10: structure-venue-year-title and structure-venue-year. The combinations perform-ing best for every top-α are structure-year and structure-year-title. This is what we can extract from these results:

(23)

• All combinations involving ’year’ as an attribute perform the best. • If ’title’ is added in a combination, it slightly increases performance in

comparison to combinations it is left out.

• The ’venue’ attribute seems to lower the alignment performance in made combinations.

These findings are grounded firmly by gold standard statistics that have been determined:

• 100% of node pairs that are golden standard have same year value • 43% of node pairs that are golden standard have the exact same title value • 0% of node pairs that are golden standard have the exact same venue

value

5.2.2 Accuracy improvement by venue deduplication and probabilis-tic title comparison

Taken these results into account, there is no room left for improvement in ad-dition to the year attribute as it is of the highest possible quality. It, however, is possible to put methods into practice that have the ability to improve the quality and hence usability of the venue and title attributes.

First, we will look at the venue attribute. By inspection it is noticeable that in and across both datasets, different descriptions are used that refer to the same venue. These different descriptions are mainly due to the fact that abbreviations are used. For example, the International Conference on Manage-ment of Data is also referred to as SIGMOD Conference and VLDB is used as an abbreviation of Very Large Data Bases. Resolving these duplicates, results in 5 venue values instead of 10. Consequently, all node pairs that are golden standard have the same venue value. The effect of this modification on the performance was determined for all combinations that involved the venue at-tribute. Results are visualized in figure 12 and are in comparison to the old performances that include the venue duplicates.

Improving the usability of titles will be done with the metrics that have been discussed in section 2.1.2.1. Instead of deterministic comparison where λ ∈ {0,1} is distributed to calculate attribute similarities, probabilistic compar-ison gives us the ability to distribute similarity scores i where 0 ≤ i ≤ 1, to align nodes across the networks. We did probabilistic comparison for all pro-posed string metrics on the combination structure-title. The results are shown in figure 13.

(24)

Figure 12: Alignment accuracies with and without deduplication of the venue attribute across datasets.

By inspection of figure 12, it can be concluded that removing dupli-cate venues from the dataset improves accuracy for all possible combinations enormously. From all combinations, year and structure-venue-year-title perform best for every top-α class.

Improvement when using probabilistic linking metrics instead of deterministic linking for title comparison is clearly visible as well (figure 13). Although dif-ferences are small, cosine similarity performs best for every top-α class. What is noticeable as well, is that all top-1 accuracies for different metrics are around 50% and that there is no to very little increasement when larger top-α accura-cies are calculated. This pattern will be discussed more comprehensive later on. As various adjustments in the data and alignment algorithm have been made and their results have been analyzed, we will combine the most advan-tageous treatments for each attribute to examine the accuracy that can be achieved. Consequently, these attribute approaches will be combined:

• Venue: Deduplicate venue values across both datasets and use determin-istic comparison

• Year: No further treatment demanded, use deterministic comparison • Title: Use probabilistic comparison with cosine metric instead of

deter-ministic comparison

5.2.3 Accuracy inhibition by structural information

In figure 14, it is disclosed that the best performance for attribute treatments is by the structure-venue-year-title combination. When we look at our top-1 score,

(25)

Figure 13: Deterministic and probabilistic structure-title alignment accuracies for various metrics.

we observe that, except the structure-venue-year combination, all combinations are similar to their top-50 scores. This same pattern is visible in figure 13. From these revelations, we can extract that presumably, accuracy improvement for top-α > 10 mostly follows from randomness rather from than similarity scores. The growth of the baseline over all top-α groups gives this hypothesis strength, as it is very similar to the growth of the latter accuracy improvements. As a deduction, we test the hypothesis that structural information is a limiting factor for aligning nodes across both networks. To explore this possible inhibition by structural information, experiment 2 is performed.

5.3

Experiment 2:

Solve structural information

limita-tions

Because structural information seems to limit performance, we will focus on reviewing structural information characteristics that might influence this per-formance by answering the question: ’Is alignment accuracy affected by the de-gree connectedness within the pair of graphs that are considered for alignment?’. We will encounter this problem by dividing it in 3 main steps: (1) analyze the quality of the rule that is used for constructing both graphs, (2) test the effect of altering the degree distributions by removing outlier nodes in both graphs and (3) add degree weights to improve discriminative power within graphs.

(26)

Figure 14: Alignment accuracies for all possible attribute combinations with best individual treatments for attributes.

5.3.1 Improvement of rule for graph construction

As has been described in section 4.2, we use authors of papers to allow for con-nections between nodes. If we analyze our gold standard, it is witnessed that 97% of the gold standard pairs have an author in common (by deterministic comparison). If we look at gold standard pairs that do not have at least one au-thor in common by deterministic comparison, there are a few noticeable reasons why they do not have a common author according to deterministic comparison: • Presence/absence of initials (e.g., Richard Snodgrass and Richard T.

Snod-grass)

• Typos (e.g., Randal J. Peters and Randel J. Peters)

• Encoding differences in datasets: (e.g., Ralf Hartmut Güting and Ralf Hartmut G?ting)

These examples only point out differences across both datasets but potentially are present within a dataset as well. Therefore, node connections within a graph may be inhibited. To examine to what extend this may affect node alignment performance bases upon structural node properties, we will apply probabilistic comparison instead of deterministic comparison for author comparison. So, if similarity(author node 1, author node 2) > β, where β is a predefined threshold with 0 ≤ β ≤ 1, we will allow to connect these nodes within the graph. The following metrics and thresholds will be used:

• Metrics: Jaro, Jaro-Winkler, Levenshtein, Damerau-Levenshtein, Cosine • Threshold: 0.8 ≤ β ≤ 0.95 with intervals of 0.05

(27)

Figure 15: Alignment accuracies for deterministic and probabilistic comparison of authors for structuring the graphs.

As observable, there is a little improvement by altering deterministic au-thor comparison to probabilistic auau-thor comparison and all metrics seem to per-form similar. Although this adjustment improves accuracy, the computational costs of these probabilistic instead of deterministic comparisons rise consider-ably. Hence, this is adjustment in the alignment process might not favorable.

5.3.2 Outlier removal to equal degree distributions

As presented in section 4.2, the shape of both graph degree distributions is quite similar. A noticable difference, however, is that the DBLP graph holds some outlier nodes that have a degree > 150. This is opposed to the maximum degree exisiting in the ACM graph, which is around 110. To see the effect of this skewness, we will (1) reconstruct both graphs without including nodes that have degree > 125 and (2) see if accuracy improvement is achieved when aligning nodes based on structural information. The results are shown below.

(28)

Figure 16: Alignment accuracies for structural information with and without outlier nodes.

When we analyze the acquired results, they show us that removal of outlier nodes hardly improves accuracy. There is a slight increasement notice-able at higher top-α values but these are not significantly higher than retrieved accuracies that include outliers.

5.3.3 Addition of degree weights to improve discriminative power As a final attempt to substantially increase accuracy scores by structural in-formation solely, we will look at the possible accuracy increasement if we add weights to our node connections. To be able to link weights to edges within both graphs, a metric is needed to determine these weights. Hence, we propose two metrics to calculate weights. The equations look as following:

weight = σ · # authors in common

max(# authors node 1, # authors node 2) (11) where

• σ is a predefined scalar and

weight = λβ1+ θβ2 (12)

where

• λ is a predefined scalar for year weight • θ is a predefined scalar for venue weight • β1∈ {0,1} (1 if years are the same, 0 if not)

(29)

• β2∈ {0,1} (1 if venues are the same, 0 if not)

Figure 17: Alignment accuracies for structural information with and without weights based on author similarity.

Figure 18: Alignment accuracies for structural information with and without weights based on year and venue similarities.

As pictured in figure 17, including weights based upon equation (11) does only decrease accuracy scores. For every scalar, alignment accuracy is worse than excluding weights from the graphs to align. As visible in figure 18, adding weights based on equation 12 increases accuracy. Despite the fact that

(30)

it is difficult to draw a pattern, it can be seen that the best performing scalar combinations are combinations where scalars are not equal.

It should be mentioned there are numerous ways to define weights and we have only used two possible metrics. This limitation is mostly due to the fact that the scope of this thesis does not allow for further analysis. It is, however, clearly noticeable that addition of weights can improve accuracy. Hence, follow-up research could investigate the appliance of weights in graph alignment more comprehensively.

5.3.4 Algorithm design limitations for structural information align-ment

As seen throughout experiment 2, there are multiple techniques available to increase accuracy based upon structural information created with REGAL. None of the applied techniques however, are able to increase accuracy significantly. A possible explanation for this could be a insufficient design choice within the REGAL algorithm. If we look at the distribution of differences in node degrees for record pairs in the gold standard across the networks and at the distribution of the correct aligned nodes we see the following:

Figure 19: Difference in degree for golden pairs across the ACM and DBLP graphs (structural information solely based on common authors, no weights added). Negative score indicates higher node degree in DBLP graph, positive score indicates higher node degree ACM graph. Outliers with difference > 50 are left out.

As clearly visible, most correctly aligned nodes across networks for every top-α have no to very little different degrees. If node degree differences are high, almost no correct alignments are made. It is believed, a design choice within the REGAL network might cause flaw. Ultimately, this design limitation might cause the structural performance limitations that have been present throughout the experiments. Although the scope of this thesis does not allow for further analysis, further investigation could be very valuable to eliminate this possible

(31)

defect.

5.4

Experiment 3: Including disconnected components

In the pre-processment steps to prepare our data and enable usability, small disconnected clusters were deleted to maintain similar graph structures for the ACM and DBLP graph. With the removal of these smaller disconnected clusters however, valuable data that could be correctly aligned is deleted as well. Hence, in this experiment we will answer the question: ’To what extend is REGAL sensitive for graphs containing disconnected components?’ First of all, we will compare the accuracy rates for graph alignment with and without the inclusion of detached node clusters. This is visualized in figure 20.

Figure 20: Alignment accuracies for graphs with and without detached clusters. Graphs are constructed by deterministic comparison of authors as a rule. Only structural information is used.

5.4.1 Inclusion of small clusters

As visible, accuracies are quite similar for each top-α class. An important difference, however, is that more golden pairs are included across both graphs that hold smaller clusters. Hence, we can deduct that more nodes are aligned correctly. This is visualized in figure 21. After investigation whether these extra correctly made connections arise from the big central chunks within both graphs or from the smaller clusters (figures 5 and 6), it was confirmed that the smaller clusters are responsible 100% for the increasement of correctly aligned nodes. Hence, excluding smaller clusters for graph alignment is not necessarily needed. We have put the statistics for node degree differences of these correctly aligned nodes from the small clusters as well. These are visible in figure 22.

(32)

Figure 21: Numbers of correctly aligned nodes by structural information with and without inclusion of disconnected components. Graphs are constructed by deterministic comparison of authors as a rule.

Figure 22: Difference in degree for golden pairs across ACM and DBLP graphs (by structural information solely based on common authors, no weights added). Negative score indicates higher node degree in DBLP graph, positive score in-dicates higher node degree ACM graph.

From figure 22 we can extract that gold standard pairs existing in the dis-connected components across the graphs have very little degree difference. Most nodes that are aligned correctly from these subparts have no degree difference at all. This is very much in common with the degree differences for correctly aligned nodes in the biggest interconnected parts across graphs (figure 19). As already suggested in section 5.3.4, this again may indicate a design limitation for structural linkage within the REGAL framework that is being used.

(33)

6

Discussion

This thesis explored the performance of REGAL for record linkage using struc-tural and attribute identity of records. In experiment 1, we conducted tests to answer if alignment accuracy could be increased by adding attribute-based information to REGAL. Our stated hypothesis that inserting attribute infor-mation besides structural inforinfor-mation in graph alignment could result in higher accuracy scores, has been proved right. When attribute-based information was added, the increasement of accuracy was noticable for each available attribute separately and combined. For each attribute, however, the best comparison approach had to be identified for optimal usability. For example, adding the year attribute and comparing these deterministically proved to be successful. This was mainly possible due to its high quality. Comparison of titles though, was done by probabilistic comparison to be most beneficial for alignment. This was mostly due to a lesser data quality (i.e., less occurrences of exact same title strings in golden pairs.

If we look at the literature, it points out resemblances to these find-ings. For example, Zhu et al. approached their comparison research by creating datasets that represent a typical Medicare claims registry including admission date, date of birth and gender [27]. They created 96 scenarios by altering three data parameters: rate of missing and error, discriminative power and file size. In most scenarios, probabilistic comparison strategies demonstrated superior performance to deterministic comparison. The few scenarios in results were comparable, only a very low or low rate of missing data and errors was present. Baldwin et al. compared probabilistic and deterministic comparison for linking mother and infant within electronic health records to support pregnancy out-come research [28]. Besides numerical identifiers, alphabetical identifiers such as surname were used as well. Results showed that accuracy increased from 74.5% to 84.1% in one population and 52.8% to 58.3% in another if a probabilistic matching algorithm was used.

Within this research as well, for linking scenarios with few to no errors and a strong discriminating power, deterministic comparison could be of equal compared to probabilistic comparison. Probabilistic comparison, however, is able to obtain better accuracy than deterministic comparison if data quality is not optimal. Ultimately, with these revelations we have strengthen previous research findings.

From the results in experiment 1, it was visible that structural infor-mation limited the accuracy rates that were reached. In experiment 2, we have produced multiple accommodations within the alignment process that could po-tentially be beneficial for accuracy increasement. Changing the connectedness within graphs by adjusting the rule for edge allowance between nodes showed little improvement. Deleting outliers that skewed the degree distributions of both graphs showed a minor accuracy boost as well. The most promising ad-justment, however, was by adding weights to edges. We proposed two equations to calculate weights. Equation (11) showed to only decrease accuracy. Equation (12) however, showed accuracy increasement for every top-α class. It should be mentioned though, that only a very little area of weight usability was covered. It is very likely that better weight equations were at disposal. Further research could focus on finding the most optimal weight additions with the data that is

(34)

available.

An interesting pattern that has been addressed in experiment 2 as well, is the fact that most correct nodes have no to very little differences in degrees. Although the algorithm is able to process and align nodes from different sized graphs, it seems that the differences in connectedness across graphs has a sub-stantial impact on performance.

In the third conducted experiment we included disconnected components that were not connected to the biggest part of the graph as a result of the used rule. Results showed that a fair number of golden pairs present in the added smaller parts were aligned correctly. Consequently, it could be stated that in-cluding disconnected components in the graphs has no negative effect on the alignment process. Noteworthy, is that most correct aligned nodes within the added components had no degree differences. As this pattern was visible in experiment 2 as well, it sustains the believe in an algorithm design flaw. In follow-up research, analyzing this complication by looking at algorithm design choices in depth, could offer resolutions to this problem.

Within this research, a few limitations are present as well. First of all, only one pair of equivalent datasets was used to perform our proposed experi-ments on. Consequently, our results may not be exactly comparable to results that could have been achieved with other pairs of datasets that have their own data characteristics. Hence, in follow up research, a broader variety of datasets could be fed into the REGAL framework. If parallel results would be visible, this could strengthen the findings within this thesis considerably. Secondly, during experiments no adjustments were made to hyperparameters. Although the used settings are suggested by REGALs developers [26], different parameter choices could result in better performance by the algorithm. Lastly, as only the REGAL algorithm was used to align nodes, no comparison was possible with other node alignment algorithms. This should not be seen as a limitation, but rather as a future research possibility that presents itself. As we have developed a strong foundation of theoretical understanding of node alignment with the REGAL framework, this knowledge could be used in analyzing and using other node alignment algorithms as well.

(35)

7

Conclusion

Within this thesis, we investigated whether the combined use of structural- and attribute-based information within REGAL could be beneficial for record link-age.

First of all, adding attribute information to the structural alignment process has proven to be of positive influence. In depth research on graph align-ment excluding attribute-based information showed substantially lower accuracy scores. In comparing attributes as a part of the graph alignment process, proba-bilistic comparison showed the ability to perform better on low quality data. For linking scenarios with high quality data, however, deterministic linkage showed to be perform equally good. A good addition to graph alignment performance was the use of weights within graphs. Future research could focus on finding optimal weights with available data resources. Also, it could be stated that dis-connected components across graphs that are separated because of the used rule for graph construction have no negative effect on the graph alignment process. Possibly, design alternations to the REGAL algorithm could provide for higher alignment accuracy though.

In conclusion, we can say that the combined use of structural- and attribute-based within REGAL has proven to be of use for record linkage. There is however, much room left for improvement of the algorithm and optimization of the alignment performance.

(36)

References

[1] Villars, R. L., Olofson, C. W., Eastwood, M. (2011). Big data: What it is and why you should care. White Paper, IDC, 14, 1-14.

[2] Vatsalan, D., Sehili, Z., Christen, P., Rahm, E. (2017). Privacy-preserving record linkage for big data: Current approaches and research challenges. In Handbook of Big Data Technologies (pp. 851-895). Springer, Cham. [3] Ravelli, A. C., Tromp, M., van Huis, M., Steegers, E. A., Tamminga, P.,

Eskes, M., Bonsel, G. J. (2009). Decreasing perinatal mortality in The Netherlands, 20002006: a record linkage study. Journal of Epidemiology Community Health, 63 (9), 761-765.

[4] Choi, I. Y., Park, S., Park, B., Chung, B. H., Kim, C. S., Lee, H. M., ... Lee, J. Y. (2013). Development of prostate cancer research database with the clinical data warehouse technology for direct linkage with electronic medical record system. Prostate international, 1 (2), 59-64.

[5] Steorts, R. C., Hall, R., Fienberg, S. E. (2016). A Bayesian approach to graphical record linkage and deduplication. Journal of the American Statistical Association, 111 (516), 1660-1672.

[6] Bayardo, R. J., Ma, Y., Srikant, R. (2007, May). Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web (pp. 131-140). ACM.

[7] Chaudhuri, S., Ganti, V., Kaushik, R. (2006, April). A primitive operator for similarity joins in data cleaning. In 22nd International Conference on Data Engineering (ICDE’06) (pp. 5-5). IEEE.

[8] Xiao, C., Wang, W., Lin, X., Yu, J. X., Wang, G. (2011). Efficient sim-ilarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36 (3), 15.

[9] Zhu, L., Ghasemi-Gol, M., Szekely, P., Galstyan, A., Knoblock, C. A. (2016, October). Unsupervised entity resolution on multi-type graphs. In International semantic web conference (pp. 649-667). Springer, Cham. [10] Koutra, D., Faloutsos, C. (2017). Individual and collective graph mining:

principles, algorithms, and applications. Synthesis Lectures on Data Mining and Knowledge Discovery, 9 (2), 1-206.

[11] Dunn, H. L. (1946). Record linkage. American Journal of Public Health and the Nations Health, 36 (12), 1412-1416.

[12] Christensen, H. T. (1958). I. The Method of Record Linkage Applied to Family Data. Marriage and Family Living, 20 (1), 38-43.

[13] Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64 (328), 1183-1210.

[14] Singla, P., Domingos, P. (2006, December). Entity resolution with markov logic. In Sixth International Conference on Data Mining (ICDM’06) (pp. 572-582). IEEE.

(37)

[15] Corley, C., Mihalcea, R. (2005, June). Measuring the semantic similarity of texts. In Proceedings of the ACL workshop on empirical modeling of seman-tic equivalence and entailment (pp. 13-18). Association for Computational Linguistics.

[16] Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7 (3), 171-176.

[17] Winkler, W. E. (1999). The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau.

[18] De Vico Fallani, F., Richiardi, J., Chavez, M., Achard, S. (2014). Graph analysis of functional brain networks: practical issues in translational neu-roscience. Philosophical Transactions of the Royal Society B: Biological Sci-ences, 369 (1653), 20130521.

[19] Klau, G. W. (2009). A new graph-based method for pairwise global network alignment. BMC bioinformatics, 10 (1), S59.

[20] Singh, R., Xu, J., Berger, B. (2008). Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences, 105 (35), 12763-12768. [21] Vijayan, V., Milenkovi, T. (2017). Multiple network alignment via

multi-MAGNA++. IEEE/ACM transactions on computational biology and bioin-formatics, 15 (5), 1669-1682.

[22] Bayati, M., Gleich, D. F., Saberi, A., Wang, Y. (2013). Message-passing algorithms for sparse network alignment. ACM Transactions on Knowledge Discovery from Data (TKDD), 7 (1), 3.

[23] Koutra, D., Tong, H., Lubensky, D. (2013, December). Big-align: Fast bipartite graph alignment. In 2013 IEEE 13th International Conference on Data Mining (pp. 389-398). IEEE.

[24] Zhang, S., Tong, H. (2016, August). Final: Fast attributed network align-ment. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1345-1354). ACM.

[25] Rahm, Erhard, Kpcke, Hanna, and Thor, Andreas. DBLP ACM Dataset. Ann Arbor, MI: Inter-university Consortium for Political and Social Re-search [distributor], 2017-07-21.https://doi.org/10.3886/E100843V2

[26] Heimann, M., Shen, H., Safavi, T., Koutra, D. (2018, October). Regal: Representation learning-based graph alignment. In Proceedings of the 27th ACM International Conference on Information and Knowledge Manage-ment (pp. 117-126). ACM.

[27] Zhu, Y., Matsuyama, Y., Ohashi, Y., Setoguchi, S. (2015). When to con-duct probabilistic linkage vs. deterministic linkage? A simulation study. Journal of biomedical informatics, 56, 80-86.

[28] Baldwin, E., Johnson, K., Berthoud, H., Dublin, S. (2015). Linking moth-ers and infants within electronic health records: a comparison of determin-istic and probabildetermin-istic algorithms. Pharmacoepidemiology and drug safety, 24 (1), 45-51.

(38)

[29] Meredith, C. (2017, February 24). Explaining GraphQL Connections [Web blog post]. Retrieved from https://blog.apollographql.com/ explaining-graphql-connections-c48b7c3d6976

[30] Berg, J., Lssig, M. (2006). Cross-species analysis of biological networks by Bayesian alignment. Proceedings of the National Academy of Sciences, 103 (29), 10967-10972.

Referenties

GERELATEERDE DOCUMENTEN

characteristics (Baarda and De Goede 2001, p. As said before, one sub goal of this study was to find out if explanation about the purpose of the eye pictures would make a

Indien mogelijk dient altijd eerste de bloedglucose te worden gemeten, om vast te stellen of het inderdaad een hypo is.. Glucose nemen in de vorm van

This short report analyses a simple and intu- itive online learning algorithm - termed the graphtron - for learning a labeling over a fixed graph, given a sequence of labels.. The

More specifically, in this work, we decompose a time series into trend and secular component by introducing a novel de-trending approach based on a family of artificial neural

As shown, in this thesis it is possible to describe a policy for alignment that can be applied when the interaction beliefs of two or more interacting agents are not

Figure 84 shows the displacement of femur IV towards the femoral groove (femur III). The carina of the trochanter and femur is clearly visible, which will permit the tarsus and

The n = 2 (AABB and ABAB) probes in the starling study contained several bigrams used in the preceding training phases (2).. *Probes with “ABAB” and

Radiographs of hands and feet are traditionally the images that are used to assess structural damage progression in drug trials in patients with rheumatoid arthritis, aiming at