Record linkage on consecutive share and post actions

(1)

Record linkage on consecutive share and post actions

SUBMITTED IN PARTIAL FULFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Erik Dolstra

11927429

Master Information Studies

Information Systems

Faculty of Science

University of Amsterdam

Friday 12

th

July, 2019

1st examiner 2nd examiner

prof. dr. P.T. (Paul) Groth dr. R.P. (Rogier) Vlijm

(2)

ABSTRACT

To harness the rich amount of public data online, more compa-nies combine data from online social networks with their own private data. A subset of these organizations publishes online ar-ticles, which can be shared by readers to online social networks with the use of share buttons. In this research, we show that the timestamp of an intent to share and a consecutive post on a so-cial network can be compared in a record linkage algorithm. We apply a novel method of comparison on the time differences be-tween the consecutive share and post actions. We call this time difference intent time. We implement record linkage with intent time on a test case from a Dutch newspaper and Twitter. We then validate our algorithm on synthetic datasets, where we measure the recall, precision and F1-score. When a cost-based record linkage algorithm with intent time is compared to a baseline probabilistic record linkage algorithm then the F1-score and recall are higher, but the precision is lower. Our research is applicable for anyone who wants to perform record linkage on timestamps of online article shares and prefers F1-score over precision.

KEYWORDS

Record Linkage, OSN, Social Media, Entity Resolution

1 INTRODUCTION

Both in academics and businesses, there is an increasing inter-est in the research of matching entities across multiple datasets [27]. Matching entities across multiple datasets is most commonly referred to as record linkage [17]. Instead of using one unique iden-tifier to refer to one entity, record linkage uses a combination of attributes and rules to determine the probability that two records refer to the same entity. Challenges in record linkage come from inconsistent data (for example, spelling mistakes), unsynchronized records between datasets (for example, an address not being up-dated when a person moves) and missing values (for example, not stating a first name or telephone number in a dataset) [15].

Online there is an increasing number of articles being shared on Online Social Networks (OSNs) [12]. A popular method for sharing online articles on OSNs is using a share button [25]. A share button is placed on an article’s webpage. When a reader of the article clicks the share button, then this generates a post with a reference to the current article that can be placed on an OSN with the social media profile of the current reader. The action of clicking a share button can also be registered via a cookie in a database of the article publisher. A cookie is a set of records containing all the actions that were performed by a reader while they browsed a website [20]. When a reader clicks a share button, they are referred to the website of the OSN. The website of the OSN is external for the article publisher and cookies do not work on websites of external parties.

OSNs are a popular method of sharing opinions [28]. Finding opinions on an online article can be of interest to the article pub-lisher. It has been shown that online article publishers often already have information on their readers and are interested in gathering

more [2]. This research focuses on exploring a technique of com-bining company-owned reader data with public OSN data. It has been shown in previous research [19] that there are not enough techniques available to combine company-owned data with public OSN data. In the related work, we show that there is a gap in the literature in enriching private data with public OSN data. A record linkage technique on the consecutive share and post actions from readers can help fill this gap.

1.1 Definitions

There are several definitions for record linkage and the related terms [23], but in this paper, we use the following definitions:

• Record linkage is the techniques and methods used for de-termining the probability that a pair of records refers to the same entity.

• A record is an entry in a dataset that is created by the action of an entity.

• An entity is a person or agent that creates one or multiple records in one or several datasets.

In addition, we introduce three new terms, which are as follows: • A share intent is a record in a companies’ private dataset of an entity (in this research the reader of an online article) clicking on a share button of the article.

• A post action is a record in a public OSN dataset of a com-pleted share intent, which generates a post on the OSN. • The intent time is the difference in time between a share

intent and a post action.

1.2 Research question and contributions

In traditional record linkage, there is a comparison on strings or numbers to find matching records [22]. In OSN datasets identi-fiers are often private, which makes the number of missing values high and the traditional record linkage techniques less accurate [1]. However, it has been shown in previous research [12], that public identifiers are often present in OSN datasets. In this research we explore the influence of one of these public identifiers, called the timestamp or temporal identifier. To determine what the influence of this identifier on a record linkage algorithm is, we pose the fol-lowing research question:

To what extent can temporal identifiers influence the accuracy of a record linkage algorithm that matches the sharers of online articles with their social media profiles on OSNs?

In answering this research question, the following contributions are made:

• A method of applying record linkage with intent time is introduced, which can be applied by online article publishers that want to enrich their own datasets with public OSN data. • A numerical comparison for temporal identifiers is intro-duced. Instead of directly comparing two values, this com-parison method compares the difference of two timestamps to the mode of differences.

(3)

We apply record linkage with intent time in two cases. The first case uses a dataset with share intents from Algemeen Dagblad (which is a Dutch newspaper) and posts from Twitter. Algemeen Dagblad has 5.6 million unique monthly online readers [3]. In this test case, we only look at readers that share through the website of Algemeen Dagblad, because the mobile application of Algemeen Dagblad was not ready at the start of this research. Posts are gath-ered only from Twitter because the other share buttons on Alge-meen Dagblad are for OSNs that do not publish data. The second test case uses three synthetic datasets that we generate. In these synthetic datasets, the distribution of the frequency of the intent times is changed to determine to what extent our record linkage algorithm works.

The rest of this paper is divided into six sections. In section 2. Related work, the current techniques for record linkage on public and private identifiers are described. In section 3. Methodology, it is described how record linkage with intent time works, how we create the synthetic datasets and how we validate our results. In section 4. Experiment evaluation, we describe two implementations of record linkage with intent time and their results. In section 5.) Discussion, the results and the validation process of this research are discussed. In section 6.) Conclusion, the research is concluded with final remarks and future work.

2 RELATED WORK

Record linkage is used in several fields. In Epidemiology, record linkage is often used to compare records of patients among dif-ferent datasets [23]. In marketing, there are examples of profiling clients [6] and in the social sciences, there are examples of de-duplicating social media profiles [12]. Depending on the purpose and the datasets available, different record linkage methods can be applied.

Regardless of the purpose, implementing record linkage always involves the same three steps. First, data is gathered from two datasets. The records in these datasets are cleaned and attributes that can be used for linkage are identified. Second, the algorithm compares identifier values and then classifies record pairs as (non-)matches. Third, the results of the algorithm are validated to distin-guish between true (non-)matches and false (non-)matches. The rest of this section gives an overview of the relevant theory available on these three steps. The subsections are divided respectively.

2.1 Data Gathering and Cleaning

Record linkage always requires records with multiple attributes that can be used to discern between entities. These attributes and their corresponding values are called identifiers [4]. In [1], it is stated that identifiers can be divided into private identifiers (such as the last name or email address) and public identifiers (such as public texts or timestamps). It is known that in OSNs the consistency of private identifiers is low relative to public identifiers [12]. For example, on average 50% of Yelp and Facebook users provide a real first-name in their profile, while 100% of the posts placed on these OSNs contain a timestamp. When using public identifiers, both the granularity (the level of detail) and the margin of error may vary among OSNs [1]. When people share online content, it is common that this is

done on multiple occasions [24], which creates multiple records that refer to one entity.

Using public OSN APIs to gather data for research is prone to bias [26] [21]. Public OSN APIs often return a subset of the total number of posts that match the search query. The posts returned by public OSN APIs are chosen by an algorithm. It is known that Twitter APIs prefer popular posts to unpopular posts when returning posts [16]. However, no definition of a popular post is given. According to [26], when using the Twitter Search API the proportion of posts returned can also differ. When a query was made to the Public Twitter Search API that should have returned 1084 posts, only 900 were returned (85%). In another query that should have returned 122,869 posts, only 67% of expected posts were returned.

2.2 Comparison and classification

Record linkage always starts with records divided over at least two datasets, called datasetA and dataset B. Records in these datasets are called (a ∈ A) or (b ∈ B) [4]. When records refer to the same entity they are called a true match. When records refer to a different entity they are called a true non-match. The goal of a record linkage algorithm is to identify as many true matches while minimizing the number of false matches [23]. Most record linkage algorithms go through a blocking, comparison and classification phase. This subsection is split into these phases respectively. Additionally, this subsection is concluded with the gap in the literature that exists for these algorithm phases.

2.2.1 Blocking phase. Blocking is necessary to limit the amount of computing power needed [8]. Blocking is performed when the number of possible record combinations betweenA and B becomes so large that it impacts the performance of the algorithm [5]. For example, ifA contains 10,000 records and B contains 5,000 records, then the algorithm must consider 50,000,000 record pairs. To reduce the number of record pairs to consider, records can be divided into blocks. Blocks are created by taking one of the identifiers from the records and adding all records with the same identifier value to the same block. The blocking identifier must be an exact match on all records [8]. Blocking strategies can influence the recall and precision of an algorithm, therefore the blocking strategy must always be outlined in a methodology [5].

2.2.2 Comparison phase. Comparison is done by going over each identifier (in) for each considered record pair after blocking

(A × B) and applying some form of comparison. Private identifiers, such as names or addresses, are often compared through a string or numerical comparison [5]. The most widely used string compar-ison is the Jaro-Winkler string comparcompar-ison [22]. The Jaro-Winkler comparison allows an identifier betweena and b to still be a match when there are spelling errors in the records. With a Jaro-Winkler string comparison, a threshold for the maximum margin of error can be set. When it is known that records do not contain any errors, then an exact comparison on the identifier can be performed [4]. The numerical comparison works similarly. Here the difference between two numerical values (bi{in} −ai{in}) is compared to a

threshold. When the difference is larger than the threshold, the comparison method returns a non-match status for the current

(4)

identifier. This method of comparison is shown in formula 1. Nu-merical comparisons are often used for private identifiers, such as date of birth or salary [5].

0 <bi{in} −ai{in}

threshold < 1 → match (1) Public identifiers can also be numerical, but they often require a different method of comparison. For example, in [12], series of timestamps are compared and the timestamp series with the least distance between them are considered to match. The result of the comparison step in the record linkage algorithm is always a match or a non-match status for each identifier for each considered record pair.

2.2.3 Classification phase. The classification phase determines wheter the identifier (non-)matches for each considered record pair leads to a record match or a record non-match [4]. Classification can be done in a number of ways. In probabilistic record linkage, the probability that two records match is calculated, where the thresh-old for assigning match status is based on the minimum probability desired of being a true match [23]. Probabilistic record linkage ap-plies a weight to each identifier, which is based on the probability that an identifier match leads to a true match of records. To assign the weight of the identifier, the M-probability and U-probability are calculated. The M-probability is the probability that a record pair with the same identifier value leads to a true record match. The U-probability is the probability that a matching identifier leads to a match by chance. For example, the U-probability that two different people share the same month of birth by chance is 1/12. When the previous comparison step leads to a match on an identifier, then the weight (W) of that identifier match is calculated as shown in formula 2 [8]. Where the result (Wa) is always a positive number.

Wa= loд2(M/U ) (2)

Wd= loд2((1 −M)/(1 − U )) (3)

When the identifiers do not match, formula 3 is used [8]. Formula 3 always results in a negative number. The agreement (Wa) and

disagreement (Wd) weights are summed to calculate the match

score. The match score is calculated for each considered record pair. The match score is compared to a threshold, which can be estimated by the researcher or calculated [23]. To calculate the threshold, a starting weight (Ws), according to formula 4, and a

probability weight (Wp) ,according to formula 5, are defined. In

formula 4,E is the number of expected matches and (A × B) is the number of considered record pairs after blocking. In formula 5,P is the desired probability of a record pair that is classified by the algorithm to match, to be a true match [8].

Ws = loд2(E/((A × B) − E)) (4)

Wp= loд2(P/(1 − P)) (5)

The record pair is a match when the combined weight scores of the identifiers (WaandWd) is larger than the absolute difference

betweenWs andWp. Otherwise, the record pair is classified as a

non-match by the algorithm.

There are cases where probabilistic record linkage cannot be used for classification. Cases have been described where the number of expected matches is not known [12] or where true matches could

not be validated [18]. A recent example is given in [18]. Here, two datasets that contain anonymized flight paths are compared. The goal of this research is to merge the two flight-path datasets and remove duplicates. The flight paths consist of a series of three-dimensional points, where the three dimensions are represented as latitude, longitude and a timestamp. Traditional probabilistic record linkage cannot be used because the number of combinations for these dimensions is incalculable, which makes the U-probability incalculable. For these cases, cost-based record linkage is used. In cost-based record linkage the researcher looks at the distance, or cost, between identifiers values [8]. Cost-based record linkage can partially match on an identifier, which results in a proportion of the full identifier weight.

With cost-based record linkage, the threshold for classifying match status is estimated. When selecting a threshold the researcher must consider that there is a trade-off between the recall and the precision, where a higher threshold gives a lower recall but higher precision [23].

2.2.4 Applications of record linkage. In this literature research, we found various examples where two datasets containing private identifiers (such as patient records in Epidemiology) are matched using probabilistic record linkage [23] [14]. Additionally, we found examples of cost-based record linkage on public identifiers in pub-lic datasets [12] [18]. However, we could not find cases of record linkage that uses private company-owned datasets with a public dataset. According to [19], a gap in the literature exists when a relation must be made between existing company-owned datasets and public information gathered from OSNs.

2.3 Evaluation methods

To validate the accuracy of an algorithm, three metrics are often used [17] [8]. The recall is the proportion of true matches the algorithm found to the total number of true matches. The precision is the proportion of true matches found by the algorithm to the total number of matches found by the algorithm. The F1-score considers both the recall and precision. Formula 6 is used to calculate the F1-score 6 [13].

F 1 = 2 ×recall ∗ precision

recall + precision (6) Calculating the recall and precision requires a gold standard [14]. A gold standard is a sample of two datasets, where all true (non-)matches are shown. There is rarely a case that a gold standard is readily available. When there is no gold standard available, matches can be classified as true matches by clerical review . However, a clerical review is prone to bias and time-consuming [14]. Alterna-tively, it is proposed that a researcher can generate a dataset [10]. In this generated dataset, also called a synthetic dataset, record identifier values are assigned randomly according to the frequency distribution of each identifier [4]. This frequency distribution can be extracted from the real dataset. In a synthetic dataset, match status is added to a record pair when one thinks it would represent a real-world match. When a synthetic dataset is created, it is im-portant that the researcher reports all the steps in the methodology. This way, potential bias can be detected [10].

(5)

3 METHODOLOGY

This section describes the methods of data gathering and the meth-ods used to create and validate a record linkage algorithm with intent time. This section is divided into five subsections. First, we describe the data that is needed to perform record linkage with in-tent time. Second, we describe the steps the algorithm goes through with the data described in the first subsection. Third, we give a general outline of the methods used to validate the accuracy of the algorithm. Fourth, we describe how to gather the data needed for the datasets with Algemeen Dagblad readers and Twitter profiles. Finally, we describe how we use the data from Algemeen Dagblad to create synthetic datasets.

3.1 Data description

On a news website a reader’s actions are tracked through a cookie. Every time a reader clicks the share button of an online article, a share intent is generated. DatasetA is a collection of share intents (a ∈ A). Every a contains multiple identifiers ai = {i1, . . . , in}.

If a reader completes their share intent, it generates a post action in datasetB (b ∈ B). Dataset B is gathered from an OSN and contains a collection of post actions. An unknown number of the post actions are generated by the share intents from datasetA. Each b contains multiple identifiersbi = {i1, . . . , in}. At least three identifiers in

everyai and everybi are needed:

(1) An identifier for the entity that performed the share intent or post action.

(2) An identifier for the online article location that is often rep-resented as a URL.

(3) A temporal identifier, which indicates the moment of the share intent or post action.

B does not necessarily contain all entities from A and A does not necessarily contain all entities fromB. Additionally, both A and B can contain entities that do no other occur in the other dataset.

The temporal identifier is often represented as a timestamp. The temporal identifier in datasetaimarks the starting time of the share

intent (the moment the share button was clicked). The temporal identifier in datasetbi denotes the moment the post was created.

From now on we will refer to these three identifiers as:i1(entity

identifier),i2(article identifier) andi3(moment of action/action).

An overview of these identifiers, with examples, is given in table 1. symbol example inA example inB

i1

b0d42a43-f884-4c40-92f9-3a678a3bf4e4 @ErikDolstra

i2 http://example.com/ar_{ticle123/?source=twitter} https://example.com/ar_ticle123/

i3

2019-04-19T09:24:11.751Z 2019-04-19 09:27:04 Table 1: The necessary identifiers and examples of their val-ues for cost-based record linkage with intent time.

3.2 Algorithm steps

We assume that both datasetsA and B are filled with records that contain the identifiers from table 1. Of all possible record pairs, an

unknown number are true matches and an unknown number are true non-matches.

Our record linkage algorithm is cost-based and uses intent time. The cost-based record linkage algorithm with intent time consists of 6 steps. Below a quick overview of these steps is given. The rest of this subsection details each step.

(1) Clean the gathered data

(2) Create potential record pair blocks

(3) Calculate the intent time for each considered record pair (4) Divide the intent time into ranges

(5) Plot the distribution of considered record pairs per intent time range

(6) Apply a numerical comparison with an approximated distri-bution

(7) Set the threshold for assigning match status

(1) Data-cleaning is required before applying the algorithm. For i2in table 1, we remove everything before the domain name and

ev-erything after the directory name of the article URL. Data cleaning on URLs is needed because URLs often contain more than just the article location, such as the network protocol (HTTP or HTTPS). Fori3, we convert the timestamps to a Unix timestamp format.

A Unix timestamp is an integer with the number of seconds that passed since the year 1970 started [9]. Thus, the granularity ofi3is

a second.

(2) To create blocks of potential record pairs. We base our blocks on an exact match ofi2after applying data cleaning.A × B now

consists of record pairs where the URL is always the same between a and b.

(3) For every considered record pair we calculate the intent time using formula 7, wheret stands for the intent time. Formula 7 calculates the difference between the Unix timestamp ofb and the Unix timestamp ofa.

t = bi{i2} −ai{i2} (7)

(4) Every intent time is converted to a ranged intent time. This is needed because there is a margin of error in the timestamps in OSNs due to desynchronized timers and different processing delays [12]. For our implementation, we used ranges of five seconds. We converted our intent time to a ranged intent time with formula 8, wheret stands for intent time and tr stands for the ranged intent

time.

tr = 5 + t − t%5 (8)

(5) The number of considered record pairs per ranged intent time is plotted in a frequency distribution. This results in a graph similar to the graph displayed in figure 2 in section 4. In this figure, we see a sudden rise in potential matches when the intent time is between 0 and 5 seconds.

(6) We compare the ranged intent time of each considered record pair to an approximation of the distribution that is the result of step 5. This approximated distribution is made with three param-eters and two thresholds (minimum and maximum). These three parameters are shown in figure 1 and are called the origin, offset and scale. These three parameters are gathered from an existing record linkage python library [7]. The origin is set at the middle of the mode of the approximated distribution (set at 25 seconds for

(6)

Figure 1: Example of the offset, scale and origin for an intent time numerical comparison.

figure 1). The offset is half of the width of the top of the peak of the approximated distribution (set at 5 seconds for figure 1). The scale is the rate at which the weight of the identifier is halved (set at 17 seconds for figure 1). Additionally, the cut-off threshold on the maximum and minimum intent time is set. With thresholds it is possible to only consider the right or left half of the approxi-mated distribution shown in 1. When the ranged intent time falls outside the threshold, the score is always 0. When the ranged intent time falls within the thresholds the numerical comparison returns a score between 0 and 1, where a 1 is returned when the intent time falls within the range between the origin and the offset.

(7) When all identifiers are compared and a match score is calcu-lated for the considered record pair, we compare the match score to a threshold. The threshold is estimated based on the desired recall and precision. If the match score is higher than the threshold, then match status is assigned to the record pair.

3.3 Validation process

We measure accuracy on three metrics: recall, precision and F1-score. To calculate these values we need to know which record pairs are true matches, for which we need a gold standard [14]. To get a gold standard we considered several options. Two options and the reasons we did not use them are as follows.

The first option is using a clerical review to determine which record pairs are true matches. This does not work for our datasets, because the readers in the Algemeen Dagblad dataset are anony-mous. Without the use of the temporal identifier we found no method of linking these readers to their social media profile. Our algorithm already uses the temporal identifier, thus validating matches in a clerical review with the temporal identifier is not possible. The second option is finding a readily-available gold stan-dard. We could find no readily-available gold standard that has similar identifiers as our dataset. All readily-available gold stan-dards we found are meant for probabilistic record linkage on private identifiers (for example, the gold standards of FEBRL [7]).

Instead of these options, we choose to generate a gold standard in a synthetic dataset. In the synthetic dataset, we decide which records are a true match, and which are a true non-match. The steps of data generation are described in subsection 3.5. We based our

synthetic dataset on two datasets from a real-world implementation. In this implementation, we match readers from Algemeen Dagblad with social media profiles on Twitter. In subsection 3.4 we explain how we gathered this data.

3.4 Data gathering

We ran a script that gathered every share intent on the Algemeen Dagblad website from 02-05-2019 00:00:00 until 16-05-2019 23:59:59, which is a period of two weeks. This results in datasetA, which contains 11,683 share intents by 7,735 unique entity identifiers.

At the end of each day in the two week period, the URL of every share button that is clicked is cleaned according to step 1 in section 3.2 and used in the Twitter Public Search API [16] by another script. This results in datasetB, which contains 12,347 Twitter posts by 4,603 unique entity identifiers.

For each record, the three necessary identifiers from table 1 are gathered. In addition two optional identifiers that are also gathered for each record. These two optional identifiers are shown in table 2. These five identifiers are the only usable identifiers that are present in both the Algemeen Dagblad dataset and the Twitter dataset. Here, we discuss several aspects of the identifiers.

i1in dataset A can change when an entity clears their browser’s

cookies, switches browsers or switches their device.i1in dataset

B cannot be changed by the user, but a reader can have multiple Twitter profiles.

All the necessary identifiers (i1, i2andi3) are not nullable.

The location identifier (i4) in datasetA is gathered from a readers

IP-address and is automatically converted to a city. In datasetB, i4

is gathered from the Twitter user profile. Twitter users are free to set their location to any value and is therefore not reliable. When we compared the user locations of the 4,682 unique Twitter profiles against a list of 5,900 Dutch cities1with a Jaro-Winkler comparison [22] (margin of error set at 0.85), we found that 2,715 (58%) Twitter user locations contained at least one of these cities. It should also be noted that when the user moves around, the value fori4changes

in datasetA, but not in dataset B. i4is nullable in both datasets.

The device type identifier (i5) is represented as one of four

pos-sible values: Android, iOS iPad, iOS iPhone and PC. The value "PC" includes all records that are made by devices that are not otherwise listed. Additionally, post actions (b) may also have the value "other". Post actions with the value "other" are mainly created by bots and third-party platforms.i4is present in every record.

Table 2 shows an example of the identifiersi4andi5for datasets

A and B.

symbol example inA example inB

i4 Amsterdam AMSTRDAM, NEDER-_LAND

i5 Android Android

Table 2: Example of the location identifier (i4) and the device

identifier (i5)

If there is a true match betweena and b, then in the identifiers i2andi5are always an exact match in the comparison step. The

1_{http://www.metatopos.eu/almanak.html} 5

(7)

Device

Type count_i5 runTot prevRunTot Android 4,114 4,114 0

iPhone 1,057 5,171 4,114 iPad 2,101 7,272 5,171

PC 4,411 11,683 7272

Other 0 11,683 11,683

Table 3: The number of record pairs, the running total and the running total of the previous device type for the share intents inA.

comparison oni4can result in a non-match even when the records

are a true match. We estimate that in only 20% of the records that match,i4is also a match with a Jaro-Winkler comparison method

when the margin of error is set at 0.85.

3.5 Data generation

Data generation is done according to a protocol, for which we wrote a program in Spark [11]. The protocol creates datasets with similar properties every time, except for the variable that we influence: the distribution of true matches per ranged intent time. This subsection describes the protocol for data generation and provides pieces of pseudo code to help understand the steps of the protocol. In this subsection, we refer to the original datasets asA and B and to the synthetic datasets asA′andB′. This subsection is divided into the four steps that the data generation protocol goes through. These steps are as follows.

First, share intents in datasetA′are generated. Second, post actions in datasetB′are divided into two subsets: non-matches and matches. Third, matches inB′are filled identifiers. Fourth, non-matches inB′are filled with randomly selected identifiers.

3.5.1 Generating records in A′. In A for each identifier, we can create frequency tables for the occurrence of each identifier value. We do this by applying a GROUPBY function to each identifier. With two additional functions, a running total (runTot) is created and a running total for the previous record (prevRunTot). The running total is the number of records counted of the previous identifier values plus the number of records of the current identifier value. Another name for the running total is the cumulative count. Listing 1 shows an example of how these functions are implemented. The results of the code from listing 1 is shown in table 3. Table 3 and listing 1 are meant fori5, but this process is repeated for all five

identifiers.

Listing 1: Create a frequency table. A = LOAD( share_intents .csv );

frequency_i5 = A. GROUPBY ($"i5 ") .AGG( COUNT ($"i5 ") as " counti5 ") . WITHCOLUMN (" runTot ",

RUNNING_TOTAL ($" counti5 ") )

. WITHCOLUMN (" prevRunTot ",LAG($" runTot " ,1)); We add 11,683 empty records toA′, which is the same number of records as there are inA. To each record in A′, 5 columns (one

for each identifier) are added, each containing a generated random number. Where the random number is between 0 and 11,683. On each random number we perform a inner join to the corresponding identifier frequency table. The result is thatA′is filled with ran-domly selected values from records fromA, where the frequency of any value being selected is equal to the frequency of that value occurring inA. Last, a column with a row number is added to A′ and random numbers are dropped from the table. The pseudo code for fillingA′with randomly selected values fromA is provided in listing 2.

In the original datasetsA and B, there is a relation between the URL (i2) and the timestamp (i3). Most articles are published

somewhere after the starting time of data gathering (02-05-2019 00:00:00). Most share intents and post actions occur within a few days after the article is published. Thus, when sampling random values to filli2andi3inA′, the values ofi2andi3are taken from

the same record.

Listing 2: Inner join to select a random value from the fre-quency table.

A' = A. DROP (*)

. WITHCOLUMN ($" rndNr_i5 ", RND (11683)); .JOIN ( frequency_i5 . SELECT ($" device_type "),

WHERE A '.$" rndNr_i5 " <= frequency_i5 .$" runTot " &&

A '.$" rndNr_i5 "> frequency_i5 .$" prevRunTot " )

. WITHCOLUMN (" rowNr ", ROW_NUMBER ()) .DROP ($" rndNr_i5 ");

3.5.2 Selecting (non-)matches for B′. Selecting matches and non-matches is done randomly. We add 12,347 empty records toB′, which is the same number of records asB. We split B′into two parts. There is a part with true matches and a part with true non-matches. The number of records inB′with true match status is 3,505, which is equal to 30% the size ofA. The number of records inB′with true non-match status is (12, 347 − 3505=) 8,842 records. This is shown in listing 3.

Listing 3: Split B’ into matches and non-matches B' = B. DROP (*);

B' match = B '. LIMIT (3505)

. WITHCOLUMN (" match ", LIT( true )) . WITHCOLUMN (" rndRowNr ", RND (11683); B' nonmatch = B '. LIMIT (8842)

. WITHCOLUMN (" match ", LIT( false ));

Listing 4: Copy the values of records that are selected as a match.

B' match = B' match . join ( A',

WHERE B' match .$" rndRowNr " === A '.$" rowNr " ). DROP ($" rowNr ");

(8)

3.5.3 Generating true matches in B′. For each true match in B′, we create a random number between 0 and 11,683 (the number of recordsA′). We then copy all values fromA′toB′where the row number ofA′is equal to the generated random number fromB′. This is shown in code block 4. On the copied data inB′alterations are made toi3andi4.

Toi3in the proportion ofB′that is selected as a true match

(B′_{_a′_{= b}′_{}) we add a certain value. We state that}_b′

i{i3} of any

record inB′{a′ = b′} is equal to the matchinga′_i{i3} value plus

some intent time (t). Thus, we alter b′

i{i3} according to formula 9.

The goal of the data generation protocol is to create datasets, where the only variable that changes is the distribution of true matches per ranged intent time. This distribution determines the value fort in formula 9. Thus, the distribution of the frequency of any value fort in formula 9 is determined by us every time we generate a new dataset. The frequency distributions for selecting a value fort are given in subsection 4.3.1.

a′_{= b}′_→_b′ i{i3}= a

′

i{i3}+ t (9)

i4in the records inB′{a′= b′} can be changed. We know thati4

is not always a match when the records are a true match, because users on Twitter can change their location to any value they prefer. In theB′we givei4only a 20% chance to match when the records

match. Ifi4is randomly selected (in 80% of the records) not to be a

match then the copied value fromA′is changed to a different city. 3.5.4 Generating true non-matches in B′. Generating true non-matches inB′is done in the same method as creating records in A′_{, which is explained in subsection 3.5.1. When both parts (the}

matches and non-matches) ofB′are filled, a sparkU N ION function is applied which merges the matches and non-matches intoB′.

4 EXPERIMENT EVALUATION

In this section, the setup and results are reported for the experiments performed with record linkage with intent time. This section starts with describing the setup for the case where we matched shares intents from Algemeen Dagblad with post actions from Twitter. Second, the results for the Algemeen Dagblad case are reported. Third, the setup for experimentation on the synthetic datasets is described. Fourth, the results for applying record linkage to the synthetic datasets is described.

4.1 Experiment settings Algemeen Dagblad

The approach described in subsection 3.2 is used on the data that is gathered according to the method described in subsection 3.4. Records are blocked oni2. The number of considered record pairs

after blocking is 137,180. The intent time is calculated for each considered record pair with formula 7 and the ranged intent time with formula 8. This gives the distribution shown figure 2. Figure 2 shows that there is a peak of record pairs when the ranged intent time is between 0 and 10 seconds and that the number of record pairs decreases when the intent time further increases. The rest of this subsection describes the parameters for comparison and classification that we used.

4.1.1 Comparison parameters. Based on the distribution of fig-ure 2, the parameters in appendix A in table 7 in the first row are

Figure 2: The number of record pairs after blocking per ranged intent time. The number of record pairs gets closer to 0 the larger the distance is from a ranged intent time of 5 seconds.

applied. Additional to the necessary identifiers, we supplement the algorithm with identifiersi4andi5. The Jaro-Winkler string

comparison is performed oni4with a threshold margin of error of

0.85.i5is compared using an exact match.

4.1.2 Classification parameters. The comparison of the intent time returns a normalised score. We tripled the weight of thei3in

our classification. The weight is tripled because a comparison oni3

has a higher discerning factor that the other identifiers, but returns a lower score when the intent time is further away from the origin than the offset is.

The comparisons oni4andi5get a weight of 0 when an identifier

is not a match and a weight of 1 when an identifier is a match. Missing values ini4automatically get a score of 0.

Thus, all record pairs get a summed match score that is between 0 and 5. The threshold for assigning match status to a record pair is set at a summed match score of 3.

4.2 Results Algemeen Dagblad

The result of applying the parameters from subsection 4.1 to our cost-based algorithm is shown left half of table 4. Table 4 shows that 1,409 record pairs have a summed match score of 3 or higher, and therefore get match status assigned. This is equal to 1% of all considered record pairs or 12% of the records inA. Table 4 also shows that most record pairs either have a match score of zero or one (133,874) . A match score of zero means that none of the identifiers is a match after blocking. A match score of five means that all identifiers match. In the other summed match scores record pair counts, there may be any combination of identifier matches that returns the summed match score shown.

4.3 Experiment parameters Synthetic datasets

We test our cost-based algorithm with intent time against a baseline probabilistic record linkage algorithm on three synthetic datasets. Additionally we test our cost-based algorithm with different identi-fier combinations on one of the synthetic datasets. This subsection describes how these experiments have been set up. First, the param-eters are described for the creation of the three synthetic datasets.

(9)

Match Score Original dataset Synthetic dataset 0 72,065 (52.5%) 155544 (73.7%) 1 61,809 (45.1%) 51340 (24.3%) 2 1,897 (1.4%) 1803 (0.9%) 3 578 (0.4%) 1330 (0.6%) 4 658 (0.5%) 902 (0.4%) 5 173 (0.1%) 131 (0.1%) Total 137,180 (100%) 211,050 (100%) Table 4: The number of record pairs per summed match score (floored to the nearest integer) for the Algemeen Dag-blad case (left) and for the Synthetic dataset scale × 1 (right).

Figure 3: The distributions for the average number of true matches per ranged intent time in the synthetic datasets. All the distributions are cut-off at a ranged intent time of 0 and 1200 seconds.

Second, the parameters are described for the cost-based record link-age algorithm with intent time. Third, the parameters are described for the baseline algorithm.

4.3.1 Parameters synthetic datasets. We create three synthetic datasets, according to the methodology described in subsection 3.5. In the three synthetic datasets every identifier frequency stays the same, except for the distribution of the true matches per ranged intent time. The distributions of true matches per ranged intent time for the three synthetic datasets are shown figure 3.

The main change in these distributions is that the kurtosis in-creases. In distributionScale × 1, the distribution is modeled after our estimation of true matches per ranged intent time for the Al-gemeen Dagblad case. In distributionScale × 4, the distribution is more evenly distributed over the ranged intent time (four times as much) compared toScale × 1. This causes the distribution to have a bigger tail. In distributionY = 30, the distribution is a flat line. Here, the intent time is evenly distributed over a period of 1200 seconds. With these distributions we test how our algorithm is affected as the frequency of true matches per ranged intent time becomes more evenly distributed.

4.3.2 Parameters cost-based algorithm. For our cost-based al-gorithm we again apply the steps from subsection 3.2. Blocking is applied toi2, which returns 211,050 considered record pairs. The

parameters fori3 comparison, on the cost-based record linkage

algorithm are placed in appendix A. All parameters are chosen to optimize the precision. (i.e. we ran the algorithm multiple times with different parameters and with the parameters shown in ap-pendix A we achieved the highest precision.) Comparison oni4is

done with a Jaro-Winkler comparison with a threshold margin of error of 0.85 andi5has to be an exact match. In the classification

phase the threshold for assigning match status is set at a minimum summed match score of 3.

In table 4, we demonstrate that our method of data generation produces a similar percentage of matches when the same cost-based algorithm is run on the synthetic datasetscale × 1 as on the dataset for the Algemeen Dagblad case. In the left column the percentage of records pairs with a match score of 3 or above is 1.0% of all considered record pairs and in the right column this is 1.1%.

4.3.3 Parameters baseline algorithm. For the baseline algorithm we apply a probabilistic record linkage algorithm. Again at the start blocking oni2is applied, which results in 211,050 considered record

pairs (A × B). The probabilistic record linkage algorithm cannot use the intent time comparison method described in subsection 3.2. Instead,i3in the baseline algorithm uses the numerical comparison

method described in formula 1 in subsection 2.2.2. For the threshold in formula 1, we use the parameters shown in appendix A (table 7) in the column "max. threshold". Fori4andi5the same parameters

as in the cost-based algorithm are applied. The weights for the identifiers and the cut-off threshold are calculated according to the formulas described in section 2.2.3. For formula 4,E is set at 3505 expected matches and (A × B) is set at 211,050. In formula 5, P is set at the desired probability of 0.95.

4.4 Results Synthetic datasets

To test the influence of intent time on a record linkage algorithm, two comparisons are made. First, we compare our cost-based record linkage algorithm with all identifiers to a baseline probabilistic record linkage algorithm. Second, we compare our cost-based record linkage algorithm with the use of different identifier combinations. The metrics are the same for both comparisons: the recall, pre-cision and F1-score. In previous research [17], a difference of 0.1 in F1-score between a new and a baseline algorithm is seen as a significant change. We use the same measure to indicate if the cost-based record linkage algorithm produces significant results when compared to the baseline algorithm.

4.4.1 Cost-based vs baseline. In all three datasets in table 5, the precision of the cost-based algorithm is lower than the precision of the baseline algorithm. The precision of the baseline algorithm stays above the desired probability of 0.95. The precision of the cost-based algorithm lowers as the kurtosis of the distribution of true matches per ranged intent time increases. The lowering in precision for the cost-based algorithm occurs because with every time-step for the ranged intent time there are several non-matches that the cost-based algorithm classifies as a match. Thus, the larger the range of the parameters for intent time is set, the more false matches are made by the algorithm.

Recall shows the opposite behaviour of precision. Every time the precision lowers in table 5, the recall increases. It is known in record

(10)

Cost-based algorithm Baseline algorithm

Dataset Recall Precision F1-score Recall Precision F1-score Scale × 1 0.600 0.926 0.729 0.123 0.995 0.219 Scale × 4 0.708 0.830 0.764 0.144 0.973 0.251

Y=30 0.756 0.357 0.484 0.191 0.967 0.318

Table 5: Validation results for the synthetic datasets using a cost-based algorithm and using a baseline probabilistic algorithm. Best values are highlighted in bold.

Identifiers used Recall Precision F1-score All identifiers 0.600 0.926 0.729 Onlyi3 0.272 0.893 0.417

Onlyi4andi5 0.190 0.516 0.277

Table 6: Validation results on dataset Scale ×1 when us-ing different identifier combinations on the cost-based algo-rithm. In all the results blocking is first applied withi2.

linkage [4], that there is always a trade-off between precision and recall. Thus, this increase in the recall is expected.

Even though the precision of the cost-based algorithm is lower than the precision of the baseline algorithm, the F1-score is signifi-cantly higher. The F1-score applies equal weight to the recall and precision. The difference in recall between the cost-based and base-line algorithm is larger than the difference in precision. Therefore, the F1-score is higher with the cost-based algorithm.

4.4.2 Identifier combinations. In table 6, the cost-based algo-rithm is applied to the synthetic datasetScale × 1 with different identifier combinations. As expected the highest scores are gained when all identifiers are used in the algorithm. In table 6, it is also shown that running the algorithm on the datasetScale × 1 with onlyi3results in a higher precision and recall than running the

algorithm with only the two other identifiersi4andi5. This shows

thati3is the best identifier in this dataset to increase both precision

and recall. We made the comparison of results for different identi-fier combinations on only one dataset. Therefore, it is possible that in other datasets a location or device type identifier can outperform the temporal identifier when using the cost-based algorithm with intent time.

5 DISCUSSION

In this section, we discuss the advantages and disadvantages of applying cost-based record linkage with intent time. Additionally, we discuss a validity issue in our research.

The first advantage of record linkage with intent time is that it results in a high recall and F1-score when compared to probabilistic record linkage. The results from table 5 show that cost-based record linkage with intent time works significantly better than traditional probability record linkage when one favours recall or F1-score over precision. We have shown that when the distribution of true matches per ranged intent time has a higher kurtosis than seen in a real-life case, our method outperforms a traditional record linkage algorithm on F1-score. The second advantage is that the necessary identifiers (i1,i2andi3) for cost-based record linkage with intent

time are all public. Previous research [12] has shown that public identifiers are present more often than private identifiers. Thus, our algorithm can be applied to many other datasets as well.

The first disadvantage of record linkage with intent time is that as the kurtosis of the distribution of true matches per ranged intent time increases, the precision of the cost-based algorithm lowers faster than the precision of the baseline algorithm. The second dis-advantage is that the temporal identifier must have a low margin of error and granularity. In this research, we did not experiment with different margins of error or a different granularity for temporal identifiers. Thus, we can not state to what extent these factors in-fluence the accuracy of a record linkage algorithm. We hypothesize that the precision will lower in a cost-based algorithm when the margin of error increases or when the time-steps of the granularity increase. Future research should investigate this.

The validity issue in our research is that we did not compare a record linkage algorithm with only the necessary identifiers to a baseline algorithm. This causes the comparison with the baseline algorithm to be influenced by identifiers other than the temporal identifier. We did not run the algorithm with only the necessary identifiers, because this is not possible in the current dataset. The agreement weight ofi3, which is calculated according to formula 2,

is always lower than the cut-off threshold, which is calculated with formulas 4 and 5. Thus, a probabilistic algorithm does not assign match status to any of the records. However, the results in table 5 and table 6 show large enough differences in precision, recall and F1-score to state that influencing parameters fori3, or usingi3in

general, influences the accuracy of a record linkage algorithm.

6 CONCLUSION

In this paper, we have answered the research question: To what ex-tent can a temporal identifier influence a record linkage algorithm that matches the sharers of online articles with their social media profiles on OSNs? Table 5 and table 6 summarize the influence of temporal identifiers on different algorithms. The extent to which the temporal identifier influences the accuracy of the algorithm depends on the kurtosis of the distribution of true matches per ranged intent time. When this distribution has few outliers then using the temporal identifier in a cost-based algorithm result in significantly higher accuracy when compared to a traditional al-gorithm or a cost-based alal-gorithm that does not use the temporal identifier. Thus, the methods described in this research can be used in record linkage implementations with public identifiers where a high F1-score is desired. Future research must determine to what extent the temporal identifier is influenced by its granularity and margin of error.

(11)

REFERENCES

[1] Michael Backes, Pascal Berrang, Oana Goga, Krishna P. Gummadi, and Praveen Manoharan. 2016. On Profile Linkability despite Anonymity in Social Media Systems. Proceedings of the 2016 WPES (2016), 25–35. https://doi.org/10.1145/ 2994620.2994629

[2] Pablo J. Boczkowski and Eugenia Mitchelstein. 2012. How Users Take Advantage of Different Forms of Interactivity on Online News Sites: Clicking, E-Mailing, and Commenting. Human Communication Research 38, 1 (2012), 1–22. https: //doi.org/10.1111/j.1468-2958.2011.01418.x

[3] De Persgroep Nederland B.V. 2019. Algemeen Dagblad. (2019). https://www. persgroep.nl/merk/ad

[4] Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science & Business Media, Berlin, Germany.

[5] Peter Christen and Karl Goiser. 2007. Quality and Complexity Measures for Data Linkage and Deduplication. Studies in Computational Intelligence (SCI) 43 (2007), 127–151. https://doi.org/10.1007/978-3-540-44918-8_6

[6] Colin Conrad, Naureen Ali, Vlado Keselj, and Qigang Gao. 2016. ELM: An Extended Logic Matching Method on Record Linkage Analysis of Disparate Databases for Profiling Data Mining. IEEE 18th Conference on Business Informatics (2016), 1–6. https://doi.org/10.1109/CBI.2016.9

[7] Jonathan de Bruin. 2019. Python Record Linkage Toolkit (0.13.2). (2019). https: //pypi.org/project/recordlinkage/

[8] Stacie B. Dusetzina, Seth Tyree, Anne-marie Meyer, Adrian Meyer, Laura Green, and William R. Carpenter. 2014. Linking Data for Health Services Research: A Framework and Instructional Guide. The University of North Carolina, Chapel Hill, NC. https://www.ncbi.nlm.nih.gov/books/NBK253312/

[9] Curtis E. Dyreson and Richard T. Snodgrass. 1993. Timestamp semantics and representation. Information Systems 18, 3 (1993), 143–166. https://doi.org/10. 1016/0306-4379(93)90034-X

[10] Anna Ferrante and James Boyd. 2012. A transparent and transportable method-ology for evaluating Data Linkage software. Journal of Biomedical Informatics 45, 1 (2012), 165–172. https://doi.org/10.1016/j.jbi.2011.10.006

[11] Apache Software Foundation. 2019. Apache Spark (2.4.3). (2019). https://spark. apache.org/

[12] Oana Goga. 2014. Matching User Accounts Across Online Social Networks: Methods and Applications. Computer science. LIP6 - Laboratoire d’Informatique de Paris 6 (2014), 1–156. https://hal.archives-ouvertes.fr/tel-01103357 [13] David Hand and Peter Christen. 2018. A note on using the F-measure for

evalu-ating record linkage algorithms. Statistics and Computing 28, 3 (2018), 539–547. https://doi.org/10.1007/s11222-017-9746-6

[14] Katie L. Harron, James C. Doidge, Hannah E. Knight, Ruth E. Gilbert, Harvey Goldstein, David A. Cromwell, and Jan H. van der Meulen. 2017. A guide to evaluating linkage quality for the analysis of linked data. International Journal of Epidemiology 46, 5 (2017), 1699–1710. https://doi.org/10.1093/ije/dyx177 [15] Yichen Hu, Qing Wang, Dinush Vatsalan, and Peter Christen. 2017. Regression

classifier for Improved Temporal Record Linkage. PAKDD 2017 (2017), 561–573. https://doi.org/10.1007/978-3-319-57454-7_44

[16] Twitter inc. 2019. Standard Search API (1.1). (2019). https://developer.twitter. com/en/docs/tweets/search/api-reference/get-search-tweets.html

[17] Furong Li, Mong Li Lee, Wynne Hsu, and Wang-Chiew Tan. 2015. Linking Temporal Records for Profiling Entities. Proceedings International Conference on Management of Data (2015), 593–605. https://doi.org/10.1145/2723372.2737789 [18] Dario Martinez, Samuel Cristobal, and Seddik Belkoura. 2018. Smart Data Fusion:

Probabilistic Record Linkage adapted to Merge two Trajectories from Different Sources. Eight SESAR Innovation Days (2018), 1–19. https://www.sesarju.eu/sites/ default/files/documents/sid/2018/papers/SIDs_2018_paper_58.pdf

[19] Imen Moalla, Ahlem Nabli, Lofti Bouzguenda, and Mohamed Hammami. 2017. Data warehouse design approaches from social media: Review and comparison. Social Network Analysis and Mining 7, 1 (2017), 1–14. https://doi.org/10.1007/ s13278-017-0423-8

[20] Panagiotis Papadopoulos, Nicolas Kourtellis, and Evangelos P. Markatos. 2018. Cookie Synchronization: Everything You Always Wanted to Know But Were Afraid to Ask. Proceedings of the 2018 World Wide Web Conference (2018), 1432– 1442. https://doi.org/10.1145/3308558.3313542

[21] Jurgen Pfeffer, Katja Mayer, and Fred Morstatter. 2018. Tampering with Twitter’s Sample API. EPJ Data Science 7, 50 (2018), 1–26. https://doi.org/10.1140/epjds/ s13688-018-0178-0

[22] Edward H. Porter and William E. Winkler. 1997. Approximate String Comparison and Its Effect on an Advanced Record Linkage System. (1997), 190-199 pages. https://www.census.gov/srd/papers/pdf/rr97-2.pdf

[23] Adrian Sayers, Yoav Ben-Shlomo, Ashely Blom, and Fiona Steele. 2016. Proba-bilistic record linkage. International Journal of Epidemiology 45, 3 (2016), 954–964. https://doi.org/10.1093/ije/dyv322

[24] Chei S. Sian and Long Ma. 2012. News sharing in social media: The effect of gratifications and prior experience. Computers in Human Behavior 28, 2 (2012), 331–339. https://doi.org/10.1016/j.chb.2011.10.002

[25] Dick Stenmark, Fahd O. Zaffar, and Jan Ljungberg. 2017. Like, Share and Follow: A Conceptualisation of Social Buttons on the Web. Scandinavian Conference on Information Systems (2017), 54–66. https://doi.org/10.1007/978-3-319-64695-4_5 [26] Rebekah K. Tromble, Andreas Storz, and Daniela Stockmann. 2018. We Don’t Know What We Don’t Know: When and How the Use of Twitter’s Public APIs Biases Scientific Inference. Conference International Communication Association, Prague, Czech Republic (2018). https://doi.org/10.2139/ssrn.3079927

[27] Soroush Vosoughi, Helen Zhou, and Deb Roy. 2015. Digital Stylometry: Linking Profiles Across Social Networks. Proceedings International Conference on Social Informatics (2015), 164–177. https://doi.org/10.1007/978-3-319-27433-1_12 [28] Yuan Zheng. 2018. Opinion Mining from News Articles. Advances in Intelligent

Systems and Computing 752 (2018), 447–453. https://doi.org/10.1145/10.1007/ 978-981-10-8944-2_51

(12)

7 APPENDIX A: PARAMETERS COST-BASED

ALGORITHM

Dataset Origin Offset Scale Minimum threshold Maximum threshold Original 5 5 60 0 300 Scale × 1 5 5 60 0 300 Scale × 4 10 10 240 0 1200 Y=30 600 300 300 0 1200

Table 7: The parameters fori3for cost-based record linkage

algorithms for each dataset.