Functional Schema Matching with Double Deep Q-Networks

(1)

MASTER

THESIS

F

UNCTIONAL

S

CHEMA

M

ATCHING

W

ITH

D

OUBLE

D

EEP

Q-N

ETWORKS

by

PHILIPP

OLLENDORFF

11734078

August 28, 2020

36 ECTS March - August 2020

Daily Supervisor:

S

AMI

J

ULLIEN

Supervisor:

D

R

. J

OÃO

M

OURA

Examiner:

D

R

. J

AAP

K

AMPS

(2)

Abstract

Machine Learning has become widely available and is actively developed and implemented across a plethora of domains. The widespread adoption introduced the problem of standardizing and matching various data sets for usage in Machine Learning tasks. Our work approaches such schema matching using Reinforcement Learning, framing the mapping problem in its context and applying state-of-the-art Double Deep Q-Networks. We focus on functional schema matching, which is only concerned with whether a schema match is useful, or functionalin terms of a Machine Learning task. We show the legitimacy of this approach on two case studies from the field of card payment fraud detection, proving that small data sets can be matched perfectly and others functionally. This promising result will save payments experts hours of repetitive and manual mapping work. Our research gives rise to the new domain of functional schema matching using Artificial Intelligence, that we hope will motivate other researchers to pursue even further.

(3)

I want to express my sincere gratitude to several individuals without whom I would not have been able to present this thesis. Thank you, João Moura, for guiding me through this journey, providing countless ideas, invariably highlighting problems and demonstrating considerable patience when finding elegant solutions to them. I would also like to thank Sami Jullien for letting me bounce innumerable ideas off of him and providing much-needed clarity when my thoughts were disorganized. I am grateful for thoughtful discussions with Stijn Verdenius, Ioannis Gatopoulos, Filip Knyszewski, Anna Hansen, Yann Müller, Julian Ollendorff and Zoë Duives, who shaped the outcome of this research. Thanks to Fraudio for the opportunity and support to work on a real business problem and to all my colleagues for helping me wherever needed. Thanks to my family, my friends from all over the world, my fellow students and the University of Amsterdam for enabling me to present this thesis.

(4)

Introduction

A fundamental problem in Machine Learning on structured data arises when a model is trained on one dataset and applied to another. The second, unknown dataset may even be from the same problem domain and may contain the same kind of information, but their structure will likely differ. For the model to work, the unknown data must be processed into the structure of the training data. This processing is called schema matching and serves as the main problem our research aims to solve. We focus on what we coin functional schema matching, which describes this process in a Machine Learning context. Here the concern is only whether a schema match is useful, or functional, in terms of a Machine Learning task. In the scientific literature, most schema matching approaches focus on a universal and theoretically correct solution to this problem. However, in many practical scenarios, a sub-optimal, yet functional, mapping is already sufficient. We notice a lack of research in this domain and aim to fill in the gap. We exemplify this process practically on proprietary business data from the payments domain, which is used for card payment fraud detection. However, the approach is sufficiently general that it can be applied to any data-driven Machine Learning classification problem.

1.1 Problem Statement

We choose the payment fraud domain as the subject of our case study, because of its increasing factor on business profitability in e-commerce and its inevitable rise due to globalization. In an article by the German statistical research institute Statista,Clement(2020) states that the global e-commerce market is predicted to grow to 4.9 trillion USD by 2021. Furthermore,Lindner(2017) predicts that by 2022 online sales will make up 17% of all global consumer sales. Vast numbers like these attract criminal offenders, leading to 24.26 billion USD lost in 2018 due to payment card fraud worldwide, says the payment processing companySHIFT

(2020). The European Central Bank,ECB(2018), analyzed that online fraud now makes up 73% of financial fraud in Europe. The data suggests that fraud poses a tremendous cost to many businesses and highlights a potential revenue stream through fraud prevention.

Historically, there has been some success in combatting fraud by using rule-based engines. These rules are manually engineered using expert knowledge and categorize any card payment transaction into fraud or non-fraud. However, the rise of big data collection, Machine Learning, and cloud computing have dramatically changed this field. Most modern fraud detection mechanisms are based on Machine Learning methods and detect more notable fractions of fraudulent transactions than ever before. Research bySorournejad et al.

(2016) surveyed 11 standard fraud detection techniques, of which seven were related to Machine Learning.

Dal Pozzolo et al.(2014) viewed fraud detection from an industry perspective and focused solely on three Machine Learning techniques, highlighting their dominance from a practical point of view. Fraud detection increasingly determines a company’s value proposition; therefore, it also becomes valuable, proprietary technology that stays isolated within banks, credit card networks, and payment service providers. Even if they did share their records to create a more accurate model or establish a standard benchmark dataset, the integration costs are high. All of them collect credit card transaction records, but there is no unifying standard,

(7)

making this data needlessly diverse. This reinforces already secluded data silos within each company and prevents any aggregation, thereby failing to leverage the total amount of data the payment industry already has.

1.2 Aim and Scope

We believe hidden potential lies within this aggregation of datasets and propose a Reinforcement Learning approach to dynamically and functionally transform a new dataset schema (the source) into a known dataset schema (the target). We limit the scope of this thesis to two case studies to determine whether such an approach can solve problems from the domain of card payment fraud detection.

Schemas describe a dataset’s syntactic characteristics like data types, number, and order of features. Two datasets can look different to the human eye, even if they represent semantically equivalent information. Think of a basic 2 column dataset: column A represents names and column B ages. Swapping those two columns does not change the information they convey. However, it may make a pre-trained model fail because it expects integers in column B. The same data can be presented using many different schemas, without ever changing underlying semantics. This ambiguity poses a challenge to modern Machine Learning models, which heavily rely on consistent data schemas. Resolving this ambiguity is the process of schema matching, a well-established problem intersecting the domains of Database Systems, Knowledge Representation, Machine Learning, and Information Retrieval. It is essentially a problem of discrete combinatorial optimization, or, more specifically, a constrained optimization: among the enormous number of possible schemas, only a small subset is suitable for model prediction, and a possibly smaller subset maximizes model performance. Due to the nature of Machine Learning models and their uneven feature importance, a handful of well-mapped features may already suffice to achieve close to optimal model performance. Thus, multiple good schemas — although possibly imperfect — may lead to similarly useful model predictions. Focusing on mapping the most critical features simplifies the problem area of schema matching, without considerable sacrifices.

The family of optimization algorithms is large and has been extensively covered byAmaran et al.(2016). We focus on what they coined ’simulation optimization’, as we merely optimize the function of a simulated environment. In practice, the estimates found during simulation are useful to real problems. Some of the techniques will not work in our case, e.g., mathematical programming and model-based techniques, because we do not have access to an objective function, and approximations are noisy. Among others, they list Evolutionary Algorithms, Simulated Annealing, Bayesian Optimization, and Reinforcement Learning as suitable candidates instead. Our research focuses solely on the latter and builds on a series of recent advancements, which are starting to reveal its unprecedented potential. These advancements were initiated by DeepMind in their seminal paper byMnih et al.(2013) on the Deep Reinforcement Learning paradigm DQN. We favor Reinforcement Learning over the other methods for the following reasons: first, there is little cost and no urgency associated with our real-world use case, so the common disadvantage of sample inefficiency in Reinforcement Learning does not apply. Functional schema matches need to exist eventually, but it is acceptable for the training time to lie within a matter of hours or even days. Second, Reinforcement Learning is particularly interesting because it deals well with dense reward signals. Other methods, such as Evolutionary Algorithms, are more suitable for sparse rewards, as indicated bySalimans et al.(2017). Depending on the design choices, the problem can be modeled using either dense rewards, sparse rewards, or both. However, dense rewards usually converge faster and more reliably and are hence favored over their sparse counterparts.Sanders et al.(2019) states that Bayesian Optimization works well in expensive-to-evaluate functions when a region of possible and reasonable solutions is sufficient. Both of those criteria do not apply in our scenario: function evaluations are noisy, but not expensive, and we are looking for one excellent solution, rather than a collection of sub-optimal ones.

1.3 Contribution

The original contributions of this work are three-fold and can be summarized as follows:

1. We frame the functional schema matching process as a Reinforcement Learning problem.

2. We introduce a novel hierarchical reward function that is microscopically based on syntactic dataset characteristics and macroscopically on supervised Machine Learning classification metrics.

(8)

CHAPTER 1. INTRODUCTION Page 5

3. We present a case-study and apply the previous contributions to successfully solve a real business problem in the fraud detection domain.

1.4 Outline

In Chapter2, we first examine the problem domains more closely. The examination includes a more in-depth look into dataset schema matching and a summary of the origins of Reinforcement Learning until state of the art in Deep Reinforcement Learning. Finally, we wrap up the background section with a recapitulation of basic binary classification metrics. This will establish the theoretical framework our contributions build upon. In Chapter3, we present the necessary characteristics of the payments domain and fraud detection, as well as our case-study outline. Then, we detail our approach, framing the problem precisely in the Reinforcement Learning context, and examining the modeling choices we made. These choices include our representation of main Reinforcement Learning components such as states, actions, reward functions, and additions such as replay memories, activation functions, optimizers, and other hyperparameters. We illustrate the experimental results of this case-study in Chapter4, showing that our proposed model is capable of matching data schemas functionally. We discuss these findings in extensive detail in Chapter6, pointing to possible implications of this research and highlighting its strengths and limitations. Finally, we conclude with a brief outlook on future areas of research regarding this topic.

(9)

Background

In this Chapter, we lay the necessary groundwork to understanding our research method. We split the Chapter into three sections: first, we examine the problem area of schema matching and highlight the requisite literature. Then, we introduce Reinforcement Learning from its bare origins and outline the most notable inventions in the field, from early temporal difference methods until the state-of-the-art. Last, we introduce Machine Learning classification tasks and give a brief overview of their most popular metrics. These three topics constitute the building blocks for our proposed contribution.

2.1 Schema Matching

First, we locate the field of schema matching at the intersection of its many neighboring domains, highlighting its interdisciplinary nature. Then, we reference a selection of known schema matching methods and their characteristics. This reference will help embed our proposed solution into the existing literature.

2.1.1 An Interdisciplinary Field

Schema matching systems are concerned with identifying a standard schema among a collection of relational databases and a mapping between each database and the standard schema. The field has received steady attention in the database and AI communities but is often filed under adjacent domains such as schema (or ontology) translation, integration, mapping, and alignment. A primary outcome of this research area is that schema matching tools always require a well-balanced combination of linguistic and statistical analysis and domain knowledge. It is straightforward that most datasets contain both natural language and numerical data, requiring the former two approaches. The latter is only true because the process of schema matching is often subjective. It is difficult to clearly define how well a schema describes the semantics of its underlying data, and this difficulty multiplies when attempting to map many such ambiguous schemas. Thus, expert validation is valuable and often essential to find proper mappings. Such validation can present itself in many forms, such as external dictionaries, known partial mappings, or analytical verification using an expected output of the new match. Machine Learning classification falls into the category of analytical verification.

A substantial amount of research targets a generalized approach of schema matching: mapping any given schema to another from an unknown, but similar domain. This greatly complicates the problem. Such universal approaches need to deal with enormous complexity because they avoid any assumptions on the underlying domain and dataset schemas. In theory, this sounds plausible, but no algorithms have been proposed that satisfy such ambitions sufficiently.

2.1.2 Related work

Bellahsène(2011) andMadhavan et al.(2001) evaluate the most common schema matching tools andRahm and Bernstein(2001) conduct a large survey of relevant schema matching algorithms. They find most can be

(10)

CHAPTER 2. BACKGROUND Page 7

categorized into two main approaches: matching on a schema-level or an instance-level. Schema-level matchers consider high-level information such as name, description, data type, and schema structure. Such information is often given in XML format and can be represented as a graph or tree structure. Instance-level matchers instead take the real data observations into account. This includes finding keywords or word frequency for string-based data and numerical averages and specific character patterns like dates and phone numbers. The latter indicates a potential for our research because all required information already exists within the data. The former may be beneficial but depends on high-level knowledge of the schema, which need not be known beforehand. Schema-level matchers often assume that schemas are already fairly similar, and their instances are different, while instance-level matchers assume the opposite. In practice, a combination is usually preferred. In our case, we assume the instances are rather similar, while schemas may vary wildly because the data is from a regulated domain that relies on a few well known and coherent features.

The SEMINT system byLi and Clifton(2000) is an instance-level matcher that finds corresponding features from two schemas via their feature signatures. These signatures consist of 20 constraint- and content-based numerical criteria derived from real data observations. Each attribute of one schema is normalized to be within [0, 1], representing a point in 20-dimensional space, and then clustered using their Euclidean distance to each other. A neural network trained on the cluster centers can derive the most relevant cluster for each attribute of the second schema. SEMINT does not make use of the schema structure, as it cannot be mapped into a numerical value easily.

Madhavan et al.(2001) developed the schema matcher Cupid, which comprises of syntactic techniques at both the element and structure level. The former includes, e.g., common prefixes and suffixes, while the latter focuses on broader concepts like tree matching. Cupid also applies a precompiled thesaurus as an external resource for similarity measures. The main drawback is that it relies heavily on string-based techniques. Similarly,Melnik et al. (2003) implemented Rondo, which frames schema matching as an optimization problem and presents schemas as directed labeled graphs. The algorithm again focuses on syntactic techniques at the element and structure level. It starts with string-based comparison of the nodes’ labels to obtain an initial mapping, e.g., using common prefixes and suffixes, which is then refined using structure level techniques. Some other less relevant prototypes from the research community include Clio byPopa et al.(2002), Tupelo byFletcher and Wyss(2006), HePToX byBonifati et al.(2005) and Spicy byMecca et al.(2009). They focus on aspects such as graphical interfaces and dedicated schema languages, which are out of scope for us. All of these mapping tools are designed to accommodate users from various domains and deal with highly heterogeneous schemas. The generalization capability of their schema mappers may outweigh our approach’s capability, but our problem domain is explicitly narrow. By simplifying the domain, we eliminate many of the difficulties of schema mapping and avoid any graphical components and dedicated low-level schema mapping language. On the contrary, we gain additional information by further exploiting the problem domain, i.e., using a Machine Learning classification task as an additional mapping verification. However, most of these tools divide the mapping phase into a pre-match, match, and post-match phase: cleaning the dataset during the pre-match phase, running an algorithm during the match phase, and applying or verifying the mapping during the post-match phase. This division is intuitive and motivates a similar approach in our contribution.

2.2 Reinforcement Learning

Next to the two main branches of Machine Learning, namely Supervised Learning, and Unsupervised Learning, a third branch has been getting more attention over the last decade. Both, Supervised and Unsupervised Learning are data-driven approaches. They are only distinguished by the existence of a target variable, that supervises and confirms a classification or regression algorithm. Instead, Reinforcement Learning is a Machine Learning paradigm that does not solely rely on data, but on experience gained by an agent’s interaction with a simulated environment.Sutton and Barto(2018) states that the agent’s goal is to learn good strategies for sequential decision problems within this environment by taking actions to explore it. Each action results in a reward, either positive or negative. Over time agents learn to optimize their actions to maximize cumulative future rewards, thereby finding optimal strategies. By designing environments, actions, and rewards in specific ways, numerous problems can be solved.

(11)

The field has an extensive history and originated as a subdomain in psychology, which was concerned with the learning behavior of individual animals. In parallel, a subdomain of Economics called Game Theory studied multi-agent behavior in humans. Only when these two domains were combined in the 1980s, the new domain of Reinforcement Learning was born. For a long time, it has focused on dynamic programming approaches, untilMnih et al.(2013) introduced a Deep Reinforcement Learning algorithm called DQN that revolutionized the field. The idea of using neural networks for universal function approximation has been introduced much earlier byBertsekas and Tsitsiklis(1996), but during a time when less computing power was available to make practical use of the theory. A plethora of improvements has since been proposed as outlined extensively in

Sutton and Barto(2018) andGraesser and Keng(2019) and more briefly byIvanov and D’yakonov(2019).

2.2.1 Reinforcement Learning Origins

Reinforcement Learning environments are explicitly or implicitly designed as a Markov Decision Process (MDP), defined by the tuple (S, A, P, R). S denotes the state space, A the action space, P the set of probabilities that actiona∈ A in state s ∈ S leads to state s0 _{∈ S and R the set of rewards received after} transitioning from states_{∈ S to s}0_{∈ S using action a ∈ A. If all of these variables are known beforehand,} we can apply model-based algorithms, such as dynamic programming. More often, however, some of these are unknown, requiring the use of model-free algorithms. A Reinforcement Learning agent then interacts with the MDP for a pre-defined number of steps of some finite lengthT , by choosing actions at∈ A to explore the state spaceS. The agent starts from an initial state s0∈ S and may end in a terminal state s+∈ S. The sequence_{T = (s}0, a0, r1, s1, a1, r2, s2, ..., sT−1, aT−1, rT, sT) of states visited, actions taken and rewards accumulated within theseT finite steps is known as an episode. Now a distribution π(a_{|s) is defined as the} policy, denoting an agent’s probability to choose actiona in state s. Given any such policy π and a discounting coefficientγ_{∈ [0, 1), the agent’s expected discounted total reward is defined as}

J(π) = Eπ X t=0

γtrt+1. (2.1)

All Reinforcement Learning algorithms aim to find an optimal policyπ∗_{, which maximizes this expression.} Given this aim, we can introduce concepts that estimate the expected discounted future reward from any state (value functionV ) or any state-action pair (quality function Q):

Vπ_{(s) = E} π|s0=s X t=0 γt_r t+1 (2.2) Qπ(s, a) = Eπ|s0=s,a0=a X t=0 γtrt+1 (2.3)

Specifically Equation2.3is of importance. If we find the optimal Q-functionQ∗(s, a), we will directly get the optimal policyπ∗_{from it:}

π∗= arg max

a Q

∗_{(s, a).} _(2.4)

These estimators form the backbone of modern Reinforcement Learning algorithms and are deeply connected:

Vπ(s) = Ea∼π(a|s)Qπ(s, a) (2.5)

Qπ_{(s, a) = E}

s0_∼p(s0_|s,a)[r(s0) + γVπ(s0)] (2.6)

Equation2.5and Equation2.6together form a system of recursive equations, known as the Bellman equation:

(12)

This recursive relationship gives rise to the first well-known value-based Reinforcement Learning algorithm, called Q-Learning, introduced byWatkins(1989). It uses Equation2.7and inserts the optimal policyπ∗to obtain the Bellman optimality equation

Q∗(s, a) = Es0_∼p(s0_|s,a)[r(s0) + γ max

a0 Q

∗_{(s, a)],} _(2.8)

which can be updated iteratively and will converge towards the optimalQ∗_{quickly. In practice, several update} iteration algorithms have been proposed, of which the most popular is the temporal difference algorithm (TD). The TD algorithm samples transition tuples(st, at, rt+1, st+1) from interaction experience and updates only those state-action pairs. Given an update smoothing parameterαt, its update is formally given as:

Q∗t+1(s, a) = Q∗

t(s, a) + αt[rt+1+ γ maxaQ∗t(st+1, a0)− Q∗t(s, a)] if s = st,a = at

Q∗t(s, a) else

(2.9)

The expression in the brackets formulates the notable TD-error, which represents the difference be-tween the estimated value of the current state-action pairQ∗t(s, a) and its one-step approximation rt+1+ γ max0

aQ∗(st+1, a0). It is trivial that this difference will converge towards zero, as we approach the optimal Q-function. Hence, the TD-error is a good proxy of how close we are to this optimality. The main require-ment for such convergence is enough exploration of the state space via interaction experience. Quantifying enough exploration is a challenging problem within the Reinforcement Learning domain to this day, more commonly referred to as the exploration versus exploitation trade-off. It is well documented and appeared first in the statistical literature on the topic of sequential multi-armed bandit problems byBerry and Fristedt(1985). Gambling slot machines inspired the academic study of such bandit problems. Each pull of the slot lever incurs a fixed cost and returns a probabilistic gain. The gain is unknown but can be approximated by playing the slot multiple times and averaging its rewards. Naturally, gains are to be maximized. Suppose an agent is presented with multiple such slot machines. In that case, it will have to balance exploring new machines to estimate their probabilistic returns with the exploitation of the best machine to generate the most gain. This balance is usually skewed towards an emphasis on exploring options in the beginning and exploiting the best option towards the end, but quantifying a universally optimal relationship is intractable. Instead, various approximations exist, such as the-greedy strategy outlined bySutton and Barto(1998), which balances the two actions probabilistically. With a chance of, a random slot machine is explored, while a chance of 1₋ picks the one with the currently largest estimated gain. In practice, after some empirical tuning of, such approximations usually suffice to explore the state space accurately.

The main issue with temporal difference algorithms, including Q-Learning, is that Q-functions need to compute and store a value for each state-action pair separately, with a total memory size of_{|S| · |A|. This explicit} storage was a severe constraint during a time when computer storage was commonly in MB, rather than GB. Thus, the adoption of temporal difference techniques became more widespread only when the price of memory fell, and neural networks offered generalization capabilities across many state-action pairs simultaneously.

2.2.2 Deep Q-Networks

Deep Q-Networks (DQN) was introduced byMnih et al.(2013) in their seminal paper on solving Atari games directly from image input using Reinforcement Learning with neural networks. As the name suggests, DQN is mainly based on Q-Learning, outlined in the previous section. At its core, DQN learns to estimate the action-value functionQ using a neural network with weights θ:

Q(s, a; θ)_{≈ Q}∗(s, a) = Ehr + γ max a0 Q

∗_(s0_{, a}0₎_{|s, a}i_. _(2.10)

The action-value function is then trained by minimizing the loss functionL using stochastic gradient descent:

yi= r + γ max a0 Q(s

(13)

Li(θi) = E(yi− Q(s, a; θi))2 . (2.12)

Mnih et al.(2013) also introduced an experience replay memory to increase data efficiency. The memory keeps track of a limited number of previous experiences, defined by the starting state, the action taken, the resulting state and associated reward. During every model training step, a batch of experiences is sampled randomly and used to adjust the neural network weights. This sampling revisits past experiences and decorrelates subsequent actions, leading to faster and more stable training.

Q-Learning is an off-policy algorithm, so by extension, DQN is as well. The off-policy property is a result of estimating the total discounted future reward using future greedy actions, even though the chosen action might differ. This results in a divergence between the estimated best policy and the used policy. Such off-policy algorithms are a requirement for replay memories because we do not know the chosen action and hence take the maximizing action. We do not want to store it either because it would counteract the useful decorrelation of experiences.

Mnih et al.(2015) later refined the algorithm outlined in Equation2.11further and added a separate network for generating the targetsyi. Simply put, after everyC updates of the Q function the network is copied onto a target networkQ0. This ensures a more stable target that is lagging at mostC steps behind the online network.

Mnih et al.(2015) also proposed the use of the Huber loss function with the error terma:

Huber(a) = 1 2a 2 _for |a| ≤ 1, (|a| −1 2), otherwise. (2.13) This loss function, also known as the smooth L1 loss, combines the mean-squared error (MSE) and the mean-absolute error. Its main benefits include less sensitivity to outliers than the MSE and the prevention of exploding gradients.

2.2.3 Deep Q-Networks Extensions

Since its origin in 2013, DQN has been refined substantially. The most notable additions, as mentioned by

Hessel et al.(2017), include a prioritized experience replay, Double DQN, Dueling DQN, and Distributional DQN, which all resulted in faster and more stable convergence behavior during training on Atari games. We treat these as the accepted state-of-the-art in value-based Reinforcement Learning and briefly analyze each invention in this section. It is important to note that other types of algorithms exist, so called policy-based variants. These do not find an explicit value of each state, but rather optimize the policy directly. Such strategies may present their own benefits and difficulties, but these remain to be tested in future research. Instead, we focus our efforts on value-based methods, because they are easier to interpret, straightforward to implement and naturally fit with our problem scenario: it suffices to know the optimal state, rather than the exact route from an initial state to it.

Prioritized Experience Replay Memory

Schaul et al.(2016) added the Prioritized Experience Replay to DQN as a way to deal with the unequal importance of an agent’s experiences. They realized that agents spend a considerable amount of time gathering redundant experience in most environments. By sampling uniformly from the memory of stored experiences, this redundant information is reinforced at the same frequency as rare, valuable experiences. A valuable experience results in a large TD-error, which is assumed to resemble great unexpectedness. This should translate to a substantial amount of new information and is therefore considered useful or valuable. To combat the imbalance of stored redundant and valuable experiences, they propose a weighting scheme that prioritizes new experiences first and experiences with a large TD-error second. This scheme ensures that the weight update function uses all memories at least once, but important memories more often.

Double DQN

Van Hasselt et al.(2015) presented Double Deep Q-Networks to tackle the well-known issue of Overestimation in traditional Q-Learning. Overestimation occurs due to the maximization step over estimated action values in

(14)

Equation2.10, which biases the algorithm towards higher action values. This bias exists because Q-values are noisy during training, but updates only consider the positive noise. If this positive noise is uniform across states, we will not see an overestimation bias, because we consider only state value differences, not absolutes in Equation2.12. In practice, the noise is seldom uniform, which encourages the exploration of states that are incorrectly believed to be good. Van Hasselt et al.(2015) first prove that overestimation generally leads to worse performance, as illustrated in Figure2.1.

2 4 8 16 32 64 128 256 512 1024 number of actions 0.0 0.5 1.0 1.5 error maxaQ(s, a)− V∗(s) Q0_{(s, argmax} aQ(s, a))− V∗(s)

Figure 2.1: The orange bars illustrate the bias in single Q-learning updates, while the blue bars use the Double DQN update. Image taken fromVan Hasselt et al.(2015)

Then they propose an approach that decouples action selection and evaluation, by training two separate function approximators on different samples, leading to more realistic estimates. They rewrite the Q-Network targetyimentioned in Equation2.11and then add the weights of a second Q-network to the evaluation part, as shown in Equation2.15. In practice, this second network is often just the lagging target network outlined in the previous section.

yQ_i = r + γQ(s0, arg max a

Q(s0, a; θi); θi), (2.14)

yDoubleQ_i = r + γQ(s0, arg max a

Q(s0, a; θi); θ0i). (2.15) Dueling DQN

Wang et al.(2016) introduced Dueling DQN as a way to improve learning when generalizing across actions. It uses two separate estimators: one for the state value function as we have seen before and another for the action advantage function. The action advantage is generally defined as

A(s, a) = Qπ(s, a)_{− V}π(s) (2.16)

and describes the advantage of the actiona over the action the current policy π would have chosen. This is becauseVπ_{(s) represents the expected return from state s onwards following policy π. In contrast, Q}π_{(s, a)} represents the expected return after taking actiona in state s and then following the same policy π. Figure2.2

depicts such a dueling neural network architecture schematically. Its valuable insight is that state values are universally useful on Atari games, but actions have unequal importance across states. Thus, more information on action values is required in important states, while it can be sacrificed in other states. They highlight this in their proposed Q-function

Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α)₋ 1 |A| X a0 A(s, a0; θ, α) ! , (2.17)

withθ the parameters of the shared convolutional layers and α, β the parameters of the two dueling streams of fully connected layers.

(15)

Figure 2.2: A popular single stream Q-network (top) and the dueling Q-network (bottom). Illustration taken fromWang et al.(2016)

Distributional DQN

Bellemare et al.(2017) introduced Distributional DQN as a way to represent state values only as distributions instead of their distribution’s expectation. Specifically, they transform the update in Equation2.10to an update of an entire distribution Z, such that

Z(s, a)= r(s, a) + γZ(sD 0, a0), (2.18)

with Q the expected value of Z. The authors state that such a distributional approach is mainly favorable when learning non-stationary policies, which are multimodal state value distributions.

Scaled Exponential Linear Units

Klambauer et al.(2017) proposed ’Scaled Exponential Linear Units’ as an activation function that renders neural networks as self-normalizing:

selu(x) = λx ifx > 0, αex

− α if x ≤ 0. (2.19)

The SELU activation function illustrated in Figure2.3has four important properties: its domain covers both negative and positive values on a continuous spectrum, ensuring a well-defined derivative. It has saturation regions where the derivatives approach zero and a slope greater than one to increase the weights if they have become too small. For the latter, one needs to make sure thatλ > 1.

2.3 Machine Learning Classification

Our research aims to functionally map schemas, which necessarily involves a classification task that determines what is functionally useful. Therefore, this section covers the basic knowledge necessary to understand classification tasks and the most common metrics we rely on for our case study.

2.3.1 Algorithms

Machine Learning classification is a well-established field concerned with separating data into a finite number of classes, see, e.g.,Bishop(2006). The book cites plenty of classification algorithms, such as Logistic Regression, Support Vector Machines, k-Nearest Neighbors, Decision Trees, and Naive Bayes. Note that

(16)

CHAPTER 2. BACKGROUND Page 13 5 4 3 2 1 0 1 2 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 y SELU

Figure 2.3: The SELU activation function

the first two options were initially designed for the sub-domain of binary classification, a specific instance in which the number of classes must be two. However, derivations for multiple classes exist nowadays. The other algorithms mentioned are capable of multivariate classification out-of-the-box. Decision Trees perform particularly well and are easy to interpret, two properties that are especially valued in business. We will examine them in the following paragraphs, focusing our explanations on only the necessary background for the case-study later.

Decision Trees are a supervised learning method based on a tree structure with feature conditions at each internal node, branches for each outcome of this condition, and leaf nodes for each class label. This tree splits a dataset into various subsets at each internal node until it is distributed across the leaf nodes. The expectation is now that the class label associated with each leaf node should coincide with the target label of all the data points within each subset at that leaf node. If this is not the case and not all features have yet been used for splitting, further conditions may be required, adding more branches and leaf nodes. This way, Decision Trees iteratively build deeper trees with more conditions until the amount of correct class labels is optimal. Decision Trees by themselves are already powerful, but many additions have been proposed to improve their performance. One of these is a process called boosting and was originally designed specifically for classification tasks, althoughFriedman(2001) has shown they can be used for regression as well. In boosting, multiple base classifiers are weighted and combined to form a committee. These base classifiers, i.e., Decision Trees, are trained sequentially, each influencing weighting coefficients of the next. When all training is complete, a majority voting scheme combines the various weights into one model — the committee. One such committee is Gradient-Boosted Trees (GBT), which is usually able to outperform any of its constituent base classifiers. This dominance depends on achieving at least better-than-random performance with regular Decision Trees and combining the strengths of each to outperform most other techniques. Intuitively, an approach that cherry-picks all the correct classifications should be highly prone to overfitting, and indeed, GBTs are no exception. They need to be carefully regularized by, e.g., limiting the number of sequential training iterations, or training them on slightly varying datasets using stochastic gradient boosting. Competitive results on the online Data Mining platformKaggleshow that such regularisation can be done, and it helps

(17)

GBTs generalize well.Vorhies(2016) states that the top submissions on structured data — the most common type of data in the industry — are dominated by only a few variations of the gradient boosting method.

2.3.2 Binary Classification Metrics

We briefly outline various metrics used in this area and conclude with the two metrics used in our experiments. Confusion matrix

Based on the true label and the predicted label of any classification task, we can categorize a prediction into four categories, intuitively illustrated in Figure2.4.

Predicted class 1 0 True class 1 0 TP TN FN FP

Figure 2.4: Illustration of the confusion matrix

The confusion matrix provides a complete description of binary classification performance using true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). However, weighing and comparing four values at once can be challenging. Several aggregated metrics have been proposed that put the confusion matrix into perspective.

Precision

The Precision metric

P= TP

TP+ FP, (2.20)

combines three out of the four main metrics into one. It represents the ratio of correct positive predictions to all positive predictions. Careful models that avoid labeling too many data points as positive usually achieve high values. This cautiousness can potentially lead to missing a lot of true positives that will go unnoticed as false negatives.

Recall

The Recall metric

R= TP

(18)

also combines three out of the four main metrics into one, exchanging FP with FN. Recall, also known as the true positive rate (TPR), represents the ratio of correct predictions to all truly positive data points. Very bold models that label as many transactions as possible as positive usually achieve high values. This boldness can potentially lead to overestimates and numerous false positives.

Receiver Operating Characteristic Curve

0.0

0.2

0.4

0.6

0.8

1.0 FPR

0.0

0.2

0.4

0.6

0.8

1.0 TPR

Random

ROC

Figure 2.5: Schematic illustration of a ROC curve

The Receiver Operating Characteristic (ROC) shown in Figure2.5, is a plot of the true positive rate (TPR), against the false positive rate (FPR)

FPR= FP

FP+ TN, (2.22)

also known as the Inverse Recall. Analyzing the shape of this function is useful in itself, but often a single scalar value is handy to quantify performance. Hence, we use the area under the ROC curve (AUROC):

AUROC= Z 1 0 TP TP+ FNd _FP FP+ TN . (2.23)

The ROC’s primary function is to measure the model’s capability to distinguish between two classes. A score of 1.0 represents the perfect separation between both classes, while a minimum of 0.0 represents labeling every transaction with the wrong class. Note that such a model can easily be inverted to receive a perfect classification. For this reason, the worst-performing model achieves a score of 0.5, a random classification. The ROC is a popular metric, butPorwal and Mukund (2019) has argued that it functions sub-optimally on heavily imbalanced data. Due to its prevalence in the literature, we still indicate the ROC value in our experiments, but additionally, look at another metric.

(19)

Precision and Recall Curve

0.0

0.2

0.4

0.6

0.8

1.0 Recall

0.0

0.2

0.4

0.6

0.8

1.0 Precision

Random

PR

Figure 2.6: Schematic illustration of a PR curve

The Precision-Recall curve (PR) shown in Figure2.6plots the Precision metric against the Recall. Similarly to the AUROC, we look at the area under this curve in our evaluations:

AUPRC= Z 1 0 TP TP+ FPd TP TP+ FN . (2.24)

In practice, the relationship between Precision and Recall is contrarian, where increasing one of them incurs the cost of decreasing the other. This metric emphasizes the much-dreaded trade-off between them.

(20)

Chapter 3

Methodology

The previous chapter indicated the domain of dataset schema matching. It also gave a reminder on modern Reinforcement Learning techniques and Machine Learning classification tasks. In the following chapter, we present our case study on the domain of card payment fraud detection, outlining the necessary characteristics of both domains. With this information, we offer the main contributions of this research: framing the dataset schema matching process as a Reinforcement Learning problem and theoretically deriving a hierarchical reward function. The former is achieved by specifying a design for both states and actions that can be plugged into any Reinforcement Learning algorithm. The latter consists of a microscopic reward signal that works on the step-level and a macroscopic signal, which works on an episode-level. The more localized component focuses on syntactic characteristics of the datasets, while the global component measures semantics.

3.1 Case-Study: Card Payment Fraud Detection

First, we present the case study. This section briefly covers the domain of card payment fraud detection, the classification model, and the dataset. We review how digital payments generally work and which parties are involved in the process. With this knowledge, the optimization problem of fraud detection becomes intuitively evident. Then we can outline specifics of the real business data available to us and the model we apply to this problem. These dataset specifics naturally motivate our design choices for the subsequent sections regarding state representation, action encoding and reward shaping.

3.1.1 The Payment Process in a Digital Economy

An example of a payment fraud process byJ˛edraszak(2017) is depicted in Figure3.1. We start with the direction of the money flow: whenever a consumer (here client) buys a product at a store, online, or in a physical outlet, he presents his card details to a merchant. The merchant forwards it via a payment gateway to his bank, called the acquirer. They forward it further to the credit card scheme, e.g., Visa or MasterCard, which connects the transaction to the client’s bank, called the issuer. Every credit card has been issued in conjunction with one such partner bank. Only when the transaction appears on the real card holder’s bank statement can he verify its integrity. If he chooses to flag the transaction as fraud, the partner bank performs a chargeback and restores the card holder’s money. The chargeback triggers the signal flow in the opposite direction to all the other members of the payments chain, passing the cost of the chargeback down to the merchant. In this entire process, the merchant is the party that suffers most from a chargeback. Credit card schemes and banks do not refund the regular transaction fees and usually add fees on top of the original amount due to the extra work associated with a chargeback.

This is where incentives are formed: the merchant does not want to pass transactions through to the issuing bank if it expects them to get flagged and trigger a costly chargeback. The sold product is most likely already gone, and the merchant has no way to get that money back. Merchants are also often too small to develop and use a large fraud detection model themselves. So they ask specialists, e.g., payment service providers

(21)

Figure 3.1: The payment cycle

(PSPs), to do this for them against a fee lower than the anticipated lost revenue due to fraud. The PSP blocks transactions before they are passed down to the issuing bank, while guaranteeing specific threshold values for wrong decisions, i.e., failing to block a fraudulent transaction or blocking a harmless one. In this setup, the merchant can deny the fraudulent payment as soon as his client makes it, preventing any loss of goods. To avoid losing the merchant as a client, the PSP is incentivized to keep within its promised threshold values. This is the business domain of fraud detection as a service.

3.1.2 Fraud Detection

As we have just seen, Fraud Detection is essentially an optimization problem within a classification problem. The classification part is straightforward: separate any financial transaction into one of two classes: fraudulent (1) and non-fraudulent (0). This is an instance of a binary classification problem, which we generally outlined in Chapter2.3.

The optimization part requires more business context. Classifying a benign transaction as fraud, i.e., a ’false positive’, means immediate revenue lost and possibly permanent customer churn. Failing to find a fraudulent transaction and letting it pass through the system, i.e., a ’false negative’, will also result in costs, albeit usually slightly higher. Decreasing only one of them is easy, by deploying excessively defensive models that block a high rate of transactions or not deploying any fraud model at all. However, in a perfect world, we want both false positives and false negatives to be non-existent. In reality, there is a trade-off: if one metric decreases, the other will increase. This dependence is also known as the precision-recall trade-off, which motivates the optimization problem: sizable numbers of false negatives do substantial financial damage directly to a business, but considerable quantities of false positives will lead to customer churn and are potentially brand-damaging, thereby indirectly damaging the business. Any fraud detection mechanism strives to balance these two and minimize the long-term total revenue lost for the merchant.

3.1.3 Dataset Schema

Our case-study builds upon a proprietary dataset. We focus on only one specific target schema, which allows us to incorporate a large portion of domain knowledge. We can do this in the form of additional constraints because we know which features to expect. This target dataset is tabular and consists of 26 features and over 400 million observations. It originated at a large payment service provider in Amsterdam. Dataset features fall into the three common data types: numerical, boolean, and categorical. As with any credit card transaction dataset, this one is heavily imbalanced, with99.3% benign transactions and 0.7% fraudulent ones. These

(22)

CHAPTER 3. METHODOLOGY Page 19

transactions are from cardholders primarily in Europe and the US, but also include some from Asia and South America. The amounts are frequently a couple of dozen Euro and reach up as high as a few hundred, with only a few transactions exceeding 1,000 Euro.

Table3.1gives a complete overview of the main features we are investigating. Similar values exist in both source datasets, but we only extracted the sample instances and feature names from the target dataset. To avoid redundancy, we exhibit only this schema. Features are presented in alphabetical order here, but no specific order is given or required during training. We transform boolean features to indicate numerical values of either0 or 1, rather than false and true. These features are still considered a boolean datatype. Similarly, the column ’fraudlabel’ can only adopt values0 or 1, but is algorithmically required to be of type Double. Also, we universally convert all String-based features to their lowercase counterparts and remove whitespace between them, thereby eliminating redundancy of differently capitalized words, e.g., Visa debit vs. Visa Debit vs. VisaDebit. If, for whatever reason, a feature contains null values only, we see no informational value and remove it from the dataset. Such features may appear when a client’s source dataset attempts to conform to the target data by including empty features to imitate the target schema.

After such basic data cleaning, we adjust the fraud ratios to resemble those of the target dataset. This adjustment may mean either upsampling or downsampling, depending on the source data characteristics. In the data presented in the case-study later, we upsampled fraud. This is necessary for a fair comparison of classification metrics, because heavily imbalanced data may impact performance and mapping success.

3.1.4 Classification Model

Our main case-study builds upon a Gradient Boosted Tree specifically trained on the target dataset. The parameters were found using manual grid-search and are given in Table3.2. The trained model results in a scaled feature importance list shown in Table3.3. We scale each score using min-max normalization (see Equation3.6), resulting in a score of1.0 for the most important feature and a score of 0.0 for the least important feature. We note that tuning the model to each particular dataset would likely increase its classification scores, but such tuning is not strictly necessary. The similarity measure cares only about the relative difference in scores across datasets, not the absolute performance. A different, larger, and pre-trained model will be used for absolute predictions and practical usage later.

3.2 Framing Schema Matching in a Reinforcement Learning Context

In this section, we outline how functional schema matching can be designed as a Reinforcement Learning problem. Algorithms in this domain generally design the problem as a Markov Decision Process, as outlined previously in Section2.2.1. Therefore, we are looking to find a tuple(S, A, P, R) to describe functional schema matching. First, we consider sequences_{T of arbitrary length t of actions a ∈ A, observations o ∈ O} and rewardsr_{∈ R, such that}

Tt= (o1, a1, r2, o2, ..., at−1, rt, ot) (3.1) and learn game strategies that depend upon these sequences. More specifically, we define our problem environment as fully observable, e.g.

∀t, o ∈ O, s ∈ S : ot= st (3.2)

and hence do not need to distinguish between statess and the notion of observations o. All sequences terminate in a pre-defined number of time-stepsT . We further specify the environment as fully deterministic, which means that any actiona in state s will lead to some state s0_{with probability}_{p = 1.0. This general definition} gives rise to a large but finite Markov decision process. As a result, we can apply standard reinforcement learning methods to it. The precise environment definition and its translation into states, actions, and rewards are of particular interest and are covered extensively in the following sections.

(23)

Table 3.1: Overview of a complete dataset schema

Field Datatype Sample Description

acceptorcountry Integer 428 The numeric Country ISO-3 code

acceptorpostalcode String 3768ab the local postal code

acceptorstreetaddress String schoutenkampweg81 The cardholder’s private address

authdate String 2019-08-08 The transaction’s authentication date

authresult String success The failure code or ’success’

cardbrand String Visa The card brand.

cardbin Integer 530446 The first six digits of the card number.

cardexpirymonth Integer 11 The expiration month of the card

cardexpiryyear String 2021 The expiration year of the card

cardprepaid Boolean 1.0 Indicates if the card is prepaid.

cardtype String debit Specific type of card, credit or debit

channel String ecom The payment channel, E-Commerce, mobile, etc.

currency Integer 978 The numeric Currency ISO-3 code

cvvused Integer 1.0 Are Card Verification Value (CVV) checks done?

fraudlabel Double 1.0 Indicates if the transaction is fraudulent

euramount Double 112.37 Indicates the amount of the transaction in Euro.

initialrecurring Boolean 1.0 Flag for the first transaction of a recurring transaction.

issuingbank String ING Bank The bank which issued the card.

mcccode String 7311 Four digit merchant type

merchant Integer 5677788 The identifier of the merchant.

posentrymode String 812 ISO 8583 (field 22)

recurring Boolean 1.0 Indicates a recurring transaction

threedsused Boolean 1.0 Indicates if 3D secure (3DS) is enabled.

timestamp Integer 1592837408 The UNIX timestamp of the transaction

transactionid String 1943883231 A unique identifier of the transaction

transactiontype String capture The type of transaction

3.2.1 Representing a State

We have established that dataset schemas are structural abstractions based on real data. The real data input is in the order of106_to₁₀8_{observations, far too large to use as a state from a computational perspective.} Instead, we use a compressed representation that maintains the key characteristics of the data while abstracting unnecessary details. It is not intuitively clear what constitutes a key characteristic, and there is no standard way to describe a schema in such a condensed manner. So we present our version, which poses as the state representation. One intuitive way to translate a structured dataset into a vectorized numerical representation is via word embeddings, e.g., GloVe byPennington et al.(2014), or word2vec byMikolov et al.(2013)), as it captures the similarity between words appropriately. However, this approach is not feasible due to a mismatch in words included in this dataset and the embeddings.

Embeddings cover mostly natural language as found in large public web corpora, e.g., Wikipedia for GloVe. They exclude abbreviations, custom and niche product or company names, and alphanumeric literals, such as postal codes and addresses. Instead, they rely on an in-depth knowledge of every single word encountered to describe similarity accurately. This reliance fails for many of the words found in financial transaction datasets.

(24)

Table 3.2: Overview of a Gradient Boosted Tree Hyperparameters

Field Value

Maximum bins 31510

Maximum depth 5

Maximum iterations 60

Step size 0.03

Minimum instances per node 40

Subsampling rate 0.5

Loss type Logistic

Table 3.3: Overview of scaled feature importance scores, sorted from highest to lowest

Feature Scaled importance

cardbin 1.0 merchant 0.377336 transactiontype 0.294960 acceptorpostalcode 0.180503 euramount 0.090085 mcccode 0.036104 acceptorcountry 0.005981 channel 0.004973 currency 0.004639 cardprepaid 0.000985 cardtype 0.000169 cardbrand 0.0

If any word from either the source or target datasets does not exist in the word embedding corpus, the approach will fail. To find a standardized structure that works for any possible feature, we instead define the notion of a feature descriptor. Each feature descriptor is a high-level representation of a feature in the original full-size dataset. It consists ofM = 16 custom properties that are carefully chosen to be boolean or numerical and scaled within the range[0, 1]. All properties cover at least one potential feature category: numerical data, categorical data, and binary data.

In the first section of Table3.4, we propose the set of intra-feature propertiesΩ that each address at least one of these categories. These data-based properties are computed once per feature in the raw dataset and do not change after that. This way, any feature descriptor consists of immutable properties describing only its real counterpart, ensuring a meaningful schema at any point in time. Together with the last similarity scoreΣ, this part functions as our state. It is the raw input to our model’s neural network and the decision basis for actions taken during each step. In the second section, we add the set of inter-feature propertiesΦ that each measures the direct similarity between two selected features. They are computed at each step, reflecting actions taken by the agent, and therefore, they will change over time. These algorithm-based properties are used together with Ω to compute the reward, which we describe in more detail in Section3.3. Last, the third section is merely storing the last computed reward associated with this feature. We include this for tracking purposes during

(25)

Table 3.4: Overview of feature properties

Numerical Categorical Binary Intra-feature properties Ω is none (ω1) X X X is country (ω2) - X -is date (ω3) X X -is ip (ω4) - X -is alpha (ω5) - X -is alphanumeric (ω6) - X -is currency (ω7) X X -is boolean (ω8) - - X

distinct value ratio (ω9) X X X

maximum (ω10) X X X average (ω11) X X X median (ω12) X X X standard deviation (ω13) X X X Inter-feature properties Φ distinct matches X X X relative magnitude X X X type match X X X

State value tracker Σ

last similarity score - -

-training, but also as an indication which features need to change the most. Assuming there areN = 14 features given in the dataset, this representation results in a state vector of size

N× |Ω ∪ Σ| = 14 · 14 = 196 (3.3)

for the neural network part and a reward vector of size

N_{× |Ω ∪ Φ| = 14 · 16 = 224.} (3.4)

As an example, Table3.5shows a state obtained from a source dataset withN = 14. In the following, we give a summary of each property:

Intra-feature properties Ω

1. is none: Used to describe features that contain only null values. This can sometimes be necessary when the dimensionality of source and target dataset differs. If the source contains more features than the target, some features cannot be mapped to anything, which led to the introduction of a dummy feature in the target, which consists only of null values. Represents a boolean indicating if the feature is such a dummy.

(26)

Table 3.5: An example of a state representation. Each line represents a feature descriptor in the source dataset. The feature names are given for illustration, but are not part of the neural network input. The order is deliberately non-alphabetical, as it encodes mapping information.

ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10 ω11 ω12 ω13 acceptorpostalcode 0 0 0 0 0 1 0 0 0.254 1 0.755 0.778 0.132 acceptorcountry 0 0 0 0 0 0 0 0 0.019 1 0.707 0.714 0.106 cardbrand 0 0 0 0 1 1 0 0 0.002 0.5 0.5 0.5 0 currency 0 0 0 0 0 0 1 0 0.016 1 0.989 0.992 0.042 cardprepaid 0 0 0 0 0 0 0 1 0.002 0.5 0.5 0.5 0 cardtype 0 0 0 0 1 1 0 0 0.003 1 0.996 1 0.064 mcccode 0 0 0 0 0 0 0 0 0.037 1 0.558 0.539 0.073 channel 0 0 0 0 1 1 0 0 0.004 1 0.188 0 0.391 transactiontype 0 0 0 0 0 1 0 0 0.003 1 0.999 1 0.031 timestamp 0 0 1 0 0 0 0 0 0.954 1 0.672 0.675 0.191 transactionid 0 0 0 0 0 0 0 0 1 1 0.388 0.351 0.327 cardbin 0 0 0 0 0 0 0 0 0.004 1 0.414 0.386 0.134 merchant 0 0 1 0 0 0 0 0 0.371 1 0.038 0 0.124 euramount 0 0 0 0 0 0 0 0 0.312 1 0.282 0.278 0.016

2. is country: Applies a variety of common country code conversions to check if the data represent countries. This includes the ISO-2 and ISO-3 numeric codes, as well as letter codes. Represents a boolean indicating if the conversions were successful.

3. is date: Applies date format conversions to check if the data represents calendar dates. This includes the ISO 8601 standard known as Zulu time. Represents a boolean indicating if the conversions were successful.

4. is ip: Represents a boolean indicating if the majority (> 50%) of the data represents IP addresses. 5. is alpha: Represents a boolean indicating if the majority (> 50%) of the data represents alphabetic

characters.

6. is alphanumeric: Represents a boolean indicating if the majority (> 50%) of the data represents alphanumeric characters.

7. is currency: Applies a variety of common currency code conversions to check if the data represents currencies. This includes the ISO-3 numeric codes, as well as letter codes. Represents a boolean indicating if the conversions were successful.

8. is boolean: Represents a boolean indicating if the majority (> 50%) of the data represents boolean values.

9. distinct value ratio: Computes

r = r

distinct value count

total value count (3.5)

as a measure of what fraction of the data is distinct. We use the square root as a way to re-introduce some of the missing distribution density towards 1.0. Without it, the distribution is naturally dense close to 0, because most categorical features only consist of extremely few distinct values compared to the large dataset size. The low cardinality then usually leads to distinct value ratios smaller than 0.01. Distinguishing features based on such small values can be challenging and hence requires such scaling.

(27)

10. maximum: Represents the maximum numerical value found in the feature. If the feature is not numerical, we use the string length of its longest element.

11. average: Represents the average numerical value found in the feature. If the feature is not numerical, we use the average string length of its elements.

12. median: Represents the median numerical value found in the feature. If the feature is not numerical, we use the median string length of its elements.

13. standard deviation: Represents the standard deviation of all numerical values found in the feature. If the feature is not numerical, we use the standard deviation in the string length of its elements. We note that the last four properties maximum, average, median and standard deviation are scaled using min-max normalization:

x←−_max(x)x− min(x)

− min(x). (3.6)

Inter-feature properties Φ

14. distinct matches: Describes the percentage of distinct values in the common subset between the target and the source, among all values.

15. relative magnitude: Describes the mean value’s length if its a numeric feature or the length of the mean string length otherwise.

16. type match: Represents a boolean indicating if the data types of both features in the source and target match.

State value tracker Σ

17. last similarity score: Stores the last computed reward that this feature achieved. It does not keep track of the feature pair that led to this reward and does not distinguish between syntactic and semantic reward proportions.

To bring it all together, we refer to the neural network input as the state. The state is an ordered list of feature descriptors. At all times, we keep a copy of the target state at hand, which naturally does not change during training. With this setup, each descriptor in the source state is directly associated with one descriptor in the target — the one in the same position. This association describes a mapping between features in the source and the target. By mutating this order of descriptors, we change the state and with it the mapping function. The agent’s actions are fully responsible for these mutations.

3.2.2 Encoding Actions

We propose a single type of action: swapping the position of two feature descriptors within the source state, resulting in a changed mapping for exactly two features. This constitutes the minimal possible change to a state that still ensures a meaningful mapping at all times. Intuitively, the agent should be able to swap any two feature descriptors so we can use combinatorics to calculate the action space. LetN be the number of feature descriptors, then a total of

N_{· (N − 1)}

2 (3.7)

unique pairs exist. This excludes identical pairs of different orderings, since the swapping context clearly defines those as equal, e.g., swapping A with B and swapping B with A. We define a function

(28)

that maps a scalar integer value

a_∈ 0,N· (N − 1) 2 (3.9) to two scalar valuesi, j∈ [0, N − 1] denoting feature indices in the ordered source dataset. The algorithm for this mapping is shown in Algorithm1and illustrated in Figure3.2. The choice of such a simple action reduces complexity and training time while ensuring a homogeneous action space. It is conceivable to add other types of actions, such as deleting a feature that does not have a mapping target, combining two features into one, or splitting one feature into two. However, we opt for a problem declaration that avoids these, by detecting impossible feature mappings in the reward function and mapping them to placeholder null features, and maximally splitting all features in an intuitive way as part of the data preparation process. We hope this simplified approach supports the training effort.

Algorithm 1 Encoding scalar actiona to index pair (i, j) Initialize first indexi = 1

Initialize second indexj = 2 Initialize environment with sizeL Receive input actiona

fork = a to 0 do Setj = j + 1 Seti = i +j_(L+1)j k ifj = L + 1 then Setj = j_{− (L − i)} end if end for Seti = i_{− 1} Setj = j_{− 1} 0 f₀ f₁ f₂ ... f₁₄ 1 13 2 … 12 14 15 … 25 26

Figure 3.2: Illustration of the action encoding. Each number represents a given scalar action and each dotted line represents the two associated feaures for swapping.

Functional Schema Matching with Double Deep Q-Networks

MASTER

THESIS

F

UNCTIONAL

S

CHEMA

M

ATCHING

W

ITH

D

OUBLE

D

EEP

Q-N

ETWORKS

PHILIPP

OLLENDORFF

August 28, 2020

Daily Supervisor:

S

J

Supervisor:

D

. J

M

Examiner:

D

. J

K

Contents

Chapter 1

Introduction

1.1

Problem Statement

1.2

Aim and Scope

1.3

Contribution

1.4

Outline

Background

2.1

Schema Matching

2.1.1

An Interdisciplinary Field

2.1.2

Related work

2.2

Reinforcement Learning

2.2.1

Reinforcement Learning Origins

2.2.2

Deep Q-Networks

2.2.3

Deep Q-Networks Extensions

2.3

Machine Learning Classification

2.3.1

Algorithms

2.3.2

Binary Classification Metrics

0.0

0.2

0.4

0.6

0.8

1.0

FPR

0.0

0.2

0.4

0.6

0.8

1.0

TPR

Random

ROC

0.0