Interpretable Representation Learning for Relational Data

(1)

MSc Artificial Intelligence

Master Thesis

Interpretable Representation Learning for

Relational Data

by

Bella Nicholson

12094870

August 24, 2020

36 EC January 2020 − August 2020

Supervisor:

Sadaf Gulshad, MSc

Sjoerd van Bekhoven, MSc

Assessor:

Dr. Hinda Haned

(2)

List of Figures

1 Active research areas in artificial intelligence . . . 4

2 The mass adoption of database technology. . . 5

3 Explainability-Criticality Matrix. . . 6

4 Relational database example. . . 8

5 Different graph representations examples . . . 9

6 2D convolution vs. graph convolution . . . 10

7 Graph convolutional network . . . 11

8 Depiction of classical relational data learning paradigm . . . 12

9 RDBToEmbedding input and output. . . 14

10 Node representation learning . . . 15

11 Node2vec random walk . . . 16

12 Proposed framework for relational data representation learning . . . 18

13 RDBToEmbedding architecture . . . 19

14 Non-target entity edge reconstruction . . . 21

15 A node neighborhood preview . . . 26

16 Second step in node neighborhood . . . 27

17 Extension of neighborhood preview . . . 28

18 Simple latent space cluster formations . . . 29

19 Numerical latent space clusterings . . . 30

20 Complex latent space cluster formations . . . 31

21 Implicitly learned latent space clusterings . . . 32

22 RDBToGraph heuristic example . . . 42

23 Employee feature distributions . . . 44

List of Tables

1 RDBToGraph heuristic . . . 13

2 Construction of our intemediary graph representation . . . 18

3 Quantitative results . . . 34

(4)

Abstract

The field of natural language processing (NLP) relies heavily upon word embeddings to translate semantic concepts and meanings into their analogous but more workable numerical representations. By doing so, we bypass the need to continuously rebuild real-world understandings of word meaning every time we execute a model, which frees researchers to focus on the problems they originally intended to solve. In this paper, we seek to bring the same benefits of comprehensive representation learning to the realm of relational data —a data type that is used nearly ubiquitously across every industry. Unlike its natural language counterpart, little is established with regards to directly learning from relational data. That is, obtaining a general-purpose representation of desired relational database (RDB) objects is even more critical, as this would allows us to leverage many recent machine learning advances in a way that was not previously possible. In our proposed relational database to embedding algorithm, we first craft a graph representation of our target objects, and then apply a self-designed graph neural network architecture to obtain learned interpretable, low-dimensional embeddings.

(5)

1 Introduction

Over the past forty years, relational data usage has reached an unparalleled level of ubiquity. Unlike many other classes of data, relational data extends into almost every industry —from online retail to the health care sector [Arora and Gupta, 2012]. Often times, relational databases are seen more as an integral part of day-to-day functions rather than just another data type. Relational data’s foundational nature in business and its overwhelming pervasiveness implicate its potential applications to be boundless. All the while, this data class has not enjoyed the fruits of the “neural revolution” in the ways that its natural language, computer imaging, or click data counterparts have had [Cvitkovic, 2020; Dean, 2020; Krizhevsky et al., 2012].

This blind-spot in literature arises from the difficulty of coalescing relational data into an easily acceptable neural input [Cvitkovic, 2020]. Figure 1 depicts this incongruence between degree of data homogeneity and papers published per topic. In contrast to its more homogeneous counterparts, relational data is eclec-tic in nature. As a series of interlinked tables, relational databases (RDBs) contain the same assorted mix of numerical, categorical, and ordinal information as tabular data does as well as an additional dimension of information. Here, in this additional dimension, specific concepts serve as mapping functions between data points and pose further problems in crafting a suitable representation [Kanter and Veeramachaneni, 2015; Lam et al., 2017]. As there exists no clear procedure to consolidate such a degree of data heterogene-ity, the research is still infantile.

Figure 1: Published arXiv papers grouped by artificial intelligence subcategory. Ever since the introduction of AlexNet [Krizhevsky et al., 2012], the field of artificial intelligence (AI) has undergone an undeniable boom. With computer vision and natural language commanding most of the AI community’s recent atten-tion, deep learning applications on homogeneous data types have been well-explored. The same cannot be said for more heterogeneous data types. Adapted from: Shoham, et al., 2018.

Often times, researchers resort to flattening RDBs into a single tabular representation to render their given data more “workable”. However, this approach poses three critical problems: (a) it destroys all relational information, (b) it introduces false correlations into the data, and (c) it limits model selection to that of tree-based models [Cvitkovic, 2020]. Even if we choose to overlook the first two problems of this approach, tree-based models [Arik and Pfister, 2019; Friedl and Brodley, 1997; Kontschieder et al., 2015; Wang et al., 2017] are not a suitable choice in every situation. Consider a relational database where certain data entries are free to interact with and influence the properties of all other data entries —e.g., any system where human interaction is prevalent. In such a case, tree-based models would fail to consider the information propagation occurring amongst data points, even if that interaction dominates our given system. In contrast, graph neural networks (GNNs) are much better suited for such a task [Xu et al., 2018;

(6)

Figure 2: The mass adoption of database technology. The generalizable and versatile nature of relational data has not only enabled its widespread adoption, but has also allowed it to stand the test of time. Even to this day, relational data and its many variants still serve as the dominant storage model for business transactions and enterprise management. Adapted from: Dave Labuda 2018.

Zhou et al., 2018]. Alternatively, we may be interested in predicting an object’s behavior or characteristics at the future time slice t + 1 given all known information from preceding time slices t, t − 1, and so on. In this instance, tree-based models have no mechanisms to account for the passage of time, while the likes of recurrent neural networks (RNNs) [Jordan, 1997] and long short-term memory (LSTM) networks [Hochreiter and Schmidhuber, 1997] do [Fraccaro et al., 2016]. In short, the model must fit the problem, and when the field of potential problems ranges so greatly, we require an equally vast set of possible solutions.

Clearly, some form of representation learning is needed to address these problems. Although end-to-end learning solutions are often deemed as more desirable [Arik and Pfister, 2019], they do not translate well to fields were each data point is heavily imbued with real-world meanings and concepts. Nowhere is this better exemplified than in the ways natural language processing (NLP) and computer vision research have diverged in their respective developments [Krizhevsky et al., 2012; Mikolov et al., 2013; Wallach, 2006]. In the latter, a pixel, the most basic unit of a computer image, consists of nothing more than either a numerical value or a vector. Even though blocks of pixels compose semantically meaningful objects, each pixel is relatively simple for a machine to understand. Thus, in the field of computer vision, end-to-end learning is easily achievable [Voulodimos et al., 2018]. In stark contrast, natural languages encodes real-world concepts and meanings as a series sequential characters. As such, machines face the significant hurdle of first learning semantic meanings before they can begin to solve specific tasks. As a consequence, it is of no surprise that the advent of word embeddings revolutionized the way natural language is approached [Mikolov et al., 2013; Peters et al., 2018; Wallach, 2006]. The introduction of pre-learned word embeddings freed researchers to delve into more complicated tasks [Bahdanau et al., 2014; Gambhir and Gupta, 2017] rather than wasting their efforts in the consistent relearning of word meanings.

Generally, raw relational data more closely resembles natural language than computer vision in terms of the level of real-world meaning ascribed to each data point. Thus, we consider the development of rela-tional entity embeddings, i.e., general-purpose relarela-tional data representations, to be a potentially worth-while pursuit on its own. As seen in the field of NLP, pre-trained embeddings make neural architectures easily interchangeable during implementation. Meaning, relational data embeddings would drastically expand the number of possible models at our disposal, and allow us to truly select a suitable model for any given problem.

Of course, unlike natural language, relational data learning is innately application focused work. From its inception, relational data has almost exclusively served as the industry preferred data storage model (Figure 2). With corporate enterprises and government institutions as the main holders of relational data [Arora and Gupta, 2012; Joseph and Johnson, 2013], we must also concern ourselves with the implica-tions of machine learning models “misbehaving” due to hidden biases. Decisions made in these arenas bear indisputable financial and social ramifications for society as a whole. A discriminatory hiring

(7)

al-Figure 3: Explainability-Criticality Matrix. Machine learning models applied in the irrevocable quadrant must be fully understood and explained before their practical implementation. As virtually all these in-dustries or governmental bodies use relational data, the need for an explainable representation learning approach is clear. In contrast, if we do not consider the interpretability of whatever approach we derive, then we restrict our domain of application to those shown in the personalized and experiential quadrants, where many of these applications neither require nor use relational data. Adapted from: [Joshi and Mittal, 2019].

gorithm can leave a lasting stain on a mega-corporation’s reputation as it further promotes workplace inequality [Hamilton]. A biased crime prediction software intended to aid in law enforcement can further persecute racial minorities and cement socio-economic divides across racial lines [Larson et al., 2016; Lum and Isaac, 2016]. When machine learning intermingles with corporate and governance policies, explain-ability and interpretexplain-ability measures become non-negotiable if we wish to improve the problems we set out to solve rather than to unwittingly worsen them. Thus, we can safely conclude that the generation of meaningful relational data embeddings is a matter of representation learning just as much as it is of explainability.

Naturally, the selection of problem instances where model explainability is critical can ensure the inter-pretability of our developed approach. This is turn renders our model to be relevant irregardless which quadrant in the Explainability-Criticality Martix a specific application belongs to (Figure 3). Additionally, we require our chosen problem instance to be concrete and easy to conceptualize. Learning tions for tangible real-world objects aids in the reader’s intuitive understanding of our broader representa-tion learning problem and in the understanding of our model’s underlying mechanisms. Ideally, we wish this problem instance to be something the reader has real-world experiences with such that non-experts can easily discern any domain-specific knowledge that they may encounter. Hence, we have developed our approach in the context of human resources (HR), a field which fulfills all of the aforementioned cri-terion. As the majority of relational data learning applications often only concerns themselves with one or two object classes from the entire relational database, we will also restrict ourselves to representation learning for a specific class of objects. This more focused interpretation of representation learning allows us to better leverage a priori knowledge as we no longer have to worry about the inadvertent effects that

(8)

any graph-based manipulations may have on less relevant entities. In a HR database, employees are indis-putably the most important entity group, since corporations are, at their core, nothing more than groups of people working together towards some financial objective. As a consequence, we select the generation of employee embeddings as our chosen problem instance.

To summarize, this thesis concerns itself with the interpretable generation of dense, low-dimensional vector representations for selected relational database objects. Particularly, we do so in the context of employee embeddings as this problem instance allows the reader to focus more so on our chosen method-ology and less so on the domain-specific knowledge at hand.

1.1 Contributions

Fortunately for us, recent developments make the problem of relational data learning ripe for innovation. Graph-based relational data representations have already been suggested as a viable alternative to the problematic tabular representations frequently used. The work of [Cvitkovic, 2020] demonstrates that graph-based approaches generally outperform various tree-based models in classical supervised learning tasks. Meanwhile, the problem of extracting meaningful embedding representations of graph nodes is largely considered to be solved by the scientific community [Donnat et al., 2018; Hamilton et al., 2017a,c]. Given these advances, we now wish to test the merits of node representation learning on the graph rep-resentations of relational databases.

That is, we are interested in effects of merging these two bodies of work together and their subsequent effects on downstream tasks. For starters, general-purpose relational data entity embeddings have the potential to increase the ease of implementation for downstream tasks, and provide gains in performance similar to those that word embeddings begot in natural language processing [Devlin et al., 2018; Mikolov et al., 2013; Peters et al., 2018]. Furthermore, this work allows researchers to implement new classes of well-established machine learning models to relational data learning rather than consistently relying on either tree-based models or graph neural networks to accomplish their goals.

To summarize, our contributions are as follows:

• We further explore the tentative promises of graph-based representations in relational data learning through the introduction of our RDBToEmbedding model.

• More over, we quantify the influence of object-to-object relations in the context of a given set of relational data learning tasks.

• Our introduced framework aids in downstream learning tasks, while leveraging the interpretability that is inherent to graph-based inputs.

• Finally, we demonstrate a combinatory objective function’s capacity to aggregate information when dealing with heterogeneous data types.

1.2 Outline

This thesis consists of six further sections. In Section 2, we review the fundamental concepts that underlie relational data and the class of models we incorporate into our approach. The related works, described in Section 3, survey the fields of relational data learning and graph representation learning. Continuing to Section 4, we introduce our framework for obtaining embeddings from relational databases in detail. Fur-thermore, we present the various mechanisms that make RDBToEmbedding possible. Section 5 discusses experiments on dissecting the generated latent space and the improvements achieved in basic classifi-cation tasks. Finally, Section 6 concludes this thesis with a reflection of the results and suggests future works.

(9)

2 Preliminaries

The following chapter introduces the reader to relational data and graph neural networks (GNNs). Sec-tion 2.1 provides of a formal definiSec-tion of relaSec-tional databases, and reviews their the basic concepts. Mean-while, Section 2.2 presents a gentle introduction to graph neural networks (GNNs).

2.1 Relational Data

A relational database (RDB) refers to a set of tables {T1_{, T}2_{, . . . T}T_} _{where relational mappings exist}

amongst different data entries. Each table Ti_{∈ T}_{corresponds to a specific type of object or concept such}

that all object instances from that overall class are grouped together [Bachman, 2009; Codd, 2002]. In the literature, relational data and relational databases are often used interchangeably [Codd, 1989; Cvitkovic, 2020]. The former refers to a relational database as a data type, while other usually refers a specific data set. Figure 4 shows an example of a simplified relational database as well as its formalized structure. Typically, each table row represents a specific instance of an object, whilst its column represent a given object’s properties [Humbird et al., 2018; Lam et al., 2018, 2017]. We will assume this conventional structure as universal through out the rest of this paper. As suggested by our use of language, the logic that underlies object-oriented programming has shaped relational data structure and organization.

The relationships between RDB objects are fluid in their nature; that is, relational databases poses no other restrictions in what constitutes as a relational mapping besides those introduced in its definition [Arora and Gupta, 2012; Bachman, 2009; Codd, 1989, 2002]. Let, R represent an arbitrary relationship and objects oi and oj represent two distinct objects from our database that are linked together by R.

First of all, R(oi, oj)must map objects together to the same conceptual space. In other words, objects oi

and oj must be linked together via the same concept; otherwise, they are not relationally related to one

another. Note that: (a) We do not require oiand ojto reside in different tables. In the instance that a table

references itself, we refer to these relational mappings as recursive [Cvitkovic, 2020]. (b) Relationships can be directed such that R(oi, oj) = R(oj, oi)or undirected such that R(oi, oj) 6= R(oj, oi)[Codd, 1989,

2002]. To further elucidate relationship directionality, suppose the following example. If we were to model the concept of family relations in some application, it is clear that the concept of parent is asymmetrical while the concept of sibling is inherently symmetrical. However, R cannot be simultaneously directed and undirected, or symmetrical and asymmetrical. In other words, the only restriction imposes on R is one of consistency, both in terms of its semantic and its properties.

Figure 4: An illustration of an arbitrary database (RDB). The left shows a toy RDB example, while the right shows it structure or schema. Source: [Cvitkovic, 2020]

As seen, object relationships are quiet flexible in their definition, which lends to their versatility. In essence, relational data contains all the information found in tabular data in addition to relational

(10)

map-(a) Molecule. (b) Mass-Spring System.

(c) Sentence and Parse Tree. (d) Image and Fully-Connected Scene Graph.

Figure 5: Examples of different graph representations. While some graph representations are derived from naturally occurring structures, others are the byproduct of conceptual modeling. Source: [Battaglia et al., 2018].

pings between various data points [Arik and Pfister, 2019; Codd, 2002]. That is, relational data is numerical, categorical, ordinal, and hierarchical all at once. While this heterogeneity makes machine learning tasks difficult [Cvitkovic, 2020; Humbird et al., 2018; Lam et al., 2018, 2017], it also establishes the broad of appeal of relational data. Since its introduction in the 1960s, it has served as the “go-to” method of data storage across all industries and has become a pivotal part of any business’s day-to-day functions [Arora and Gupta, 2012]. As a consequence, relational data is application-based in nature — which sets it apart from other popular data types. However, it also introduces additional complications to relational data learning. Namely, authentic relational data is rarely available to the public, since the governments and corporate entities tend to guard such data closely. Nonetheless, the motivations behind the behaviors of both groups vary greatly. Businesses withhold relational data, because it either: (a) offers critical insights about their business that their competitors could use against them, and (b) it threatens their clients’ trust. In contrast, the release of governmental relational data has the potential to inflict long-lasting damage upon its citizens, both in terms of privacy and in security. For these reasons, publicly available relational data typically comes in the form of outdated corporate data.

To summarize, relational databases models real-world concepts and meanings through an object-oriented interpretation of tabular data. The very properties that promote its popularity also limits its potential applications in machine learning. Paradoxically, despite being extremely ubiquitous, it is still difficult to come by authentic relational data without directly partnering with a specific government organization or business.

2.2 Graph Neural Networks

While the learning of graph information is broad and inclusive in the machine learning approaches it considers, graph neural networks (GNNs) still predominate all modern graph learning literature [Scarselli et al., 2008; Zhou et al., 2018]. For the sake of brevity, we will use this term throughout our paper to refer any graph-based machine learning task. We do so as this body of literature lacks any specific term to concisely refer the learning of both graph structure and the information that it denotes. In their essence, all GNNs variants are connectionist models that retain node neighborhood information regardless of net-work depth [Zhou et al., 2018]. First introduced by [Scarselli et al., 2008], graph neural netnet-works are the adaptation of convolutional neural networks (CNNs) [Fukushima, 1988; Krizhevsky et al., 2012] principles to suite an analogous problem, where local connectivity is also heavily emphasized. That is, both com-puter image pixels and graph nodes are not only defined by their own assigned values but also by those of their immediate neighbors. Over the past decade, the design and implementation of GNNs has undergone

(11)

(a) 2D Convolution (b) Graph Convolution

Figure 6: 2D Convolution vs. Graph Convolution. (a) The 2D convolution takes the weighted average of pixel values of the red node along with its neighbors. Pixel neighbors are ordered and fixed in size. (b). The graph convolutional operation takes the average value of the node features of the red node along with its neighbors. Node neighborhoods are unordered and variable in size. Source: [Wu et al., 2020].

rapid development.

This rapid development can be understood through the context of the versatility graph representations offer. Graphs act as mathematical abstractions for the modeling of the dependencies or interactions within anygiven system. In this given definition, no restrictions are placed upon the properties a supposed sys-tem must hold or what constitutes as a relationship between two syssys-tem elements. This open-endedness makes graph representations attractive to any scientific field involved in modeling system behaviors and interactions, which then propels a need for further GNN model development [Battaglia et al., 2018; Zhou et al., 2018]. Figure 5 demonstrates some graph representation applications across various field, including several that are non-native to artificial intelligence. In some cases graph structures naturally manifest in the physical form, as Figure 5a exemplifies; in others cases, graph structures form through conceptually modeling of real-world phenomena as seen in Figures 5b and 5c. Meanwhile, graph structures can also exist somewhere between naturally occurring physical structures and conceptual abstraction in their ori-gins. Figure 5d depicts one such example. With the intuition behind graph representations established, we will now formalize its definition. Let G = (V, E) represent an arbitrary graph, where V refers to a given set of elements while E refers to the dependencies that exist between select element members ∈ V.

The desires to better model graph-like data have been answered with great improved in GNN representa-tion capabilities, accuracy, and efficiency. Generally, GNN variants are defined by the type of graph input that they accept, which in turn defines their aggregation or propagation functions. As the name suggests, a propagation function mathematically models how a given node signal propagates across G. In its current state, graph neural networks have become so established that researchers are beginning to experiment with their application in non-traditional contexts [Cvitkovic, 2020]; however, GNNs still suffer from a few unsolved problems. While recent breakthroughs have been developed to aid GNN explainability [Ying et al., 2019] and introduce mechanisms to better model dynamic graph data [Pareja et al., 2020], GNN models are inherently still more shallow than most other neural network classes. Any attempts to stack more than three GNN layers results in over-smoothing, where all vertices are pushed to converge to the same value [Zhou et al., 2018]. While [Li et al., 2018, 2015] introduce methodologies to deepen GNN ar-chitectures, their depth pales in comparisons to the hundreds of layers [He et al., 2016] used in other deep learning applications.

We will now pivot away from a general description of GNNs and towards a detailed description of a GNN variant that we heavily leverage in our relational data representation learning approach. [Kipf and Welling, 2016a] introduce the concept of graph convolutions, which are akin to their pixel convolution counterparts (Figure 6). The spectral graph convolutions originally proposed decompose the connection or adjacency matrix of a given graph in the Fourier space to identify distinct sub-graph structures. This is achieved through the approximation of the largest eigenvalue λmaxto reinforce the condition λmax≈ 2.

(12)

(a) Graph Convolutional Network. (b) Hidden layer activations

Figure 7: The hidden layer activations of a graph convolutional network. (a) The schematic depiction of multi-layer Graph Convolutional Network (GCN). The graph structure (denoted as black lines) is shared across layers. Yidenotes node labels. (b) t-SNE visualizations [Maaten and Hinton, 2008] of hidden layer

activations of a two-layer GCN. Adapted from: [Kipf and Welling, 2016b]. That is, the graph filter gθis to be convolved with the node signal x such that:

gθ? x ≈ θ IN + D− 1 2_AD−12 x (1)

, where A represents the graph adjacency matrix and D is a diagonal matrix whose elements correspond to eigenvalues. θ is a matrix of the kernel (filter) parameters meant to be shared over the entirety of G. Figure 7 visualizes how the hidden layer activations this function identify distinct subgraph structures. In its essence, [Kipf and Welling, 2016a] limit the layer-wise convolution operation to the first order, which alleviates local neighborhood overfitting. In doing so, the authors reap the additional benefit of model linearity within the Fourier domain, which ultimately reduces model complexity. Within the context of our work, we use graph convolutions to enforce node similarities between immediate node neighbor-hoods, and do not deviate from the signal propagation function that [Kipf and Welling, 2016a] originally proposed.

(13)

3 Related Works

The following section reviews recent work done in relational data learning. Specifically, Section 3.1 intro-duces the conventional paradigm for RDB learning, and surveys various approaches that function within this paradigm [Kanter and Veeramachaneni, 2015; Lam et al., 2017]. Afterwards, Section 3.2 details a ten-tatively introduced but promising alternative for relational data learning [Cvitkovic, 2020]. Finally, Sec-tion 3.3 assesses the suitability of various node representaSec-tion learning methodologies [Hamilton et al., 2017c].

3.1 Standard Relational Data Learning Approaches

The standard paradigm for relational data learning calls for the conversion of relational data into tabular data. In the process of “flattening” a relational database into a single table, as shown in Figure 8, all relational information is destroyed and false correlations are introduced. However, many practitioners accept these drawbacks as an unfortunate but necessary part of the learning process, since the literature for tabular learning is much more well-established [Cvitkovic, 2020; Kanter and Veeramachaneni, 2015; Lam et al., 2017]. As a consequence, this section will first review popular and state of the art approaches for tabularized data. Afterwards, we will review some relational data specific solutions that are designed to make the tabular conversion process easier.

User Table

user_id name email 1 John D. nqjhaspoh 2 Alice L. hiuhagqz 3 Marie C. biuqhung

Movies Table

ratings_id name description 1 Her romance, comedy 2 Inception action, adventure 3 Maleﬁcent drama

Ratings Table

ratings_id ratings user_id movie_id

1 3852 908 92 2 892 2121 1085 3 1582 6709 401 id user_id movie_id 1 871 42 2 10 186 3 1410 3029 Tag Table Convert the relational database into a single table.

movie_id movie_name description user_id rating

1 Her sci-ﬁ, comedy 810 8.0

2 Inception action, adventure 6194 8.8 3 Maleﬁcent family, action 41 7.0 4 Psycho horror, thriller 294 8.0

Apply tree-based model 7.0 Classiﬁc-ation Decision Predicted Movie Rating Tree-based Model

Figure 8: Classical relational data learning paradigm. Classical relational data learning approaches are tabular-based; meaning, as a relational database must be flattened into a single table.

Classical tree-based models. Once relational data has been flattened, practitioners will often default to tree-based models as they are capable of learning interpretable global features, and can easily be boosted using tree ensembles [Arik and Pfister, 2019]. First introduced in [Breiman et al., 1984], classification and regression trees (CART) have long served as the go-to approach for tree-based modeling. Their simplicity and efficacy bolsters their popularity even to this day [Loh, 2011; Safavian and Landgrebe, 1991]; however they are not without flaws. For starters, decision tree models are unstable in nature, and a single new data point can trigger the recreation and recalculation of all decision nodes. Furthermore, their tendencies towards overfitting and high variance render them unsuitable learning on large data sets [Loh, 2011]. As a consequence, decision tree performance pales in comparison to the performance offered by their neural counterparts.

Neural Decision Trees. Deep learning’s capacity for proper representation learning has compelled many researchers to invent neural alternatives to classical tree-based models [Arik and Pfister, 2019; Humbird et al., 2018; Kontschieder et al., 2015; Wang et al., 2017; Yang et al., 2018]. Of course, some attempts have been more successful than others. The mapping of decision trees to neural networks of [Humbird et al., 2018] yields redundancy and inefficient learning. Meanwhile, [Kontschieder et al., 2015; Wang et al., 2017] use differential decision functions to create soft neural decision trees, but in doing so they lose the automatic feature selection needed for tabular data. Adaptive neural trees (ANT) [Tanno et al., 2018] successfully emulate decision trees whilst enjoying the advantages of deep learning. The model learns when to share or separate data representations to optimize the performance of some end-task, which

(14)

Table 1: RDBToGraph Heuristic as proposed by [Cvitkovic, 2020]. T denotes an arbitrary table present within the database. Superscripts refer to the table name, and the subscripts refer to row by columns coordinates of a particular table entry. Adapted from: [Cvitkovic, 2020].

results in learned hierarchies tend that mostly display clear paths to certain classes or categories of data. However, the interpretable end result is not the same as an interpretable model, and it is unclear what data features compel ANTs to come to certain decisions [Tanno et al., 2018]. That is, even at their best, neural decision trees lack the explainability measures needed to protect against hidden bias.

Attention-based Models for Tabular Data. In this context, Google has introduced its TabNet [Arik and Pfister, 2019], which bypasses the tree-based approach all together through its use of sequential atten-tion. To put it simply, TabNet tackles many of the problems faced when working with tabular data head on. It provides local interpretability at each decision step as it uses sequential attention to learn what features to reason with. Its gradient descent-based optimization allows for flexibility in learning representation, which removes the need for any feature preprocessing. In addition to outperforming all of its predecessors on various classification and regression problems, TabNet is the first instance of a self-supervised learning approach designed specifically for tabular data. In summary, TabNet offers drastic improvements in tab-ular learning; however, we cannot disregard that this architecture was never designed for relational data. Hence, any application TabNet to relational data would still suffer from many the shortcomings associated with this learning paradigm.

Solutions from Feature Engineering. One of the critical drawbacks of the tabularization of relational data is the extensive feature engineering required to do so, which bottlenecks the entire learning process [Cvitkovic, 2020]. Hence, the field of feature engineering offers several algorithmic processes to optimize this preprocessing step. At its core, each approach considers the aggregation of a relational database into a single table as an optimization problem, where we must search over all possible combinations [Kanter and Veeramachaneni, 2015; Lam et al., 2018, 2017]. However, approaches vary in the complexity of the fea-ture aggregation methods considered. Deep Feafea-ture Synthesis (DFS) [Kanter and Veeramachaneni, 2015] automatically aggregates features from related tables using MAX and SUM functions recursively. While DFS’s exploration of all possible combinations makes it highly interpretable, it also causes this solution to scale poorly. The One Button Machine (OneBM) [Lam et al., 2017] builds upon the framework of DFS by considering a wider breadth of feature types and manipulating complex relational graphs to reduce the search space for feature aggregation. Both of the aforementioned approaches still rely heavily on heuristic rules and fail to take into account any “hidden” patterns that may be present. Thus, in the final iteration of DFS, Lam et al. [Lam et al., 2018] replace the MAX and SUM aggregation functions with Recurrent Neural Networks (RNNs). With each improvement, DFS becomes more efficient but less interpretable. That is, feature engineering functions on a clear trade-off between efficiency and interpretability.

3.2 Graphical Neural Network-based Solutions

To summarize, even the best solutions from an ill-fitting paradigm cannot offer satisfactory outcomes. At its best, the tabularization of relational data discards all relational information, and forces us to undergo extensive feature engineering. At its worse, we also inadvertently introduce false —and possibly severe —hidden biases into our data, and end up implementing ill-suited models. Naturally, these limitations create the need for alternative approaches towards relational data. In this section, we will inspect the

(15)

merits and disadvantages of one such approach, where graph edges act as an analogue for relational information. The RDBToGraph heuristic proposed by [Cvitkovic, 2020] serves as the potential foundation for graph-based relational data learning. In it essence, this heuristic maps different aspects of our raw relational data to their most similar graph components such that we are free to implement any GNN model we see fit. Table 1 depicts the exact relational data to graph conversions done. Once this graph has been generated, [Cvitkovic, 2020] provides further instructions for subgraph selection, which namely involve selecting all of a given node’s ancestors and descendants. Figure 9 shows one such subgraph as well as the original RDBToGraph input.

In essence, RDBToGraph converts relational data learning problems into graph learning problems, where we can rely on already established works to make a given relational data learning task more manageable. The sheer number of models and techniques developed for graph learning greatly exceeds what is present in the tabular data learning literature [Arik and Pfister, 2019; Zhou et al., 2018] and underpins this new paradigm’s main appeal. Additionally, RDBToGraph poses three other critical advantages: (a) its deter-ministic process makes it indisputably explainable, (b) the quest for fully explainable GNN behavior is well underway [Huang et al., 2020; Ying et al., 2019], and (c) graph-based approaches appear to outper-form their tabular-based counterparts in standard supervised learning tasks [Cvitkovic, 2020]. Meaning, RDBToGraph allots us the opportunity to obtain better performance while offering the interpretability needed to deploy machine learning to real-world tasks whose implications are worth reckoning with. Of course, the problem of relational data learning is not yet completely solved. Like, any newly introduced concept, RDBToGraph manages to address the core issues of the problem it intended to solve, but it fails to consider relational learning under anything that is less than a ideal set of circumstances.

(a) Relational database input. (b) Outputted graph representation.

Figure 9: RDBToGraph input and output. The RDBToGraph heuristic uses the relations defined in Table 1 to map a relational database into its graphical representation. Due to the immensity of the constructed graph, [Cvitkovic, 2020] also introduce subgraph selection mechanisms. Adapted from: [Cvitkovic, 2020]. In any practical application, concerns about computational expenses and model flexibility arise. More often than not, most relational data learning applications only concern themselves with some but not all of the objects from a given database. The graph construction procedure detailed in Table 1 assumes we care equally about all objects involved, and thus can be quite computationally inefficient. In contrast, tabularized representations of relational data explicitly filter out any information does not directly per-tain to the specific problem at hand. Furthermore, the results of RDBToGraph are dubious in instances where node connectivity becomes too sparse (Section A.1). In such cases, meaningful signals cannot be propagated across the nodes analogous to our objects of interest. Thus, these results are no better than those outputted by tabular models. Finally, RDBToGraph does little to consider the ever-changing nature of relational data. That is, the generated graph must be reconstructed upon every relational data object deletion, introduction, or property modification. In context of a regular business setting, such changes incur periodically on a daily, weekly, or monthly basis. Thus, it is not unreasonable to imagine that these changes could accumulate over a matter months and cause the current data to lose any resemblance it

(16)

once held to the original data set. As such, the practical implementation of relational data inescapably calls for model retraining periodically — a consideration RDBToGraph has not addressed.

Nonetheless, the importance of RDBToGraph is derived from its broader implications rather than its imple-mentation details. Simply put, this work establishes the merits of graph-based approach towards relational data learning — at least in the context of straightforward supervised learning tasks. While the implica-tions of RDBToGraph in more complex supervised or even in unsupervised tasks has yet to be explored, no evidence suggests the results obtained under those circumstances would greatly deviate from those [Cvitkovic, 2020] observed. After all, the ability to learn from relational information has already been shown to be so notably advantageous. In short, the ideas introduced by RDBToGraph are still well worth exploring, despite its many shortcomings. RDBToGraph denotes the start to a promising new paradigm. However, it is still just a start.

3.3 Node Representation Learning

Now that we have solidified the potential of graph-based representations, we will investigate the field of node representation learning(NRL). This complimentary body of research seeks to convert high-dimensional, discrete node representations into dense, low-dimensional vector representations [Idahl et al., 2019], a process depicted in Figure 10. NRL research can be broken into three bodies of distinct work: (a) the adaptation of dimensionality reduction approaches to learn local graph structure, (b) the introduction of GNNs to also consider node attributes, and (c) the balancing of information between local and global graph structures. The section provides a large overview of this field in order to prepare the reader for subsequent model selection (Section 4.1).

(a) Graph Input. _{(b) Outputted Representation.}

Figure 10: Learned node representations encode community structure to increase ease of implementation in downstream tasks. [Perozzi et al., 2014] visualize their obtained embeddings in the R2_{space. Adapted}

from: [Perozzi et al., 2014].

Early NRL: approaches to only learn local graph structure. Originally, node embeddings solely contained information regarding a node’s graph position and the structure of its local graph neighbor-hood. These shallow embeddings yielding techniques fell into two categories: matrix factorization tech-niques and neural random walks. As the earliest instance of NRL, matrix factorization is borrowed heavily from the classical dimensionality reduction techniques that inspired it [Belkin and Niyogi, 2002; Kruskal, 1964]. Random walk approaches, the more recent counterpart of matrix factorization, are founded under the intuition that nodes that tend to co-occur on short random walks over a graph should share similar em-beddings [Hamilton et al., 2017c]. These approaches are neural in nature and follow an encoder-decoder architecture as exhibited by the DeepWalk [Perozzi et al., 2014] and node2vec [Grover and Leskovec, 2016] approaches. Regardless, both general approaches suffer from the critical flaws of: (a) an inability to cap-ture node attribute information, and (b) a transductive nacap-ture. Meaning, node representations are only

(17)

Figure 11: Node2vec random walks. Node2vec introduces two different methods for neighborhood explo-ration: breadth-first search (BFS) and depth-first search (DFS). BFS-like random walks limit neighborhood exploration to a node’s direct neighborhood, while DFS-like walks explore further away from the node to capture community structures. Adapted from [Grover and Leskovec, 2016; Hamilton et al., 2017c]. generated for the nodes present during the training phase [Hamilton et al., 2017b]. Thus, such approaches are unsuitable for large, evolving graphs where node attributes provide critical information.

Approaches to learn node attributes. In contrast, neighborhood aggregation algorithms, such as the likes of GraphSAGE [Hamilton et al., 2017a], column networks [Pham et al., 2017], and graphical convo-lutional networks (GCNs) [Kipf and Welling, 2016b], solve the main limitations associated with shallow embeddings. These encoders generate embeddings based on a node’s local neighborhood rather than the entire graph, and are often described as convolutional in nature. When initialized, node embeddings are set equal to their respective node attributes. After every following iteration, nodes are assigned new em-beddings equal to the sum of its aggregated neighborhood vector and its previous embedding assignment. Larger and larger neighborhoods are aggregated with each new iteration. Since the dimensionality of the resulting embeddings remains fixed, “weighted” neighborhood attributes are further and further com-pressed into the same-sized low dimensional node embedding. Although neighborhood aggregators do answer the many problems associated with their predecessors, they are not without their own flaws as aggregation-based algorithms are designed for undirected graphs and assume global graph structure to be of minimal importance [Hamilton et al., 2017c]. Meaning, neighborhood aggregators may be insuffi-cient in handling complex real-world problems, where either edge directionality or global graph structure convey critical information.

Approaches to learn structural information. All of the aforementioned approaches oversimplify graph structure by treating it as nothing more than the distances or recursions between nodes. Conse-quently, distant nodes with similar local structures are then incorrectly considered as dissimilar. Ribeirio et al. [Ribeiro et al., 2017] propose struc2vec to assess structural similarity independently of node posi-tions and edge attributes, which makes it possible to capture the structural context of complex hierarchies. Struc2vec generates a series of weighed auxiliary graphs from the original graph, where each auxiliary graph captures structural similarities between nodes within k-hops of one another. Afterwards, struc2vec performs biased random walks to feed into the node2vec [Grover and Leskovec, 2016] algorithm. Fig-ure 11 visualizes two of the bias walk variants used. In stark contract, GraphWave [Donnat et al., 2018] uses takes a non-neural, Laplacian transformation-based approach towards learning structural informa-tion. In essence, this approach uses graph wavelets and heat kernels to computes vectors that correspond to the structural roles of a given set of nodes. By doing so, the resulting vectors implicitly relate to topolog-ical quantities; however, to effectively capture structural information the choice of scale hyperparameter smust be tuned. In summary, such approaches treat the learning of global structure as a distinct and separate task, and still require previous methods to learn local structure and node feature information. That is, such measures should only be taken when distant nodes have the capacity to exhibit similarity; otherwise, we will only incur unnecessary and meaningless computational expenses.

(18)

4 Relational Data Representation Learning

As seen in the review of related works, methods do exist to conduct relational data learning; however, such approaches are fundamentally flawed as they operate within a limited paradigm. Simply put, approaches that hinge upon tabular representations of relational data are insufficient. In this context, we present a graph-based approach towards generalizable relational data representation learning. The benefits of this approach are: (a) it preserves relational information, (b) it constructs an intermediary representation of our original data without introducing false biases, and (c) it broadens the scope of potential models applicable to relational data well-beyond the likes of GNNs and tree-based models. Section 4.1 uses the literature to motivate our chosen methodology and specifies our expected input. Next, Section 4.2 details the procedure taken for graph representation generation. Meanwhile, Section 4.3 presents RDBToEmbedding, the model responsible for embedding generation.

4.1 Model Selection

Often times, relational data learning applications are only concerned with the behaviors or characteristics of one or two object classes from the entire database. Thus, we define our input x as a simplified relational database where we have already identified the set of objects O = {o1, o2, . . . , oN}that we wish to learn

the representation of. To obtain this simplified relational database, we simply exclude all tables in our RDB that are not within n degrees of separation of O. These target entities should have features that are either important in real-world meaning or in how the assigned feature values relate to one another. We recommend the possible values of n = 1 or n = 2, where this value depends on the perceived influence of neighboring table objects. Of course, any degree of n > 0 can be used; however, the inclusion of any more tables than strictly necessary will only increase computational expenses and complicate any a priori knowledge-based manipulations we may later make.

Now that we have clearly defined our input x, we must identify the model we intend to implement. As always, we must let the nature of our data inform model selection. We can attribute complexity of relational data learning tasks to the heterogeneity of its data. If we were to design a novel architecture distinct from any class of well-established models, we make the risky gamble that this model can somehow properly assimilate highly heterogeneous data well enough to learn meaningful information. To further complicate matters, this supposed model needs to deliver these results in an interpretable manner. Given the general lack of research into such models — both inside and outside of relational data learning — we expect the development of such a model to be a convoluted process with little promise of success. Meaning, the most reasonable approach that we can take requires the creation of some intermediary data representation, which we then can feed into a model to obtain our desired embeddings. As the literature is quite sparse, there are only two possible precursory representations we can consider: graphs or tables. Given the tendency of tabular-based approaches to skew data with false biases and to destroy relational information, a graph-based approach stands as the more preferable alternative.

This design choice also provides us an established representation learning framework for the model devel-opment process. Simply put, we can adapt techniques from one of the three node representation learning waves reviewed in Section 3.3 to serve as the basis for our model development. First wave NRL research disregards node features in the node embedding generation process, which would severely restrict the amount of possible information we can capture about a given system. Alternatively, third wave NRL re-search provides post-facto solutions that serve more as a compliment to earlier works than a stand alone set of methods. That is, the final wave of NRL still requires the implementation of earlier works. Further-more, the need to learn structural information is task-specific. If we were to extend a structural learning approach to applications where global graph structure is trivial, then we will simply muddy the results we worked so hard to obtain. Thus, we seek to fit our approach alongside those of second wave NRL, and leave the learning of structural information as a potential future work. Second wave NRL research is founded on the principle of neighborhood aggregators, where we assume similarity to be a function of node distance. The innate interpretability of this general process means that so long as we can carefully control our graph input, we can expect our results to be interpretable —which only deepens this approach’s appeal. Thus, in summary, we use a node aggregating model to achieve relational data representation learning as illus-trated in Figure 12. This taken approach hinges upon the ability of our constructed graph representation G

(19)

Use node features to initialize embeddings. Convert the relational database into a graph representation. Graph Graph Customer Table customer_id age gender

1 45 Female 2 22 Male 3 31 Male

Products Table

products_id merchant_id price

1 58 8.99

2 104 24.95 3 910 10.50

Orders Table

order_id product_id user_id status

1 3852 908 Delivered 2 892 2121 Dispatched 3 1582 6709 Returned Learned Representation idcountry created_on 1 NL 10-01-2012 2 DE 01-25-2020 3 IT 06-01-2019 Merchant Table RDBToEmbedding

Figure 12: We can summarize the framework for relational data representation learning into two steps: (a) the translation of a relational database into a graph representation, and (b) the consequent node rep-resentation learning task, where we implement our RDBToEmbedding network.

to retain critical object information. In other words, the ability to craft a meaningful graph representation of our data will serve as a potential bottleneck; however, the work of [Cvitkovic, 2020] suggests that a graph-based representation of O has the capacity to do just that.

4.2 Graph Construction

The name node representation learning implicates that our target objects O should be expressed as nodes in graph form. Generally, each RDB table uses some concept to group objects into rows of information. As a result, we will treat tables as node types and rows as nodes. Any referenced table columns, whether they be foreign or self-reference, will then act as edges. With the base structure of our constructed graph Gaffirmed, we will sort the remaining table columns into two groups: (a) ones where meaning is derived from real-word concepts, and (b) ones where meaning is derived from relative value assignments. That is, knowledge of a single entry in a table column has no meaning unless we can compare it that of other assigned values. All table columns in category (a) will be treated as node features, while the unique values from the columns described in (b) will act as connector nodes and link similar objects together. For the sake of brevity, we will from hereon refer to category (a) type columns as feature columns and category (b) as comparative columns. In the case that there is not enough prior knowledge to separate the remaining table columns into two groups, we advise the reader to treat all columns as feature columns. Table 2 summarizes our construction of G.

Relational Database Graph

Table Name, Comparative Columns Node Type Rows, Unique Comparative Values Nodes Non-foreign-key Columns Node Features Column References, Shared Comparative Values Edge

Table 2: Our construction of graph G. Relational database components are placed side-by-side with their corresponding graph components. Comparative columns refer to any column that contains information where meaning is derived from the following relationships between column entries i and j: i = j, i 6= j, i ≥ j,and i ≤ j. In cases where no a priori knowledge is available, the concept of comparative columns is rendered null.

We will now review a scenario in which the graph construction detailed above yields sub-optimal results and then propose the appropriate modifications to bolster our graph representation. Suppose that there exists a small number of unique values in an comparative column. If we were to follow the proposed graph construction Table 2 closely, then our resulting graph would contain a handful of densely connected nodes that indirectly link significant portions of our database together. This renders any information propagated as too noisy to be meaningful, and makes any relative comparisons between column entry values sense-less. Therefore, we suggest the introduction of θ, a predefined threshold value. Any comparative column

(20)

Figure 13: The architecture of RDBToEmbedding is based on the principle of graph convolutions, where we assume node connectedness to be an indication of similarity. RDBToEmbedding accepts a graph repre-sentation of relational data and outputs a dense, low-dimensional node embedding. These outputted node embeddings correspond to a set of specific entities from the original relational database. In the outputted dimension space, similar objects are grouped together.

cthat fulfills the condition unique(c) ≤ θ should be treated as a feature column. By encoding its infor-mation as a node attribute, we still maintain the ability to learn any relevant inforinfor-mation that column c conveys.

Our deterministic generation of G constitutes the approach taken thus far as explainable. Even though it is not possible to visualize G in its entirety, due to the number of nodes it contains, subgraph visualizations do greatly substantiate our understanding of G and how downstream models will receive it as input. To obtain such visualizations, we first randomly sample from our set of target entities {ok} ∈ O. For each

ok sampled, we visualize all the nodes that lie within m = 2, 3 degrees of separation. Any larger m

values will lead to indiscernible subgraph structures, which offer little interpretability. Once we have secured our understanding of G, the interpretability of our approach is limited to the aggregation model we choose to implement. Of course, in cases where explainability is of no concern, the reader is free to disregard the generation of subgraph visualizations and simply commence with the embedding extraction process.

4.3 From Graph to Embedding

Now, we must reformulate the vertices (nodes) V and edges E present in G = (V, E) such that they are discernible to a GNN model. While the reformulation of E is a straight-forward task, the same cannot be said about V . Particularly, we must encode the node features that define vi ∈ V. Using node attributes

to initialize node embeddings solves this problem, whilst furthering the explainability of our approach. Particularly, we define our initialized embeddings matrix B as |V | × L matrix, where each embedding bi

corresponds to a specific object from our simplified database. L denotes the number of unique features states across all node types. We define a feature state f as either a numerical feature or a unique categorical feature value. This initialization of biis formalized as

∀bi∈ B : bi =

f1, f2. . . fL

, B ∈ R|V |×L (2)

, where each possible feature state corresponds to a single dimension in our initialized embeddings. All categorical feature states are encoded as binary values while each numerical feature state is assigned its respective numerical value. The numerical feature dimensions are then normalized to have a mean of zero, and a standard deviation of one. Connector nodes, i.e. nodes ∈ V that correspond to objects /∈ O, are expected to have sparser embedding initializations than those of their {ok} ∈ Ocounterparts.

Our RDBToEmbedding model, shown in Figure 13, then expands our newly initialized embeddings into higher dimensional states. In these higher dimensional spaces, we undergo graph convolutions [Kipf and

(21)

Welling, 2016a], which take after image convolutions in their theory and function. Afterwards, we project our learned embeddings back into a lower dimension. This final set of projections forces our outputted embeddings to succinctly represent O. By restricting our model architecture to linearities, non-linearities, residual connections, and graph convolutions, each outputted node embedding is conceptually equal to a weighted average between the latent representation of its node features and those of its neighbors. The residual connections used are critical in meeting this objective, since they protect against the signal de-cay of node feature information. Without these residual connections, we risk overwriting node feature information with edge information during the training of our model. As a relational database consists of so much than just its relational mappings, these connections ensure that the obtained embeddings reflect the entirety of O rather than just one aspect. Simply put, residual connections guarantee the comprehen-siveness of our learned representation. To summarize, our chosen design of RDBToEmbedding (a) uses graph convolutions to follow the fundamental concept of node aggregation, (b) adapts the principles of second wave NRL to suite relational data representation learning, and (c) possesses an inherent degree of explainability.

The ability of RDBToEmbedding to learn meaningful relational data representations depends on: (a) our ability to condense all information pertaining to {ok} ∈ O into G, and (b) the crafting of a suitable

objective function L. Since we have already addressed graph construction in Section 4.2, we will now motivate and discuss the design of L. At its core, our construction of G does nothing more than represent the relationship mappings of {ok} ∈ Oas edges and their properties as node features. Thus, if we want

to obtain a model that understands the real-concepts behind {ok} ∈ O, we simply need to train it on its

ability to use B0 _{to reconstruct edges and node features. Hence, we can generalize L as a composite of}

losses, where

L = L_{node feature recon}+ L_{edge recon} (3) Node feature reconstruction. As we are only interested in the embeddings generated for {ok} ∈ O,

we will restrict feature reconstruction to its corresponding set of nodes. Generally, we map B0_{into R}|V |×L

such that each feature dimension of the outputted embedding coincides with those of our initialized inputs. From here on, numerical features fnand categorical node features fcare reconstructed independently of

one another. We extract all numerical feature predictions ˆynand compare them to their ground truths yn

through the use of Mean Square Error (MSE) loss. Similarly, we use B0_{to infer category predictions. The}

obtained prediction ˆycis then compared to its ground truth yc using Cross-entropy Error H(ˆyc, yc). As

such, we can summarize the node feature reconstruction portion of our objective function as

L_{node feature recon}= 1 |O| X ok∈O X fn M SE(ˆyn, yn) + X fc H(ˆyc, yc) (4)

Edge reconstruction. Meanwhile, for the task of edge reconstruction, we draw inspiration from the loss function proposed in node2vec [Grover and Leskovec, 2016]. Let the vertices vj, vk ∈ V correspond

to objects oj, ok ∈ Osuch that j 6= k. Target entity edges refer to any edge pair where (vj, vk) ∈ E.

Non-target entity edgesrefer to any edge pair in E that does not meet this definition. We reconstruct these edges separately due to a disparity in node connectivity. Given our idea of using selected object features as nodes to reinforce similarity, our connector nodes tend to be the mostly densely connected nodes in G— possibly even by orders of magnitude. In these cases, the use of a single loss function would then push our model to disregard all edges that map our connector nodes to the nodes of {ok} ∈ O. Meaning,

our model will fail to learn that objects in O share certain features and, therefore, must be similar. In consequence, B0_{would overlook critical portions of the database and be an incomplete representation.}

If we wish to maximize the amount of information we can represent from our original input x, we must express Ledge reconas shown in Equation 5.

L_{edge recon}= L_{target-entity edge}+ L_{non-target-entity edge} (5) Our target entity edge reconstruction holds true to its node2vec origin; however, we forgo iterating over the entire graph in favor of the positive and negative sampling seen in word2vec [Mikolov et al., 2013]. In

(22)

Constructed Graph Sample target-entity node pairs. yi Select sample. positive sample negative sample Classiﬁcation Decision ... ... f1 fL fz bk bj

Figure 14: Non-target entity edge reconstruction. Our constructed graph representation G is sampled selectively for node pairs, where each node corresponds to an entity we wish to learn the representation of. We define a positive sample to be the case in which both vertices share a connection to the same type t node vt, and a negative sample to be the instance where both nodes lack a common vtconnection. A simple

multilayer perceptron is trained to identify whether two target entity nodes vjand vkare indirectly linked

through vt. This process forces RDBToEmbedding to learn object relationships that it may otherwise

neglect.

doing so, we avoid computational inefficiencies while maintaining the utility of the originally proposed node2vec loss function. We use random walk within a given node neighborhood to obtain our positive sample, and random sampling of V to acquire our set of negative samples. Through the loss function illus-trated in Equation 6, we push each node embedding to be similar to its neighbors and to be distinct from the node embeddings found in other regions of G. We achieve this objective by rewarding the model when-ever the target node’s embedding viresembles its positive sample vj, and penalizing the model whenever

viis similar to its negative sample subset {vk} ∈ V of length N. We suppose that the embeddings bi, bj,

and {bk}Nn=1corresponds to the vertices vi, vj, and {vk}Nn=1respectively. The logarithmic sigmoid

func-tion σ maps the inner product between biand its sample to the domain of (0, 1). Without a positive bias

of m, the dot product between random (i.e., highly dissimilar) vector pairs will return a loss of zero, and our model will not receive any learning signal. This signal loss is an undesirable but inevitable side effect of the random sampling we used to increase computational efficiency rather than some fundamental flaw in the combination of node2vec and word sampling concepts. As such, we restrict m > 0.

L_{target-entity edge}= logσ(bT_i · bj+ m)

+

N

X

n=1

Ek∼Pneg(i)log

− σ(bTi · bk+ m)

, m > 0 (6)

Non-target entity edge reconstruction is achieved similarly (Figure 14). Let vt∈ V be an arbitrary vertex

of node type t whose analogous object /∈ O. For every non-target entity node type in our graph, we randomly sample objects oj, ok∈ Oand train a simple multilayer perceptron (MLP) [Rosenblatt, 1957] to

predict whether or not they share an edge with with the same vertex vt. Since this is a binary classification

task, we use Binary Cross Entropy (BCE) loss. The benefits of our non-target entity edge reconstruction are two-fold. First, we push information propagation through our connector nodes such that these sim-ilarities are encoded into our obtained node embeddings. Secondly, our model now focuses solely on the similarities bestowed to the objects oj, okrather than what real-world meanings their shared values

may carry. While this distinction between shared similarities and their underlying meanings is subtle, their implementations vastly differ. Consider an e-commerce platform interested in forecasting what its customers will buy in the summer months. Customers from heat wave prone areas will have different purchasing patterns than those from cooler climates. From the perspective of this problem, we do not necessarily care about the concepts and attributes behind the locations “Scandinavia” and “the Mediter-ranean”. Rather, we assume explicitly linking customers by area will not only convey their climate-based spending habits but also allow the model to implicitly learn that “Spain” and “Portugal” are somehow similar. Alternatively, if we were to treat location as just another customer feature, then our model may eventually learn this information. However, it would take our model considerably longer to do so, and it

(23)

may not be appropriately emphasized in B0_{. In short, a suitable emphasis on similarity enables the model}

to learn the information it needs — and to do it in a timely manner.

Model Adaptability. Nevertheless, computational efficiency remains inconsequential so long as the slightest change in x forces us to reiterate over our entire approach. Unlike its computer vision or nat-ural language data type counterparts, relational data is updated periodically. Consider our previously discussed e-commerce example. We can reasonably estimate that this hypothetical online shop updates its database on a minute-by-minute basis to reflect the new orders that its customers continuously place. Other industries may have slower rates of information flow, but these continual modifications of x still, nonetheless, happen. For instance, a human resources department is likely to update its database monthly to reflect the monthly changes in its workforce. In fact, it is very difficult to find a relational data appli-cation where the updating of x is so infrequent that it is inconsequential. In other words, there are very few applications were it is acceptable for B0_{to be regenerated as though all of its objects were previously}

unseen. Thus, our representation learning process must be adaptable. Without adaptability mechanisms to account for these data updates, our process becomes an interesting piece of theory rather than a piece of work that is applicable in practice.

In short, we do not have to continuously reconstruct G and retrain RDBToEmbedding on all of x for every small changes made to our data. The common relational databases changes are: (a) object deletion, (b) object introduction, and (c) object property modification. To account for object deletions, we simply re-move the deleted object’s corresponding embedding bifrom B. Similarly, we account for the introduction

of new objects by introducing {bk} ∈ Bduring embedding initialization. Afterwards, we can selectively

train RDBToEmbedding to map these new embeddings alongside their closest preexisting embeddings. This generated latent space does not significantly differ from the spaces created by the repetition of our process. As seen, no aforementioned case requires G to be reconstructed. Thus, after initial B0

genera-tion, G serves more as a tool for interpretability rather than a necessary preprocessing step. Similarly, our reinitialization of B can be adapted for cases of table deletion and introduction. However, in cases of more extreme changes, we strongly advise the reader to redo graph construction and subgraph visualizations, since these steps ensure the embeddings obtained are still understandable and are safe for more sensitive applications. In summary, the adaptable nature of our approach addresses the dynamic nature of our data, which in turn makes our procedure implementation-friendly.

Interpretable Representation Learning for Relational Data

MSc Artificial Intelligence

Master Thesis