Role and Community Classification in Large Networks: A Graph Neural Network Approach

(1)

TRACK: MACHINELEARNING

M

ASTER

T

HESIS

Role and Community Classification in Large

Networks:

A Graph Neural Network Approach

by

J

ENS

D

UDINK 11421479

September 30, 2019

36 EC April 2019 - September 2019

Supervisor:

Prof Dr Ing Z

ENO

G

ERADTS

Daily Supervisor:

MSc C

EES

S

CHOUTEN

Assessor:

MSc N

OUR

H

USSEIN

(2)

We present a novel approach to convolutional neural networks on large graphs, making use of Monte Carlo sampling and skip-connections to overcome issues related to estimation on large graphs. We provide a framework based on the concepts of community and regular equivalence to motivate the neural architecture. We focus on the distinction between community classification and role identification, and show how the Jumping Knowledge Network design can be extended to improve performance on community classification. We briefly highlight the relevance of equivalence concepts to role identification. Our experiments on citation networks, the Reddit social network dataset, and a proprietary social network dataset show that our architecture can perform equally well as related methods on larger graphs, while maintaining its advantage in speed of inference.

1 INTRODUCTION

Graphs and networks have become commonplace both inside the academic world and in the real world, often as representations of relationships between entities. As a consequence, many complex relationships are modeled as graphs. Some examples of graphs are social networks, knowledge graphs, and citation networks. Over the past decades we have seen a tremendous increase in computing power, and hence also in the use of computers to model the real world. And so, with the increasing relevance of graphs as a data structure alongside the stellar growth of neural methods, a thorough exploration of the application of neural methods to graphs is more than warranted.

Since the publication of a landmark paper by Kipf & Welling [20], the field of graph neural networks has seen explosive growth, and a large amount of original contributions (see, e.g., [35], [15], [8], [13], [28], [24]). The inner workings of many of these neural networks can be roughly divided into two parts: (i) the generation of information node representations based on node-specific information and location in the network, and (ii) the design of a neural network architecture that allows for effective use of these node representations. Correspondingly, there exists neural network models that only perform the former or the latter function.

The key mechanism behind the success of graph neural networks is that the local environment of a nodes - its neighboring nodes - provides information about the node itself. This insight is reflected in the convolution operator on graphs [20]. A graph convolution calculates the hidden representation of a node by taking the weighted average of its neighboring nodes. The graph convolutional network sets a hidden layer to the aggregation of neighboring node information, followed by a non-linearity. The consequence is that after every convolution by the network, a larger neighborhood is aggregated for information. This neighborhood aggregation mechanism is central to the interpretation of graph convolutional networks.

Despite its many successes over the past few years, the field of graph neural networks as of yet has still a number of open problems. First, there is no coherent theoretical framework to provide guarantees on the efficacy of graph neural networks architecture related to graph structure. The field of network analysis, on the other hand, has plenty of theory that links graph structure and algorithms. Second, as a result of working memory constraints during the training phase, there has been relatively little progress on application of graph neural networks to very large graphs. As such, most progress has been made on smaller graphs, leading to the exclusion of some interesting problems. Social networks in particular, which have seen a meteoric rise in popularity over the last decade, have been largely excluded from literature as a result of the focus on smaller graphs. And while social network analysis has historically been the field that investigated these networks, the field of social network analysis could greatly benefit from a synthesis with machine learning.

In this paper we present a novel method to tackle the two central problems of graph size and theoretical grounding. In this process, we develop a loose framework of node classification for neural networks based on ideas in network analysis. On the methodological side, we show how a

(3)

combination of FastGCN and Jumping Knowledge Networks outperform other networks on large datasets. And on the theoretical side, we emphasize the distinction between community classification and role classification, two concepts from the field of network analysis, and show how these tasks relate to neural network architecture. In particular, we elaborate on the definitions of community and regular equivalence. We prove how a Jumping Knowledge Network can express the delta operator as presented in Mixhop [1], showing how their results can be extended to an architecture that can accommodate large graphs. Regular equivalence, or role equivalence, in particular has not yet seen a treatment in the field of neural networks, and deserves more attention considering the practical applications of the concept.

A second contribution we make in this paper is the application of our novel approach on a proprietary, social network dataset that is markedly different from the datasets commonly used for testing network performance in the graph neural network literature. Whereas nowadays many social network datasets have a bias towards the exchange of information or the maintenance of social ties (see, e.g., Reddit or Facebook), the actors in our dataset have a considerable professional interest in the social interactions that take place within the network. The dataset consists of chat data that contains information exchanged by dealers and producers of illicit goods to set up deals and coordinate production. Because of this particular structure in the data, we are interested both in community classification, and in role classification for this dataset. The additional focus on role classification for a large dataset, as opposed to community detection, is not found in other publications in literature. A consequence of these differences in social network structure is that the resulting network topology differs from other social networks.

The contents of this paper are structured as follows. First, we discuss related work and papers relevant to our research in section 2. We emphasize neural methods and theory from the field of network analysis. Second, we continue by providing a theoretical foundation of the contribution of the author in section 3, where we prove how an adaptation of the Jumping Knowledge Network can represent the delta operator. Third, we describe the different datasets and their unique properties in section 5.1 and expand on our experimental setup in section 5.2. Furthermore, we show how these datasets are used to evaluate the proposed methodology in section 6. Fourth, we provide an extensive overview of the performance of the method on different datasets. Finally, we conclude by providing a brief summary of our contribution and results, discuss some pros and cons of the implemented methodology, and provide different paths for future research to explore.

2 RELATED WORK

In this section we provide a chronological overview of work most intimately related to the present work. We briefly touch upon classical, non-neural network methods for node classification, and subsequently continue with the graph neural network [20] and some of the neural network archi-tectures that followed over the past few years. A central theme in the literature for graph neural networks is the notion of node representations. Similar to, for example, GloVe word embeddings [27], nodes in a graph are vector representations of nodes within a network. Most of the graph neural network literature is related to finding a novel way to efficiently or effectively generate these node representations. Finally, we elaborate on the sampling method we employ in our own architecture. 2.1 Classical Methods

We restrict the discussion of classical node classification algorithms to those related to graph kernels, and the Weisfeiler-Lehman graph isomorphism test in particular, starting with the latter.

Inspired by Weisfeiler and Lehman [36] and their test for subgraph isomorphisms, Shervashidze et al. [31] construct a graph kernel that allows fast and effective node classification. Before the advent of neural network-based methods, the Weisfeiler-Lehman kernel achieved state of the art

(4)

performance on a number of benchmark tests. Optimal assignment graph kernels [12] are another method for computing a similarity measure for graphs. As opposed to the Weisfeiler-Lehman kernel, but similar to neural methods, this kernel is based on the aggregation of neighborhood information. Another kernel method is the marginalized [18] graph kernel. This kernel is based on random walks on a graph, and thereby finding the stationary distribution of the graph and can be solved efficiently using linear programming. This method can be compared to GraphSAGE (discussed below), that uses random walks on graphs to generate samples for hidden node representations. Finally, we have the graphlet kernel [32]. The graphlet kernel works by counting subgraphs, and is aimed at faster inference for large graphs. An early example of sampling for inference graphs, the authors of the graphlet kernel paper provide schemes to improve performance by means of node sampling.

If we look at the themes in the above-mentioned graph kernel methods, we see methodology based on graph isomorphism, neighborhood aggregation, sampling, and methods specialized for larger graphs. These are recurring themes and are found in the graph neural network literature as well. It is because of this relationship that we reemphasize their importance.

2.2 Graph Neural Networks

Graph neural networks are neural networks that rely on some message-passing or aggregation scheme to accumulate information on the neighborhood of a node in a graph. In this section we give an overview of the field of graph neural network, along with those academic publications deemed most relevant to the purpose of this thesis. We elaborate on the Graph Convolutional Network, as well as attention models on graphs, sampling methods, and two models related to network analysis. 2.2.1 Graph Convolutional Network. The graph convolutional network (GCN) was introduced in 2016. The neural network was aimed at providing a solution to the problem of node classifi-cation in a graph that does not contain labels for all its nodes, a special case of semi-supervised learning. It solves the problem by creating representations of a node by aggregating its surrounding nodes, an approach inspired by convolutional neural networks [21]. In similar fashion to images, where each pixel has a neighboring pixel, most nodes in a graph have a neighboring node. A sin-gle convolution operation in a GCN thus takes the average over neighboring nodes, as is shown below.

Hl+1= σ ˆ AHlWl (1)

In this equation the matrix of hidden layer node representations is denoted H, with the superscript showing the index of hidden layer. ˆAis the Laplacian of the adjacency matrix of the graph and remains fixed for every hidden layer computation, while Wlare the neural network weights for layer l. As can be seen, the hidden representations for layer l + 1 depend on the representations for layer l. An intuitive description of the graph convolution operation can be provided as well. The hidden layer representation of a node is a weighted average (weighed by matrix Wl) of the representation of its neighboring nodes. In this manner, the representation of a node is determined by itself, and its surrounding nodes, with every incremental hidden layer representation increasing the neighborhood that is taken into account. In the rest of the paper we will frequently switch perspective between this intuitive idea and its mathematical representation.

The GCN does not scale to larger graphs, because the calculation of the hidden layer representation requires the vector representation of all neighboring nodes during the forward pass. Consequently, the larger the graph is, the more working memory is required to calculate hidden representations. This process does not scale well with the size of the graph.

(5)

2.2.2 Graph Attention Network. An interesting addition to the GCN design is the graph attention network [35]. This network architecture employs self-attention to generate higher quality node embeddings than a regular GCN, and outperforms the GCN on the usual baseline datasets. The basic distinction between the GCN and the graph attention network is displayed mathematically in the equation below. Hl+1= σ

∑

j∈N αjHlWl (2)

This equation is similar to the one above, but instead of taking matrix ˆA, we use attention, denoted by α , as described below. αi j= exp(σ (aT_{[W h} i||W hj])) ∑k∈Niexp(σ (a T_{[W h} i||W hj])) (3) In the above equation it becomes clear how α differs from ˆA. Whereas a GCN shares the weights given to neighboring nodes, attention allows every individual node to have a different distribution over its neighboring nodes. The efficacy of this method is reflected in its superior results [35]. The graph attention network suffers from working memory problems to a similar degree as GCN. 2.2.3 GraphSAGE. GraphSAGE [15] is the first paper to use supervised learning alongside sam-pling methods to overcome working memory issues related to graph size. Similar to unsupervised methods such as DeepWalk [28] or Node2Vec [13], the authors use a random walk on a graph to accumulate information about the neighborhood of a node. Different from these methods, however, GraphSAGE introduces different accumulation methods. Using mean aggregation, LSTM aggre-gation and max pooling, the authors provide different options for users to aggregate information about node neighborhoods. The central idea behind the approach posited by the authors, is that as the number of random walks, or samples, on the graph increases, the Markov chain generated by the random walk converges to a stationary distribution [2] (this is a well-known result for graphs), and thus to the correct convolutional operator given the dataset.

A key benefit of GraphSAGE is inference on large graphs. Whereas methods described up to this point yield to problems with working memory, GraphSAGE uses samples of random walks to construct node representations, thereby alleviating working memory problem.

2.2.4 FastGCN. Similarly to GraphSAGE, FastGCN [6] aims to tackle the issue of inference on large graphs. Instead of approaching the issue by sampling neighboring nodes, and thereby aggregation over the neighborhood, the authors take the approach of sampling random vertices in the graph. At first glance, such an approach might result in losing relevant information. Taking a more nuanced view, however, reveals the benefits of this approach.

Stochastic gradient descent relies on calculation of a gradient with respect to a loss term, based on the average loss of n independent and identically distributed samples. If one samples from neighbor-hood nodes, as in a graph, however, the samples are no longer independent and identically distributed [6], as nodes are often correlated with their neighbors. To alleviate this issue, as mentioned, the authors choose to sample graph nodes, instead of neighboring nodes. The resulting Monte Carlo framework [6] is described below.

˜hl+1_{(v) =}Z _Ah_ˆ l_(u)Wl_dP(u) ₍₄₎

(6)

is translated to the Monte Carlo framework. If we interpret our network consisting of nodes and edges as a sample from a (possibly infinite) graph, we can then integrate over its nodes. As such, the nodes in the graph are denoted by u in the equation above. Similarly to the GCN equation, we have the graph Laplacian ˆA, the weights of the hidden layers Wland the hidden node representation hl.

The FastGCN architecture derives its name from its performance. Because this architecture samples graph nodes, instead of neighbor nodes, it does not require the same amount of memory usage as GraphSAGE, nor does it require the same amount of computation as random walks during training time. As such, it is an order of magnitude faster than GraphSAGE without sacrificing performance. We use this model as a point of departure for our own architecture.

2.2.5 Simplified Graph Convolution. The simplified graph convolution (SGC) [37] breaks with the trend of increasing complexity by introducing a generalized linear model, instead of a neural network. The SGC model works by taking the graph Laplacian matrix, taking the nthdegree power of that matrix and multiplying this by the feature matrix. Finally, this product is fed into a softmax function. The equation for the SGC is shown below.

ˆ Y = so f tmax ˆ AlXW (5)

The above equation exemplifies the essence of graph convolutions; the kthof the graph Laplacian correspond to all paths between nodes with a distance of k. As such, taking the kthpower of matrix ˆA accumulates information of all nodes in the second order neighborhood of a node. Besides the softmax function, no non-linearities enter the equation, and consequently, universal function approximation does not apply to the SGC.

2.2.6 Mixhop. The mixhop model [1] gets to the core of the goal we are trying to accomplish in this thesis. In this paper, the authors show the delta operator, described below, has the power to better describe the concept of homophily, or community, than a regular graph convolutional network. The authors of the mixhop paper go on to empirically verify the benefits of the delta operator by showing that their network outperforms other network architectures on synthetically generated networks with differing degrees of homophily. In addition, they show that a regular GCN cannot express the delta operator described below.

f σ ˆ AX − σ ˆ A2X (6)

As can be seen above, the delta operator calculates the difference between the hidden node represen-tations (after the application of a non-linearity). The Mixhop does so by means of concatenation of the hidden representations of different layers (i.e. skip-connections). The authors of the paper hypoth-esize that the increased capacity to capture homophily between nodes is due to the ability to express the delta operator. The train of thought behind the statement is that to accurately capture homophily, it is necessary to calculate the difference between neighborhood sizes of hidden node representations. In the section below, we will elaborate on the significance of the concept of homophily in graphs, or networks.

We have discussed graph neural network methods related to sampling, attention, skip-connections and generalized linear models. All these approaches add an extra angle to the framework of the graph convolutional network. In the following sections we focus on sampling and skip-connections in particular, as these allow us to tackle the problems of graph size and role identification.

(7)

2.3 Network Analysis

Network analysis, the formal study of the mathematical and statistical properties of graphs, has historically been the front-runner in terms of node classification. Over several decades, the field has developed different algorithms and concepts for analyzing graphs. In this section we focus on two specific concepts most relevant to the present investigation, and hereafter discuss some historically relevant information for node classification. First, we look into the notion of community, and its corresponding metric of homophily. Afterwards, we focus on the concept of equivalence, and in particular regular equivalence, as this concept is most relevant to the issue of role classification in social networks. Both homophily and regular equivalence are used throughout literature for community classification and role classification, respectively. It should be noted that the terms graph and network are used interchangeably, and that these terms should not be confused with neural networks.

2.3.1 Community. The idea of community within a (social) network can best be described infor-mally. A community can be defined as a subset of nodes in a graph that is more densely connected than its surrounding communities [17]. That is, compared to the rest of the network, the nodes within a community have more connections with each other (note that we assume that all nodes belong to some community). An example of a community is a group in a social network, but the term community applies to any densely connected group of nodes in a network. Formally, there are many differing definitions of community. We describe a number of them.

Fig. 1. Community within a graph.

The green nodes fall within the community, as they each have more edges between each other than with the other nodes. The grey nodes do not have a sufficient amount of connections to belong to the community.

From the definition of community follows quickly the notion of homophily. Homophily, as opposed to community, is a local concept and applies to single nodes as opposed to collections of nodes. As such, homophily is easier to apply to graph neural networks, of which convolutional aggregation methods depends on local information. In addition, homophily is a concept that is not exclusive to social networks, and assumes some notion of similarity between objects (nodes) beyond belonging to a community. One way to interpret the concept is as similarity between entities [26].

The definition of community presented below [17] states that for community V , we have that ∀i ∈ V , the total sum of connections for every node i with the community is greater than the sum of

(8)

connections not belonging to the community. This is a particularly strong definition of community, and depends on every node within a community respecting a particular condition.

∑

j∈V

Ai, j>

_∑

j∈(G−V )

Ai, j (7)

Below you find a weaker version of community [17]. The difference between the definition of community presented above and this one is that the definition below is a condition that is satisfied by the community as a whole, and does not require individual nodes to satisfy a specific condition. Specifically, the definition requires the aggregate of all connections within community V to be greater than the aggregate of connections of nodes in the community to nodes outside the community.

∑

i, j∈V

Ai, j>

_∑

i∈V, j∈(G−V )

Ai, j (8)

Both of these definitions of community depend on whether neighboring nodes belong to some other community to determine whether a node belongs to community V . To express such information in a neural network, we need to go beyond simple definitions as the standard GCN model. We expand on this issue in section 3.1.

2.3.2 Equivalence. Equivalence as a concept in network analysis, by its very definition, is local in nature. Equivalence in its broadest sense is the idea that two nodes share the value of some similarity measure, yielding their equivalence. The theory of equivalence has historically received much attention, as it allows for partitioning of a graph based on the properties of its nodes. By extension, graph partitioning facilitates the identification of specific roles in a network. Theoretically, there are different definitions of equivalence, such as structural equivalence, automorphic equivalence, and regular equivalence [9]. We briefly treat the former two, and then zoom in further on regular equivalence, because of its relevance to our examination.

Structural equivalence is equivalence of the neighbors of a node. If two different nodes have exactly the same neighbors, the nodes are considered to be structurally equivalent. Automorphic equivalence refers to the structure of a graph. Two nodes are automorphically equivalent if two nodes share exactly the same properties with respect to the graph besides their labels [5]. The notions of structural equivalence and automorphic equivalence are very much interrelated. Notice how structural and automorphic equivalence can be defined in terms of the qualities of either nodes themselves, or the direct qualities of their neighbors.

Finally, we have regular equivalence [3] [10] [25]. Regular equivalence of two nodes is less strict than the above definitions, and has been defined differently in different strands of literature. A common theme across the literature is the following intuition. Two nodes are regularly equivalent if they have the same kind of neighboring nodes. At first sight this definition appears unambiguous, but at further inspection it leaves us with more questions. What does it mean for a neighbor to be of a certain kind? In the context of a social network, a type may be gender, age, or any other quality. We run into complications when we try to formalize this definition.

Although there are several different definitions of regular equivalence, as well as different algo-rithms to compute regular equivalence for smaller graphs, we highlight two distinct approaches. Herein we emphasize the link of equivalence with social relations and graph structure. Concretely, we discuss a definition of regular equivalence based on graph partitioning and coloration (definition 1), and a definition based on role relations (definition 2).

(9)

Fig. 2. Regular equivalence of nodes.

All nodes are regularly equivalent, as similarly colored nodes are connected to nodes of the same color and have the same number of neighbors, and their neighbors are connected to the same color of nodes as well.

Definition 1.

∀(u, v ∈ V ) : c(u) = c(v) (9)

⇐⇒ (10)

∀(n ∈ N) : |{w ∈ V |{u, w} ∈ E ∧ c(w) = n}| = |{w ∈ V |{v, w} ∈ E ∧ c(w) = n}| (11) In the equation above, u, v and w are nodes belonging to the set of vertices V . An edge is denoted {a, b}, while the set of edges is called E. c is a coloring function, and assigns to each node in the graph a color (a color may be interpreted as a category, or formally, a natural number). The condition above clarifies the relationship between the coloring function (the assignment of roles, so the speak) and edges between nodes. Colloquially, we say that two nodes have the same role if, and only if, they have an equal number of neighboring nodes for each role. Notice that this definition takes into account the number of neighboring nodes.

Definition 2. Let ⟨A, R, P⟩ be a network, in which A denotes the set of nodes, R ⊆ A × A denotes the set of edges, and P denote the set of attributes, all elements of which are subsets of A (or differently, Ppartitions A). Then, let x = y denote a regular equivalence relation between node x and y in a graph. Finally, we say that x and y are regularly equivalent if:

(1) nodes x and y have the same attributes

(2) for all edges xRx′, there exists an edge yRy′, such that x′= y′ (3) for all edges yRy′, there exists an edge xRx′, such that x′= y′

Marx et al. [25] take a different approach to regular equivalence by making use of the notion of bisimulation [29], a notion from modal logic. Bisimulation can be considered to be the equivalence of transition systems, or labeled graphs. In their paper, the authors show how bisimulation on graphs is equivalent to regular equivalence in network analysis. The equations above shows how regular equivalence can be defined in terms of relation between nodes in a graph, as opposed to node coloring as we have seen above.

The conditions above show a definition of regular equivalence that coincides with that of bisim-ulation. As opposed to the definition based on coloring discussed earlier, this definition of regular equivalence is also a statement about the roles of neighboring nodes.

(10)

Finally, the practical relevance of equivalence of nodes ought to be emphasized. Oftentimes, owners of graph data find it beneficial to partition the graph based on node roles. Partitioning of a graph is not limited to social networks, but can be found in a multitude of fields. These assigned roles do not necessarily correspond to communities, and two nodes with similar roles may be separated in the graph (and thus belong to different communities). As such, role (regular) equivalence is a concept that comes naturally when considering practical applications.

2.4 Monte Carlo Sampling

The sampling framework we use in this paper is based on Monte Carlo sampling. We are interested in particular in the estimation and approximation of the convolution operator as described in the work of Kipf & Welling [20]. That is, we wish to evaluate the integral in equation 4. The primary motivation behind the use of the Monte Carlo method, is the fact that the method does not suffer from the curse of dimensionality [4] [23]. In other words, the speed of convergence of the Monte Carlo method does not depend on the dimensionality of the data. The evaluation of the Monte Carlo integral, however, brings with it a number of obstacles.

˜hl+1_{(v) =}Z _{A(v, u)h}_ˆ l_(u)Wl_dP(u) ₍₁₂₎

If we sample from the set of all vertices, instead of from the direct neighbors of a node, how do we calculate the convolution operation, which is inherently local? The answer to this question lies in the asymptotic behaviour of the Monte Carlo sampling procedure, and we elaborate on it momentarily. Practically, the answer is more simple: we only aggregate the information provided by sampled nodes, and only those sampled nodes that are neighbors of the node for which we calculate the hidden representation. One would expect sparsity in larger networks to result in deterioration in performance, as the probability of a sampled node being a neighbor of a node for which we calculate the hidden representation decreases. And conversely, one expects performance to increase with the number of samples. The latter behaviour has definitely been observed in previous research, while the former has not yet received much attention. Theoretically, convergence of the Monte Carlo sampling procedure has been proven numerous times, and is reaffirmed in a proof by the authors of the FastGCN paper [6] and is shown below.

lim t0,t1,...,tM→∞

Lt0,t1,...,tM= L, with probability one (13)

In the equation above, L denotes the loss function of the neural network. The equation shows that the loss function that is the result of the Monte Carlo sampling method converges to the correct loss function. Practically, we see affirmed that the procedure provides us with the correct loss to train our network. In our case, we use batch training, and the empirical loss function is the average over the sampled nodes.

L= Ex∼D[g(W ; x)] (14)

Practically, we evaluate the integral presented above by taking the average over all sampled nodes. Below we see the equation used in the neural network to calculate hidden node representations. Notice that the notation is different from previous descriptions of neural networks (see section 2.2), as we still interpret the graph to be a random variable, so we can define an integral on it. Practically, the interpretation is equivalent. ˆA(v, u) is the restriction of ˆAto the indices of nodes v and u. Similarly, we interpret hl(u) to be a hidden layer representation restricted to u.

(11)

˜hl+1_{(v) =} 1 tl tl

∑

j=1 ˆ

A(v, u)hl(u)Wl (15)

The Monte Carlo formulation of the graph convolution can be adapted to decrease the sampling variance of the procedure, and thereby decrease the training time [6]. To do so, we apply importance sampling. Up to this point we have silently assumed that sampling from the true node distribution is uniform. Importance sampling requires that we sample from a different distribution that is correlated with the distribution of the node in the graph. The resulting graph convolution equation is shown below. Hl+1= σ x tl tl

∑

j=1 AlHlWl q(ul_j) (16)

In equation 16 the term q(ul_j) is aimed at reducing the variance of the sample of nodes. To reduce the variance of the sample, the function q(u) ought to be correlated with the distribution of neighboring nodes of sampled node, as this is the function we are trying to estimate. See [6] for a more extensive description of the procedure. Practically, we set q(u) to be equation 17, as seen below.

q(u) = || ˆA(:, u)|| 2

∑u′∈V|| ˆA(:, u)||2

(17)

A final practical quality of Monte Carlo sampling is the following. Since sampling only a limited number of nodes results in a large amount of sparsity in the accumulation function, overfitting on the training set of a large graph is less of an issue compared to methods that do not make use of sampling.

Above we have described from an abstract point of view the sampling procedure employed in our method. We re-emphasize the training speed of the method, as well as its application to large datasets. Note that as of yet, inference on large graphs can only be done using sampling methods as a consequence of the working memory problems described in section 2.2.1.

3 THEORY

In this section we describe the theory behind the methods used in this paper. Later, we show how the Jumping knowledge network design solves a number of prominent issues for graph neural networks. In addition, we show how the Jumping Knowledge Network (JKN) satisfies the delta operator as described in section 2.

To reiterate, the method described below solves a number of the issues laid out in the introduction. First, by taking a sampling-based approach, instead of doing exact inference based on all neigh-borhood nodes, we avoid working memory problems graph convolutional networks and its many variations run into. Second, by appending a JKN layer, we allow ourselves to increase the number of hidden layers in the neural network, as well as attain the benefit of increased representational power through the delta operator.

(12)

3.1 Jumping Knowledge Networks

The Jumping Knowledge Network module provides a number of benefits compared to a regular GCN. Below we briefly repeat the main benefits the JKN provides, including theoretical results of previous publications [38]. We then extend these results to the delta operator to show that Jumping Knowledge Networks have at least the same representational power in capturing community (homophily) as the mixhop network [1].

The essential quality of the JKN is the concatenation of hidden node representations of different layers in the neural network. Whereas in other graph neural networks information of different neighborhoods is diffused over the graph as a result of multiple hidden layers, JKNs retain access to that information through skip-connections. Skip-connections have been used to great effect in CNNs [33] [16], and an analogous argument can be applied to graphs. Skip-connections allow the neural network to access smaller scale features later in the network, allowing for better inference. For graphs, this means the network has access to smaller scale neighborhoods at later layers in the network, allowing the network to calculate differences between layers. We show how this is the case for JKN, based on similar results for Mixhop [1].

JKN=h1

v| . . . | hkv

(18) Above we see the definition of the concatenation version of the Jumping Knowledge Network. An additional layer is appended to the network after the convolutional layer, in which the hidden node representations are concatenated and multiplied by a learned weight matrix. To show, however, that JKNs can represent the delta operator as it occurs in Mixhop [1], we have to slightly alter the definition. In the sections above, we have used the notation hlto denote the lthlayer hidden node representation, and ˆh to denote the neighborhood aggregation before applying a non-linearity (the difference between h and ˆh being the non-linearity). We use the same notation below to describe our altered version of JKN.

JKN=h1_v| . . . | hk

v| ˆh1| . . . | ˆhk

(19) As explicated in section 2.2, the Mixhop neural network is able to express the delta operator, and as a consequence is better at capturing homophily in graphs. Below we show that the Jumping Knowledge Network is equally capable of capturing these relationships, given a some adaptations, by proving that the JKN can compute the delta operator on graphs. Our proof follows that of the proof laid out by the authors of Mixhop, but depends crucially on an adaptation of the JKN layer. Theorem 1. The Jumping Knowledge Network module, using the configuration described in equation 19, can represent the two-hop Delta Operator (see equation 6 for a definition).

PROOF. Consider a two-layer graph convolutional network, with a single layer l defined as follows. We follow the notation used earlier in this paper.

Hl= σ ˆAXWl−1 (20)

Let us also define the hidden representations before fed through a non-linearity. ˆ

Hl= ˆAXWl−1 (21)

To construct the JKN module as shown in equation 19 we let the network branch into two separate parts. The first branch contains the regular convolutional network, while the second branch consists of two convolutional layers followed by a non-linearity, as shown below.

(13)

H3= ˆA( ˆAXW1)W2 (22)

= ˆA2XW1,2 (23)

Note that in the equations above the weights W1_{and W}2_{compose. These two branches allow us} to construct a JKN module that, in accordance with equation 19, contains the hidden node representa-tions as well as the representarepresenta-tions before the application of a non-linearity. If we concatenate these representations, we get the following JKN layer, where hl_{denotes H}l_{restricted to particular node.}

JKN=

σ (h1_v) | σ (h2_v) | ˆh1| ˆh2 (24)

If we set W1and W2to the identity matrix I of the size of the hidden representation vector, the first and second convolutional layers in equation 22 reduce to ˆAX and ˆA2X, respectively. Finally, we set the linear transformation of the JKN layer to be as follows.

Wjkn=     I −I 0 0 0 0 0 0 0 0 0 0 0 0 0 0     (25)

The product of this linear transformation with the JKN layer in equation 24 then yields the Delta

Operator described in equation 6. _□

In the proof above we show that at the expense of doubling the amount of parameters in a neural network a Jumping Knowledge Network module can describe neighborhood mixing as found in Mixhop [1], but does not have to suffer from the working memory problems preventing models such as Mixhop or GCN from doing inference on large graphs. Additionally, the proof above can be extended to include neighborhood mixing of more than two layers.

4 IMPLEMENTATION

The implementation of the neural networks tested in this paper is done using the PyTorch Geometric library [11]. We use this library to construct the different networks, as well as to pre-process the open source datasets by implementation of the API provided by it. The proprietary dataset is pre-processed before converting to a NetworkX [14] format before using the PyTorch Geometric framework to generate batches.

5 EXPERIMENTS

We test our model using a number of different experiments. We use semi-supervised document classification using citation networks, we classify nodes in their respective communities in a social network, and we test the methodology on a proprietary dataset to test the predictive power of the new network architecture. Below, we describe the properties of each of these five datasets, and briefly relate some of these properties to theory. In addition, we describe the experimental setup and describe the models we use as baselines to test the performance of our model.

5.1 Datasets

To evaluate the efficacy of the neural network architecture we follow the experimental setup of Kipf & Welling [20], Hamilton et al. [15], and Wu et al. [37]. There has yet to be established a standard with regards to benchmark testing in the graph neural network literature, but some datasets seem to be used most often to verify the performance of graph neural networks.

(14)

Table 1. Overview of different datasets

Dataset Nodes Edges Classes

Citeseer 3,312 4,732 6

Cora 2,708 5,429 7

Pubmed 19,717 44,338 3

Reddit 232,965 11,606,919 50

Proprietary data 801,027 3,966,026 10

5.1.1 Citation networks. We test our methods on three different citation networks: Citeseer, Cora and Pubmed [30]. The task for this dataset is to predict to what community a particular paper belongs; a case of classification. The citation networks consist of citation links between documents (edges in a graph) and bag-of-words feature vectors for each document. Every document in the dataset has a label that assigns the document to a class. As for all datasets we employ, edges between different nodes in the graph are considered to be undirected. The Citeseer dataset consists of approximately 384,000 nodes, and 1,700,000 edges between these nodes. The average node degree, or the average amount of neighbors of each node, is 9, so the network is very sparse. The Cora dataset consists of 2,708 publications subdivided into seven classes. The entire network consists of 5,429 links between documents based on citations. As result of a high degree of centralization, the Cora network is quite sparse, and has an average node degree of 4. Finally, the Pubmed dataset contains 19,717 nodes representing scientific publications in the field of medicine. Between these nodes, there are 44,338 edges. Each node has a TF-IDF vector of a dictionary of 500 words. The average node degree in the Pubmed dataset is 2.

5.1.2 Reddit social network. A key test of the efficacy of our new method is its performance on the Reddit social network dataset [22]. For this task we try to predict to which community within the Reddit forum a particular post belongs. A post is labeled as belonging to a particular community if it is posted in a specific “subreddit”, or subforum. Different posts are connected by an edge in the social network graph if the same forum user posts in different subforums. The dataset consists of 50 communities, or classes, and 232,965 posts in total. Node features consist of 300-dimensional GloVe embeddings [27] of the post title, the average embedding of a post’s comments, the post’s score and the total amount of comments of a post. It should be noted that the Reddit dataset has a high degree of pre-processing and feature engineering, and does not accurately represent the structure of the Reddit forum.

5.1.3 Proprietary dataset. A significant part of the motivation behind this research comes from the availability of a proprietary social network dataset. The data in this dataset consists of both chat data and personal data in a messaging application. As opposed to most common messaging applications, the messages sent in this network are used to communicate information concerning business transactions and professional relationships. Of particular interest to us is the fact that this specific social network consists of chat data used by dealers and producers of illicit goods to exchange information and set up deals. As such, we are interested in the roles the people using the chat program fill in. Take, for example a dealer. We expect a dealer to be differently connected than a buyer or a producer, both in terms of the number of connections and the other roles to which the dealer is connected. Although role identification is not exclusive to these types of social networks, this dataset is a prime example of the particular classification task. The average node degree in the dataset is 5. That is, the average person in the graph has only 5 neighbors, yielding a very sparse graph. The number of nodes in the networks is 801,027, and the number of edges is 3,966,026.

(15)

5.2 Experimental Setup

In this section we perform a number of different experiments in our research. First, we compare accuracy for classification on balanced datasets and F1-score for unbalanced datasets. Second, we compare training time per epoch for each of the networks to show speed improvements for sampling methods. We find that for larger graphs the FastGCN and Jumping Knowledge Network outperforms other methods, while for smaller graphs traditional graph neural networks outperform.

5.3 Baselines

To compare the performance of our architecture to existing architectures, we choose a number of baseline models. These models were chosen because of their resemblance to the model we present, and their documented performance on the datasets we use. In particular, we evaluate the performance of the Simplified Graph Convolution (SGC) [37], the Graph Convolutional Network (GCN) [20], and GraphSAGE [15]. All of these node classification algorithms belong to the field of graph neural networks. The SGC has the benefit of scaling well to larger networks and performs with high speed, but lacks non-linearities. As such, it is not a universal function approximator [7]. The lack of the universal function approximation property can be observed in the performance of the network, as the SGC performs worse than most networks across the board, but performs reasonably well on all tasks. The GCN is the quintessential graph neural network for node classification, and attains high scores on most datasets, but lacks the capacity to scale to larger networks. GraphSAGE, similar to our architecture, relies on neighbor node sampling to generate node representations and performs well on large datasets, but can be relatively slow due to a bottleneck in the speed of random walks over a graph.

6 RESULTS

The results of our experiments are summarized in table 2. The reported numbers are F1-scores for unbalanced classes and accuracy for balanced classes (note that accuracy and F1-score coincide for balanced classes). Results are all based on our own implementation of the respective neural architectures. We choose to implement each of the methods to ensure consistency and reliability in the interpretation of our results.

Table 2. Summary of results in terms of F1-score and accuracy.

Method Citeseer Cora Pubmed Reddit Proprietary data

GCN 0.720 0.827 0.798 NA NA

GraphSage 0.710 0.820 0.794 0.923 0.484

SGC 0.690 0.805 0.770 NA NA

FastGCN + JKN 0.703 0.809 0.791 0.942 0.477

Note that SGC and GCN cannot do inference on large graphs as a result of working memory load.

The table above shows the results of our experiment, with the F1-score of the best performing models for each dataset marked in bold. We see that the GCN performs well across the board, while SGC follows suit with slightly lower performance. GraphSAGE has roughly equivalent performance to GCN, but suffers slightly because of regularization as a result of sampling. Our own model outperforms GraphSAGE on the Reddit dataset, but fails to do so on our proprietary dataset.

(16)

We attribute the lower performance of our network to overfitting. On the smaller datasets, we see that the training accuracy quickly approaches 100%. The phenomenon of overfitting does not occur in large datasets as a consequence of a larger training set and a more diverse set of samples.

6.1 Training Time per Epoch

In this section we show the training time per epoch of our method and the baseline methods, to illustrate how network architecture and sampling method affects training time of the different methods.

Table 3. Average time per epoch in seconds for different neural network architectures.

Method Citeseer Cora Pubmed Reddit Proprietary Data

GCN 0.125 0.110 0.552 NA NA

GraphSAGE 0.151 0.123 0.154 243.374 46.400

SGC 0.055 0.026 0.048 NA NA

FastGCN + JKN 0.056 0.064 0.0131 16.229 4.132

The training times were taken over 25 epochs of training, not including the first epoch because of initialization time. Note that SGC and GCN cannot do inference on large graphs as a result of working memory load.

Table 2 shows the average training time per model, per dataset taken over 25 epochs. SGC was excluded from large datasets, as the SGC can only process these datasets if the adjacency matrix powers are calculated before training time. We find such measures to be detrimental to an accurate depiction of real training times.

As can be seen, the FastGCN component of our architecture allows for fast training times, outperforming on large datasets. On the smaller datasets, SGC competes for the fastest training time. These results reflect the underlying computational properties of each method. While SGC is based on matrix multiplication, and its computation time scales with matrix size, FastGCN + JKN is based on sampling and its computation time is dependent on the number of layers and the number of samples. It should be noted that the additional layers required for the adapted JKN module (see section 3.1) induce increased training time as a result of a doubling of parameters.

6.2 Hyperparameters

All models were trained for 200 epochs using the Adam optimizer [19], with early stopping if validation accuracy or micro-F1 score does not increase for 25 epochs. The learning rate employed is 0.01 for all models. We use dropout for regularization, set at the empirically determined value of 0.5. For all networks other than our own, the number of hidden layers is limited to two, as there is a significant decline in performance if the number of hidden layers is set to be higher (this is a known observation, see [20]). The batch size used for training was equivalent for all architectures and datasets (with the exception of SGC, which does not use batch training), and amounted to 1000 data points per batch. The number of hidden units per hidden layer was held constant as well, with each hidden layer containing 16 hidden units for every model. For GraphSAGE the number of hops was fixed at 2, while the number of samples was set at 25. For hyperparameter tuning we varied the number of samples of our own, final model to 400.

(17)

7 CONCLUSION

In this paper we have presented a novel architecture for graph neural networks, and have shown how this novel architecture relates to the concepts of community and regular equivalence, or homophily and role equivalence. We have proved that a modified Jumping Knowledge Network can express the delta operator, while retaining its original properties. Empirically, we have tested this architecture against a number of baselines on a diverse set of benchmark datasets. In doing so, we have aimed to contribute to solutions of a number of open problems in the field of graph neural networks: (i) theoretical grounding, and (ii) the problem of inference on large graphs.

Based on experiments, we have found our model to match performance of state of the art models on large graphs, and in some cases to outperform. In terms of micro-F1 score, our model performs well on the proprietary dataset, and outperforms on the Reddit dataset. By contrast, our model does not perform as well on the smaller citation network, attaining lower scores than most models. We attribute this performance to the overfitting of the Jumping Knowledge Network module on smaller graph, but its superior performance on larger graphs to better access to neighborhood information. We also see that even with the Jumping Knowledge Network module included, training times are often faster than those of GraphSAGE.

Practically, we have shown our method to be capable of identifying communities and roles in a proprietary dataset of buyers and sellers of different illicit products. Crucially, these roles are not intrinsic to the network, as is the case in the Reddit dataset, but assigned based on necessity and analysis. In the future, as the demand for analysis on large graphs increases, tools such as our own may prove to be valuable. For the Police, our research may be helpful in the quickly and accurately providing predictions for datasets that have either some predefined notion of community, or an explicit definition of role according to which training labels can be defined.

8 DISCUSSION

In this section we briefly highlight the main benefits and drawbacks of the proposed methodology. In addition we provide a number of avenues to explore for future research.

8.1 Proposed Model

We have shown how the proposed model tackles several different standings problems in literature, among which are the lack of theory surrounding graph neural networks, and neural networks in general, and the lack of neural architectures sufficiently equipped to perform on large graph datasets. Despite the above-mentioned benefits, our methods have some shortcomings. First, in spite of the regularization provided by Monte Carlo sampling, the Jumping Knowledge Network module results in overfitting on the training set for smaller datasets. Second, an increase in the number of hidden layers does not necessarily yield an increase in performance, counter to what one would expect. Third, and related to the previous drawbacks, our network seems to have relatively worse performance on sparse networks. We hypothesize that the lower performance is a result of smaller graphs having lower node degrees than the larger ones.

The benefits of our model can be briefly summarized as follows. First, the method works well on large graphs. This was to be expected, as the method is based on FastGCN. Second, the method has the capacity to express the delta operator, and therefore the ability to better express homophily in graphs. Third, despite an deterioration in inference time compared to FastGCN as a result of the modified Jumping Knowledge Network module, the method still outperforms other methods in terms of inference time.

(18)

8.2 Future Work

The Jumping Knowledge Network architecture relies on skip-connections, as is common for Con-volutional Neural Networks. Skip-connections, however, have not yet received the same extensive treatment for graph neural networks as they have in other fields of machine learning (computer vision, natural language processing). As skip-connections allow access to information at different levels in the graph or network, a more thorough treatment of skip-connections for GCNs may prove to be fruitful. As shown in this paper as well as other work, skip-connections allow graph neural networks to express a particular notion of community not found in other networks. As such, graph neural networks may also reap the benefits already enjoyed in other fields of deep learning.

Deep neural networks have shown to be more effective than shallow ones in most tasks (e.g. Imagenet [16], or transformer models [34]). As of recently, there have been signals that show that deep networks are applicable to graph neural networks as well. Most graph neural network architectures, however, are limited to a small number of hidden layers as a consequence of decreased performance with a higher number of layers, and do not benefit from the deep neural network development. A promising avenue to explore is the application of deep neural networks to graphs. An early sign of results in this direction is found in literature (see [24]), but deep networks have not yet found widespread adoption in the graph neural network community.

As of yet, there has been no exact characterization of regular equivalence that can be learned by neural networks. Further research in the area of regular equivalence may yield increased expressive power for graph neural networks, as well as theoretical guarantees on the kind of inference that is possible for different neural network architectures. Although our work is not the first one in emphasizing the importance of the concept of regular equivalence, we believe the contraposition with community is highly relevant to furthering the field.

As Transformer models have taken over sequence-to-sequence modeling and Convolutional Neural Networks have a firm grip on computer vision, we ask what generalized neural architecture might prove to be most effective for graph data. As of yet, there is no single neural architecture that dominates other architectures on graph data. We hope to see a similar convergence to a few effective neural architectures for graph data as we see in the fields of computer vision and sequence-to-sequence modeling.

ACKNOWLEDGMENTS

We would like to thank Zeno Geradts and Cees Schouten for their supervision and contributions, as well as Jorn Ranzijn and Leon Velthuijzen for their helpful discussions and feedback. Finally, we thank the National High Tech Crime Unit of the Police for providing a conducive and pleasant environment during the time of research.

(19)

REFERENCES

[1] Abu-El-Haija, S., Perozzi, B., Kapoor, A., Alipourfard, N., Lerman, K., Harutyunyan, H., Steeg, G. V., and Galstyan, A. (2019). Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 21–29.

[2] Aldous, D. and Fill, J. (1995). Reversible markov chains and random walks on graphs.

[3] Audenaert, P., Colle, D., and Pickavet, M. (2019). Regular equivalence for social networks. Applied Sciences, 9(1):117. [4] Bellman, R. (1966). Dynamic programming. Science, 153(3731):34–37.

[5] Borgatti, S. P. and Everett, M. G. (1992). Notions of position in social network analysis. Sociological methodology, pages 1–35.

[6] Chen, J., Ma, T., and Xiao, C. (2018). Fastgcn: Fast learning with graph convolutional networks via importance sampling. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.

[7] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.

[8] Defferrard, M., Bresson, X., and Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852.

[9] Everett, M. G. and Borgatti, S. P. (1994). Regular equivalence: General theory. Journal of mathematical sociology, 19(1):29–52.

[10] Fan, T.-F. and Liau, C.-J. (2014). Logical characterizations of regular equivalence in weighted social networks. Artificial Intelligence, 214:66–88.

[11] Fey, M. and Lenssen, J. E. (2019). Fast graph representation learning with pytorch geometric. volume abs/1903.02428. [12] Fr¨ohlich, H., Wegner, J. K., Sieker, F., and Zell, A. (2005). Optimal assignment kernels for attributed molecular graphs.

In Proceedings of the 22nd international conference on Machine learning, pages 225–232. ACM.

[13] Grover, A. and Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM.

[14] Hagberg, A., Swart, P., and S Chult, D. (2008). Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States).

[15] Hamilton, W. L., Ying, R., and Leskovec, J. (2017). Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1025–1035. Curran Associates Inc. [16] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In European conference on

computer vision, pages 630–645. Springer.

[17] Hu, Y., Chen, H., Zhang, P., Li, M., Di, Z., and Fan, Y. (2008). Comparative definition of community and corresponding identifying algorithm. Physical Review E, 78(2):026121.

[18] Kashima, H., Tsuda, K., and Inokuchi, A. (2003). Marginalized kernels between labeled graphs. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 321–328.

[19] Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization.

[20] Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.

[21] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.

[22] Kumar, S., Hamilton, W. L., Leskovec, J., and Jurafsky, D. (2018). Community interaction and conflict on the web. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 933–943. International World Wide Web Conferences Steering Committee.

[23] Kuo, F. Y. and Sloan, I. H. (2005). Lifting the curse of dimensionality. Notices of the AMS, 52(11):1320–1328. [24] Li, G., M¨uller, M., Thabet, A. K., and Ghanem, B. (2019). Can gcns go as deep as cnns?

[25] Marx, M. and Masuch, M. (2003). Regular equivalence and dynamic logic. Social Networks, 25(1):51–65.

[26] McPherson, M., Smith-Lovin, L., and Cook, J. M. (2001). Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444.

[27] Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.

[28] Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM. [29] Sangiorgi, D. (2009). On the origins of bisimulation and coinduction. ACM Transactions on Programming Languages

(20)

[30] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. (2008). Collective classification in network data. AI magazine, 29(3):93–93.

[31] Shervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Mehlhorn, K., and Borgwardt, K. M. (2011). Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561.

[32] Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., and Borgwardt, K. (2009). Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pages 488–495.

[33] Tai, Y., Yang, J., and Liu, X. (2017). Image super-resolution via deep recursive residual network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3147–3155.

[34] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.

[35] Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li`o, P., and Bengio, Y. (2018). Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.

[36] Weisfeiler, B. and Lehman, A. A. (1968). A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia, 2(9):12–16.

[37] Wu, F., Jr., A. H. S., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. Q. (2019). Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 6861–6871.

[38] Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i., and Jegelka, S. (2018). Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pages 5449–5458.