Image2Graph Transformers

(1)

Master Thesis

Image2Graph Transformers

Attention-based Deep Autoregressive Models for

Conditional Graph Generation

by

Davide Belli

11887532

September 26, 2019

36 EC Jan 2019 - Aug 2019

Supervisor:

Thomas Kipf, MSc

Assessor:

Dr. Herke van Hoof

(2)

(3)

Acknowledgements

It has been extremely challenging and satisfying to work on this thesis. First of all, I would like to thank my thesis supervisor Thomas Kipf, for his great support. Motivating conversations, useful suggestions and timely feedback inspired me to shape and improve my research over the last months.

I also want to thank my colleagues and friends in the MSc AI, with whom I shared challenges, homework, stress, trips, beers and a lot of time in the master’s room during the last two years. In particular, for the mo-tivating conversations and significant insights: Gabriele Cesa, Gabriele Bani, Andrii Skliar, Gautier Dagan.

A space in here is also deserved by the friends from a lifetime, for their support even when I have been away: Nicola, Fabio, Damiano, Pier, P, Mira, Gabri.

Спасибо Саше за ее любовь и поддержку, которые делали и про-должают делать меня счастливее как в университете так и за его пределами.

Per ultimo, ma piu’ importante di tutti, un ringraziamento alla mia famiglia. Marco e Monica, per il vostro supporto incondizionato nelle mie scelte di vita. Per avermi aiutato nei momenti difficili e per avermi accompagnato nei momenti di festa. Per voi che ci siete sempre per me. Perche’ voglio che possiate essere orgogliosi di vostro figlio.

Questa tesi e’ dedicata a voi.

Davide Belli

(4)

(5)

Abstract

Graphs are non-Euclidean structures frequently occurring in natural and artificial data from social sciences, chemistry and physics. Recent re-search in the deep learning community tries to propose effective methods to learn generative models for graphs. In this work, we explore the task of graph generation conditioned on a specific input. We introduce and motivate the Generative Graph Transformer (GGT), a new deep autore-gressive model employing attention mechanisms for the recurrent gener-ation of graphs. To show the relevance of conditional graph genergener-ation in real-world scenarios, we benchmark the model and other baselines on the application of road network extraction from semantic segmentation. For this, we introduce the new Toulouse Road Network dataset, based on real, publicly-available data. We also present the StreetMover distance, a new effective, efficient and scalable metric to compare planar graphs. Detailed statistical evaluations and qualitative studies confirm the high quality of the reconstructed graphs, showing that our model could be applied in a real-world scenario.

(6)

(7)

List of Figures

1.1 Outline of GGT . . . 3 2.1 Unfolding RNNs . . . 6 2.2 Multi-Head Self-attention . . . 10 2.3 An example of graph . . . 12 2.4 Schema of GAN. . . 15 2.5 Schema of VAE. . . 17

2.6 Semantic segmentation of road maps . . . 22

3.1 Datasets for Unlabeled Graph Generation . . . 24

3.2 Toulouse road map . . . 28

3.3 Pre-processing satellite graphs . . . 29

3.4 Splitting the datasets in train, validation, test . . . 30

3.5 Samples from road map dataset . . . 30

3.6 Marginal distributions of |E|, |V | . . . 31

3.7 Joint distributions of |E|, |V | . . . 31

3.8 BFS-ordering over the graph . . . 33

3.9 Distribution of M for the road map dataset . . . 33

3.10 Heuristic on the BFS ordering . . . 34

3.11 Basic CNN Encoder. . . 36

3.12 CNN Encoder with Context Attention . . . 37

3.13 MLP Decoder . . . 38

3.14 RNN Decoder . . . 38

3.15 Extended GraphRNN Decoder . . . 39

3.16 Generative Graph Transformer Decoder . . . 40

3.17 Sequential representation of a graph . . . 42

3.18 Examples of StreetMover distances . . . 45

4.1 Qualitative evaluation on Unlabeled Graph Generation . 51 4.2 MLP vs GRU . . . 54

4.3 MLP vs RNN, a qualitatively evaluation . . . 55

4.4 RNNs with Self-attention . . . 58

4.5 SM distance in function of |E|, |V | . . . 59 vii

(10)

4.6 Generative Graph Transformer and Image attention . . . 61

4.7 Inspecting Self-attention weights. . . 62

4.8 Self-attention over the sequence . . . 63

4.9 The effect of λ on the loss components . . . 65

4.10 The effect of λ on the reconstructed graphs . . . 66

4.11 Loss Curves . . . 67

4.12 Qualitative comparison of reconstructions. . . 68

4.13 Samples at different time-steps during training . . . 69

4.14 Best samples, average samples, failure cases . . . 70

4.15 Reconstructing subregions of the map . . . 72

A.1 Additional examples of StreetMover distances . . . 88

B.1 Samples from Grid dataset . . . 90

B.2 Samples from Community dataset . . . 91

B.3 Samples from Ego dataset . . . 92

B.4 Samples from Protein dataset . . . 93

B.5 Samples from Community Big dataset . . . 94

B.6 Samples from Protein Big dataset . . . 95

(11)

List of Tables

3.1 Unlabeled datasets statistics . . . 25

3.2 Road map dataset statistics . . . 32

4.1 Quantitative evaluation on Unlabeled Graph Generation 50 4.2 MLP vs RNN . . . 54

4.3 RNNs and self-attention . . . 56

4.4 Generative Graph Transformer and image context attention 60 4.5 The effect of λ . . . 64

4.6 Overall evaluation . . . 66

C.1 Hyper-parameter choice . . . 100

C.2 CNN Encoder architecture . . . 100

C.3 Context Attention in the CNN Encoder . . . 100

C.4 MLP decoder . . . 101

C.5 Simple RNN decoder . . . 101

C.6 Generative Graph Transformer decoder . . . 102

C.7 Generative Graph Transformer (attention + linear layer) block . . . 102

(12)

(13)

(14)

Chapter 1 Introduction

Graphs are data structures particularly useful for representing the re-lational structure in valuable real-world domains like social networks, knowledge graphs, language, and interactions in physical and chemical environments. In deep learning, generative models for graphs aim to learn the probability distribution underlying and observed set of graphs. Once a model learns this probability distribution, it can generate new graphs with characteristics similar to the ones observed at training time. A particular case is conditional graph generation, where the generative process for each graph is conditioned on some input features. Condi-tional graph generation can be used for critical real-world applications like image captioning, drug discovery, and modeling chemical reactions. In this work, we introduce a new approach to image-conditioned graph generation using deep-autoregressive methods. We also explore the ef-fects of attention mechanisms (Bahdanau et al.,2014;Luong et al.,2015;

Xu et al., 2015) and transformer networks (Vaswani et al., 2017) to im-prove deep-autoregressive models for graph generation. Also, we im-prove the efficacy of our newly-proposed model for the important application of automated extraction of road networks from semantic segmentation of satellite data. For this, we introduce a new dataset based on real-world data, that will be publicly released and will serve as a benchmark for future conditional generative models for graphs.

1.1 Motivation

Existing approaches to graph generation have to face well-known issues caused by the high dimensionality, variable size of the graphs, and ex-istence of internal cycles. Besides, the representation of a graph is am-biguous because of the equivalence of the permutation of its adjacency matrix. Different models have been proposed to generate graphs with

(15)

specific solutions to tackle such issues, but most of them have drawbacks, like being limited to generating graphs up to a fixed size, or requiring expensive graph comparison algorithms for training and evaluation. You et al. (2018b) introduce an interesting recurrent formulation of graph generation. The recurrent formulation allows the generative process to scale to large graphs naturally.

However, we believe that there is still a large space for improvement in this direction for two main reasons. First of all, the architecture is based on simple GRUs (Chung et al., 2014) and could be improved with better performing components. To design a new, effective generative pro-cess, we take inspiration from state-of-the-art generative models for NLP, where the data is also represented in the form of sequences. Attention mechanisms and transformer networks show up in most recent researches on language modeling, and we believe that an architecture designed for recurrent generation of graphs could significantly benefit from these new deep learning components. Secondly, the GraphRNN is only applied for unlabeled graph generation, where the task is to learn the patterns in the connection of an observed graph distribution, learning the process of generating similar adjacency matrices. In most of the application of gen-erative models for graphs, however, relevant information is contained in the node labels, characterizing the actual content of a graph. This is the case for example, in drug discovery, where the node features describe the atom type, or the generation of scene graphs from images, where the node describe the objects appearing in the picture. For this reason, we decide to test the newly introduced Generative Graph Transformer (GGT) in a real-world application where the graph is represented by both node features and adjacency matrices. In particular, we considered the image-conditioned graph generation setting for automated extraction of road networks from satellite images. In this task, the generative process is conditioned on some input, the satellite image, and the underlying graph represent the road network characterized by road segments (edges) and nodes (intersection between nodes). The node features describing the in-tersections of segments consists of the (x,y) coordinates over the image. Our model could also be applied to different conditional graph generative tasks, for example for the reconstruction of scene graphs from pictures, or drug discovery, by switching the conditioning on an image with the conditioning over a feature vector describing the qualities of the molecule.

1.2 Contributions

The contribution of this work is threefold. First, we introduce the Toulouse Road Network dataset based on real, publicly available data from

(16)

Open-StreetMap1_{. The dataset is designed to benchmark new models applied}

to the image-to-graph task, in the framework of conditional graph gen-eration (generative modeling) or graph extraction from images (learning a deterministic function). Each datapoint in the dataset represents a squared patch in the whole map. A datapoint consists of an image de-scribing the semantic segmentation of the road network, and the underly-ing graph represented as X and A, plus other information like BFS and DFS over the graph. The introduction of the dataset includes a detailed discussion of pre-processing, filtering, and augmenting techniques used for its generation. We believe that the key benefits of the Toulouse Road Network dataset are the abundance of training data, the fact that it is based on real maps and the clear interpretability of the planar graphs describing the road networks.

CNN Encoder Self-Attentive Decoder . . . Conditioning Vector Input Image G1 Sequence Self-Attention Context Attention I ct G2 G3 GT

Figure 1.1: Outline of GGT. An image I is passed through the encoder which produces a conditioning vector ct using a context attention

mech-anisms on the previously generated node. The self-attentive decoder uses the conditioning vector and the previously generated nodes to predict the next node in the graph. This sequential process incrementally generates the graph GT.

The second, important contribution is the introduction, motivation, and discussion of the new Generative Graph Transformer, a deep auto-regressive generative model for graphs outlined in Fig. 1.1. In our work, we discuss and study the model for the task of generating road networks from images. The Generative Graph Transformer is based on an encoder-decoder structure, where the encoder is responsible for the conditioning of the generative process. The decoder is a deep-autoregressive model based on multi-head attention and linear layers, generating a probabilis-tic matrix ˜Aand a feature matrix X. The main benefits of the Generative Graph Transformer are i) its adaptability to condition the generative pro-cess on different data types, ii) the scalability to generate large graphs thanks to the self-attention mechanism, and iii) the nonnecessity for ex-pensive algorithms for graph comparison at training or evaluation time,

(17)

thanks to the use of a fixed canonical ordering. The Generative Graph Transformer is compared with a set of baselines, including extensions of existing works in the literature. A variety of metrics is used to conduct a quantitative study on the efficacy of the models, and an extensive qualita-tive study is presented to support the numerical analysis. Overall, GGT shows significant improvements in comparison with other baselines.

The final significant contribution consists in the introduction of the StreetMover distance, a new metric used to evaluate the models in the road network generation task. The StreetMover distance is based on an approximation of the predicted and target graphs with point-clouds ob-tained by sampling a fixed number of equidistant points over the edges of the graphs. Then, the StreetMover distance is computed as the opti-mal cost of moving the proposal point cloud to the target point cloud. Sinkhorn iterations (Cuturi,2013) are used for an efficient approximation of the Wasserstein distance. The StreetMover distance can be interpreted as describing the cost of moving road segments in the predicted graph to match the shape of the ground-truth network. The main benefits of the newly introduced distance are its interpretability, scalability, and invariance with respect to the graph representation, translations, and rotations.

1.3 Thesis outline

This thesis is organized into four other chapters. In Chapter2, we intro-duce background concepts and related work. Chapter 3 introduces the main contributions of our work: the proposed dataset, methods, and eval-uation metrics. Continuing in Chapter4, we present a set of experiments aimed at evaluating different approaches and models for image-to-graph generation. We conclude with an overall analysis of the result along with an intensive qualitative study of the models, showing their applicability to real-world scenarios on larger scales. Finally, in Chapter5, we present final considerations on our work and delineate possible future directions in the research on conditional graph generation.

(18)

Chapter 2 Background and Related

Work

In this chapter, we briefly introduce the main background concepts and the important related works on which our work is based. In particular, our attention will be focused on tasks, datasets, and models in the ex-isting literature which motivate approaches, experiments, and solutions discussed in our project. In section 2.1, we introduce the framework of Recurrent Neural Networks and their application to language modeling tasks. Furthermore, we investigate recent extensions of RNNs and CNNs with attention-based mechanisms. Then, in Section 2.2, we discuss how graph representations are effective when working with highly-structured data, and which tasks involving graphs have been approached in previ-ous work. Finally, we explore in details the task of Graph Generation considering different approaches based on Deep Generative Models. The background and related work chapter is concluded by introducing the concept of Conditional Graph Generative Models, which is the main fo-cus of the following chapters.

2.1 Recurrent Neural Networks and

Atten-tion

2.1.1 RNNs

Traditional feedforward neural networks are not the best fit to capture time dependencies in sequential data. At the same time, feedforward neural networks have fixed size, and thus, they cannot be used data with variable length. Time dependency and variable lengths are characteristics present in a variety of data types, like language, audio, video, or financial time-series.

(19)

Figure 2.1: RNNs have self-connections that can be graphically unrolled to show the time dependencies over the sequence.

Recurrent neural networks (RNNs (Pearlmutter, 1989; Cleeremans et al., 1989)) are characterized by a recurrent formulation, where the architecture folds on itself in a loop to deal with the recurrent formula-tion of sequential data (see Fig. 2.1). This type of neural networks can be used for different types of deterministic and generative tasks based on sequential data. Examples include language modeling (Sundermeyer et al., 2012), sentiment analysis (Zhang et al., 2018a), machine transla-tion (Sutskever et al.,2014), summarization (Nallapati et al.,2016), ques-tion answering (Donahue et al.,2014), stock prediction (Gao,2016), video generation (Gregor et al., 2015), conditional image generation (van den Oord et al., 2016). The definition of the naive RNN is a as follows:

at= b + W [ht−1, xt]

ht= tanh (at)

ot= c + V

(2.1)

The first two lines describe the RNN cell, while the last two define the output linear layer. In this formulation, W , U , V , b, c are learnable parameters (matrices and biases), ht is the hidden vector at the current

time-step t, and ot is the output of the network at t. at is the

acti-vation of the RNN cell before the hyperbolic tangent nonlinearity. The hidden vector is responsible for carrying the information of earlier parts of a sequence to future time-steps. Based on this general formulation, a multitude of variants of RNNs have been proposed to solve different tasks (Gers and Schmidhuber, 2000;Lee et al., 2017).

One of the first extensions of RNNs is LSTMs, which deal with the problem of decaying long-term memory in naive RNNs. In long-short term memory networks (LSTMs), a cell state retains information that is not useful for the predictions at current time-step but may be needed for future ones (long-term dependencies). Internally, LSTMs present three

(20)

different gates (input, output, forget) to modulate what is saved in the cell state and what is output for the current prediction. LSTM cells (not considering the common output linear layer) are defined as follows:

˜ Ct= tanh (WC · [ht−1, xt] + bC) ft= σ (Wf · [ht−1, xt] + bf) it= σ (Wi· [ht−1, xt] + bi) Ct= ft∗ Ct−1+ it∗ ˜Ct ot= σ (Wo[ht−1, xt] + bo) ht= ot∗ tanh (Ct) (2.2)

In Equation 2.2, the updates for the hidden state and cell state are modulated through three gating mechanisms functions of the current input xt and previous state of the hidden and cell states xt−1, Ct−1. As

shown in the equations, ˜Ct contains candidate values from the current

input to update the previous cell state. The forget and input gates ft

and it define which values to forget in the previous cell state Ct−1, and

which values to keep in the new proposed cell ˜Ct. The gating with ft

and it is then applied to the previous cell state and candidate cell state.

The gated cells are combined by summation to obtain the updated cell state Ct. Finally, the output gate defines which information from the

cell state is output at the current time-step in the hidden state ht.

Gated Recurrent Units, introduced by Chung et al.(2014), define an effective way to simplify LSTM cells by reducing the number of gates. In particular, the input gate and forget gate are merged in order to reduce the number of trainable parameters and speed up the training:

Ct= ft∗ Ct−1+ (1 − ft) ∗ ˜Ct (2.3)

To base further discussions on generative models in our work, we are going to briefly introduce the language modeling task and how recur-rent neural networks can be used to approach it. In language modeling, the goal is to learn the sequential generative process underlying an ob-served distribution of text sequences (in the form of sentences, books or corpora). First of all, a sequence of words (or a sentence) is defined as w= wN

1 = (w1, w2. . . , wN −1, wN), where every word belongs to a

dictio-nary of possible words wi ∈ D ∀1 < i < N . Now, the joint probability

for sentence w can be factorized into the multiplication of conditionals, which is expressed as:

P(wn 1) = P (w1) P (w2|w1) P w3|w21 . . . P wn|wn−11 = n Y k=1 P wk|w1k−1 (2.4)

(21)

A recurrent neural network applied to language modeling is trained to capture, at every time-step, the probability distribution for the next word wi+1 in the sentence, which can be defined as P (wi+1|wi1). Practically,

the inputs of the RNN at every time-step are the word generated at the previous time-step xt = ˆyt−1 and the current hidden state ht−1.

The output is a probability distribution over the whole vocabulary D. At inference time, generation starts with a ’Start-Of-Sequence’ token w0 =< SOS >, and the recurrent sampling continues until the

’End-Of-Sequence’ token is sampled, wi =< EOS >, resulting in the termination

of the current sequence.

2.1.2 Attention

Attention mechanisms were firstly introduced for Neural Machine Trans-lation by Bahdanau et al. (2014); Luong et al. (2015). The authors pro-pose an attention mechanism to overcome the bottleneck in the encoder-decoder architectures for machine translation, where the task is to trans-late an input sequence x = [x1, x2, . . . , xn] in the output sequence y =

[y1, y2, . . . , ym]. In the encoder-decoder approach to machine

transla-tion, the encoder is a recurrent neural network which parses the source sequence to represent in its hidden state the information relevant for translation. The hidden vector after the last encoding time step (n: hn)

is then used to initialize the first hidden state s0 in the decoder recurrent

neural network, which is used to sample the translated sequence condi-tioned on the decoder representation. The main problem in NMT with the naive encoder-decoder is the bottleneck in the hidden vector hn = s0,

and the fact that information in the first words of the sequence has to flow and persist through all the encoding time-steps to be finally used in the decoder. Solving this issue, attention mechanisms provide a way to model directly the connections between each hidden state hi and the

hidden states in the decoder st. This is done by concatenating the output

of the decoder network with a context vector ct, defined as a weighted

average of the encoder hidden vectors: ct=Pn_i=1αt,ihi. The weights αt,i

are the alignment scores describing the importance of hi to decode the

word at time t. For every t, αt,: describes a probability distribution over

the input sentence x.

αt,i = align (yt, xi) =

exp (score (st−1, hi))

Pn

i0₌₁exp (score (st−1, hi0)) (2.5)

The probability distribution is obtained by applying a softmax func-tion to the score obtained by each pair yt, xi. In the definition from Bah-danau et al. (2014), the scores between an encoder and a decoder hidden

(22)

vectors are computed through the following alignment score function, where W and v are learnable parameters.

score (st, hi) = va>tanh (Wa[st; hi]) (2.6)

However, there are multiple ways to defined and parameterize such an alignment function, for example based on the cosine similarity between hi and st (Graves et al.,2014), or with the scaled dot-product (Vaswani et al., 2017):

score (st, hi) =

s>_t hi

√

n (2.7)

Attention is often defined under the (key, query, value) triplet represen-tation. The values V are the vectors on which we want to attend, namely computing the alignment scores for the weighted average. The keys K are the representation of the values V on which the similarity function is computed to output the alignment scores. The query Q defines the similarity function and, intuitively, what the model is looking for in the keys K. In the attention mechanisms previously discussed (Bahdanau et al., 2014), key and value vectors coincide in K = V = h0:n, and the

queries are Q = s0:m. For the scaled-dot product example, the

formula-tion becomes: C =

n

X

i=1

α:,ihi = softmaxrows

SH> √ n H = softmaxrows QK> p|K| ! V (2.8)

where the softmax is done over each row of the resulting matrix.

Attention mechanisms can be classified into different categories. In soft attention, the context vector results from a weighted average over all the attended vectors hi, like in the example discussed above. Hard

attention only selects one vector to attend to, for example, by sampling from the alignment score distribution. While the soft attention approach may become very expensive for long sequences, hard attention is non-differentiable and requires techniques such as variance reduction (Sutton et al., 1999) or reinforcement learning to train.

A special case of attention is self-attention. In this setting, the mech-anism attends over different positions on a single sentence. This has been proven to be useful in generative tasks like image captioning (Mao et al.,

2014) or for machine reading (Cheng et al., 2016).

In recent years, different versions of attention mechanisms have been employed to improve solutions for natural language processing, image

(23)

and graph reasoning tasks (Schlemper et al.,2018;Veliˇckovi´c et al.,2017;

Kawai et al., 2019). Of particular relevance, Vaswani et al. (2017) first introduce an innovative approach to language modeling without the use of the popular RNN-based encoder-decoder architecture. In their work (Vaswani et al., 2017), the authors completely get rid of the recurrent neural network for language modeling and substitute it with a network purely based on attention mechanisms. In the transformer architecture, the encoder and decoder components are based on a similar common structure. First of all, the input x = [x1, x2, . . . , xn] and output sequences

y = [y1, y2, . . . , ym] are passed through an embedding layer to obtain a

dense representation. Afterward, a positional encoding is used to trans-form the embeddings depending on their position in the sequence. This encoding introduces information about the ordering in the original se-quence, which would be lost otherwise due to the permutation invariance of the following attention mechanisms.

Figure 2.2: Schema of the Multi-Head Self-attention layer.

The main component of the Transformer is the attention mechanism. The proposed Multi-Head Self-attention layer (see Fig. 2.2) is based on the scaled dot-product attention defined in Eq. 2.8. First, the en-coded vectors are passed in parallel through three different linear layers to obtain keys K, queries Q and values V . The triplet (Q, K, V ) is

(24)

then fed to the multi-head attention layer. In comparison to the original self-attention dot product, the matrices Q, K, V are split into multiple equally-sized heads and the attention mechanism is run independently over each of those heads. Splitting the embeddings into different heads allows the network to simultaneously attend to different subsections of each embedding, retrieving different information (topics) from different parts of the input and output sequences. To enforce the probability factorization defined in 2.4, the inputs (and at training time also the outputs) are masked so that the self-attention mechanism does not al-low dependencies on future words. The attention mechanisms are paired with feedforward layers and repeated multiple times in the architecture. Layer normalization and skip connections are used in each of these re-peated blocks.

Following the multi-head attention is a feed-forwards layer. In the decoder network, an additional multi-head attention layer is included between the first multi-head attention layer and the feedforward. This second attention layer combines the processed input sequence from the decoder (in the form of keys and values), with the partially generated output sequence (in the form of queries). Residual connections and layer normalization are used in combination with each multi-head attention and feedforward block. Finally, the network components described so far are repeated 6 times in both encoder and decoder to obtain the final Transformer architecture.

Further work in language modeling (Dehghani et al.,2018;Dai et al.,

2019), and in particular the recent GPT-2 for generative pre-training from language modeling (Radford et al., 2019) has shown the extreme efficacy of transformer-based networks for language modeling. In GPT-2, a huge model pre-trained for language modeling is fine-tuned for differ-ent downstream tasks achieving state-of-the-art results in most of them. This behavior shows how transformer networks can learn the underlying pattern in languages resulting in a model that can generalize for different tasks, similarly to low-level filters learned in modern CNNs (for example Inception (Szegedy et al.,2014), ResNet (He et al., 2015)) can be reused for transfer learning.

Attention mechanisms have also been applied to improve solutions for reasoning on images (Karpathy and Fei-Fei, 2017). Xu et al. (2015) propose an attention-based image caption generator. In their approach, image attention is implemented as soft context attention on the visual features of the image, given the previous outputs in the partially gen-erated caption. The attention on the image allows the model to focus on different areas on the image that still have to be described. Using image features instead of the source image greatly helps in keeping low

(25)

the dimensionality of the attended matrix. The context vector is then fed to the recurrent layer along with the previous hidden vector and the partially generated output. The process is repeated iteratively until the end token is sampled, marking the end of the caption.

2.2 Graph Neural Networks

2.2.1 Graph Structured Data

Definition A graph G = (V, E) is a structure defined by a set V of objects called vertices (or nodes, points), and a set E of connections between pairs of such objects, namely edges (or links). More precisely, given a set of vertices V in the graph, the edges are defined as E ⊆ {(x, y)|(x, y) ∈ V2 _{∧ x 6= y}. Sometimes, self-connections between a node}

and itself are also represented, and thus the component x 6= y is omitted. Moreover, both nodes and vertices in the graph can be described with a feature vector, containing additional information about that element.

Figure 2.3: Example of graph with nodes V = {1, 2, 3, 4} and edges E = {e1, e2, e3, e4}, where e1 = (1, 2) = (2, 1), e2 = (1, 3) = (3, 1), e3 = (1, 4) = (4, 1), e4 = (3, 4) = (4, 3).

In order to disambiguate among different possible ways to describe graphs, a fixed convention on the notation is defined in this section to en-sure consistency throughout this report. A general graph G = (X, A, E) is defined by a node features matrix X ∈ RN ×D_{, a binary square}

adja-cency matrix A ∈ {0, 1}N ×N and an edge feature matrix E ∈ RN ×N ×L_.

When considering a particular node, at index i, the array Xi represents

its feature vector, row Ai,: and column A:,i represent the existence of

outcoming and incoming edges respectively, and for every other node j, Ei,j,: and Ej,i,: represent the feature vector for the edge between i and j

(26)

and vice-versa. Usually, the feature vectors describing each node in the graph contain either a dense numerical representation of the node or a one-hot encoding of its class. Depending on the domain of the graph, the edges can be directed or undirected. In the first case, for exam-ple, when modeling bonds between atoms in a molecule, A is symmetric and (x, y) ∈ E ⇔ (y, x) ∈ E. Otherwise, for example, when modeling the ’following’ relationship between users in a social network, A is not symmetric, and the previous proposition does not hold. Note that by permuting in the same way row and columns of the adjacency matrix, A ∈ {0, 1}N ×N can be represented in N ! equivalent ways. As we will discuss in the next sections, this is a particularly relevant problem when working with graph data, for example when trying to match a real and a reconstructed graph through some similarity metric. For this project, we are not considering the edge features matrix E, due to the absence of edge labels in the datasets and tasks we consider. The edge labels can be used, for example, to describe the type of chemical bonds between atoms in a molecule, or the relations between entities in an ontology.

Applications When it comes to visualizing, analyzing and work-ing with any data, there are different ways to represent it to preserve and exploit any important information. It is straightforward to see that images are best visualized as 2-dimensional matrices (3, if we consider the color channels), while text makes more sense when expressed as a se-quence. For highly-structured data where the structure may vary greatly among different graphs or sub-graphs, graph representation is the best way to preserve the structural information. For example, graphs allow having effective representations for chemical molecules, social networks, physical systems and even text (in the form of ontologies or parsing trees). In fields related to Artificial Intelligence, graphs have been initially used to represent large knowledge-based systems and ontologies on which to perform rule-based reasoning, or as an instrument to model physical and chemical processes from a mathematical point of view (Erdös and Rényi, 1959; Albert and lászló Barabási, 2001). Following recent de-velopments in the Deep Learning field, researchers started to generalize Neural Networks (previously applied to work with images and text) to graph-structured data. At the same time, a relevant slice of data and sig-nals produced by modern technology have an underlying graph structure, for example, point cloud representation for 3D scenes and environments, navigation data from mobile road maps, recommendation systems for social networks, streaming services, browsers (Qi et al., 2017; Bronstein et al., 2016; Mirowski et al., 2018; Zhang et al., 2017). Taking inspira-tions from convolution operainspira-tions already applied to 2D images, a series of models (Graph Convolutional Neural Networks) have been proposed

(27)

to perform the analogue message passing algorithm over graph data (Kipf and Welling, 2016a; Hamilton et al., 2017; Gao and Ji, 2019) with the goal of reasoning over graphs. Other works proposed graph counter-parts of Recurrent Neural Networks, like (Pineau and de Lara, 2019; Li et al., 2015). Extension of other common neural network layers include graph pooling (Ying et al., 2018; Lee et al., 2019) and graph attention (Veliˇckovi´c et al.,2017;Kawai et al.,2019). Of particular interest,Gilmer et al.(2017) summarize the different graph neural network architectures and layers under the Neural Message Passing framework, providing an easy way to define and compare them. Models based on graph neural net-works have been shown to outperform other neural network approaches for a multitude of discriminative and generative tasks based on graph data (Wu et al., 2019; Xu et al., 2018; Zhou et al., 2018). Examples of tasks and applications on which GNNs have been applied are graph classification (Battaglia et al., 2016; Watters et al., 2017; Zhang et al.,

2018b), node classification, link and node prediction (Zhang and Chen,

2018), image classification (Quek et al., 2011) and reasoning (Li et al.,

2017; Narasimhan et al., 2018), reasoning on text (Bastings et al., 2017) and combinatorial optimization (Li et al.,2018c). The previous cases are examples of discriminative models based on graphs, where the task is to learn to make some prediction given unseen input. A different approach is based on generative models, where the task is to learn to model the probability of observed data and be able to sample new, unseen data from the approximate distribution. In our work, we focus on graph generation, and in the next section, we will introduce and discuss some interesting examples from literature based on RNNs and GNNs. Generative models can also be used for different tasks like representation learning and com-pression (Huang and Carley, 2019;Bianchi et al.,2019), as they provide a way to better understand and reason on the observed distribution of data.

2.2.2 Graph Generative Models

Modeling the generation of graphs has critical importance in studying networks in chemistry, physics, social sciences, and engineering. The complexity in learning to generate graphs lies in the intrinsic nature of this type of data. The modeling of graph distributions is particularly difficult due to the high-dimensionality, variable size, and existence of in-ternal cycles in this type of structures. Moreover, graphs can be equally represented by each of the N ! permutations of the adjacency matrix A. This ambiguity makes it difficult to train and evaluate graph generative models because a way to compare pairs of (real and generated) graphs

(28)

is needed. Possible solutions to this problem include fixing a canonical ordering for the graphs in the dataset, or an expensive comparison con-sidering all possible permutations of A. In this section, we are going to discuss more in-depth some approaches to graph generation and how they try to solve the difficulties in modeling we just introduced. For clarity, we include in the two gray boxes a brief introduction on GANs and VAEs models, which are the basis for some of the discussed graph generative models. This introduction is aimed for general ML practitioners and can be skipped by readers that are already experts in these topics. We also stress the comparison between models that generate graphs in one-shot against deep autoregressive models that generate graphs iteratively.

GANs Generative Adversarial Networks is a framework of models firstly introduced by Goodfellow et al.(2014).

Figure 2.4: Schema of GAN.

In GANs, two neural networks contest each other in a zero-sum game. The first neural network, the generator Gθ, is trained

to generate new datapoints matching the characteristic of an observed distribution x ∼ pdata(x). The second network, the

discriminator Dφ, is trained to distinguish real samples drawn

from the observed distribution from the fake ones created by the generator. To enforce the generator to sample diverse dat-apoints every time, the generation procedure is conditioned by sampling the current input from a latent vector z ∼ p(z). Usu-ally this latent variable follows a normal or uniform distribution. At training time, the discriminator learns to output the proba-bility that the input images are real, with the goals of predicting Dφ(x) = 1 and Dφ(Gθ(z)) = 0. The signal from the

(29)

to improve against the discriminator network, trying to achieve Dφ(Gθ(z)) = 1. The resulting optimization process is described

in the mini-max game minθmaxφV (Gθ, Dφ). The discriminator

and generator losses for a minibatch i can be derived as alterna-tive minimization and maximization steps on the parameters θ and φ as:

V (Gθ, Dφ) = Ex∼preal(x)[log Dφ(x)]+Ez∼pz(z)[log (1 − Dφ(Gθ(z))]

(2.9) L z(i)_{; θ = − log D}

φ Gθ z(i)

L x(i)_,_x_ˆ(i)_{; φ = − log D}

φ x(i) − log 1 − Dφ xˆ(i)

(2.10) Both networks are parameterized through learnable parame-ters (φ and θ) which are trained by optimizing the mini-max loss function with gradient descent. Ideally, this mini-max game con-verges to the point in which the discriminator is perfectly able to distinguish between real and fake inputs, and the generator learned to generate datapoints from the real distribution p(x). In GANs, the generative process is learned implicitly, meaning that GANs do not model an approximate posterior distribution q(z|x), but instead define a latent variable z ∼ p(z) which is sampled and transformed by the learned deterministic function Gθ to generate the corresponding datapoint ˆx ∼ Gθ(z). This is

different from the VAEs which are modeling an approximation of the posterior explicitly, as we will see in the next paragraph. GANs have been successfully applied to image generation, with in-credibly realistic results (Ledig et al.,2016), but also to videos (Vondrick et al.,2016), text (Fedus et al.,2018) and graphs (Wang et al.,2017;Fan and Huang, 2019). De Cao and Kipf proposed MolGAN (De Cao and Kipf,2018), a GAN-based graph generative model for de novo drug dis-covery. In their approach, both the generator and discriminator networks are based on graph convolutional layers, and an additional reward net-work is introduced to enforce the modeling of non-differentiable qualities in the generator. The model is trained in the Wasserstein GAN setting (Arjovsky et al.,2017). Of particular interest in this approach is that the three networks are independent to graph ordering, and this circumvents the need for expensive procedures for graph matching (between original and generated samples) and for ordering heuristics. The main downside of this approach to graph generation is its scalability. Since the genera-tion is done in one shot, the model can only scale to generating a small

(30)

number of nodes in the graph (in MolGAN, the experiments are done on the QM9 dataset, containing graphs with |V | = 9).

VAEs The variational auto-encoder is an unsupervised generative model introduced in parallel byKingma and Welling(2013) andRezende et al. (2014).

Figure 2.5: Schema of VAE.

VAEs are based on an auto-encoder architecture. In the stan-dard auto-encoder approach, the purpose is to learn a (compact) representation of the data, removing some of the possible noise from the input data and preserving only the informative content. The auto-encoding is done employing two neural networks: an encoder Eφand a decoder Dθ. The encoder maps the input data

x ∼ pdata(x) to a latent representation z = Eφ(x). The decoder

learns the opposite mapping, from the latent vector z to a re-constructed datapoint ˆx= Dθ(z). The two networks are trained

by optimizing the parameters φ and θ to minimize the recon-struction loss between the original data x and its reconrecon-struction

ˆ x.

In the variational auto-encoder, the main idea is to repre-sent a generative model through the joint distribution p(x, z) where the latent random variable Z conditions the generation of observed data X. Practically, the joint distribution can be decomposed as p(x, z) = p(x|z)p(z) and then, with Bayes the-orem:

p(z|x) = p(x, z) p(x) =

p(x|z)p(z)

p(x) (2.11)

With this decomposition, the latent variable Z can be modeled by learning a parameterized approximation qφ(z|x) of the

in-tractable, true posterior p(z|x). A generative model would then result in sampling a latent vector from the approximated pos-terior and then generating a new datapoint ˆx ∼ pθ(x|z) from

the learned likelihood distribution. In VAE the true posterior is replaced with a family of parameterized approximated

(31)

posteri-ors qφ(z|x) ≈ p(z|x), learned iteratively by variational

expecta-tion maximizaexpecta-tion to maximize the (log-)evidence lower bound (ELBO).

log p x(i)_{≥ E}

qφ(z|x(i))log pθ x(i)|z − DKLqφ z|x(i) kp(z)

(2.12) At training time a VAE encoder Eφ learns the posterior as

a mapping from the input data x ∼ pdata(x) to the parameters

of the latent variables z. Commonly, the posterior is modeled as a multivariate Gaussian distribution with parameters µ and σ such that qφ(z|x) = N (µ, diag (σ2)) can be represented and

sampled efficiently. From this parameterization of the posterior, using the reparameterization trick, a latent vector z ∼ qφ(z|x) is

sampled. The decoder Dθ then uses the sampled vector for the

current datapoint to learn the function pθ(x|z), mapping back to

a reconstruction of the real input, ˆx. Both the encoder Eφ and

the decoder Dθ are neural networks with respective parameters

φ, θ learned by optimizing the VAE loss: L x(i)_{; θ, φ = − log p}

θ x(i)|z(i) + DKLqφ z(i)|x(i) kp z(i)

(2.13) The VAE loss shown in Eq. 2.13 combines the reconstruction loss between the real and reconstructed datapoints x and ˆx(in the first term), and the variational loss (in the second term). The variational loss, derived by the variational maximization of the ELBO, is defined as the KL-divergence between the current parameterization of the posterior qφ z|x(i) and the prior p(z).

In the original VAE definition, the prior is a normal distribution p(z) = N (0, I). At inference time, a latent vector z is sampled from the learned approximate posterior and fed to the decoder, to generate a new datapoint ˆx ∼ p(x|z). In comparison to GANs, variational auto-encoder allows for an explicit represen-tation of the approximated posterior distribution, which can be modeled by setting different (complex) priors. In addition, the independence between the latent variables z can be exploited to learn a disentangled latent distribution where each variable models a different quality of the data.

Multiple approaches to VAE-based graph generation have been pro-posed in recent years (Kipf and Welling, 2016b; Grover et al., 2017). For example, Simonovsky et and Komodakis introduce in their work

(32)

the GraphVAE (Simonovsky and Komodakis, 2018). In their approach, graph convolutional layers are used to define the encoder network, and both encoding and decoding steps are conditioned on a particular graph label y. This conditioning results in disentangled latent vector repre-sentations and allows for conditioning the sampling at inference time. The output of the model is a probabilistic fully-connected graph, from which it is possible to sample adjacency matrices and node/label feature matrices. A relevant problem in this approach is that a graph matching function is needed to compute the reconstruction loss at training time. As we have discussed in the previous section, the adjacency matrix can be represented in any of its permutations, and the problem of finding the best alignment between ˆA and A is polynomial in the number of nodes, in particular, O(N4_{). For this reason, the method proposed in}

GraphVAE is not scalable to large graphs, and the authors demonstrate consistent results only for graphs with up to 38 nodes.

Deep Autoregressive Models One of the main limitations high-lighted by the authors of MolGAN and GraphVAE methods discussed in the previous paragraphs is that these models try to learn the genera-tive process for the probability of the graph p(X, A, E) in a one-shot fashion. Although the results are promising for small graphs, these ap-proaches are not able to scale to larger graphs, like citation networks, road maps or protein graphs. In these types of graphs, the matrices A, E become very large since they have a quadratic size in the num-ber of nodes. The problems of bigger graphs reside in the difficulty of training and the mode collapse effect (for MolGAN), and the exploding complexity of the graph matching algorithm (in GraphVAE). Addressing this issue, recently developed approaches formulate the generative pro-cess in a recurrent way (Li et al., 2018a; You et al., 2018a). Liu et al.

(2018) propose to use GGNNs (Li et al., 2015) for the encoder and the decoder in a VAE, where the decoder takes as input the latent represen-tation of the graph and gradually refines each node and edge to obtain the final reconstructed graph. To learn this recurrent formulation of the graph, the authors linearize the graph arbitrarily over one of the possible permutations and train the loss using a formulation that only captures the error at the current step in the iterative process, marginalizing out the path taken to reach the current graph state. Although this approach simplifies the problem of modeling the graph probability by decomposing it in a sequence of conditional probabilities, some of the problems high-lighted before still arise. First of all, the maximum number of nodes for a generated graph has to be specified beforehand, which may not be the case for dynamic datasets. Moreover, to learn a model which is invariant to the ordering, the training should use all the possible permutations of

(33)

the adjacency matrix for every graph in the dataset.

A different approach to modeling the graph generation with deep au-toregressive models is proposed by You et al. (2018b) with GraphRNN. In their work, the authors aim to generate unlabeled graphs (without node and edge features) matching the characteristics of an observed dis-tribution of real graphs. The proposed generative process describes the generation of a graph incrementally, by sequentially adding new nodes to the graph generated so far and at the same time modeling the con-nectivity between the newly added node and the already existing ones. Formally, the authors fix a permutation over the nodes π for all the graphs G in the datasets and learn the generative process underlying that particular permutation. In particular, each graph is linearized as a sequence S of rows in the adjacency matrix A , masked out after the current node: Sπ = fS(G, π) = (S1π, . . . , S π n) (2.14) S_iπ = Aπ 1,i, . . . , A π i−1,i T , ∀i ∈ {2, . . . , n} (2.15) The generation of the graph is then modeled as a multiplication of the conditional probabilities under this decomposition. This factorization is based the assumptions that the generation of a node i only depends on the previously generated nodes < i, and the generation of an edge from node i to node j (where j < i) only depends on the existence of edges between node i and all the nodes < j in the graph. By looking back at Eq. 2.4, we note the strict similarity with the factorization used for language modeling in NLP. Formally, the joint probability in GraphRNN results in: p(Sπ) = n+1 Y i=1 p(Siπ|S π <i) = n+1 Y i=1 i−1 Y j=1

p S_i,jπ |S_i,<jπ , S_<iπ (2.16) In the implementation, the generative process is modeled with two GRUs (Chung et al., 2014), one to learn the ordered node sequence and update a latent representation of the partial graph, and one which uses this latent representation to model the connectivity of each new node with respect to previous nodes in the graph. The authors show the effectiveness of their model on a variety of citation networks, geometrical graphs, and protein networks, comparing GraphRNN to a non-recurrent approach (GraphVAE (Simonovsky and Komodakis, 2018)) and other statistical models (Erdös and Rényi, 1959; Albert and lászló Barabási,

2001). To evaluate the similarity between the real distribution of graphs (test set) and the generated distribution, Maximum Mean Discrepancy (Gretton et al.,2012) is computed over a set of graph statistics: clustering coefficient, degree, and orbit.

(34)

2.2.3 Conditional Graph Generation

Conditional generative models are an extension of the generative model framework, where the generative process is conditioned on some addi-tional input vector y. Mirza and Osindero (2014) define the condition-ing in generative adversarial networks by includcondition-ing an additional feature vector y as input to the generator network, in addition to the sampled latent vector z. In their work, the feature vector y is used to choose the class for the datapoint to be generated. Other works have applied conditional generation to graph modeling. Li et al. (2018b) propose a conditional VAE to generate new molecules for drug discovery. In their model, the conditioning is used to define some criteria to optimize in the generation of the molecule, such as being synthetically available or having a high affinity for a specific target.

A graph generative model could also be conditioned on something more complicated than feature vector or one-hot class representation. In recent literature (Schuster et al., 2015; Yang et al., 2018), graph gener-ation is conditioned on an input image with the task to output a scene graph representation of the picture. A scene graph is composed of nodes representing the elements appearing in the picture (e.g. people, objects) and edges describing the logical and positional relations among these ele-ments. InBradshaw et al. (2018), the conditional graph generation aims to generate the output of a chemical reaction, and the conditioning is made by defining the reactant and reagent molecules that are the input of the reaction.

Image-to-Graph Generation In our work, we also approach the problem of image-conditioned graph generation, but from a different point of view. In particular, we propose a method to extract underlying geometric graphs from images, resulting in an accurate representation of nodes (in terms of (x, y) coordinates in the image) and edges (describing the connectivity in the graph). A particularly relevant practical applica-tion of this task is the automated extracapplica-tion of accurate road map graphs from satellite images. A solution that would be able to provide precise reconstructions of road maps would allow for complex reasoning on satel-lite images for relevant applications. Examples are generation of accurate annotations for unmapped territories, up-to-date mantainance for large-scale road maps, and improving the logistics in first-aid operations in case of natural disasters.

In his work (Muruganandham, 2016), Muruganandham proposes a CNN-based architecture to generate pixel-wise segmentations from raw satellite images, showing promising results, as shown in Fig. 2.6. In 2018 the competition ”Broad Area Satellite Imagery Semantic Segmentation” (BASISS) took place, aimed at the extraction of road masks from massive

(35)

Figure 2.6: Reprinted from (Muruganandham, 2016) with the author’s permission. In the middle, the pre-processed input image. On the left the ground-truth road semantic segmentation, and on the right the seg-mentation proposed by their model.

SpaceNet images (Etten et al., 2018). All the top-solution in the com-petition were based on generating an accurate pixel-wise segmentation of the road maps with U-nets. From this segmentation, a sequence of post-processing steps and the usage of external rule-based libraries were finally applied to extract a graph representation of the semantic segmen-tation. However, no solutions involving deep learning approaches have been proposed to automatically extract graphs directly from semantic segmentations.

Since many works in existing literature have focused on extracting semantic segmentation of roads from satellite images, in our work, we concentrate on the problem of extracting road map graphs directly from semantic segmentation of satellite images. To the best of our knowledge, there has been no other work in the deep learning research community approaching extraction of graphs structures from images, with the ex-ception for the case of abstract scene graphs mentioned in this section. A complete pipeline for end-to-end road network extraction from raw satellite images can be defined by combining the method proposed in this work with existing semantic segmentation networks. However, this is not in the scope of this project, while we want to focus on different methods for conditional graph generation for the image-to-graph task. In particular, we will approach the generative process based on the recur-rent formulation proposed in GraphRNN (You et al.,2018b) and defined in Eq. 2.16. We will extend existing methods for conditional generation and introduce new deep autoregressive models to solve this task, with particular focus into the effect and improvements of attention-based so-lutions in comparison with traditional RNNs. The soso-lutions proposed in this work are easily adaptable to any case of conditional graph generation in a supervised learning setting, as we will discuss in Sec. 3.3.

(36)

Chapter 3 Methods

In this chapter, we introduce our approaches to deep autoregressive gen-eration of graphs. The discussion starts with a description of datasets and tasks, followed by a motivation of our methods and a discussion of the metrics used for evaluation. In Section 3.1, we start by consider-ing the unlabeled graph generation task as in You et al. (2018b). We introduce the datasets and the evaluation metrics, discuss the previous approaches based on RNNs, and then motivate our contribution to ex-tend GraphRNN model using attention mechanisms like inVaswani et al.

(2017). In Section 3.2 we introduce the image-to-graph task in the con-ditional graph generation setting. We design a dataset based on the real-world task of automated road map extraction from satellite images, discussing details of data gathering, pre-processing, and graph lineariza-tion techniques. We follow in Seclineariza-tion 3.3 by introducing an encode de-coder architecture for conditional graph generation, discussing different choices of networks for encoders and decoders including new baselines, extensions of existing models from literature and finally our attention-based methods. We conclude the chapter in Section 3.4 by explaining the training and evaluation procedures. In the first part, we describe the training settings for our experiments along with the choices for the loss function. Then, we propose a set of standard evaluation metrics and introduce a new, effective metric to capture the distance between 2-dimensional graphs underlying images.

3.1 Unlabeled Graph Generation

One of the main goals of this project is to explore the effectiveness of attention-based approaches to improve deep autoregressive graph gener-ation. As a preliminary contribution in our work, we first consider the case of unlabeled graph generation. In unlabeled graph generation, the

(37)

task is to generate graphs that are only described by their adjacency matrix A, but do not contain any label information on edges or nodes.

3.1.1 Datasets and Metrics

To have a fair comparison with non-attentive deep autoregressive models, we base our work on the datasets and solutions discussed in You et al.

(2018b). In particular, we consider the four main datasets selected in the original paper, and then introduce two new datasets to discuss the scalability of models to larger graphs.

Figure 3.1: Samples from the 6 datasets for unlabeled graph generation. From left to right columns: Grid, Community, Ego, Protein, Commu-nity Big, Protein Big. Notice the similarity in structures between the original and the 2 enlarged versions of Community and Protein datasets In Figure 3.1 we plot samples from the six datasets. The Grid datasets (column 1) is an artificial datasets of 2D regular structures. Community is formed by 2-community graphs generated with the Erd˝os-Rényi model (Erdös and Rényi, 1959) with parameters n = |V |/2 and p= 0.3. Inter-community edges are added with probability pi = 0.05|V |.

Ego is a document citation graph extracted from the Citeseer network. Protein contains protein graphs where every node is an amino acid and two nodes are connected if they are less than 6 Angstroms apart. We then use the same procedures described to generate Community and Pro-tein datasets to extract two new datasets with a larger number of nodes

(38)

or edges. Community Big contains graphs with high number of edges: 1243 ≤ |E| ≤ 3391. Protein Big contains graph with high number of nodes: 401 ≤ |V | ≤ 903. We introduce new datasets to test the limi-tations of the GraphRNN model better. In particular, since GraphRNN is based on recurrent neural networks, it could potentially struggle with capturing longer-term dependencies over node and edge sequences. In Table 3.1 we report additional statistics about the distribution of |V | and |E| in the graphs for each dataset.

datasets |G| min/mean/max |V | min/mean/max |E|

Grid 100 100/210/361 180/392/684 Community 510 60/110/160 232/962/1959 Ego 757 50/145/399 57/335/1071 Protein 918 100/258/500 186/646/1575 Community Big 510 160/210/260 1243/2238/3391 Protein Big 218 401/532/903 870/1362/2774

Table 3.1: Statistics on the graph distributions in datasets for unlabeled graph generation. Community Big has the largest number of edges per graphs |E|, Protein Big has the largest number of nodes per graph |V |.

To evaluate the models, we conform to the evaluation metrics intro-duced in the original paper. The proposed metrics are based on Maxi-mum Mean Discrepancy (Gretton et al., 2012) over statistics computed on the set of real graphs and on the set of generated graphs. The con-sidered graph statistics are clustering coefficient, degree, and orbit count for orbits with 4 nodes. The test set is an approximation of the distribu-tion of real, observed data preal(x) and the set of generated graphs is an

approximation of the distribution of graphs modeled by the generative process pgenerated(x). Maximum Mean Discrepancy can be derived as in Gretton et al. (2012) for two probability distributions p and q as follows:

MMD2(pkq) = Ex∼p,y∼p[k(x, y)] + Ex∼q,y∼q[k(x, y)]

− 2Ex∼p,y∼q[k(x, y)]

(3.1) where kernel k is chosen to capture high-order moments and is based on Wasserstein distance as k= kW(p, q) = exp W(p, q) 2σ2 W(p, q) = inf γ∈Π(p,q)E(x,y)∼γ[kx − yk] (3.2)

(39)

3.1.2 Models

To compare the effect of attention-based mechanisms over traditional recurrent neural networks, we base our work on modifications of the original GraphRNN architecture, to avoid any possible effect caused by additional changes in the models. The original GraphRNN employes two Gated Recurrent Units to model the node-wise and edge-wise updates. The outputs of the edge-wise updates are means of independent Bernoulli distributions, which are sampled to obtain the binary vector describing the adjacency column for the current node.

The first extension to the original architecture is GraphRNN-att. In this model, we add an attention layer on top of the node-wise GRU, the edge-wise GRU, or both. The attention layers are based on scaled dot product self-attention. The context vector ct output of the scaled

dot-product is residually connected by addition or concatenation with the input of the attention layer ht to get the output of the attention layer

ˆ ht. ˆ ht= [ht, ct] or ˆ ht= ht+ ct ct= t−1 X i=1 αt,ihi = softmax htH_1:t−1> √ n ! H1:t−1 (3.3)

In the second extension of GraphRNN, we remove one or both of the GRUs, replacing them with transformer-like decoders. Considering the notation for the original Transformer previously described in Sec.

2.2, the node-wise and edge-wise components in GraphRNN-mhdpa are structured as:

xt

positional encoder

7−−−−−−−−−−→ x(e)_t 7−−−−−−−−−−→ xN ×decoderlayers (d)_t 7−−−−−−−−→ hf eed−f orward t (3.4)

Each of the N decoder layers is composed by a sequence of: layer normal-ization, multi-head self-attention, residual connection, layer normaliza-tion, feed-forward, residual connection. The hyper-parameter selection for each of the models is briefly discussed in the experiment section 4.1.

3.2 Touolouse Road Network

Dataset

The main contributions in this thesis consist in introducing a new dataset, models, and metrics for conditional graph generation, in particular, con-sidering the image-to-graph setting. First of all, we wanted to find a task that would allow for an effective way to evaluate the proposed models,

(40)

and that would also be an essential challenge for real-world applications. For these reasons, we choose to test the proposed methods to automated road map extraction from satellite images, in the image-to-graph setting. As we discussed in section 2.2, a large number of works already focused on semantic segmentation of road maps from satellite images. Since studying and implementing computer vision algorithms and pre/post-processing steps for semantic segmentation is not in the scope of this work, we restricted the setting to extracting road networks from already segmented images. A crucial step to our work is the introduction of the Toulouse Road Network dataset1 _{for conditional graph generation in the}

supervised setting. One of the biggest open-source web-applications for road maps and satellite data is OpenStreetMap2_{. In particular,}

GeoFab-rik3 _{released publicly a few datasets of cities containing geographical}

information like railways, road networks, buildings, waterways. To in-troduce a labeled dataset for image-to-graph generation, we extract a labeled dataset from raw road network data from the city of Toulouse, France, available on GeoFabrik and time-stamped June 2017. We start by discussing a sequence of pre-processing steps taken to convert the raw data provided in shapefile format to explicit graph structures. Further, we extract the road map dataset as image and graph pairs, and briefly present statistics of the dataset. Finally, we present our approach to linearize the graph structures through node-wise BFS-ordering.

Pre-processing and Statistics In the source data, each road in the map is described as a sequence of line segments, and intersection points between different roads are not modeled. In the original data, each road is represented as an independent connected component in the graph. As a first step to pre-process the data, we divide the map in a grid of squares with a side length of 0.001 degrees, or 0◦₀0_{3.6”, resulting}

in 123, 096 cells (patches on the map). After pre-processing and filter-ing, each cell will result in a different datapoint in the dataset. At this point, we start detecting intersections between different roads, splitting each two incident segments in their crossing point and merging the four resulting segment in a unique road, thus joining the two connected com-ponents. For each cell, we also add points where road segments intersect one of the edges of the squared patch. To avoid ambiguity in the image representation for graphs that look almost the same, we merge consecu-tive segments if the angle between them is close to π, and we merge points if they are closer than 1

20 of the image size. The result of this merging 1_{The full dataset along with the code for dataset generation, analysis and} PyTorch Dataset API will be available at https://github.com/davide-belli/

toulouse-road-network-dataset.

2_{www.openstreetmap.org}

(41)

Figure 3.2: Road networks, railways and waterways in the city of Toulouse. Data from Geofabrik3_.

step is best seen visually in the example Fig. 3.3. Finally, coordinates are normalized to [−1, +1] range.

Taking a closer look at the graphs generated at this point, we notice how a relevant percentage of the cells contain none or few segments of roads. We decide to filter the dataset generated so far by keeping only datapoints with at least 4 nodes in the graph. The reason motivating this filtering is that we consider trivial the automated extraction of simple graphs with less than 4 nodes from image representations. In addition to increasing the size of the dataset significantly and slowing the training process, it would be uninformative to test the capabilities of different generative models for such simple cases. We also remove the right tail of the graph distribution in terms of |V | and |E|, in other words, outlier graphs with too many edges per image. These datapoints contain too cluttered and noisy graphs which, according to early experiments on the dataset, significantly slow down the training of the proposed models. The filtering of the right tail of the graph distribution is done at the 95th percentile of the distribution, resulting in graphs with up to 19 edges and 16 nodes. The filtered and pre-processed graphs extracted from each patch in the map grid are then plotted and saved as a 64×64 pixel images,

(42)

Figure 3.3: Pre-processing step in the satellite datasets generation. Seg-ment merging is based on angle similarity, and point merging is based on Euclidean distance. This pre-processing step removed details in the graph that are difficult to capture, and that result in minimal changes in the image representation. Plotting nodes and coloring roads is only for visualization purposes.

and unique identifiers are assigned to each datapoint. Augmentation of the dataset is done at this point by translation (shift), rotation and flipping. In particular, we translate the image (and the graphs) vertically and/or horizontally by 1

4, 2 4 or

3

4 of the square side (augmentation factor

×16). For rotations we use π

2, π and 3π

2 angles, optionally flipping the

source image (augmentation factor ×8). The total factor of augmentation is ×128 times the original number of datapoints.

Dataset splits Finally, we split the datasets in training, valida-tion and test sets. To have consistent splits in terms of the distribuvalida-tion of datapoints, we use a row-column splitting criteria, as shown in Fig. 3.4. The idea of this splitting is that we want to capture similar distributions of graphs in each set. In Fig. 3.2 we see how some areas of the map have very cluttered road networks, while others have sparse networks or no roads at all. The proposed way of splitting the dataset enforces enough diversity inside each split, and similarity between the distributions in dif-ferent splits. Since we are using translation in the augmentation, we also remove the regions of the grid at the edge between two different dataset splits, to avoid repetition of (part of) the graphs in different splits. This way of splitting the dataset also minimize the number of datapoints dis-carded in the conflict regions, and this is why we avoid random selection over the map to split the dataset.

To ensure that the distribution of graphs in the training set is aligned with the ones in the validation and test set, we conduct a study on the marginal and joint distributions of |V | and |E| for different splits. In Fig.

Image2Graph Transformers

Master Thesis