Using Deep Learning on Graphs to Detect Illegal Vacation Rentals

(1)

MSc Artificial Intelligence

Master Thesis

Using Deep Learning on Graphs to Detect

Illegal Vacation Rentals

by

Petra Ormel

10607005

May 24, 2020

36 EC September 2019 - May 2020

Supervisor & Examiner:

Dr. S. (Stevan) Rudinac

Other Supervisor:

U. (Ujjwal) Sharma MSc

Assessor:

Prof. Dr. M. (Marcel) Worring

Daily supervisor municipality:

A. (Ammar) Khawaja MASc

(2)

Abstract

In this research we investigate whether graph representation learning can be exploited to capture valuable relational information in combined data sources containing multiple modalities which do not follow a clear graph structure. The use of graph representation learning has been extensively researched for modeling higher order relations in data. Nevertheless, most present feature learning approaches are only applicable on homogeneous graphs. Graph based approaches that are applicable on heterogeneous graphs are often only evaluated on tasks that utilize data with a clear graph structure. However, for some tasks data can be collected from different sources that do not follow a clear graph structure but still contain relational information which might be valuable. One such task is the detection of illegal vacation rentals and is used as a use case in this research. In this research, we deploy a graph based model build upon a Heterogeneous Graph Attention Network which is able to extract semantic relations between different graph structures. We conduct experiments on real world data provided by the municipality of Amsterdam, data from a short-term housing rental platform, and publicly available data sources. We found that a graph neural network is able to extract additional information from numerical and text data, originating from different sources, by representing connectional structures inside a graph in the form of metapaths. Furthermore, we analyzed several methods to define the importance of different metapaths which provide insight in what type of graph structures are more informative than others. We discovered that similarly to feature importance, metapath importance can be used to reduce the number of metapaths and decrease training time while preserving equal performance. Moreover, we found that metapath design can influence the added value. Finally, we show that the graph model can be used to increase performance on the classification task of predicting illegal vacation rentals.

(3)

Acknowledgements

First of all, I want to thank my supervisor Stevan for all his guidance throughout my thesis. Stevan, thank you for all your ideas and especially your positivity, it always gave me new energy and it encouraged me a lot. Furthermore, I want to thank my team at the municipality: Ammar, Geert, Saban and Tim, you were great company and always open to listen and think along if I just needed someone to spar with. Special thanks to Ammar, you always immediately helped me if I had a question or needed something, that was very nice. More-over, I would like to thank Ujjwal. I really appreciate all your time and help during my thesis. Furthermore, I really want to thank Iva for her nice feedback. It always felt you really cared about us and all our theses and you even provided feedback on your ‘not working’ days. Moreover, I’m grateful to Maarten for his feedback points.

Also, I’d like to thank Bert and my friends for supporting me through the last period. Moreover, I want to thank Bert for providing me his home as my working space during the last Covid-19 months. Finally, I would like to thank my parents for always supporting me through my whole study career and for giving it a try to proofread my thesis.

(4)

1 Introduction

Recently, the use of graph representation learning has been extensively researched for modeling higher order relations in data [1, 2, 3, 4, 5]. Graph neural networks can be used as feature learners which capture connectional information between graph nodes and map them towards a new fixed representation. They have shown to be useful on a copious amount of different tasks, such as classification in citation networks or knowledge graphs [6], capturing relations between entities in social networks [1], content recommendation [7] or identifying protein functions [8]. However, a great deal of the proposed models can only apply on homogeneous networks while real-world data is often heterogeneous. Most models that are applicable on heterogeneous data are evaluated on single networks or data that follow a natural graph structure such as citation and social networks, molecule bindings, IMDB or Wikipedia [9, 6, 8, 10, 4]. However, for some tasks, data can be collected from different data sources that do not follow a clear graph like structure but still contain relational information that might be valuable for the specific task. Therefore, in this research we investigate whether graph representation learning can also be applied to capture valuable relational information in real-world combined datasets where a graph structure is more vague and could be constructed in various interpretable ways.

A task in which data can be collected from various sources that can contain valuable relational information and where the data does not follow a straightforward graphical structure is the task of detecting illegal vaca-tion rentals. Through the increase in popularity of housing rental platforms such as Airbnb, internet housing rental services are an easy framework for housing investors. In many cities over the world large debates prevail about how these platforms suppress the housing market and cause nuisance tourism. Amongst these cities, Amsterdam is one of the severely afflicted. In order to tackle this problem, the municipality of Amsterdam, has set up regulations in which vacation rental is illegal. To uphold these regulations, the municipality tries to track down illegal vacation rentals. Originally, the municipality was mainly dependent on civilian reports. Nowadays, enforcers search for suspicious listings on housing rental platforms as well. However, since housing rental platforms often only show the neighborhood within some range of the house, finding suspicious listings and tracking down the belonging address is a difficult task. Therefore, a model that could still capture relations between houses and vacation rental platforms might be of great value in the task of detecting illegal vacation rentals.

There are various data sources available that can contribute to the task of predicting illegal vacation rentals, such as property characteristics, data related to inhabitants, publicly available data and data from a short-term vacation rental platform. Initial experiments conducted by the municipality of Amsterdam in cooperation with the University of Amsterdam has investigated the contribution of Airbnb features in an early fusion approach for the task of prioritising illegal vacation rentals. Their research showed that the early fusion approach had a very small contribution. However, the Airbnb data contains different modalities such as numerical and continuous features and text and image attributes. Since graph based approaches have shown promising results on model-ing multi-modal data [1, 11, 2], a graph based approach might be a better method to incorporate the Airbnb data.

Connections between houses, which are based on property characteristics or based on advertisements on platforms that only indicate the neighborhood of the house, can be constructed in several different ways into a graph. This is different from the data used in most papers about deep learning on graphs which follow a more natural graph structure. For instance, in citation networks it is straightforward to link papers to papers if they cite to each other, because they have the same author, or because they were published on the same venue. In molecule bindings it is straightforward to create connections between proteins where one knows they exist. As the construction of graph structures can be done in several ways, this research will also investigate which and what type of constructions are more informative than others. This will not only provide insight in what type of graph structures are more important than others, it will moreover contribute to the overall interpretability of graph based approaches. For the municipality, the interpretability of algorithms is an important aspect as well. Especially in fraud detection tasks, as the municipality has to be able to justify why they would check upon certain houses.

In the context of a graph based approach for the task of predicting illegal vacation rentals, houses can be represented as graph nodes. House nodes can be connected to each other by direct edges or by edges to other type of nodes which represent properties such as the number of inhabitants or the neighborhood of the house. Since the resulting graph nodes are of different types, the graph will become heterogeneous. Graph neural networks can be applied to map the house nodes towards a machine interpretable embedding which entails structural information of the heterogeneous graph. These embeddings can be fine-tuned in a (semi-)supervised fashion in which fraud labels are used to steer the embedding in a direction where they will be useful for the prediction of illegal vacation rentals.

(6)

A possible way to address the task of predicting illegal sublets is with a fraud detection approach. For instance, suspicious houses can be selected with the use of an anomaly detection method applied directly on concatenated heterogeneous features associated with houses. However, in another study conducted by the mu-nicipality of Amsterdam in cooperation with the UvA about the explainability of prioritising housing fraud in general, it turned out some characteristics of houses such as location and the amount of inhabitants affect the probability of housing fraud. Therefore, it is not expected that the fraudulent houses will behave like anomalies. Accordingly, the behaviour of fraud in this task is different from for instance credit card fraud where at some moment a change of behaviour can be noticed by an unusual transaction. Therefore, this research will approach the task of predicting illegal vacation rentals as a classification task rather than an anomaly detection task.

In conclusion, the main focus in this research is to investigate how graph based models can be used to examine complex relations between data from different real world data sources and different modalities, which do not follow a straight forward grid like structure like in most research cases. Moreover, it will investigate which graph structures are more informative than others which will contribute to the interpretability of graph based models. As a use case this research will approach the task of detecting illegal vacation rentals in Amsterdam.

1.1 Research Questions

The main research question in this thesis can be summarized as follows:

• Can the use of deep learning on graphs extract complex relations between houses in order to approximate the probability of illegal vacation rentals?

As was explained in the introduction, the model needs to be explainable to some extent. Moreover, data originates from different data sources and the goal is to learn structural information within the data to optimize performance. Therefore, we proposed the following sub research questions:

• What type of graph based model is best suited for this task? • Which graph structures are more informative than others?

In order to give some overall view of the benefits of using a graph based approach, this research will answer the following sub-question:

• What are the advantages and disadvantages of the use of a graph based approach over the use of feature classifiers such as Logistic Regression and Random Forest?

1.2 Contributions

This research will be innovative in a few ways. First of all, most papers that are written about Graph Neu-ral Networks test their model on a single data-set which has a straight-forward graphical structure, such as citation/social networks, molecule bindings or Wikipedia [9, 6, 8, 10]. This research will investigate whether representation learning can also increase performance on different combined data-sets where a graphic visu-alisation of the data is less straight-forward and with the use of a heterogeneous graph structure. Moreover, previous research raised serious concerns about a gap between the academic research in the field of represen-tation learning and how the methods can be applied in real world settings [12]. This research will provide an other example of how a real world problem can be approached with a graph based model. At last, this research will investigate what type of graph connections will be more informative than others and it will compare the results of experiments with and without the use of two different types of attention. Therefore, this research will also contribute to the interpretability of representation learning which can help the research field to move forward.

1.3 Outline

In the next section, Section 2, we start with providing some theoretic background information about how graphs can be used in deep learning and how data from multiple modalities can be used in graphs. In the related work in Section 3, we provide an overview of the origin and recent popular graph based approaches with their advantages

(7)

for our task? Section 4 will first explain more about the context of this research. Secondly, it deliberates on the data that was collected. Thirdly, it will provide a data analysis on which the details of our research approach were defined. Fourthly, the graphs structures or metapaths that were constructed in this research will be discussed. Finally, it provides a detailed explanation of the model architecture. Section 5 will provide the experimental setup that was used, the experiments and an evaluation of their results. With the use of those results the main research question together with the last two sub-research questions will be evaluated and answered. Section 6 will summarize all findings in a conclusion. Finally, Section 7 provides a discussion and possible future work.

(8)

(a) Homogeneous graph (b) Heterogeneous graph (c) Metapaths

Figure 1: Graphs and metapaths.

2 Theoretic Background

This section will provide some basic background concepts in the field representation learning. Section, 2.1 will discuss how graphs can be constructed in representation learning. Section 2.2 will discuss how data from multiple modalities can be used in graph based approaches.

2.1 Deep learning on graphs

In general the main reason to use graphs in the field of deep learning is to capture higher order relations between entities [13]. Graph based approaches try to maintain structural and relational connections in data while they map them towards a compressed representation or embedding. The new embedding will contain new structural information inside the vector representation. Therefore, graph representation learning can also be seen as a form of feature learning. The resulting representation can be machine interpreted and used for several machine learning tasks such as classification, ranking or recommendation [6, 14, 15]. Representation learning can not only be used on graphical data, it can also be performed on other types of data, such as words, documents or images. In the case of graph-node embeddings, a node embedding contains information about its position in the graph and the relational information of neighboring nodes. Different methods on how those node embedding can be learned will be explained in Section 3. First, it will be discussed how graphs for deep learning can be created.

In graphs, data samples are represented as graph nodes and relations between nodes are represented with edges between them. In general graph data is encoded with a tuple G = (V, A). Where V ∈ RN ×F _{are the}

graph vertices or nodes in which N nodes contain F node features. A is the adjacency matrix, which encodes the edges. The adjacency matrix entries can be defined as followed.

aij =

(

wij if there is a connection between i and j

0 otherwise (1)

The weight wij is a scalar that represents some measure of strength or similarity of the connection between

i and j. In binary graphs this weight equals zero or one.

In homogeneous graphs all nodes are of the same type. An example of a homogeneous graph is a citation network where papers are represented as nodes and if one paper cites to another paper, this connection is represented with an edge between the nodes. This edge is called a homogeneous edge. Edges can be directed or undirected. Representing asymmetric or symmetric relations between nodes respectively. Graphs with directed edges are also called Information Networks [16]. In the homogeneous graph in Figure 1a, P1is connected to P2

and P4. Therefore, P2 and P4 are called the one-hop neighbors of P1. P1 has a relation with P3 because they

share the same one-hop neighbor. Therefore, they are called two-hop neighbors of each other. A heterogeneous graph contains nodes with different types. In the example of the citation network, a heterogeneous setting could have nodes that represent papers, authors and conferences. Apart from two papers being connected because of a citation link, in a heterogeneous setting of the citation network, two papers can also be connected because they are written by the same authors or because they are published at the same conference. In addition two authors can be connected because they have written the same paper, and thus have a co-author relationship. These relationships in heterogeneous graphs can be represented with metapaths. If two nodes are connected when they follow a specific metapath, they are called metapath based neighbors. A co-author relationship is represented by

(9)

the metapath AP A and a paper that is published at the same conference has the metapath P CP . Two authors that both have a paper published at the same conference can be represented in the metapath AP CP A, see Figure 1c. P1-P2, P3-P4, P3-P5and P4-P5are metapath based neighbors for the metapath P CP . An advantage

of heterogeneous graphs is that they can express more information. Two different entities can connect the same node, however, the relation will have its own semantic meaning. Modeling these kind of semantic relations is often used for semantic proximity search in heterogeneous networks [4, 5]. A disadvantage of heterogeneous graphs is that they are often more difficult to work with. Moreover, when adding new metapaths to the graph it is important to ascertain how the graph will interpret the connections. This will prevent the model from learning higher order relations in an inappropriate way. Nevertheless, the use of heterogeneous graphs have shown to increase performance in several representation learning tasks [4, 5].

2.2 Modelling multi-modal data in graphs

Real world data often consists of multiple modalities in which each modality has a different format. Some examples of data with different modalities are: text, images, audio, time, and geographical data. Data from different modalities can add extra information for a specific task. Multiple works have demonstrated that the use of multiple modalities increase performance compared to the use of a single modality [17, 3, 18]. Combining multi-modal data can be done in several ways. One example is early fusion in which the features of different modalities are concatenated and used as input for a classifier. Early fusion methods fuse multi-modal data in feature space, they integrate unimodal features before learning concepts. An other example is late fusion which fuses multi-modal data in semantic space. For each individual modality a seperate model reduces unimodal features. The output of the separate models are integrated to learn new concepts. While early fusion can lose correlation in the mixed feature space, late fusion strategies can capture the effects of each modality on the task separately [19]. Multi-modal data can also be combined with the use of graphs. By adding new modalities into a graph structure, inter-relations originate between nodes of different modalities and intra-relations between nodes of the same modality. By learning the node representation, the goal is to caption both these relations. Therefore, in these approaches a distinction between two types of graph edges are made:

1. Attribute edges: These represent the inter-relations and are heterogeneous edges that connect nodes to their attributes. They are represented with a solid line.

2. Similarity edges: These represent the intra-relations and are homogeneous edges that connect two nodes within the same modality if they are similar. They are represented with a dashed line.

Figure 2: Graph with attribute and similarity edges represented with solid and dashed lines respectively.

Figure 2 shows an example of a multi modal graph with attribute and similarity edges. Similarity between text document and image feature vectors can be calculated with measures such as cosine similarity and the use of the Euclidean distance. Cosine similarity between two text vectors can be defined as:

St(ti, tj) =

ti· tj

||ti||||tj||

(2)

Similarity between image feature vectors can be defined with a non-linear transformation of the Euclidean distance with the use of a radial basis function:

(10)

Si(ii, ij) = exp −||ii− ij|| 2 2σ2 (3)

The degree of similarity can be represented in the edge weights. However, using similarity linkages between all nodes often yields dense adjacency matrices. To prevent this, it is also possible to just add a weighted or unweighted linkage between two nodes if the node belongs to its K-nearest neighbors.

(11)

3 Related Work

Early work on representation learning is inspired by dimensionality reduction techniques that factorize matrices of a network to generate latent dimension features for nodes and edges in graphs [20, 21]. However, decomposing large scale matrices often come at a high computational cost. Lately, with the advent of deep learning, more neural network based approaches have come up for the task of representation learning.

As was already mentioned in the introduction, various studies have exploited the use of graphs for a wide variety of different tasks. In the recent study by Sharma et al. [2], information graphs are used to predict restaurant popularity. In their work the graph represents a restaurant network where multi-modal data such as the type of cuisine, meals, price, location, users and images of restaurants are used to bring together restaurants with shared or similar attributes. Their method makes use of bi-modal random walks, and they use attention to combine attribute specific representations. Arya et al. use hypergraphs to capture higher order relations between entities in social networks in order to predict missing information about entities [13]. They apply convolutional networks on graphs to capture relations between entities in social networks. Rudinac et al. use Graph Convolutional Networks to classify violent online political extremism in multi-media [3].

Many of these approaches can be divided into two groups: models that make use of a skip-gram architecture with random walks and models that make use of convolutional layers to learn node embeddings. This section will discuss related work build upon these architectures and provide an explanation of their main building blocks.

3.1 Random walk approaches

DeepWalk DeepWalk [22] was the first method which applied unsupervised feature learning techniques of natural language processing into network analysis. Inspired by word2vec [23], DeepWalk leverages the Skip-gram architecture to learn node representations by modeling short random walks. A skip-Skip-gram model tries to predict the reverse of the Common Bag Of Words (CBOW). Where the CBOW tries to predict the center word given its surrounding words, the skip-gram model tries to predict the source context words (surrounding words) given the center word. Since the context contains multiple words, it is harder to predict. Therefore, the words are split up in target, context pairs. In DeepWalk the skip-gram architecture is used to maximize a node’s neighborhood by using fixed length random walks in the graph.

Nowadays, numerous studies in the field of graph representation learning make use of these random walks [24, 8, 4, 2, 25]. Successive nodes in a random walk can be compared with successive words in a sentence in word2vec. More precisely, a random walk in a graph is a walk from one node to another node over some length l. Where nodes on the path are generated by the following distribution:

P (ci = x|ci−1= v) =

πvx

Z if (v, x) ∈ E

0 otherwise. (4)

Where ci denotes the ith node in the walk starting with c0. πvx denotes the unnormalized transition

prob-ability between nodes v and x, Z is the normalizing constant. Advantages of random walks are that they are computationally efficient. Storing immediate neighbors of every node in a graph has space complexity O(|E|).

LINE LINE [24] is a different feature learning algorithm which extends DeepWalk. LINE can be applied on social, citation and language networks to learn network embeddings which preserve local and global network structures. LINE is suitable to learn embeddings of very large networks with millions of nodes. Local structures are represented by the observed links in the networks that capture the first-order proximity between vertices. The second-order proximity captures the shared neighborhood structure of vertices. Tang et al. state that in social networks people who share a same friend, are likely to share same interest and words that are used in same sort of context are likely to have similar meaning. This can be interpreted in a graph setting as nodes with shared neighbors are likely to be similar. Therefore, vertices that have a strong second-order proximity should be represented close to each other in vector space. In LINE the models to preserve first and second order proximity are trained separately, the generated node embeddings of the two separately trained models are concatenated for each vertex. Moreover, they propose an edge-sampling method where edges are sampled with probabilities proportional to their weights after which they are treated as binary edges to update the model. This way the edge weights no longer affect gradients which prevent the exploding gradient issue in large real world information networks where edge weights have high variance. In PTE [26], Tang et al. extended LINE into a semi-supervised model for embedding text data.

Node2vec Node2vec [8] extends DeepWalk with breadth-first search schemes. In Node2vec they make use of biased random walks to generate node embeddings for the task of multi-label classification. The biased random

(12)

walk procedure is introduced to efficiently explore diverse neighborhoods which can learn richer representations. As in LINE, Node2vec maximizes the likelihood of preserving the network neighborhood of nodes conditioned on its feature representation. Since computing the partition function for all nodes is computationally expensive, LINE and Node2vec adopt the approach of negative sampling [27]. They make use of second order random walks to generate (sample) neighborhoods of nodes. The main contribution in Node2vec is that the model makes it possible to define a flexible notion of a node’s network neighborhood by tunable parameters to control the search space of the random walks. They state that nodes in networks don’t have the linear nature as in text and therefore a richer notion of a neighborhood is necessary where nodes are not restricted to their immediate neighbors. In Node2vec, the size of a neighborhood is set to k nodes and different sampling strategies are used to create multiple sets for a single node. Breadth-first Sampling (BFS) is used in which the neighborhood is restricted to immediate neighbors and Depth-first Sampling (DFS) is applied in which the neighborhood consists of nodes sequentially sampled at increasing distances from the source node. Node2vec smoothly interpolates between BFS and DFS. Due to the flexibility in sampling strategy, Node2vec can learn embeddings that can represent different roles of nodes in the network or community. Node2vec experiments with networks from diverse domains, such as social networks, information networks and systems biology. It has shown good performance with only 10 percent labeled data and can scale to large networks with millions of nodes in a few hours.

Metapath2vec Deepwalk, LINE and Node2vec focus on homogeneous networks. However, a large number of real world networks are heterogeneous involving a diversity of node types and relationships. Therefore, Dong et al. [4] further extended the above mentioned models and designed Metapath2vec and its extension Metap-ath2vec++ which are applicable on networks with multiple node types and are able to extract heterogeneous structures and semantic correlations. In their work they introduce the heterogeneous skip-gram model, metapath based random walks and heterogeneous negative sampling. With the learned node embedding various tasks could be performed such as multi-class classification. In Metapath2vec they apply a logistic regression classifier to classify node embeddings into different categories. Their results have shown that Metapath2vec outperforms models that use identical treatment for different node types and relations.

3.2 Convolutions on graphs

Convolutional Neural Networks (CNNs) are very efficient for tasks where the underlying data representation has a grid structure such as images [28]. By sliding filters over images, CNNs are great feature extractors. Convolutions can also be applied on graphs in Graph Convolutional Networks (GCN). They are called convo-lutional because they share filter parameters over locations in the graph. Originally GCNs were proposed by Bruna et al. [29] and later extended by Deferrard et al. [30] with fast localized convolutions. Later, Kipf et al. [6] proposed a simplified version of Graph Convolutional Networks which combines a graph structure with node-level features and yields good performance in the the task of semi-supervised classification in networks.

Related work which use graph convolutional networks for the task of semi-supervised classification of nodes, smooth label information over the graph by adding a graph based regularization term to the loss function [31, 32]. Those regularization terms assume that connected nodes in the graph are likely to share the same label. However, edges not necessarily need to represent node similarity, still they can provide additional information to the graph. Therefore, in GCN [6] this graph based regularization in the loss function is avoided. Instead, the neural network is conditioned on the adjacency matrix and gradient information of the supervised loss can be used to learn representations from labeled and unlabeled nodes.

GCN uses a form of message passing where the feature representations of neighboring nodes are aggregated in the representation of the node itself. Therefore, nodes that lay close to each other in the graph structure are pushed towards each other in their representations. This can be explained by the following propagation rule:

H(l+1)= σ ˆD−12A ˆˆD− 1

2H(l)W(l)

(5)

Where H(l) ∈ RN ×D _{is the matrix of activations in the l}th _{layer (N is the number of nodes and D is the}

number of input features) and H(0)_{= X. ˆ}_{A is the adjacency matrix with added self connections: ˆ}_{A = A + I} N.

ˆ

D is the diagonal node degree matrix of ˆA and Wl_{is a layer specific trainable weight matrix. σ(˙) is a non-linear}

activation function.

By the multiplication of ˆA with D, node features are summed with their neighboring node features. Multi-plying ˆA with the inverse degree matrix will normalise this addition such that the change in scale of the node features is restricted. By aggregating node features with neighboring nodes features, nodes that lay close to each other in the graph will result in more similar representations. In the next step, the aggregated representation is transformed by a linear projection followed by a non-linearity. Finally, chaining up multiple layers produce a node-level representation output in which the graph structure and node features are entailed.

(13)

GCN aggregates in a structure-dependent manner as it is conditioned on the adjacency matrix. Therefore, the graph structure is required to be known up front which has a negative influence on its generalizability. Hamilton et al. [33] proposed a new model, GraphSAGE, that extends GCN by making it useful for induc-tive learning. It can be used to quickly generate embeddings for unseen nodes and is suitable to operate on evolving graphs. Moreover, GraphSAGE generalizes better across graphs with the same form of features. It extends the convolutions by using trainable aggregation functions that learn to aggregate feature information from a node’s neighborhood. Hamilton et al, experimented with three different aggregators which aggregate neighboring features: the mean aggregator which only differs from the GCN equation by a minor normalization constant, an LSTM aggregator and a pooling aggregator. In Graph Attention Network (GAT), Velivcovic et al. [34] propose a different type of aggregation which makes use of an attention mechanism to weigh neighbor features and outperforms the aggregators in GraphSAGE.

GAT differs from GCN in the way the information of neighboring nodes is aggregated. This difference between GCN and GAT becomes more explicitly by their formulas. In vector form, the Graph Convolutional layer-wise propagation rule is the following:

h(l+1)_i = σ   X j∈N (i) 1 cij W(l)h(l)_j   (6)

Where N (i) are the one-hop neighbors of node i and cij is the normalization constant that originates from

the degree matrix and is set to cij =p|Ni|p|Nj|.

What GAT actually does is that it substitutes the normalization constant by an attention mechanism which results in the following formula:

h(l+1)_i = σ   X j∈N (i) α(l)_ijW(l)h(l)_j   (7)

The attention is defined as follows:

αij = softmaxj(eij) = exp(eij) P k∈N(i)exp(eik) (8) eij = LeakyReLU(aT[Whi||Whj]) (9)

Where aT _{is a learnable weight vector and || denotes concatenation. Furthermore, they applied multi-head}

attention to stabilize the learning process where K attention heads independently apply the transformation in equation 7 and their results are concatenated. Resulting in the following equation:

h(l+1)_i = kK_k=1σ   X j∈N(i) αk_ijWkhj   (10) Where αk

ij are normalized attention coefficients of the kth attention head (ak). In this case the output

is of shape KF0 instead of F0. However, in the final layer the attention mechanisms are averaged instead of concatenated to achieve the desired output form F0. Which results in the following equation for the last layer.

h(lf) i = σ   1 K K X k=1 X j∈N(i) αk_ijWkhj   (11)

Although GAT has shown to outperform GCN and the mean, LSTM and pooling aggregators that were introduced by Hamilton et al. GAT focuses on homogeneous graphs while the data in the task we approach consists of a heterogeneous structure.

In Heterogeneous Graph Attention Network (HAN), Wang et al. [35] expand the GAT model to work with heterogeneous graphs. To learn the importance between a node and its metapath based neighbors, they apply node-level attention which is the same attention mechanism that is used in GAT and has the following formula:

z_iΦ= σ    X j∈NΦ (i) αΦ_ijh0_j    (12)

(14)

Where zΦ_i is the learned embedding of node i for metapath Φ. Since nodes in a heterogeneous network can be of different types and therefore will have different features spaces. They have replaced the trainable weight matrix to transform the input features into a new feature dimension (Wh) with a node type-specific transformation matrix Mφi that project the features of different node types into the same feature space as

follows:

h0_i= Mφihi (13)

Beside the node-level attention, in HAN an extra attention mechanism is introduced which learns the im-portance between different metapaths, referred to as the semantic-level attention. The attention mechanism is inspired by the attention mechanism in neural machine translation [36]. The metapath specific learned node embeddings are fused with this semantic-level attention mechanism such that more important metapaths, for the specific task, will get a higher weight. In AMPE [37] they used the same attention mechanism to fuse node embeddings learned from different metapaths by an AutoEncoder architecture instead of a GAT based architecture. Such an attention mechanism might be interesting for this research as it might provide better insight in what kind of graph structures are more or less informative in the task of predicting illegal vacation rentals.

3.3 Model selection

In this research we want to investigate whether graphs can be utilized to obtain extra relational and structural information from different data sources to optimize the classification of illegal vacation rentals. We have seen that random walk based methods are suited for embedding the graph structure in node representations and can be applied on heterogeneous graph structures. However, they are limited in that they do not contain node features and that they are based on a multi-step pipeline which is harder to optimize. Convolution based methods overcome these limitations while they remain favorable in terms of efficiency [6]. In our research task there are a great amount of house features which could contribute to the prediction task. Modeling all these features in the graph structure will result in a very large graph. Moreover, this research wants to investigate what the additional value can be of graph structures beside commonly used feature classifying approaches. Therefore, a graph convolutional based approach that can contain features inside the graph nodes is preferred. As we seek a model to apply on heterogeneous data and interpretability is important, it is decided to use a Heterogeneous Graph Attention Network (HAN) [35] based approach in this research. The HAN based model can provide insight in how metapaths can best be created and what kind of graph structures are most informative for the prediction task. In Section 4.4.1 we will dive deeper into the details of our HAN based model architecture.

(15)

4 Methodology

4.1 Research context

We design our research to be applicable in general and for a variety of use cases. However, we evaluate our experiments in the context of predicting illegal vacation rentals. This subsection will deliberate on why the detection of illegal vacation rentals has become an important social issue. Moreover, it provides some further information about in which cases vacation rental is illegal.

The use case of this study is in the interest of the department of Housing of the municipality of Amsterdam. The department is responsible for the enforcement on various forms of housing fraud of which illegal holiday rental is the largest. The municipality aims to achieve a more proactive approach in tracking down illegal hotels. Therefore, an algorithm that can discover new possible illegal vacation rentals would be of great value. Within the municipality of Amsterdam, algorithms will never be used to directly hand out fines or to perform other ways of enforcement. They will be used to approximate the possibility of fraud. Next, enforcers will check upon the house and make the decision whether fraud was conducted or not.

This research is applied on a use case in Amsterdam, however, it can be well extended for other cities. Amsterdam has approximately 450,000 households and almost double the amount of citizens. One of the goals of the municipality of Amsterdam is to improve the livability of the city. Which include; affordable livings for everyone and limitation of nuisance caused by tourists. However, at this moment Amsterdam is struggling with an extreme housing shortage. Waiting lists for social housing properties run up to over ten years. On the same time 1 in 15 houses in Amsterdam has been on Airbnb in 2019 [38]. This is not surprising given the fact that using houses as vacation rentals is extremely lucrative. At this moment, Amsterdam’s average housing price per night amountse156 and in the city center this reaches even e253 per night [39]. Houses that are used as illegal hotels, reduce the amount of houses available to live in. Moreover, the increased number of tourists in the middle of residential areas are causing nuisance [40]. Therefore, the municipality is committed to reduce negative impact of vacation rentals and has set up the following regulations in which vacation rental is illegal:

1. A house is rented out by someone who is not the house owner and does not have permission from the house owner (for example rents a house from a housing corporation)

2. The house is rented out for more than 30 nights in a year.

3. The house is rented out to more than 4 tourists at the same time.

Regulations become stricter every year. In 2018 it was still allowed to rent out homes for 60 nights a year. Starting from July 2020, three whole districts in the city centre are appointed in which vacation rental will be completely forbidden. In those districts, the extreme amount of tourists put pressure on the livability of the area. People who do not adhere to the regulations, face a large fine. Moreover, in some cases, owners of illegal hotels are given a cease and desist which should make them available for citizens again.

In 2019, the total amount of vacation rentals in Amsterdam stabilized compared to earlier years. The total offer on Airbnb has slightly decreased while the offer on other platforms like booking.com and Homeaway has increased [38]. Still, Airbnb is by far the most prominent platform for short-term housing rental in Amsterdam. A large number of houses are rented out illegally through Airbnb and some even say that the estimated amount of illegal sublets on Airbnb account for 50% of the total houses in Amsterdam that are listed on Airbnb [41]. Fraud conductors, use different platforms or multiple listings for the same house to complicate the recognition of fraud. Therefore, data driven approaches which could incorporate data from different platforms would be of great value for the detection of vacation rental fraud.

4.2 Data

The first part of this project was collecting and cleaning data that could contribute to the task of predicting illegal vacation rentals. All useful and available data was prepossessed and shaped such it could be used in our graph based approach.

In order to build a useful model for the municipality, it is important to determine what possible inputs are and what a useful output is. The municipality is interested to gain information on which addresses have a high risk of fraud. If the model designates addresses with a higher probability of fraud, the municipality can send enforcers to inspect the address to determine whether there is really a case of fraud. In order to achieve this desired output, all available data needs to be able to relate to addresses. This is where the first difficulty arises. Generally speaking, it is not the addresses that commit fraud, rather the people who live in or own the house

(16)

on the specific address. However, personal data is a sensitive issue regarding privacy and especially since the General Data Protection Regulation (GDPR) [42] statements are in effect. The municipality has to be very careful in using personal data, hence the personal data available for this research is limited. Therefore, instead of detecting people that commit fraud, this research focuses on detecting addresses that have a higher risk to be used as illegal vacation rentals.

In order to acquire more information on what exact type of data could provide a meaningful contribution, there was done some investigation within the department in which several enforcers were concerned. This is further explained in the next section.

4.2.1 Collecting Data Available data sources

The municipality has provided access to two databases for this research. The first one is the BAG database (Basis-Registration Addresses and Buildings) and contains information about all buildings in Amsterdam of which all addresses with residential functions can be selected. The total amount of households in Amsterdam is equal to 452,114. Furthermore, the BAG contains information about the surface of buildings, number of rooms, the number of building floors and geometry information of which latitude and longitude coordinates can be extracted.

The second database is the BRP (Basis-registration Persons). It contains information about people that are registered in Amsterdam. Not all data from this database is allowed to be used in this research because of privacy concerns. However, some features could be extracted such as the number of people registered at an address, their age, the percentage of male/female and what kind of family relationships are found at the address.

Furthermore, the municipality of Amsterdam provides publicly available data on Amsterdam City data 1 and Amsterdam Maps2. Data from these platforms that might contribute to the detection of illegal hotels are locations of popular touristic places. Such as the location of public transport stops, museums and shops.

Another data source is data from the Airbnb website. The website Inside Airbnb 3 _{scrapes Airbnb data}

of various cities including Amsterdam almost monthly starting from April 2015. The data is freely available and has been used in more researches. An example is a research into the relationship between the way hosts describe themselves and their perceived trustworthiness [43]. An excessive amount of data can be scraped from Airbnb listings, some examples are: the number of reviews per listing, how many days the house is available per year, some information about the host, such as whether the host is a superhost but also text descriptions about the space of the lodging, the neighborhood or the host. In most cases (+-95%) the approximate longitude and latitude of the listing address can be scraped as well. All in all, there is a wide variety of data available which could contribute to the prediction of illegal vacation rentals.

An overview of the different used data sources is shown in Figure 3

Figure 3: Different data sources that are used in this research.

1_{https://data.amsterdam.nl/} 2_{https://maps.amsterdam.nl/}

(17)

The labeled data

Labeled data samples are stored in a database in which the municipality keeps track of all information concerning possible housing fraud cases. If suspicious properties pop up through civilian reports, or because they are tracked down based on vacation rental platforms, or because they are part of a neighborhood research plan, the corresponding address is entered in this database and the addresses will be inspected by enforcers. If after inspection it was not possible to decide whether fraud was conducted or not these cases will remain uncertain and will be assigned an uncertain label. However, since there is a strong possibility the house came to light because of suspicion, the risk of fraud at the address is probably still higher than on an arbitrary address in the city. Houses that are checked by enforcers but where no fraud could be identified or proved will be assigned the label No fraud. In this case there is still a possibility that fraud took place, however, it could not be proven at that point. If at the instant of a house inspection fraud was encountered and could be proven, the sample is assigned a fraud label which is actually the only label that is ‘certain’. Regulations concerning illegal vacation rentals change over the years. However, they tend to become stricter every year. Therefore, it is assumed fraud cases in the past would still be fraud cases nowadays.

Integrating domain expert knowledge

In order to integrate domain expert knowledge in the search for relevant data, several enforcers at the Housing department of the municipality were approached. They have experienced that there could be a few indicators related to vacation rental fraud. First of all, they have noticed in case no people are registered on an address, it increases the risk of a fraudulent activity occurring at the address. Because of the housing shortage problem in Amsterdam, empty properties are prohibited. Moreover, if one lives in a property it is obliged to be registered at the property as well. Everyone can be registered at maximum one address. Therefore, an empty property in Amsterdam is actually per definition already illegal. If nobody is registered at the property, it might be used as an illegal hotel. Therefore, a feature representing an empty property flag might be a good indicator in the prediction of illegal vacation rentals.

The second indicator related to illegal vacation rentals appointed by the enforcers is a situation where two people in a partnership are registered at two different addresses. In this case, they often use one house to live in and the other house for touristic rental.

A third indicator which enforcers often come across is that one house-owner owns multiple floors of a (canal) building, where they use one floor to live in and other floors to rent out to tourists. If one desires to rent out a part of its house, a Bed and Breakfast permit involving several regulations is required. The most important regulations are that a maximum of 40 percent of the property may be used by the tourist(s) and that the owner lives in the house. Moreover, the part of the house rented out to the tourist can not be a residential on itself. In the case it is a residential by itself it might just as good be used as a house for civilians to live in.

With the use of this domain knowledge, two new features are generated that will be integrated in the graph structure: 1. Whether the house is an empty property 2: The number of registered partners of residents that are not registered on the same address. A third one, which would be desirable, is whether the house-owner owns multiple floors of a building. However, data about house owners is stored in the Kadaster database which is not accessible within this research.

4.2.2 Pre-processing data for a graph based approach

Amsterdam has approximately 450,000 houses. First of all, all houses were combined with data from the BAG, BRP and Amsterdam City data. The data was cleaned and flattened on address level with categorical data one-hot encoded. A list of all these features and their definition is displayed in Table 16 Appendix A. From Amsterdam City data, three datasets were used with coordinates of museums, shops and public transport stops in Amsterdam. Possible feature representations of these locations can be defined with proximity and density. Where proximity is the distance to the closest location or the mean distance of the K closest locations. Density describes the total amount of the locations within a given range of the house. The coordinates of the locations are expressed in latitude and longitude. When calculating the real distance between two latitude and longitude coordinates one has to use the haversine distance which takes the spherical shape of the earth into account. However, when comparing two coordinates within a small range such as a city, the spherical shape of the earth makes a negligible difference in the distance between the two points. Therefore, the spherical component is neglected during the calculation of the distance between two locations in Amsterdam. Which results in the normal distance formula multiplied with a constant that converts the distance to kilometers:

d(P, Q) = cp(x2− x1)2+ (y2− y1)2 (14)

Where x are the latitude coordinates, y the longitude coordinates and c = 1 0.015.

(18)

As explained before, the labeled data samples are stored in a separate database. Therefore, when adding the labeled data to the rest of the (unlabeled) data, it was necessary to shape them in the same format with the same features. Labeled samples are snapshots of situations that took place in the past. There are labeled samples on illegal housing fraud starting from the year 2013. In the mean time this data might have changed, people can have moved out, the family relationship on the address might have changed, etc. Therefore, for the labeled samples, person related features were composed with historical data of the BRP.

Some labeled samples could not be matched with the current BAG dataset. This might be possible due to a change in address notations over time or there might be a different unknown reason. In order to ensure there is no unintentional difference between the labeled and unlabeled data samples, the labeled samples which could not be matched with the current BAG dataset were disregarded. Resulting in 5861 useful labels regarding to illegal hotels.

After creating the same features for the labeled and unlabeled data, they were concatenated. This implies that the resulting dataset might have multiple samples of the same address but with a different label. If one would keep the labeled addresses but remove the duplicate (unlabeled) addresses, this would entail that all addresses which were identified with fraud in the past are excluded from the data on which fraud scores can be calculated. Which is, of course, undesirable. Therefore, the duplicate addresses with different labels and/or features are kept in the dataset.

The Airbnb listings in Amsterdam are scraped on a monthly basis. The listings can not be extracted on address level since the Airbnb website does not show the exact location of properties but instead an approximate location. This approximate coordinates were used to combine all Airbnb listings from one month in municipality defined neighborhoods and districts. The neighborhoods (buurten) and districts (wijken) can be obtained from the website of the CBS (Central Office of Statistics)4_{. In this way, Airbnb listings are grouped by their}

neighborhood or district and by the date they were scraped from the Airbnb platform. This makes it possible to match them with the labeled and unlabeled data samples by date and location.

The Airbnb data consists of categorical, numerical, text and image data. However, due to privacy concerns we do not use the images in this research. The multiple listings in each area-date group were aggregated by taking the average value of the categorical and continuous features (leaving out the text features). Moreover, two extra features were engineered: 1. The total number of listings in the group, and 2. The total number of entire homes in the group of listings. It is also possible to rent out a single room or a part of the house instead of the entire home. An overview of the Airbnb features that were used in this research is displayed in Table 17 in Appendix A.

Table 1: Airbnb text features and their description.

Field name Description

Name The title of the listing.

Space Description of the space the tourists are staying. Description Description about the house.

Neighborhood overview Description about the neighborhood.

Transit Description about public transport in the neighborhood. Host about Introduction about the host.

The text features were treated separately. There are six fields of useful text features in the Airbnb data. Their name and description are shown in Table 1. In order to convert the text fields to a vector representation, TF-IDF (Term Frequency Inverse Document Frequency) [44] was used. As the name suggest TF-IDF assigns a value for each word in a ‘document’ that is inverse proportional to the percentage of ‘all documents’ the word appears in. Since the text fields describe different entities, each individual entity might provide different additional information. Therefore, it is chosen to use separate TF-IDF vector representations for each of the text fields instead of combining the multiple text fields into one vector representation. One entry in one text field is considered as one ‘document’ and ‘all documents’ are considered as all entries within one text field. This results in six different TF-IDF vectors per Airbnb listing.

4.2.3 Data analysis

In order to get some better insight in the data and the distribution of the different classes, some data analysis was performed. This section will go further into the details of this data analysis.

As was stated before, Amsterdam contains approximately 450,000 houses and the municipality has 5861 useful labeled cases related to illegal hotels in Amsterdam. In order to acquire some knowledge about the data

(19)

distribution t-SNE [45] was utilized on all features that were flattened on address level (this excludes the Airbnb data). The entailed features are shown in Table 16 of Appendix A. For reproducibility and comparability of the results, for all plots the same seed was used. First of all, the fraud cases were plotted against the rest of the data. However, plotting 450,000 unlabeled samples against the fraud samples results in an illegible plot. Therefore, a random fraction of 1.4 % of the unlabeled data was taken which result in approximately 6000 samples which is the same size as the labeled data.

Figure 4: t-SNE plot of fraud samples and the rest of the data.

In the plot in Figure 4, the fraud samples are plotted against the rest of the data. It appears there is some distinction between the fraud samples and the rest of the data. However, when generating t-SNE plots of the labeled data versus the unlabeled data, see Figure 5a, it appears that this distinction is to a great extent due to a difference between the labeled and unlabeled data. In Figure 5b all individual labels as discussed in 4.2.1 are visualized.

(a) Labeled vs. unlabeled samples (b) Samples with all four labels

Figure 5: t-SNE on raw features.

The distribution between the unlabeled and labeled data might be caused by several reasons. First of all, a great part of the the labeled data came to light because they had a form of suspicion. They were reported by civilians or they were investigated based on suspicious listings on vacation rental platforms. In this case, predicting the labeled data would provide information on the suspicion of houses which is valuable information for the task in this research. An other cause might be the labeled data are snapshots of situations in the past, there might for instance be a difference in how accurate data was stored a few years ago compared to nowadays.

Initially, a semi-supervised approach was considered in which a supervised part distinguished fraud labels from no fraud labels and in which the unlabeled data could be used to gain extra information in an unsupervised fashion. However, because of the different data distributions of the labeled and unlabeled data and because of the plausibility this is declared by the suspicion of the labeled samples, it was decided to consider the labeled and unlabeled samples as different classes. As a result, in order to approach this task, two different classification tasks were set up: the first one that predicts the labeled data from the unlabeled data. The second one that predicts the four different classes unlabeled data, labeled as fraud, labeled as no fraud, labeled uncertain. First of all, a baseline Random Forest [46] classifier was applied on the first prediction task. 12102 samples were used of which 52% unlabeled and 48% labeled samples. An optimized Random Forest classifier managed a score of 99% accuracy on the test set. This confirms the difference between the unlabeled and labeled samples which appeared in the t-SNE plots. Moreover, this near excellent score of this classifier indicates that a graph based

(20)

approach will not be able to add significant additional information to this task. Therefore, this research will focus on the four-class prediction task.

For this four-class prediction task it was decided to continue with the smaller subset of the unlabeled data because of a few reasons: it reduces the imbalance between the data points; it makes it easier to visualize the data; as graphs can scale up very quickly when the number of nodes and edges are increased, decreasing the data size will speed up the learning process and prevent memory issues; finally, the two-class prediction task has shown that the size of the unlabeled data subset is sufficient to learn a clear border between labeled and unlabeled cases. Therefore, it is expected that an increase of unlabeled data would not be advantageous for the four-class classification task.

Summarizing the performed data analysis, the main conclusions that can be drawn are that the unlabeled data can be well distinguished from the labeled data samples. Therefore, it is decided to approach the task of predicting illegal vacation rentals as a four-class classification problem with the following classes: unlabeled data, labeled as fraud, labeled uncertain and labeled as no fraud. The next section will deliberate on how a graph structure can be build to approach this four-class prediction task.

4.3 Generating the graph

This section will explain how the graph in this research is build and which metapaths are used. In Section 4.2.1 all collected features from the different data sources were proposed. In total there are 66 different features per house disregarding the Airbnb features. Expanding the categorical features results in a total of 114 features per house. Treating each individual feature as a separate metapath in the graph would increase the graph size drastically which slows down the learning process. Therefore, it is decided to do some feature importance and use the advice of the domain experts to build in graph structures that presumably contribute to the task of predicting illegal vacation rentals. The first subsection will be devoted to the applied feature importance. The second subsection will deliberate the constructed metapaths.

4.3.1 Selecting graph structures based on feature importance

In order to receive insight in important features to construct in the graph structure, feature importance was applied with the use of a Random Forest classifier. Feature importance can be done with the use of Gini importance which is based on the principle of impurity reduction. It defines for every feature how strong their impurity decrease is [47]. More specifically, it defines the total decrease in node impurity weighted by the probability of reaching that node and averaged over all trees in the forest. However, impurity reduction is biased if the features differ in their number of categories or scale of measurement [48], which is the case in this research. Therefore, it is chosen to use permutation importance which was originally introduced in [46].

(21)

Figure 6: Heatmap of feature correlations. The yellower or lighter the boxes, the stronger the positive correlation. The bluer or darker the boxes, the stronger the negative correlation.

The intuition of permutation importance is if a feature is unimportant, then permuting its value will have low effect on the performance of the model. Permutation importance permutes features with random values and checks the accuracy drop afterwards. In permutation importance it is possible to evaluate multiple features together such that one-hot encoded categorical features can be compared to other variables. However, this research is interested in the most important variables to build a graph structure. Therefore, it is valuable to know the predictive power of each individual category in the categorical data.

Permutation importance is sensitive for correlated features. Because if two features are correlated and one feature is permuted, the model can still get a high performance based on the correlated feature. This will result in a low importance for both features while they actually might be important. In order to prevent this, correlated features were clustered and one feature of each cluster was kept to apply permutation importance on. This was done with the use of hierarchical clustering on the Spearman rank-order correlations [49]. A dendrogram of the created clusters is shown in Figure 14 in Appendix B. Results of the correlations are shown in the heat map in Figure 6.

The heatmap shows mostly the presence of correlations where they are expected. For example the density features of the museums, shops and public transport stops are highly correlated. This can be explained by the often higher density of those locations in city centers and vice versa. Moreover, they have a negative correlation with their proximity features which is, of course, logical. An other example that can be observed is features that express something about family relationships are correlated. From the clusters of correlated features one was picked to apply permutation importance on. The results are shown in Figure 7. Recall that in Table 16 in

(22)

Appendix A a list of all features with their description is shown.

Figure 7: Permutation importance of most important features.

Based on this plot and based on the domain expert knowledge a list was created with the most important factors which will be integrated in the graph. This list is shown in Table 2.

Table 2: Most important features based on domain experts and permutation importance.

Domain expert knowledge Feature Importance Empty property Distance to museums

Partner living at different address Partner living at different address Longitude/latitude (location) Maximum age

Distance to closest public transport spot

The house was financed for social housing (x1 110) Number of inhabitants

4.3.2 Constructing metapaths

This section will give an overview of all the generated metapaths. As explained in the previous section, the choice of metpaths is inspired by the feature importance. Some design choices were made such that the resulting adjacency matrix would not become too dense which would require too large memory capacity. Table 3 shows an overview of all the metapaths with the abbreviation they will be referred to in the rest of this research. Below the table the essence of the metapaths will be explained.

(23)

Table 3: Metapaths used in this research with their notations.

Notation Metapath

HH(3;30;50;80) House - House

HNH House - Neighborhood - House

HDH House - District - House

HM1H House - Museum1 - House

HM2H House - Museum2 - House

HPH House - Partner - House

HLH House - empty property (Leegstand) - House HPTH House - Public Transport stop - House HIH House - total Inhabitants - House HSH House - Social housing - House

HMAH House - Max Age - House

HAH House - Airbnb feat. - Airbnb feat. - House TXT NA House - airbnb text NAme - airbnb text NAme - House TXT S House - airbnb text Space - airbnb text Space - House TXT D House - airbnb text Description - airbnb text Description - House TXT NE House - airbnb text NEighborhood - airbnb text NEighborhood - House TXT T House - airbnb text Transit - airbnb text Transit - House TXT H House - airbnb text Host - airbnb text Host - House

Metapath explanations

In the first three metapaths: HH, HNH and HDH, houses are connected based on location. In HH(3;30;50;80), houses are connected to their 3, 30, 50 or 80 most nearby houses. HNH connects all houses which lay in the same municipality defined neighborhood (buurt). HDH connects houses that are located in the same municipality defined districts (wijken). Districts are larger areas than neighborhoods.

There are two different metapaths used to represent museum connections, visualised in Figures 8a and 8b. In the first one, HM1H, there are 165 museum nodes, each representing a museum in Amsterdam. Houses are connected to a museum if they are located within a range of 300 meters of the museum. Metapath HM2H represents the proximity of a museum. Proximity ranges are divided in ten categories starting at zero meters and increasing with 250 meters per category.

In the metapath HPH, the inhabitants of houses of two metapath based neighbors, have the same amount of partners registered at a different address. The partner nodes represent the amount of partners who are registered at a different address. A probably more informative metapath would be to connect houses to their inhabitants and inhabitants to their partners. However, this was implementation wise impossible unfortunately. In metapaths HLH and HSH, there is just one single node which can connect houses and represents whether the house has the node property or not. HPTH has the same design as HM2H but takes steps of 50 meters as there are more public transport stops than museums. HIH is straightforward and connects houses if they have the same amount of inhabitants. In HMAH there are ten max age nodes each with a range of ten years, houses connect to a node if the maximum age of the inhabitants falls in the age category of the node.

As was stated in Section 4.2.2 the Airbnb feature vectors are grouped by month and neighborhood. In metapath HAH, houses are connected to an Airbnb node if they match date and neighborhood. For the labeled data samples the date is defined as the date the case was introduced in the system. For the unlabeled data samples one can use the current date to receive the newest Airbnb information. In this research March 2020 was used. Airbnb nodes are connected through a similarity edge. Each Airbnb node has an unweighted similarity edge to its K = 100 nearest neighbors defined by the cosine similarity between the feature vectors. Figure 8c shows a visualisation of the metapath HAH.

The last 6 metapaths are based on the text attributes of the Airbnb data as explained in Section 4.2.2. Each airbnb text node has an unweighted similarity edge to its K = 50 nearest neighbors defined by the cosine similarity between the TF-IDF vectors.

(24)

(a) House - Museum1 - House (b) House - Museum2 - House (c) House - Airbnb feat. - Airbnb feat. - House

Figure 8: Visualisations of metapaths HM1H, HM2H and HAH.

4.4 Graph model

As explained in the related work section, in this research it is chosen to use a Heterogeneous Graph Attention Network (HAN) [35] to approach the task of predicting illegal vacation rentals. In this section the details of the model will be discussed and some small modifications of the model architecture in our approach will be explained. Due to these small modifications we refer to the used model in this research as mHAN.

4.4.1 Model architecture

As discussed in Section 3.2, the architecture of HAN is build upon a GCN as in the work of Kipf et al. [6] and extended to be applicable on heterogeneous graphs with the addition of two different attention mechanisms. The first attention mechanism defines attention weights for edges between nodes of the same type, the second attention mechanism defines attention weights between metapath based neighbors which defines the importance of different metapaths. This section will provide the details of the approached model. Figure 9 provides the overall framework of mHAN.

Figure 9: Model frame-work of mHAN. For each metapath all house nodes are multiplied with a weight matrix to map them towards a lower dimension. Node-level attention is applied to learn the weight between each node and its metapath based neighbors. According to these weights the nodes are fused towards the metapath specific embedding ZΦi. Next, semantic-level attention is applied to learn the weights between each metapath

specific embedding. According to these weights the metapath specific embeddings are fused towards the final embedding Z. A linear layer is used to predict the output.

Using Deep Learning on Graphs to Detect Illegal Vacation Rentals

MSc Artificial Intelligence

Master Thesis