On the use of multiple graph structures for scientific document embedding

(1)

On the use of multiple graph

structures for scientific document

embedding

Adriaan de Vries

10795227 Master Thesis MSc Artificial Intelligence University of Amsterdam Supervisors dr. Maarten Marx dr. Phil Tillman Assessor dr. Mike Lees November 15th, 2020

(2)

Abstract

Scientific document embeddings can be useful in various appli-cations, such as citation recommendation, subject classification and search. Despite this, little work has been done on general purpose scientific article embedding. In particular, no previous work has at-tempted to utilize both inter-author and inter-document information, which are available in scientific work through citations and collabo-ration. We evaluate embeddings based on all combinations of several information sources on a citation prediction and a classification task, concluding that indeed using both inter-document and inter-author structure improves performance.

(3)

1 Introduction

Embeddings, as understood in machine learning, are functions mapping high-dimensional data to lower dimensional representations. These map-pings are a fundamental part of deep learning, as transforming inherently high-dimensional or categorical data into a low-dimensional representation is a common part of any simple feed-forward network. Indeed, multiplica-tion of a vector by a non-square matrix can by this definimultiplica-tion be seen as an embedding.

Complementing this, there is increasing interest in the embedding of common types of data as an endeavor in itself, to be used in later down-stream applications. For example, natural language is such a common type of data that it makes sense to manufacture embeddings that make it possi-ble for something as inherently high-dimensional as language to be mapped to low-dimensional vectors which somehow capture most of the information of the original text. The most straightforward example of useful embed-dings in text processing is the embedding of separate words. There is a plethora of methods available to create these so-called word embeddings [e.g. Mikolov et al., 2013b, Bojanowski et al., 2017]. The idea behind this type of endeavor is that these embeddings would be usable regardless of specific downstream application, as they would encapsulate generally rele-vant information. Keeping with the example of word embeddings, a simple lossless vector representation of English words would be one-hot vectors in which each index refers to a specific word. Two problems with this approach is that it captures no information about the words, such as encoding similar words similarly, and it requires a vector representation with several hun-dreds of thousands of elements. This former problem could be solved, for example, by using an explicit distributional vector, which lists the counts for the occurrences of words in the context of a target word, but the latter problem requires a more sophisticated approach. A good word embedding is ideally capable of capturing relevant information in a small, real-valued, dense vector.

In this vein, this thesis will explore the embedding of scientific articles, and its use in downstream applications, such as classification and docu-ment similarity. Scientific publications are an interesting subject in that they combine textual information with different kinds of graph-based social structure. This comes in the form of inter-document linking, i.e. citations, and in the form of inter-author linking because of document coauthorship. Previous work has incorporated these sorts of structures for embeddings in various contexts. For example, to embed web-pages for use in an in-formation retrieval system, one might want to incorporate the hyperlinks to other webpages as useful information, which is analogous to citations as being inter-document structure [e.g. Wang et al., 2016b]. Similarly, for the embedding of social media posts, previous work makes use of the

(5)

so-cial network among the users, thus incorporating inter-author relations [e.g.

Liu et al., 2016]. However, to the best of my knowledge no previous work

has looked at embeddings that utilize, apart from textual information, both inter-document and inter-author information.

Finding out whether these different types of information are simulta-neously useful for the embedding of scientific articles is an inherently in-teresting and worthwhile task, but apart from this scientific interest it has practical use as well. If we can harness all of these information sources, or indeed if we can conclude that not all of them are useful, we can use this knowledge to improve embeddings of scientific documents. Because we specifically consider task-agnostic embeddings, the use cases for these em-beddings are hard to pin down. Indeed, any machine learning system that uses scientific articles might benefit from a good embedding. Regardless, there may be value in discussing some specific use cases more generally.

Firstly, we consider that a good embedding captures information about a paper in itself, and is therefore useful for systems that treat documents in a vacuum, instead of as a part of a larger structure. For example, we might want to be able to algorithmically classify a paper as belonging to a specific field. A good embedding would capture all information necessary to do this, allowing the algorithm to only consider the low-dimensional embeddings as opposed to the complex document as a whole.

Perhaps more interestingly, an embedding also immediately gives rise to a method to compare different documents. If the embeddings of two different articles are close together, these two articles are in some sense similar. This could be immediately useful as a citation recommender. A researcher could pick some previous work important to theirs and find documents that are embedded nearby to find other likely relevant work. Even better, one could embed an almost-finished paper to find other papers which, according to the embedding, are similar, to find related work that one might have otherwise missed.

This inherent similarity measure can also be used less directly for im-proving query-document search. Diversification of retrieved results can be a desirable property of a search engine which requires a method of determin-ing which documents are similar. More generally, any search system might of course benefit from a fixed-size vector representation of the document. Unfortunately, because the embedding method may only be applicable to scientific articles and therefore not to the query, even a very good embedding cannot on its own be used to match documents to queries.

Additionally, if inter-document and inter-author structures are indeed useful for the embedding of scientific articles, this might indicate that they are also useful for different kinds of data that have both of these structures. Posts on social media are often responses to, and therefore linked to, other posts, and their authors are part of a social graph. Wikipedia articles are co-written by several editors, and contain links to other articles. Even code on

(6)

GitHub has a collaboration graph, and through library imports also contains a form of inter-document linking. Indeed, several sorts of documents have an equivalent to the collaboration and citation graphs, and could perhaps be embedded better if both of those graphs were used.

In short, this thesis aims to find out whether the different information sources inherently present in scientific articles can be combined for a more effective embedding. We specifically consider an embedding to be effective if it captures information about the article in itself, as well as information about how an article relates to other articles. In the next section, we will be taking a more detailed look at the relevant literature, as well as introduce some of the models we will be working with. In Section3the research ques-tions are specified, which serve as a backbone to the rest of the thesis, which sets out to answer the research questions as directly as possible. Section 4

then explains the methodology, which includes a description of the data, the models, and the experiments. In Section 5 the results of said experiments will be discussed, and finally Section6concludes the thesis, noting both the broad conclusions that can be drawn from this work, as well as the inherent shortcomings.

(7)

2 Background

Because this thesis is about embedding scientific documents viewed as a combination of textual and graph-based data, we can largely divide the background into three related but distinct categories: embeddings and their use, text embedding and graph embedding. We will briefly consider these separately, and then discuss related work that deals specifically with general purpose scientific document embedding.

2.1 Embeddings

Embedding has been a central part of machine learning for a very long time. Indeed, the work ofRumelhart et al.[1986] that claims the invention of back-propogation for neural networks does so within the context of embeddings. This embedding may be better described as a task-specific representation in a hidden layer of a feed-forward network, than as a task-agnostic embedding with intrinsic value, but the two are certainly related. At the beginning of this century embeddings for their own sake were already widely accepted as a fruitful endeavor and Bengio et al. [2003], within the context of word embeddings, describe a neural algorithm for learning word representations from large amounts of unlabeled text, by maximizing the likelihood of a target word given its preceding context. Contrast this method, which is now ubiquitous, to older embedding methods that relied on training data with explicit annotation of relations between entities [Hinton et al., 1986]. They also describe the main advantages of embedding to be reducing the size of the vector representations, and capturing inherent similarity between entities.

By now, embeddings are used for a wide array of complex data. Word embeddings paved the way for embeddings of larger texts [Le and Mikolov,

2014,Dai et al.,2015], users of online services and their preferences get em-bedded to be served user-specific recommendations [Chen et al., 2016,

Gr-bovic and Cheng,2018], and music gets embedded to aid automated playlist

making [Renshaw and Platt,2009,Wang et al.,2016a]. This is of course no exhaustive list, but serves to illustrate the omnipresence of embeddings. 2.2 Text embedding

Famously and seminally, Word2Vec [Mikolov et al.,2013b] describes a method to derive low-dimensional vector representations for words, using the intu-ition that similar words appear in similar contexts. This method was im-proved over time, and became the default way to represent words in machine learning applications [Mikolov et al.,2013a,Cho et al.,2014,Kim,2014].

Since then, many methods more or less based on Word2Vec have been suggested. For example, GloVe [Pennington et al.,2014] extends Word2Vec

(8)

by training on global word co-occurrence instead of a local sliding context window. Alternatively, fastText [Bojanowski et al., 2017] uses subword in-formation to improve word embeddings, specifically on rare or compound words. Doc2Vec [Le and Mikolov,2014] is an extension which, using ideas from Word2Vec, is able to embed entire documents into low dimensional space. Lau and Baldwin[2016] however note that without careful hyperpa-rameter optimization, taking the average of Word2Vec vectors in a document leads to a better representation than actually using Doc2Vec.

Recently, attention mechanisms [Bahdanau et al.,2015], followed by their successor transformers [Vaswani et al., 2017], have significantly improved upon older language modeling techniques. In particular BERT [Devlin et al.,

2019] and GPT [Radford et al., 2019, Brown et al., 2020], both using the transformer architecture, have achieved great results largely due to their sheer size. Unlike Word2Vec and related methods however, training or even finetuning these large models can take considerable time.

2.3 Graph embeddings

One important method for embedding vertices in an unlabeled graph, as opposed to a semantic graph, is Node2Vec [Grover and Leskovec, 2016]. Itself being an extension of DeepWalk [Perozzi et al.,2014], it is capable of producing embeddings for vertices in a graph, without requiring the vertices to have features of their own. Both DeepWalk and Node2Vec use random walks to acquire a neighborhood for a vertex, and use the prediction of this neighborhood as an unsupervised learning task. Note here the similarity to Word2Vec. For a thorough review of graph vertex embedding techniques, see an overview byGoyal and Ferrara[2018].

2.4 Linked data embedding

Embedding linked data is not new. Combining some form of data with an inherent graph structure to arrive at a better embedding is in fact com-monplace. For the embedding of social media posts, one can use the social graph to embed the author, which in turn can improve the embedding of the post. Recently, for example,Del Tredici et al. [2019] andMishra et al.

[2019] perform sentiment analysis and topic classification on tweets using a representation made by concatenating the vector derived from a language model with an author-vector derived using a plethora of graph embedding techniques.

For the embedding of scientific articles, the citation graph is well es-tablished to be useful information. Kipf and Welling [2016a] introduce the Graph Convolutional Network (GCN) and use it to combine the citation graph and a word occurrence vector based on the text to embed scientific ar-ticles for subject classification. A generative model using the GCN and based

(9)

on the Variational Auto-Encoder [Kingma and Welling,2013] is provided by the same authors [Kipf and Welling,2016b]. Wang et al. [2016b] create an embedding specifically for classification by combining Doc2Vec, Node2Vec and a supervised classification component. Han et al. [2018] propose a method in which the text surrounding inter-document links are given an explicit role, with the intuition that this allows the model to infer the pur-pose of such a link, which they argue is useful for an embedding.

2.5 Related work

Despite the importance of scientific document embedding, few models are used that are built specifically for task agnostic embedding. Other work tends to focus either on task dependent embeddings, used for classifica-tion [Wang et al., 2016b] or citation intent prediction [Cohan et al.,2019,

Han et al., 2018]. Other models used are like those previously mentioned,

capable of embedding text, graphs and linked data regardless of context. However, the recent but likely seminal work by Cohan et al.[2020] fills this void by introducing a new model, SPECTER, which is the first to combine the information in the citation graph with the powerful language model SciBERT[Beltagy et al.,2019], which is a variation of BERT specif-ically tuned to scientific language. In their comparison to other work, they are constrained to general purpose language models, a Simplified Graph Convolutional Network [Wu et al.,2019], and a model specifically tuned to citation prediction [Bhagavatula et al.,2018].

Regardless, whether it be for social media or scientific publishing, to the best of my knowledge no attempt has been made to embed documents using two separate graph structures: one based on links between documents themselves and one based on the social structure of the authors. Cohan

et al.[2020] do show that the performance of SPECTER does not improve

when incorporating author information, but they take the author to be a categorical variable, as opposed to using the structure of the collaboration network.

This thesis, then, will explore whether using multiple graph structures can aid the performance of embeddings, and will do so within the context of scientific articles. Specifically, we want to know whether the text of a document, the links between documents, and the social structure of its authors carry useful and sufficiently differing information so that using all three of these information sources could lead to a better embedding than using only one or two.

(10)

3 Research questions

Before we can understand the research questions, we must first consider the separate information sources we can use to embed a scientific article. First and foremost is the text of the document. For scientific writing specifically, this text tends to begin with a summary of the whole: the abstract. This can allow us to not have to process or gather the entirety of the text.

More interesting perhaps are the graph-based information sources. We will consider three in particular: the citation graph, the author graph and the coauthor graph. The citation graph is an undirected graph in which two documents are linked if either cites the other. The author graph is similar, except two documents are linked if they have at least one author in common. Finally, the coauthor graph, sometimes called the collaboration graph, links authors if they are coauthors of at least one document. Note that the citation graph depends on inter-document structure, and the other two graphs contain inter-author structure. For a formal definition of these graphs, see Section4.1.1.

Given these information sources then, the following is the research ques-tion, immediately broken up into smaller subquestions.

RQ Do the different graph structures naturally present in the scientific network carry relevant information for the embedding of scientific ar-ticles, and can a combination of these inter-author and inter-document structures improve these embeddings?

RQ.1 Does the citation graph, in itself, capture relevant information for the embedding of articles?

RQ.2 Do the author and coauthorship graphs, on their own, capture relevant information for the embedding of articles?

RQ.2.1 Which of the author-based graph can best be used? RQ.2.2 Is it useful to use both of the author-based graphs?

RQ.3 Given that we use the textual information from the articles them-selves, do the graphs add useful information? Do the networks contain conditional information given the textual information? RQ.3.1 Does the quality of a text-based embedding improve if we

append the citation-based embedding?

RQ.3.2 Does the quality of a text-based embedding improve if we append author-based embeddings?

RQ.3.3 Does the quality of the embeddings improve yet more if we use both inter-author and inter-document structure?

Because to the best of my knowledge no work has attempted to combine a content graph such as the citation graph with a social graph such as the

(11)

author and coauthor graphs, this is precisely the setting that we will want to explore. The central research question RQ is purposefully broad, so the several subquestions are designed to be concrete and directly answerable, together forming an answer to RQ.

Firstly, questions RQ.1 RQ.2 ask simply whether the graphs capture any information. The answer to these questions is seemingly obvious: yes, they do. Regardless, there is value in stating the questions explicitly, not just because it functions as a sanity check but also because a look at each graph in isolation might tell us something about its behavior, even when ultimately combined with other sources of information.

Question RQ.2.1 then asks which of the author graph and the coauthor graph is the most useful one. Because these two graphs both carry infor-mation about the social network of scientific authors they might very well capture the same form of information, and one of them might be better at it. Question RQ.2.2 concerns whether using both of the graphs is beneficial. Indeed, if they capture the same information except for one of them doing it better, there might be no benefit to using both of them over simply using the better one.

Finally, in question RQ.3 we further explore this notion of mutual and conditional information. Although we expect the several graphs to carry information, this in and of itself does not make them useful for embedding. For the embedding of scientific articles we would certainly want to utilize the content of the articles, in the form of their title and abstract. The graphs, then, are only useful if they contain some information either not present in or not extracted from the text.

In this vain, RQ.3.1 and RQ.3.2 ask if inter-author and inter-document structure respectively improve text-based embeddings. In other words, we want to know if the various graphs contain conditional information condi-tioned on the content of the documents.

Finally, the most interesting subquestion that I believe has not been answered in the literature, is RQ.3.3. It asks whether an embedding using multiple graph structures as well as the text is expected to perform better than an embedding with using the text and only a single graph structure.

As a closing comment, note that the research questions tend to be quali-tative, as opposed to quantitative. We are mostly interested in the question of whether or not all the separate information sources are useful for a good embedding. Of course, it is inherently interesting and useful to find out how to make the best possible embedding given the data, but it is difficult to say anything decisive about such quantitative questions. In addition, by only aiming to answer the qualitative part, it is often possible to apply simpler algorithms that only require modest computational power.

(12)

4 Methodology

The methodology for this thesis aims to answer the research questions as effectively and simply as possible. Regardless, we will need to include two parts that do not immediately follow from the research questions. First off, we will need to be able to subsample data to a workable size, which is not a straightforward task when working with graphs. Second, we will need to be able to quantify the quality of embeddings to compare different methods, which is central to the research as a whole.

4.1 Data

The data we have access to for this thesis consists of data corresponding to scientific articles available to Elsevier. Specifically, we have access to millions of scientific documents, consisting of the title and abstract, as well as unique identifiers for each of the authors, and a list of citations.

These articles come from every conceivable scientific field, and all of them are equipped with one or more All Science Journal Classification (ASJC) tags, which is a label denoting the fields or subfields the specific article belongs to. These labels are generally inaccurate, as they are a function of the publication in which the paper appears. Because many journals and conferences feature work from different fields, most papers have many ASJC tags, most of which will not be applicable to the paper itself.

Furthermore, for a relatively small subset of the data we have labels sim-ilarly denoting the field, which unlike the ASJC codes are accurate. These are the product of manual annotation, generally by the authors themselves. Although we have labeled data, these labels require some preprocessing be-fore we can use them for classification. Specifically, the labels could be any of a large array of scientific fields, from something as granular as “com-putable model theory” to something as broad and meaningless as “Science”. Any data labeled “Science” might as well be unlabeled, but this is a rare problem. Very granular labels, on the other hand, intuitively carry a lot of information. However, it quickly becomes impractical to do classification on as many classes as there are subfields, so we need a method to somehow synergize the more and less granular labels. To do so, we will be using Om-niscience, a taxonomy of scientific fields, to add each general label that the granular label implies.

Now then, because of the large amount of data we will only be looking at a specific subset of the data. That is, we only consider articles which have at least one ASJC tag that falls under mathematics. Subsampling like this has several benefits. First, by taking a specific field we retain the structure of the data such that most of the documents cited by documents in our sub-sample, will themselves also be in our subsample. Second this removes the easiest part of embedding, namely embedding two entirely unrelated articles

(13)

EID Authors Cites Labels 1 A, B, C 2, 3 Calculus 2 B, E None Physics 3 A, D 2, 4 None 4 C, D 2 Algebra A B E D C

EID Authors Cites Labels

1 A, B, C 2, 3 Calculus, Mathematics, Science 2 B, E None Physics, Science

3 A, D 2, 4 None

4 C, D 2 Algebra, Mathematics, Science

Coauthor graph Citation graph Author graph 1 3 2 4 1 3 2 4

Figure 1: Example data and pre-processing. The top left shows the raw data, the bottom left includes derived labels and the right shows the citation, author and coauthor graph as derived from the data on the left. Note that this is an abstract representation, not a real subset of the data.

differently. Intuitively it should be easy for an embedding to differentiate between a mathematical and a psychological paper, so we can ignore this part and instead focus on slightly more subtle properties of papers, which should also be captured in embeddings. Finally, most of the accurately la-beled articles are mathematical, which leaves us with enough lala-beled data to accurately evaluate embeddings. Note also that the inaccuracy of the ASJC tags used for this subsampling need not lead to problems. Although focusing only on mathematics has its advantages, no method should or will rely on every single article being about mathematics, and therefore noisy data is not a problem.

After this first subsampling step, and after removing any article for which either the title, abstract, citations or authors are missing, we are left with about three million documents. Further subsampling will still be required, but this requires taking into account the graph structures. More on this in Section 4.1.2, but first a thorough definition of the graphs we will be considering.

For a graphical intuition as to what the data looks like, see Figure 1. 4.1.1 Graph definitions

We will be using the following graph definition, in which a graph G = hV, Ei is fully defined as an ordered pair of a set of vertices (or nodes) V and a set of edges E. This set of edges is a subset of the Cartesian product V × V . Note that in this definition, the same two vertices cannot be connected by

(14)

multiple edges. Furthermore the edges have no characteristics except for the vertices they connect, implicitly making the graph unweighted.

Finally, the graphs are all undirected. That is, if an edge connects some vertices v1 and v2 such that hv1, v2i ∈ E, then also hv2, v1i ∈ E.

What follows is a description of three specific graphs that are used throughout this thesis.

Citation graph The citation graph is a graph G = hV, Ei in which the set of vertices V is defined to be the set of scientific papers. Then, hv1, v2i ∈ E

if and only if the paper v1 cites v2, or if v2 cites v1.

Author graph The author graph, similarly to the citation graph, has the set of papers as its vertices. Unlike the citation graph however, vertices are connected if and only if the papers that the vertices represent share at least one author.

When subsampling the citation graph as explained in Section4.1.2, the author graph transforms similarly. Its set of vertices remains equal to that of the citation graph, and two vertices are still connected if and only if they share an author.

Coauthor graph The coauthor graph is fundamentally different from the citation and author graphs. Specifically, it does not have as its vertices the papers, but rather the authors themselves. Two authors are connected by an edge if and only if the two authors are credited as coauthors in at least one paper.

When the citation graph is subsampled, the coauthor graph is trans-formed in a less straightforward way than the author graph. The new set of vertices includes those authors that authored at least one paper that is in-cluded in the subsampled citation graph. However, two authors are still con-nected if they were coauthors for any paper present in the non-subsampled data.

Other possible graphs There are several other reasonable graph defi-nitions we could consider, all omitted in this thesis except for this short discussion.

First and foremost, the citation, author and coauthor graphs need not be simple graphs. The citation graph is inherently directed, as citation is non-symmetric, and the author-based graphs could both be multigraphs, with edges weighted by the amount of authors different papers have in common, and the amount of papers that two authors have coauthored. Regardless, we choose for them all to be undirected and unweighted. Because edges are taken as an indication of similarity, it makes sense for the graphs to be undirected, as similarity unlike citation is inherently symmetric. We

(15)

choose to take the author graphs as unweighted because the vast majority of edges would have unit weight regardless. Including the possibility for different weights would necessitate experimentation on the ideal weighting scheme, likely without significant improvement. Also, even if there were to be improvement with such a method, if this improvement would only be quantitative this would not change the answer to the qualitative research questions. Furthermore, using simple graphs follows the convention in the literature.

Two alternatives to the citation graphs would be based on co-citation or bibliographic coupling, which are the degree to which two documents are cited by the same document, and two documents cite the same doc-ument, respectively. Although both of these graphs would have intrinsic relevance, every edge between two papers in these hypothetical graphs is already present as a two-length path in the citation graph. Therefore they are unlikely to lead to qualitative improvement.

4.1.2 Subsampling the data

Given the large amount of scientific articles available, even after filtering on mathematical documents, we need a principled subsampling method to obtain a workable subset. Although ultimately a good embedding method would ideally be usable on extremely large amounts of data, comparing and experimenting with different methods will require a small subset. Here, we will discuss three possible methods for subsampling graph-based data, including the one we end up using.

Randomly sampling the vertices The first idea for subsampling data is a direct one: we could take a random sample of the vertices, and keep those edges that connect two vertices in the sample. This method would have the advantage of being simple, both in terms of how easy it is to understand, as well as how easy it is to implement. It is also computationally efficient. Regardless, this is a bad method. If we randomly sample vertices the further graph structure will be lost, and the subsampled graph will become largely unconnected.

As an illustration, consider a citation graph with 3.5 million papers each of which cites on average thirty different papers. This gives us a graph with 3.5 million vertices and, under mild conditions such as no paper citing themselves, 105 million edges. The average degree of the vertices in this graph will therefore be 60. If we now want to subsample this graph to contain a manageable 10 000 vertices, that means keeping every vertex with a probability of p = ₃₅₀1 . Because an edge only persists in the sample if both of its vertices persist, and because the vertices are chosen uniformly and independently, this gives each edge a probability of p2= ₁₂₂₅₀₀1 of being in the sample.

(16)

Therefore, we end up with an expected 10 000 vertices, and about 850 edges. Clearly this subsampled graph would be sparse beyond usefulness. Maximizing connectivity To combat the problem of ending up with an extremely sparse graph, we could instead maximize connectivity. One way to do so is to iteratively remove the vertex with the lowest degree, until the desired size is reached. This procedure is equivalent to finding the k-core of the graph, except that the value of k is not determined beforehand, and instead the procedure terminates when the graph is small enough.

Although this method is a little more complex than the random sample method, it can be implemented efficiently enough to be feasible. In addition, it certainly does not leave the smaller graph unconnected. On the contrary: by consistently keeping those vertices with the highest degree, the average degree increases significantly. In the end the subsample will predominantly feature the most influential works, for they are cited the most, and large-scale reviews, for they cite the most. Although the resulting graph is connected, it is not representative of the underlying data.

Random walk based sampling Leskovec and Faloutsos[2006] describe a principled method based on random walks to properly subsample a graph, keeping most of its properties intact. Among other things, they measure the degree distribution, size distribution of connected components and the clustering coefficient distribution, to verify which graph sampling techniques work best with the goal of retaining the properties and structure of the orig-inal graph. They end up with several similar and similarly good algorithms, of which we will use the one based on random walks for no particular reason. Specifically, they mention another algorithm based on forest fires which is likely to perform equivalently. Both of these sampling algorithms closely resemble sampling nodes in a breadth-first fashion from a random source node, but they do improve upon this baseline.

The sampling method based on random walks works by starting off at a random source vertex v, and taking a random walk from there, adding all vertices on the path to the sample. However, at every step, we return to v with some probability c. c is set to 0.15 as it is in the original paper. In the event we get stuck, for example because v is poorly connected to the rest of the graph, we continue from another random source node. In practice this is in our case never necessary.

4.2 Models

Most of the research questions are about the information content of vari-ous kinds of structured data. In fact, the questions are fundamentally not about the best models to harness the information. Of course, to be able to qualitatively say something about the information content of data it is

(17)

imperative to use a model that is able to extract a majority of said informa-tion, but it need not necessarily be the best or largest model. With this in mind, we will be using models that are well known and widely used, while still being effective. This will mean we favor simplicity over the current state-of-the-art.

With that out of the way, let us consider the sort of models required. First off, we require a model capable of creating vector embeddings for vertices in a graph, without the need for any other form of information. The vertices in said graph will have no attributes except for their position in the graph, and likewise for the edges. The model that seems like the best fit for this task is Node2Vec, being a widely used general purpose model for just this situation.

Further, we need a model that is able to embed text. This is of course an extensively studied task, and much is published on it regularly. Regardless, keeping in mind that we favor widespread use and simplicity, some direct variant of Word2Vec seems ideal. Although Doc2Vec is generally better at text embedding than taking the average of Word2Vec vectors, this is only the case after careful hyperparameter optimization which seems contrary to the simplicity requirement for this project in particular Lau and Baldwin

[2016]. Instead, we will opt for a more direct improvement upon Word2Vec: fastText [Bojanowski et al.,2017]. This method is specifically tailored such that taking the average of word vectors from this method works well as an embedding of the whole text. Moreover, it is widely used and fast and simple to train.

Finally, we require a method for combining textual and graph based data. The GCN [Kipf and Welling, 2016a], with its ability to work on graphs in which vertices have features of their own, seems like the clear choice. Indeed, it is both widely used and relatively simple. However, it is computationally much more expensive then either fastText or Node2Vec. In addition, in-formal experiments indicate that for our purpose it gives no performance benefit over vector concatenation while being a clear bottleneck. Therefore, for the combination of different embeddings, we will be concatenating the embeddings obtained from the separate information sources.

First, we will take a look at Word2Vec, which is necessary for under-standing the other models. Then follow the two models we will actually be using, fastText and Node2Vec, which can both be understood as adaptations of Word2Vec.

4.2.1 Word2Vec

Word2Vec, specifically the skipgram model, is a model for creating dense continuous embeddings for words from a large corpus of unlabeled text [Mikolov et al.,2013a]. It does so by associating every word in a corpus to its context, with the intuition that similar words appear in similar contexts. Formally,

(18)

every word v from the vocabulary V has an associated d-dimensional vector uv. This vector is used to predict which words appear in the context of v

by using it as input to the following small network: f (uv) = σ(W uv)

Here W is a |V | × d weight matrix and σ is the softmax activation function. The output f (uv) can be seen as a probability distribution over

all the words in vocabulary V . Using this, we can iterate over the words in the corpus, feed the vector associated with each word into the network, and compare the output distribution with the actual local context distribution. Using a cross-entropy loss we can then use gradient descent to optimize toward the true distribution, thus creating vectors for words that carry the information necessary to predict context. Under the assumption that a word is defined as the contexts in which it appears, this leads to a good fixed-size embedding, in which similar words are close.

The softmax activation function is a bottleneck in the above compu-tation, so subsequent implementations instead used hierarchical softmax, which is an efficient approximation of the true softmax function. Even this has since been superseded by negative sampling, which works by doing bi-nary classification using context for positive samples and random words as negative samples. This only requires local information for the activation function, and does not require the computation of the complete matrix mul-tiplication, making it much faster.

4.2.2 FastText

FastText is an extension of Word2Vec, which uses subword information to improve the word embeddings [Bojanowski et al.,2017]. That is, not only is every word associated with a vector, but so is every substring of every word. Typically, every substring of a word with length at least 3 and at most 6 has its own embedding. To embed a word, we take the sum of its own embedding and the embeddings of each of its substrings. To retain tractability we do not have a unique embedding for every possible substring, but rather hash the various substrings into a fixed-size array of embeddings.

This subword information is particularly helpful for the embedding of rare and misspelled words, allowing for a similar embedding to orthograph-ically similar words, even when such a word has not been seen by the model before.

4.2.3 Node2Vec

Node2Vec is a generalization of Word2Vec that works on graphs instead of text [Grover and Leskovec,2016]. From every vertex in the graph it creates short random walks, the vertices of which are used as context for the source

(19)

vertex. These random walks are controlled by parameters which dictate the trade-off between local search and more exploratory random walks, and generally slightly favor exploration.

By using random walks to find a context, graphs can fit into the Word2Vec framework. Training the vertex embeddings to allow for the prediction of their contexts ideally enables the embeddings to capture both homophily and structural equivalence.

4.3 Experiments and Evaluation

Evaluating models is a complicated part of task-agnostic and largely unsu-pervised modeling, but it is nonetheless necessary. Because we have access to a small amount of labeled data linking articles to their specific scientific fields, classification accuracy can be used as an evaluation metric. To do so, we would need a classifier on top of the embeddings that is able to handle relatively small amounts of training data. A Support Vector Machine (SVM) fits this description, in theory even able to produce reasonable results with fewer data points than data dimensions, and can be used out-of-the-box. More complicated and better models may very well exist, but as the SVM would only be used comparatively while using the same model for differ-ent embeddings, it need not be optimal. Clearly a good classification score would correlate with the amount of relevant information about the article stored in the embedding even if not all of the information is extracted.

As specified before, we would also like the embeddings to capture doc-ument similarity. Unlike classification, we have no golden standard for a document similarity test. Instead, we can consider citations to be a good indicator of relatedness, and evaluate embeddings with a citation prediction task.

Sch¨afer[2018] in his work on the evaluation of scientific documents also specifically recommends classification and citation prediction as evaluation methods.

We will first take a look at the various embedding models we will be testing, and then consider more in-depth the tasks used for evaluation, while making explicit the design choices and the reasoning behind them.

4.3.1 Embeddings used

The different embeddings we will be testing are made up of the separate embeddings based on the four different information sources: the textual information text, the citation graph cit, the author graph auth and the coauthor graph coauth.

All of these embeddings are relatively straightforward. For text, we first train a fastText model on the titles and abstracts of all documents in the

(20)

corpus, and embed articles by the average of the fastText vectors of the words in its title and abstract.

For the embeddings of the graph-based information, we do not use the entire corpus, but rather the subsampled version. Using the method based on random walks from Section4.1.2, we reduce the size the dataset to a size of 200 000 documents, using the citation graph for the random walks neces-sary for subsampling. For auth and coauth we then simply train Node2Vec on the subgraphs left after subsampling. In the case of auth this gives us pa-per embeddings directly, and for coauth we obtain author embeddings which we can average to get the embedding for a paper based on its authors. For cit we require one more step, as the citation graph will also be used for eval-uation. We therefore set aside one tenth of the edges in the citation graph at random to be used as a test set, and use the rest for the embedding.

For each of the models we will use the hyperparameters indicated as default in the original work introducing the model or, if no such default is specified, the default of the official implementation given in the articles.

For embeddings with more than one information source, we simply con-catenate the separate embeddings. This has the advantage of guaranteeing that no information gets lost, as well as being efficient. However, it might not be optimal. More sophisticated methods of combining information sources may lead to better results, but concatenation is sufficient for the purposes of this thesis.

Another option would be to normalize the embeddings before concate-nation, to give an equal weight to each of the constituents. We will not be doing this for two reasons. Firstly, because every model we will be using is directly based on Word2Vec, there are likely no large discrepancies in the forms of the vectors that need to be resolved. Second, the norm of a Word2Vec vector does carry information. Words that occur more frequently and have simple unambiguous meanings tend to get larger vectors, so a small norm represents a form of uncertainty about the word which can be useful information [Schakel and Wilson, 2015]. It is unclear however to what de-gree, if at all, this property also applies to the variants of Word2Vec that we use, as for the text embeddings we will use the average of many words, which are themselves the sum of the embeddings of the subwords, and for the graph embeddings every vertex is equally frequent.

Because every information source can either be used or not, we have 24 = 16 embeddings for evaluation. One of these, as it will have no information source, will be the empty embedding.

4.3.2 Classification

Given embeddings for each paper in the subsampled dataset, some of these vectors will correspond to labeled data which we will be using to evaluate the embeddings through classification. The ones that correspond to unlabeled

(21)

papers will for this part of the evaluation be ignored.

After the data has been preprocessed as described in Section4.1, we can distinguish between three labels: “physics”, “mathematics” or neither of these two, also denoted “neither”. With these labels, we can use the various embeddings as the input to a classification task. We will specifically use the embeddings as input to a linear SVM with each sample weighted according to the inverse frequency of its label, which is then trained to convergence and used to predict the fields of documents in the test set. Splitting training and test data is done through 4-fold cross-validation, and no distinction is made between validation and test set as no further optimization is done based on the prediction results.

Because the three classes are of comparable size, accuracy should give us a single easy-to-compare number for the quality of different methods. Com-paring methods qualitatively will require a more interesting approach in which we look not just at the amount of correctly classified papers, but also the specific sets of correctly classified papers. For example, if one method classifies a subset of labeled samples A correctly and another method cor-rectly classifies set B, then the overlap A ∩ B can give us an intuition with regards to the mutual information between the embeddings produced by the two methods. If A ∩ B is empty a combination of the two methods is likely to outperform both separate methods. On the other hand, if A ⊂ B then a combination of the two methods is unlikely to perform significantly better than the second method on its own.

4.3.3 Citation Prediction

Like the classification task can be understood as testing the capability of embeddings to encode document-specific information, a citation prediction task can be seen as a test of the ability of embeddings to capture inter-document similarity. A good embedding should embed similar inter-documents close together, but we do not have access to a gold standard similarity score between documents. Therefore we instead use citations as a proxy for relat-edness, using the intuition that citations correlate heavily with similarity.

If we use citations to evaluate relatedness, we have plenty of data to do so. Despite only using one tenth of the citation graph edges for test-ing, as the rest is used for training the citation graph embeddtest-ing, we have 241 090 citations left. Using this subset of the citation graph, we evaluate the citation prediction performance with a triplet task. That is, given some document v1, another document v2 which is connected to v1 and finally v3

which is not connected to v1, we would count as a success for an embedding

f the following:

d(f (v1), f (v2)) < d(f (v1), f (v3))

(22)

and furthermore it is invariant under scalar multiplication given an appro-priate distance metric.

For every edge in the test part of the citation graph, we randomly choose one of its vertices to be v1 in the above notation, and the other becomes v2.

v3 is then a document chosen at random from the entire graph, such that

v3 is not connected to v1 in the whole citation graph, including the training

part. Doing this for the entire test graph, we have 241 090 trials over which we can calculate accuracy.

For the distance metric d we will use the cosine distance. First and foremost because cosine distance is a widely used distance metric, but also because of some of its properties. For example, unlike Euclidean distance, cosine distance is invariant to scalar multiplication of either of its operands. This allows vectors with small norm, which as previously discussed may denote uncertain information, to be matched with similar vectors which represent more certain information and are therefore larger. When using Euclidean distance, uncertain vectors could be matched with other uncertain vectors, even if their information content is entirely opposite.

On the other hand, cosine distance is not invariant to scalar multiplica-tion of a segment of its operands. That is, if one of the partial embeddings within a concatenation is smaller, cosine distance focuses more on the other segments, which are larger and more certain. This distance measure can therefore be seen as using its best guess when evaluating uncertain infor-mation, but when presented with both certain and uncertain information favoring the certain, striking a good balance.

Apart from the distance metric, we could also have chosen alternative ways to test citation prediction. Some other work has done a ranking task in addition to triplets, in which citations are taken as positive samples [e.g.

Sch¨afer,2018]. However, such a ranking task tests the exact same thing as the triplet task, namely whether an embedding is capable of differentiating between a related and an unrelated document, and therefore does not add anything.

Another possibility is a harder version of the triplet task, in which nega-tive samples are vertices with a distance of 2 in the citation graph. This task can certainly have a purpose, but we will not be using it. Firstly, because we use citation as a proxy for similarity, the negative test samples might ac-tually be false negatives for our purposes. That is, documents that are close in the citation graph despite not being neighbors might still be related. Sec-ondly, this variation of the task intuitively favors embeddings based on the citation graph, which we expect to outperform other embeddings regardless.

(23)

5 Results

5.1 Classification

Graphs Acc. (with text) Acc. (without text)

None 0.694 0.400 Citation 0.722 0.694 Author 0.708 0.612 Coauthor 0.723 0.682 Citation + Author 0.726 0.701 Citation + Coauthor 0.734 0.717 Author + Coauthor 0.721 0.677

Citation + Author + Coauthor 0.733 0.713

Table 1: The classification accuracy of every combination of the four information sources, citation graph, author graph, coauthor graph and text, when classifying a total of 14285 labeled papers between mathematics, physics and neither of the two. The leftmost column indicates the graphs used, each getting two accuracy scores depending on whether the text of the documents was also used for the embeddings. The upper right cell indicates no information used, and is therefore a baseline that always predicts the most common class. The best score is underlined, as well as those scores that are not significantly different. Significance testing is done with a two-tailed binomial test with p = 0.05 as cutoff.

To start with, Table1gives the classification accuracy of all embeddings, when classifying documents between mathematics, physics and neither. This immediately gives us an indication of performance of the embeddings, and therefore partially answers several of our research questions. First and fore-most, every single information source is able to capture something useful on its own, as they are all capable of correctly classifying more documents than a random classifier would be. This was expected to be true, but nonetheless gives an answer to RQ.1 and RQ.2. More interestingly, these results indicate a concrete answer to RQ.2.1, which asked whether either of the two author is superior to the other. This seems to have a surprisingly clear answer, as the coauthor graph consistently outperforms the author graph. Indeed, not only is the embedding based purely on the coauthor graph better than the one based on the author graph, we also see that in combination with the textual and citation information, the coauthor graph still outperforms the author graph. Moreover, the embeddings that already use the coauthor graph are not improved when the author graph is added, in fact the clas-sification accuracy tends to decrease, though not by a large amount. This indicates that perhaps the coauthor graph is strictly more useful than the author graph, thus giving an answer to RQ.2.2.

Figure 2, then, gives the confusion matrix for this classification task. Here we can see that it is seemingly much harder to correctly classify the “neither” documents than the ones about mathematics or physics. This

(24)

Figure 2: Confusion matrix for the classification of math, physics and neither. This spe-cific matrix is for the embedding that uses all information sources, but it is representative of all others.

is not entirely unexpected. As the pre-filtering of the data was done on mathematical ASJC tags, the neither label is likely to contain many dif-ficult samples, even for humans. That is, this category is more likely to contain econometrics and theoretical chemistry than psychology and archi-tecture. This imbalance does leave the possibility that some embeddings only outperform others based on their performance of classifying the neither category, and one might say that this is less interesting due to the inherent ambiguity and perhaps arbitrariness of the label.

Were we to only consider binary classification between mathematics and physics, the results would be as shown in Table 2. Most importantly, this retains the result that the author graph as an information source seems to have no redeeming qualities when compared to the coauthor graph.

Finally, we want to know the degree to which the different embeddings capture different information. To this end, Figure 3 shows the overlap in accuracy between any two embedding methods. Specifically, if we define C(f ) as the set of correctly classified documents by some embedding f , then we can define Relative Improvement (RI) as follows:

RI(f1|f2) = |C(f1) \ C(f2)|

The Relative Improvement of f1 over f2 is then the number of correctly

(25)

Figure 3: Relative Improvement of all embedding methods over all others. Each cell in this figure gives the amount of documents correctly classified by the embedding method given at the bottom, that were not correctly classified by the embedding on the left. For example, the 794 in the upper right cell means that there were a total of 794 papers that were correctly classified by the embedding that used all information sources, but not by the embedding that used only the text.

(26)

Graphs Acc. (with text) Acc. (without text) None 0.824 0.501 Citation 0.850 0.829 Author 0.845 0.775 Coauthor 0.857 0.831 Citation + Author 0.856 0.840 Citation + Coauthor 0.859 0.853 Author + Coauthor 0.855 0.829

Table 2: The classification accuracy of every combination of the four information sources, citation graph, author graph, coauthor graph and text, when classifying 11417 papers only between mathematics and physics. The leftmost column indicates the graphs used, each getting two accuracy scores depending on whether the text of the documents were also used for the embeddings. The upper right cell indicates no information used, and is therefore a baseline that always predicts the most common class. The best score is underlined, as well as those scores that are not significantly different. Significance testing is done with a two-tailed binomial test with p = 0.05 as cutoff.

if both RI(f1|f2) and RI(f2|f1) are low, that means that f1 and f2 have

similar results. If only one of them is low, this can indicate that one of the embeddings is strictly better than the other. If they are both high, both embeddings presumably excel in different situations.

Turning our attention back to Figure 3, we immediately see that the RI of every embedding over auth tends to be high. We also see that our previous observation that the coauthor graph seems to be strictly more useful than the author graph is confirmed. That is, if the only difference between two embeddings f1 and f2 that both include the coauthor graph is

that f1 includes the author graph, then RI(f1|f2) and RI(f2|f1) are both

very low.

Figure 4 features a subset of Figure 3 with only four of the embed-dings. These four embeddings are central to this thesis, as we would expect any embedding to utilize the textual data, and we are consequently mostly concerned with the additional improvement given by adding the other infor-mation sources. If we furthermore consider the author graph to be inferior to the coauthor graph, we are left with the four embeddings shown in the figure. The first thing we notice is that the text-based embedding indeed improves from adding either information source, which indicates a positive answer to RQ.3.1 and RQ.3.2.

On the other hand, the RI of text over text+cit and text+coauth is de-cently sized as well. This could mean that the former has some quality that the latter two lack, but it could also be a random artifact of the difference between the embeddings. If the embeddings are sufficiently different, their classifications will correlate less with each other by their very nature, leading to a relatively high RI in both directions.

(27)

Figure 4: Relative Improvement of a subset of the embedding methods. As in Figure3, each cell in this figure gives the amount of documents correctly classified by the embedding method given at the bottom, that were not correctly classified by the embedding on the left.

(28)

To see this, consider two embeddings f1 and f2 that are, in some sense,

complete opposites. Specifically, consider the situation in which a binary classification using these two embeddings will never have the two predictions being the same. In this situation, the set of correctly classified samples using f1, C(f1), is exactly the complement of C(f2). Therefore, RI(f1|f2) =

|C(f1)| − |C(f1) ∩ C(f2)| will be equal to |C(f1)| as C(f1) ∩ C(f2) = ∅, and

similarly RI(f2|f1) = |C(F2)|. Now, even if f1 is a very good embedding,

there will still be some misclassifications when using it, for example because the correct label cannot be deduced from the available data. Thus, most but not all samples will be in C(f1) and all other samples will be in C(f2).

Because there is no overlap between the two, we also have RI(f1|f2)

RI(f2|f1) 0. That is, we will have a relatively large RI for f2 over f1

despite f2 not encoding anything actually useful, simply because f1 and f2

are very different.

Of course, text and text+coauth are not complete opposites, but the very high RI of text+coauth over text combined with the relatively high RI the other way around can be explained by this phenomenon, with the two embeddings being sufficiently different and text+coauth being better. The other possible explanation is that text has some positive property that text+coauth and text+cit lack, but this seems less likely. In terms of in-formation content, this cannot possibly be true as concatenation of vectors does not lead to any information loss.

More central to this thesis is the question of whether adding both graphs is better than adding either. It does not seem like adding the citation graph to the other sources helps much. There is certainly a small increase of classification accuracy, but the relative improvement is low both ways, which indicates that it has little effect on the embedding for the purposes of classification. We might conclude that therefore using only the text and the coauthor graph is sufficient for classifying scientific documents, but the citation graph, and perhaps even the author graph, might be more important for other purposes. With that in mind, we will now consider the results on the citation prediction task.

5.2 Citation Prediction

Because we use citation as a proxy for document similarity, we should expect the citation graph to be the most important information source for this task. Indeed, as we can see in Table 3, this turns out to be true. Despite the fact that the citations used for this assessment were not available to the citation-graph-based embedding during training, it has learned to capture this specific kind of correlation. However, if we agree that citation can serve as an indicator of similarity or that citation prediction or recommendation is inherently useful, then this is a useful thing to capture, and therefore these results point to a quality of the citation graph, even if they are as expected.

(29)

Graphs Acc. (with text) Acc. (without text) None 0.744 0.500 Citation 0.978 0.986 Author 0.821 0.789 Coauthor 0.900 0.888 Citation + Author 0.981 0.987 Citation + Coauthor 0.982 0.985 Author + Coauthor 0.899 0.888

Table 3: The citation prediction accuracy of every combination of the four information sources, citation graph, author graph, coauthor graph and text. Each embedding was tested on its ability to distinguish, given a target paper, between a paper that is cited or cites the target, and another random paper. If the cite-related papers had a smaller cosine distance, this is counted as a success. Accuracy is given over a total of 241 090 trials. The leftmost column indicates the graphs used, each getting two accuracy scores depending on whether the text of the documents were also used for the embeddings. The upper right cell indicates no information used, and is therefore given by the random baseline. The best score is underlined, as well as those scores that are not significantly different. Significance testing is done with a two-tailed binomial test with p = 0.05 as cutoff.

If we ignore for a moment the results of those embeddings that use the ci-tation graph, all of which perform similarly well, we can see clear differences between the other embeddings. Taken on their own, text performs worst, followed by the author graph and then the coauthor graph. Furthermore, concatenating different information sources seems to generally improve per-formance, with the caveat that once again adding the author graph to an embedding that already uses the coauthor graph seems to not have any pos-itive effect. It should also be noted that an embedding based only on the citation graph easily outperforms a combination of all other sources.

Now then, because Table3only tells us about the degree to which articles that cite each other are embedded close together, we might want to investi-gate further what the different embeddings look like by manually looking at some articles that have similar embeddings. To this end, Table4shows some papers that have a similar embedding. Specifically, for a target paper about multi-agent systems, it shows what different embeddings consider to be the most similar papers in the dataset, in terms of minimal cosine distance. Note that qualitatively the embeddings that use more than a single information source are less interesting because they simply consist of the concatenation of the embeddings of single sources. Because we use a distance measure as opposed to a separate model, there can be no emergent behavior from the combination of these sources. Therefore, we only consider the information sources separately. Furthermore, the author graph is omitted, as it has been previously established that it does not differ from the coauthor graph in a useful way.

(30)

Text Citation Coauthor

Coordinated control of multi-agent systems with bounded control inputs and preserved network connectivity

Coordinated tracking for nonlinear multi-agent systems under directed networks

Flocking with Obstacle Avoidance: Cooperation with Limited Communi-cation in Mobile Net-works

Bridgeness: A local in-dex on Edge significance in maintaining global connectivity

Teleoperation of multi-ple cooperative slave ma-nipulators using graph-based non-singular ter-minal sliding-mode con-trol

Self-organization of Swarms with commu-nication delays under disturbances

Completeness of the space of separable mea-sures in the Kantorovich-Rubinshteˇın metric

Decentralized control of multi-agent systems for swarming with a given geometric pattern

Virtual leaders, artifi-cial potentials and coor-dinated control of groups

The inverse problem for the Kuramoto model of two nonlinear coupled os-cillators driven by appli-cations to solar activity Exponential convergence

rate of distributed opti-misation for multi-agent systems with constraints set over a directed graph

Opinion consensus of modified Hegselmann-Krause models Scaling properties of strong avalanches in sand-pile Hierarchical multi-objective decision sys-tems for general resource allocation problems

Position convergence of informed agents in flock-ing problem with general linear dynamic agents

Crossover phenomenon and universality: From random walk to de-terministic sand-piles through random sand-piles

Table 4: Shows papers with the most similar embeddings for text, citation graph and coauthor graph based embeddings respectively. For each of these three embeddings, gives the titles of the five papers that are most similar, according to the embedding, to the paper titled “Coordinated control of multi-agent systems with bounded control inputs and preserved network connectivity”.

(31)

In this table, we can observe some interesting differences between the different embeddings. The text-based embedding seems to highly prefer papers that focus on graph structures as used in multi-agent systems, and that also use specific terms to that effect. The embedding based on the citation graph, on the other hand, does not require exact terms such as “multi-agent systems”. The related papers are still about these systems, but the text can instead use terms such as “flocking” or “swarms”. Of course, a good text-based model will be able to overcome such things, as it would capture the meaning of the text, but it would still prefer text that is most like the target. The citation graph is completely unaffected by synonymy or homonymy.

As for the coauthor graph, it really appears to capture something differ-ent. More so than the others, some of the related papers seem to be purely mathematical in nature. Apparently the embedding is under the impression that the authors of the target paper are part of a mathematical community, and indeed the math is relevant to multi-agent systems. This may very well indicate that the target paper is written from a more mathematical theoretical point of view, as opposed to a pragmatic algorithm-based an-gle of approach, which is the kind of distinction that the coauthor graph is uniquely equipped to make. In addition, the coauthor graph captures a broader relatedness, by recognizing that researchers of multi-agent system simulations tend to also be interested in other kinds of simulations, such as sand-piles, despite their indirect relevance to the target paper.

As we only consider a specific sample we should be careful drawing defini-tive conclusions from this experiment, but the fact that each embedding seems to have its own niche is nonetheless evidence for all of them encoding something different.

(32)

6 Conclusion

We have seen that, for the embedding of scientific documents, the addition of multiple graph structures as information to an embedding model can improve the quality of the embedding, when compared to embeddings that only use the text of the document, or embeddings that only use the text combined with either of the graphs. Specifically, we have seen that utilizing one graph that represents the inter-document structure and one graph that represents inter-author structure can improve performance, as both of these structures carry information that is neither present in the other nor in the text.

The inter-author structure, best captured by the coauthor graph, im-proved performance on the classification task. Classification inherently only requires document-specific information, and therefore the coauthor graph helps us capture at least some document-specific information. It seems likely that this graph is then useful in general for tasks requiring document-specific information, because the field a paper belongs to seems to be important in-formation to capture regardless of the specific downstream application, but we cannot conclusively conclude this from our experiments.

Similarly, the inter-document structure, in this case represented by the citation graph, greatly improved performance on the citation prediction task. Because of the close relation between the information source and the task this result is less evidence for a more general task-agnostic embedding im-provement, but the citation graph is already more widely established as being useful for the embedding of scientific articles.

Regardless, as subject classification and citation recommendation are certainly tasks that a general purpose embedding should facilitate, we can conclude that using both graphs is useful, as the information contained in either of them is not encapsulated by either the other graph or the textual information.

Referring explicitly to the research questions in Section 3, we can now answer them as follows. For RQ.1 we note that indeed the citation graph is useful for embedding. For RQ.2 we note that the coauthor graph is useful, and that the author graph seems to be a strictly inferior alternative. For RQ.3 we note that indeed the embedding that uses every information source outperforms embeddings that do not. Finally, for RQ we conclude that the ideal scientific document embedding will likely require all three information sources discussed in this thesis.

On the other hand, there is much work still to be done. The most impor-tant question that this thesis has not answered, and indeed has not set out to answer, is the matter of how best to combine these different information sources. Both the models themselves as well as the combination method could be improved, by using larger more powerful models and using more sophisticated synergy than concatenation. Another possibility for future

(33)

work concerns generalization. On the one hand, a more broad evaluation could determine if the various structures improve performance on a wider array of tasks, for which the newly published dataset SciDocs [Cohan et al.,

2020] that includes user activity allowing for various user activity prediction tasks could be a good starting point. On the other hand, a similar inves-tigation into using multiple graph structures for other kinds of data could reveal whether this approach is worth considering even for other kinds of data for which it is arguably not as well-suited, such as Wikipedia articles or GitHub projects which have content, author structure and inter-document structure but whose graphs are likely to be far less clustered.

On the use of multiple graph structures for scientific document embedding