Embedding experts on COVID-19 related topics

(1)

Embedding experts on

COVID-19 related topics

(2)

Layout: typeset by the author using L

A

_TEX.

(3)

Embedding experts on COVID-19

The effects of Doc2Vec and TF-IDF on specified datasets

Kenan G. Amatradjak

11724404

Bachelor thesis

Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam

Faculty of Science

Science Park 904

1098 XH Amsterdam

Supervisor

Dr. G. Colavizza

IvI

Faculty of Science

University of Amsterdam

Science Park 907

1098 XG Amsterdam

July 3, 2020

(4)

4

abstract

With the ongoing spread of the COVID-19 virus, efforts are made to combat this pandemic. These efforts include research to discover a potential vaccine. However, less effort is made in creating tools to facilitate this research. Specifically, exploring beneficial papers related to COVID-19. In this thesis, I evaluate the performance ofdoc2vecandtf-dfin creating similarity vectors. These are then reduced in dimensionality for visualization purposes with t-sne and PCA. This is done in the attempt to aid in the creation of a valuable tool to find experts relevant to specified topics - such as COVID-19. The results showed doc2vecto create a large single cluster, whiletf-idfcreated sub-clusters from the same data. Furthermore,t-snecreated a better visualization thanPCA.

(5)

C O N T E N T S

1 introduction

6 2 related work

8

2 .1 Embeddings

_{. . . .}

8

2.1.1 Word2vec . . . 8 2.1.2 DeepWalk . . . 9

2 .2 Clustering

_{. . . .}

11 3 datasets & algorithms

13

3 .1 The datasets

_{. . . .}

13

3 .2 Algorithms

_{. . . .}

14

3.2.1 Doc2Vec . . . 14 3.2.2 TF-IDF . . . 15 3.2.3 PCA . . . 15 3.2.4 t-SNE . . . 16 3.2.5 K-Means . . . 16

4 methods

17

4 .1 Data preprocessing

_{. . . .}

17

4 .2 Feature extraction and embedding

_{. . . .}

19

4 .3 Clustering possibilities

_{. . . .}

19

4 .4 Representation

_{. . . .}

20 5 experiments

21

5 .1

K-Means

Elbow method (evaluation)

_{. . . .}

21

5 .2 The setup

_{. . . .}

22 6 results

24 7 discussion & conclusion

31

7 .1 Future work

_{. . . .}

33

(6)

1

_{I N T R O D U C T I O N}

Wet markets in China are theorized to have caused a multitude of pathogens to infect humans with zoonotic diseases (e.g.8;9;37). Likewise, COVID-19 is spec-ulated to have originated from Wuhan’s Seafood Market. In late December 2019, 41cases of pneumonia were found in Wuhan, caused by the previously unknown coronavirus (Lu et al.). Initially, the virus was believed to have mostly stemmed from infected animals in the wet market considering that 27 of the formerly men-tioned cases had been exposed to the Wuhanian market (12). However, later evidence confirmed COVID-19 to have spread human-to-human (e.g.27;6;32;7). The virus spread quickly, with 282 confirmed cases in January, rapidly increasing to 87137 confirmed cases at the end of February (25;29). Despite this, govern-ments deemed the disease as a mere flu (33) owing to COVID-19’s relatively low fatality rate of 0.37 percent to 2.9 percent (28) when compared to its predeces-sor SARS, which had a fatality rate of 9.6 percent (1). This judgement likely contributed to an exponentially increasing number of people to become infected. Consequently, governments across the globe started imposing measures to slow the rate of infection. These include quarantining citizens and travel bans (30). As a result, The United States experienced over 20.5 million job losses in April. Fur-thermore, over 250,000 people have died due to the pandemic as of May 7th, 2020 (26). To prevent COVID-19 causing more deaths and job losses, a global effort is being made to combat it, such as finding a vaccine or programming tools to improve research effectiveness.

Research on COVID-19 is ever expanding as the pandemic is ongoing. However, most research surrounding this topic revolves around understanding the global disease, in the hopes of unearthing a potential cure. Consequently, research is less focused on tools to facilitate this research. However, Research effectiveness is reliant on the quality of the papers assisting said research. Therefore, there is great value to be found in creating tools that aid in discovering relevant papers. More specifically, discovering experts that have contributed to these papers. To further reinforce finding relevant experts, papers should be divided into sub-topics, as there are different researching fields within the topic of COVID-19. An advancement that greatly contributed to clustering text is word2vec(22;23). This state-of-the-art algorithm detects similarities between words. This allows words to be represented in a vector space, having more similar words closer to-gether. An extension of this algorithm is DeepWalk, by adding random walks to the process it results in higher accuracy scores. A more recent extension to

word2vecis the algorithm by Nikzad-Khasmakhi et al. (24) -ExEm. Instead of em-ploying random walks, Nikzad-Khasmakhi et al. (24) opted to utilize dominating nodes, yielding even better results thanDeepWalk. However, their research does

(7)

introduction 7 not describe their performance on very specified datasets. Furthermore, the al-gorithms do not provide similarity scores on the basis of documents, rather the similarity scores are embedded to words. A solution has been proposed by Le and Mikolov (17) in which an adaption of_word2vecis described. This algorithm is named doc2vec, in which the similarity scores are embedded to documents. However, there is a substantial gap in the research of doc2vec, as it is very anal-ogous to its predecessor. Another advancement in creating similarity vectors for words istf-idf, shown to excel in clustering tasks. This algorithm is less recent than the aforementioned, yet it provides competitive results (18).

A great leap forward would be to examine the performance ofdoc2vecon speci-fied datasets, such as only COVID-19 related topics, and comparing this totf-idf. Hereby also evaluating the results of the latter algorithm. Both algorithms can be applied to either segments of text or individual pieces of data - such as the references of a paper. Furthermore, these similarity vectors are often in a high-dimensional space and have to be represented in a low-dimension to be applicable as a tool. Therefore, dimensionality reducing algorithms should be used. t-sneis an algorithm that accomplishes this and excels at creating representations appro-priate for visualization. A more traditional approach to dimensionality reduction is the linearPCA. However, it does not offer the specialized visualization oft-sne.

The research question of this thesis is as follows: How well dodoc2vecandtf-idf

perform in creating similarity vectors for different data types and how dot-sneandPCA

compare in creating a visualization for leading experts on COVID-19 related sub-topics based on those similarity vectors?

The remainder of this thesis is divided into six chapters. First, I give an overview of related works surrounding embedding techniques and clustering. Second, I discuss the dataset and algorithms utilized during this thesis. Third, the method-ology is explained for generating the results. Fourth, the setup for experiments and evaluation metric are described. Fifth, the results are from these experiments are shown and finally, these results are discussed in the chapter thereafter.

(8)

2

_{R E L A T E D W O R K}

The quality and informativeness of data representation have a direct impact on the performance of an algorithm. Therefore, there is a wide range of research devoted to creating novel techniques for learning representations (3). Commonly, literature describes the task of learning representations of words i.e., word em-bedding. However, embedding is not limited to only words (5). Moreover, em-bedding is a beneficial tool when clustering. In this section, I first cover important works discussing embedding methods and then survey literature related to clus-tering.

2.1 embeddings

Embeddings are the mapping of discrete - categorical - variables to a vector con-taining continuous numbers. In other words, embedding is meant to represent data or features in such a way that a machine can understand. By doing so, embeddings allow for dimensionality reduction of categorical variables and pro-vide a meaningful representation of categories in the transformed space. For these reasons, using embeddings show to be beneficial when clustering in the vector-space, while also providing a means to better visualize concepts and re-lations among those. Many machine learning algorithms require their input to be represented as a fixed-length feature vector (17). Consequently, as machine learning is gaining more traction in recent years, the application of embedding has grown significantly. This interchanging support between the two approaches encouraged further research for novel embedding techniques to be devised (14).

2.1.1 Word2vec

One common technique for embedding words is word2vec(22;23) - a two-layer network that takes a text corpus as input and returns a set of feature vectors which represent the words in that corpus.word2vec’s purpose is to detect similar-ities between words mathematically. The algorithm accomplishes this task by cre-ating vectors which contain a distributed numerical representation of the word’s features. Given enough data, word2vec’s vectors contain highly accurate data about a word’s semantics (18). These vectors can be utilized to cluster documents and classify them by topic. This is due toword2vectraining by comparing words against neighbouring words from the input corpus. Word2vechas two algorithms

(9)

embeddings 9 to create representing vectors by word comparison: Continuous Bag-of-Words model (CBOW) and the (continuous) Skip-Gram model.

The first algorithm is based on the common, yet simplistic bag-of-words (BOW) method. Its simplicity administers to its lack of respect to the semantic value within the resulting vectors. CBOW aims to combat this problem by using con-tinuous distributed representation of the context (22). Instead of vectorizing for each word, CBOW utilizes a sliding window to determine the word’s context i.e. n-future and n-history words. Then training to correctly classify the middle word. The second algorithm, the Skip-Gram model does the opposite. Whereas CBOW trains to predict a word, Skip-Gram trains to classify context based on a word. More precisely, it learns word representations by maximizing the log-likelihood of a multi-set of co-occurring word pairs (23):

1 T T X t=1 X −c6j6c,j6=0 log p(wt+j|wt)

Given a sequence of training words w1, w2, w3...wT and where c is the size of

the training context. In other words, for every current word Skip-Gram predicts n-future and n-history words. Therefore, Skip-Gram is considered to be more ac-curate for infrequent words at the cost of a substantial increase in computational complexity.

Figure 1: The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word (22).

2.1.2 DeepWalk

Other studies widely utilize Skip-Gram, such as the algorithmDeepWalk(31) - a network embedding technique. Network embedding refers to the approach of learning latent low-dimensional feature representations for the nodes or links in a network (2). For example, In the case of_word2vec, a word in a word embedding

(10)

embeddings 10 space would be the equivalent of a node in a network embedded space. As indi-cated previously,DeepWalkis such a network embedding algorithm. It consists of two stages. In the first stage, the algorithm iterates through the network with dom walks. An algorithm employing random walks permits every node to ran-domly traverse through the network for a set amount of times and steps. Every co-occurring path amplifies the similarity between the path’s nodes. This allows the algorithm to construe local structures of neighbouring relationships. In the second stage, DeepWalk uses Skip-Gram to learn embeddings that are improved by the inferred structures. For some specific tasks, the representations learned withDeepWalkoffer significant performance improvements. Consequently, many subsequent studies focused on extendingDeepWalkor utilizing it as common de-nominator for results (5).

Two studies that compare their results to that of DeepWalk are the studies by Nikzad-Khasmakhi et al.andGanguly et al.. Moreover, these studies show more relevancy to this thesis as they specifically describe embedding methods for au-thors. The study by Ganguly et al. proposes an algorithm called Author2vec. The authors of the article argue that DeepWalk suffers from a sparsity problem because it only focuses on link information. Author2vec combats this by utiliz-ing a Content-Info and Link-Info model when learnutiliz-ing their embeddutiliz-ings. The Content-Info model aims to capture the representation of the author only by tex-tual content derived from the abstracts of the author’s paper. The second model - the Link-Info model - attempts to enrich the author representations obtained in the previous model by combining it with information concerning the author’s linkage to other authors. The resulting embeddings are empirically shown to perform better thanDeepWalk on link prediction and clustering tasks. However,

Author2veconly outperformsDeepWalkby a slight margin of 2.35% and 0.83% in the aforementioned tasks respectively (11).

The more promising ExEm algorithm by Nikzad-Khasmakhi et al. did provide significantly better results thanDeepWalk. WhereDeepWalkhad an average score of 0.7551, ExEmsurpasses this with an average score up to 0.9915 depending on the method employed (24).ExEmconsists of three methods. The first two methods arefastTextandword2vecutilizing Skip-Gram, with the third method being the combination of the two methods. The latter performs best, followed byfastText

and then word2vec. This hierarchy is due tofastText’s difference in the way it vectorizes for each word. Whereasword2vectreats each word in a corpus like an atomic entity, fastText treats each word as composed of character n-grams (4). This comes with many benefits.1

ExEmalso employs the use of dominating nodes with the algorithm provided in the paper by Esfahanian (10). For a node d to be dominating, every path from an entry node to a node s must pass through node d. One of the mentioned methods is then applied to the dominating set to acquire vectors to experiment and cluster with.

1

Generate better word embeddings for rare words, the ability to construct out of vocabulary words

(11)

clustering 11

Figure 2: An example of a graph with 6 nodes

Figure 3: Dominating set of graph displayed in red

2.2 clustering

Clustering is defined by Webster (21) as "a statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics". Additionally, a more operational definition goes as follows: for n objects, find K groups based on a similarity measure where the similarities within the same group are high and those of dissimilar groups are low.

The study by Lilleberg et al. showed the results of the performance on catego-rization between word2vec and tf-idf. These results are directly correlated to the potential accuracy for clustering, as those results are what will be used as a similarity measure. In the paper, the authors describe utilizingword2vecto amass semantic value and weighing the results against tf-idf, thereby improving the algorithm’s accuracy score.

However, having a similarity measure with high accuracy is not the only concern when clustering. The paper byJain provides a six-element-long list of problems and research directions that are worth focusing on in the field of clustering. The first element discusses the need for benchmark data containing ground-truth for better evaluation. The second element sheds light on some applications not need-ing the best clusterneed-ing method available. Rather, more focus should put into finding an appropriate feature extraction method. The third element indicates the importance of choosing computationally efficient clustering methods. The fourth element addresses the instability and inconsistency issues that occur when clustering. The fifth element proposes a new way of evaluating clustering princi-ples by determining axiom satisfaction. The final element administers for the use of semi-supervised clustering methods.

While not all aforementioned methods and principles will be tested in the scope of this thesis, they do carry useful information. The paper by Mikolov et al. (23) provided a fundamental insight ofword2vecto help better comprehend the sys-tems behinddoc2vec- a descendant ofword2vec. Furthermore,doc2vecis one of

(12)

clustering 12 the main methods explored in this thesis. Therefore, understanding the algorithm it is build upon is necessary. Additionally, it also provides an understanding of the Skip-Gram method, which is used in many embedding algorithms. Such an algorithm being DeepWalk- a staple example in embedding networks. However, I refrained from its utilization as I wanted to exploredoc2vec’s potential, since it often outperformsword2vec (16). _ExEmoffers an evaluation of_fastTextand the results of combining it withword2vec, while also employing dominating nodes. The information gathered from these studies brings great opportunities for future work.

(13)

3

_{D A T A S E T S & A L G O R I T H M S}

Evaluating related work gives background information for potential algorithms to consider. Furthermore, the reasoning as to why to use a particular algorithm becomes clearer. In this chapter, I will be explaining the algorithms and the datasets chosen to use during the thesis.

3.1 the datasets

To answer the research question, the dataset should only contain scientific litera-ture related to COVID-19. Therefore, by using all the literalitera-ture provided by for example, Google Scholar, it would result in the focus of the thesis shifting to cre-ating an algorithm that would also be capable of retrieving literature related to a specific topic. Fortunately, there are online datasets available that only contain topics to COVID-19. As multiple datasets are attempting to accomplish the same goal, their difference left for a choice to be made. It was opted to utilize the most expansive and well-known dataset used for coding - CORD-19. The metadata-set of 22/06/2020 contained 158947 individual articles. All of which have more de-tails on each article. However, only a few of these dede-tails were relevant. Namely, the title, abstract, doi and Microsoft Academic ID.

Microsoft Academic’s API was of great importance during the thesis. While CORD-19 offered a more concentrated dataset, it lacked necessary details con-cerning each article. As citation count was the main denominator for determining the level of expertise of an author, the dataset needed to contain this. Microsoft Academic’s data construction accommodated for this and many other features. The features used during the thesis are found in table1.

Worth noting is Microsoft Academic’s automatic field of study retrieval (34). The machine generates topics by applying a linguistic machine learning algorithm. Hereby, it takes all the documents that are in the database of Microsoft Academic and infers a tagging based on those. Re-tagging documents is done on a weekly basis. Therefore, the field of study feature is always up-to-date and its accuracy decently high - 81.20%. Consequently, the feature was also employed in the algo-rithms.

(14)

algorithms 14 Attribute Description

AA.AuId Author ID

AA.DAuN Original author name CC Citation count DN Original paper title DOI Digital Object Identifier F.DFN Original field of study name F.FId Field of study ID

Id Paper ID

RId List of referenced paper IDs

Table 1: All relevant features extracted from the Microsoft Academic dataframe

3.2 algorithms

There are two main algorithms used for vectorization and embedding during the thesis; doc2vec and tf-idf. Additionally, PCA was utilized for dimensionality reduction. Likewise, t-SNE also reduces a vector’s dimensionality. However, its main purpose was to reduce high-dimensional data to be represented on a plot. Finally, K-meanswas applied to cluster the data. This section gives an overview of these algorithms.

3.2.1 Doc2Vec

Doc2vecis closely similar toword2vec. It attempts to create numeric embeddings for documents. Le and Mikolov used the word2vec model and added a sepa-rate vector (Paragraph ID) for document embedding (Fig. 4). The framework

presented in figure4is named Distributed Memory version of Paragraph Vector

(PV-DM). It shows great similarities to figure 1, as it is a small extension of the

CBOW model. Their key difference is the addition of the paragraph vector. The model works as a memory for saving the missing information from the current context. The three words are used to predict the fourth word, while the para-graph ID represents the document. Additionally,doc2vecprovides a framework similar toword2vec’s Skip-Gram model, the Paragraph Vector Distributed Bag of Words (PV-DBOW). However, this model works slower than PV-DM and PV-DM achieves valuable results by itself.

(15)

algorithms 15

Figure 4: A framework for learning paragraph vector. Distributed Memory version of Paragraph Vector (PV-DM)

3.2.2 TF-IDF

Similar todoc2vec,tf-idfprovides a model for creating numeric embeddings of documents. It is a measure which evaluates the relevancy of a word to a docu-ment in a database of docudocu-ments. To achieve this, the model multiplies two met-rics: the term frequency and the inverse-document-frequency. Term frequency refers to the rate of a word’s appearance in a document. The inverse-document-frequency calculates the rarity of the word across the set of documents, the closer this value is to 0, the more common a word is. The tf-idfscore for each docu-ment is calculated as follows, with score for the word as t in docudocu-ment d from the document set D:

tfidf(t, d, D) = tf(t, d) · idf(t, D) Where: tf(t, d) = log(1 + freq(t, d)) idf(t, D) = log( N count(d∈ D : t ∈ d)) 3.2.3 PCA

The resulting vectors from doc2vec and tf-idf are reduced in dimensionality with Principal component analysis (PCA).PCAis a linear feature extraction tech-nique to reduce the dimensionality of data correlated with each other. It does this while retaining a certain amount of the variation that was present in the data. The data is transformed into a new set of variables known as the principal com-ponents (PCs). These are constructed by calculating the eigenvectors from the data’s covariance matrix. Therefore, PCs are orthogonal. Furthermore, they are

(16)

algorithms 16 ordered in such a way that a PC’s variation retention decreases the lower in the order it resides in. Therefore, the first PC has the highest variation retention to the original data. Consequently, the resulting vectors retain high variation to the original data. Yet, the dimensionality could be significantly reduced, depending on the data and the percentage of retained variation.

3.2.4 t-SNE

In contrary toPCA, t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique. It performs especially well for visualizing high-dimensional data. The algorithm calculates the probability of points that are similar in the high-dimensional space. Furthermore, it calculates the same probabilities for the corresponding low-dimensional space. The similarity between two points A and B is calculated as a conditional probability in which the chance that point A is assigned point B as a neighbour is calculated. This probability is determined as if there is a Gaussian probability density centered around A. Then the algorithm attempts to minimize the difference between the resulting similarities in both dimension spaces. It then evaluates itself by minimizing the sum of Kullback-Leibler divergence (15) of the data using gradient descent. Hereby, _t-SNE can map high-dimensional data to a lower dimension by extracting features in the data. Moreover, it identifies clusters based on the similarity scores of points. However, it is not possible to revert back to the original data as the input features become unidentifiable. Hence,t-SNEis mainly used for data representation.

3.2.5 K-Means

K-Means is an iterative algorithm which attempts dividing data into K distinct subgroups. Each data point is assigned to a singular group in such a way that the sum of the squared distance between the data points and the clusters’ centroids is as small as possible. Each data point is then assigned to its closest centroid. Initially, the centroids were randomly selected, meaning that the accuracy of the clustering was highly reliant on the beginning centroids. However, a more state-of-the-art method has been developed called K++. This algorithm permits better initialization of the centroids. This is done by having every next centroid be a data point with maximum distance to the previous centroid.

(17)

4

_{M E T H O D S}

There are four requirements to successfully approach the research question at hand. First, to uncover leading experts in the field of COVID-19, it is essential to gather and process COVID-19 related data. Second, an appropriate algorithm should be chosen to find similarities for clustering. Third, the decision has to be made on which clustering options should be chosen based on these similarities. Finally, a method to represent the high-dimensionality data is decided upon.

4.1 data preprocessing

The first requirement involves creating dataframes that contain all processed and necessary data for the algorithms to work with. These dataframes are derived from scientific literature related to COVID-19. As the scope of the thesis is lim-ited and datasets which only comprised of COVID-19 related literature existed already, it was opted to utilize such a dataset. There are multiple COVID-19 re-search datasets available. However, from these options, it was decided to make use of the CORD-19 dataset. This dataset is ever-expanding with a large number of articles. However, its features are limited. The main reason for CORD-19 to be chosen over other COVID-19 research datasets is because it included a Microsoft Academic ID.

By calling upon the Microsoft Academic API, all further necessary information is retrieved for creating an author dataframe. Every article is queried with its DOI and, if available, with its Microsoft Academic ID. Because of Microsoft Aca-demic’s restrictions, repeated requests were made to the API with only 15 articles as query. By only querying 15 articles the iteration was shown to be most effec-tive while adhering to Microsoft’s restrictions. Now that all relevant information is extracted, text preprocessing is done on the available abstracts. This involved stop-word removal and stemming. The impacts of the two preprocessing steps are small (35). In addition, the study by Lilleberg et al. showed unsatisfactory results for tf-idfwhen stop words were removed. However, by applying stop-word removal and stemming the dimensionality of feature vectors is reduced and the efficiency of the system improved (36). Every article retrieved from Mi-crosoft Academic has its columns filtered, so only the columns given in table 1

are present. Finally, the articles are then added to a new dataframe -df_article.

(18)

data preprocessing 18

df_article (19636 rows)

Attribute Description

Title Original article title

Id Article ID

CC Article’s citation count

Abstract Original article abstract ProcAbstract Processed article abstract

Ref Referenced paper IDs

fosTitle Article’s field of studies fosIds Articles field of study IDs

aNames List of original author names of article aIds List of author IDs of article

Table 2: The columns ofdf_article

In this state, the system is ready to be employed for article classification. However, the goal is to classify authors. Therefore, a new dataframe -df_author- is created. Every row indf_authoris a separate author extracted by iterating through each article’s authors. The article’s information is then added to the columns for each author. If an author has written multiple articles, then the information from the additional articles is appended to the author’s original row. Likewise, the arti-cle’s citation count is added to the author’s total citation count. Then, with both dataframes completed, feature extraction is applied to some of their columns.1

df_author (89668 rows)

Attribute Description

Name Original author name

Id Author ID

tCC Author’s total citation count

pTitles List of titles from the author’s articles pIds List of IDs from the author’s articles

pAbstracts List of original abstracts from the author’s articles pProcAbstracts List of processed abstracts from the author’s articles pRefs List of referenced paper IDs from the author’s articles pCCs List of citation counts from the author’s articles pfosTitles List of field of studies from the author’s articles pfosIds List of field of study IDs from the author’s articles

Table 3: The columns ofdf_author

1

For computational efficiency I very strongly advise to use dictionaries instead of dataframes for the processes described in this paragraph. If this discovery was made earlier, less research would be devoted to waiting.

(19)

feature extraction and embedding 19

4.2 feature extraction and embedding

Feature extraction was applied to three of the columns in the dataframes. Specif-ically, the columns containing the following information: Processed abstracts, field of studies and references. The latter two are unusual as the used embed-ding methods are commonly applied for large texts. Therefore, the parameters were changed when experimenting. However, columns consisting of processed abstracts did offer the usual text data. This information was utilized in the em-bedding methods discussed below.

There are two embedding methods employed during the thesis. Namely,doc2vec

andtf-idf.Doc2vec’s involvement in the thesis is due to the success of its prede-cessor - word2vec. The state-of-the-art word embedding algorithm is still highly regarded and its systems employed in staple works, such asDeepWalk. However, whyword2vecwas not used is because it embeds words and the topic similarity was of less importance than the similarity between authors. Therefore, it was opted to use doc2vec as it offered the accurate similarity vectors of word2vec

while providing an embedding for objects other than words. Furthermore, there are not many studies addressing doc2vec’s potential or utilization. For that rea-son, I chose it to contribute to supplying this field. As for DeepWalk’s random walks, I chose not to implement it for the sake of reducing the code’s complexity and the scope of the thesis. The employed methods in the thesis already gave vi-able results, thus I deemed it as being an opportunity for future work. The second method,tf-idfwas employed because of its high accuracy score of 0.894595 (18). Furthermore, as stated previously, field of studies and references are unusual data types, both being a list of extracted features. Therefore, usingtf-idfallows the focus to be on the features rather than its context. The resulting vectors from these methods are then reduced in dimensionality by utilizing PCA, maintaining 99% of the original variance.

4.3 clustering possibilities

It is already possible to represent these resulting vectors. However, for further specification, the next step is to convert the data to clusters. The employment of

K-means is based on the elements of focus by Jain. As Jain suggests, often the focus in research relies too heavily on the clustering methods, rather than the feature extraction. Therefore, the initial choice of K-means allowed more research to be allocated to the previous section since clustering methods are readily inter-changeable. Moreover, K-means is computationally efficient, as the dataset used was relatively small (19636 articles).DB-Scanhas been considered as another clus-tering method. However, its complexity was out of proportion for the goal of this thesis and its computational efficiency worse than that of K-means. However, a drawback ofK-Meansis that it finds clusters in a completely uniform dataset and

(20)

representation 20 tends to divide the dataset into even segments. The impact of adhering to the residual suggestions made byJainare to be discovered in future works.

4.4 representation

Once all articles and authors have a cluster labelled to them, their position in a high dimensional space has to be represented. The choice is between one of two methods - PCA and t-SNE. PCA offers an arguably more accurate plot than that of t-SNE. However, when confronted with a high dimensional vector, PCA’s limitations show. Often resulting in an incoherent representative plot. There-fore, when employing PCAthe feature vectors had to be low-dimensional which resulted in fewer features to be represented. t-SNE, on the other hand, provides clear plots with minimal cluttering despite dimensionality. Initially, the use of

df_article was neglected. Therefore,tf-idfand doc2vec were applied to the relevant columns of df_author. However, this approach resulted in the plots fromt-SNEto be deemed undesirable. The reason for this was because oft-SNE’s randomization. Consequently, authors that have written the same article would not have the same position on the plot. Instead, they would make a small oval around the centroid of the article. This was uninformative. Therefore, the algo-rithm methods were applied to the relevant columns of df_article, instead of applying them to the columns of df_author. The resulting plot was representa-tive for the articles and their coordinates on the plot are saved. Every author is automatically assigned the same coordinates as its first article. If an author has written multiple articles, then the author’s coordinates become the midpoint of the articles’ coordinates. Hereby removing the uninformative ovals and produc-ing an arguably more representative author plot. Moreover, the size of a point on the plot is determined by the total amount of citations the article or author has.

(21)

5

_{E X P E R I M E N T S}

This chapter discusses the setup for testing the methods described in the previous chapter and reflects on the iteration that the code went through. The setup for experimenting consists of including and excluding methods and adjusting param-eters. To simplify experimenting, the complete pipeline with varying methods and parameters with a single function. Hereby constructing results can be done quicker.

5.1 k-means

elbow method (evaluation)

As said, K-Meansattempts to minimize the total within-cluster sum square dis-tance. This defines the compactness of a cluster and the goal is to have it be as small as possible. Hence, an optimal number of clusters should be found to accomplish the smallest sum square distance. The most common method for this is the elbow method using distortion score. It functions as follows: com-puteK-Meanswith different values for K. For each K, the total within-cluster sum square distance is calculated and plotted. A favourable plot would exhibit a kink -or otherwise called elbow. The respective K value of the kink is the optimal value for clusters. However, the use of the distortion score does not serve a means of evaluating the clusters made. Additionally, it often occurs that the resulting plot does not contain a kink, causing ambiguity about the optimal value of K. Therefore, a newer method is employed - the silhouette coefficient.

The silhouette coefficient calculates the average within-cluster distance a and the average nearest-cluster distance b for each data point i. The formula is as follows:

s(i) = b(i) − a(i) max(a(i), b(i))

Then the global silhouette score is calculated as follows: S = 1

N X

i

si

This takes the average over all silhouette coefficients. This average is computed for every value of K to find the optimum. If the plot is favourable, a peak should show, indicating the optimal K. Additionally, the average silhouette coefficient also serves as a measure for the quality of a clustering, as it determines how well

(22)

the setup 22 each data point lies within its cluster. Therefore, it carries a second purpose as an evaluation metric for unsupervised clustering.

Figure 5: An example plot of the elbow method using the distortion score on the abstracts, showing a well-defined kink.

Figure 6: An example plot of the elbow method using the average silhouette coefficients on the abstracts, showing a peak indicating the optimal K.

5.2 the setup

doc2vec and tf-idf are both tested differently, as doc2vec provides more pa-rameters. doc2vec’s window is either a size of 1 or 5, this is done to determine whether context is of influence. Furthermore, the number of epochs is in the fol-lowing values: 1, 5, 10. As fortf-idf’s parameters, inverse-document-frequency was either employed or not. The dimensions of the feature extraction vectors is in the range: 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000. This range is applicable

(23)

the setup 23 for both methods. After, PCAis applied to the the resulting vectors. The vectors are then clustered usingK-Means.

The value ofkis determined using the Elbow method. This gives both an optimal

kvalue and the evaluation score for the methods. Initially, the score is determined using the silhouette coefficient. The traditional approach for the elbow method - distortion score - is not used, because it does not provide an evaluation metric.

The value ofkis then based on the silhouette coefficient and the computational time. tf-idfanddoc2vecnow have their best scoring vector with an respective optimal value of k. These vectors will be reduced in dimensionality again for representation purposes.

The high dimensional vectors are required to be reduced for them to be repre-sented. There are two methods to choose from: t-SNEand PCA. The latter does not offer any experimenting opportunities, asPCAis a linear process and provides no parameters to change. Therefore, only t-SNE’s parameters are experimented with. Namely, perplexity, learning rate and early exaggeration. Perplexity was tested with the following values: 5, 10, 20, 30, 40, 50, 100, 200, 500, 1000. For each perplexity value, early exaggeration was iterated through the values: 1, 5, 10, 20, 30, 100. The best resulting parameters are taken and the learning rate is changed over the values: 10, 50, 100, 200, 1000. Furthermore, the numbers of iterations range from 1000 to 3000 with steps of 1000.

(24)

6

_{R E S U L T S}

The results of the experiments are shown in this chapter. The first experiments were done on doc2vec. As mentioned previously, the initial experiments are evaluated utilizing the vector’s silhouette coefficient. The results from feature extraction on the abstracts of articles withdoc2vecare exhibited below.

Figure 7: Doc2vec’s silhouette coefficients for article abstracts with vectors dimensions in the range of 5 to 1000.

The graph7shows the silhouette coefficient over variable parameters. Beyond a

dimension size of 200, the scores are relatively stable for all parameters. However, before 20 dimensions the scores decrease rapidly and are relatively close to one another. Additionally, it becomes apparent that 1 epoch yields significantly better results. Furthermore, a window size of 1 consistently outperforms its comple-mentary.

The graph8displays the silhouette coefficient for field of studies and references.

In contrast to graph 7, there is no clear decline with dimension size for any

method. However, for field of studies, only learning for 1 epoch consistently outperforms using multiple epochs. For references, learning for 5 epochs results

(25)

results 25

Figure 8: Doc2vec’s silhouette coefficient for articles field of studies and references with vectors dimensions in a range of 5 to 5000

in a better coefficient score. Likewise to graph 7, a window size of 1 performs

better for both methods.

Tf-idf’s silhouette coefficients were also measured for all methods - abstracts, field of study and references. The result is shown in graph9. Likewise todoc2vec,

there is a substantial decrease before 50 dimensions. However, the total silhouette coefficient of tf-idfis higher thandoc2vec’s. Moreover, not employing inverse-document-frequency provides no significant difference in performance.

(26)

results 26

Figure 9: Tf-idf’s silhouette coefficient for all methods with vectors dimensions in a range of 5 to 1000

The best parameters fordoc2vecare determined and shown in table4.

Further-more, the best parameters fortf-idf’s methods were consistent. All further test-ing for tf-idf was done with the employment of inverse-document-frequency and a dimension size of 5.

Doc2Vec best parameters

Method Dimension size Window Epochs

Abstracts 5 1 1

Field of Studies 5000 1 1

References 1000 1 5

Table 4: doc2vec’s best parameters for each method

To represent the best vectors, t-sne and PCA were applied for dimensionality reduction used for plotting. t-sne parameters were tested, resulting in the pa-rameters to be set to: perplexity = 100, early exaggeration = 10, learning rate = 200and 1000 iterations. Some of the resulting visualizations are shown below.

(27)

results 27

Figure 10: References are plotted by employing doc2vec for embedding andt-sne for plotting. The first graph shows articles’ references plotted. The second graph shows the authors plotted based on the coordinates of their article(s).

(28)

results 28

Figure 11: References are plotted by employingdoc2vecfor embedding andPCAfor plot-ting. The first graph shows articles’ references plotted. The second graph shows the authors plotted based on the coordinates of their article(s).

(29)

results 29

Figure 12: Abstracts are plotted by employingtf-idffor embedding andt-snefor plot-ting. The first graph shows articles’ abstracts plotted. The second graph shows the authors plotted based on the coordinates of their article(s).

(30)

results 30

Figure 13: Abstracts are plotted by employingtf-idffor embedding andPCAfor plotting. The first graph shows articles’ abstracts plotted. The second graph shows the authors plotted based on the coordinates of their article(s).

(31)

7

_{D I S C U S S I O N & C O N C L U S I O N}

Based on the results shown previously, some conclusions can be drawn. The silhouette coefficient does not evaluate how informative clusters might be in lower dimensions. In graph8, field of studies trained with 1 epoch and a window size of

1seems to constantly increase with every increment in dimension size. However, further testing has shown that the resulting plots from a dimension size of 1000 and onward, did not show any significant difference. Therefore, a better-defined cluster in high-dimensional space might not have a better representation of the information in low-dimensional space.

The quality of the high-dimensional space vectors relied heavily on the utilized algorithm. Doc2vec generated very scattered plots for each method (abstracts, field of studies or references). Where t-snewas used for plotting (Fig. 10), the

resulting visualizations often had a snake-like aesthetic to them, where every data point seemingly is attached to a main line. The clusters defined byK-Meansdo not seem to be actual clusters, rather they simply divide this line into K segments. The PCAplot (Fig. 11) exhibits this phenomenon more clearly. Here, there is a

clear line dividing each segment. Just as with t-sne’s plot, there seems to be a main line with only a few outliers. I hypothesize doc2vecto be generalizing too much, only showing a significant difference between articles or authors if one discusses a topic that is irrelevant to COVID-19. As every article and author in the dataset is relevant to COVID-19,doc2vecis unable to create multiple well-defined clusters, as it likely sees the entirety of the dataset as an individual cluster. However,tf-idfdid show some well-defined clusters witht-sne(Fig.12). This

is likely due to tf-idfdistinctions, as it vectorizes based on word count. There-fore, as it finds authors or articles which use the same words it is probable they ought to be clustered, neglecting the word’s context. The methods tested (ab-stracts, field of studies and references) were of relatively small data size. There-fore, by ignoring context and focusing on singular aspects (e.g. a word) of the data, the algorithm can concentrate on the more nuanced differences. As was discussed by Lilleberg et al. usingword2vec, better results can be conceived by combining the similarity findings of doc2vec (resemblant of word2vec) and the specificity oftf-idf.

Furthermore, tf-idf generates complete clusters which are sometimes repre-sented by a singular point when plotted with PCA. The first graph in figure 13

shows every data point for the green cluster in the same position. This point is barely surrounded by other data points. However, the author plot changes the empty space to be filled with data points, indicating the potential significance of the articles contained within that single point. Authors in the second graph are plotted by taking the midpoint of the coordinate vectors of their corresponding

(32)

discussion & conclusion 32 articles. In the plots generated usingPCA, the influence of an area of data points becomes clear. Whereas, author plots generated with t-sne become cluttered and their clustering labels seemingly scrambled. This is likely due to the non-linearity oft-sne. Hence, using a linear method of generating a new coordinate (i.e. taking the midpoint), would create less conceivable results when applied to a non-linear plot.

Initially, the article plots were redundant, as the vectorization was done by taking all of the authors’ combined information from each of their articles. For instance,

doc2vecwas applied to a concatenation of an author’s abstracts for each article the author has contributed to. This resulted in more coherentdoc2vecplots using

t-sne. However, as was discussed earlier, uninformative ovals of data points were created, each oval being authors that had written the same article (Fig. 14). This

was an undesired occurrence, as the authors within these ovals had the exact same information and similarity vectors. Hereafter, it was attempted to give all authors the identical coordinates if and only if they have written the same article(s). If an author had written multiple articles, the similarity vectors of those articles were averaged and embedded to the author. Then, t-snewas applied again, creating a plot in which some authors had identical coordinates and other authors were positioned according to their newly embedded similarity vector. This method resulted in even better author plots for t-sne. However, the author plot showed no resemblance to its original article plot anymore, causing evaluation of the plot to become much harder. Additionally, the aforementioned ovals appeared again in lesser quantities.

Figure 14: Five ovals, where each point is an author and each oval is an article Thus, reflecting on the research question, doc2vecperformed relatively poor in creating similarity vectors according to the silhouette coefficients. Tf-idfscored significantly higher in total and also showed multiple cluster in the resulting

t-sne plots. However, an argument can be made that it would perhaps outper-form tf-idfif the complete dataset was less specified. Furthermore, combining the two algorithms would most likely yield the best results. As for the repre-sentation, PCAcreated less coherent article plots. Even in the cases wheret-sne

would find very well-defined individual clusters,PCAfailed at representing these well. Instead, it often created a rounded triangle, with each corner representing a cluster. However, the author plots withPCAwere more comprehensible. Because of its linearity, the midpoint coordinate method for authors was better translated to the plot in comparison tot-sne’s plots.

(33)

future work 33

7.1 future work

It has been concluded thatt-snecreated better article plots. However, the current method used for creating author plots is seemingly not compatible with t-sne. An earlier iteration for generating author plots did produce favourable results. The one in which some authors had identical coordinates, while other authors got new coordinates based on the average of articles’ similarity vectors. However, evaluating those plots was near impossible in the scope of this thesis. Therefore, a point for improvement in the future is using the earlier iteration for creating author plots and employing an appropriate evaluation metric. Hereby potentially uncovering the reasons for the creation of the uninformative ovals.

Furthermore, the method for producing similarity vectors could be further elab-orated on. The random walks utilized inDeepWalkcould improve the specificity problem discussed thatdoc2vecdemonstrated. Furthermore, the use of dominat-ing nodes in embedddominat-ings proposed byNikzad-Khasmakhi et al.has the potential to greatly benefit this algorithm. The results from these methods used together while combiningdoc2vecandtf-idfcould yield promising results.

In figure11, there are clear divisions in what is seemingly a slightly scattered line.

This is the manner in which K-Meanslabelled the data. However, as mentioned before, K-Meanstends to cluster data that does not contain any clusters. Con-sequently, the algorithm always yields believable results, yet these can often be incorrect. Therefore, future research should consider using another clustering al-gorithm, perhaps revisiting the utilization ofDB-Scan. Furthermore, as this thesis did not adhere to all of the guidelines proposed by Jainsurrounding clustering, future research might opt to do so.

(34)

R E F E R E N C E S

[1] (2015). Summary of probable sars cases with onset of illness from 1 november 2002to 31 july 2003.

[2] Arsov, N. and Mirceva, G. (2019). Network embedding: An overview. arXiv preprint arXiv:1911.11726.

[3] Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828.

[4] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computa-tional Linguistics, 5:135–146.

[5] Brochier, R., Guille, A., and Velcin, J. (2019). Global vectors for node represen-tations. In The World Wide Web Conference, pages 2587–2593.

[6] Chan, J. F.-W., Yuan, S., Kok, K.-H., To, K. K.-W., Chu, H., Yang, J., Xing, F., Liu, J., Yip, C. C.-Y., Poon, R. W.-S., et al. (2020). A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. The Lancet, 395(10223):514– 523.

[7] Chen, N., Zhou, M., Dong, X., Qu, J., Gong, F., Han, Y., Qiu, Y., Wang, J., Liu, Y., Wei, Y., et al. (2020). Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study. The Lancet, 395(10223):507–513.

[8] Chen, Y., Liang, W., Yang, S., Wu, N., Gao, H., Sheng, J., Yao, H., Wo, J., Fang, Q., Cui, D., et al. (2013). Human infections with the emerging avian influenza a h7n9 virus from wet market poultry: clinical analysis and characterisation of viral genome. The Lancet, 381(9881):1916–1925.

[9] Cheng, V. C., Chan, J. F., Wen, X., Wu, W., Que, T., Chen, H., Chan, K., and Yuen, K. (2011). Infection of immunocompromised patients by avian h9n2 influenza a virus. Journal of Infection, 62(5):394–399.

[10] Esfahanian, A.-H. (2013). Connectivity algorithms. In Topics in structural graph theory, pages 268–281. Cambridge University Press.

[11] Ganguly, S., Gupta, M., Varma, V., Pudi, V., et al. (2016). Author2vec: Learn-ing author representations by combinLearn-ing content and link information. In Pro-ceedings of the 25th International Conference Companion on World Wide Web, pages 49–50. International World Wide Web Conferences Steering Committee.

(35)

REFERENCES 35 [12] Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y., Zhang, L., Fan, G., Xu,

J., Gu, X., et al. (2020). Clinical features of patients infected with 2019 novel coronavirus in wuhan, china. The lancet, 395(10223):497–506.

[13] Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recogni-tion letters, 31(8):651–666.

[14] Koehrsen, W. (2018). Neural network embeddings explained.

[15] Kullback, S. (1997). Information theory and statistics. Courier Corporation. [16] Lau, J. H. and Baldwin, T. (2016). An empirical evaluation of doc2vec

with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.

[17] Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. [18] Lilleberg, J., Zhu, Y., and Zhang, Y. (2015). Support vector machines and

word2vec for text classification with semantic features. In 2015 IEEE 14th In-ternational Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), pages 136–140. IEEE.

[19] Ling, W., Luís, T., Marujo, L., Astudillo, R. F., Amir, S., Dyer, C., Black, A. W., and Trancoso, I. (2015). Finding function in form: Compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096.

[Lu et al.] Lu, H., Stratton, C. W., and Tang, Y.-W. Outbreak of pneumonia of unknown etiology in wuhan china: the mystery and the miracle. Journal of Medical Virology.

[21] Merriam Webster Online Dictionary (2020). Cluster analysis. Merriam-Webster.

[22] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[23] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.

[24] Nikzad-Khasmakhi, N., Balafar, M., Feizi-Derakhshi, M. R., and Motamed, C. (2020). Exem: Expert embedding using dominating set theory with deep learning approaches. arXiv preprint arXiv:2001.08503.

[25] Organization, W. H. et al. (2020a). Coronavirus disease 2019 (covid-19): situ-ation report, 1.

[26] Organization, W. H. et al. (2020b). Coronavirus disease 2019 (covid-19): situ-ation report, 108.

[27] Organization, W. H. et al. (2020c). Coronavirus disease 2019 (covid-19): situ-ation report, 3.

(36)

REFERENCES 36 [28] Organization, W. H. et al. (2020d). Coronavirus disease 2019 (covid-19):

situ-ation report, 31.

[29] Organization, W. H. et al. (2020e). Coronavirus disease 2019 (covid-19): situ-ation report, 41.

[30] Parmet, W. E. and Sinha, M. S. (2020). Covid-19—the law and limits of quarantine. New England Journal of Medicine, 382(15):e28.

[31] Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710.

[32] Phan, L. T., Nguyen, T. V., Luong, Q. C., Nguyen, T. V., Nguyen, H. T., Le, H. Q., Nguyen, T. T., Cao, T. M., and Pham, Q. D. (2020). Importation and human-to-human transmission of a novel coronavirus in vietnam. New England Journal of Medicine, 382(9):872–874.

[33] Piguillem, F., Shi, L., et al. (2020). The optimal covid-19 quarantine and testing policies. Technical report, Einaudi Institute for Economics and Finance (EIEF).

[34] Shen, Z., Ma, H., and Wang, K. (2018). A web-scale system for scientific knowledge exploration. arXiv preprint arXiv:1805.12216.

[35] Song, F., Liu, S., and Yang, J. (2005). A comparative study on text representa-tion schemes in text categorizarepresenta-tion. Pattern analysis and applicarepresenta-tions, 8(1-2):199– 209.

[36] Uysal, A. K. and Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1):104–112.

[37] Woo, P. C., Lau, S. K., and Yuen, K.-y. (2006). Infectious diseases emerging from chinese wet-markets: zoonotic origins of severe respiratory viral infec-tions. Current opinion in infectious diseases, 19(5):401–407.