Estimating the prestige of CS scientists by their conference repertoire

(1)

Estimating the prestige of CS

scientists by their conference

repertoire

Pepijn Reurs 11053003

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. M.J. Marx ILPS Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 28th, 2019

(2)

Acknowledgement

First, I would like to thank my supervisor, Maarten Marx, who has greatly sup-ported me by providing useful feedback and by helping me find a concrete goal. I also like to thank Sander van Splunter for providing guidance and structure during the earlier part of the project when Maarten was not available.

(3)

Abstract

This thesis aims to discover whether it is possible to predict the prestige of a researcher based on the works that he or she published at computer science conferences. This will be accomplished by applying regression to a dataset of feature vectors in order to predict the often used h-index of each researcher.

There are two main variations in how researchers are represented through these vectors. In the first variation, the vectors consists of the amount of public-ations a researcher has published for each conference weighted by the prestige of those conferences. The vectors in the second variation are similar to the vectors in the previous version, but each publication is also weighted by the amount of researchers that have worked on it, therefore taking contribution into account.

In order to speed up certain operations, only published work from the A* ranked conferences (according to CORE) and only published work from the year 2000 and onward will be looked at.

Modelling the publications of an author using his or her total contribution on each publication weighted by the prestige of each conferences worked best. Unfortunately, the results were disappointing: we only found a rather low cor-relation between our way of measuring prestige and the h-index of an author.

(4)

1 Introduction

Last year, on the 8th of November 2018, the results from a research done by Fraiberger et al. were published in an issue of Science. In this research, a network of museums and galleries was constructed which showed the movement of art between these institutions. This was done by constructing what is called a ’coexhibition network’ (Fraiberger, Sinatra, Resch, Riedl & Barab´asi, 2018).

In this coexhibition network, museums and art galleries were linked to each other based on how artists exhibited their art among these museums and gal-leries. From this network, various observations were made.

The first half of the report discussed the construction of the network, and showed that the importance of the nodes in the network indeed corresponded with the rank and popularity of that institution.

The second half of the report had a focus on the emergence of reputation of artists in the network. One of the conclusions that was drawn from this report was that there exists several relations between the popularity of the artists, the longevity of their career, and the popularity of the institutions at which the artists exhibited. Similar kinds of this relationship seems to be present at other networks as, such as with start-up companies and their employees (Bonaventura et al., 2019), and actors in the showbusiness (Williams, Lacasa & Latora, 2019). This thesis therefore aims to discover whether such a relation also holds for the prestige of computer scientists and their published work at conferences.

Vrettas and Sanderson (2015) has shown that for computer science, con-ferences are an important venue for publications because they are more highly valued than publications in journals or other kinds of venues. Therefore, if a relation indeed exists between the prestige of a researcher and the conferences at which he or she published his or her work, then that will have some interesting implications.

1.1 Research Questions

The aim of this thesis is to discover whether there might be a relationship between the works submitted and published at conferences, and the prestige of the researcher. Therefore the main research question can be stated as: ¨Is it possible to predict the prestige of a researcher based on it’s published work at conferences?”

This question however, does raise some additional questions:

• What would be the definition of prestige in this context? An array of different metrics exist in order to determine the prestige of a researcher. Section 3.1 discusses a couple of those metrics and deems one metric (the h-index) to be the most successful candidate.

• What source will be used in order to determine the works pub-lished at conferences? The data that will be used is a subset of the DBLP bibliography for computer science, combined with information from the CORE conference rating dataset. This combined subset only contain the conferences which are graded with an A*- rating according to CORE and only published work from the year 2000 and onward will be looked at. Additional information about these datasets and it’s subset will be discussed in section 4.1.1.

(6)

• What method will be used in order to predict the prestige? Re-gression will be applied in order to predict the prestige along with various regularization methods. What kind of methods will be applied is discussed in section 3.3. In addition, the prestige of conferences will be added to the publications, which will be retrieved by constructing a network similar to the network constructed by Fraiberger et al. More information about this network will be discussed in section 3.2 and section 4.2.

After the previous questions are answered, but before regression is applied, it is a good idea to have an overview of the distribution of the data. Therefore, the following sub-question will be asked:

• What is the distribution of scientists and the amount of their published content at conferences and/or prestige? For example, how many scientists have two publications at conferences? And how many scientists have a h-index of 100 or more?

Various parameters will be tinkered with when applying regression in order to optimise the results. There are however two aspects which could influence the accuracy of the models that warrants two sub-research questions of their own:

• What is the effect of the network time window (τ ) on the ac-curacy? When constructing the network, two conferences will only be linked when a researcher has published at both conferences within a time window of τ years. One of the questions that is asked here is if changing this time window will result in a better or worse accuracy?

• Does a feature vector that takes contribution into account per-form better? In other words: when taking into account the amount of researchers that have worked on a single project, does the accuracy increase?

1.2 Thesis Structure

First, the related literature and works that resemble the goal of this thesis will be discussed. Then, in the theoretical background section, various terms and techniques will be described in detail. This is followed by the method section, which will outline the methods used to obtain the relevant data, the construction of the network, and how both the data from the network and the data from the researchers was combined into a feature vector. In the results section, the findings of the methods will be described. The analysis of these findings will then be discussed in the discussion section. And finally, the conclusion section will provide a short concise summary of the findings of this thesis.

2 Related Work

This section will be divided into two subsections. The first one describes related work on networks such as the network described in Fraiberger et al. (2018). The second section will discuss other attempts to predict prestige of researchers.

(7)

2.1 Networks

As discussed earlier, the article of Fraiberger et al. describes the construction of a coexhibition network in which the nodes represent art institutions. The first half of the article explains the general structure and construction of this network: two institutions have a directed weighted link between them if an artist exhibited their work at both institutions within a window of τ exhibitions. The second part of the article focuses more on artists within the network and how reputation emerges from their path through said network. The article makes various observations, such as that artists who start exhibiting in a more prestigious art gallery not only are more likely to have a longer lasting career than their fellow artists, they also tend to develop a higher reputation than their peers.

A similar network has been constructed in order to analyse the success of start-ups. In Bonaventura et al. (2019), a network is constructed in which nodes represent the companies and the links represent the flow of employees and the associated transfer of know-how across companies (Bonaventura et al., 2019). This network was constructed using the data of firms and people related to said firms from the www.crunchbase.com website. The results obtained from this endeavour indicated that the closeness centrality of a node in this network corresponded to future success of said company.

2.2 Predicting prestige

Various attempts have been done in order to predict prestige, scientific success or scientific impact of researchers. Many of these attempts used the metric called h-index as a quantifier for success. More information on the h-index can be found in section 3.1.

In Acuna, Allesina and Kording (2012), attempts were made to predict the future h-index of scientists on the basis of 18 different features found in their CVs. These features include aspects such as the number of articles pub-lished in top journals, number of years in a postdoctoral position, the current h-index of the researcher and career length. This research used data from Scopus (www.scopus.com), an online database of academic papers and citation data, along with regression in order to predict the future h-index of neuroscientists.

Ayaz, Masood and Islam (2018), built further upon the research of Acuna et al. The main difference being that various regression methods were used on the Arnetminer dataset (Tang et al., 2008) in order to predict future h-index values. Arnetminer is a collection of publications from the field of Computer science collected from various other datasets. The features used in this attempt are of a similar nature as of that of Acuna et al. These features included, but were not limited to, the current h-index, the average citations per paper, the number of co-authors, the number of journals, etc.

Both Ayaz et al. (2018) and Acuna et al. (2012) worked with a large amount of different features, which could prove to be troublesome when trying to predict the h-index. In Acuna et al., this was solved by implementing elastic net regularized regression (Zou & Hastie, 2005), which would function to limit the amount of features used, as well as to determine the most useful features. In order to estimate the accuracy of the regression, Acuna et al. used the R2

(8)

different sets, each consisting of 5 different features (with one exception). Ayaz et al. used two different accuracy measures for their regression functions: the R2_{measure and the RMSE (root mean squared error).}

The regression methods used by Acuna et al. managed to reach a R2 _value

of 0.67 when predicting the h-index for the next five years. Ayaz et al. managed to eventually reach an average R2 _{value of 0.82 and an average RMSE of 0.77}

when predicting the h-index for the next five years using researchers that had more than one year of experience. Those with only one year of experience unfortunately were not very accurate, scoring only a R2 value of 0.33 and a RMSE of 0.94.

3 Background

The background section will discuss various theories, concepts and models that will help in answering the main research question. The section is split up in four parts. The first part will discuss a number of methods that are used to measure the prestige of a researcher. It will then explain why one of these methods (the h-index) was chosen for quantifying the prestige. The second part will explain how the repertoire (that will say, the amount of work published) will be interpreted from both the dataset and the network. In the third part, a multitude of regression methods will be discussed, as well as their application to the problem. And in the final part, predictions are made as to what the relationship between the repertoire and the prestige.

3.1 What will be used to quantify prestige?

There exist various amounts of different metrics that can be used to indicate the prestige of a researcher. In the past, the most popular metrics used were the total number of papers published, and total number of citations. Both of these metrics have various disadvantages, with total number of papers not taking the quality of said papers into account, and total number of citations excessively favouring single papers with an enormous amount of citations. Due to these disadvantages, Jorge E. Hirsch presented a new metric, called the h-index, that would provide the best of both metrics. The h-index would not only take the quantity of papers published into account, but also the overall quality of those papers (Hirsch, 2005).

The h-index for a single researcher is computed in the following manner: for each publication, compute the amount of times that publication has been cited. Then sort the publications by the amount of times they are cited, from highest to lowest. The value of the h-index corresponds to

h index(f ) = maximin(f (i), i)

where i is the ith publication in the sorted list, and f is the function used to compute the amount of times publication i has been cited. Thus the h-index is the lowest rank i at which the number of citations of that publication is at least i.

The main advantages of the h-index are that it is not only easy to compute, but it is also robust, due to it being insensitive to sets of papers cited by few (Alonso, Cabrerizo, Herrera-Viedma & Herrera, 2009).

(9)

Unfortunately, the h-index also has some disadvantages (Alonso et al., 2009). For example, the h-index is sensible to career length, making it difficult to com-pare researchers of different career lengths, and self citations, making it possible for researchers to artificially boost their h-index. The h-index is also insensit-ive to the citation count of their most cited papers. And finally, some other negative aspects include the fact that the h-index does not allow researchers of different fields to be compared and that the h-index does not take the context of the citations into account. Many other indicators and metrics however, suffer from these last two issues as well (Alonso et al., 2009). Different approaches and implementations have been developed in order to mitigate some of the dis-advantages of the h-index. These include examples such as the g-index, the normalised h-index and the m-index. However, studies have shown that many of the h-index variants seem to heavily correlate with the h-index (Alonso et al., 2009).

Due to this correlation, and due to the fact that the h-index is more widely available, the h-index was chosen as the metric for prestige.

3.2 How will the repertoire be described?

The repertoire (published work) of an author will be described by a vector of weights. Each dimension of this vector represents an A* rated CS conference. The weight for a conference X for an author A is the product of the total number of publications of A at conference X multiplied by the prestige of the conference. As prestige we use the eigenvector centrality value (also known as eigencentrality) of the conference in the network.

The eigencentrality is a measurement that can determine how important a node in a network is (Newman, 2016). It is computed by using the adjacency matrix in order to calculate the eigenvalue. In order for a node to have a high eigencentrality, it needs to be connected by other nodes whom also have a high eigencentrality value.

Using the amount of work published at each conference and the eigencent-rality of each conference, a feature vector can be constructed. More information about the construction of the feature vector can be found in section 4.3.

3.3 Regression

The method that will be applied in order to predict the h-index of each re-searcher is linear regression, along with various regularization methods applied to linear regression. The reason why regularization will be applied is because the vector of weights will achieve a high dimensionality. As will be discussed in secion 4.1.1, there are 64 different conferences that will be looked at. Having 64 dimensions might risk in overfitting the model, which is why regularization is applied in order to prevent this from happening.

The regularization methods that were applied are: Lasso, Elastic-Net, Lars and Lasso-Lars.

Lasso (Tibshirani, 1996) tries to reduce the amount of coefficients the linear model uses by looking for the solution which contains the fewest non-zero coef-ficients. It therefore will try to maintain a solution that takes the most essential conferences into account.

(10)

Elastic-Net (Zou & Hastie, 2005) uses both L1 & L2 regularization of the

coefficients. This allows the linear model, like Lasso, to learn from sparse coef-ficients, while still maintaining regularization properties.

The Lars algorithm (Efron, Hastie, Johnstone, Tibshirani et al., 2004) tries to find the feature that is most correlated with the target value. If at certain steps it turns out that there are multiple features that are equally correlated, then it will proceed in a equiangular direction between these features. Which will hopefully result in Lars finding the most essential conferences for predicting the h-index.

Lasso-lars (Efron et al., 2004) combines both the lars algorithm with the lasso algorithm. Therefore trying to find the conference that is most correlated with the h-index, while still trying to maintain most conferences.

Each of these methods will be implemented by using the scikit-learn python module (Pedregosa et al., 2011).

3.4 What relationship can be expected between the

rep-ertoire and the prestige?

The expected result would be that the repertoire of a researcher would correlate with the h-index in some capacity. As was discussed in section 2.2, both Acuna et al. (2012) and Ayaz et al. (2018) show that it is possible to predict the h-index through information that includes aspects such as the number of publications, the number of articles published in the top 10 journals from the computer science field and the number of distinct journals in which this researcher published his or her work. These aspects are contained in the feature vector that is being used, although conferences are used instead of journals. However, according to DBLP, the majority of publications occur inside conferences (about 51.74% according to https://dblp.org/statistics/distributionofpublicationtype.html as of July 28th 2019). Which could indicate that conferences hold the same kind of information. This strength of this claim is increased by the fact that conferences are a more important venue for computer scientists than journals (Vrettas & Sanderson, 2015).

4 Method

4.1 Data

This section will describe the various data that was used in order to construct the network and from which the researchers are retrieved. It will also describe how the h-index for each researcher was retrieved.

4.1.1 DBLP Database

DBLP is a bibliography that provides free access to high-quality bibliographic meta-data for computer scientists around the world (Ley, 2002). As of January 2019, it contains roughly 4.4 million publications which are published by over 2.2 million researchers. Of those publications, about 51.74% consists of conference and workshop papers, which are the entries that are relevant to answering the main research question. The complete data can be downloaded in the form of a XML file (which is available at http://dblp.uni-trier.de/xml/).

(11)

The XML file is large, with it taking up about 2.49 GB as of 17 June 2019. So, in order to speed up the operations on the dataset, only a subset of the DBLP data is used. This subset consists of the publications that appeared at the highest ranked conferences. In order to determine the highest ranked con-ferences, the conference database from CORE is used. CORE is an association of university departments of computer science in Australia and New Zealand (“CORE Rankings Portal - Computing Research & Education”, 2019). CORE maintains a database containing various conferences which are ranked by an executive committee, with periodic rounds for submission of requests for addi-tion or reranking of conferences. The highest rank that can be achieved by a conference in the CORE database is A* (A-star). Therefore, the subset from the DBLP database that is used is the subset whose publications have been submitted at a conference that has scored a rank of A-star according to the CORE database. The latest update of the database, CORE2018, will be used to determine the rank. This version stemmed from 16 December 2017.

Unfortunately, combining the information from both datasets did not go smoothly. Some inaccuracies were introduced during the construction of the database, which caused some conferences and workshops of a lower grade to be matched with an A-star rank conference. Due to this, the DBLP code was used in order to determine to which conference a paper was published instead of the conference name for each entry. This roughly corresponds to about 64 different conferences. For more information about the dataset, see the thesis by Schouw (2019).

The resulting database consists of 583327 rows and 10 columns, the most notable of which detail the following:

• Title: The title of the publications

• Author: The name of one of the researchers working on this publication. • Year: The year in which the publication was released.

• PublishedIn: The name of the conference or workshop in which this publication was submitted.

• PublishedInDBLPCode: The code by which this conference or work-shop is referred in DBLP

• PublicationType: The type of the publication, which in this case would only be ’conf’ for conference.

• NrOfAuthors: The number of researchers that have worked on this pub-lication.

• InvNrOfAuthors The result of dividing one by the number of researchers that have worked on this publication.

Of the 583327 publications in this dataset, only 475405 of them belonged to researchers that submitted more than one publication in total. Since research-ers with a single entry don’t contribute when constructing a network between conferences, they were removed from the dataset. This resulted in the total number of different researchers in the dataset to be 78841.

(12)

4.1.2 Retrieving the h-index

There exist only a couple of databases which compute the h-index of their re-searchers. The most notable ones are Scopus (www.scopus.com), which requires a subscription, and Google scholar (scholar.google.com), which automatically calculates the h-index values of their users since 2011 for free. Due to this, re-trieving the information from google scholar was deemed the best way in order to obtain the h-index values for the researchers. This was done by scraping the h-index value of the researchers corresponding profile pages.

The automatically scraping of google profiles was done using the python module known as scholarly, which allows it’s user to retrieve author and public-ation informpublic-ation from google scholar in a friendly, pythonic way (“Scholarly”, 2019). The process of scraping the h-index from google scholar was, unfortu-nately, a slow process. So in order to obtain some early h-index values, the website www.guide2research.com was also scraped. Guide2research provides a top 1000 list of researchers in computer science with the highest index. This h-index is retrieved from google as well, so there should be no discrepancy there. Roughly about 2154 entries were scraped from guide2research (since the top 1000 list contained more than a 1000 entries). Also due to the slow process of scraping, the decision was made that only scientists that had 10 or more publications at A*-rated conferences would be looked at.

One of the problems that arose during scraping was that some of the re-searchers yielded no results from google. This might be due to name differences between DBLP and Google scholar profiles. About roughly 20 % of the authors were not found and therefore have no matching h-index.

Another problem that arose during the matching of the h-index from guide2research and the feature vectors, was that the names of some researchers from guide2research differed from the names used in DBLP. This problem was solved by using a name comparison module known as whoswho to help match the correct names (“Whoswho”, 2019).

A bigger problem which was encountered was the fact that the scraper some-times mismatched given names and the profiles. This caused some researchers to obtain the wrong h-index. In order to solve this problem, the whoswho module was applied here as well.

4.2 Constructing the network

The network will be constructed in a similar fashion as the network described in Fraiberger et al. The nodes of the network will represent computer science conferences, and the directed links between them will represent researchers pub-lishing their works at both the connected conferences. Two conferences will be linked together when a researcher first published his or her work at the first con-ference and then publishes his or her next work at another concon-ference within a τ amount of years. If both conferences occurred within the same year, a two-way directed link would be created between them. This was because DBLP data-base only contained the year in which the conference occurred, so it was not possible to identify the correct chronology for conferences within the same year. The python module networkx (networkx) will be used in the construction of this network. The weight of each link is determined to be 1 for each researcher that published their work in both conferences within a window of τ years. In

(13)

order to study the effects of the time window on the conference weight used in the feature vector, four different values of τ are used: 2 years, 3 years, 5 years, and no limit on the window size (τ = infinite). The initial value that is being studied is 3 years.

4.3 Constructing the feature vector

There are three different ways in which the feature vector was computed for each individual researcher.

In the first variation, the feature vector was constructed in the following manner: for each conference that a researcher has visited, count the amount of times one of this researcher’s work was published on this conference. and multiply it by the eigencentrality value of that conference.

The second variation resembles the first, except that instead of counting the amount of publications, the contribution, which corresponds with the value in the InvNrOfAuthors column, of each of those publications is used instead. Therefore, for each publication, the value that corresponds with 1 divided by the number of authors that worked on that publication is used instead. This was done to see if the contribution of a researcher has an impact on the regression. The last variation matches the first variation, except that no multiplication with the eigencentrality occurs. This is done in order to check whether the importance of a conference had any impact on the regression.

Regardless of which variation was used, each researcher obtained a vector consisting of 64 dimensions. Along with this vector, another different type of feature vector was created as well. This vector, which will be refered to as the dot product vector, contained the dot product of the conference eigenvalues and either the sum of works published at each conference of each researcher, or the sum of the InvNrOfAuthors value of each researcher. In case of the third variation, the vector containing the total amount of works published at conferences of each researcher was provided instead.

4.4 Evaluation

In order to evaluate the result of the different regression methods, both the R2

score and RMSE (Root mean squared error) are used.

The dataset containing the feature vectors will be randomly split into a 70 % training set, with the other 30 % serving as a test set. This is done using the scikit-learn train test split function. Each of the models will be trained and tested on five different random configurations of the training and test data. Each of those random configurations will be determined using the random state op-tion from the train test split funcop-tion. The results consists of the configuraop-tion that gave the overall best results.

5 Results

In figure 1, the distribution of scientists and the amount of their published content at conferences and prestige can be seen. Most of the researchers manage to publish 2 or 3 works at conferences (24584 and 11558 respectively) after which the amount of researchers quickly declines. After 29 publications, the amount

(14)

of researchers drops below 100 and by about 70 publications the amount of researchers drops below 10. The greatest amount of publications would be 317. As stated in section 4.1.2, due to the duration of scraping the h-index from google scholar, the decision was made to only scrape the researchers that published 10 publications or more.

Figure 1: Distribution of amount of researchers per amount of published work (left) and h-index (right)

(a) Amount of researchers per amount of

publications (b) Amount of researchers per h-index

The distribution of the h-index makes a steep ascent at the lower end of the h-index values, and caps at between 20 and 24, after which it makes a slow decline. This can be explained by the selection that was applied to the dataset (only authors with 10 or more publications at A* conferences). The h-index value here caps at 180.

In the following section, the results of five different methods will be looked at:

• The results of applying regression on the feature vector • The results of applying regression on the dot product vector

• The results of applying regression on the feature vector with different values for τ

• The results of applying regression on the feature vector that takes contri-bution into account, with different values for τ

• The results of applying regression on an unweighted vector. Thus a vec-tor containing only the number of publications an author made on each conference.

The initial results from applying regression on the feature vector can be seen in table 1. These initial results show that linear regression seems to perform the best when compared to the other methods. However, the R2 _{scored low, and}

the RMSE scored quite high, indicating that the model is not accurate enough. The graph showing the dot product vector with the h-index can be seen at figure 2a. The results of applying regression on this dataset can be seen at figure 2b and table 2. In figure 2b, Purple corresponds to Lasso Lars and blue

(15)

R2 _RMSE Linear Regression 0.2444 19.07 LASSO 0.1835 19.82 Elastic-net 0.1674 20.02 Lars 0.0225 21.69 Lasso-Lars -0.0004 21.94

Table 1: Regression on the feature vector

corresponds with linear regression, of which it’s line overlaps the other methods. Again, the regression methods on the dot product vector scored low on the R2

score and high on the RMSE, with there almost being no difference in result except for Lasso-Lars.

R2 _RMSE Linear Regression 0.1534 20.18 LASSO 0.1534 20.18 Elastic-net 0.1534 20.18 Lars 0.1534 20.18 Lasso-Lars 0.068 21.17

Table 2: Regression on the dot-product

Figure 2: Results using the dot product vector

(a) The h-index for each corresponding dot product

(b) Regression applied to the dot product vector

In table 3 are the results for applying regression on the dataset with different values of τ . The best results are again obtain by linear regression, but instead of the initial value of τ = 3, the better results are obtained by using τ = infinite. However, the R2 _{score and RMSE value, while better than first, are still not}

great.

The results of using the contribution instead of just the amount of publica-tions for each author can be seen in table 4. The highest R2 _{score and lowest}

RMSE obtained is again by using τ = infinite, and is actually better than the results obtain by not taking contribution into account, although they are still not particularly good.

(16)

τ = 3 τ = 2 τ = 5 τ = infinite

R2 _RMSE _R2 _RMSE _R2 _RMSE _R2 _RMSE

Linear Regression 0.2444 19.07 0.2487 19.99 0.2487 19.99 0.2494 18.78 LASSO 0.1835 19.82 0.2074 20.53 0.2048 20.56 0.1867 19.55 Elastic-net 0.1674 20.02 0.1801 20.88 0.1777 20.91 0.1709 19.74 Lars 0.0225 21.69 0.0148 22.89 0.0148 22.89 0.0170 21.49 Lasso-Lars -0.0004 21.94 -0.0024 23.09 -0.0024 23.09 -0.0003 21.68

Table 3: Regression on the feature vector, with different time windows (τ )

τ = 3 τ = 2 τ = 5 τ = infinite

R2 _RMSE _R2 _RMSE _R2 _RMSE _R2 _RMSE

Linear Regression 0.2709 18.73 0.2638 19.78 0.2638 19.78 0.2721 18.49 LASSO 0.1603 20.10 0.1619 21.11 0.1593 21.14 0.1637 19.82 Elastic-net 0.0831 21.00 0.0770 22.15 0.0777 22.14 0.0879 20.70 Lars 0.0186 21.73 0.0122 22.91 0.0122 22.91 0.0158 21.51 Lasso-Lars -0.0004 21.94 -0.0024 23.09 -0.0024 23.09 -0.0003 21.68 Table 4: Regression on the feature vector that takes contribution into account,

with different time windows (τ )

As a final test, regression was applied on a unweighted vector, therefore only containing the number of publications an author made at each conference. The results can be seen in table 5. These results score lower than when taking eigenvalue and/or contribution into account, but they score better than using the dot product as a feature.

R2 _RMSE Linear Regression 0.2715 19.19 LASSO 0.2693 19.22 Elastic-net 0.2698 19.21 Lars 0.0220 22.23 Lasso-Lars -0.00004 22.48

Table 5: Regression with unweighted vector

6 Discussion

Overall, there does not seem to be a connection between the repertoire of a researcher, and the h-index. Even by obtaining the best results, which are achieved by taking contribution into account, and using τ = infinite, the R2

score equals roughly 0.27 and the RMSE is around 18. Which is, unfortunately, quite bad. The various variations of the regression methods did nothing to improve upon this score, and as figure 2a shows, the data seems to be quite scattered in such way as that it would be difficult for a linear model to predict the correct h-index.

(17)

predicted (Acuna et al., (2012), Ayaz et al., (2018)), these results are unexpec-ted. There exist several explanations which would explain why the obtained results do not align with the expectations.

One of the main aspects which could cause the linear models to be inac-curate might be due to the fact that the h-index is determined by many other factors than just the number of publications at conferences. Not only are pub-lications in journals taken into account as well when determining the h-index, but the amount of citations each of those works receive is another important factor. Also, the prestige of a conference is not factored in with the h-index, and therefore might not have an impact on it’s value. This can indicate that the h-index might not have been the best metric to use in order to determine the prestige.

Another aspect that could influence the results would be that the A*-rated conferences might have been too strict of a restriction. They might not have played a large role for determining the h-index for the researchers. And only looking at publications from the A*-rated conferences also limited the amount of scientists as well. As stated in section 4.1.1, the complete DBLP database contains publications from 2.2 million researchers. Limiting this data to re-searchers that have published 10 or more publications at A*-rated conferences, this number is already reduced to 22942 researchers. Including other confer-ences in the network, which will also increase the amount of authors, therefore could make the model more accurate.

Finally, another aspect that could have influenced the results is the fact that scraping the h-index from google scholar did not go as smoothly as was anticipated. As discussed in section 4.1.2, about 20% of the authors did not have a google scholar profile, and therefore no h-index could be retrieved for these entries. Of the authors that were matched with a google scholar profile, a lot were mismatched. This problem was partially solved by incorporating the whoswho module, but the data still contains some entries that have the wrong h-index. This could have provided too much noise to the dataset, and therefore affected the results in a negative manner.

For further study it would be recommended that, in addition of taking the previous discussed issues into account, to include the factor of time in some capacity. For instance when constructing the feature vector. Another recom-mendation would be to perhaps include journals as well. Including journal pub-lications and incorporating journals in the construction of the network might make the results more in line with the h-index.

7 Conclusion

In this thesis, the question was asked whether it was possible to predict the prestige of a researcher based on it’s published work at conferences. In order to answer this question, for each researcher, the amount of publications for each conference that this researcher has worked on was weighted with the prestige of that conference. The prestige of a conference was represented by it’s eigencent-rality. The eigencentrality of a conference was determined through a network of conferences linked to each other with weighted links representing researchers publishing their work at those conferences within a time window of τ . Vari-ous linear regression methods were applied in order to predict the h-index. An

(18)

alternative method which not only took the amount of publications into ac-count, but also the number of authors that worked on those publications (the contribution), was used as well.

The results showed that the method that used the network that had no time window (τ = infinite) applied to it, along with it taking the contribution into account performed the best. But it unfortunately obtained a low R2_score

of 0.27 and a high RMSE of 18.49. Which indicates that, given the current method, it does not seem possible to predict the h-index based on publications at conferences alone.

References

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58 (1), 267–288. Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In String processing and information retrieval, 9th international symposium, SPIRE 2002, lisbon, portugal, september 11-13, 2002, proceedings (pp. 1–10). doi:10.1007/3-540-45735-6\ 1

Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. et al. (2004). Least angle regression. The Annals of statistics, 32 (2), 407–499.

Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National academy of Sciences, 102 (46), 16569– 16572.

Zou, H. & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodo-logy), 67 (2), 301–320.

Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L. & Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. In Proceedings of the 14th acm sigkdd international conference on knowledge discovery and data mining (pp. 990–998). ACM.

Alonso, S., Cabrerizo, F. J., Herrera-Viedma, E. & Herrera, F. (2009). H-index: A review focused in its variants, computation and standardization for different scientific fields. Journal of informetrics, 3 (4), 273–289.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Acuna, D. E., Allesina, S. & Kording, K. P. (2012). Future impact: Predicting scientific success. Nature, 489 (7415), 201.

Vrettas, G. & Sanderson, M. (2015). Conferences versus journals in computer science. Journal of the Association for Information Science and Techno-logy, 66 (12), 2674–2684.

Newman, M. E. (2016). Mathematics of networks. The new Palgrave dictionary of economics, 1–8.

Ayaz, S., Masood, N. & Islam, M. A. (2018). Predicting scientific impact based on h-index. Scientometrics, 114 (3), 993–1010. doi:10.1007/s11192- 017-2618-1

Fraiberger, S. P., Sinatra, R., Resch, M., Riedl, C. & Barab´asi, A.-L. (2018). Quantifying reputation and success in art. Science, 362 (6416), 825–829.

(19)

Bonaventura, M., Ciotti, V., Panzarasa, P., Liverani, S., Lacasa, L. & Latora, V. (2019). Predicting success in the worldwide start-up network. arXiv preprint arXiv:1904.08171.

CORE Rankings Portal - Computing Research & Education. (2019). Retrieved from http://www.core.edu.au/conference-portal

Scholarly. (2019). https://github.com/OrganicIrradiation/scholarly, (visited on 28-6-2019).

Schouw, M. (2019). What is the correlation between external conference rank-ings and conference rankrank-ings in a network? (Bachelor thesis, University of Amsterdam).

Whoswho. (2019). https://github.com/rliebz/whoswho, (visited on 28-6-2019). Williams, O. E., Lacasa, L. & Latora, V. (2019). Quantifying and predicting

Estimating the prestige of CS scientists by their conference repertoire