Researcher’s performance and social networks

(1)

Faculty of Economics and Business University of Amsterdam

Researcher’s performance and social networks

XGBoost contradicts OLS in importance of social networks variables

Bachelor's Thesis

Econometrics and Operational Research

Author: Puja Chandrikasingh Student no.: 11059842

Date: June 26, 2018

(2)

This document is written by Puja Chandrikasingh who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Abstract

This thesis considers to what extent collaboration network variables have predictive power on the future performance of researcher according to XGBoost, a machine learning method. Ductor et al. (2014) examined the performance of those variables with econometric methods and concluded that the best model, a linear model estimated by OLS, contains all network variables. However, XGBoost shows that the predictive power of network variables is overestimated by OLS. The out-of-sample RMSE is reduced less with XGBoost than with OLS. Moreover, XGBoost shows that network variables do not have predictive power over and above past performance if only the obser-vations of researchers that started publishing at least five years ago are considered. Since XGBoost outperforms OLS in predicting the future performance of a researcher, the predictive power of the network variables are misspecified by OLS.

(4)

1 Introduction 1

2 Theoretical background 3

2.1 A researcher’s performance . . . 3

2.2 OLS . . . 6

2.3 XGBoost . . . 7

2.4 Comparison econometrics and machine learning . . . 12

2.5 Conclusion . . . 14

3 Methods 15 3.1 First research question | Performance . . . 15

3.2 Second research question | Network variables . . . 16

3.3 Robustness . . . 17

4 Data 18 5 Results and analysis 21 5.1 Performance . . . 22 5.2 Network variables . . . 24 5.3 Robustness . . . 29 6 Conclusion 32 7 References 34 A Appendix 35 A.1 Parameters in XGBoost . . . 35

A.2 Tuning procedure . . . 35

(5)

CONTENTS CONTENTS

A.4 All variables available for XGBoost . . . 37 A.5 Tuned XGBoost models . . . 38 A.6 Improvement of performance by handling missing values . . . 40

(6)

1 Introduction

When it comes to trying to make predictions, machine learning (ML) and econometrics are two competing fields. According to de Prado (2018), "econometrics may be good enough to succeed in financial academia (for now), but succeeding in business requires ML" (p. 15). His statement seems to be correct, when the list of winning solutions for the competitions on Kaggle, a data science competitions site, is considered. The goal in these competitions is to predict a certain variable as good as possible. OLS, an econometric method that stands for Ordinary Least Squares, is used frequently for predictive purposes (Yang, Delcher, Shenkman, & Ranka, 2017, p. 385). However, a new machine learning algorithm, named XGBoost, is the algorithm that dominated the winning solutions in the Kaggle competitions in 2015 (Stern, Erel, Tan, & Weisbach, 2017, p. 19).

That machine learning methods outperform econometric methods in predicting a certain vari-able, is also noticed by academic researchers. One example is a study by Stern et al. (2017). They compare the performance of several machine learning algorithms with OLS in predicting a few measures for the performance of a director. Their ultimate goal is selecting directors that would be most valuable to a particular firm. They conclude that XGBoost is the preferred method when it comes to predicting.

In this thesis, the focus is not on directors but on researchers, as it is in the paper of Ductor, Fafchamps, Goyal and Van der Leij (2014). In all selection procedures, within firms and univer-sities, the future performance is important. Ductor et al. predict the future performance of a re-searcher, measured in the quality of the articles he or she will write, by their past performance among other things. They show that variables that measure a researcher’s coauthorship network improve the prediction of the future performance of a researcher. From the different models they tested, the linear model estimated by OLS performed best. However, they only used models from the econometric field, while machine learning methods might give a better specification which leads to better predictions and perhaps to a different conclusion about the predictive power of collabora-tion network variables.

(7)

algo-1 INTRODUCTION

rithm, that is used in a wide range of fields and performs well in more or less all cases (Chen & Guestrin, 2016, p. 785). XGBoost, which stands for Extreme Gradient Boosting, is a tree boosting system introduced by Chen. It is an algorithm that belongs to the class of gradient tree boosting algorithms, as the name clearly implies. Boosting is one of the highly effective and frequently used prediction methods of machine learning (Chen & Guestrin, 2016). It contains statistical learning procedures, which are able to combine many ‘weak’ or poor predictions to a ‘strong’ or accurate prediction (Berk, 2017, p. 257).

The main question of this thesis is to what extent collaboration network variables have predic-tive power on the future performance of a researcher, according to XGBoost, a machine learning method. First, the predictive performance of the linear model estimated by OLS, from now on the OLS model, is compared to the predictive performance of the model estimated by XGBoost, from now on the XGBoost model. Then, the predictive power of all network variables together and the predictive power of the individual network variables are considered. In this thesis, only the predic-tive power, hence Granger-causality, is considered and not real causality.

The results indicate that XGBoost indeed significantly outperforms OLS as prediction method. Network variables do not have predictive power for researchers, who have been publishing for five years or longer. However, if all researchers are considered, the network variables have a small predictive power. Hence, the predictive power of network variables is overestimated by OLS.

The remaining part of this thesis is organised as follows. In the next section, econometrics and machine learning are compared. Then, the econometric model for this specific problem is considered. Thereafter, the intuition behind XGBoost is explained. Moreover, the reasons XGBoost might outperform OLS are given. Section 3 contains the used method and a description of the dataset. Section 4 shows the results itself and the analysis. At last, the conclusion together with future research possibilities, are given in Section 5.

(8)

2 Theoretical background

The main question of this thesis is to what extent collaboration network variables have predictive power on the future performance of a researcher according to XGBoost, a machine learning method. Ductor et al. (2014) show that the best OLS model contains all the network variables, hence those variables are important. However, if OLS does not capture the full relationship, then this result might be biased.

In this section, the theoretical background about predicting a researcher’s performance is first explained. Next, the OLS models that Ductor et al. use are considered. Thereafter, the intuition behind XGBoost is explained. Then, attention is paid to the way econometrics and machine learning relate to each other. Finally, the conclusion of this section is given.

2.1 A researcher’s performance

In this subsection, the theory behind predicting the future performance of a researcher is considered. A number of explanatory variables are discussed. The precise definitions can be found in Section 4 or in the work of Ductor et al. (2014). Furthermore, the tests Ductor et al. use and their conclusion are given.

An explanatory variable that is commonly used when predicting future performance, is the past performance. According to Ductor et al., this is because if the past performance of a person is good (bad), then that person has (not got) the ability and the ambition to perform well. Hence, the future performance will be good (bad).

Ductor et al. argue that besides the past performance and some control variables, information about the collaboration network of a researcher also has predictive power. A collaboration net-work represents all authors and the connection between them through coauthorships. The predictive power of such a network is not necessarily based on a causal relationship. However, it is based on predictive causality, i.e. Granger causality, which means that information about the collaboration network can improve the predictions. Hence, the conditional probability distribution F of the

(9)

per-2.1 A researcher’s performance 2 THEORETICAL BACKGROUND

formance of a researcher is influenced by (the lags of) the collaboration network variables (Granger 1969; from Hiemstra & Jones, 1994). Formally, Hiemstra and Jones state that a time series {Xt}

does not Granger cause {Yt} if:

F (Yt|Yt−Ly, Yt−Ly+1, ..., Yt−1, Xt−Lx, Xt−Lx+1, ..., Xt−1) = F (Yt|Yt−Ly, Yt−Ly+1, ..., Yt−1)

According to Ductor et al., there are two ways in which a researcher’s collaboration network influences his or her performance. First of all, Ductor et al. state that a researcher with a good network has better access to new ideas than a researcher with a worse network. With good and innovating ideas, a researcher can do more research and produce more high quality articles, which improves his or her performance.

Second of all, Ductor et al. state that if a researcher is involved in a coauthorship with highly productive researchers, then that researcher must have talent and ambition. After all, the highly productive researchers will not work with someone that is obstructing them. Hence, that researcher must have high quality, which indicates that the future output of that researcher will be high.

As a measurement for the accuracy of the predictions, ˆyi,t+1, Ductor et al. use the

root-mean-squared errors, RMSE. If the RMSE decreases when a variable is added, then this variable is con-tributing to better predictions.

RMSE =s 1 n

X

i,t

(yi,t+1− ˆyi,t+1)2

Ductor et al. compare the models with the Diebold-Mariano test. The same test is used in this thesis to compare the XGBoost models to the OLS models. The Diebold-Mariano test (Diebold & Mariano, 1995; from Ductor et al., 2014, pp. 939-940) is used to determine whether the predictions of different methods are significantly different. Ductor et al. use a squared loss differential di,t to

(10)

compare model A with model B. The null hypothesis is that there is no difference, i.e. E[di,t] = 0.

di,t = ε2Ai,t− ε 2 Bi,t

H0 : E[di,t] = 0 H1 : E[di,t] 6= 0

¯ d q ˆ V ( ¯d)/n ∼ N (0, 1), where ¯d = 1 n X i,t di,t

and ˆV ( ¯d) is the Newey-West type estimator of the asymptotic long-run variance of√n ¯d.

Ductor et al. conclude that variables based on the collaboration network indeed have predictive power over and above past performance. They find that the performance of coauthors is the best predictor among those variables. However, all the network variables that they constructed, seem to be relevant according to the Bayesian Information Criterion (BIC). A description of the different network variables can be found in Section 4.

Moreover, Ductor et al. find that, unlike past performance, network variables are useful predic-tors for researchers that are at the start of their career. Furthermore, network variables have most predictive power in the case of researchers who just started publishing and publish above average. For the researchers that belong to that group, but not to the highest productive group, network variables contain even more predictive power than past performance.

The results are valid, provided that OLS is the right model for predicting the future performance of a researcher. However, Ductor et al. worry that the model might be misspecified. Hence, they also estimated nonlinear econometric regression models, panel data models and vector autoregressive (VARs) models. OLS has the best out-of-sample performance, measured with the RMSE, thus Ductor et al. use OLS as their main specification. However, the in-sample log likelihood is higher for the negative binomial model than for OLS, indicating that the model might be misspecified when OLS is used. Hence, the importance of the network variables might be misspecified.

Concluding, besides past performance, network variables improve predictions of the future re-search output. The most relevant network variable is the performance of coauthors. Furthermore,

(11)

2.2 OLS 2 THEORETICAL BACKGROUND

the effect of network variables differ with career time and performance. However, if OLS is not the correct model, then these results might be invalid. In this thesis, the importance of network vari-ables are estimated with the help of XGBoost, a machine learning algorithm, because this algorithm can capture complex relationships. In Section 2.3 XGBoost is considered. In the following section, the OLS models that are used by Ductor et al. are given.

2.2 OLS

In this subsection, the econometric models for predicting the future performance of a researcher are introduced. In this thesis, XGBoost is compared to those models. The models are linear models estimated by OLS and come from the work of Ductor et al. (2014). Their goal is to find the best model in the sense of predicting. Since XGBoost has the same goal, both methods can be compared on their predicting performance. In this section, only the final linear models with their characteris-tics are presented. For a more detailed explanation of the models, the work of Ductor et al. (2014) can be consulted.

The first final model Ductor et al. describe, is a multivariate restricted linear model estimated by OLS, which they denote as model 3 (hereafter model D3). For author i at time t, it contains the control variables (xit), the recent past performance of the previous five years (yit) and all the

network variables (zit). It is a restricted model, because at time t, the lagged productivity variables

since the start of author i’s career are assumed to have the same effect. Moreover, only one lag of the network variables is taken into account.

Model D3 yi,t+1= xitβ + yitγ1+ zitγ2+ εit

The second end model Ductor et al. describe, is a multivariate unrestricted linear model esti-mated by OLS, which they denote as model 30(hereafter model D30). This model seems like model D3, however it does not assumes that productivity lags have the same effect. Thus, all the lags are added separately. Moreover, all the lags of the network variables are taken into account. Ductor et

(12)

al. use the BIC to select the right number of lags. For the past performance, the right number is twelve, for the network variables it differs, hence it is denoted as T .

Model D30 yi,t+1= xitβ + 12 X s=0 yi,t−sγs+ T X s=0 zi,t−sθs+ εit

Model D3 has an out-of-sample RMSE of 0.654 while model D30 has an out-of-sample RMSE of 0.758. However, the two models use different out-of-sample observations (see Section 4). Hence, the RMSE’s cannot be compared. In both models, Ductor et al. conclude that network variables indeed have predictive power over and above past performance.

Concluding, the final models of Ductor et al. contain all the network variables, but with a differ-ent number of lags. Moreover, for both models the conclusion is that network variables can indeed improve the predictions of the future performance of a researcher. However, the models might be misspecified, as mentioned in the previous subsection. Hence, in this thesis XGBoost, a machine learning method, is considered, because it can capture a more complex relationship than XGBoost. In the following subsection, XGBoost is explained.

2.3 XGBoost

In this thesis, XGBoost, a machine learning algorithm, is compared to the econometric models from the previous subsection. XGBoost is a gradient tree boosting algorithm, that is used in a wide range of fields and performs well in more or less all cases (Chen & Guestrin, 2016, p. 785). In this subsection, the intuition behind gradient tree boosting algorithms is given. Thereafter, a more mathematical approach is followed to explain XGBoost. Finally, the innovations that come with XGBoost are mentioned. A more detailed explanation of XGBoost and its innovations is given in the work of Chen and Guestrin (2016).

XGBoost is a gradient tree boosting algorithm. Tree boosting means that different trees are estimated and the predictions are combined (boosted) into one final prediction. The intuition behind gradient tree boosting is given by walking through an example. It is based on the work of Chen and

(13)

2.3 XGBoost 2 THEORETICAL BACKGROUND

Guestrin (2016). The concept of trees and boosting becomes clear within this example.

In the following example, the dependent variable y has to be explained by independent variable x, see figure 1a. Furthermore, trees are used to capture the relationship between y and x. The first problem is to find a splitting point. XGBoost only uses binary trees, in which nodes have at most two branches. Hence, only one splitting point is needed.

The algorithm finds the best place to split the data by calculating for each point the total loss. Often the squared error function is the chosen loss function. The goal is to minimise the total loss. The algorithm finds a place where the loss is minimal and it splits there (figure 1b). The tree at that point is shown in figure 1c. Each leaf j (end node) has an own predictiony_bj, which is often a

constant, for the dataset that it contains.

Now, the algorithm tries to explain the unexplained part, in other words the residuals. The residual of observation i, which is in leaf j, is the difference between yi and the prediction of leaf

j,y_bj. This is exactly the gradient of the loss function, if the squared error function is chosen. The

residuals of the data points in this example are shown in figure 1d.

For explaining the residuals, the algorithm again looks at x, since this is the only indepen-dent variable in this example. However, if there are more indepenindepen-dent variables, then all of those variables are considered in the same manner. The variable that reduces the loss the most is chosen. The algorithm searches in each current leaf (here the left node and the right node) the best point to split, see figure 1e and 1f. It then chooses to split the node that leads to the smallest loss. In this example, that means splitting the left node. This continues until one of the stopping criteria is reached, for example when the maximum number of leaves are created. However, this example stops now. The combination of the first and the second split gives the final tree, see figure 1g.

(14)

Figure 1: Example explaining the intuition behind XGBoost

At this point, one tree is constructed. For the next tree, the residuals are calculated and the algorithm tries to fit a new tree. This is different than making the last tree deeper, since the algorithm

(15)

2.3 XGBoost 2 THEORETICAL BACKGROUND

looks at all the data again. Hence, it might find other variables that explain the difference, while it did not explain the difference in the specific subsets of the data in the leaves of the foregoing tree(s).

Combining the trees in this way is called gradient boosting. This is a highly effective and fre-quently used method in machine learning (Chen & Guestrin, 2016). It combines many ‘weak’ or poor predictions into a ‘strong’ or accurate prediction (Berk, 2017, p. 257).

In the foregoing, the intuition behind gradient tree boosting is explained. Now, XGBoost is considered more mathematically. As mentioned in the intuition, the goal is to minimise the total loss L. Chen and Guestrin (2016) describe the loss function l for each observation i (∈ 1, .., n) as a measure of the difference between the real value yi and the prediction ybi. The gradient of l with respect to y_bi is thus a function of the residuals, used in the intuition. Besides the loss l for

all the n observations, their model, XGBoost, also takes a regularisation term, Ω, with correcting parameters γ and λ into account (in order to prevent overfitting, see section 2.4). If γ is high, then the algorithm selects trees that are less deep, which results in capturing a less complex relationship. In the formulas below, M is the number of trees and fm is the specific structure of tree m, which

contains the prediction wj in leaf j and the number of leaves T .

L = n X i=1 l(y_bi, yi) + M X m=0 Ω(fm) Ω(fm) = γT + 1 2λkwk 2 = γT + 1 2λ T X j=1 w2_j

If there are M trees, the prediction of observation i is denoted as ˆy_i(M )and the total loss that has to be minimised as L(M ). With each new tree, the previous prediction of observation i, ˆy_i(M −1) is adjusted with the prediction from the new tree fM(xi). Hence, for tree M the objective that has to

be minimised by changing the structure of the new tree M , fM, becomes:

L(M ) =

n

X

i=1

l(y_bi(M −1)+ fM(xi), yi) + Ω(fM)

Since the objective function contains a function as parameter, Chen and Guestrin (2016) solve the minimisation problem approximately by using a second-order Taylor expansion of the loss

(16)

function l, that contains the gradient g and the hessian h of l with respect to y_bi. This is why it is

called gradient tree boosting. From the resulting function, they take the first order condition, which gives the optimal predictions w∗_j for all the leaves and filling this in in L gives the minimal loss after combining m trees L∗(m).

w∗_j = − P i∈Ijgi P i∈Ijhi+ λ , L˜∗(m) = −1 2 T X j=1 P i∈Ijgi 2 P i∈Ijhi+ λ + γT

For splitting a node, the gain, which is the difference between ˜L∗(m) _{before and after the split,}

has to be positive. The gain function G, in which Iland Ir contain respectively the observations in

the newly created left and right nodes and I = Ir∪ Ilcontains all the observations of the node that

is split, is defined below.

G = 1 2 ₍P i∈Ilgi) 2 P i∈Ilhi+ λ + ( P i∈Irgi) 2 P i∈Irhi+ λ − ( P i∈Igi) 2 P i∈Ihi+ λ − γ

There are four main innovations that come along with XGBoost, which are now briefly men-tioned. For a more detailed explanation, the work of Chen and Guestrin (2016) can be consulted. The main difference with other (gradient) tree boosting algorithms is the regularisation term Ω. This prevents overfitting. The second innovation is that XGBoost can deal with sparsity. It sets a default direction, hence it sends all the missing values to the same direction. Hence, if values are missing for a reason, i.e. all the observations that have specific missing values have something in common that explains y, then this information will not be lost. The third and fourth innovation causes XGBoost to work faster. XGBoost does not go through all the values of a variable when it determines the split. Moreover, XGBoost make smart use of the capacity of the computer.

Concluding, XGBoost is a gradient tree boosting algorithm. It tries to make a good final predic-tion by iteratively explaining the residual in each splitting point and each tree. With each tree, the prediction becomes slightly better. The main innovation of XGBoost is the regularisation term Ω, which prevents overfitting. In the next subsection, econometrics is compared to machine learning.

(17)

2.4 Comparison econometrics and machine learning 2 THEORETICAL BACKGROUND

2.4 Comparison econometrics and machine learning

In this subsection, some of the similarities and differences between econometrics and machine learning are considered. More attention is paid to the models used in this thesis, namely XGBoost and OLS. The next paragraph considers the basics, such as notation. Thereafter, the similarities and differences are mentioned. Extra attention is paid to the differences that improve the performance of XGBoost compared to OLS.

In both fields, machine learning and econometrics, there is a dataset with observations of vari-ables. The goal is to predict one variable y, the dependent variable, with observations of the other variables x1, x2, ..., xk, the explanatory or independent variables, which can be summarised in

ma-trix X (Heij, De Boer, Franses, Kloek, & Van Dijk, 2004, p. 79).

The first similarity is that both econometric and machine learning methods need data (Heij et al., 2004, p.76; Varian, 2014, p. 6). However, machine learning algorithms require more data in order to predict better (Varian, 2014, p. 10).

A second similarity is the preferred kind of model. Varian states that both machine learning ex-perts and economists prefer simpler models above complex ones, because the risk of overfitting is smaller for the simpler models. Hence, the out-of-sample forecasts are better for the simpler mod-els. The difference is that in machine learning more methods are available to penalise complexity, this is referred to as regularisation (Varian, 2014, p. 7). One innovation of XGBoost is the extra regularisation term (Chen & Guestrin, 2016).

A difference between machine learning and econometric methods is the way variables are se-lected. In econometric methods the theory is considered, while in machine learning methods the model is merely determined by the data. Hence, XGBoost decides for itself which variables it includes (Chen & Guestrin, 2016).

Another difference is the area in which the methods outperform each other. Liu and Xie (2018) state that econometric methods outperform machine learning methods in estimating long term re-lationships. On the other hand, machine learning algorithms are, according to them, more capable in predicting short term relationships, since those methods are better in capturing heterogeneity in

(18)

the data.

Moreover, econometrics and machine learning differ in the amount of complexity that can be taken into account. Varian (2014) states that machine learning is capable to cover more complex relationships. According to him, tree models are good in capturing a nonlinear relationship. How-ever, he mentions that when a relationship is in reality simple, for example linear, then models that solely rely on trees do not perform well. Since XGBoost is a model that solely relies on trees, both are true for XGBoost.

Hence, XGBoost is likely to outperform OLS when predicting the future performance of a researcher, because it can capture a more complex relationship. However, Ductor et al. (2014) tested different non-linear econometric models and concluded that OLS outperformed all of them in making out-of-sample predictions. If the relationship between the independent variables and the future performance of a researcher is indeed linear, than XGBoost will probably perform worse (see Section 2.1).

The last mentioned difference is the way missing values are handled. In econometrics, all the variables need a value for each observation. The models cannot handle missing information. Trees on the other hand, can (Varian, 2014, p. 10). XGBoost handles missing values in a new way (Chen & Guestrin, 2016).

Since there are missing values in the dataset that Ductor et al. use, the advanced way of handling those missing values might improve the performance of XGBoost. If the missing values are missing for a reason, than XGBoost does not lose that information. Hence, XGBoost probably performs better.

Concluding, there are some similarities, but lots of differences between machine learning and econometrics. Conceptual, the main difference is that econometric models are build form the (eco-nomic) theory, while machine learning models are purely data driven. In this thesis, the importance XGBoost, a machine learning algorithm, and OLS, an econometric method, assign to network vari-ables when predicting the future performance of a researcher are considered. It is likely that XG-Boost will outperform OLS, since the relationship is probably not linear and XGXG-Boost can handle

(19)

2.5 Conclusion 2 THEORETICAL BACKGROUND

missing values.

2.5 Conclusion

The main question of this thesis is to what extent collaboration network variables have predictive power on the future performance of researcher according to XGBoost, a machine learning method. Ductor et al. use OLS, an econometric model, to predict a researcher’s future performance. They show OLS is the preferred model from several econometric models. Moreover, all the network variables have predictive power, according to them. However, if OLS does not capture the full relationship, then this result might be biased.

XGBoost, a widely used and well performing gradient tree boosting algorithm, is likely to out-perform OLS, because it can capture a more complex relationship and it can handle observations with missing values. XGBoost might come to a different conclusion about the importance of net-work variables. This section contains all the theoretical background.

(20)

3 Methods

In this thesis, the difference in importance of network variables, assigned to by OLS and XGBoost, when predicting the future performance of a researcher, is considered. In this section, two research questions and the methods to investigate them are discussed. Finally, the method to test the robust-ness is described.

3.1 First research question | Performance

First of all, it is interesting to know which method performs better. The hypothesis is that XGBoost outperforms OLS. If XGBoost indeed performs better, then the model can capture more of the true relationship between a researcher’s future performance and the independent variables. Hence, it might attach less or more importance to the network variables. The first research question is:

1. Is there a significant difference between the out-of-sample predictions of XGBoost and the out-of-sample predictions of OLS? If so, which method is better in predicting the future performance of a researcher?

The performance of XGBoost and OLS has to be compared in a fair manner. On one hand, it would be fair if the variables and observations are the same. On the other hand, both prediction methods need to be allowed to choose the form of the model that leads to the best predictions. Hence, both sides are looked into.

The first step consists of constructing models estimated by XGBoost, that use the same vari-ables and observations as Ductor et al. (2014). Ductor et al. use two different linear models, both estimated by OLS. The first model, from now on denoted as D3 (model 3 in Ductor et al.), con-tains the cumulative and recent past performance, while the second model, from now on denoted as D30 (model 3’ in Ductor et al.), instead includes thirteen performance lags. However, there are researchers that started publishing less than five years ago, this means that for those observations there are missing values for the cumulative and recent past performance. Hence, those observations

(21)

3.2 Second research question | Network variables 3 METHODS

are left out in the analysis of model D3. Furthermore, model D3 uses one lag of each network vari-able, while D30 uses eight. For both models D3 and D30 a new model is estimated by XGBoost, which are respectively denoted as X1 and X2.

To decide whether the out-of-sample predictions differ significantly from each other, a test described by Diebold and Mariano, that Ductor et al. use, is conducted (see Section 2.1). Hereby, the compared models have always the same out-of-sample dataset.

However, the models X1 and X2 might not be optimal. XGBoost can construct an optimal model by itself, when all the variables and the complete training dataset are given to XGBoost. Hence, this is done in order to compare the best model estimated by XGBoost, XBEST, with the

linear models estimated by OLS.

The three models, estimated by XGBoost, that are build to compare the performance of XG-Boost with OLS, have to be tuned. XGXG-Boost has lots of parameters that have to be chosen optimally, such as the maximum depth of a tree. In Appendix A.1 a table with the default values and a full description is given for each parameter. Moreover, the used tuning procedure can be found in Ap-pendix A.2. It is inspired by the stepwise procedure of Xia, Liu, Li and Liu (2017) and uses 3-fold cross-validation.

Concluding, to estimate which model, estimated by XGBoost or OLS, predicts better three mod-els are constructed by XGBoost. They are tuned on the training dataset with 3-fold cross-validation. Moreover, with the Diebold-Mariano test their performance is compared to the performance of the linear models estimated by OLS.

3.2 Second research question | Network variables

The second research question is whether the importance that OLS and XGBoost assign to the social network variables is comparable. Ductor et al. (2014) find that social network variables have predictive power over and above knowledge of past individual performance. Worth knowing is if this is still the case when XGBoost is used. Hence, the second research question is:

(22)

variables still have predictive power over and above the past individual performance?

If from the analysis from the previous question shows that XBEST indeed outperforms the linear

OLS models, then XGBoost can capture more of the underlying relationship between the future performance of a researcher and the independent variables and thus the importance that XGBoost assigns to network variables is more accurate and thus leading.

This question is mostly investigated in the same way as Ductor et al. (2014) examined it. First is investigated whether social network variables improve the out-of-sample predictions or not. For this, XGBoost is trained and tested with all variables other than network variables. Then, the model is again estimated but with the network variables. If XGBoost performs better in the second case, then network variables indeed have predictive power over and above past research output. The performance is measured by the out-of-sample RMSE.

Besides the importance of all network variables together, the importance of individual network variables are also considered. To analyse this, the XGBoost model is trained and tested on the control variables, past performance and one network variable at the time. Expected is that the variables that are most important according to OLS are also important according to XGBoost.

When the analysis shows that there are network variables that are not improving the out-of-sample predictions, then those variables are left out and the resulting model is again estimated by XGBoost. This might change the importance XGBoost assigns to the network variables that are still taken into account.

3.3 Robustness

The main question of this thesis is to what extent collaboration network variables have predictive power on the future performance of a researcher according to XGBoost, a machine learning method. To test the robustness of the assigned importance, the relevant parts of the robustness analysis conducted by Ductor et al. (2014), is repeated with the difference that XGBoost is used instead of OLS. Hence, the robustness of the results of the two research questions is tested by changing the

(23)

4 DATA

Ductor et al. (2014) test the robustness of their results by changing the three-year window to a five-year window for the future performance of a researcher and by correcting the future perfor-mance by the article length and the number of coauthors.

First, the analysis is repeated when the dependent variable is the average future productivity over a five-year window instead of three-year window. OLS is better in predicting long term re-lationships (Liu & Xie, 2018; see Section 2.4), hence it might perform better. For this analysis XBEST is used.

Thereafter, the analysis is repeated when the dependent variable is corrected by the length of the published articles and the number of coauthors. This is done to test if the results are sensitive for the definition of a researcher’s performance. Again the model XBEST is used.

In this section, the research questions, the methods to investigate them and the robustness check are described. In the next section, the data and the definition of the variables are considered.

4 Data

This section describes the data that is used to investigate the research questions of the previous section. The data is the same data that Ductor et al. (2014) utilised and comes from the EconLit database, which is a bibliography of journals in economics. It contains information about all pub-lished articles between 1970 and 1999. In this section, first the definition of the variables are given. This part is based on the work of Ductor et al.. Thereafter, the descriptive statistics are considered. In this thesis, the dependent variable is a measurement of the future performance of researcher i at time t, denoted as qf_i,t. The future performance is based on a three-year window. Hence, if qi,t

is the performance of researcher i in year t, then the future performance is defined as:

q_i,tf = qi,t+1+ qi,t+2+ qi,t+3

(24)

transformation is used as in Ductor et al. (2014). Hence, the dependent variable becomes:

yi,t+1 = ln (1 + q f i,t)

The performance of researcher i in year t, qi,t, itself, is the summation of the journal quality a

researcher published in in year t, instead of the number of articles the researcher published in that year. The quality of a journal comes mainly form the work of Kodrzycki and Yu (2006), see Ductor et al. (2014, p. 940) for details.

The control variables, in the restricted models D3 and X1, are the cumulative past performance, q_i,tc , of researcher i in year t, the number of years ri,tsince the last publication of researcher i, career

time dummies citand year dummies t. The cumulative past performance is the summation over the

performance of a researcher from the first publication until five years before t, hence t − 5. The summation over the performance from t − 5 until t is defined as the recent past performance qr

i,t.

For both qr_i,tand q_i,tc , the log transformation is used.

In the unrestricted models D30 and X2 the control variables are the same, except that the past performance is not included. Instead, thirteen lags of the productivity variable are used. For model XBEST all the variables are included, so that XGBoost can decide for itself which variables are

important.

An important difference between the restricted and the unrestricted models is the observations that are included in the training and testing sets. Observations of authors with a career time smaller or equal to five, are left out when the restricted models are trained and tested. The values for the cumulative and recent past output are also missing for those observations. All the observations are included for training and testing the unrestricted models. The missing lags of performance and network variables are replaced by 0, since authors that are at the beginning of their career simply do not have a past performance or coauthorships. The difference in the used observations makes the performance of the different kind of models incomparable.

Besides the control variables and the recent past performance, there are also network variables. Those are constructed from the coauthorship networks G at time t in which s is the number of

(25)

4 DATA

years a link between author i and j has to last. If there is a link between those authors, then gij,t= 1.

There is path between two authors if they work indirectly with the same persons. In the restricted models, the same time window is used as for the recent past performance, hence s = 5. In the unrestricted models, several values of s are used.

The first network variables are the first and second-order degree and a dummy variable for belonging to a giant component of Gt,s. The first-order degree, n1i,t, is the number of coauthors

researcher i is directly linked to, hence gij,t = 1. The second-order degree, is the number of

coau-thors researcher i is at distance 2 from, hence there is only one other researcher in between them. The dummy variable takes the value 1 if researcher i belongs to the giant component in Gt,s, which

is the largest subset of nodes such that there exists a path between each pair of nodes and no path to a node outside the giant component.

Two other network variables are closeness centrality (C_i,tc ) and betweenness centrality (C_i,tb ), which are constructed from the giant component. For both variables, the log transformation is used. The first is the average distance of a node to other nodes within the giant component. The second measures the frequency of the shortest paths passing through node i.

The last three network variables are based on the recent past performance qr_i,t. The first is the productivity of the coauthors of researcher i in year t, denoted as q_i,t1 . This is simply the summation of the recent past performance of all the coauthors of i. The second variable is the productivity of coauthors of coauthors, denoted as q2_i,t. Again, this is just the summation of the recent past perfor-mance. On these two variables, the log transformation is applied. The last variable is a dummy, that is 1 if one of i’s coauthors has a recent past performance in the top 1% of the distribution of q_i,tr .

The descriptive statistics are shown in Table 1. In total there are 129, 003 researchers and 1, 697, 415 observations between 1970 and 1999. In this thesis, the data is split into a learning set and a testing set, the same way as Ductor et al. (2014) did. The learning set consists of 64, 502 randomly chosen researchers and contains 848, 433 observations. The testing set contains the other researchers with 848, 982 observations. However, there are 181, 010 observations with a missing value for the dependent variable in the training set and 180, 977 in the testing sets. These

(26)

observa-Table 1: Descriptive statistics of the full dataset

Mean Standard deviation Output variables

Future performance 0.41 0.99

Cumulative past performance 1.62 1.44

Recent past performance 0.62 1.20

Network variables Degree 0.58 1.21 Degree of order 2 0.90 3.12 Giant component 0.10 0.30 Closeness centrality 0.01 0.02 Betweenness centrality 0.50 2.29 Coauthors’ productivity 0.59 1.40

Coauthors of coauthors’ productivity 0.58 1.58

Working with top 1% 0.01 0.11

tions are left out.

This section described all the variables and the training and testing datasets that are used in this thesis. In the following section, the results based on the procedure described in the previous section, are shown and analysed.

5 Results and analysis

From the previous sections can be concluded that the models that are estimated by XGBoost are likely to outperform those estimated by OLS. According to the OLS models, all collaboration network variables are improving the predictions. This sections starts with an analysis to show that XGBoost models indeed outperform OLS models. Thereafter, the individual and jointly predictive power of the collaboration network variables are considered. Finally, the robustness of the obtained results are tested.

(27)

5.1 Performance 5 RESULTS AND ANALYSIS

5.1 Performance

This subsection summarises and analyses the results of the first research question: which method predicts better, OLS or XGBoost? First, the variables that the methods can use are kept the same. XGBoost models X1 and X2 are constructed and then compared to respectively the restricted OLS model D3 and the unrestricted OLS model D30. Thereafter, the optimal XGBoost model (model XBEST) is constructed and compared to the other models. First, the results of the comparison of X1

and X2 with D3 and D30 are discussed. Thereafter, the results for XBEST are given an analysed.

Models X1 and D3 both use the cumulative and recent past performance as independent vari-able. There are observations with missing values for those variables, since there are researcher who are publishing articles less than five years. Hence, the observations of those researchers are left out.

Table 2: Untuned XGBoost models

Authors that started publishing at least five years ago

Note Model Number of

boosting rounds*

Out -of-sample RMSE OLS model with cumulative and recent past performance D3 - 0.654

Variables exact as in Ductor et al. (D3) X1 19 0.6333

Add ‘continuous’ variables to above model X1 19 0.6319

Delete dummies from above model X1 19 0.6318

* Chosen by the early stopping rounds function and 3-fold cross-validation of XGBoost.

Table 2 gives the out-of-sample RMSE of the different models. The untuned XGBoost model X1, in which only the number of boosting rounds is chosen optimally, outperforms OLS model D3. X1 uses the same subsets of the data as model D3, hence training and testing subset 1. The RMSE of X1 is lower than that of model D3, respectively 0.6333 and 0.654.

However, XGBoost might be troubled by the dummy variables for career time and year. If the ‘continuous’ variables are used together with the dummies, the RMSE decreases to 0.6319. In Ap-pendix A.3 it is shown that the ‘continuous’ variables are more important than the dummies. With both the dummies and the ‘continuous’ variables, there might be overfitting, hence the dummies are removed from the model. The predictions of the resulting model differ significantly from the

(28)

predictions of D3, which is the linear model estimated by OLS (p-value of 0). Furthermore, the RMSE decreases even a bit with the resulting model, indicating that the dummies made the model perform worse.

Models X2 and D30 use thirteen performance lags of performance instead of the cumulative and recent past performance. There are no missing values for the lags. Hence, all the observations are used in the training and testing datasets of X2 and D30.

Table 3 shows the of-sample RMSE of the models. The untuned XGBoost model X2 out-performs the OLS model D30, since the RMSE are respectively 0.741 and 0.758. Both models use training and testing subset 2. If the dummies are replaced by the ‘continuous’ variables, the RMSE drops to 0.7404. The Diebold-Nariano test shows that the difference between this XGBoost model and OLS model D30 is significant (p-value of 0). The decrease in the RMSE, caused by changing the dummies with the ‘continuous’ variables, is small, however the ‘continuous’ variables are again much more important than the dummies. This might be caused by the ordering in the values.

Table 3: Untuned XGBoost models All authors

Note Model Number of

boosting rounds*

Out -of-sample RMSE

OLS model with thirteen lags of performance D30 _- _0.758

Variables exact as in Ductor et al. (D30₎ _X2 ₃₂ _0.7410

Add ‘continuous’ variables to above model and delete the dummies X2 49 0.7404 * Chosen by the early stopping rounds function and 3-fold cross-validation of XGBoost.

The second question that needs to be answered, is which model performs better if all the vari-ables are available for XGBoost. Hence, XGBoost can use all the performance lags, the cumulative and recent past performance and all the lags of the network variables. This XGBoost model is de-noted as XBEST. Since it can be seen as an extension of model X2, the number of boosting rounds

is kept on 49. From the analysis in Appendix A.4 it can be concluded that XBEST uses only the

performance lags and not the cumulative and recent past performance, since there is no significant difference when adding the cumulative and recent past performance. Moreover, it uses 14 lags of

(29)

5.2 Network variables 5 RESULTS AND ANALYSIS

Hence, XBEST is an extension of model X2 in which the only difference is that fourteen lags

of network variables are used instead of eight (according to BIC). In order to test whether XBEST

performs better than X1, XBEST is trained and tested on observations of authors that started

pub-lishing at leas five years ago. The Diebold-Mariano test shows that XBEST differs significantly

from X1 (p-value of 0.011). The RMSE becomes 0.6303, which is a decrease of 0.24% compared to X1 and 3.62% compared to D3.

In the analysis so far, only the untuned XGBoost models are considered. However, there are several parameters that can be tuned in order to optimise the performance of the XGBoost models. In Appendix A.5, the results of the stepwise tuning procedure are shown for different XGBoost models. However, XGBoost seems to overfit the model when the tuning procedure is followed. Moreover, for model X1 the difference in performance is not significant. Hence, the untuned mod-els are considered better.

Finally, an analysis is conducted to determine whether handling missing values can improve the performance of a XGBoost model. The analysis is given in Appendix A.6 and it shows that handling missing values does not improve the performance of XGBoost when predicting the future performance of a researcher.

Concluding, no matter how OLS is compared to XGBoost, the latter performs better in all cases. XGBoost seems to have problems with overfitting. Applying the BIC to select the optimal lag length of network variables results, in a lag length of fourteen for XBEST, the best XGBoost model. This

model gives a reduction in RMSE of 3.62% compared to D3 (only authors that started publishing at least five years ago). The next section covers the analysis of the importance of network variables according to XBEST.

5.2 Network variables

In the previous subsection, the XGBoost models are compared to the OLS models. This subsection considers the difference in the importance of the social network variables. Here, no attention is paid to real causality, instead it is examined whether the network variables are a Granger cause, which

(30)

means that they have predictive power. Since the XGBoost models outperform the OLS models, the importance XGBoost assigns to network variables is more accurate. In this section, only XBEST is

analysed, because this is is the best model.

The remaining part of this subsection is organised as follows. First is determined whether the so-cial network variables improve the out-of-sample predictions. Thereafter, the importance assigned to them by XGBoost and OLS are compared. Finally, the effect of individual network variables are considered and compared to the results from Ductor et al. (2014).

In order to test the overall predictive power of all the network variables, a new model is es-timated by XGBoost. This model contains the same variables as XBEST but without the network

variables. This model is trained and tested on a set with all the observations in order to compare it with D30 and a set with only authors that started publishing at least five years ago, in order to compare it with D3. The results are shown in Table 4. The number of boosting rounds is kept the same as in XBEST, since the early stopping rounds function and 3-fold cross-validation seems to

overfit the model (see previous subsection).

The Diebold-Mariano test shows that adding all network variables does not change the predic-tions significantly (p-value of 0.323) if only observapredic-tions of authors that started publishing at least five years ago, are taken into account. Hence, network variables do not have predictive power over and above past performance in this case. This is in contradiction with the results of Ductor et al. (2014, p. 943).

However, when all observations are included, the difference caused by adding network variables is significant (Diebold-Mariano test gives a p-value of 0). This is in line with the results of Ductor et al. (2014, p. 944). However, Ductor et al. show that the RMSE is reduced by 1.94%, which is more than the reduction XGBoost shows (1.11%). Hence, the predictive power of all the network variables is overestimated by the linear models estimated by OLS.

Another interesting point is that the XGBoost models without network variables, perform much better than their OLS equivalent. This indicates that the control variables and past performance already explain more than in OLS, hence their relationship w.r.t. a researchers’ future performance

(31)

is not linear, but more complex.

Table 4: The decrease in RMSE by introducing network variables

Authors that started publishing All authors at least five years ago

Percentage Percentage

Model RMSE* decrease RMSE* decrease

XBEST without network variables 0.6308 - 0.7492

-XBEST with 14 lags of network variables 0.6303 0.08% 0.7409 1.11%

OLS without network variables** 0.665 - 0.773

-OLS with network variables** 0.654 1.65% 0.758 1.94%

* Out-of-sample

** See Ductor et al. (2014) Table 4 and 7

To investigate the importance XGBoost assigns to network variables when all observations are included, the feature importance XGBoost outputs is considered. The importance is measured in the F-score, which is simply the number of times a feature appears in a tree (from the official XGBoost guide). In Table 5 is this summarised for all variables in the first two columns. The first column gives the total F-score of all the lags and the second column gives the average F-score for one lag. Since this is for model XBEST, there are fourteen lags for each network variable and thirteen

performance lags.

OLS does not have such an importance measure, however Ductor et al. give the RMSE differ-ential of each network variable variable w.r.t. the base model with the control variables and thirteen lags of performance. This is also given in Table 5 in the fifth till seventh column. The third and fourth column gives the RMSE and the RMSE differentials that are calculated in the same manner but then the models are estimated by XGBoost instead of OLS. Here fourteen lags of the network variables are used, as in XBEST.

From Table 5 can be concluded that the past performance lags are used far more often than the individual network variables. However, the performance of a researcher’s coauthors and the coauthors of coauthors seem to be important too. The F-score and both differentials are relatively high.

(32)

Table 5: Individual effect network variables All authors

XGBoost OLS*

F-score RMSE Lag RMSE Variable Total Average RMSE differential** length RMSE differential** Base model*** 890**** 68.5**** 0.7492 - - 0.773 -Coauthors’ performance 525 37.5 0.7419 0.98% 12 0.761 1.55% Coauthors of coauthors’ performance 384 27.4 0.7435 0.76% 11 0.764 1.16% Betweenness 240 18.5 0.7476 0.22% 9 0.767 0.78% Closeness 245 17.5 0.7450 0.56% 10 0.767 0.78% Degree of order 2 197 14.1 0.7466 0.35% 5 0.768 0.65% Degree 179 12.8 0.7485 0.10% 6 0.768 0.65% Working with a top 1% 26 2.6 0.7449 0.57% 13 0.767 0.78% Giant component 3 1 0.7455 0.50% 8 0.768 0.65%

The number of boosting rounds is kept on 49, because the 3-fold cross-validation tends to overfit. * See Table 6 of Ductor et al. (2014) for the lag length and the RMSE

** W.r.t. base model. The differentials for XGBoost are calculated with the unrounded RMSEs. *** A model with the control variables and 13 lags of the performance variable.

**** This is the F-score of only the performance lags and not for the whole base model.

Noteworthy is the difference between the F-score and the two differentials of degree of order 2 and working with a top 1% coauthor. The differentials suggest that working with a top 1% coauthor has more predictive power than degree of order 2, while the F-score suggest that the latter has more predictive power, since it is used seven times more often.

Furthermore, the F-score suggest that the degree is quite important, while especially the dif-ferential for XGBoost is small. The opposite counts for belonging to a giant component. This is only three times used to split a node in XBEST. At last, betweenness seems important according to

the F-score and the RMSE differential of OLS. However, the RMSE differential of XGBoost is the second lowest.

Hence, it is clear that the coauthors’ performance and the performance of coauthors of coauthors are important. For the other network variables, the conclusion is not unambiguously. Moreover, from the fourth column of Table 5 can be seen that not all network variables can individually improve the predictions, since the RMSE differential does not even come close to 1%. Furthermore, there is a chance that some network variables take over the roll of other network variables that are

(33)

not included and thus seem important while they are not.

Therefor, it is tested if there are indeed network variables that should not be included in the model. If only one network variable, with fourteen lags, is added to the base model, the lowest BIC is obtained when the coauthors’ performance is added to the model. If one network variable can be added to the resulting model again, then the coauthors of coauthors’ performance gives the lowest BIC. However, according to the Diebold-Mariano test, there is no significant difference in the predictions of a model with fourteen lags of coauthors’ performance (and the other variables of the base model) and a model with both coauthors’ and coauthors of coauthors’ performance (and the other variables of the base model) (p-value of 0.355).

Table 6: Individual effect network variables All authors

RMSE Diebold-Mariano

Model RMSE Differential p-value

Base model* 0.7492 -

-Base + coauthors’ performance 0.7419 0.98%** 0** Base + coauthors’ performance

+ coauthors of coauthors’ performance 0.7417 0.03%*** 0.355*** Base + all network variables 0.7409 0.13%*** 0.001*** The number of boosting rounds is kept on 49 and the RMSE is the out-of-sample RMSE. * The base model includes 13 lags of performance and the control variables.

** W.r.t. the base model

** W.r.t. the model: Base + coauthors’ performance

However, if the difference between a model with fourteen lags of coauthors’ performance (and the other variables of the base model) and a model with all the network variables (and the other variables of the base model) is significant (p-value of 0.001). Hence, the network variable that has the most predictive power is the coauthors’ performance. However, including all the other network variables significantly improves the predictions. Hence, all the network variables seem important. Table 6 gives the RMSE, the RMSE differential and the p-values of the Diebold-Mariano tests for the different models.

(34)

re-searchers who started publishing at least five years ago are taken into account. However, when all the observations are used, the network variables do have predictive power. This most likely means that network variables only have predictive power for researchers who are at the beginning of their career, i.e. have started publishing less than five years ago. This is in contradiction with the results form Ductor et al.. If all observations are considered, the most important network variables is the coauthors’ performance. This is in line with what OLS suggest. Moreover, according to both OLS and XGBoost all network variables are improving the predictive performance when all observa-tions are considered. However, when network variables are added, OLS reduces the RMSE more than XGBoost. XGBoost makes better predictions, thus the importance that XGBoost assigns to the network variables are more accurate. Hence, the effect of network variables is overestimated by OLS.

5.3 Robustness

The previous section shows that the unrestricted multivariate XGBoost model with a lag length of fourteen for the network variables, give the best predictions. All network variables are improving the predictive performance, when all observations are included in the analysis. However, when only observations of researches with a career time longer than five years are considered, network variables do not have significant predictive power.

In this subsection, the robustness of these result are considered. For this, the best model XBEST

is trained and tested again. However, the definition of the dependent variable, the future perfor-mance of a researcher, is changed. The number of boosting rounds is again chosen by the early stopping rounds function and 3-folds cross-validation both build in in XGBoost.

First, the future performance is measured with a five-year window instead of a three-year win-dow, hence a more distanced future has to be predicted. The RMSE of the unrestricted multivariate OLS model D30 becomes 0.9039, which is higher than in the three-year window case. The reason for this might be that the more distanced the future is, the harder it is to predict (Ductor et al., 2014, p. 946). The RMSE of the untuned version of XBEST with 57 boosting rounds is 0.8649.

(35)

5.3 Robustness 5 RESULTS AND ANALYSIS

Hence, XGBoost still outperforms OLS in predicting the future performance of a researcher. The difference in performance is significant according to the Diebold-Mariano test (p-value of 0).

When all observations are included, the network variables are improving the predictions w.r.t. a model with just the control variables and the past performance lags (Diebold-Mariano test p-value of 0). The out-of-sample RMSE becomes 0.8870. Hence, RMSE is reduced by 2.49% when network variables are added to the model. However, when only the observations of researchers with a career time longer than five years are considered, then the RMSE is just reduced by 0.01% and the predictions of the future performance do not differ significantly (p-value of 0.879). Hence, network variables do not have predictive power over and above the past performance when the future performance is predicted for researchers who started their career longer than five years ago. The second robustness check consists of correcting the three-year window future performance, with the number of pages and the number of coauthors. The definition of the performance qit of

researcher i in year t depends on the set of articles in year t Sit, which contains all articles j of

author i. Formally it is defined as:

qit =

X

j∈Sit

pages_j ∗ journal quality_j Number of coauthorsj

In which pages_j stands for the number of pages of article j divided by the average number of pages of articles published in the journal. The definition is the same as Ductor et al. (2014) use.

The RMSE of the unrestricted multivariate OLS model D30 becomes 0.6575. The RMSE of the untuned version of XBEST with 26 boosting rounds is 0.6440. Hence, XGBoost still

outper-forms OLS in predicting the future performance of a researcher. The difference in performance is significant according to the Diebold-Mariano test (p-value of 0).

With this definition of performance, the out-of-sample RMSE is reduced by 1.23% when net-work variables are added to the model (p-value of 0) and all observations are considered. However, when only the observations of researchers with a career time longer than five years are considered, the network variables again do not have predictive power (p-value of 0.001, RMSE reduction of 0.26%).

(36)

Concluding, XGBoost outperforms OLS even with a dependent variable, that captures a more distanced future or a dependent variable that is corrected for the number of pages and the number of coauthors. In both cases, XGBoost shows that the network variables contribute to a better prediction of the future performance of a researcher if all observations are included. However, for researchers that started their career longer than five years ago, the network variables do not have predictive power.

(37)

6 CONCLUSION

6 Conclusion

In this thesis, the main question that has been examined is to what extent collaboration network variables have predictive power on the future performance of a researcher according to XGBoost, a machine learning method. Ductor et al. (2014) already showed that network variables have pre-dictive power over and above the past performance of a researcher, when OLS is used instead of XGBoost.

Hence, first it is considered whether the performance of the models estimated by XGBoost and OLS differ significantly. The models estimated by XGBoost, perform significantly better as expected, since XGBoost can capture a more complex relationship than OLS. The difference is solely caused by the complexity XGBoost can capture and not by using indirect information from missing values.

Since the predictions of XGBoost are better, the importance it assigns to network variables is probably more accurate than the importance OLS assigns to them. If all observations are used to train and test the models, then XGBoost reduces the out-of-sample RMSE with 1.11% if network variables are added while OLS reduces the RMSE with 1.94%. The RMSE of the model with net-work variables estimated by XGBoost, is 2.26% lower than the RMSE of the same model estimated by OLS. This is caused by a better representation of the complexity, which also results in a better prediction of the model with only control variables and past performance. The RMSE of that model estimated by XGBoost is 3.08% lower than the same model estimated by OLS.

Important is that the predictive power of network variables are structurally overestimated by OLS. This can be seen from the lower RMSE differentials for XGBoost. Moreover, according to OLS the network variables have predictive power from the start of a researcher’s career until four-teen years later. However, when the model with control variables and past performance estimated by XGBoost is tested against the same model with network variables, the predictions do not differ significantly if the training and testing set only contains researchers that started publishing at least five years ago. Hence, network variables do not have predictive power on the future performance of researchers who started their publishing career longer than five years ago.

(38)

However, if all researchers are included in the analysis XGBoost shows that network variables do make the predictions significantly better. From all the network variables XGBoost and OLS both assign the most value to coauthors’ productivity and the coauthors of coauthors’ productivity. Moreover, with all observations, they both show that all the network variables contribute to better predictions. However, the RMSE is reduced much more for OLS than for XGBoost. Hence, again the predictive power of network variables are overestimated by OLS.

To test the robustness, the performance of a researcher is measured with a five-year window instead of a three-year window. Both XGBoost and OLS perform worse, since the more distanced the future is, the harder it is to predict it. However, XGBoost still outperforms OLS, thus the result that XGBoost performs better is robust. If the definition of the performance is corrected for the article length and the number of coauthors, XGBoost still outperforms OLS.

There are many questions which still have to be answered. The results of this thesis indicates that not all network variables have predictive power. However, a more in-depth analysis has to be conducted in order to determine which network variables are important. Moreover, the robustness could be tested more fully. Furthermore, it is interesting to now if network variables have more predictive power when only researchers who are at the beginning of their career are considered. For those researchers it might even be the case that the network variables have more predictive power than the past performance.

(39)

7 REFERENCES

7 References

Berk, R. (2017). Statistical Learning from a Regression Perspective (2nd ed.). Switzerland: Springer. Brownlee, J. (2016, August 31). Retrieved from

https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

Chen, T and C. Guestrin (2016), "XGBoost: A Scalable Tree Boosting System", KDD ’16 Proceed-ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794.

Chu, C. Y., Henderson, D. J., & Wang, L. (2017). The Robust Relationship Between us Food Aid and Civil Conflict. Journal of Applied Econometrics, 32(5), 1027-1032.

de Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.

Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87.

Ductor, L., Fafchamps, M., Goyal, S., & van der Leij, M. J. (2014). Social networks and research output. Review of Economics and Statistics, 96(5), 936-948.

Heij, C., De Boer, P., Franses, P.H., Kloek, T., & Van Dijk, H.K. (2004). Econometric methods with applications in business and economics. America: OUP Oxford.

Hiemstra, C., & Jones, J. D. (1994). Testing for linear and nonlinear Granger causality in the stock price volume relation. The Journal of Finance, 49(5), 1639-1664.

Liu, Y., & Xie, T. (2018). Machine learning versus econometrics: prediction of box office. Applied Economics Letters, 1-7.

Nunn, N., & Qian, N. (2014). US food aid and civil conflict. American Economic Review, 104(6), 1630-66.

Stern, L. H., Erel, I., Tan, C., & Weisbach, M. S. (2017). Selecting Directors Using Machine Learn-ing.

Varian, H. R. (2014). Big data: New tricks for econometrics. Journal of Economic Perspectives, 28(2), 3-28.

Yang, C., Delcher, C., Shenkman, E., & Ranka, S. (2017, April). Machine Learning Approaches for Predicting High Utilizers in Health Care. In International Conference on Bioinformatics and Biomedical Engineering(pp. 382-395). Springer, Cham.