Opinion Dynamics on Corporate Boards

(1)

Computational Science

Master’s Thesis

Opinion Dynamics on Corporate

Boards

Author

J.P. Uylings

Supervisors

Dr. Rick Quax

Javier Garcia

Bernardo

September 20, 2018

(2)

by the results of a robust modelling procedure. The model is validated in a variety of scenarios and applied in a study of influences in an interlock network of company boards. Some recent studies concerning social networks aim to find characteristics (or ’features’) and nodes that have the highest influence on the collective opinion formation. However, most of those studies just focus on these features alone and did not implement interdependent opinion behavior. Few of them took the combined effects of a higher number of features into account. In our work we deal with a special type of social network, an intertwined board of directors network, that forms an interconnected network between all companies through co-membership. In board meetings of large companies, important deci-sions are made regarding long term strategies for the corporation and economical development. In this study, we aim to achieve two research goals: to identify the most influential opinion spreaders and to find a restricted number of features which can predict the impact of each opinion spreader. The present approach introduces two innovative aspects: first, the opinion dynamics is described by a kinetic Ising model which includes dynamic node behavior through the analogy with spins; second, state-of-the-art machine learning methods are applied to enable the identification of four dominant and robust predictive features. More complete and dynamics based information is crucial for building a reliable pre-dictive model. Our approach is the first to combine dynamic behavior with a more complete feature information. The resulting general model, exemplified here by a company network, is due to its robustness considered applicable to a wide variety of networks with a minimal loss of performance.

(3)

1 Introduction 3

2 Methods 6

2.1 Directors interlock Network . . . 7

2.2 Ising model . . . 8 2.3 Nudge approach . . . 10 2.3.1 Greedy Approach . . . 11 2.4 Topological Features . . . 13 2.5 XGBoost . . . 18 2.5.1 Model Explanation . . . 18 2.5.2 Model Optimization . . . 20

2.5.3 Model Metric Calculation . . . 22

2.6 Sensitivity analysis . . . 23

2.7 Hyper Parameter Tuning . . . 26

3 Results 27 3.1 Sensitivity Analysis . . . 27

3.1.1 Important Features . . . 29

3.1.2 Feature value contributions and Akaike procedure . . . . 31

3.2 Directors Network Model . . . 34

3.2.1 Ising Model Results . . . 34

3.2.2 Pairwise Complementary Features . . . 38

3.2.3 Model Analysis . . . 40

3.2.4 Feature value contribution: interlock network . . . 42

3.2.5 Interlock network: validation . . . 44

3.3 Model Visualization . . . 45

4 Conclusion 46

Appendices 53

A Background Theory 54

(4)

(5)

Introduction

Opinions are the pillars upon which almost all social interactions are based. Every day, we make decisions based on opinions in a large variety of ways. We form opinions on things like politics, food and appearance preference. In addition, from the viewpoint of social sciences, we also have a set of complex and nuanced opinions about how others will act in certain social interactions [2]. Some opinions are formed by learning experiences from parental guidance [2], but most of them are formed by people’s own experiences [49].

Above all, opinions are influenced by other people’s opinions. By representing this process in a network, one directly recognizes a complex system with higher order interdependencies. The main goal of complex systems research is to un-derstand how the dynamics of individual units combine to produce the behavior of the system as a whole [50]. A complex system is a system composed of many interacting components. Such a system may well be represented as a network where the nodes represent the components and the links their interactions. In this paper we use a complex system to simulate information and opinion spreading. Information spreading is the mechanism that forms the basis of many phenomena such as viral marketing, where companies exploit social net-works to promote their products. [9]. This is achieved by influencing the nodes with the highest potential of spreading their opinion, called ’influential spread-ers’, thereby maximizing the profit with minimal cost. Opinion influences are everywhere, networks consisting of friends, family, co-workers and mass media, to name a few. They constitute major opinion driving platforms during elec-tions, producing amplification or attenuation of risk perceptions and shaping public opinions about social issues [45].

The identification of influential spreaders attracts increasing attention from both computer science and physical societies with algorithms ranging from simply counting the immediate neighbors to complicated machine learning and mes-sage passing approaches [41]. However, almost all studies are topological-only and do not take dynamical opinion formation (opinion changes through inter-dependencies) into account. Due to this omission, it still does not become clear beforehand which nodes are influential spreaders.

(6)

In addition to finding influential spreaders, recent studies are interested in the predictive capabilities of topological features on influential spreaders [38]. In different types of literature, diverse topological features have been selected as dominant prediction features. It has been shown that different features (degree, coreness, etc.) might identify different influential nodes even for the same dy-namical processes. Such approaches, restricted in the number of used features, are expected to be at the expense of the desired accuracy [16]. In addition, most of these models are based on ’topology-only’ results, thereby ignoring the node dynamics, and disregard how features can complement each other at identifying influential spreaders.

It is instructive to compare various research fields with regard as to what features are considered to be dominant identifiers. Note that the models that concluded these observations, rarely take the dynamics of the nodes into ac-count and are frequently incomplete. In addition, the assumption is sometimes unjustifiably made that field-specific models, like epidemic spreading, would be directly applicable to opinion spreading phenomena. This multitude of ap-proaches creates a wide variety of features which are assumed to be important identifiers. This makes it hard to filter out the features which are truly the best identifiers in opinion dynamics.

Pioneering work in the 2000s [3, 48, 14] proposed that in the case of a broad degree distribution, hubs (nodes with a high number of connections) play a key role in identifying influential spreaders. However, recent research showed that degree centrality, closeness centrality, shortest path and clustering are more re-liable indicators [5, 23]. Other fields such as sociology, rumor dynamics and biology use different features to determine influential nodes as well.

In sociology [31], it is often assumed that the spreading capabilities are directly linked both to the betweenness centrality of the node [19, 20] and to the classi-fication of the node as a hub [10].

Rumor dynamics or disease spreading show that coreness may be an important predictor of influential spreaders [8, 40, 55, 11].

Biological network science suggests that the important nodes are described best by betweenness centrality and eigenvalue centrality [37]. Recent literature mostly focuses on the previously discussed small number of individual features and not on the extended variety of other possible features. In addition, they also do not make a combination of these features to see what features complement each other.

By contrast, our approach does take this into consideration and hence we believe our work will be a useful addition to the existing literature. In this paper we use a social directors network from the data obtained from all dutch corporations with a yearly revenue of over 10 million euros. Board directors discuss topics such as long term strategies for the corporation and/or are in-volved in decisions regarding economical growth or recession (such as whether to increase or decrease the investment in a product or advertisement). Since boards are making the decisions for the companies, knowledge of how opinions

(7)

spread and how they can be influenced is potentially very important. A first and relatively common idea would be that decision making groups like director boards are isolated units and as such independent, as are their decisions. In ac-tual practice, however, such boards turn out to form a well-connected network, with individual directors sharing multiple boards. Thereby, opinions of direc-tors may be susceptible to influences from outside their board. The direcdirec-tors exchange information and opinions with other directors in meetings. Decision making in one board can influence the decisions in other boards [6]. By taking the directors as nodes and ’sharing the same board’ as the edges, they form a so-called interlock network [6].

Identifying the most influential ’spreaders’ in the interlock network is an important step to ensure a more efficient spread of information [42]. Addition-ally, this type of information can be used to determine how topological features predict the opinion spreading capabilities by the use of supervised learning. In our work we focus on the latter by making use of a kinetic Ising model and a state-of-the-art machine learning model, Xgboost. The tendency to conform to other people’s opinions is implemented in the widely used Ising model in a natural way. The Ising model enables us to discover the so-called ’impact measurement’, quantifying every node from the weakest to the most influential spreader in a dynamical manner. These data in combination with 14 different topological features are used by Xgboost. Xgboost enables the extraction of a combination of features that can robustly predict the spreading capabilities of every node. This is done in 16 different scenarios on Erd˝os R´enyi and Barabasi Alberts networks: the first possesses the small word and the second the scale-free property of an interlock network. The extracted features are identified as: clustering, degree centrality, average neighbor degree and average shortest path. These features will be used to analyze the board of directors interlock network. In the upcoming section we will explain the methods of approach and the topo-logical features we have used. In section three we will show and discuss the results and their limitations followed by a conclusion in section 4.

(8)

Methods

To give a general idea on how the methods relate to one other, we will first briefly discuss the approaches before going into depth in their corresponding subsec-tions. The main goals are to find the most important features to predict the impact each director has on the network and the directors that are able to max-imize the impact, called ’influential spreaders’. We use the kinetic Ising model, a model which is able to dynamically calculate how opinions are influenced by other people in a social network through the analogy with spins. Selected nodes may be influenced by external opinions, represented by additional nodes called ’nudges’ connected to them. The initial computation will calculate how opinions are influenced without node nudges. These results are compared to the compu-tations of each of the nodes having its opinion influenced by this nudge. The quantification of these results is called the ’impact’ of a node nudge. We use a so-called greedy nudge approach, i.e. preferring nodes with a high impact, to identify the most influential spreaders. This approach permanently nudges the node with the highest impact after calculating the impact of all nodes on the network. This process is repeated for a fixed amount of resources, in accordance with the number of node nudges we are able to spend.

Next, we use a tree based machine learning model, xgboost, to make a predictive model by using the calculated impact of all nodes as predictive variables. To be able to make an accurate prediction, we use 14 different features that are based on the topological properties of the graph and link these to the corresponding impact. These features can be extracted directly from the network without having to calculate intermediary modeling results in advance. This way, we can present a general model able to predict which nodes are likely to be influential spreaders without the need to run a simulation first.

In order to verify the robustness of the model, we made more then half a mil-lion models on Erd˝os R´enyi and Barabasi-Albert networks under a variety of circumstances. This way, we can determine in a general way in a wide variety of complex systems which nodes one could influence to maximize the impact. From this sensitivity analysis we have identified which directors can be classified

(9)

as weak and which as influential information spreaders. In addition, we have found a general identification of what features people need to possess in order to maximize this spread of information. The individual identity of the most influential spreaders is now clear but will not be given explicitly in this thesis.

2.1 Directors interlock Network

The data regarding the directors interlock network consist of all the dutch com-panies with a revenue of more then 10 million euros. The directors of these companies are connected through board memberships and can be represented as a bipartite network. By projecting this bipartite network [59], we can find the interlock network.

In order to influence opinions efficiently, we need to quantify the opinion spread-ing through the entire network. Small disconnected components will not con-tribute to the process because these nodes are out of reach of the opinion dy-namics. Therefore we take the components that are most connected to the interlock network.

Below, we visualize the directors graph and the clusters calculated with the fruchterman-reingold algorithm [22] in order to get an impression of the net-work.

Figure 2.1: Directors interlock Network: The nodes represent the directors, the edges connections to other directors and the colors different clusters

Director interlock networks can be classified as social networks, consisting of linked groups of boards. This results in a network which consists of a structure of weakly coupled clusters. As shown in figure 2.1, we verify that the network structure consists of many weakly coupled clusters. The different colors all represent different clusters. If the color of a cluster is the same as another cluster but not directly linked to it, it represents a different cluster as well.

(10)

Research involving director interlock networks concluded that they have small world properties [15, 39, 32]. A small world is a network in which most nodes are not neighbors of one another, but in which every node can be reached from any node with a relative small number of hops or steps. This corresponds to the network we are using in this work, table 2.2.

Degree CC Coreness Hubs Authorities LC Pagerank Mean 13.13 0.16 0.49 0.001 0.001 0.007 0.0012

Std 8.45 0.03 0.29 0.005 0.005 0.032 0.0005

AN-Degree BC EC Clustering COM-C ASP DC Mean 15.53 0.007 0.008 0.924 445 ·108 _6.58 _0.016

Std 6.52 0.032 0.034 0.182 212 ·109 1.41 0.01

Figure 2.2: Properties of the Directors interlock Network: With (CC) being the Closeness Centrality, (LC) Load Centrality, (AN-Degree) Average Neigh-bor Degree, (BC) Betweenness Centrality, (EC) Eigenvalue Centrality, (COM-C) Communicability Centrality, (D(COM-C) Degree CEntrality and (ASP) Average shortest Path.

The measurements will be explained in section 2.6: Topological Features.

2.2 Ising model

In order to model the opinion dynamics in the interlock network we use the Ising model. The Ising model is an algorithm which can capture collective opinion behavior through node dynamics. Therefore, it is very suitable and widely used to simulate opinion flow through a network [43, 33, 56]. A binary opinion called the spin is held in each node Xi ∈ {−1, 1}, and can be affected by neighbors

[35]. The opinion evolves over time as a function of its own opinion and of that of its network neighbors, all weighted according to their influence [30].

The decision making is based on the hypothesis that people tend to follow other people’s opinions they are exposed to. One of the basic principles of this complex system is that people can only be exposed to an opinion by the people they are directly linked to. In other words, direct influence by other people will only take place by virtue of an interaction with those people. This decision making behavior is also known as ’herd behavior’ [18, 47]. The neighbors that people are linked to, will simultaneously be linked to other people, making it hard to predict how information spreads. The assumption that people are influenced by their direct social network makes the Ising model well suited to use [25, 26, 58]. Literature assumes that shared directors have a strong influence on each other’s opinion [6], which supports the classification as ’herd behavior’. Many other authors have used an Ising-based model approach to model opinion formation on social networks [53, 34, 24, 7, 57, 28, 29].

(11)

with each other. Each unit probabilistically chooses its next state depending on the current states of its neighbors, termed discrete-time Markov networks [50]. A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event [17]. The Markov chain does not get trapped in cycles, and contains positive probabilities, therefore the Markov chain is irreducible, aperiodic and ergodic. The Ising model has been used to describe a variety of systems, including cellular regulatory networks, brain networks, immunity networks, financial networks and social networks [50].

Because we need to know what impact the nodes have on the system after we influence them, we first need to find the so-called ’Gibbs probability’ for each state. This state probability gives the probability of a node x being in state X at time t + 1. This probability depends on the energy function Hσ, the

Hamiltonian of the system, which can be found as:

Hσ = −J ·

X

i,j

σiσj (2.1)

where σi is the neighboring spin, σj the original spin, J = 1 the coupling

constant and σiσj is summed over all pairs of neighboring nodes in a finite

volume. The positive value of J merely indicates that ferromagnetic coupling (the tendency to align the spins) is in order in the system. A value J = −1 would e.g. indicate counter alignment: in that case the opinion of one person would evoke the opposite opinion of others. The Hamiltonian represents a numerical value of the alignment of spins, a high Hamiltonian indicates that the spins are not aligned.

The Gibbs probability is defined as:

P (X = x) = 1 Z(T )e

−Hσ/T _(2.2)

With probability P of the system X being in state x and T the temperature; the inverse temperature β = 1/T is also frequently used in this context. We define a partition function Z, a constant for normalizing the probabilities:

Z(T ) =X

i

e−Hσ/T _(2.3)

with T the ’temperature’ (a high temperature stands for more chaotic behav-ior and thus less alignment tendency) and Hσ the total ’energy’ of the system

at state i (the system tends to reduce this energy by alignment). The initial temperature T of the system is thereby introduced as a parameter that mea-sures degree of randomness in a system [51]. When the temperature increases, so does the randomness in the system. This parameter is estimated by the use of Shannon entropy.

(12)

The entropy h of a finite Gibbs probability distribution (p1, · · · , pn), with

n being the number of nodes in the network and i the current node, can be defined as:

h(p1, · · · , pn) = −

X

i

pilog2(pi) (2.4)

Before running the Ising model, we set the network temperature using a chosen entropy measurement. To set a temperature from a chosen entropy, we iterate over a set of possible temperatures and calculate the corresponding probability distributions. Next, we calculate the entropy values from this set of distributions. We extract the final temperature from the entropy value that matches the entropy originally chosen, and use this temperature in the Ising model. We use two quite different entropies in this paper, 0.1 and 0.25, to test their respective effects in the sensitivity analysis below, thereby making the model more robust.

In our work, we apply nudges on nodes in order to see how much the prob-abilities in the network change. In order to compare these probprob-abilities, we create a zero probability distribution Q and a nudged probability distribution P . First of all a node nudge is applied; subsequently, the Gibbs probability distributions are calculated after we equilibrate the network by taking 105_steps

per node. We repeat this process 10 times and average the Gibbs probability distributions for robustness before calculating the impact. This process ensures that the influence of the nudge of the node has propagated properly on the whole network.

The impact can now be calculated by comparing the ’nudged’ Gibbs probability distribution P with the ’nudge-free’ Gibbs probability distribution Q. This is achieved by taking the Hellinger distance between the distributions P (p1, · · · pk)

and Q(q1, · · · qk): Impact = H(P, Q) = √1 2 v u u t k X i=1 (√pi− √ qi)2 (2.5)

here, k is the number of nodes in the network, and i is the node index.

2.3 Nudge approach

In order to be able to find ’influential spreaders’, we influence the Ising model by applying ’nudges’ on nodes, increasing the probability that a node has a specific opinion. A node nudge can be described as an additional node to the network that forms a link to the node that is being nudged. Such a node holds a certain weighted opinion and will not be influenced by the Ising model in any

(13)

way. These weighted opinions have been made static since they constitute an external influence and are not part of the network.

Let K be the network size, N the number of nudged nodes and C all the possibilities with which these nudges can influence the network. The number C is then given by:

C =K N

(2.6)

This number of possibilities increases steeply with each node added to the network, and thus with K. Since we can not calculate all possible combinations C of nudged nodes on such a big network as the directors interlock network, we have to apply a more viable approach to find the most influential spreaders. In order to find a better approach, we use the Ising model on a small 10-node network (K → 10) as a test case, calculating the impacts of all the possible variations C by which a node can be nudged with a given amount of ’nodes to nudge’.

These networks simulations are done on Barabasi-Alberts and Erd˝os R´enyi Net-works with four different temperatures: 4, 6, 8, 10. These different temperatures are chosen to be able to verify that a possible viable approach holds under dif-ferent circumstances. In the Ising model simulations in the sensitivity analysis we approximately use the temperatures 4 and 8. The temperatures 6 and 10 are chosen for additional robustness of the results. A commonly used approach is the ’greedy approach’, i.e. calculating the impact of every node in the network and nudging the node with the highest impact; this approach is discussed below.

2.3.1 Greedy Approach

In the greedy approach, the impact of every node is calculated and the node with the highest impact is nudged. This process continues for the available number of resources. We will investigate whether this approach results in a smaller impact then other possible nudge combinations. To this end, we compare the greedy approach result for each number of node nudges N with all possible permutations C in order to verify its feasibility.

First, we calculate the impact of the greedy approach for each number of node nudges (N ). The impact is calculated by taking the Hellinger distance between Q the ’zero’ and P the ’nudged’ distributions. After the impact has been calculated on all nodes, the highest impact node will be nudged. This procedure (that aims to maximize the impact) is of crucial importance, as it makes a clear distinction between impact values in the process. These impact measurements will be used by the XGBoost model to generate impact predic-tion values. Therefore the impact needs to show clear distincpredic-tions, because the over-representation of a given numeric value interval is known as data imbalance and leads to imbalanced learning [44].

(14)

Secondly, we calculate C for each number of node nudges. Finally, we want to know if there are dependencies between nodes that beat the greedy approach. To to verify this, each number of node nudges N in the greedy approach is matched with the list of all permutations C, sorted by impact. If the greedy approach is the perfect option, each combination of nudged nodes using the greedy approach would yield ranking numbers 1 for all entries. In order to verify if the greedy approach is viable in general, we run this simulation on 10-node Erd˝os R´enyi (ER) and Barabasi Alberts graphs (BA) with different temperatures T4, T6, T8 and T10.

For the greedy approach to be viable, the ranking numbers should be as low as possible. Note that the rank is relative to the number of possible combinations (C). The results are shown in the table below:

Ranking Order BA Network N C T4 T6 T8 T10 1 10 2 8 5 1 2 45 5 16 7 3 3 120 1 28 6 1 4 210 5 21 27 1 5 252 30 15 8 2 6 210 28 7 3 6 7 120 3 1 3 4 8 45 3 1 1 1 9 10 1 1 1 1

Ranking Order ER Network N C T4 T6 T8 T10 1 10 2 1 3 5 2 45 3 7 1 12 3 120 27 13 7 22 4 210 22 7 4 4 5 252 30 11 5 3 6 210 15 9 11 1 7 120 5 1 3 5 8 45 9 1 1 4 9 10 1 1 1 1

Figure 2.3:For every temperature T, on BA and ER networks, we have ranked how well the greedy approach performs relative to all possible options nudging the same amount of nodes C (sorted by impact). A ranking of 1 means that the greedy approach has nudged the optimal combination of nodes in the network in order to maximize the influence of an opinion

The impact per node nudge of such a 10-node network is very small and tiny variations may be present in the calculation of the impact. This can result in a small inaccuracy of the best impact score. In order to minimize the calculated impact error, we run the Ising model ten thousand steps per node to assure that the node nudge has influenced the whole network properly. The impact is then calculated by taking the average of ten simulations. These two optimiza-tion processes ensure that the impact measures should be more then accurate enough to indicate if an approach is viable.

When we look at the results in table 2.3, we can see that the greedy ap-proach is a viable option to use for larger graphs. Seemingly it performs a little better for higher temperatures, where interdependencies are not solely dictat-ing the change in opinions. This indicates that the greedy approach performs better at predicting influential spreaders without interdependencies. However, the performance of lower temperatures is very acceptable as the scores for these temperatures are also within the top, if not the best among the impact scores

(15)

calculated with the same amount of node nudges. Hence this approach is viable and will therefore be used for the sensitivity analysis and the final simulation on the directors interlock network.

Before we continue with the sensitivity analysis we will discuss the 14 topo-logical features we will be using in the sensitivity analysis. These features can all be extracted from any network structure with no additional external calcu-lation needed to determine an estimate of the impact of a node. These features are: Pagerank, Eigenvalue Centrality, Degree, Betweenness Centrality, Hubs and Authorities, Coreness, Closeness Centrality, Clustering, Load Centrality, Average Shortest Path, Average Neighbor Degree, State Sum, Communication Centrality and Degree Centrality. We will discuss these features briefly one by one before explaining the model in which they play a role.

2.4 Topological Features

Pagerank

Pagerank is a ranking algorithm used by Google to rank web pages [4]. Pagerank creates a web page ranking by evaluating the rank of all the incoming links to that page. If a web page had many incoming links (hyperlinks linking to the page) the webpage will be ranked higher. The outgoing links do not have any effect, and yield no indication of any importance. This algorithm can also be applied to graphs. Pagerank is a good way to effectively capture the relative importance of the nodes of a graph, and is also applicable with weighted edges. For example, if there is a link from a big company to your website, the ranking will increase more then when there is a link from a small website. The weight of the link depends on the Pagerank of the node:

P R(A) = (1 − d) N + d · (P R(T1)) (C(T1)) +(P R(T2)) (C(T3)) + · · · +(P R(Tn)) (C(Tn)) (2.7)

This can also be expressed as:

P R(µ) = X

v∈Bµ

P R(v)

L(v) (2.8)

where v is one of the pages of set Bµ which contains all links to page u and

L(v) is the number of links from page v.

Eigenvalue Centrality

The eigenvector centrality is a measure of the influence of a node in a net-work. It is based on the philosophy that connections to high scoring nodes contribute more then connections to low scoring nodes. For node i, the cen-trality score is proportional to the sum of the scores of all nodes connected to it:

(16)

xv = 1 λ X j∈M (v) xj = 1 λ N X j=1 ai,jxj (2.9)

M (i) is the node set that is connected to node i, N is the total number of nodes and λ is a constant. This can also be written in a vector notation as such:

x = 1 λAx

Ax = λx (2.10)

Degree

An important property of nodes in a graph is their degree. The degree of a node is the number of edges that is adjacent to the node. In a digraph we usually speak of an in- and out degree. The in degree is the number of edges that are pointing towards the node, and the out degree is the number of edges pointing outwards to other nodes in the graph. The sum over degrees in an undirected graph is twice the amount of that in a directed graph.

X

i

di= 2|E| (2.11)

The degree distribution is the distribution of the probabilities of the degrees over the whole network. This distribution can vary largely depending on the graph. Erd˝os R´enyi networks usually follow a binomial degree distribution while scale free networks such as a Barabasi- Alberts network mostly follow a Poisson distribution.

Betweenness Centrality

Betweenness centrality is a node measurement in a network, which quanti-fies the number of times it is used in a shortest path between nodes. It was introduced by Linton Freeman as a communication measure between humans on a social network. Cb(v) = X s6=v6=t∈V σst(v) σst (2.12)

where σst is the total number of shortest paths between node s and t and

the σst(v) is the number of those paths that passes through v.

Hubs and Authorities

These therms are originally meant to differentiate between different kinds of web pages. The authorities are the pages that are sources of information, news

(17)

for example. The hubs can be described as link sites that contain many links to authorities. They are mutually dependent on each other. Authorities have a high score if there are many hubs linking to them and a good hub page has a good score if they link to many good authorities. This relation can be defined as follows: auth(p) = n X i=1 hub(i) hub(p) = X i=1 auth(i) (2.13)

Initially auth(p) = hub(p) = 1 for all nodes. The scores will be determined after a high number of iterations, the scores after each iteration are normalized. Eventually, the system will converge to a steady state and the hubs and author-ity scores are obtained.

Average Neighbor Degree

The average degree of a node i is

knn,i= 1 |N (i)| X j∈N (i) kj (2.14)

where N (i) are the neighbors of node i and kj is the degree of node j which

belongs to N (i).

Clustering

The clustering coefficient is based upon the ’friends of my friends are my friends’ principle. It is calculated from the ratio of existing links connecting a nodes neighbor to each other until the maximum of these links has been reached.

Ci=

2ei

Ki(Ki− 1)

(2.15)

where ei is the number of connections between neighbors and ki is the number

of neighbors from node i.

Coreness

K-core is a clustering algorithm designed to study social networks. The coreness of a node is based on the maximal subgraph of G in which all nodes have a degree of at least k, called a k-core. If the node does not belong to a subgraph with a higher k-core degree then x, for example (x + 1), the node has a coreness of x. This subgraph of graph G is created by deleting all vertices with a lower degree then k.

Closeness Centrality

(18)

Closeness centrality is a centrality measure in a network, it is calculated by normalizing the sum of all shortest paths from a node to all other nodes, it can be be formalized as

C(x) =_PN − 1

yd(y, x)

(2.16)

with d(y, x) the distance between node x and node y, and N the number or nodes. Higher values of C(x) indicate higher centrality of the node.

Load Centrality

Load centrality is a centrality measure based on a hypothetical flow process through the network. It is often used in social network analysis. Each node sends a package containing a unit to all the other nodes, without capacity con-straints. The routing is based on prioritizing minimum distance between nodes, it divides its units equally between its neighbors. The total normalized flow of units is the load of the node.

Average Shortest Path

The average shortest path of a node, is the sum of all shortest paths from node i to all other nodes divided by the number of nodes:

P (x) = P

yd(y, x)

N (2.17)

with d(y, x) the path from node y to node x, and p(x) the average shortest path for node x. This has a very close relation with closeness centrality in which the sum of all paths from x to y is normalized.

State Sum

The state sum is the number of nudges that a node has received in the Ising network. This is calculated by taking the sum of the ’node nudge chain’ that holds an integer value where (and how many times) the nodes with highest spreading capabilities have been nudged.

State Sum =X

i

σi (2.18)

with σi the integer value that corresponds to how many times node i has

(19)

Degree Centrality

The degree centrality for undirected graphs is the same as the normalized degree:

DC(x) = _Pndegree(xi) j=1degree(xj)

(2.19)

The degree centrality values are normalized by dividing by the maximum possible degree in a simple graph n − 1 where n is the number of nodes in G.

Communication Centrality

Communication centrality can also be described as the sum of closed walks, of every size, from node n to node n. It is also called subgraph centrality. It can be found using the spectral decomposition of the networks adjacency matrix:

CC(x) =

N

X

j=1

(vu_j)2eλj _(2.20)

where vj is the eigenvector calculated by taking the spectral decomposition

of the adjacency matrix from the network.

All these features may play a important role in making a prediction model for the impact of node nudges, or more specifically, to predict what impact influencing every director has on the entire network. In the prediction model, the features (values of each node) are linked to the corresponding impact value. For example, if node x is nudged, the feature values that correspond to that node x will be taken as value. If multiple nodes are nudged, the impact will be associated to the sum of these feature values divided by the number of node nudges.

Featurevaluei=

Pi

i=1σk(i)

N (2.21)

where Featurevaluei is the average feature value for the pertinent feature in

a certain set of nudged nodes. This is done for every feature we are using in the model for each combination of nudged nodes (node sequence) used in the greedy approach. Finally we obtain an average value for all features corresponding with a certain impact value for that particular node sequence. By taking these node dependent feature values as data and the impact as target variable to predict, we can extract information on the features that correspond with this target by running XGBoost. We will first explain this model approach before going into further details.

(20)

2.5 XGBoost

Extreme Gradient Boosting, also known as XGBoost, is a renowned machine learning method that is able to make data predictions. The term Gradient Boosting is first used in the paper Greedy Function Approximation: A Gra-dient Boosting Machine, by Friedman [21]. XGBoost is a highly effective and widely used machine learning method on Kaggle, a predictive modeling and an-alytics competition [13, 12]. It is a state of the art machine learning algorithm applicable for many types of problems. It belongs to the four most widely used tree based machine learning algorithms: the gradient boosting machine (GBM), LightGBM, extreme gradient boosting (XGBoost), and adaptive boosting (Ad-aBoost). Recent literature has compared all of these algorithms and has shown that XGBoost repeatedly outperforms the other models in several areas[36].

2.5.1 Model Explanation

XGBoost is an ensemble method, which means it makes use of a variety of smaller trees, based on the gradient boosting principle, and combines the pre-diction scores in 100 trees order to enhance the final tree model. Formally we can write this ensemble model in the form:

yi= φ(xi) = K

X

k=1

fk(xi), fk ∈ F (2.22)

Where F = f (x) = wq(xi) (q : Rm → T, w ∈ RT) is the space of the

regression trees [12]. The tree structure is represented by q (for example just a root node or a complicated high level tree structure) the m-dimensional array xi is mapped by function fk to one of the T leaves in the tree q with leaf

weights w. The objective of the model consists of two terms. First, the initial function l measures the quality of the prediction, expressed with the mean squared error as a loss function. The second term is the regularization function Ω(f ) = γT +1₂λw2, which is needed to measure the complexity of the model and to avoid overfitting. Hence, to be able to learn the tree ensemble, we optimize the following objective function:

Obj = l X i=1 l(yiyti) + t X i=1 Ω(fi) (2.23)

XGBoost uses an additive training approach, which means that we add one tree at a time based on the previous results:

(21)

y_i0 = 0 y_i1 = f1(xi) = y0i + f1(xi) y_i2 = f1(xi) + f2(xi) = yi1+ f2(xi) .. . = ... yt_i = t X k=1 fk(xi) = yt−1i + ft(xi) (2.24) Let yt

i be the prediction of the i-th instance at the t-th iteration. We want

to add the tree that optimizes our objective. We can derive a score to measure the quality of a tree structure q by rewriting the objective function:

Obj(t) = T X j=1  (X i∈Ij gi)wj+ 1 2( X i∈Ij hi+ λ)wj2  + γT (2.25)

Here, Ijis the set containing the indices of the j-th leaf node and giand hiare

the first and second order Taylor approximations, respectively. By substituting the sum of the first and second order Taylor approximation expressions, we get the compressed expression:

Gj = X i∈Ij gi (2.26a) Hj = X i∈Ij hi (2.26b) Obj(t) = T X j=1 (Gjwj+ 1 2(Hjhi+ λ)w 2 j + γT (2.26c)

Finally we calculate the optimal weight w_j∗ and plug this into the objective function. We can now obtain a score for the quality of the tree structure q by calculating the objective function Obj∗:

w_j∗ = − Gj Hj+ λ (2.27) Obj∗ = −1 2 T X j=1 G2_j Hj+ λ + γT (2.28)

The smaller the score, the better the tree structure, hence we have a way to assess the quality of a generated tree. Ideally there would be a sum of all possible trees, however this is unmanageable in practice, therefore the optimization is done one level at a time. A greedy tree growing algorithm will brute force search

(22)

over all possible split candidates, which are the features we use, and find the best split to add by calculating the corresponding gain by calculating the scores of the children and if we do not split.

Gain = 1 2 _G2 L HL+ λ + G 2 R HR+ λ − (GL+ GR) 2 HL+ HR+ λ − γ (2.29) G2_L HL+ λ

score of the left child (2.30a)

G2_R HR+ λ

score of the right child (2.30b)

(GL+ GR)2

HL+ HR+ λ

score if we do not split leaf node (2.30c)

with GL and HL a subset of the leaf node set. If the gain is smaller then

the γ (the complexity cost by adding an additional leaf) the branch will not be added. This process is also known as pruning [46]. The algorithm will add the split that gives the maximum calculated reduction and will continue until the tree reaches the maximum depth.

2.5.2 Model Optimization

Schapire, Freund, Bartlett, and Lee [54] proposed a theoretical explanation why xgboost is not prone to overfitting. Generally speaking, XGBoost performs ro-bust small scale data sets [27]. However that does not mean that it is always a good thing to maximize the number of features we put in the model. It would be the best solution on very large, near infinite data sets, where the number of samples in your training set gives good coverage of all variations [52]. However when you use a high number of dimensions, this must be compensated with the corresponding data coverage.

In addition, the noise on weak variables that ends up correlating by chance with the target variable can limit the effectiveness of boosting algorithms. This can more easily happen on deeper splits in the decision tree, where the data being assessed have already been grouped into a small subset. The more vari-ables you add, the more likely it is that you will get weakly correlated varivari-ables that just happen to look good to the split selection algorithm for some specific combination, but then create trees that learn noise instead of the intended sig-nal [52]. Features with a low contribution can negatively impact the predictive capabilities of the model. Hence it is advisable to disregard the features that do not contribute to the final model. In order to determine which features are to be used in the model, we employ the Akaike information criterion (AIC) metric. This metric deals with the trade-off between the goodness of the fit and the simplicity of the model, making it a good metric for model feature selection [1]:

(23)

AIC = n · ln SSE n

+ 2 · p (2.31)

SSE is the sum of the squares of the residuals between the test and the prediction set, n the sample size and p the number of independent variables. By performing this metric on every possible model based on a permutation combination of all features, we can determine the features that do not have a sufficient contribution. In addition, R2 is used to determine the quality of the model.

R2 _{measures the proportion of the variability in the target data, explained by}

the regression model:

R2=SSerror SStotal

(2.32)

A R2 _{of zero indicates that the model is not explaining the data at all, as it}

explains none of the variability of the response data around its mean. When the R2_{is one, it means that the model fits the data well and explains the variability}

of the response data around its mean perfectly. This will give a good indication on how well the model predicts the data obtained from the XGBoost model made on the data. In addition we have also used the mean squared error, mean absolute error and explained variance score. However, except for the explained variance score, these metrics do not give an absolute measurement of goodness of fit. This makes R2a suitable option to evaluate the model performance. The other metrics are reported as well; they give a better general idea of the error measurements of the model.

The final optimization we perform is the tuning of the XGBoost model’s hyper-parameters to enhance the predictive power. This is done using a grid-search. This search consists of making a model with a brute force combination of all possible input parameters while taking cross validation into consideration. The objective function optimized by the grid-search is the R2_{. We will do a}

grid-search on the parameters: ’max depth’, ’learning rate’, ’min child weight’ and ’gamma’. With gamma being the constraint parameter used in calculating the gain, it specifies the minimum loss reduction required to make a split. The ’min child weight’ is used to control overfitting, it sets the minimum sum of weight (hessian) in the child node. The ’learning rate’ determines how fast or slow the learner converges to the optimal solution. If the step size is too big, you may overshoot the optimal solution. If the step size is too small, training takes longer to converge to the best solution. Finally the ’max depth’ determines the maximum number of levels of a tree. In the results section well discuss what range of hyper-parameters we will apply.

(24)

2.5.3 Model Metric Calculation

All trees have score values in the leaves, the calculation of which is based on the series of decisions, guarded by a particular feature. After the trees are constructed, we are able to predict a target value by taking the sum of the corresponding score in all trees.

Prediction(x) = 1 J J X j=1 fj(x) (2.33)

where J corresponds to the number of trees in the ensemble. Metrics to compute the importance of a feature are useful to explain the model, such as the feature contribution and the feature frequency. The feature frequency is the percentage representing the relative number of times a particular feature occurs in the trees of the model. This can be calculated by summing the number of splits in a tree where the feature occurs. Each node in the tree can only be split on one feature, hence the summation over all splits in all trees will give the total frequency of each feature.

The feature contribution is calculated by taking the prediction value in each leaf on the split feature and calculate the difference in value in respect to its parent. If there are several splits on a certain feature within the top of the tree down to the leaf, the multiple differences will be added together. We can derive the prediction based on the contribution as follows:

Prediction(x) = 1 J J X j=1 Biasj+ K X k=1 1 J J X j=1 Avg Contributionj (2.34)

With K being the number of features, Avg Contribution the average contri-bution of each feature and the bias the mean given by the topmost region that covers the training set.

In order to make a feature-contribution distribution we iterate every tree in the model for every feature value of all the nodes in the network. These contribu-tions will make the distribution in the feature-contribution violin plot of the model. The feature-values of each node in the network and their corresponding contribution will be plotted as feature value contribution plots. These plots will give a good indication of what feature values have a negative or positive influence on the prediction of the impact.

The contribution values can range from negative to positive, because the result-ing prediction can go down after the split of a certain node value, thus resultresult-ing in a negative value. However, a negative value does not mean that the feature is not a good predictor for the model. In general, a higher absolute value of the feature contribution implies the corresponding feature is a better important predictor for the final prediction model.

(25)

The feature contribution and feature frequency may seem similar but they can have big differences in output values. The frequency metric does not always yield an accurate importance of the features. This is for example the case when there are one or two features with strong signals and a few features with weak signals. The XGBoost model will exploit the strong features in the first few trees and use the rest of the features to improve on the residuals. Therefore when plotting the frequency of the features in the trees, the strong features will not look as important as they actually are, as the frequencies are not weighted. In this case, the feature contribution will give a more accurate depiction of the important features in the model.

In the next section, we will first show the impact results of the Ising network, followed by hyper-parameter tuning on the XGBoost model. These parameters are used in the XGBoost model to conduct a sensitivity analysis in Erd˝os R´enyi and Barabasi Alberts networks with different entropies and densities. We will extract what features have the highest importance to the XGBoost prediction model from more then half a million XGBoost models. These features are used to make a robust general model on the directors interlock network. Finally, we will compare feature-value contribution plots to verify and determine reoccur-ring patterns.

2.6 Sensitivity analysis

The sensitivity analysis enables us to determine general robust features. These important features are extracted from half a million models made on different types of networks with different temperatures, densities and ’nudge approaches’. We differentiate between being able to nudge a node once (Single Nudge proach) or being able to nudge a node multiple times (Multiple Nudge Ap-proach). This indicates the way we are able to externally influence the directors. To enable nudging the same node multiple times, we weight the nudge that we apply on the node accordingly: if a state is nudged n times, the node nudge will get n times as much ’weight’.

Below, we illustrate this process by a simple example. We use the single nudge approach and a multiple nudge approach on a three node network with two available resources for nudging. These resources can be divided in a variety of ways:

Single Nudge Approach Multiple Nudge Approach [1,0,0], [0,1,0] [1,0,0], [2,0,0], [1,0,1] [0,0,1], [1,0,1] [0,1,0], [0,2,0], [1,1,0] [1,1,0], [0,1,1] [0,0,1], [0,0,2], [0,1,1]

We compare 16 different scenarios using 200-node Barabasi Alberts and Erd˝os R´enyi networks. These scenarios can be visualized in a tree structure:

(26)

Erd˝os-R´enyi Single N-Density N-Temperature O-Temperature O-Density N-Temperature O-Temperature Multi N-Density N-Temperature O-Temperature O-Density N-Temperature O-Temperature Barabasi-Alberts Single N-Density N-Temperature O-Temperature O-Density N-Temperature O-Temperature Multi N-Density N-Temperature O-Temperature O-Density N-Temperature O-Temperature

Figure 2.4: Structure Sensitivity Analysis

’Multi’ and ’Single’ indicate how many times a single node can be nudged in the particular approach. N stands for ’Normal’ and O stands for ’Other’, the ’normal temperature’ is 7.8 and ’other temperature’ 3.9. These different temperatures are calculated by matching entropies with temperatures in the directors interlock network as described in section 2.2. The two densities, 6.6 and 10, correspond respectively with the the density of the directors interlock network (’normal density’) and a higher chosen density (’other density’). The density can be calculates as such:

Density = Number of Edges

Number of Nodes (2.35) We can observe by counting the number of leaf nodes in figure 2.4, that we have 16 different scenarios for the sensitivity analysis. We apply the Ising model and using the greedy approach on every scenario, with five node nudges to spend. From the formula in the topological features section 2.4, one can deduce that the node properties, based on these 14 features, may vary greatly between nodes.

The data for every feature value is summed for every nudged node and divided by the number of times that the node is nudged. This averaging process is needed to remove the knowledge in the feature data on how many times the network has been nudged:

Averaged Nudged Featurevalues = Pj

i=1 Nudged Featurevaluei

(27)

Without averaging these feature values, they would directly indicate j (the number of node nudges) that took place. This knowledge needs to be removed from the data before making the model, because we normally do not have this beforehand. Moreover without this process the feature value-contribution plots will give an inaccurate depiction of the reality, as the sum of the values will not correspond with the actual feature values in the network. After this process, the XGBoost model will be based on topological features alone.

The below 14 features and their combinations are used to determine the best XGBoost model. The XGBoost model has fourteen features as input features: (1) Closeness Centrality (CCS), (2) Pagerank(PS), (3) Hubs(HS), (4) Average Neighbor Degree(ANDS), (5) Load Centrality(LCS), (6) Degree(DS), (7) Eigen-value Centrality (EC), (8) Betweenness Centrality(BCS), (9) Clustering (CL), (10) Communication Centrality (CC), (11) Degree Centrality (DC), (12) Aver-age Shortest Path (SP), (13) Authorities (AU), (14) Coreness (CN)

In order to determine the best model to use, we calculate all possible combi-nations of features, for every network. As we have used 14 features in the model, we built more then 32 thousand models on 16 different scenarios, resulting in more then half a million models. These data are all sorted according to Akaike value based on R2_{. All variables in the top 100 of the 32000+ models of each}

scenario, are summed and averaged by the number of networks used.

Finally in order to make an accurate estimate on what features really contribute the most to the models, we will multiply the feature frequency with the feature-R2_{. This R}2_{-value is extracted from an XGBoost model on the corresponding}

network using only that particular feature. The importance can be described as:

Importance = (Feature Frequency) · (Feature R2) (2.37)

The frequency takes into account the number of times the feature is used in the top 100 models in 16 scenarios, while the R2 indicates how good the feature is as a predictor. If the feature contains different information then other features in the XGBoost model and can boost the accuracy of the XGBoost model more then other features, it will be used regardless of what the individ-ual prediction value of the feature is. This will be reflected in the frequency. The importance is the product of the frequency and the R2_{, making it possible}

for good complementary features to be important even though their individual prediction quality is not among the top predictors.

(28)

2.7 Hyper Parameter Tuning

To optimize this XGBoost model for our purpose, we use a grid-search over the most important parameters that can influence the prediction and optimize the R2_{. This procedure can increase the precision of the tree-based model.}

We have used grid-search over the max depth, learning rate, minimum child weight and gamma. Wide ranges of parameters are used in order optimize the model as good as possible:

Max Depth: [2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 18] Learning Rate: [0.005, 0.01, 0.05, 0.1]

Min Child Weight: [0.1, 0.25, 0.5, 0.75, 1] Gamma: [0, 0.05, 0,1]

The results from all possible permutations on the Barabasi-Albert network, random network and the directors network reflect that the best model has a max depth of 8, a learning rate of 0.05, a gamma of 0 and min child weight of 0.1. Therefore we will use these settings in the upcoming model computations.

(29)

Results

3.1 Sensitivity Analysis

In this section we discuss the most important features extracted from the XG-Boost model. First we extract the general best models of all the 16 different scenarios, discussed in the method section 2.6. In order to do this, we average the metrics between all the 32768 XGBoost models for every scenario. Then we sort the models by Akaike value (formula 2.30) and look at the top 10 perform-ing models. These top 10 models, sorted in order of performance are displayed in a list. For reference we show the metrics of two models based on only one feature, hubs and degree:

Model AIC MSE RMSE MAE EVS R2

Hubs -328.624204 146.972122 6.212230 4.616112 0.284119 0.281423 Degree -477.836466 99.207477 5.105100 3.370793 0.521169 0.519535

Figure 3.1: The metrics of models based on hubs and degree only, for future reference. These models are averaged over all 16 scenarios discussed in the method section 2.6. Notable is that the model based on degree (R2 _{= 0.52) performs much better then the model based}

on just hubs (R2_{= 0.28).}

When we observe the metrics in figure 3.1, we can see that the model based on the feature degree (R2= 0.52) performs considerably better then the model based on hubs (R2= 0.28).

We will next briefly compare these results to the below top 10 performing mod-els:

(30)

1. ’Pagerank’, ’CCentrality’, ’LCentrality’, ’ANDegree’, ’Shortestpath’, ’Clustering’, ’DCentrality’, ’ComCentrality’ 2. ’Degree’, ’Pagerank’, ’Authorities’, ’LCentrality’, ’ANDegree’, ’Coreness’, ’Shortestpath’, ’Clustering’ 3. ’Degree’, ’Pagerank’, ’Authorities’, ’LCentrality’, ’ANDegree’, ’Shortestpath’, ’Clustering’, ’DCentrality’ 4. ’Hubs’, ’LCentrality’, ’ANDegree’, ’Coreness’, ’Shortestpath’, ’Clustering’, ’DCentrality’, ’ComCentrality’ 5. ’CCentrality’, ’ANDegree’, ’BCentrality’, ’Shortestpath’, ’Clustering’, ’DCentrality’, ’ComCentrality’ 6. ’Degree’, ’Pagerank’, ’CCentrality’, ’LCentrality’, ’ANDegree’, ’BCentrality’, ’Shortestpath’, ’Clustering’ 7. ’Degree’, ’Authorities’, ’CCentrality’, ’ANDegree’, ’BCentrality’, ’Shortestpath’, ’Clustering’ 8. ’Degree’, ’Hubs’, ’LCentrality’, ’ANDegree’, ’ECentrality’, ’Shortestpath’, ’Clustering’, ’ComCentrality’

9. ’Hubs’, ’Authorities’, ’LCentrality’, ’ANDegree’, ’ECentrality’, ’Shortestpath’, ’Clustering’, ’DCentrality’, ’ComCentrality’ 10. ’ANDegree’, ’BCentrality’, ’Coreness’, ’Shortestpath’, ’Clustering’, ’DCentrality’

Figure 3.2: Top 10 sorted models, based on Akaike values, of all 16 scenarios discussed in the method section 2.6. Note that the features in these models are not yet sorted horizontally by importance, the analysis is postponed until later. The larger table is included in appendix figure C.1

The corresponding metrics of the top 10 models of figure 3.2 are visualized in the below table 3.3:

Model AIC MSE RMSE MAE EVS R2

1 -951.413890 53.915616 3.723463 2.135102 0.835679 0.834961 2 -950.326411 54.744150 3.752239 2.174270 0.838031 0.837626 3 -948.340918 52.472674 3.675352 2.103277 0.838784 0.838348 4 -948.302404 55.730810 3.789494 2.211747 0.833343 0.832701 5 -947.917512 55.421562 3.775037 2.205432 0.838038 0.837683 6 -947.086834 52.726407 3.675327 2.104331 0.835237 0.834646 7 -946.469640 53.321644 3.712352 2.149842 0.835246 0.834813 8 -945.870662 53.325725 3.705917 2.155369 0.837792 0.837309 9 -945.775517 51.779394 3.655380 2.119747 0.839960 0.839519 10 -945.208249 54.482645 3.748460 2.186556 0.835311 0.834803

Figure 3.3: Metrics corresponding with the overall top 10 models, based on Akaike values, averaged over all 16 scenarios discussed in the method section 2.6.

Comparing the metrics of figure 3.1 and figure 3.3, we can conclude that all the top performing models are considerably better (R2_{of 0.83) than the models}

with the features hubs (R2 _{of 0.53) and degree (R}2_{of 0.28) alone. Noteworthy}

in figure 3.2 is that the top performing models all consist of multiple features, ranging from 6 to 9, and contain 6 frequently recurring features. Other features seem to support these features randomly between all top 10 models. The re-curring features are: Average neighbor degree, degree, clustering, shortest path, communication centrality and degree centrality. When we observe their corre-sponding metrics in figure 3.3, we can see that the R2(as well as the other error measures) does not show much variance, indicating that the R2_{does not depend}

(31)

to make the intended model as robust as possible, we discard these supporting features that have little impact on the final prediction. As their supporting role is already minimal to begin with, it is likely that their performance is not stable enough to rely upon. Since measuring the stability of supporting features is not within the scope of our work, we will focus on the minimal number of important features that predict the impact robustly.

3.1.1 Important Features

To extract the importance of each feature, we use the importance formula 2.37 on the top 100 models of each of the 16 scenarios described in tree structures in section 2.6 and average the results. This is then visualized in two histograms, with each figure representing the average feature importance of the 8 possible scenarios for each type of network:

(32)

Figure 3.4: The averaged feature importance of all features used in the top 100 models from each of the 16 scenarios, visualized by network type. All features with values above the red line are considered important.

(33)

Observing the importance plots in figure 3.4, we can conclude that the im-portance of most features follow the same patterns across the two networks. This excludes coreness, which has a higher importance in making models on Erd˝os R´enyi networks, possibly as a result of the different degree distribution between the two networks. The top five most dominant features for the pre-diction of the impact for all models are, in order of importance: (1) Clustering (CL), (2) Average Shortest Path (SP), (3) Average Neighbor Degree (ANDS), (4) Degree Centrality (DC), (5) Degree (DS).

These features correspond well with the most recurring features in the top 10 overall best models in figure 3.2, indicating that these features are indeed the most important predictive ones. From figure 3.4, we can conclude that the most predictive feature is clustering, followed by average shortest path. For convenience we have drawn a red line that acts as a classifier for selecting the most important features. The red line is half the rounded importance value of the top feature. Since these results are made from over half a million models on 16 different scenarios, we consider these important features to be robust predictors.

3.1.2 Feature value contributions and Akaike procedure

We are interested in what feature values have the highest positive or negative impact on predicting influential spreaders. To visualize this, we make a scatter plot of the average influence of the feature-values of each node in all 16 different scenarios:

(34)

Figure 3.5: Average feature value contribution plots from over half a million models on ER and BA networks. A positive contribution, as explained in section 2.5.3, is directly related to a positive impact prediction. Therefore the peaks (e.g. 11.2 for the ANDegree) indicate the optimal values of the most influential spreaders. Larger figures are given in Appendix C.

We can observe in figure 3.5 that the degree and the degree centrality have similar patterns. Based on these figures it is likely that the two features are highly correlated, as would be expected since degree centrality is in fact the normalized degree. Therefore, one could a priori assume that the XGBoost tree based model performs equal on both features; however differences may still oc-cur as the model possibly performs better on features with a higher value range [44]. For that reason, we choose in the first instance to keep both degree features simultaneously. An Akaike test (formula 2.30) will be needed to determine if one of these features should be left out, and which one.

(35)

Degree ANDegree Shortestpath Clustering DCentrality AIC R2 0 1 1 1 1 -947.313466 0.831899 1 1 1 1 0 -932.288987 0.829691 1 1 1 1 1 -932.252447 0.830088 0 0 1 1 1 -916.646920 0.825791 1 0 1 1 0 -898.291167 0.813817 1 1 0 1 1 -892.739984 0.808559 1 1 0 1 0 -886.293053 0.814404 0 1 0 1 1 -876.999569 0.811495 0 0 0 1 1 -874.287869 0.808824 1 0 0 1 1 -870.723110 0.805156 1 0 0 1 0 -867.601079 0.798966 0 1 1 1 0 -866.107505 0.800338 0 0 1 1 0 -805.785487 0.769640 0 1 1 0 1 -766.245575 0.737478 1 1 1 0 0 -747.727281 0.726846 1 1 1 0 1 -726.317198 0.715619 1 0 1 0 0 -724.269244 0.717722

Figure 3.6: Average Akaike and R2 _{values of 16 different scenarios. The combination of}

average neighbor degree, average shortest path, clustering and degree centrality have the lowest AIC value, thus representing the best option.

Table 3.6 is based upon the average Akaike score and R2_{, from 16 scenarios}

in figure 2.4, of all permutations of the 5 important features. For convenience this table only shows the top 50% of all permutations, sorted by Akaike value in descending order. As shown in table 3.6, the combination of average neighbor degree, average shortest path, clustering and degree centrality has the lowest AIC value. This means that discarding the feature degree will increase the robustness of the model.

We can observe in figure 3.6 that the top model based on 4 features has a R2 of 0.832, which is just a small 0.018 difference with the top model with a R2

value 0.8350, based on half a million models, in figure 3.3. This indicates that the performance of this model just based on these four features is very robust. We will now look at the feature contribution plots of these four features in figure 3.5. The feature contribution directly relates to how the impact changes for each individual feature. Each node of the tree has an output score, and the contribution of a feature on the decision path is how much the score changes from parent to child as discussed in section 2.5.3. The feature contribution can vary widely per network, mainly due to size. Still, the differences in contribution values between the features give a good indication of the most dominant ones. It is however not suitable for more absolute measurements. Additionally, it is important to note that the feature contribution highly depends on what node

(36)

has been used to split at the root of the tree in the XGBoost algorithm. This will automatically have the highest contribution on the prediction. If this feature is left out, another feature will take its place as root node and will contribute a lot more than if it were used in splits in lower levels of the trees. Hence, the contribution differences between the features are relative and not to be interpreted on an absolute scale.

More informative are the patterns that emerge around the zero contribution line. These patterns will be the same for each individual feature, regardless of what level of the trees the feature is used in splits. These patterns show what feature values are necessary to make positive and negative contributions to the impact prediction, as explained in section 2.5.3. Unlike the tree based model, the feature values in these plots are network size independent. We can see that the feature value patterns are very similar between the feature value contribution plots of the sensitivity analysis and the interlock network (but for a different range of contribution values). This shows that the contribution is scalable with respect to the feature values and exhibits uniform behaviour. As the contribution is based on the impact, which in turn relates to the network size, the feature values in these plots are network size independent. We are interested in the feature values that yield have the highest absolute contribution. Based on these values for every feature we are able to determine the weakest and most influential spreaders on a wide variety of networks, without the need of using complex models.

As is shown in figure 3.5, clustering has a peak around 0.03 with a positive contribution, ±0.01 around this peak the contribution will become negative. Average neighbor degree has a positive contribution around 11, this remains for ±1 around the peak with a small bell curve. The average shortest path has a positive contribution of approximately 2.8 and seems to keep increasing as the value increases. Finally, the degree centrality has a negative contribution below 0.03, after the peak it remains positive and steady.

Based on these results, we can conclude that the most influential spreaders have a clustering of 0.03 ±0.01, average neighbor degree of 11 ±1, shortest path of 2.8 or higher, a degree centrality of approximately 0.03 and a degree approximately 4.

3.2 Directors Network Model

3.2.1 Ising Model Results

The nodes with the highest impact are the super spreader nodes in the director interlock network. The model uses the multiple nudging approach, meaning that the model is able to nudge a node multiple times, which provides more options then the single nudge approach.

We extracted the impact from the nudged nodes for each of the directors from the interlock network. This includes the weakest and most influential

(37)

spreaders.

Figure 3.7: The average impact for each number of nudges on all nodes in the directors interlock network in blue, with their respective 95% confidence interval. The impact of the most influential spreaders are given in green. These impacts are calculated using the greedy nudge approach.

The bar values are the mean value of the nudge impact and the error bars represent the 95% confidence interval. This gives a good indication of how the impact is distributed with each added node nudge in the network. We can ob-serve in figure 3.7 that the impact increases as the number of nudges increases, following an approximately linear trend. This trend indicates that the impact of the nudged nodes is independent of each other. This trend is visible for the average impact as well as for the most influential spreaders.

When we compare the values in figure 3.7 we can observe that the most influen-tial spreaders have a significantly higher impact value then the average values. In both scenarios, the impact of nudging does not show any signs of converging. A possible explanation would be that the weight of the node nudges is not high enough, resulting in a relatively low impact compared to the network size. In this case it takes more node nudges before the impact reaches a stage of satu-ration and converges.

Our hypothesis is that the linear trend will level off and converge, at a cer-tain point, to a constant value as the number of node nudges increases. The nodes in the network will get saturated by other nudge influences, thereby

(38)

mak-ing it hard to be influenced by new node nudges. In order to give an estimate of this saturation, we made a simulation indicating the impact of nudging an additional node on the network within a 6 hop radius of permanently nudged nodes. A hop is defined as the shortest path length between the permanently nudged node and another node. Because we have several nudged nodes, we give priority to the smallest number of hops to a permanently nudged node and label the node as such. All impacts of the nudged nodes will be added to their respective labels and averaged. The results can be seen in the figure below:

Figure 3.8: The average impact of nudged nodes in the directors interlock network within a radius of 1-6 hops away from permanently nudged nodes. Error bars are left out for better focus on the course of the graph. The node is labeled by the smallest number of hops to a permanently nudged node.

Figure 3.8 indicates that nodes 2-3 hops away from nudged nodes are influ-enced by permanently nudged nodes, hence making it harder to have a higher impact in the network. From 4 hops onward the impact becomes approximately steady, these nodes are thus more suitable candidates to influence in order to maximize the impact in a network. These results are also confirmed when we look at the distance between most influential spreaders calculated in the direc-tors interlock network: