Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23

(1)

The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.

Author: Gao, F.

Title: Bayes and networks

Issue Date: 2017-05-23

(2)

Part III

M O D E L I N G T H E DY N A M I C S O F T H E M OV I E - A C TO R N E T W O R K

(3)

(4)

6

M O D E L I N G A N D I N F E R E N C E O F N E T W O R K S O F T H E I N T E R N E T M O V I E D ATA B A S E W I T H

P R E F E R E N T I A L AT TA C H M E N T M O D E L

6.1 introduction

A network is a structure containing vertices and edges. A complex network, unlike simple networks, is a graph/network with non-trivial topological features. Empirical studies on real networks, such as the Internet, the World-Wide Web, social and sexual networks, and net-

works describing protein interactions, show fascinating similarities See more about the typical properties in Section 2.1.3.

— for example, many of these networks are both small-world ([78]) and scale-free ([6]).

The collaboration network, is a category of networks, where the vertices are people and the edge among two vertices stand for a col- laboration relation—whether they have worked together in some way.

A perfect example is scientific collaboration networks—see [57–60]—

where scientists participate as vertices and the scientific acquaintances, typically whether they have collaborated in writing the same papers, are the edges. In this chapter we consider another form of collaboration networks—the movie-actor network: the actors and actresses are vertices and the acquainted relations, whether they have acted in the same movie together, are edges. It has been studied by several sources, the earliest being Albert-László Barabási and his group in [5, 12], and some other sources [1, 66, 67]. [4, 52] study the movie-actor network from a data point of view, to propose computing algorithms and perform visualizations. The most important literature is [11], which in- vestigates the scale-free property of the movie-actor network and pro- poses a possible model to explain the scale-freeness.

It have been discovered that surprisingly many networks are scale- free. In practical terms, it means that their degree distributions can be modeled by power laws—the frequency𝑝_𝑘of a vertex of degree𝑘 is of the form

𝑝_𝑘≈ 𝐶𝑘^−𝜏 (6.1)

for𝑘 sufficiently large and 𝐶 and 𝜏 being both positive constants. Power laws have a long history, but their presence in the degree distributions of networks were first rediscovered by Barabási, Albert, and Jeong in [12, 13], which popularized the term scale-freeness. Power laws have since been observed ubiquitously ([19]), in social ([2]), sexual ([50]), biological ([51]), informational ([26, 39]), brain ([25]) and epidemic networks ([63]). [35] considered the world-wide airports as a network

(5)

and discovered a power law therein. Barabási [9] gave a nice overview of scale-free networks on Science ten years after the conception of the term scale-free. For power-law relations (6.1), taking the log on both sides giveslog 𝑝_𝑘= log 𝐶−𝜏 log 𝑘, i.e. log 𝑝_𝑘decreases affinely with respect tolog 𝑘 with slope −𝜏. Consequently whether a degree sequence follows a power law is often determined by a loglog-plot of degree frequency versus degree. The power law exponent𝜏 is obtained by per- forming a linear regression on data oflog 𝑝_𝑘againstlog 𝑘.

A possible model to explain the scale-free properties is the preferential attachment model (pam). In the pam, a new vertex connects to existing vertices with a probability proportional to their degrees𝑙 plus a parameter𝛿

𝑝(𝑙) = 𝑙 + 𝛿

∑_𝑗(𝑑_𝑗+ 𝛿),

where𝑑_𝑗is the degree of vertex𝑗 and 𝛿 ≥ 0 is to give the model more flexibility. In [11]

more history in

Section 2.2.1 Barabási proposed the first version of the model where𝛿 = 0. The model has been studied extensively so far and [38]

and references therein give rather nice reviews on the developments of the model. The pam was an attempt to define a mathematical model characterizing so-called Mathew effect where the rich get richer.

more on the connections in Section 2.1.3 on scale-freeness

The well-known Internet Movie Database (IMDb) provides a concrete platform to conduct investigations on the movie-actor network.

The IMDb collects all information relating to movies (and as a result, also actors), and formulate the data in the plain text file that are ac- cessible to the general scientific community. It is worth noting that results in this chapter are reproducible for everyone because of the public availability of the dataset. The files of the IMDb, downloaded from the website of the IMDb or its ftp mirrors, are plain text files with each line representing one actor or one movie, or the appearance of one actor in one movie. We write Python scripts to parse the lines and store the extracted information into a well-structured database.

To model the IMDb, the fundamental line of thinking is to consider everything as a network of vertices, where things in the network happen in a somewhat random manner. An essential feature of the dataset of the IMDb, as we later will see, is that its corresponding actor network has power-law like degree distribution. Hence we seek a random graph model that describes the dynamics of collaboration graphs, is simple (with few parameters) and is sufficiently flexible. The focal point of our study is to come up with such models to explain the degree distribution of the movie-actor network, which we consider as the most important typological structure of the movie-actor network.

Barabási was the first to point out in [12] that the degree distribution

(6)

6.2 conceptual model description

of the movie-actor network constitutes a power law with exponent 2.3 ± 0.1. We propose a dynamic network model modified from the pam, which to a reasonable extent explains the evolution of the movie- actor network and gives a possible explanation to why the degree distribution of the current movie-actor network look as it is.

The chapter is organized in the following way. In Section 6.2 we present the novel conceptual model. We fit the model to the IMDb dataset in Section 6.3. Simulation of the model and comparisons between the simulation and real dataset validate the goodness-of-fit of our model, to a satisfying extent. This conclusion is further consoli- dated in Section 6.5 where we prove, with proofs in Appendix B, that the model indeed leads to a power-law degree distribution. The theoretic investigations, though seemingly deviant, help us to estimate the parameters in the model.

6.2 conceptual model description

In this section we describe a model in a skeleton-like fashion, without giving the exact specification, and this is hereafter referred to as the conceptual model.

We denote𝑡 as the number of movies of the movie network and it is an index to measure the growth of the network, but it is not real calendar time. We consider a comprehensive movie network(𝒢_𝑡)_𝑡≥0 (𝒢_𝑡= (ℳ_𝑡, 𝒜_𝑡, ℰ_𝑡^ℳ, ℰ_𝑡^𝒜)) with two layers:

1. The layer of movies ℳ_𝑡, with vertices(𝑚_𝛼)𝛼∈{1,2,…,𝑡}as movies.

2. The layer of actors 𝒜_𝑡, with vertices(𝑎_𝛽)_{𝛽∈{1,2,…,𝜙}_𝑡_}as actors, where𝜙_𝑡= |𝒜_𝑡| is the size of actor network at time 𝑡.

3. The edge set ℰ_𝑡^ℳ, describing the relations between movies and actors.

4. The edge set ℰ_𝑡^𝒜, describing the relations among actors.

The edges in the layers are as follows:

• Movie𝑚 and actor 𝑎 are linked if and only if actor 𝑎 has played a role in movie𝑚, which we denote as 𝑎 ↔ 𝑚, and this link 𝑎 ↔ 𝑚 defines an edge 𝑒_𝑎↔𝑚∈ ℰ_𝑡^ℳbetween𝑎 and 𝑚.

• Two actors𝑎₁∈ 𝒜_𝑡and𝑎₂∈ 𝒜_𝑡(𝑎₁≠ 𝑎₂) are neighbors if and only if there exists a movie𝑚 ∈ ℳ_𝑡such that𝑎₁and𝑎₂are both linked to𝑚. We write 𝑎₁↔ 𝑎₂to denote that actors𝑎₁and𝑎₂ are neighbors and this link defines an edge𝑒_𝑎₁_↔𝑎₂∈ ℰ_𝑡^𝒜.

(7)

We define the appearance time𝑇_𝑀(𝑚) of movie 𝑚 as the smallest 𝑡 such that𝑚 ∈ ℳ_𝑡and the movie size𝑆(𝑚) as the number of actors playing in it𝑆(𝑚) = ∑_𝑎∈𝒜

𝑇𝑀(𝑚)+11_{{𝑎↔𝑚}}. We define the appearance time𝑇_𝐴(𝑣) of actor 𝑣 as the smallest 𝑡 such that 𝑣 ∈ 𝒜_𝑡.

The actor degree𝐷_𝑡^𝐴(𝑣) of actor 𝑣 ∈ 𝐴 at time 𝑡 is defined as 𝐷_𝑡^𝐴(𝑣) = ∑_𝑎∈𝒜

𝑡,𝑎≠𝑣1_{{𝑎↔𝑣}}, whereas the movie degree 𝐷^𝑀_𝑡 (𝑣) of actor 𝑣 ∈ 𝐴 at time 𝑡 is defined as 𝐷_𝑡^𝑀(𝑣) = ∑_𝑚∈ℳ

𝑡1_{{𝑣↔𝑚}}. By convention, both degrees are defined as0 when 𝑡 < 𝑇_𝐴(𝑣) (the actor has not yet appeared in the network).

The evolution starts from𝑡 = 1 with one movie and several actors in ℳ₁, and ℳ_𝑡is updated in the following way when a new movie 𝑚_𝑡+1is introduced:

1. The number of actors𝜁_𝑡+1of the movie𝑚_𝑡+1(𝜁_𝑡+1= 𝑆(𝑚_𝑡+1)) follows a certain distribution𝜁, independent from the past state 𝒢_𝑡.

2. The number of new actors𝜉_𝑡+1(here we abuse the word new and old: an old actor means an actor who has been in the net- work before this movie and a new actor means the contrary) of the movie𝑚_𝑡+1(i.e., the number of new vertices introduced to the actor layer by movie𝑚_𝑡+1) is independent from the past 𝒢_𝑡, and follows a distribution depending only on𝜁_𝑡+1. 3. (pam)𝜓_𝑡+1= 𝜁_𝑡+1− 𝜉_𝑡+1is the number of old actors, who must

be chosen from the actor layer 𝒜_𝑡. The old actors are chosen from 𝒜_𝑡independently and by using the preference that de- pends on a certain quantity of the actors

ℙ(𝑎 ↔ 𝑚_𝑡+1|𝒢𝑡) ∝ 𝑓(𝑎), (6.2) for any𝑎 ∈ 𝒜𝑡, where 𝑓 is a function mapping certain at- tributes of actors to preference.

4. Add𝜉_𝑡+1new actors to the actor layer, and add corresponding edges between𝑚_𝑡+1and actors and also edges between actors among the actor layer.

After movie𝑚_𝑡+1, 𝒢_𝑡evolves into 𝒢_𝑡+1with|𝒜_𝑡+1| = 𝜙_𝑡+1= 𝜙_𝑡+ 𝜉_𝑡+1 actors and|ℳ_𝑡+1| = 𝑡 + 1 movies.

It is worth pointing out that the actor layer of our double-layered network is a self-contained conventional collaboration network and actor degree is standard vertex degree in the collaboration network.

(8)

6.3 empirical fitting to the imdb dataset

6.3 empirical fitting of the model to the imdb dataset We study the movie-actor network given by information extracted out of the IMDb dataset, and in due process we specify the distributions of𝜁 and 𝜉 and the preference function 𝑓. As the section goes on, we make the conceptual model in Section 6.2 into a concrete one.

6.3.1 Movie sizes

First we study the movie size (the number of actors in one movie)—

the distribution of𝜁 in the conceptual model.

Figure 6.1. Histogram of movie sizes in 1947 with 2458 movies We see from Figure 6.1, which contains all movies sizes in the year 1947, that there is a peak at around 5 and after that the frequencies decrease with respect to movie sizes. The distribution appears to have a heavier tail than normal distribution. In Figure 6.2, we do a more revealing loglog plot of𝑝_𝑘 versus𝑘 with 𝑘 being movie size and 𝑝_𝑘 being the frequency of movie size𝑘. By the look of it, the movie sizes seem to follow a power-law distribution because the tail looks like a straight line on the loglog plots as the number of movies goes up.

We perform a regression onlog 𝑘-log 𝑝_𝑘data evidenced in Figure 6.3 where the sizes of all movies up to year 2007 are taken into account.

We see a straight line fits rather well in the middle of data range but

(9)

Figure 6.2. loglog-Histogram of movie sizes from 1947 to 2007

suffers from a bad fit on the tail. We conclude that movie sizes do follow somewhat a power law distribution but not an exact power law.

For our purpose, we use the empirical distribution of all movie sizes as our𝜁. It is interesting that the movie sizes follow a power law, but we are not interested in explaining it. Given that there are more than a million movies in the network, the empirical distribution should be close to the true distribution.

Figure 6.3. loglog-histogram of all movie sizes until the end of 2007

(10)

6.3 fitting the pam-imdb model

6.3.2 Number of new actors

We also need to find a good model for the generating mechanism of new actors, who are responsible for the expansion of the network. The question is that given that a new movie has𝑛 actors, which is the distribution of the number of new actors?

We give a scatter plot of the ratios𝑟 of the number of new actors relative to the movie size of all movies in year 1971 in Figure 6.4. The ratio is far from being a constant even when the movie size is large.

In fact the ratio seems to have a lot of variations and spread around over the curves𝑥𝑦 = 1, 2, 3, 4 ⋯ if we view the horizontal axis to be the𝑥-axis and the vertical one to be the 𝑦-axis. However the latter observation is natural since both the movie size and number of new actors are integers. Therefore given𝜁 = 𝑛 the movie size, 𝜓 is not a binomial distribution but rather the ratio𝜓/𝜁 distributes all over [0, 1].

In light of the above observations, we propose a randomized binomial model. The idea is to put the ratio𝜓/𝜁 at random in [0, 1] then to perform a binomial trial. Given the size of the movie𝑛, the number of new actors follows a binomial distribution Bin(𝑛, 𝑈), where 𝑈 is a beta-distributed random variable, i.e.𝑈 ∼ beta(𝑝, 𝑞), and is independent of everything else. We are interested in a good model producing the number of new actors that look and feel like the real data. If we assume our model is true, then we obtain the maximal likelihood estimate (mle) ̂𝑝 and ̂𝑞 for the proposed model. This is done by writing down the likelihood function and applying Newton-Raphson method.

We obtain ̂𝑝 ≈ 0.2615 and ̂𝑞 ≈ 2.097.

We justify the model by comparing the outcomes of simulating the number of new actors with our proposed model to the real dataset. We did a simulation experiment of 27647 movies with the obtained mle parameter and compare it with the real data from all the movies in year 1994, as in Figure 6.5a. We see that the patterns are similar and hence come to that our model is a reasonable one.

6.3.3 Preferential attachment function

In this section we establish the preferential attachment function𝑓 by studying the evolution of actor degrees and movie degrees. If we think of a conventional collaboration network and try to fit a preferential attachment model there, a natural candidate is the preference of affine dependence on the degrees in the network. Yet we have two kinds of degrees for our model—movie degrees and actor degrees—where movie degree of an actor is the number of movies the actor took part

(11)

Figure 6.4. Plot of Ratio of new actors of movies in 1971

(a) Real Data in 1994 (b) Simulation of 27647 movies

(12)

in and the actor degree is the number of other actors with whom the actor collaborated. The convention hints that we should have an affine preference over the actor degree. If an actor has actor degree𝑙, then the probability of this actor being chosen for a new movie is proportional to𝑙 + 𝛿 with 𝛿 acting as an extra flexibility parameter. However, we argue that in the dataset of the IMDb, it is better to have the linear preference over the movie degree. We study the actor and movie degree evolution separately and explain the reasoning in the coming sections.

Actor degree evolution

We have an actor network given by the IMDb where the actors are vertices and the collaboration relationships are edges. The network evolves with time as new movies are produced and new actors are introduced into the network, resulting in the dynamics of the network.

We study how the actor degrees evolves. We first present a plot of loglog-histogram of actor degrees in year 1947 in Figure 6.6. The four plots in Figure 6.6 are of accumulative nature, the actor network in year 1947 is a subset of that in year 1967 and so forth. Nonetheless the distribution of actor degrees is remarkably stable over 60-year time as we compare the four plots from different years, indeed this means the actor network seems scale-free–the distribution of actor degrees looks like following a power law. We perform a regression on the tail of loglog-histogram plot and, noting power law is an asymptotic property, try to fit a straight line for actor degrees greater than50, we obtain Figure 6.7 and an estimate of the power exponent𝜏 ≈ 2.0924 for the actor network.

In the network with𝑁 nodes with degrees (𝑑_𝑖)^𝑁_𝑖=1, define the empirical degree distribution function ̂𝐹_𝑁(𝑥) as follows

̂𝐹_𝑁(𝑥) =∑^𝑁_𝑖=1𝟙_𝑑_𝑖_≤𝑥

𝑁 .

Note that if𝑝_𝑘 ∝ 𝑘^−𝜏,log ∑_𝑗≥𝑘𝑝_𝑗 ∼ −(𝜏 − 1) log 𝑘, thus log (1 −

̂𝐹_𝑁(𝑘))-versus-log 𝑘 plot gives a straight line with slope −(𝜏 − 1). To further investigate whether the actor degree distribution follows a power law, we do alog (1 − ̂𝐹_𝑁(𝑘))-versus-log 𝑘 plot in Figure 6.8. We see that for𝑘 not so large (on the left side of the plot) the curve looks like a straight line but goes downhill afterwards. This might suggest a power law is not exactly accurate for the actor degrees but some other modified power laws fits it better. This is further discussed in Section 6.6.

(13)

(a) 1947 with 108,012 actors (b) 1967 with 266,553 actors

Figure 6.7. Fitting a straight line on the loglog-histogram starting from𝑘 = 50 on actor degrees

(14)

Figure 6.8. log(1− ̂𝐹_𝑁(𝑘))-vs.-log 𝑘 plot of actor degrees in year 1947

Movie degree evolution

We do the same for movie degrees as for actor degrees. Movie degrees and actor degrees are clearly closely related—an actor who par- ticipates in a lot of movies is, although not definitively, more likely to be linked with other actors and vice versa. Movie degrees over a 60-year time span is presented in Figure 6.9. Conclusions on movie degrees seem to converge with those of actor degrees—both follow power laws. Same regression fitting, given in Figure 6.10, provides us with an estimate of the power-law exponent𝜏 ≈ 2.1654 for the movie degree.

Careful comparisons between Figure 6.6 and Figure 6.9 reveals that there are differences between movie degrees and actor degrees.

The curve in the plots of Figure 6.9 is cleaner with less noise and looks more likely to be a straight line than the curve in Figure 6.6.

This means that the movie degrees are more stable and robust to random noise. This is somehow expected—actors may participate in a movie with huge size and thus get an instant jump in their actor degrees, but their movie degrees is more resilient to such abnormalities because it only goes up by 1. As the movie degree seems to be more reli- able as an indicator of how important an actor is, it make sense to build the preferential attachment function on the movie degrees—the higher one’s movie degree is, the more likely the actor will appear in the follow- ing movies. In contract, the actor degrees are more subject to sudden random disturbances—for example, if a movie of a large size comes to the network, all its actors will be neighbors in the actor layer of the network, hence a lot of actors with high actor degrees suddenly join the network.

(15)

(a) Year 1947 (b) Year 1967

(c) Year 1987 (d) Year 2007

Figure 6.9. Movie degree evolution in 60 year from 1947 to 2007

Figure 6.10. Fitting a straight line on the loglog-histogram starting from𝑘 = 40 on movie degrees

(16)

Figure 6.11. log (1 − ̂𝐹_𝑁(𝑘)) vs. log 𝑘 plot of movie degrees in year 1947

6.3.4 Preferential attachment function on movie degrees

We see in the previous section that movie degrees are more stable and more robust against random noise, thus we choose to build the preferential attachment function upon the movie degree. Here we consider the preference to be a function of the movie degrees. Denote the weight function by𝑓(⋅) ∶ ℕ → ℝ⁺, thus for (6.2) we make it more concrete

ℙ(𝑎 ↔ 𝑚_𝑡+1|𝒢_𝑡) ∝ 𝑓(𝐷^𝑀_𝑡 (𝑎)) for any 𝑎 ∈ 𝒜_𝑡, (6.3) where𝐷^𝑀_𝑡 (𝑎) is the movie degree of actor 𝑎 at time 𝑡. Because of the power law exhibited in the distribution of movie degrees, we choose 𝑓(𝑘) = 𝑘 + 𝛿 as this model will lead to a power law. In several re- cent papers [8, 23, 43, 62], different types of preferential attachment function were studies in three categories: (i) linear preferential attach- ment𝑓(𝑘) ∼ 𝑘, (ii) sub-linear preferential attachment,𝑓(𝑘) ∼ 𝑘^𝛾 when0 ≤ 𝛾 < 1; (iii) super-linear preferential attachment 𝑓(𝑘) ∼ 𝑘^𝛾 when𝛾 > 1. If it is the super-linear case, Krapivsky and Redner [43]

and Oliveira and Spencer [62] has shown that a winner take all phe- nomenon arises, where a single dominant vertex (often referred to as hub) links to almost every other vertex. In [23], it is proved that a sub- linear preferential attachment results in an asymptotic degree distribution with a stretched exponential tails ([46]). We have pointed out that the distribution of movie degrees has thinner tails than a power law, but ignoring the tail, the distribution of movie degrees gives a pretty good power law, which hints at an affine preferential attach-

(17)

ment model on the movie degrees. To conclude, given what we have known about both the actor degree and movie degree distributions, the affine preferential attachment on the movie degrees is reasonable for our purpose. Further discussion is in Section 6.6.

6.3.5 Model fitting

For𝜁, we use the empirical distribution of movie sizes that we obtain from the real dataset. Conditioned on𝜁 = 𝑛, we model the number of new actors as the randomized binomial experiment. With𝑈 ∼ beta(𝑝, 𝑞) being independent, 𝜉 ∼ Bin(𝑛, 𝑈). Preferential attachment function is defined in (6.3) with𝑓(𝑘) = 𝑘 + 𝛿. We simply take the empirical distribution of the movie sizes to be the true distribution as there are sufficiently many movies in the network to make sure the empirical distribution is close enough to the true one. We obtain the parameters(𝑝, 𝑞) by maximum likelihood estimation and 𝛿 is esti- mated on the basis of the theoretic study in 6.5. Hereafter the conceptual model filled with the above details is referred as the IMDb-pam model.

6.4 simulations

In this section, we study the fitness of the model by doing simulations of the IMDb-pam model and comparing the evolutions of the simu- lated network to the network from real IMDb dataset.

Note (again) that in our model𝑡 is not calendar time. We intro- duce a pseudo-year approach of generating the same number of movies as in the real dataset for any particular year. For example there were 5624 movies in the year 1915, then we simulate 5624 movies and mark them being in the year 1915 and compare the respective networks at the end of the year 1915. It is computationally costly to do the simulations as the number of actors blows up fast with the number of movies.

Hence we only do the simulations until year 1951 with about 11,000 movies in total and compare the respective networks in a 20-year in- terval, in 1910, 1930 and 1950. The author uses the same programming toolsets for the simulation and the analyses of simulations as for the real dataset.

Movie degrees and actor degrees are two competing candidates upon which we could possibly build the preferential attachment, as stated in Section 6.3. We are mostly interested in the comparisons of these two quantities between the simulations and the real dataset. The comparisons are of interest as well for the reason that the two quan-

(18)

6.4 simulations

tities are of (more or less) power-law distribution in the real dataset.

We are concerned whether our proposed IMDb-pam model recovers the power laws.

(a) Simulation

(b) Real Data

Figure 6.12. Histogram of actors’ movie degrees until year 1950 From Figure 6.12, we see that even though of a similar shape, the histograms are different both in the horizontal axis (the degree) and the vertical axis (the frequency). However a better way of revealing the topological structure of degree sequences is loglog-plot of frequency versus degree. It discloses in particular whether the degrees follow a power-law distribution. We move to study the loglog histograms.

Figure 6.13 and Figure 6.14 are comparisons of movie degree evolu-

(19)

(a) Simulation until 1910 (b) Real data until 1910

(c) Simulation until 1930 (d) Real data until 1930

(e) Simulation until 1950 (f) Real data until 1950 Figure 6.13. Movie degree comparisons between simulation and real

dataset

(20)

6.4 simulations

(a) Simulation until 1910 (b) Real data until 1910

(c) Simulation until 1930 (d) Real data until 1930

(e) Simulation until 1950 (f) Real data until 1950 Figure 6.14. Actor degree comparisons between simulation and real

dataset

(21)

tions and actor degree evolutions in the form of loglot histogram plots over 50 years of time span. We read from these plots that for both actor degree and movie degree, the fit of the simulation to the real dataset is quite good but with a certain degree of imperfection. The simulation gives a power-law distribution with similar power-exponent but there are more variations for large degrees in simulation than in real data. The tail in the simulation is heavier, indicating that there are more actors with high degrees. The maximal degrees, both of movie degree and actor degree, are one order of magnitude higher in the simulation than those in the real dataset. Nonetheless the simulation outcome is not entirely same for the actor degree and movie degree:

the fit of movie degree seems better than that of actor degree and the difference between slopes of curve in the tail in the simulation and real dataset is bigger for actor degrees.

The above observations, on one hand, suggests that our model is basically a reasonable one. Indeed the simulation gives quite accurate, if not perfect, capturing of most actors’ degrees. On the other hand, the simulation exposes the systematic error of our IMDb-pam model because the simulation gives distinctive characteristics for the tails of the degree sequence. The IMDb-pam model is putting more preference on the actors with higher degrees than the real dataset. It is per- haps reasonable that one should put some restriction on choosing the actors with high degrees. After all one actor cannot play in too many movies for the physical constrains. Possible remedies are mentioned in Section 6.6.

6.5 theoretical study

We present a theorem without proof. The proof is rather technical and is not the point here, however for the completeness of the dissertation, the proof is given in Appendix B and was one chapter of the author’s master’s thesis [29].

The theorem ensures that our movie-actor model gives us the de- sired asymptotic power-law movie degree distribution. We introduce the notation𝜇_𝜓= 𝔼𝜓, 𝜇_𝜁= 𝔼𝜁 and 𝜇_𝜉= 𝔼𝜉 = 𝜇_𝜁− 𝜇_𝜓. For simpler notations we define𝜃 = 𝜇_𝜁+ 𝜇_𝜉𝛿 and 𝜃^∗= 𝜃/𝜇_𝜓.

Theorem 6.1. Assuming that the followings hold:

1. There exist a constant𝑎₀> 0 and 𝑐_𝑁such that

ℙ(𝜁 > 𝑁) ≤ 𝑐_𝑁𝑁^−(3+𝑎⁰⁾. (6.4) In particular this implies the distribution of the movie size𝜁 has

(22)

6.5 theoretical study

a finite second moment.

2. 𝛿 > −1.

3. 𝜃 > 1.

Then there exists a constant𝛾 such that

𝑡→∞limℙ (max

𝑘≥1 |𝑝_𝑘(𝑡) − 𝑝_𝑘| ≥ 𝑡^−𝛾) = 0

where(𝑝_𝑘)_𝑘≥1 is defined as the solution to the recursive equation for 𝑘 ≥ 0

𝑝_𝑘= 𝑘 − 1 + 𝛿

𝜃^∗ 𝑝_𝑘−1−𝑘 + 𝛿

𝜃^∗ 𝑝_𝑘+ 1_{𝑘=1}, (6.5) which is solved by

𝑝_𝑘 =𝛤(1 + 𝛿 + 𝜃^∗) 𝛤(1 + 𝛿)

𝛤(𝑘 + 𝛿) 𝛤(𝑘 + 𝛿 + 1 + 𝜃^∗)𝜃^∗.

In particular(𝑝_𝑘)_𝑘≥1follows a power law as𝑝_𝑘≈ 𝑐(𝜃^∗, 𝛿)𝑘^−(1+𝜃^∗⁾ when𝑘 is sufficiently large by applying Stirling’s formula to Gamma function and comes from Stirling’s formula

𝛤(𝑡 + 𝑎)

𝛤(𝑡) = 𝑡^𝛼(1 + 𝑂(1/𝑡)). (6.6) The above theoretical result yields the asymptotic distribution of actor’s movie degree and gives us an approximate power law with the power𝜏 = 1 + (𝜇_𝜁+ 𝜇_𝜉𝛿)/𝜇_𝜓. On one hand this result coincides with the observation of movie degrees in Section 6.3.3. On the other hand this gives us a way to estimate the parameter𝛿 in the preferential attachment model, recall the description of power laws in Section 6.3.3, the power𝜏 can be obtained through a regression on log 𝑘-log 𝑓_𝑘data, i.e.𝜏 is the absolute value of the slope in the loglog-histogram plot of movie degrees. With the estimates of𝜇_𝜁and𝜇_𝜓, solving

𝜏 = 1 + (𝜇_𝜁+ 𝜇_𝜉𝛿)/𝜇_𝜓 (6.7) with 𝛿 as the only unknown gives us a way to estimate 𝛿. In Sec- tion 6.3.3 we obtained ̂𝜏 ≈ 2.1654 and we have estimates of ̂𝜇_𝜓 ≈ 8.7400, ̂𝜇_𝜁≈ 10.0723 and ̂𝜇_𝜉= ̂𝜇_𝜁− ̂𝜇_𝜓≈ 1.3323 from empirical data.

Hence we have an estimate ̂𝛿 ≈ 0.0851, which we used in simulations in Section 6.4.

(23)

6.6 conclusion and future work

We have proposed a model and fit our model to the specific case of the IMDb dataset. The simulations have shown that our model is a reasonable fit hence our model might provide insights on the gen- erative mechanism how the present actor networks come into being.

However the model is not of perfectness and is subject to substantial improvement.

We could consider a saturation of power law—power law with exponential cut-off. A standard power law is of the form that𝑝_𝑘= 𝐶𝑘^−𝜏 for some appropriate positive constant𝐶 and 𝜏, if we consider the form𝑝_𝑘 = 𝐶𝑘^−𝜏exp^−𝑘/𝐴for some appropriate constant 𝐶, 𝜏 and 𝐴. For the latter form 𝑝_𝑘 ≈ 𝐶𝑘^−𝜏when𝑘 ≪ 𝐴 but the exponential term is dominating when𝑘 is sufficiently large, henceforth 𝑝_𝑘 ∝ 𝑒𝑥𝑝(−𝑘/𝐴). If we do a log(1 − ̂𝐹_𝑁(𝑘))-versus-log 𝑘 plot for the data drawn from a distribution of a power law with exponential cut-off, we see curve with more-or-less straight line until𝑘 is moderately large and then a concave curvature with slope decreasing rapidly, which is exactly what we see in Figure 6.8 and Figure 6.11. Investigating how to fit such an model into the dataset of the IMDb might be interesting and so is looking into mechanisms that are responsible for the cut-off.

A different preferential attachment model can be pursued as well.

Instead assuming linear preference with respect to degrees, we consider a truncated linear preferential attachment function in (6.3)

𝑓(𝑘) = {𝑘 + 𝛿 for𝑘 ≤ 𝐴,

𝑓₁(𝑘) for𝑘 ≥ 𝐴, (6.8)

where𝑓₁(𝑘) ∶ ℕ → ℝ⁺is a function that grows slower than a linear function, or even a constant function. The truncated preferential attachment, on one hand, accommodates the assumption that preference should be given to people with higher degrees in an affine fashion for most of the time; on the other hand restricts the preference by the truncation where the restriction is due to physical constrains and so on.