Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23

(1)

The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.

Author: Gao, F.

Title: Bayes and networks

Issue Date: 2017-05-23

(2)

4

M A X I M U M L I K E L I H O O D E S T I M AT I O N I N A F F I N E P R E F E R E N T I A L AT TA C H M E N T N E T W O R K M O D E L S

4.1 introduction and notation

In the past decade random graphs have become well established for modelling complex networks. The preferential attachment (pa) model, introduced in [11], is popular in studies of social networks, the Inter- net, collaboration networks, and so on. The pa model is a dynamic model, in that it describes the evolution of the network through the sequential addition of new nodes, and can explain the so-called scale- free phenomenon. This is the observation that in various real-world networks the proportion𝑝_𝑘of nodes of degree𝑘 follows a power law

𝑝_𝑘∝ 𝑘^−𝜏,

for some power-law exponent𝜏. For example, Table 3.1 in [60] gives comprehensive lists of basic statistics for a number of well-known networks, where the power-law exponent is estimated to be2.4 for pro- tein interactions and2.5 for the Internet. Another source [26] esti- mates the power-law exponent of the Internet to be between2.15 and 2.20.

The pa model is built on the simple “the-rich-get-richer” paradigm

and offers a possible scenario in which the Matthew effect For whosoever hath, to him shall be given, and he shall have more abundance: but whosoever hath not, from him shall be taken away even that he hath.

—Matthew 13:12, King James Version

takes place ([64]). If the network is modelled as a graph, with the vertices representing individuals and the degree of a vertex (the number of edges) representing wealth, then this means that a new vertex is more likely to connect to already well-connected vertices: vertices with higher de- grees (rich) inspire more incoming connections (get richer).

In the simplest type of pa model this is implemented as follows.

We are given a non-decreasing preferential attachment function𝑓 ∶ ℕ⁺→ ℝ₊. The network is initialised at𝑡 = 1 as a graph consisting of two vertices with one edge between them. Then the recursive attachment scheme begins (𝑡 = 2, 3, …). At time 𝑡 a new vertex is added to the graph and is connected to exactly one of the𝑡 existing vertices, say𝑖, with probability proportional to 𝑓(𝑑_𝑖), where 𝑑_𝑖is the degree of vertex𝑖 in the graph at time 𝑡 − 1.

The proportionality, which entails normalizing by the sum of all transformed degrees𝑓(𝑑), makes that an affine function 𝑓 can be pa- rameterized without loss of generality by a single parameter𝛿, in the form𝑓(𝑘) = 𝑘 + 𝛿, where minimally 𝛿 > −1. This special case has

(3)

been well studied. In particular, it has been established (see e.g. [38, 55]) that the empirical degree distribution(𝑝_𝑘(𝑡))^∞_𝑘=1, where𝑝_𝑘(𝑡) is the proportion of vertices of degree𝑘 in the tree at time 𝑡, converges to a limiting degree distribution(𝑝_𝑘)^∞_𝑘=1as𝑡 → ∞, which follows a power law

𝑝_𝑘 ∝ 𝑘^−(3+𝛿).

The Barabási–Albert model is the special case that𝛿 = 0 and has 𝑝_𝑘= 4/(𝑘(𝑘 + 1)(𝑘 + 2)).

If the limiting degree distribution(𝑝_𝑘)^∞_𝑘=1follows a power law, say 𝑝_𝑘= 𝑐_𝑘𝑘^−(3+𝛿)with𝑐_𝑘slowly varying in𝑘, then log 𝑝_𝑘 = log 𝑐_𝑘− (3 + 𝛿) log 𝑘, where log 𝑐_𝑘 behaves like a constant when𝑘 is sufficiently large. This suggests that we might estimate𝛿 by performing a linear regression of the logarithm empirical degreelog 𝑝_𝑘(𝑡) with respect to log 𝑘, with 𝑘 sufficiently large. Then ̂𝛿 ∶= ̂𝜏 − 3, for ̂𝜏 the negative of slope of the fitted line, should estimate𝛿. This has been the method of choice in several applications of pa models; see for example the famous paper [11]. However, this estimator does not work well. One problem is that the equalitylog 𝑝_𝑘(𝑛) = log 𝑐_𝑘− 𝜏 log 𝑘 comes from an asymptotic approximation based on Stirling’s formula, and it is unclear when the asymptotics start to kick in. It is also unclear how to determine the quality of this ad-hoc estimate, by a standard error or confidence interval, or to perform inference.

In this chapter we remedy this by considering the maximum likelihood estimator of𝛿. We show that this can be easily computed as the solution of an equation, prove its asymptotic normality and derive its standard error. Furthermore, in a simulation study we show that it is quite accurate.

We do this in the following more general context of the pa model with random initial degrees as considered in [21]. In this model the graph also grows by sequentially adding vertices. However, instead of connecting a new vertex to a single existing vertex, the vertex at time𝑡 comes with 𝑚_𝑡≥ 1 edges to connect it to the existing network.

The connections are still made according to the basic preferential attachment rule. The special case of this model with each𝑚_𝑡equal to a fixed number𝑚 ∈ ℕ was well studied (see [38] for a discussion and more references). For𝑚 = 1 this model reduces to the basic model considered in the preceding paragraphs. However, for real life applications it is somewhat unnatural and inflexible to assume that every vertex comes with the same number of edges, let alone a single edge.

In this chapter, we allow𝑚₁, 𝑚₂, … to be an iid sequence of random variables, only restricted to have two finite moments. From the point of view of social network modeling the resulting model can be inter-

(4)

4.1 introduction and notation

preted as follows. The initial degrees𝑚_𝑡stress the difference of nodes at their times of inclusion (birth), while the preferential attachment enforces the discrimination towards the less well-off individuals (or equivalently the positive discrimination against the well-off individuals). Thus the pa network with random initial degrees combines the effects of rich-by-birth and rich-get-richer.

The estimation of a preferential attachment function𝑓 is of both theoretical and practical interest. Despite the omnipresence of pa networks in modeling real-world networks, the literature on statistical estimation of this model is sparse. Sound understanding of the statistical properties of pa models will help in evaluating the validity of the pa network model in real-world applications.

Generally this model also gives networks with degree distributions of asymptotic power-law type, but the exponent depends on both the attachment function and the initial degree distribution. Sup- pose that the preferential attachment function is𝑓(𝑘) = 𝑘 + 𝛿 as before, and denote the mean of the initial degree distribution by𝜇 = 𝔼𝑚_𝑡. Then [21] has shown that the limiting degree distribution(𝑝_𝑘) for the model follows asymptotically (as𝑘 → ∞) a power law with exponentmin(3 + 𝛿/𝜇, 𝜏_𝑚) if the random initial degree distribution follows a power law with exponent𝜏_𝑚and is equal to3 + 𝛿/𝜇 if the latter distribution decays faster than a power law (which is equivalent to setting𝜏_𝑚= ∞).

Chapter 3 proposes an empirical estimator for each𝑓(𝑘) given a general sub-linear preferential attachment function𝑓. Although this estimator was proved to be consistent as𝑡 → ∞, we may expect bet- ter estimators if we restrict attention to the domain of affine functions, whose estimation is equivalent to estimating the single parameter𝛿.

A restriction to affine functions is natural, as exactly the affine preferential attachment models correspond to power-law behaviour. There- fore affine𝑓 matter most to the practitioners seeking a natural expla- nation of power laws. The paper [69] solves some statistical problems in the affine model, but focuses on the estimation of the individual degree distribution, and does not directly consider estimating the key affine parameter𝛿. This chapter provides a ready-to-use solution for estimating the affine parameter, and sound theoretical insight into the statistical modeling of pa networks.

The estimation problem of the affine parameter𝛿 is not only inter- esting in itself, but also important because of its direct connection to estimating the power-law exponent. Although power laws are ubiqui- tous, as pointed out above, estimating their components can be diffi- cult ([19]). One must deal with two types of asymptotics here: the limiting degree distribution when the number of vertices goes to infinity

(5)

and the limiting power-law behavior as the degree goes to infinity in the limiting degree distribution. It is hard to determine when these asymptotics kick in and hence are usable for an estimation procedure.

For instance, Clauset, Shalizi, and Newman [19] proposed an estimator for the power-law exponent but the estimator needs an estimate of the minimal degree where the power law starts to hold and such an estimate typically requires a careful empirical analysis. Furthermore, power-law relations refer mostly to nodes of high degrees, of which there are only few, resulting in high variance in estimation (and thus low credibility). The maximum likelihood method (mle) automati- cally weighs the information contained in the different degrees, and does so in an optimal way, as we show below. Of course, a drawback is that it assumes that the pa mechanism is a good fit to the empirical network.

The main challenge in analyzing the maximum likelihood estimator of𝛿 is that the model is a non-stationary Markov chain, which continuously visits new states, given by the growing network. This requires a careful analysis, but in the end we find that the quantities that are relevant for the estimation of𝛿 stabilize, as the network grows, leading to the result that the mle for𝛿 based on observing the history of the network until time𝑛 tends to a normal distribution of minimal variance when centered at the true𝛿 and scaled by √𝑛.

In practice it may well be that only a snapshot of the network at time𝑛 is observed, and not its history through times 𝑡 = 1, … , 𝑛. A discovery that is surprising at first is that given fixed initial degrees 𝑚_𝑡= 𝑚 this makes no difference for the maximum likelihood proce- dure. It turns out that in this case, which we refer to as the preferential attachment network with fixed initial degrees, the snapshot at time𝑛 is statistically sufficient for the full history. As a consequence the mle based on the snapshot is asymptotically normal after centering at𝛿 and scaling by √𝑛 in this case.

On the other hand, random initial degrees𝑚_𝑡significantly com- plicate estimation when observing only a snapshot of the network.

Performing maximum likelihood would require either marginalizing the likelihood over all possible histories or implementing an iterative approximation, e.g. of EM type. This seems computationally daunting.

We also do not present a full theoretical analysis. In fact, we conjecture that if the initial-degree distribution were also unknown no estimator for𝛿 based only on the degrees in the network at time 𝑛 can attain a

√𝑛-rate of estimation, thus suggesting that the statistics change signif- icantly. On the positive side, we propose a quasi maximum likelihood estimator of𝛿 that requires to know the initial degree distribution but relies only on the final snapshot. We show this estimator to attain a

(6)

4.2 construction of the mle

√𝑛-rate and to be asymptotically normal, with a somewhat larger variance than the mle based on the complete evolution of the network.

The chapter is organized as follows. In Section 4.2 we introduce basic notation and derive the likelihood and the maximum likelihood estimator. In Section 4.3 we prove the consistency of this estimator.

In Section 4.4 we derive the main result of the chapter, which is the asymptotic normality of the maximum likelihood estimator, by an ap- plication of the martingale central limit theorem. Section 4.6 gives a special case of the general results for fixed initial degrees. Section 4.7 defines the quasi-maximum-likelihood estimator ̃𝛿_𝑛, which does not depend on the history of the network, and establish its asymptotics.

Last but not the least, we present simulations in Section 4.8 to illus- trate the results.

4.2 construction of the mle

We start by introducing the affine preferential attachment model with random initial degree distribution. Here we adapt the notation from [21, 38]. Let(𝑚_𝑡)_𝑡≥1 be an independent and identically distributed (iid) sequence of positive integer-valued random variables. The model produces a network sequence{PA_𝑡(𝛿)}^∞_𝑡=1, where for every𝑡 the net- workPA_𝑡(𝛿) has a set 𝑉_𝑡= {𝑣₀, 𝑣₁, … , 𝑣_𝑡} of 𝑡 + 1 vertices and ∑^𝑡_𝑖=1𝑚_𝑖 edges. The first networkPA₁(𝛿) consists of two vertices 𝑣₀and𝑣₁with 𝑚₁edges between them. For𝑡 ≥ 2, given PA_𝑡−1(𝛿), a new vertex 𝑣_𝑡 is added with𝑚_𝑡edges connecting𝑣_𝑡to𝑉_𝑡−1, determined by the in- termediate updating preferential attachment rule. This updating rule means that the edges are added sequentially using the preferential attachment rule. DefinePA_𝑡,0(𝛿) = PA_𝑡−1(𝛿) to be network after 𝑣_𝑡−1 has been fully integrated, and letPA_𝑡,1(𝛿), PA_𝑡,2(𝛿), … , PA_𝑡,𝑚_𝑡(𝛿) be intermediate networks, which add𝑣_𝑡and its𝑚_𝑡edges sequentially to PA_𝑡,0(𝛿), as follows. For 1 ≤ 𝑖 ≤ 𝑚_𝑡, the networkPA_𝑡,𝑖(𝛿) is con- structed fromPA_{𝑡,𝑖−1}(𝛿) by adding an (additional) edge between 𝑣_𝑡 and a randomly-selected vertex among{𝑣₀, 𝑣₁, … , 𝑣_𝑡−1}. The probability that this is vertex𝑣_𝑗is proportional to𝑘 + 𝛿, if 𝑣_𝑗 ∈ 𝑉_𝑡−1has degree𝑘 in PA_{𝑡,𝑖−1}(𝛿). Here 𝛿 > −1 is an unknown parameter. The random choice is made through a multinomial trial on all the vertices in𝑉_𝑡−1. In other words, the conditional probability that the𝑖-th edge of𝑣_𝑡connects it to𝑣_𝑗is

ℙ(𝑣_𝑡,𝑖→ 𝑣_𝑗∣ PA_{𝑡,𝑖−1}(𝛿)) = Deg_{𝑡,𝑖−1}(𝑣_𝑗) + 𝛿

∑_𝑣∈𝑉

𝑡−1(Deg_{𝑡,𝑖−1}(𝑣) + 𝛿), (4.1)

(7)

whereDeg_{𝑡,𝑖−1}(𝑣) is the degree of 𝑣 in PA_{𝑡,𝑖−1}(𝛿). After all 𝑚_𝑡edges have been added to𝑣_𝑡, the network is given byPA_𝑡,𝑚_𝑡(𝛿) = PA_𝑡(𝛿) = PA_𝑡+1,0(𝛿).

We define𝑁_𝑘(𝑡) to be the number of vertices of degree 𝑘 in the networkPA_𝑡(𝛿) (counting also 𝑣_𝑡), and𝑁_𝑘(𝑡, 𝑖 − 1) to be the number of vertices of degree𝑘 in the network PA_{𝑡,𝑖−1}(𝛿), not counting 𝑣_𝑡(so belonging to𝑉_𝑡−1), for1 ≤ 𝑖 ≤ 𝑚_𝑡. By convention𝑁_𝑘(𝑡, 0) = 𝑁_𝑘(𝑡 − 1). We denote by 𝐷_𝑡,𝑖the degree of the vertex that was chosen when constructingPA_𝑡,𝑖(𝛿) from PA_{𝑡,𝑖−1}(𝛿). So we can say that the 𝑖-th edge chosen by the vertex𝑣_𝑡possesses degree𝐷_𝑡,𝑖. For the evolution of the number of vertices of degree𝑘, there are several scenarios. For given natural numbers𝑘 and 𝑖 ≤ 𝑚_𝑡:

• If𝐷_𝑡,𝑖 ∉ {𝑘, 𝑘 − 1}, then the number of vertices of degree 𝑘 remains unchanged, i.e.,𝑁_𝑘(𝑡, 𝑖) = 𝑁_𝑘(𝑡, 𝑖 − 1).

• If𝐷_𝑡,𝑖= 𝑘 − 1, then the vertex that was picked by the incoming vertex gets one extra connection and there is one more vertex of degree𝑘, i.e., 𝑁_𝑘(𝑡, 𝑖) = 𝑁_𝑘(𝑡, 𝑖 − 1) + 1.

• If𝐷_𝑡,𝑖 = 𝑘, then the vertex that was picked by the incoming vertex becomes a vertex of degree𝑘 + 1 and there is one fewer vertex of degree𝑘, i.e., 𝑁_𝑘(𝑡, 𝑖) = 𝑁_𝑘(𝑡, 𝑖 − 1) − 1.

After the last update𝑖 = 𝑚_𝑡, the vertex𝑣_𝑡is fully integrated in the network. Since its degree is𝑚_𝑡, we have𝑁_𝑘(𝑡) = 𝑁_𝑘(𝑡, 𝑚_𝑡) + 𝟙_{𝑘=𝑚_𝑡_}. This concludes the time step𝑡 and the total number of vertices of the network becomes𝑡 + 1.

These observations are summarized in the following equation for the degree evolution:

𝑁_𝑘(𝑡) = 𝑁_𝑘(𝑡 − 1) +

𝑚_𝑡

∑

𝑖=1

𝟙_{𝐷_𝑡,𝑖_=𝑘−1}−

𝑚_𝑡

∑

𝑖=1

𝟙_{𝐷_𝑡,𝑖_=𝑘}+ 𝟙_{𝑘=𝑚_𝑡_}. (4.2)

The total number of edges inPA_𝑡(𝛿) is 𝑀_𝑡 ∶= ∑^𝑡_𝑗=1𝑚_𝑗. Clearly the maximal degree in the networkPA_𝑡(𝛿) is bounded by 𝑀_𝑡, i.e.,𝑁_𝑘(𝑡) = 0 for any 𝑘 > 𝑀_𝑡. Furthermore,∑^𝑀_𝑘=1^𝑡 𝑁_𝑘(𝑡) = 𝑡+1 is the total number of vertices at time𝑡.

The conditional probability of the incoming vertex choosing an existing vertex of degree𝑘 to connect to is

ℙ(𝐷_𝑡,𝑖= 𝑘 ∣ PA_{𝑡,𝑖−1}(𝛿), (𝑚_𝑡)_𝑡≥1) = (𝑘 + 𝛿)𝑁_𝑘(𝑡, 𝑖 − 1)

∑^∞_𝑗=1(𝑗 + 𝛿)𝑁_𝑗(𝑡, 𝑖 − 1). (4.3)

(8)

4.2 construction of the mle

The denominator in this expression counts the total preference of all the vertices, and can be written as

𝑆_{𝑡,𝑖−1}(𝛿) =

∞

∑

𝑗=1

(𝑗 + 𝛿)𝑁_𝑗(𝑡, 𝑖 − 1) = ∑

𝑣∈𝑉_𝑡−1

(Deg_{𝑡,𝑖−1}(𝑣) + 𝛿)

= 𝑡𝛿 + 2𝑀_𝑡−1+ (𝑖 − 1).

Abbreviate the degree sequence at time𝑡 by 𝐷_𝑡 ∶= (𝐷_𝑡,1, … , 𝐷_𝑡,𝑚_𝑡).

The conditional likelihood of observing(𝐷_𝑡)^𝑛_𝑡=2 = (𝑑_𝑡)^𝑛_𝑡=2given the edge counts𝑚₁, 𝑚₂, … is

ℙ((𝐷_𝑡)^𝑛_𝑡=2= (𝑑_𝑡)^𝑛_𝑡=2∣ (𝑚_𝑡)_𝑡≥1) =

𝑛

∏

𝑡=2 𝑚_𝑡

∏

𝑖=1

(𝑑_𝑡,𝑖+ 𝛿)𝑁_𝑑_𝑡,𝑖(𝑡, 𝑖 − 1) 𝑆_{𝑡,𝑖−1}(𝛿) .

(4.4) We are interested in estimating𝛿 and assume that the distribution of the edge counts𝑚_𝑡does not contain information on this parameter. As the edge counts are observed, we condition on them throughout, and treat the preceding as the full likelihood of the observation (𝐷_𝑡)_𝑡≥2. In view of (4.1) the likelihood of observing the full evolution of the network up toPA_𝑛(𝛿) is a function of (𝐷_𝑡)^𝑛_𝑡=2and hence the latter vector is statistically sufficient for this full evolution. In the fol- lowing we shall see that actually observing only the snapshot of the networkPA_𝑛(𝛿) at time 𝑛 is already statistically sufficient for 𝛿.

Define𝑁_>𝑘(𝑡) to be the number of vertices in PA_𝑡(𝛿) of degree (strictly) bigger than𝑘, i.e., 𝑁_>𝑘(𝑡) = ∑^𝑀_𝑗=𝑘+1^𝑡 𝑁_𝑗(𝑡). As first observed in Lemma 3.1, we have the following lemma. (In Lemma 3.1 the left hand side of (4.5) is called𝑁_→𝑘(𝑡).)

Lemma 4.1. The number of vertices with degree strictly bigger than𝑘 is equal to the number of times a vertex of degree𝑘 was chosen by the incoming vertices until and including time𝑛 plus the number of vertices with initial degrees strictly bigger than𝑘. In other words, if 𝑅_>𝑘(𝑛) = 2 ⋅ 𝟙_{𝑚₁_>𝑘}+ ∑^𝑛_𝑡=2𝟙_{𝑚_𝑡_>𝑘}, then

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

𝟙_{𝐷_𝑡,𝑖_=𝑘}= 𝑁_>𝑘(𝑛) − 𝑅_>𝑘(𝑛). (4.5)

We introduce the shorthand𝐷^(𝑛) = (𝐷_𝑡)^𝑛_𝑡=2, and from (4.4) have

(9)

the log-likelihood function

𝑙_𝑛(𝛿 ∣ 𝐷^(𝑛)) =

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

[log 𝑁_𝐷_𝑡,𝑖(𝑡, 𝑖 − 1) + log(𝐷_𝑡,𝑖+ 𝛿) − log 𝑆_{𝑡,𝑖−1}(𝛿)]

=

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

log 𝑁_𝐷_𝑡,𝑖(𝑡, 𝑖 − 1) +

∞

∑

𝑘=1

log(𝑘 + 𝛿)(𝑁_>𝑘(𝑛) − 𝑅_>𝑘(𝑛))

−

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

log 𝑆_{𝑡,𝑖−1}(𝛿).

where the second equality comes from applying Lemma 4.1 and the fact that

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

log(𝐷_𝑡,𝑖+ 𝛿) =

∞

∑

𝑘=1

[log(𝑘 + 𝛿)

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

𝟙_{𝐷_𝑡,𝑖_=𝑘}].

It follows that the likelihood factorises in a part not involving𝛿 and a part involving𝛿 and the variables 𝑁_>𝑘(𝑛), given the edge counts (𝑚_𝑡).

Thus by the factorization theorem (see [48], Corollary 2.6.1) the vector(𝑁_>𝑘(𝑛))_𝑘≥1is statistically sufficient for𝛿, given the initial degrees (𝑚_𝑡). This vector is completely determined by the network at time 𝑛.

In particular, observing the network only at time𝑛 is sufficient for 𝛿 relative to observing its evolution up to and including time𝑛.

For inference on𝛿 we can drop the first term of the log likelihood, which does not depend on𝛿, and normalize the remaining part by 𝑛 + 1 (note that there are 𝑛 + 1 vertices in the network at time 𝑛).

We take the parameter space for𝛿 to be [−𝑎, 𝑏], for given numbers

−1 < −𝑎 < 𝑏 < ∞. The maximum likelihood estimator (mle) of 𝛿 is then given by ̂𝛿_𝑛= arg max_{𝛿∈[−𝑎,𝑏]}𝜄_𝑛(𝛿), for

𝜄_𝑛(𝛿) =

∞

∑

𝑘=1

log(𝑘 + 𝛿)𝑁_>𝑘(𝑛) − 𝑅_>𝑘(𝑛)

𝑛 + 1 − 1

𝑛 + 1

𝑛

∑

𝑡=2 𝑚𝑡

∑

𝑖=1

log 𝑆_{𝑡,𝑖−1}(𝛿).

(4.6) Provided that the maximum is taken in the interior of the parameter set, the mle is a solution of the likelihood equation𝜄^′_𝑛(𝛿) = 0. This derivative is given by

𝜄^′_𝑛(𝛿) =

∞

∑

𝑘=1

1 𝑘 + 𝛿

𝑁_>𝑘(𝑛) − 𝑅_>𝑘(𝑛)

𝑛 + 1 − 1

𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

𝑡

𝑆_{𝑡,𝑖−1}(𝛿). (4.7)

(10)

4.3 consistency

4.3 consistency

The empirical degree distribution inPA_𝑛(𝛿) is defined as 𝑝_𝑘(𝑛) = 𝑁_𝑘(𝑛)

𝑛 + 1.

Deijfen et al. [21] shows that this distribution tends to a limit as𝑛 →

∞. Let (𝑟_𝑘)_𝑘≥1be the probability distribution of the initial degree (and the number of edges added in every step), i.e.,

𝑟_𝑘= ℙ(𝑚₁= 𝑘), 𝑘 ≥ 1. (4.8) Assume that this distribution has finite mean𝜇 ≥ 1 and finite second moment𝜇⁽²⁾, and write the shorthand𝜃 = 2 + 𝛿/𝜇. Then the limiting degree distribution(𝑝_𝑘)_𝑘≥1satisfies the recurrence relation

𝑝_𝑘= 𝑘 − 1 + 𝛿

𝜃 𝑝_𝑘−1−𝑘 + 𝛿

𝜃 𝑝_𝑘+ 𝑟_𝑘, 𝑘 ≥ 1. (4.9) Starting from the initial value𝑝₀ = 0, we can solve the recurrence relation by

𝑝_𝑘= 𝜃 𝑘 + 𝛿 + 𝜃

𝑘−1

∑

𝑖=0

𝑟_𝑘−𝑖

𝑖

∏

𝑗=1

𝑘 − 𝑗 + 𝛿

𝑘 − 𝑗 + 𝛿 + 𝜃, 𝑘 ≥ 1, (4.10) where the empty product is defined to be1, should it arise. Because

∑_𝑘≥1𝑝_𝑘 = ∑_𝑘≥1𝑟_𝑘 = 1, in view of the recurrence relation (4.9), the probabilities(𝑝_𝑘)_𝑘≥1define a proper probability distribution.

We list two results on the limiting degree distribution of the preferential attachment model with random initial degrees. See [21] for proofs.

Proposition 4.2. If the initial degrees(𝑚_𝑡)^∞_𝑡=1 have finite moment of order1 + 𝜀 for some 𝜀 > 0, then there exists a constant 𝛾 ∈ (0, 1/2) such that

𝑛→∞lim ℙ(max

𝑘≥1 |𝑝_𝑘(𝑛) − 𝑝_𝑘| ≥ 𝑛^−𝛾) = 0, where(𝑝_𝑘)^∞_𝑘=1is defined as in (4.10).

In the case that the initial degree is degenerate, i.e.,𝑟_𝑚 = 1 for some integer𝑚 ≥ 1, the rate of convergence in this result can be im- proved, and the limiting degree distribution takes a simpler form, as follows.

Proposition 4.3. If𝑟_𝑚 = 1 for some integer 𝑚 ≥ 1, then there exists a

(11)

constant constant𝐶 > 0 such that

𝑛→∞lim ℙ(max

𝑘≥1 |𝑝_𝑘(𝑛) − 𝑝_𝑘| ≥ 𝐶√log 𝑛/𝑛) = 0, where(𝑝_𝑘)^∞_𝑘=1is defined as follows:

𝑝_𝑘= {0, if𝑘 < 𝑚,

𝜃𝛤(𝑘+𝛿)𝛤(𝑚+𝛿+𝜃)

𝛤(𝑚+𝛿)𝛤(𝑘+1+𝛿+𝜃), if𝑘 ≥ 𝑚. (4.11)

Furthermore, if𝑚 = 1, so that 𝑟₁ = 1, then the empirical degree 𝑝_𝑘(𝑛) converges also almost surely to𝑝_𝑘, as𝑛 → ∞, for every 𝑘.

Next we give a lemma that will be essential to our analysis later.

For a summable sequence(𝑎_𝑘)_𝑘≥1, write𝑎_>𝑘= ∑_𝑗>𝑘𝑎_𝑗.

Lemma 4.4. The following recurrence relation holds, with𝜃 = 2 + 𝛿/𝜇, 𝑝_>𝑘= 𝑘 + 𝛿

𝜃 𝑝_𝑘+ 𝑟_>𝑘. (4.12)

Proof. We simply sum up terms of (4.9) and cancel repeated terms.

From now on we shall put a superscript⁽⁰⁾to stress that we consider limiting distributions under the true value𝛿₀of the parameter.

In view of Proposition 4.2,𝑁_>𝑘(𝑛)/(𝑛 + 1) is asymptotic to 𝑝⁽⁰⁾_>𝑘, while by the Law of Large Numbers𝑅_>𝑘(𝑛)/(𝑛 + 1) tends to 𝑟_>𝑘. Further- more, for fixed𝑖 the sequence 𝑆_{𝑡,𝑖−1}(𝛿)/𝑡 = 𝛿 + 2 ̄𝑚_𝑡−1+ (𝑖 − 1)/𝑡 is asymptotic to𝛿 + 2𝜇, again by the Law of Large Numbers. Therefore, we expect the criterion𝜄^′_𝑛(𝛿) given in (4.7) to be asymptotic to

𝜄^′(𝛿) =

∞

∑

𝑘=1

𝑝⁽⁰⁾_>𝑘− 𝑟_>𝑘

𝑘 + 𝛿 − 1

2 + 𝛿/𝜇. (4.13)

Consequently, we expect that the mle ̂𝛿_𝑛 will be asymptotic to the solution of the equation𝜄^′(𝛿) = 0. Because of (4.12),

𝜄^′(𝛿0) =

∞

∑

𝑘=1

𝑝⁽⁰⁾_>𝑘− 𝑟_>𝑘 𝑘 + 𝛿₀ − 1

2 + 𝛿₀/𝜇 = 0.

Thus the true parameter is indeed a solution to this equation. The following lemmas show that this solution is unique.

(12)

4.3 consistency

Define

𝑞_𝑘 =𝑝_>𝑘− 𝑟_>𝑘

𝜇 =(𝑘 + 𝛿)𝑝_𝑘

2𝜇 + 𝛿 . (4.14)

Lemma 4.5. For any nonnegative sequence(𝑣_𝑘)^∞_𝑘=1that is strictly de- creasing with respect to𝑘 and any 𝛿₁ > 𝛿₂, we have, for𝑞_𝑘(𝛿) given in (4.14) (where𝑝_𝑘= 𝑝_𝑘(𝛿) as well),

∞

∑

𝑘=1

𝑞_𝑘(𝛿₁)𝑣_𝑘>

∞

∑

𝑘=1

𝑞_𝑘(𝛿₂)𝑣_𝑘.

Proof. We first show that∑_𝑘≥1𝑞_𝑘 = 1. By manipulating the recurrence relation (4.9), we find

∞

∑

𝑘=1

𝑞_𝑘= 1 2𝜇 + 𝛿∑

𝑘=1

(𝑘 + 𝛿)𝑝_𝑘 = 1 2𝜇 + 𝛿(

∞

∑

𝑘=1

𝑘𝑝_𝑘+ 𝛿)

= 1

2𝜇 + 𝛿(

∞

∑

𝑘=1

𝑘(𝑘 − 1 + 𝛿

2 + 𝛿/𝜇 𝑝_𝑘−1− 𝑘 + 𝛿

2 + 𝛿/𝜇𝑝_𝑘+ 𝑟_𝑘) + 𝛿)

= 1

2𝜇 + 𝛿(

∞

∑

𝑘=1

𝑘𝑟_𝑘+

∞

∑

𝑘=1

𝑘 + 𝛿

2 + 𝛿/𝜇𝑝_𝑘+ 𝛿)

= 1

2𝜇 + 𝛿(𝜇 + 𝜇

∞

∑

𝑘=1

𝑞_𝑘+ 𝛿).

The only solution to this equation for∑_𝑘𝑞_𝑘has∑_𝑘𝑞_𝑘 = 1.

By the recurrence formula (4.12) and (4.9), we have𝑞_𝑘 = (𝑘 + 𝛿)𝑝_𝑘/(2𝜇 + 𝛿), and 𝑞_𝑘+1= (𝑘 + 1 + 𝛿)(𝑞_𝑘+ 𝑟_𝑘+1/𝜇)/(𝑘 + 1 + 𝛿 + 2 + 𝛿/𝜇).

Therefore the derivative𝑢_𝑘(𝛿) =_𝑑𝛿^𝑑𝑞_𝑘(𝛿) satisfies 𝑢_𝑘+1(𝛿) = 2 − (𝑘 + 1)/𝜇

(𝑘 + 1 + 𝛿 + 2 + 𝛿/𝜇)²(𝑞_𝑘(𝛿) + 𝑟_𝑘+1/𝜇) + 𝑘 + 1 + 𝛿

𝑘 + 1 + 𝛿 + 2 + 𝛿/𝜇𝑢_𝑘(𝛿).

The initial value of this sequence is positive, since 𝑢₁(𝛿) = 𝑑

𝑑𝛿𝑞₁(𝛿) = 2𝜇 − 1

𝜇²(1 + 𝛿 + 2 + 𝛿/𝜇)²𝑟₁> 0.

From the recursion it follows that𝑢_𝑘+1(𝛿) remains positive at least as long as𝑘 + 1 ≤ 2𝜇. For 𝑘 + 1 > 2𝜇 the first term of the recursion is negative, while the second term has the sign of𝑢_𝑘(𝛿). From the fact

(13)

that∑_𝑘𝑞_𝑘(𝛿) = 1 for every 𝛿, it follows that ∑_𝑘𝑢_𝑘(𝛿) = 0, and hence 𝑢_𝑘(𝛿) cannot remain positive indefinitely. If 𝐾(𝛿) + 1 is the first 𝑘 for which𝑢_𝑘(𝛿) < 0, then it must be that 𝐾(𝛿) + 1 > 2𝜇, which implies that𝑢_𝑘(𝛿) < 0 for every 𝑘 > 𝐾(𝛿) + 1 as well. Since 𝑣_𝑘is decreasing it follows that

∑

𝑘

𝑣_𝑘𝑢_𝑘(𝛿) > ∑

𝑘≤𝐾(𝛿)

𝑣_𝐾(𝛿)𝑢_𝑘(𝛿) + ∑

𝑘>𝐾(𝛿)

𝑣_𝐾(𝛿)𝑢_𝑘(𝛿) = 𝑣_𝐾(𝛿)0 = 0.

Integrating this over the interval[𝛿₂, 𝛿₁] gives the assertion.

Lemma 4.6. The function𝛿 ↦ 𝜄^′(𝛿) possesses a unique zero at 𝛿 = 𝛿₀. It is positive for𝛿 < 𝛿₀and negative if𝛿 > 𝛿₀.

Proof. Following the definition (4.13) of𝜄^′it was seen that𝛿₀is a zero.

Fix some𝛿 ≠ 𝛿₀. Since1 = ∑_𝑘𝑝_𝑘(𝛿) and 𝑞_𝑘(𝛿) = (𝑘 + 𝛿)𝑝_𝑘(𝛿)/(2𝜇 + 𝛿), we can rewrite 𝜄^′(𝛿) as

𝜄^′(𝛿) =

∞

∑

𝑘=1

𝜇𝑞⁽⁰⁾_𝑘 𝑘 + 𝛿− 1

2 + 𝛿/𝜇

=

∞

∑

𝑘=1

𝜇𝑞⁽⁰⁾_𝑘 𝑘 + 𝛿−

∞

∑

𝑘=1

(𝑘 + 𝛿)𝑝_𝑘(𝛿) (𝑘 + 𝛿)(2 + 𝛿/𝜇)

=

∞

∑

𝑘=1

𝜇𝑞⁽⁰⁾_𝑘 𝑘 + 𝛿−

∞

∑

𝑘=1

𝜇𝑞_𝑘(𝛿) 𝑘 + 𝛿 .

Applying Lemma 4.5 with𝑣_𝑘= 1/(𝑘 + 𝛿), we see that 𝜄^′(𝛿) > 0 when 𝛿 < 𝛿₀, and𝜄^′(𝛿) < 0 when 𝛿 > 𝛿₀.

The proof of consistency of the mle will be based on uniform convergence of𝜄^′_𝑛to𝜄^′, together with the uniqueness of the zero of𝜄^′. For the convergence, and also for the proof of asymptotic normality, we need the following lemma.

Lemma 4.7 (Cesàro convergence for random variables). Let(𝑋_𝑡)_𝑡∈ℕ be a sequence of random variables,(𝑎_𝑡)_𝑡∈ℕa sequence of numbers,𝑋 and𝑎 a random variable and number, and let 𝑋_𝑡and𝑎_𝑡be the average of the first𝑡 variables or numbers, respectively.

1). If𝑋_𝑡−−→ 𝑋, then 𝑋^a.s. _𝑡−−→ 𝑋.^a.s.

2). If𝑋_𝑡−→ 𝑋, or equivalently 𝑋^𝐿¹ _𝑡−→ 𝑋 and (𝑋^P _𝑡)_𝑡∈ℕis uniformly integrable, then𝑋_𝑡−→ 𝑋.^𝐿¹

3). If𝑋 −→ 𝑋 and 𝑎^𝐿¹ → 𝑎 and |𝑎| = 𝑂(1), then (𝑎𝑋) −→ 𝑎𝑋.^𝐿¹

(14)

4.3 consistency

Proof. Statement (i) is the usual Cesàro convergence, applied to al- most every of the deterministic sequences𝑋_𝑡(𝜔) obtained for elements 𝜔 of the underlying probability space.

Statement (ii) is the special case of (iii) with𝑎_𝑡= 1, for every 𝑡.

To prove statement (iii) we decompose

|(𝑎𝑋)_𝑡− 𝑎𝑋| = |1 𝑡

𝑡

∑

𝑖=1

𝑎_𝑖𝑋_𝑖−1 𝑡

𝑡

∑

𝑖=1

𝑎_𝑖𝑋 +1 𝑡

𝑡

∑

𝑖=1

𝑎_𝑖𝑋 − 𝑎𝑋|

≤1 𝑡

𝐾

∑

𝑖=1

|𝑎_𝑖(𝑋_𝑖− 𝑋)| +1 𝑡

𝑡

∑

𝑖=𝐾+1

|𝑎_𝑖(𝑋_𝑖− 𝑋)| + |𝑋||𝑎_𝑡− 𝑎|.

Take the expectation across to bound the expected value of the left side by

𝐾 𝑡 max

1≤𝑖≤𝐾|𝑎_𝑖| max

𝑡 𝔼|𝑋_𝑡− 𝑋| + |𝑎|_𝑡max

𝐾<𝑖≤𝑡𝔼|𝑋_𝑖− 𝑋| + |𝑎_𝑡− 𝑎| 𝔼|𝑋|.

Because𝔼|𝑋_𝑖− 𝑋| → 0 as 𝑖 → ∞, for any 𝜀 > 0, there exists 𝐾 such thatsup_𝑖>𝐾𝔼|𝑋_𝑖− 𝑋| < 𝜀. Then the second term is bounded above by a constant times𝜀, by the assumption on |𝑎|_𝑡. For fixed𝐾 the first and third terms tend to zero as𝑡 → ∞. Thus the limsup as 𝑡 → ∞ of the whole expression is bounded by a multiple of𝜀, for every 𝜀 > 0.

Lemma 4.8. The derivative𝜄^′_𝑛of the log-likelihood function converges uniformly to the limiting criterion𝜄^′, i.e., as𝑛 → ∞ for every 𝜖 > 0,

sup

𝛿>−1+𝜖|𝜄^′_𝑛(𝛿) − 𝜄^′(𝛿)|−→ 0.^𝑃

Proof. For𝑟_>𝑘(𝑛) = 𝑅_>𝑘(𝑛)/(𝑛 + 1), the difference 𝜄^′_𝑛(𝛿) − 𝜄^′(𝛿) can be decomposed as

∞

∑

𝑘=1

𝑝_>𝑘(𝑛) − 𝑝⁽⁰⁾_>𝑘

𝑘 + 𝛿 +

∞

∑

𝑘=1

𝑟_>𝑘(𝑛) − 𝑟_>𝑘

𝑘 + 𝛿 − 1

𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

𝑡

𝑆_{𝑡,𝑖−1}(𝛿)+ 𝜇 2𝜇 + 𝛿.

(4.15) We deal with the first two terms and the difference of the last two terms separately.

As𝑘𝑁_>𝑘(𝑛) ≤ 2𝑀_𝑛, where𝑀_𝑛= ∑^𝑛_𝑡=1𝑚_𝑡is the total number of edges inPA_𝑛(𝛿), we have 𝑝_>𝑘(𝑛) ≤ 2𝑚_𝑛/𝑘, for every 𝑘. Hence, for

(15)

𝛿 ≥ −𝜂 ∶= −1 + 𝜖,

∞

∑

𝑘=1

|𝑝_>𝑘(𝑛) − 𝑝⁽⁰⁾_>𝑘|

𝑘 + 𝛿 ≤ ∑

𝑘≤𝐾

|𝑝_>𝑘(𝑛) − 𝑝⁽⁰⁾_>𝑘|

𝑘 − 𝜂 + ∑

𝑘>𝐾

2𝑚_𝑛 𝑘(𝑘 − 𝜂)+ ∑

𝑘>𝐾

𝑝⁽⁰⁾_>𝑘 𝑘 − 𝜂. Since𝑚_𝑛→ 𝜇 almost surely, by the Law of Large Numbers, the second term on the right side can be made arbitrarily small by choice of𝐾.

The same is true for the third term as𝑝⁽⁰⁾_𝑘 follows a power law with exponent bigger than2. For any fixed 𝐾 the first term converges in probability to0 as 𝑛 → ∞, by Proposition 4.2. Thus the full expression tends to zero.

The variable𝑟_>𝑘(𝑛) − 𝑟_>𝑘can be written in the form2(𝟙_{𝑚₁_>𝑘}− 𝑟_>𝑘)/(𝑛 + 1) + ∑^𝑛_𝑡=2(𝟙_{𝑚_𝑡_>𝑘}− 𝑟_>𝑘)/(𝑛 + 1). This is a weighted sum of independent centered Bernoulli variables with success probability𝑟_>𝑘. Its first absolute moment can be bounded by its standard deviation and is bounded by a multiple of the root of𝑟_>𝑘(1 − 𝑟_>𝑘)/(𝑛 + 1). It follows that the supremum over𝛿 > −𝜂 = −1 + 𝜖 of the absolute value of the second term has expected value bounded above by a multiple of

1

√𝑛 + 1

∞

∑

𝑘=1

√𝑟_>𝑘(1 − 𝑟_>𝑘) 𝑘 − 𝜂 .

Since𝑟_>𝑘 ≤ 𝑘⁻¹𝜇, by Markov’s inequality, the series converges easily, and the expression tends to zero as𝑛 → ∞.

With slight abuse of notation write𝑚_𝑛 = ∑^𝑛_𝑡=1𝑚_𝑡/(𝑛 + 1). The third term can be decomposed as

− 1

𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

[ 1

𝑆_{𝑡,𝑖−1}(𝛿)/𝑡 − 1 𝛿 + 2𝑚_𝑡−1]

− 1

𝑛 + 1

𝑛

∑

𝑡=2

[ 𝑚_𝑡

𝛿 + 2𝑚_𝑡−1 − 𝑚_𝑡

2𝜇 + 𝛿] − [ 1 𝑛 + 1

𝑛

∑

𝑡=2

𝑚_𝑡

2𝜇 + 𝛿− 𝜇 2𝜇 + 𝛿]

= − 1 𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

(𝑖 − 1)/𝑡

(𝛿 + 2𝑚_𝑡−1+ (𝑖 − 1)/𝑡)(𝛿 + 2𝑚_𝑡−1)

− 1

𝑛 + 1

𝑛

∑

𝑡=2

𝑚_𝑡(2𝜇 − 2𝑚_𝑡−1)

(𝛿 + 2𝑚_𝑡−1)(2𝜇 + 𝛿)−𝑚_𝑛− 𝜇 2𝜇 + 𝛿.

The supremum over𝛿 > −𝜂 of the absolute value of this expression is bounded above by

1 𝑛 + 1

𝑛

∑ 𝑚²_𝑡/𝑡

(2𝑚 − 𝜂)² + 1 𝑛 + 1

𝑛

∑ 𝑚_𝑡2|𝜇 − 𝑚_𝑡−1|

(2𝑚 − 𝜂)(2𝜇 − 𝜂)+|𝑚_𝑛− 𝜇|

2𝜇 − 𝜂 .

(16)

4.4 asymptotic normality

The third term tends to zero almost surely by the Law of Large Num- bers. In the first term we have that the variables𝑋_𝑡∶= 𝑡⁻¹/(2𝑚_𝑡−1−𝜂)² converge almost surely to0 as 𝑡 → ∞, while the averages of the variables𝑎_𝑡 ∶= 𝑚²_𝑡 tend to𝜇⁽²⁾ almost surely, again by the Law of Large Numbers. Applying Lemma 4.7 to the sequences of numbers 𝑋_𝑡(𝜔) and 𝑎_𝑡(𝜔) obtained by selecting 𝜔 from the underlying probability space so that both convergences are valid, we see that the first term tends to zero, for such𝜔, and hence almost surely. The second term tends to zero by the same argument, now with the choice𝑋_𝑡∶=

2|𝜇 − 𝑚_𝑡−1|/((2𝑚_𝑡−1− 𝜂)(2𝜇 − 𝜂)).

Combining the preceding Lemmas 4.6 and 4.8 gives the following theorem.

Theorem 4.9. The mle ̂𝛿_𝑛is consistent: ̂𝛿_𝑛→ 𝛿₀, in probability under 𝛿₀, for every𝛿₀∈ (−𝑎, 𝑏) with any 𝑎 < 1 and 𝑏 < ∞.

Proof. Because𝜄^′is continuous on[−𝑎, 𝑏] and vanishes only at 𝛿₀, we have thatinf𝛿∈[−𝑎,𝑏]∶|𝛿−𝛿₀|>𝜖|𝜄^′(𝛿)| > 0, for every 𝜖. More precisely, by Lemma 4.6 it is bounded away from zero in the positive direction for 𝛿 < 𝛿₀− 𝜖 and in the negative direction if 𝛿 > 𝛿₀+ 𝜖. Since 𝜄^′_𝑛tends uniformly to𝜄^′, by Lemma 4.8, the same is true for𝜄_𝑛, with probability tending to one. This shows that the maximum of𝜄_𝑛must be contained in[𝛿₀− 𝜖, 𝛿₀+ 𝜖], with probability tending to one.

4.4 asymptotic normality

We shall apply the following martingale central limit theorem (see Corollary3.1 in [36] or Theorem XIII.1.1 in [65]) to study the asymptotic normality of the mle. The triangular array version with𝑘_𝑛→ ∞ given here is equivalent to the theorem in the latter reference (stated for𝑘_𝑛= 𝑛), as remarked preceding its statement on page 171.

Proposition 4.10. Suppose that for every𝑛 ∈ ℕ and 𝑘_𝑛 → ∞ the random variables𝑋_𝑛,1, … , 𝑋_𝑛,𝑘_𝑛are a martingale difference sequence relative to an arbitrary filtration ℱ𝑛,1⊂ ℱ𝑛,2⊂ ⋯ ⊂ ℱ𝑛,𝑘_𝑛. If for some positive constant𝑣 and every 𝜀 > 0

𝑘_𝑛

∑

𝑖=1

𝔼[𝑋²_𝑛,𝑖∣ ℱ_{𝑛,𝑖−1}]−→ 𝑣,^𝑃

𝑘_𝑛

∑

𝑖=1

𝔼[𝑋²_𝑛,𝑖𝟙_{|𝑋_𝑛,𝑖_|>𝜀}∣ ℱ_{𝑛,𝑖−1}]−→ 0,^𝑃

then∑^𝑘_𝑖=1^𝑛 𝑋_𝑛,𝑖⇝ 𝑁(0, 𝑣).

(17)

Lemma 4.11. Suppose that the initial degree distribution has finite sec- ond moment. Given almost every sequence(𝑚_𝑡)^∞_𝑡=1we have, under𝛿₀,

√𝑛(𝜄^′_𝑛(𝛿₀) − 𝜄^′(𝛿₀)) ⇝ 𝑁(0, 𝜈₀), (4.16) where𝜄^′(𝛿₀) = 0 and

𝜈₀=

∞

∑

𝑘=1

𝜇𝑞⁽⁰⁾_𝑘

(𝑘 + 𝛿₀)² − 𝜇 (2𝜇 + 𝛿₀)².

Proof. Throughout the proof we condition on(𝑚_𝑡)^∞_𝑡=1, without letting this show up in the notation.

We can write

𝜄^′_𝑛(𝛿₀) = 1 𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

𝑌_𝑡,𝑖,

for

𝑌_𝑡,𝑖= 1

𝐷_𝑡,𝑖+ 𝛿₀ − 𝑡

𝑆_{𝑡,𝑖−1}(𝛿₀) = 1

𝐷_𝑡,𝑖+ 𝛿₀ − 1

𝛿₀+ 𝑚_𝑡−1+ (𝑖 − 1)/𝑡. As to be expected from the fact that they are score functions, the vari- ables𝑌_2,1, 𝑌_2,2, … , 𝑌_2,𝑚₂, 𝑌_3,1, … , 𝑌_3,𝑚₃, 𝑌_4,1… are martingale differences relative to the filtration ℱ2,1 ⊂ ℱ_2,2 ⊂ ⋯ ⊂ ℱ_2,𝑚₂ ⊂ ℱ_3,1 ⊂

⋯ ⊂ ℱ_3,𝑚₃ ⊂ ℱ_4,1 ⊂ ⋯ obtained by letting ℱ_𝑡,𝑖correspond to observing the evolution of the pa graph up toPA_𝑡,𝑖(𝛿). Indeed, in view of (4.3),

𝔼[𝑌_𝑡,𝑖∣ ℱ_{𝑡,𝑖−1}] =

∞

∑

𝑘=1

1 𝑘 + 𝛿₀

𝑁_𝑘(𝑡, 𝑖 − 1)(𝑘 + 𝛿₀) 𝑆_{𝑡,𝑖−1}(𝛿₀) − 𝑡

𝑆_{𝑡,𝑖−1}(𝛿₀) = 0, since∑_𝑘𝑁_𝑘(𝑡, 𝑖 − 1) = 𝑡 is the number of vertices in the graph at time 𝑡, for every 𝑖 (not counting 𝑣_𝑡). (Set ℱ_𝑡,0= ℱ_{𝑡−1,𝑚}_𝑡−1.)

We now apply Proposition 4.10 to the triangular array of martingale differences𝑋_2,1, … , 𝑋_2,𝑚₂, … , 𝑋_𝑛,𝑚_𝑛, for𝑛 = 1, 2, …, and 𝑋_𝑡,𝑖= 𝑌_𝑡,𝑖/√𝑛 + 1. The 𝑛-th row possesses 𝑀_𝑛= ∑^𝑛_𝑡=2𝑚_𝑡→ ∞ variables. Since the variables𝑌_𝑡,𝑖are uniformly bounded by2/(1 + 𝛿₀), the Lindeberg condition, in the display of Proposition 4.10, is trivially satisfied. We need to show that

1 𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

𝔼[𝑌²_𝑡,𝑖∣ ℱ_{𝑡,𝑖−1}]−→ 𝜈^𝑃 ₀.

(18)

4.4 asymptotic normality

In view of (4.3),

𝔼[𝑌²_𝑡,𝑖∣ ℱ_{𝑡,𝑖−1}] = 𝔼 [ 1

(𝐷_𝑡,𝑖+ 𝛿₀)²| ℱ_{𝑡,𝑖−1}] − ( 𝑡 𝑆_{𝑡,𝑖−1}(𝛿₀))²

=

∞

∑

𝑘=1

1 (𝑘 + 𝛿₀)²

𝑁_𝑘(𝑡, 𝑖 − 1)(𝑘 + 𝛿₀)

𝑆_{𝑡,𝑖−1}(𝛿₀) − ( 𝑡 𝑆_{𝑡,𝑖−1}(𝛿₀))². Since𝑖 edges are added when constructing PA_𝑡,𝑖(𝛿) from PA_𝑡,0(𝛿), the number of nodes of degree𝑘 cannot change by more than 𝑖 ≤ 𝑚_𝑡. Therefore, for every𝑡,

1≤𝑖≤𝑚max_𝑡|𝑁_𝑘(𝑡, 𝑖 − 1)

𝑡 −𝑁_𝑘(𝑡, 0) 𝑡 | ≤𝑚_𝑡

𝑡 .

Since𝑚_𝑡has finite second moment, we have∑_𝑡ℙ(𝑚_𝑡 > 𝑡𝜖) < ∞, for every𝜖 > 0, and hence 𝑚_𝑡/𝑡 → 0, almost surely, as 𝑡 → ∞. We combine this with the preceding display and Proposition 4.2 to see that𝑁_𝑘(𝑡, 𝑖 − 1)/𝑡 → 𝑝⁽⁰⁾_𝑘 in probability, as𝑡 → ∞, for every fixed 𝑘, uniformly in1 ≤ 𝑖 ≤ 𝑚_𝑡. As a function of𝑘, the numbers 𝑁_𝑘(𝑡, 𝑖−1)/𝑡 are a probability distribution onℕ, and hence ∑_𝑘|𝑁_𝑘(𝑡, 𝑖 − 1)/𝑡 − 𝑝⁽⁰⁾_𝑘 | → 0, by Scheffé’s theorem, uniformly in 1 ≤ 𝑖 ≤ 𝑚_𝑡. In particular, the𝑁_𝑘(𝑡, 𝑖 − 1)/𝑡 are uniformly integrable (summable), whence by the dominated convergence theorem also, uniformly in1 ≤ 𝑖 ≤ 𝑚_𝑡, as 𝑡 → ∞,

∑

𝑘

|𝑁_𝑘(𝑡, 𝑖 − 1)/𝑡 𝑘 + 𝛿₀ − 𝑝⁽⁰⁾_𝑘

𝑘 + 𝛿₀|−→ 0.^𝑃 By the definition of𝑆𝑡,𝑖−1(𝛿0), we also have

1≤𝑖≤𝑚max_𝑡|𝑆_{𝑡,𝑖−1}(𝛿₀)

𝑡 − (𝛿 + 2𝑚_𝑡−1)| ≤ 2𝑚_𝑡 𝑡 .

Therefore, by the Law of Large Numbers we obtain that𝑆_{𝑡,𝑖−1}(𝛿₀)/𝑡 → (𝛿₀+ 2𝜇), almost surely, uniformly in 1 ≤ 𝑖 ≤ 𝑚_𝑡.

Combining the preceding we see that, for almost every sequence (𝑚_𝑡), as 𝑡 → ∞,

1 𝑚_𝑡

𝑚_𝑡

∑

𝑖=1

∑

𝑘

𝑁_𝑘(𝑡, 𝑖 − 1) (𝑘 + 𝛿₀)𝑆_{𝑡,𝑖−1}(𝛿₀)

−→ ∑𝑃 𝑘

𝑝⁽⁰⁾_𝑘 (𝑘 + 𝛿₀)(𝛿₀+ 2𝜇). Next by Lemma 4.7, applied with𝑋_𝑡equal to the left side of the preceding display (which is bounded and hence uniformly integrable) and

(19)

𝑎_𝑡= 𝑚_𝑡, we see that, for almost every sequence(𝑚_𝑡), 1

𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

∞

∑

𝑘=1

𝑁_𝑘(𝑡, 𝑖 − 1) (𝑘 + 𝛿₀)𝑆_{𝑡,𝑖−1}(𝛿₀)

−→ 𝜇 ∑𝑃 𝑘

𝑝⁽⁰⁾_𝑘 (𝑘 + 𝛿₀)(𝛿₀+ 2𝜇) By a similar, but simpler, argument we see that

1 𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

( 𝑡

𝑆_{𝑡,𝑖−1}(𝛿₀))²→ 𝜇 (𝛿₀+ 2𝜇)².

Since𝑝⁽⁰⁾_𝑘 /(2𝜇 + 𝛿₀) = 𝑞⁽⁰⁾_𝑘 /(𝑘 + 𝛿₀) by (4.14), the difference of the right sides of the last two displays is𝜈₀. Let𝐾 be the random varible following the distribution specified by(𝑞⁽⁰⁾_𝑘 )^∞_𝑘=1, then𝑣₀can be shown to be the variance of1/(𝐾 + 𝛿₀) and is positive by nature.

The following is the main result of the chapter.

Theorem 4.12. If𝛿₀is interior to the parameter set, then the mle ̂𝛿_𝑛 satisfies, for𝜈₀given in Lemma 4.11,

√𝑛( ̂𝛿_𝑛− 𝛿₀) ⇝ 𝑁(0, 𝜈⁻¹₀ ). (4.17)

Proof. By Theorem 4.9 ̂𝛿_𝑛tends to𝛿₀, hence is with probability tending to one interior to the parameter set, and must solve the likelihood equation𝜄^′_𝑛( ̂𝛿_𝑛) = 0. By Taylor expansion there exists ̃𝛿_𝑛between𝛿₀ and ̂𝛿_𝑛such that

0 = 𝜄_𝑛^′( ̂𝛿_𝑛) = 𝜄^′_𝑛(𝛿₀) + 𝜄^″_𝑛( ̃𝛿_𝑛)( ̂𝛿_𝑛− 𝛿₀).

Using that𝜄^′(𝛿₀) = 0, we can reformulate the preceding display as

√𝑛( ̂𝛿_𝑛− 𝛿₀)𝜄^″_𝑛( ̃𝛿_𝑛) = −√𝑛(𝜄^′_𝑛(𝛿₀) − 𝜄^′(𝛿₀)).

The expression on the right is studied in Lemma 4.11, and seen to converge in distribution to𝑁(0, 𝜈₀).

The second derivative takes the form 𝜄^″_𝑛(𝛿) = −

∞

∑

𝑘=1

1 (𝑘 + 𝛿)²

𝑁_>𝑘(𝑛) − 𝑅_>𝑘(𝑛)

𝑛 + 1 + 1

𝑛 + 1

𝑛

∑

𝑡=2 𝑚_𝑡

∑

𝑖=1

𝑡² 𝑆²𝑡,𝑖−1(𝛿). By a similar argument as in the proof of Lemma 4.8 we see that this converges in probability to the second derivative𝜄^″(𝛿), uniformly in 𝛿 in a neighbourhood of 𝛿 . Since ̃𝛿 → 𝛿 in probability and𝜄^″is