Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23

(1)

Cover Page

The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.

Author: Gao, F.

Title: Bayes and networks

Issue Date: 2017-05-23

(2)

3

M O D E L S

3.1 introduction

We formulate the model in a more concise and less abstract fashion than in Section 2.2.4. We start at𝑛 = 2 with two connected nodes.

We then define the recursive scheme for the network to evolve. Sup- pose at time𝑛 we have a network of 𝑛 nodes 𝑉_𝑛= {𝑣₁, 𝑣₂, … , 𝑣_𝑛} with degrees(𝑑_𝑖)^𝑛_𝑖=1. The new node𝑣_𝑛+1comes in at time𝑛 + 1 and puts 𝑓(𝑑_𝑖) as the preference on node 𝑖, where 𝑓 maps the degree 𝑘 ∈ ℕ₊ to its associated preference𝑓(𝑘) ∈ ℝ₊. The new node chooses one existing node𝑣_𝑖in𝑉_𝑛to connect to with probability proportional to the associated preference𝑓(𝑑_𝑖), where 𝑑_𝑖is the degree of node𝑣_𝑖. This puts probability𝑓(𝑑_𝑖)/∑^𝑛_𝑗=1𝑓(𝑑_𝑗) on node 𝑣_𝑖.𝑓 is typically assumed a priori to be non-decreasing, in other words, nodes of higher degrees are more favorable for the incoming node to connect to, thus higher degrees inspire more incoming connections. This explains the name

preferential attachment model. This is closely

related to the Matthew effect etc.

as elaborated in Section 2.1.3.

After the incoming node makes the choice, the network evolves to the stage of𝑛 + 1 nodes and the recursive scheme starts again with the newly incoming node𝑣_𝑛+2. We may repeat the procedure as many times as we want.

We define the empirical degree distribution𝑃_𝑘(𝑛) to be the proportion of nodes of degree𝑘 at time 𝑛

𝑃_𝑘(𝑛) = 1

𝑛 ∑

𝑖∈[𝑛]

𝟙_{𝑑_𝑖_(𝑛)=𝑘},

where[𝑛] = {1, … , 𝑛} and 𝑑_𝑖(𝑛) is the degree of node 𝑣_𝑖at time𝑛. In case𝑓 is affine with 𝑓(𝑘) = 𝑘 + 𝛿 as in [55], it has been well-known that for any fixed𝑘, as 𝑛 → ∞, 𝑃_𝑘(𝑛) → 𝑝_𝑘(at least) in probability with𝑝_𝑘 only depending on𝛿. Moreover, it is also well-known that the limiting degree distribution follows a power law, whence this is a scenario of the scale-free phenomenon occuring. That is, as𝑛 → ∞,

𝑃_𝑘(𝑛) → 𝑝_𝑘with𝑝_𝑘= 𝑐(𝑘)𝑘^−𝜏,

where𝜏, being equal to 3 + 𝛿, is the power-law exponent and 𝑐(𝑘) is slowly-varying in𝑘. Furthermore, if 𝑓 is linear 𝑓(𝑘) = 𝑘, we can work out the exact asymptotic degree distribution to be𝑝_𝑘= 4/(𝑘(𝑘 + 1)(𝑘 + 2)) ([16, 55]), i.e., the power-law exponent 𝜏 = 3 + 0 = 3).

(3)

Móri [55] and Resnick and Samorodnitsky [69] show that central limit theorems hold for the empirical degree distribution𝑃_𝑘(𝑛). They apply martingale central limit theorems on carefully devised martingales to prove the asymptotic normality results of𝑃_𝑘(𝑛) for any fixed 𝑘 with explicit limiting variance𝜎_𝑘²depending only on𝛿 and 𝑘.

If we reduce our assumption of𝑓 being affine to be only sublinear (which includes the affine case), we know much less, except the limiting degree distribution that we will talk about later—partly because the analysis becomes much more difficult. Suppose we have 𝑣₁, 𝑣₂, … , 𝑣_𝑛with degrees(𝑑_𝑖)^𝑛_𝑖=1and preferences(𝑓(𝑑_𝑖))^𝑛_𝑖=1. We need the total preference, defined as the sum of all𝑓(𝑑_𝑖)’s, to normalize the multinomial distribution on all the nodes. In the affine case of 𝑓(𝑘) = 𝑘 + 𝛿 with 𝛿 parameter,

The sum of all

degrees is2𝑛. the total preference is deterministic

with the form∑^𝑛_𝑖=1𝑓(𝑑_𝑖) = ∑^𝑛_𝑖=1(𝑑_𝑖+ 𝛿) = 𝑛𝛿 + 2𝑛. However, in the general case, the preceding simple relation ceases to hold and the total preference becomes a rather messy random quantity depending on the entire history of the network evolution. For the affine case, it is possible to study the limiting degree distribution with simple recursions on the degree evolution. However, this tool fails in the more general case. Stronger and less intuitive ideas are necessary to study the even simplest quantity—the limiting degree distribution.

Rudas, Tóth, and Valkó [70] proves that the empirical degree propor- tion𝑃_𝑘(𝑛) is asymptotic to 𝑝_𝑘as in (3.13) almost surely, for any fixed 𝑘. Their proofs rely on a branching-process framework where one can study the degree distribution with the well-established strong Law-of- Large-Number-type results from the classical work of branching processes dating back to the 1970s and 1980s, see [40, 56].

In the non-linear regime, the pa networks do not yield a power law in general. In the case of the sublinear case, we find a much lighter tail on the limiting degree distribution, where the degree distribution decays much faster with respect to the increasing degree. That is, in the sublinear case, much fewer node of high degrees can be found. On the one hand, the estimation problem of general sublinear pa function is interesting in itself; on the other hand, it helps in applications where power laws are believed to hold. In some applications, it might be a good idea to drop the assumption that the pa function is affine and we might try to estimate the pa function𝑓 in a more general sublinear domain. With such estimation one might conclude whether or not𝑓 being affine is reasonable, depending on whether the estimator of𝑓 looks like an affine function at all. In other words, estimating𝑓 in a more general domain helps to validate the modelling of power laws with pa networks.

The main contribution of our work is to propose the empirical es-

(4)

timator and prove its consistency. The intuition behind the empirical estimator comes from thinking in the limiting regime where everything becomes stable and easy to analyze. The Markovian nature of the pa networks makes the problem hard as it ceases to be a conven- tional statistical problem where everything is independent and identically distributed (iid). The most surprising part in our study is perhaps that our empirical estimators do not depend on history, we only need the final snapshot of the network to establish our estimators. A deeper thinking of the problem reveals that despite the Markovian nature, the network will inevitably go to the limiting shape, where the connecting behavior of the incoming nodes look more and more like iid random variables. Henceforth eventually the history matters less and less.

This chapter is organized as follows. In Section 3.2, we give the intuition behind the estimator and present the main results on consistency. Section 3.3 introduces the terminology of branching processes and gives a random tree model that is equivalent to the evolution of pa networks. After that, we give the proof of the main consistency result in Section 3.4. In the last section, we present a simulation study on the performance of the proposed empirical estimators in different settings and explain our observations. Some other studies are carried out in order to uncover more properties of the empirical estimator, the most interesting one being that the estimator seems to asymptotically normal with parametric √𝑛 rate.

3.2 construction of the empirical estimator and main result Suppose we have a pa graph, that has𝑛 nodes and is already more or less not far from the limiting distribution(𝑝_𝑘)^∞_𝑘=1. Suppose a new node comes in and needs to pick an existing node to attach to according to the pa rule associated with the pa function𝑓. For 𝑁_𝑘the number of nodes of degree𝑘 (≈ 𝑛𝑝_𝑘 in the limiting regime), the probability of choosing an existing node of degree𝑘 is

𝑓(𝑘)𝑁_𝑘

∑^∞_𝑗=1𝑓(𝑗)𝑁_𝑗 ≈ 𝑓(𝑘)𝑝_𝑘

∑^∞_𝑗=1𝑓(𝑗)𝑝_𝑗.

We are interested in the quantity𝑓(𝑘) for each 𝑘 ≥ 1. However we note the denominator on the right hand side of the above display

∑^∞_𝑗=1𝑓(𝑗)𝑝_𝑗does not depend on𝑘. If we put an extra factor 𝑛/𝑁_𝑘 ≈ 1/𝑝_𝑘on the above display to cancel the term𝑝_𝑘, then we get a rescaled version of𝑓(𝑘). Keeping in mind that 𝑓 is only identifiable up to scale, we define𝑟_𝑘—the rescaled version of𝑓(𝑘) for each 𝑘 and the above

(5)

heuristics says 𝑟_𝑘∶= 𝑓(𝑘)

∑^∞_𝑗=1𝑓(𝑗)𝑝_𝑗 ≈ Probability of choosing a node of degree𝑘

𝑁_𝑘/𝑛 .

We want to devise an estimator mimicking the above equation, which works in the non-limiting regime. The probability of the new node choosing an existing node of degree𝑘 can be estimated by counting the number of times that the incoming node chooses an existing node of degree𝑘 during the evolutional history of the pa network. We de- note the said number for the pa network at time𝑛 by 𝑁_→𝑘(𝑛) and the number of nodes of degree𝑘 in the pa network at time 𝑛 by 𝑁_𝑘(𝑛).

The empirical estimator (ee) _𝑘̂𝑟(𝑛) is defined by

̂𝑟𝑘(𝑛) = 𝑁_→𝑘(𝑛)

𝑁_𝑘(𝑛) . (3.1)

Suppose𝑁_>𝑘(𝑛) is the number of nodes of degree strictly bigger than 𝑘 at time 𝑛. For the pa networks considered here, we have the following crucial observation.

Lemma 3.1. 𝑁_→𝑘(𝑛) = 𝑁_>𝑘(𝑛).

Proof. Observe that𝑁_→𝑘(𝑛) counts the number of times that the incoming node chooses an existing node of degree𝑘 to connect to up to time𝑛. However if a node was chosen to be connected to as a node of degree𝑘 at some point before time 𝑛, its degree at time 𝑛 is at least 𝑘 + 1. On the other hand, we notice the node degree may only jump from1 to 2, 2 to 3, …, 𝑘 to 𝑘 + 1, etc. Therefore if a node has degree strictly bigger than𝑘, it must have been chosen to be connected to as a node of degree𝑘 at some time. This gives the equality as in the statement of the lemma.

In the light of the above observation, we note (3.1) is equivalent to

̂𝑟𝑘(𝑛) = 𝑁_>𝑘(𝑛)

𝑁_𝑘(𝑛). (3.2)

We give the main results of this chapter in the following theorem and defer the proof until Section 3.4.

Theorem 3.2. If the true pa function satisfies that there exists a positve constant as the Malthusian parameter for the associated continuous- time random tree model, then the above constructed empirical estimator

̂𝑟𝑘(𝑛) is consistent almost surely, i.e.,

̂𝑟𝑘(𝑛) ^a.s.−−→ 𝑟_𝑘, as𝑛 → ∞. (3.3)

(6)

Remark. Suppose you have a real-world dynamic network and are thinking of modeling it as a pa network. The above estimators then only need the network you observe at the end of the evolution and do not need the knowledge of the evolutional history. This is seemingly against the intuition given the Markov setting of the model. A simple explanation would be the limiting behavior of the network model—

the degree sequence stabilizes as the network size goes up and as the limiting degree distribution is deterministic, the influence of the history ceases. The asymptotic independence on the history is also practi- cally important, because in real-world applications, it is often difficult, expensive or sometimes even impossible, to recover the complete evolution history of dynamic networks.

Remark. The above equation has some philosophical interpretations.

Suppose one lives in the world of a pa network as a node and the nodes are constantly coming in. The nodes in this world do not un- derstand how some superior force makes the world as it is and thus do not know the exact preferential attachment rule. In this world they measure wealth by counting the degree, i.e., the more neighbors one has, the wealthier one is. Suppose a node has degree𝑘 and it asks how likely it is going to get an extra edge so as to get richer. The question is equivalent to asking for an estimate of𝑓(𝑘). Our estimator says it is reasonable to count the number of nodes with higher degrees (a.k.a.

the people above you)𝑁_>𝑘and the number of nodes with the same degrees (the people sharing the same rank as you)𝑁_𝑘, and then the quotient is an estimator of𝑓(𝑘). If you live in the world of these nodes and ask what is your chance of moving up, then naturally the best you can come up with is simply to compute the aforementioned ratio. The higher the relative number of the people above you against the people sharing your rank is, the better chance you have to move up.

3.3 borrowing strength from branching processes

In this section we introduce the terminology needed to state the pa model in the language of branching process, similar to [70]. As we later will see, the evolution of one particular kind of continuous-time branching process is in a certain sense equivalent to the evolution of the preferential attachment model; this enables us to study the degree distribution of preferential attachment networks with the classical and well-established results of the branching processes.

(7)

3.3.1 Rooted ordered tree

The pa network is a rooted ordered tree, which can be described as an evolving genealogical tree, where the nodes are individuals and the edges are parent-child relations. The usual notation for the nodes are

∅ for the root of the tree and 𝑙-tuples (𝑖₁, … , 𝑖_𝑙) of positive natural numbers𝑖_𝑗 ∈ ℕ₊for the other nodes (𝑙 ∈ ℕ₊ = {1, 2, …}). The children of the root are labeled(1), (2), …, and in general 𝑥 = (𝑖₁, … , 𝑖_𝑙) denotes the𝑖_𝑙-th child of the𝑖_𝑙−1-th child of⋯ of the 𝑖₁-th child of the root. Thus the set of all possible individuals is

ℐ = {∅} ∪ (

∞

⋃

𝑙=1

ℐ_𝑙) , ℐ_𝑙= {(𝑖₁, … , 𝑖_𝑙)|𝑖_𝑗∈ ℕ₊}.

For𝑥 = (𝑥₁, … , 𝑥_𝑘) and 𝑦 = (𝑦₁, … , 𝑦_𝑙) the notation 𝑥𝑦 is shorthand for(𝑥₁, … , 𝑥_𝑘, 𝑦₁, … , 𝑦_𝑙), and 𝑥𝑙 is the concatenation (𝑥₁, … , 𝑥_𝑘, 𝑙).

Since the edges of the tree can be inferred from the labels of the nodes(𝑖₁, … , 𝑖_𝑙), a rooted ordered tree can be identified with a subset 𝐺 ⊂ ℐ . (Not every subset corresponds to a rooted ordered tree, as the labels need to satisfy the compatibility conditions that for every (𝑥₁, … , 𝑥_𝑘) ∈ 𝐺 we have (𝑥₁, … , 𝑥_𝑘−1) ∈ 𝐺 (parent must be in tree) as well as(𝑥₁, … , 𝑥_𝑘 − 1) ∈ 𝐺 if 𝑥_𝑘 ≥ 2 (older sibling must be in tree). The set of all finite rooted ordered trees is denoted by 𝒢 . In this terminology and notation the degree of node𝑥 ∈ 𝐺 is the number of its children in𝐺 plus 1 (for its parent), given by

deg(𝑥, 𝐺) = |{𝑙 ∈ ℕ₊∣ 𝑥𝑙 ∈ 𝐺}| + 1. (3.4)

3.3.2 Branching process

The evolution in time of the genealogical tree is described through stochastic processes(𝜉_𝑥(𝑡))_𝑡≥0, one for each individual𝑥 ∈ ℐ . The birth is process𝜉_𝑥is a point process on[0, ∞) giving the ages of the parent𝑥 at the births of its children. The birth time 𝜎_𝑥of individual 𝑥 in calendar time is then defined recursively, by setting 𝜎_∅= 0 (the root is born at time zero) and

𝜎_𝑦= 𝜎_𝑥+ inf{𝑢 ≥ 0 ∶ 𝜉_𝑥(𝑢) ≥ 𝑙}, if𝑦 = 𝑥𝑙.

Thus the𝑙-th child of 𝑥 is born at the birth time of 𝑥 plus the time of the𝑙-th event in 𝜉_𝑥.

It is assumed that the birth processes𝜉_𝑥for different𝑥 ∈ ℐ are independent and identically distributed. Thus the splits in the tree

(8)

evolve independently and identically from every node onwards and this independence makes the process a branching process. As a formal definition we may define all processes𝜉_𝑥on the product probability space

(𝛺, ℬ, 𝑃) = ∏

𝑥∈ℐ

(𝛺_𝑥, ℬ_𝑥, 𝑃_𝑥),

where every(𝛺_𝑥, ℬ_𝑥, 𝑃_𝑥) is an independent copy of a single probability space(𝛺₀, ℬ₀, 𝑃₀) and every 𝜉_𝑥is defined as𝜉_𝑥(𝜔) = 𝜉(𝜔_𝑥) if 𝜔 = (𝜔_𝑥)_𝑥∈ℐ ∈ 𝛺, for 𝜉 a given point process defined on (𝛺₀, ℬ₀, 𝑃₀).

We identify the point process𝜉 with the process 𝜉(𝑡) giving the number of points in[0, 𝑡], for 𝑡 ≥ 0, and write 𝜇(𝑡) = 𝔼[𝜉(𝑡)] for its intensity measure, which is called the reproduction function in this context.

Besides the reproductive process𝜉_𝑥we also attach a random char- acteristic𝜙_𝑥to every individual𝑥 ∈ ℐ . This is also a stochastic process (𝜙_𝑥(𝑡))_𝑡≥0, which we take nonnegative, measurable and separable;

for simplicity define𝜙_𝑥(𝑡) = 0 for 𝑡 < 0. We then proceed to define 𝑍^𝜙𝑡 = ∑

𝑥∈ℐ ∶𝜎𝑥≤𝑡

𝜙_𝑥(𝑡 − 𝜎_𝑥).

If𝜙_𝑥(𝑡) is viewed as a characteristic of individual 𝑥 at age 𝑡, then the variable𝜙_𝑥(𝑡−𝜎_𝑥) is the characteristic of individual 𝑥 at calendar time 𝑡, and 𝑍^𝜙𝑡is the sum of all such characteristics over the individuals that are alive at time𝑡 (i.e., 𝜎_𝑥≤ 𝑡).

The characteristics𝜙_𝑥 are assumed independent and identically distributed for different individuals𝑥, as the reproductive processes.

Formally this may be achieved by defining𝜙_𝑥(𝜔) = 𝜙(𝜔_𝑥) if 𝜔 = (𝜔_𝑥)_𝑥∈ℐ ∈ 𝛺, for a given stochastic process 𝜙 on (𝛺₀, ℬ₀, 𝑃₀). This allows that the two processes𝜉_𝑥and𝜙_𝑥attached to a given individual are dependent. In fact, we shall be interested in the choices, for a given natural number𝑘,

𝜙(𝑡) ≡ 1, 𝜙(𝑡) = 𝟙_{{𝜉(𝑡)=𝑘}}, 𝜙(𝑡) = 𝟙{𝜉(𝑡)>𝑘}.

(3.5)

For these characteristics the variable𝑍^𝜙_𝑡 is equal to the total number of individuals born up to time𝑡, and the total number of those individuals with exactly𝑘 or more than 𝑘 children at time 𝑡, respectively.

We consider the supercritical, Malthusian processes with the fol- lowing three conditions hold.

1. 𝜇 does not concentrate on any lattice {0, ℎ, 2ℎ, … } for ℎ > 0.

(9)

2. There exists a number𝜆^∗> 0 such that

∫

∞ 0

𝑒^−𝜆^∗^𝑡𝜇(𝑑𝑡) = 1. (3.6)

3. The first moment of𝑒^−𝜆^∗^𝑡𝜇(𝑑𝑡) is finite, i.e.,

∫

∞ 0

𝑢𝑒^−𝜆^∗^𝑢𝜇(𝑑𝑢) < ∞. (3.7)

The second condition is the Malthusian assumption, and𝜆^∗is called the Malthusian parameter; the third is the supercritical condition.

The following is a weaker version of Theorem 6.3 of [56] (see Theo- rem A from [70]). It is worth noting that the (3.8) implies the Malthu- sian condition, but (3.8) is a sufficient condition for the almost sure convergence and might be not necessary for weaker notions of convergence, e.g., in probability.

Proposition 3.3. Consider a supercritical, Malthusian branching pro- cess with Malthusian parameter𝜆^∗, counted by two bounded random characteristics𝜙 and 𝜓. Suppose that there exists a 𝜆 < 𝜆^∗such that

∫

∞ 0

𝑒^−𝜆𝑡𝜇(𝑑𝑡) < ∞. (3.8)

Then almost surely, as𝑡 → ∞, 𝑍^𝜙𝑡

𝑍^𝜓𝑡

→ ∫₀^∞𝑒^−𝜆^∗^𝑡𝔼[𝜙(𝑡)] 𝑑𝑡

∫^∞

0 𝑒^−𝜆^∗^𝑡𝔼[𝜓(𝑡)] 𝑑𝑡. (3.9)

3.3.3 The continuous random tree model

To connect back to the pa model, given a pa function𝑓 we now define the process𝜉 as a pure birth processes with birth rate equal to 𝑓(𝜉(𝑡) + 1), i.e., the continuous-time Markov process with state space the nonnegative integers and the only possible transitions given by

𝑃(𝜉(𝑡 + 𝑑𝑡) = 𝑘 + 1 ∣ 𝜉(𝑡) = 𝑘) = 𝑓(𝑘 + 1) 𝑑𝑡 + 𝑜(𝑑𝑡). (3.10) The genealogical tree is then also a Markov process on the state space 𝒢 . The initial state of the process is the root{∅} of the tree, and the jumps of this process correspond to an individual𝑥 ∈ 𝐺 giving birth to a child, which is then incorporated in the tree as an additional node.

(10)

In the preceding notation this means that the process can jump from a state𝐺 to a state of the form 𝐺 ∪ {𝑥𝑘}, where necessarily 𝑥 ∈ 𝐺 and𝑘 = deg(𝑥, 𝐺) is the number of children minus one that 𝑥 already has in the tree. This jump is made with rate𝑓(deg(𝑥, 𝐺)), since according to (3.10) with𝜉 = 𝜉_𝑥the individual𝑥 gives birth to a new child with rate𝑓(𝑘) if 𝑥 already has 𝑘 − 1 children. The description in terms of rates means more concretely that given the current state 𝐺, the Markov process can jump to the finitely many possible states 𝐺 ∪ {𝑥𝑘}, 𝑥 ∈ 𝐺 and 𝑘 = deg(𝑥, 𝐺), and it chooses between these states with probabilities

𝑓(deg(𝑥, 𝐺)

∑_𝑦∈𝐺𝑓(deg(𝑦, 𝐺)), 𝑥 ∈ 𝐺.

Furthermore, the waiting time in state𝐺 to the next jump is an expo- nential variable with intensity equal to the total preference

𝐹(𝐺) = ∑

𝑥∈𝐺

𝑓(deg(𝑥, 𝐺)).

The continuous-time scale of the process is not essential to us, but it is convenient for the calculations. We shall use that when𝑡 → ∞ the continuous-time tree visits the same states (trees) as the pa model, and taking limits as𝑡 → ∞ is equivalent to taking limits in the pa model as the number of nodes increases to infinity.

In order to apply Proposition 3.3 in this setting we need to verify its conditions on the reproduction function𝜇(𝑡) = 𝔼[𝜉(𝑡)], and deter- mine the Malthusian parameter. The events of the pure birth process (3.10) can be written as𝑇₁< 𝑇₁+𝑇₂< 𝑇₁+𝑇₂+𝑇₃< ⋯, for 𝑇₁, 𝑇₂, … independent random variables, with𝑇_𝑘exponentially distributed with rate𝑓(𝑘). The total number of events 𝜉(𝑡) = ∫ 𝟙_(0,𝑡](𝑢) 𝜉(𝑑𝑢) at time 𝑡 is equal to ∑^∞_𝑙=1𝟙_(0,𝑡](𝑇₁+ ⋯ + 𝑇_𝑙), whence

𝔼[∫ 𝑒^−𝜆𝑢𝜉(𝑑𝑢)] = 𝔼[

∞

∑

𝑛=1

𝑒^−𝜆(𝑇¹^+⋯+𝑇^𝑙⁾] =

∞

∑

𝑙=1 𝑙

∏

𝑘=1

𝑓(𝑘) 𝜆 + 𝑓(𝑘). As a function of𝜆 this expression is decreasing; it is infinite at 𝜆 = 0 and decreases to0 at 𝜆 = ∞ under mild conditions on 𝑓. The Math- usian parameter𝜆^∗ is defined as the argument where the function takes the value one and will typically exist. In a neighbourhood of𝜆^∗ the function will also be finite, and so will its derivative in absolute value in (3.7). Thus the conditions of Proposition 3.3 will typically be satisfied.

(11)

We may also calculate the expressions in the right hand side of (3.9) for the characteristics (3.5), as follows. Fromℙ(𝜉(𝑡) > 𝑘 − 1) = ℙ(𝑇₁+ ⋯ + 𝑇_𝑘 < 𝑡) = ∫^𝑡

0ℎ_𝑘(𝑢) 𝑑𝑢, for ℎ_𝑘the density of𝑇₁+ ⋯ + 𝑇_𝑘, we have by Fubini’s theorem (or partial integration),

𝜆 ∫

∞ 0

𝑒^−𝜆𝑡ℙ(𝜉(𝑡) > 𝑘 − 1) 𝑑𝑡 = ∫

∞ 0

∫

∞ 𝑠

𝜆𝑒^−𝜆𝑡𝑑𝑡 ℎ_𝑘(𝑢) 𝑑𝑢

= 𝔼[𝑒^−𝜆(𝑇¹^+⋯+𝑇^𝑘⁾] =

𝑘

∏

𝑗=1

𝑓(𝑗) 𝜆 + 𝑓(𝑗).

(3.11) Furthermore, writingℙ(𝜉(𝑡) = 𝑘−1) as the difference of the preceding with𝑘 − 2 and 𝑘 − 1, we obtain

𝜆 ∫

∞ 0

𝑒^−𝜆𝑡ℙ(𝜉(𝑡) = 𝑘 − 1) 𝑑𝑡 = 𝜆 𝜆 + 𝑓(𝑘)

𝑘−1

∏

𝑗=1

𝑓(𝑗)

𝜆 + 𝑓(𝑗). (3.12) As we will see, (3.11) and (3.12) correpond to the computations of the limiting proportions of nodes with degrees bigger than𝑘 and 𝑘.

3.4 consistency of the empirical estimators

For completeness, we present a result without proof from [70] giving the limiting degree distribution for a class of pa function𝑓.

Proposition 3.4. Consider a pa function𝑓 such that there exists a positive constant𝜆^∗ as the Malthusian parameter for the associated continuous-time random tree model. Then as𝑛 → ∞, the empirical de- gree distribution𝑃_𝑘(𝑛) converges almost surely for any 𝑘 to some limit 𝑝_𝑘specified by the equation below

𝑃_𝑘(𝑛)−−→ 𝑝^a.s. _𝑘= 𝜆^∗ 𝜆^∗+ 𝑓(𝑘)

𝑘−1

∏

𝑗=1

𝑓(𝑗)

𝜆^∗+ 𝑓(𝑗), ∀𝑘 ∈ ℕ₊. (3.13) Note𝑝₁= 𝜆^∗/(𝜆^∗+ 𝑓(1)) and 𝑝_𝑘+1= 𝑝_𝑘𝑓(𝑘)/(𝜆^∗+ 𝑓(𝑘 + 1)).

Suppose the birth process is defined as in (3.10), then intervals between successive jumps of𝑋_𝑡are independent exponentially distributed random variables of parameters(𝑓(𝑘))^∞_𝑘=1respectively. We recall the Malthusian equation and that𝜆^∗the solution to the equa-

(12)

tion (in𝜆)

∫

∞ 0

𝑒^−𝜆𝑡𝑑𝜇(𝑡) =

∞

∑

𝑘=1

[ [

𝑘

∏

𝑖=1

𝑓(𝑖) 𝜆 + 𝑓(𝑖)]

]

= 1.

It will be useful later that (by the formula of𝑝_𝑘in (3.13))

∞

∑

𝑘=1

𝑓(𝑘)𝑝_𝑘=

∞

∑

𝑘=1

𝜆^∗𝑓(𝑘) 𝜆^∗+ 𝑓(𝑘)

𝑘−1

∏

𝑖=1

𝑓(𝑖) 𝜆^∗+ 𝑓(𝑖)

=

∞

∑

𝑘=1

𝜆^∗

𝑘

∏

𝑖=1

𝑓(𝑖)

𝜆^∗+ 𝑓(𝑖) = 𝜆^∗.

(3.14)

Proof of Theorem 3.2. We need to apply Proposition 3.3 properly. We find appropriate characteristics𝜙 and 𝜓 so that on the left hand side of (3.9) we have the ee and the right hand side𝑟_𝑘.

We set𝜙(𝑡) = 𝟙{𝜉(𝑡)+1>𝑘}and𝜓(𝑡) = 𝟙_{{𝜉(𝑡)+1=𝑘}}, then 𝑍^𝜙𝑡 = ∑

𝑥∈ℐ

𝟙_{𝑡≥𝜎_𝑥_}𝟙_{𝜉_𝑥_(𝑡−𝜎_𝑥_)+1>𝑘}, 𝑍^𝜓𝑡 = ∑

𝑥∈ℐ

𝟙_{𝑡≥𝜎_𝑥_}𝟙_{𝜉_𝑥_(𝑡−𝜎_𝑥_)+1=𝑘}.

𝑍^𝜓𝑡 counts all the nodes who have born and have degree𝑘 up to time 𝑡 and 𝑍^𝜙𝑡 counts all the nodes who have been born and have degree strictly bigger than𝑘 at time 𝑡.

We apply Proposition 3.3 with the above defined𝜙 and 𝜓 and it gives us

𝑡→∞lim

|{𝑥 ∈ 𝛶(𝑡) ∶ 𝜉_𝑥(𝑡 − 𝜎_𝑥) + 1 > 𝑘}|

|{𝑥 ∈ 𝛶(𝑡) ∶ 𝜉_𝑥(𝑡 − 𝜎_𝑥) + 1 = 𝑘}|

= 𝜆^∗∫₀^∞𝑒^−𝜆^∗^𝑡ℙ(𝜉_𝑥(𝑡 − 𝜎_𝑥) + 1 > 𝑘)𝑑𝑡 𝜆^∗∫₀^∞𝑒^−𝜆^∗^𝑡ℙ(𝜉_𝑥(𝑡 − 𝜎_𝑥) + 1 = 𝑘)𝑑𝑡

(3.15)

holds almost surely. We identify the denominator and nominator with

the calculations in (3.11) and (3.12) Note that the

reproduction process𝜉(𝑡) associated with a node and its degree is related by

𝜉(𝑡) + 1 = Deg(𝑡).

and come to

𝜆^∗∫

∞ 0

𝑒^−𝜆^∗^𝑡ℙ(𝜉(𝑡) + 1 > 𝑘)𝑑𝑡 =

𝑘

∏

𝑗=1

𝑓(𝑗)

𝑓(𝑗) + 𝜆^∗, (3.16)

𝜆^∗∫

∞ 0

𝑒^−𝜆^∗^𝑡ℙ(𝜉(𝑡) + 1 = 𝑘)𝑑𝑡 = 𝜆^∗ 𝜆^∗+ 𝑓(𝑘)

𝑘−1

∏

𝑗=1

𝑓(𝑗)

𝑓(𝑗) + 𝜆^∗. (3.17)

(13)

We see immediately that the right hand side of (3.17) is the same as the limiting proportion𝑝_𝑘of degree𝑘 from (3.13) and the right hand side of (3.16) is𝑝_>𝑘 ∶= ∑_𝑗=𝑘+1𝑝_𝑗by Fubini’s theorem. Therefore we arrive at

𝑡→∞lim

|{𝑥 ∈ 𝛶(𝑡) ∶ deg(𝑥, 𝛶(𝑡)) > 𝑘}|

|{𝑥 ∈ 𝛶(𝑡) ∶ deg(𝑥, 𝛶(𝑡)) = 𝑘}| = 𝑝_>𝑘 𝑝_𝑘

holds almost surely. It suffices to show that the right hand side of the preceding display is the same with𝑟_𝑘 = 𝑓(𝑘)/ ∑_𝑗𝑓(𝑗)𝑝_𝑗, which is shown in the succeeding Lemma 3.5.

Lemma 3.5. Suppose(𝑝_𝑘)^∞_𝑘=1is the limiting degree distribution speci- fied in (3.13) for the pa function𝑓. Then

𝑓(𝑘)

∑^∞_𝑗=1𝑝_𝑗𝑓(𝑗) = 𝑝_>𝑘

𝑝_𝑘 (3.18)

holds for𝑘 ∈ ℕ+.

Proof. Define the auxiliary quantity𝑞_𝑘as the limiting preference to- wards degree𝑘 to be

𝑞_𝑘= 𝑓(𝑘)𝑝_𝑘

∑^∞_𝑖=1𝑓(𝑖)𝑝_𝑖 = 𝑓(𝑘)𝑝_𝑘 𝜆^∗ . Note𝑞_𝑘= 𝑟_𝑘𝑝_𝑘and (3.18) is the same as𝑞_𝑘 = 𝑝_>𝑘.

For𝑘 = 1 we note 𝑝₁= 𝜆^∗/(𝑓(1) + 𝜆^∗), 𝑝_>1= 1 − 𝑝₁= 𝑓₁/(𝜆^∗+ 𝑓(1)) = 𝑓(1)𝑝₁/𝜆^∗. Assuming𝑝_>𝑘 = 𝑞_𝑘holds, consider the case of 𝑘 + 1. By (3.13), we have 𝑝_𝑘+1= 𝑓(𝑘)𝑝_𝑘/(𝜆^∗+ 𝑓(𝑘 + 1)).

𝑝_>𝑘+1= 𝑝_>𝑘− 𝑝_𝑘+1= 𝑞_𝑘− 𝑝_𝑘+1

= 𝑓(𝑘)𝑝_𝑘( 1

𝜆^∗ − 𝑓(𝑘 + 1) 𝜆^∗+ 𝑓(𝑘 + 1))

= 𝑓(𝑘 + 1) 𝜆^∗

𝑓(𝑘)𝑝_𝑘 𝜆^∗+ 𝑓(𝑘 + 1)

= 𝑓(𝑘 + 1)𝑝_𝑘+1 𝜆^∗ = 𝑞_𝑘+1.

Then by mathematical induction,𝑝_>𝑘= 𝑞_𝑘holds for all𝑘 ∈ ℕ₊.

(14)

3.5 simulation studies

In this section we present a numerical illustration of the behavior of the empirical estimator.

We run the experiment for the following pa functions (after nor- malization such that𝑓(1) = 1)

𝑓⁽¹⁾(𝑘) = (𝑘 + 1/2)/(3/2), 𝑓⁽²⁾(𝑘) = 𝑘^2/3, 𝑓⁽³⁾(𝑘) =√𝑘 + 2/⁴ √3.⁴

We simulate 1,000 pa networks of 10,000, 100,000 and 1,000,000 nodes for each of the three functions, so 9,000 networks in total. In each experiment, keeping the model is only identifiable up to scale, we apply the empirical estimation on some degrees and then normalize the obtained estimation such that the preference on degree 1 is1 to en- able easy comparisons. We summarize our simulation study in Fig- ure 3.1. The three rows of the 9 panels refer to the pa functions𝑓⁽¹⁾, 𝑓⁽²⁾,𝑓⁽³⁾(top to bottom), while the three columns refer to networks with 10,000, 100,000, 1,000,000 nodes. In each of the nine panels the degree𝑘 is on the horizontal axis, plotted from 1 to the maximal degree that occurred in all of the 1000 simulations. The vertical axis gives the value of the pa function and a boxplot of the 1,000 estimates com- puted from the simulated networks. The ground truth for each𝑘 is marked as the red lines in each panel.

These plots suggest the following observations:

• The estimator is consistent, as our theorem shows. The quality improves when we have more nodes, hence more observations.

• For a fixed number of nodes, the quality of the estimator dete- riorates fast when𝑘 increases, exemplified by the substantial variability compared to the ground truth.

• Even if the ee has a large variance for large𝑘’s, the sample mean of _𝑘̂𝑟 in each degree𝑘 is still remarkably close to the truth, demonstrating that the ee is nearly unbiased.

• For a fixed number of nodes, it appears that when the ee makes larger errors, it is overestimating.

• The ee is not automatically monotone. However, we can slightly modify the estimator so that it is still consistent but always gives monotone results.

(15)

23 45678910 12 14

01020304050

Degree k

f(k)

EE of f1 with 10000 nodes

2468 1013 16 19 22 25 28 31

020406080100

Degree k

f(k)

26 1015202530354045505560

0100200300400

Degree k

f(k)

23 45 6 78 910 11 12 13 14

5101520253035

Degree k

f(k)

2 4 6 8 1012141618202224

020406080

Degree k

f(k)

2 4 6 811141720232629323538

0102030405060

Degree k

f(k)

2 3 4 5 6 7 8 9 10

12345

Degree k

f(k)

2 3 45 67 8 910 11 12 13 14

12345

Degree k

f(k)

2 3 4 5 6 7 8 9 11 13 15 17 19

1234567

Degree k

f(k)

Figure 3.1. Boxplots of ee’s in different settings.

The three rows of the 9 panels refer to the pa functions𝑓⁽¹⁾,𝑓⁽²⁾, 𝑓⁽³⁾and the three columns refer to networks of 3 levels of numbers of nodes. In each panel the horizontal axis is the degree. The vertical

axis gives the value of the preference function and a boxplot of the 1,000 estimates. The ground truth is marked in red.

(16)

To sum up, the estimator works as proven in the main Theorem 3.2, but the exact performance depends on the true pa function and the degree of interest—if the true pa function increases slowly with respect to the degree, then it is easier to estimate the preference of low degrees and harder the preference of high degrees and vice versa.

3.5.1 Sample variance study

We again run 1000 simulations of trees with pa function function of 𝑓⁽¹⁾,𝑓⁽²⁾,𝑓⁽³⁾, but now only simulate networks of size 1,000,000. We apply the ee to each simulated network and calculated the sample variance of the 1000 estimates for each given degree𝑘 = 1, … , 70. The sample variances are plotted against the degrees in Figure 3.2. Differ- ent colors stand for different pa functions: the red corresponds to𝑓⁽¹⁾, the green𝑓⁽²⁾and the blue𝑓⁽³⁾. The scale on both axes are inlog. De- note the sample variance of the ee for degree𝑘 by 𝑠_𝑘, inspections on these plots reveal the following observations:

• It appearslog 𝑠_𝑘grows polynomially with respect tolog 𝑘. For 𝑓⁽¹⁾, the affine pa function, it looks likelog 𝑠_𝑘is affine tolog 𝑘.

• Sample variance𝑠_𝑘characterizes, to a certain extent, the diffi- culty of estimating𝑟_𝑘.

– Looking at small𝑘’s, we see that in the beginning, 𝑠⁽³⁾_𝑘 <

𝑠⁽²⁾_𝑘 < 𝑠⁽¹⁾_𝑘 .

– Then at about𝑘 = 17, the blue line (𝑠⁽³⁾) first crosses the green line (𝑠⁽²⁾), i.e.,𝑠⁽²⁾(𝑘) < 𝑠⁽³⁾(𝑘) < 𝑠⁽¹⁾(𝑘).

– The blue line continues on and crosses the red line (𝑠⁽¹⁾) at around𝑘 = 18. This means 𝑠⁽²⁾(𝑘) < 𝑠⁽¹⁾(𝑘) < 𝑠³(𝑘).

– The green line crosses the red line at approximately𝑘 = 35, so from that point on 𝑠⁽¹⁾(𝑘) < 𝑠⁽²⁾(𝑘) < 𝑠⁽³⁾(𝑘).

• On the one hand, for small𝑘’s, the slower 𝑓 grows with 𝑘, the easier it is to estimate𝑟_𝑘, reflected in the observation that slower𝑓 yields lower sample variance in small 𝑘’s. On the other hand, for large𝑘’s, the faster 𝑓 grows in 𝑘, the easier it is to es- timate𝑟_𝑘.

• The shapes of curves of different𝑓’s seems to indicate that the faster𝑓 grows with 𝑘, the slower log 𝑠_𝑘grows withlog 𝑘. The seemingly affine relations might be a consequence of the limiting power-law distribution.

(17)

The above observations seem intuitive because faster growth of𝑓 indicates that more nodes of higher degrees will come into realizations, and more observations of nodes of higher degrees yields better results in estimating the preferences on higher degrees. However as the total number of nodes is fixed, more nodes of high degrees means fewer nodes of low degrees. This results in larger variance in estimating𝑟_𝑘 with small𝑘’s.

10^-4 10^-2 10⁰ 10²

10 20 30 40 50 60 70

Degree

Sample Var

f 1 2 3

Sample Var. Study for f's (loglog) (n=1 million)

Figure 3.2. Sample Variance Study of EE

3.5.2 Asymptotic normality of the ee with parametric rate?

We might wonder for any fixed𝑘, what the asymptotic distribution of _𝑘̂𝑟(𝑛) is when 𝑛 → ∞? Perhaps it is asymptotically normal? If not, what else? To answer this question, we study the simulation results here.

We fix the number of nodes to be one million in all simulated networks for each pa function. Then we look at the ee’s in each simulation. For each𝑓, we plot the qq-plot

qq-plot:

quantile-quantile plot

of each estimator for𝑘 = 2, 3, 4 against the normal distribution. The results are summarized in Fig- ure 3.3. The pa function is the same on each horizontal level and the degree𝑘 on which we conduct our ee study is the same on each vertical level. Since the number of nodes is one million, we expect that the limiting distribution should have kicked in, assuming there is indeed

(18)

a limiting distribution. We see that indeed qq-plots indicates the ee’s look very much like normal distributions.

-3 -2 -1 0 1 2 3

1.651.661.671.681.69

QQ-plot of f1(2)

Sample Quantiles of f(2)

Sample Quantiles

-3 -2 -1 0 1 2 3

2.302.322.342.362.38

Sample Quantiles

-3 -2 -1 0 1 2 3

2.942.962.983.003.023.04

Sample Quantiles

-3 -2 -1 0 1 2 3

1.5701.5801.5901.600

Sample Quantiles

-3 -2 -1 0 1 2 3

2.052.062.072.082.092.102.11

Sample Quantiles

-3 -2 -1 0 1 2 3

2.482.502.522.542.56

Sample Quantiles

-3 -2 -1 0 1 2 3

1.0451.0501.0551.0601.065

Sample Quantiles

-3 -2 -1 0 1 2 3

1.0901.0951.1001.1051.1101.1151.1201.125

Sample Quantiles

-3 -2 -1 0 1 2 3

1.131.141.151.161.17

Sample Quantiles

Figure 3.3. QQ-Plots of ₂̂𝑟(𝑁), ̂𝑟₃(𝑁) and ̂𝑟₄(𝑁) with 𝑛 = 10⁶for 𝑓⁽¹⁾,𝑓⁽²⁾and𝑓⁽³⁾

We suspect that, for fixed𝑘, the following asymptotic normality result holds

√𝑛( ̂𝑟_𝑘(𝑛) − 𝑟_𝑘) ∼ 𝑁(0, 𝜎²_𝑘), (3.19) where𝜎²_𝑘only depend on𝑓 and 𝑘. To see whether this might be cor- rect, we have the following study. We fix the pa function to be𝑓⁽²⁾ and run 3 simulations for different network sizes of 10,000, 100,000 and 1,000,000 and study the estimators of the preference on degree 2 only. If (3.19) is true, then the distribution of𝑟₂(𝑛) (𝑛 is the network size) should remain somewhat stable after rescaling them with √𝑛.

We conduct such study and summarize the results in Figure 3.4, where the label r2 corrected on the𝑥-axis means that we take √𝑛( ̂𝑟₂(𝑛) − 𝑟₂) (centered and rescaled with parametric rate) instead of𝑟₂. The sample variances of each simulation is on top of each subplot.

(19)

1E4 Sample Var = 33.235 1E5 Sample Var.= 35.886 1E6 Sample Var. = 35.056

0.000 0.025 0.050 0.075

-20 -10 0 10 20 -20 -10 0 10 20 -20 -10 0 10 20

r2 corrected

Freq.

Histograms of r2

Figure 3.4. Histograms of √𝑛( ̂𝑟₂(𝑛) − 𝑟₂) for 𝑓⁽²⁾with different network sizes

If we do density estimations on the data of √𝑛( ̂𝑟₂(𝑛)−𝑟₂) for three different𝑛’s, we obtain Figure 3.5. As both the sample variances, histograms and density plots look rather stable after the √𝑛-rescaling, we conjecture that (3.19) is true.

(20)

0.00 0.02 0.04 0.06

-20 -10 0 10 20

r2 corrected

Esti. Density

No. of nodes 1E4 1E5 1E6

Esti. Density Plot for r2

Figure 3.5. Estimated Density of √𝑛( ̂𝑟₂(𝑛) − 𝑟₂) with different network sizes𝑛

(21)