• No results found

A framework for evaluating statistical dependencies and rank correlations in power law graphs

N/A
N/A
Protected

Academic year: 2021

Share "A framework for evaluating statistical dependencies and rank correlations in power law graphs"

Copied!
9
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

A framework for evaluating statistical dependencies

and rank correlations in power law graphs

Yana Volkovich

University of Twente P.O. Box 217, 7500 AE Enschede, The Netherlands

y.volkovich@ewi.utwente.nl

Nelly Litvak

University of Twente P.O. Box 217, 7500 AE Enschede, The Netherlands

n.litvak@ewi.utwente.nl

Bert Zwart

Georgia Tech. 765 Ferst Drive, NW Atlanta,

Georgia 30332-0205

bertzwart@gatech.edu

ABSTRACT

We analyze dependencies in power law graph data (Web sample, Wikipedia sample and a preferential attachment graph) using statistical inference for multivariate regular variation. To the best of our knowledge, this is the first attempt to apply the well developed theory of regular vari-ation to graph data. The new insights this yields are strik-ing: the three above-mentioned data sets are shown to have a totally different dependence structure between different graph parameters, such as in-degree and PageRank. Based on the proposed methodology, we suggest a new measure

for rank correlations. Unlike most known methods, this

measure is especially sensitive to rank permutations for top-ranked nodes. Using this method, we demonstrate that the PageRank ranking is not sensitive to moderate changes in the damping factor.

Categories and Subject Descriptors

E.1 [Data structures]: Graphs and networks; G.3

[Probability and Statistics]: Multivariate statistics

General Terms

Algorithms, Experimentation, Measurement

Keywords

Regular variation, PageRank, Web, Wikipedia, Preferential attachment

1.

INTRODUCTION

What do we know about the Web structure? There is a vast literature on the subject but we are still far from complete understanding. One point where most researchers agree is the presence of power laws. In simple words, a power law with exponent α means that a probability of obtaining ∗The work is supported by NWO Meervoud grant no. 632.002.401

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

a value grater than x is proportional to x−α

, where α > 0 is the power law exponent. The standard example of a power law is a Pareto distribution

P(X > x) = cx−α

where x > x0. (1)

For excellent surveys on history, properties, modeling, and mining of power laws, and their role in complex networks we refer to e.g. [5, 14, 17, 18, 19].

A natural mathematical formalism for analyzing power laws is provided by the theory of regular variation. This theory has been developed in the context of analysis of ex-tremes [6], financial time series [16], and traffic in communi-cation networks [21]. By definition, the distribution F has a regularly varying tail with index α, if

P(X > x) = x−α

L(x), x > 0, (2)

where L(x) is a slowly varying function, that is, for x > 0, L(tx)/L(t) → 1 as t → ∞, for instance, L(x) may be equal to a constant or log(x). Clearly, a power law can be modeled as an instance of regular variation.

In the present work, we employ statistical inference de-signed for regular variation, as described in Resnick [22], to analyze the dependencies in power law graphs. To the best of our knowledge, most of the proposed methods have never been applied to massive graph data. We consider in-degrees, out-degrees and PageRank scores in three large data sets: an EU-2005 Web sample, a Wikipedia sample and a Growing Network graph based on the preferential attachment model

by Albert and Barab´asi [2]. The data sets are described in

detail in Section 2.

It has become common knowledge that in-degree and Page-Rank in the Web graph obey power laws [3, 7, 20, 23]. The power law exponents can deviate depending on a data set and an estimator but are believed to satisfy α ≈ 1.1. Sim-ilar behavior of in-degree and PageRank has been observed in Wikipedia [4, 23]. There is however no common agree-ment on the distribution of out-degrees in the Web. Whereas Broder et al. [3] observe a power law with exponent about 1.6, Donato et al. [7] claim that the out-degrees do not fol-low a power law. Remarkably, the conclusion on whether or not the data follows a power law is often seem to be made purely by determining whether or not the log-log plot resembles the signature straight line. This however can be misleading especially when a size-frequency plot is used [14]. Although one may agree with Li et al. [14] that a cumula-tive (size-rank) plot is enough to reveal a power law to an experienced eye, for more reliable conclusions on realistic noisy data, we need more than just a glance at the log-log plots. Chakrabarti and Faloutsos [5] mention two

(2)

goodness-of-fit methods for Pareto distribution and suggest that such methods should be applied more often. In Section 3 below we aim at resolving these issues by using several state of the art techniques from the statistical analysis of heavy tails, cf. the recent book of Resnick [22].

The question of measuring correlations in the Web data has led to many controversial results. Most notably, there is no agreement in the literature on the dependence between in-degree and PageRank of a Web page [7, 10]. In this re-spect, Chakrabarti and Faloutsos [5] confirm that measuring correlation in power law data is tricky because the important large values do not appear very often, and thus, the coeffi-cient of correlation might give a wrong impression about the dependencies in the tails.

One of the main points we would like to make in this paper is that this merely confirms the common knowledge (in the extreme value theory community) that the correla-tion coefficient is an uninformative dependence measure in heavy-tailed data [5, 6, 22]. The correlation is a ‘crude sum-mary’ of dependencies that is most informative for jointly normal random variables. It is a common and simple tech-nique but it is not subtle enough to distinguish between the dependencies in large and in small values. This is in partic-ular a problem if we want to measure a dependence between two heavy tailed parameters X and Y . In that case, we are mainly interested in the dependence between the tails, i.e., between extremely large values of X and Y . Since such extremely large values are not encountered very often, the correlation coefficient can not capture the tail dependencies. Thus, in this work we employ techniques from [22] that is a range of statistical procedures designed to deal with mul-tivariate data of which the marginal distributions exhibit power laws. In particular, this paper points out that this body of statistical theory contains a well-developed notion of dependence that is designed for power tails. This notion, called extremal dependence seems much more suitable than standard correlation measures and, as the estimation results in this paper show, shed new light on dependence properties in Web graphs.

Our experimental results reveal a dramatically different correlation structure in the three data sets. For instance, the results for in-degree and PageRank in Wikipedia strongly suggests an independence between these two parameters. Similar analysis for the Web graph reveals a non-trivial de-pendence structure. Finally, a preferential attachment graph shows a very strong dependence between in-degree and Page-Rank.

The analysis of extremal dependence lead us to propose a new rank correlation measure which seems particularly suitable for bivariate power law data. The measure has the appealing property that small values in the data set are of limited influence. Thus, the measure is less sensitive to the choice of the number of considered upper order statistics, as is the case for other statistics used in the analysis of heavy tails. Moreover, unlike most known methods for evaluating rank correlation, our proposed measure is especially sensitive to rank permutations for top-ranked nodes. We discuss our ideas on this matter in Section 5.

Analysis of dependencies in real-life graph and synthetic data contributes towards a better understanding and mod-eling of complex graph structures. Clearly, for adequate modeling, it is not sufficient to maintain power laws. For instance, it was already argued in [8] that robustness of Internet power law router graph is in strong disagreement

with a preferential attachment model. Likewise, our analy-sis clearly reveals a striking disagreement of the preferential attachment graph with dependence structure of the Web and Wikipedia. Better models have to be sought and exist-ing models have to be thoroughly analyzed before we can conclude that they adequately reflect important features of complex networks.

2.

DATA SETS

We chose three data sets that represent different network structures. As the Web sample, we used the EU-2005 data set with 862.664 nodes and 19.235.140 links, that was col-lected by LAW [1]. We also performed experiments on the Wikipedia (English) data, whose structure is known to be different from the Web graph [4]. This data set contains 4.881.983 nodes and 42.062.836 links. Finally, we simulated a Growing Network by using preferential attachment rule for 90% of new links [2]. The graph consists of 10.000 nodes with constant out-degree d = 8. In Figure 1 we show the cu-mulative log-log plots for in-degrees, out-degrees and Page-Rank scores in all data sets. The PagePage-Rank scores in the

(a) 100 102 104 106 10−6 10−4 10−2 100 In/out−degree, PageRank Fraction of pages In−degree Out−degree PageRank (c=0.5) PageRank (c=0.85) (b) 100 102 104 106 10−8 10−6 10−4 10−2 100 In/out−degree, PageRank Fraction of pages In−degree Out−degree PageRank (c=0.5) PageRank (c=0.85) (c) 10−1 100 101 102 103 104 10−4 10−3 10−2 10−1 100 In−degree, PageRank Fraction of pages In−degree PageRank (c=0.5) PageRank (c=0.85)

Figure 1: Cumulative log-log plots for

in/(out)-degree, PageRank (c=0.5) and PageRank (c=0.85): (a) EU-2005, (b) Wikipedia, (c) Growing Network network of n nodes are computed according to the classical

(3)

definition [12]: P R(i) = cX j→i 1 dj P R(j)+c n X j∈D P R(j)+1 − c n , i = 1, . . . , n,

where P R(i) is the PageRank of page i, dj is the number

of outgoing links of page j, the sum is taken over all pages j that link to page i, D is a set of dangling nodes, and c is the damping factor, which is equal 0.5 and 0.85 in our case. Throughout the paper we use the scaled PageRank scores R(i) = nP R(i), where i = 1, . . . , n.

The log-log plots for in-degree and PageRank in Figure 1 resemble the signature straight line indicating power laws. However, several techniques should be combined in order to establish the presence of heavy tails and to evaluate the power law exponent. Using QQ plots, Hill and altHill plots as well as Pickands plots we will confirm that the in-degree and PageRank (c=0.85) follow power laws with similar ex-ponents for all three data sets. We will also conclude that the out-degree can be modeled reasonably well as a power law with exponent around 2.5-3.

Although all plots in Figure 1 look alike, it does not imply that the three networks have identical structure. One of the goals of the present work is to rigorously examine the dependencies between the network parameters.

3.

EVALUATING THE POWER LAWS

Consider non-negative observations X1, . . . , Xnand write

X(i)for the ith largest value of X1, . . . , Xn, where 1 ≤ i ≤ n:

X(1)≥ X(2)≥ . . . ≥ X(n). (3)

In the next sections we will provide a review of some

esti-mation techniques designed under assumption that X1, . . . Xn

are independent random variables having an identical reg-ularly varying distribution with tail index α, as defined in (2). The idea is to apply several different procedures and make sure that they lead to the same conclusion.

3.1

Hill plot

The Hill’s estimator Hk,n is a widely used estimator of

1/α, that is based on k upper order statistics:

Hk,n= 1 k k X i=1 log  X (i) X(k+1)  .

It was proved (see e.g. [22]) that Hk,n converges in

proba-bility to 1/α as n, k → ∞, k/n → 0. An obvious problem

with the Hill estimator is choosing the value k so that X(k)

corresponds to a ‘beginning’ of the power law tail. This can be mitigated by constructing a so-called Hill plot.

To make a Hill plot for α we graph {(k, H−1

k,n), 1 ≤ k ≤ n}

and if the plot looks stable around a certain horizontal line, we can pick the corresponding value of α. This sometimes works beautifully, especially for data close to pure Pareto tails. However, if L(x) in (2) deviates considerably from a constant there may be enormous errors. The Hill plot, as well as the Hill estimator, is also not location invariant. Theoretically, a shift does not affect the power law exponent, however it drastically distorts the Hill plot. Clearly, in case when the Hill plot does not look stable, the Hill estimator can not be used for the evaluation of α.

To construct confidence intervals for the Hill estimator, Newman [19] suggests to use a bootstrap method for

esti-mating the variance of H−1

k,n. A simpler way is to use the

convergence of √kHk,n to a normal random variable with

mean 1/α and variance 1/α2as n, k → ∞, k/n → 0 (see [22,

p.304]). Thus, one can obtain confidence intervals based on the quantiles of the standard normal distribution.

One can also display the Hill plot in the alternative form

{(θ, H⌈n−1θ⌉,n), 0 ≤ θ ≤ 1}, where ⌈x⌉ is the smallest integer

greater or equal to x ≥ 0. This plot is called the alterna-tive Hill plot, altHill. Compared to the Hill plot, the altHill shows the largest order statistics more prominently. Accord-ing to [22], if the distribution is not exactly Pareto, then the altHill spends more time in the small neighborhood of α than the Hill plot.

Below we display Hill and altHill plots for EU-2005 gure 2), Growing Network (Figure 3) and Wikipedia (Fi-gure 4). The saw-type picture for in-degrees and out-degrees reflects the fact that we deal with integer values that are the same for quite large groups of nodes.

(a) 2 4 6 8 10 x 104 0.4 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 Theta

Hill estimate of alpha

(b) 2 4 6 8 10 x 104 1 1.5 2 2.5 3 3.5 4

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 Theta

Hill estimate of alpha

(c) 2 4 6 8 10 x 104 0.4 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.5 2 2.5 3 3.5 4 Theta

Hill estimate of alpha

(d) 2 4 6 8 10 x 104 0.4 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 Theta

Hill estimate of alpha

Figure 2: EU-2005 data set: Hill plot (left) and altHill plot (right) for (a) in-degree, (b) out-degree, (c) PageRank (c=0.5), and (d) PageRank (c=0.85). In the Web data, the Hill plots confirm the power law tail of in-degree and PageRank (c=0.85). The exponent α seems to be the same in both cases. However, it looks like the estimation 1.1 is, on average, on a higher side. Again, oscillations between 0.9 and 1.2 are essential since α = 0.9 implies infinite mean. The altHill is stable for θ between 0.4 and 0.9. The beginning of the plot is most probably dis-torted by the well-known exponential cut-off of the real-life data [5], and for θ > 0.9 the number of used order statistics is too large.

(4)

(a) 0 100 200 300 400 500 600 700 800 900 1000 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 Theta

Hill estimate of alpha

(b) 0 100 200 300 400 500 600 700 800 900 1000 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Theta

Hill estimate of alpha

(c) 0100 200 300 400 500 600 700 800 900 1000 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 Theta

Hill estimate of alpha

Figure 3: Growing Network data set: Hill plot (left) and altHill plot (right) for (a) in-degree, (b) Page-Rank (c=0.5), and (c) PagePage-Rank (c=0.85).

nice. The plot for in-degree is more stable as it spends signi-ficant time around the line α = 1.1. The plot for PageRank (c=0.85) also behaves well and seems to suggest a slightly smaller tail index, around 1.05. From the plots we see that the estimator for α is very sensitive to the choice of k. Thus, constructing a Hill plot is a helpful step when applying a Hill estimator.

The Hill and altHill plots suggest that the in-degree and PageRank in the Web and in the Growing Networks are heavy-tailed but not exactly a Pareto. Indeed, the plots look relatively stable but it is difficult to single out α.

For the out-degree in the Web data, the altHill plot oscil-late considerably. However, the Hill plot does not behave as nearly as badly as it would, for instance, for the exponential distribution (see example in [22, p.96]). Based on the Hill plot, one may therefore conclude that the out-degree has a power law.

Finally, Wikipedia turns out to be an example of perfect Hill plots whereas altHill shows large oscillations. We con-clude that in-degree and PageRank (c=0.85) in Wikipedia follow closely a Pareto distribution with index 1.2. The in-dex of PageRank (c=0.5) distribution is around 1.4. The out-degree is also Pareto, with index about 1.6.

3.2

Pickands plot

A Pickands estimator as presented in [22], is another way to evaluate α and reveal the presence of power laws. We first introduce the extreme-value distributions, defined as

Gγ= exp



−(1 + γx)−1/γ, γ ∈ R, 1 + γx > 0.

The power law case corresponds to γ > 0 and then γ = 1/α.

Suppose {Xi, i ≥ 1} are i.i.d. with common distribution

F . The Pickands estimator is derived under the condition that the distribution F is in the domain of attraction of the

extreme-value distribution Gγ, that is, there exist a(n) > 0,

(a) 2 4 6 8 10 x 104 0.4 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.75 1 1.25 1.5 1.75 2 Theta

Hill estimate of alpha

(b) 2 4 6 8 10 x 104 0.5 1 1.5 2 2.5 3 3.5

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 3.5 Theta

Hill estimate of alpha

(c) 2 4 6 8 10 x 104 0.4 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.5 2 2.5 3 3.5 4 Theta

Hill estimate of alpha

(d) 2 4 6 8 10 x 104 0.4 0.6 0.8 1 1.2 1.4 1.6

Number of order statistics

Hill estimate of alpha

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 Theta

Hill estimate of alpha

Figure 4: Wikipedia data set: Hill plot (left) and altHill plot (right) for (a) in-degree, (b) out-degree, (c) PageRank (c=0.5), and (d) PageRank (c=0.85).

b(n) ∈ R such that nP[X1> a(n)x + b(n)] → − log Gγ(x) as

n → ∞, for γ > 0, x ∈ (−1/γ, ∞).

The Pickands estimator of γ uses differences of quantiles, where the latter are estimated by means of three upper

statistics, X(k), X(2k), X(4k), from a sample size n. The

estimator is defined as ˆ γ(P ickands)k,n = 1 log 2log X (k)− X(2k) X(2k)− X(4k)  .

Determining an appropriate of k is again an important issue. Unlike the Hill estimator, the Pickands estimator is both location and scale invariant.

Similarly to the Hill plot, a Pickands plot consists of the

pointsnk, ˆγ(P ickands)k,n , 1 ≤ k < n/4o. A difficulty in

con-structing Pickands plots for integer-valued observations such as in-degrees and out-degrees in the networks, is that the values of order statistics might be identical, resulting in di-vision by zero. To fix this problem we introduce a random-ization of the data by adding uniformly (0, 1) distributed random variables to each of the observations.

The Pickands plots for our data sets are presented in

Fi-gure 5 below. We note that we plot the values of ˆγk,n(P ickands)

that estimates 1/α. The results for in-degree and PageRank in all three data sets are in good agreement with Hill plots. The new information we find by looking at the plot for out-degree in the Web data. Here a large part of the Pickands plot shows γ < 0 which signals light tails. This is in agree-ment with Donato et al. [7] and other papers that claim

(5)

(a) 0 1 2 3 4 5 x 104 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Number of order statistics

Pickands estimate of gamma

In−degree Out−degree PageRank (c=0.5) PageRank (c=0.85) (b) 0 0.5 1 1.5 2 x 105 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Number of order statistics

Pickands estimate of gamma

In−degree Out−degree PageRank (c=0.5) PageRank (c=0.85) (c) 0 500 1000 1500 2000 2500 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Number of order statistics

Pickands estimate of gamma

In−degree PageRank (c=0.5) PageRank (c=0.85)

Figure 5: Pickands plots for in/(out)-degrees and PageRank: (a) EU-2005, (b) Wikipedia, (c) Grow-ing Network

that the out-degree data does not follow a power law. On the other hand, the Pickands plot goes below zero only for quite large values of k, so we still can not exclude the power law tail.

3.3

QQ plot

Suppose we have a hypothesis that the true distribution function producing the data is F (x). A goodness of fit test provides the rigorous way to verify such hypothesis, whereas the QQ plot is a more informal but convenient alternative. To construct a QQ plot we graph the theoretical quantiles of F versus the sample quantiles:

 F←  i n + 1  , X(n−i+1)  , 1 ≤ i ≤ n  ,

where F←(y) = inf{x : F (x) ≥ y} is the inverse of

distri-bution function F . If our hypothesis is true then the result should fall roughly on the straight line {(x, x), x > 0}. One potential problem is how to decide what we consider ‘close enough’ to linear.

To apply QQ plots to power laws, suppose that our null

hypothesis is that for some x0 > 0, distribution of random

(a) 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Quantiles of exponential log−sorted data (b) 1 2 3 4 5 6 7 8 9 10 −1 0 1 2 3 4 5 6 Quantiles of exponential log−sorted data (c) 1 2 3 4 5 6 7 8 9 10 −1 0 1 2 3 4 5 6 Quantiles of exponential log−sorted data

Figure 6: Growing Network data set: QQ lines for (a) in-degree, 1000 (α = 1.06) upper-statistics; (b) PageRank (c=0.5), 1000 (α = 1.19) upper-statistics; (c) PageRank (c=0.85), 1500 (α = 1.05) upper-statistics. variable X satisfies P(X > x) = xx 0 −α ,

so it follows that P(log X > y) = e−α(y−log x0). Hence, using

quantiles of exponential distribution we plot  − log  1 −n + 1i  , log X(n−i+1)  , 1 ≤ i ≤ n  . The slope of the least-squared line fitted to the QQ plot

is an estimate of 1/α. Thus, if {(xi, yi), 1 ≤ i ≤ n} are n

points on the plane, we can calculate the slope in standard way

SL{(xi, yi), 1 ≤ i ≤ n} = Sxy/Sxx,

where Sxy =Pni=1(xi− ¯x)(yi− ¯y), Sxx = Pni=1(xi− ¯x)2

and ¯x means mean value of x. Now we can define the QQ

estimator for 1/α based on k upper order statistics as SL  − log  1 − i n + 1  , log X(n−i+1)  , n − k + 1 ≤ i ≤ n  . Clearly, there remains the problem of choosing k.

The QQ plots for our data are presented in Figures 6 and 7 for two choices of k. Again, the data on in-degree and PageRank resulted in QQ plots similar to straight line, and the estimates for α are close to what we expected. Thus, in these case all techniques point to the same result.

With a certain amount of tolerance, we can accept that the QQ plot for out-degrees in the Web data in Figure 7(b)(left) is close enough to a straight line. Moreover, the estimated α = 2.95 is in good agreement with the Hill plot. We also note that α > 2 implies a finite variance while power law models are especially important in case when the variance is infinite, reflecting high variability [14, 21]. Hence, in case of a finite variance, it is not really crucial whether the data obeys a power law. To exclude the possibility of exponential tail of out-degree, we also constructed a QQ plot with ex-ponential quantiles by plotting − log (1 − i/(n + 1)) against X(n−i+1). The result that we do not present here is not any

close to a straight line. To summarize, the out-degree has a finite variance and a tail heavier than exponential, so it can

(6)

(a) 2 4 6 8 10 12 14 2 4 6 8 10 12 Quantiles of exponential log−sorted data 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 Quantiles of exponential log−sorted data (b) 0 2 4 6 8 10 12 14 3 4 5 6 7 8 9 Quantiles of exponential log−sorted data 2 4 6 8 10 12 14 16 2 3 4 5 6 7 8 9 10 Quantiles of exponential log−sorted data (c) 2 4 6 8 10 12 14 0 2 4 6 8 10 Quantiles of exponential log−sorted data 2 4 6 8 10 12 14 16 0 2 4 6 8 10 Quantiles of exponential log−sorted data (d) 0 2 4 6 8 10 12 14 −2 0 2 4 6 8 10 Quantiles of exponential log−sorted data 2 4 6 8 10 12 14 16 0 2 4 6 8 10 Quantiles of exponential log−sorted data

Figure 7: QQ lines for EU-2005 (left) and Wikipedia (right) data sets: (a) in-degree, 150.000 (α = 1.11) and 500.000 (α = 1.18) upper-statistics; (b) out-degree, 100.000 (α = 2.95) and 300.000 (α = 1.59) upper-statistics; (c) PageRank (c=0.5), 50.000 (α = 1.16) and 250.000 (α = 1.40) upper-statistics; (d) PageRank (c=0.85), 300.000 (α = 1.08) and 500.000 (α = 1.20) upper-statistics.

be modeled reasonably well as a power law with exponent around 2.5-3, according to our estimates.

4.

EXTREMAL DEPENDENCIES

The goal of this section is to measure the dependencies be-tween in-degree and PageRank (c=0.5 and 0.85), in-degree and out-degree, and out-degree and PageRank (c=0.85) in our data sets. In Sections 4.1 we explain the methodology and perform preliminary computations. The results on de-pendence structure in our three data sets are presented in Section 4.2.

4.1

Angular Measure

Suppose we are interested in analyzing the dependencies between two regular varying characteristics of a node, X and

Y . Let Xjand Yjbe observations of X and Y for the

corre-sponding node j. Following [22], we start by using the rank

transformation of (X, Y ), leading to {(rx

j, ryj), 1 ≤ j ≤ n},

where rx

j is the descending rank of Xjin (X1, . . . , Xn) and rjy

is the descending rank of Yjin (Y1, . . . , Yn). Next we choose

k = 1, . . . , n and apply the polar coordinate transform as

follows POLAR k rx j , k ryj  = (Rj,k, Θj,k), (4)

where POLAR(x, y) =px2+ y2, arctan (y/x).

Now we need to consider the points {Θi,k : Ri,k > 1}

and make a plot for cumulative distribution function of Θ. In other words, we are interested in the angular measure, i.e. in the empirical distribution of Θ for k largest values of R. Thus, unlike the correlation coefficient, the angular measure provides a subtle characterization of the dependen-cies in the tails of X and Y, or, extremal dependendependen-cies. If such measure is concentrated around π/4 then we observe a tendency toward complete dependence, when a large value of X appears simultaneously with a large value of Y . In the opposite case, when such large values almost never appear together, we have either large value of X or large value of Y , hence, Θ should be around 0 or π/2. The middle case plots can be seen as a tendency to dependency or independency.

It was proved in [22] that the empirical measure converges to a proper distribution on [0, π/2] as n, k → ∞, k/n → 0. That is, ideally, we need to consider only a relatively small part of a large data set.

In practice the problem remains: how to choose a suit-able value of k? In the case of bi-variate data, this can be determined by making a Starica plot. We consider radii

R1,k, . . . , Rn,k from (4) and rank them in descending order

R(1)≥ . . . ≥ R(n) as before. To get Starica plot we graph

R (j) R(k) ,R(j) R(k) · j k  , 1 ≤ j ≤ n  , or ( R(j), R(j)j Pn i=11{Ri,k≥1} ! , 1 ≤ j ≤ n ) .

The idea is that for suitable k the ratio in the ordinate should be roughly a constant and equal 1 for the values of the abscissa in the neighborhood of 1. The plot looks differ-ent for the differdiffer-ent parameters k and one can either find a suitable k by trial and error or use numerical algorithms to compute optimal k. A Starica plot for good k will have a region in the right neighborhood of x = 1 where the plot is hugging the y = 1 line. If the line is going steep up at x = 1 then the chosen k is too large. On the other hand, if the graph stabilizes around y = 1 for some x < 1 then it means that k is too small, and we miss some valuable tail data. We refer to Resnick [22] for more details and references.

After some experiments, we chose appropriate values of k for the four pairs degree, PageRank (c=0.85)), (in-degree, PageRank (c=0.5)), (in-(in-degree, out-degree), and (out-degree, PageRank (c=0.85)) in our data sets. The corre-sponding Starica plots are presented in Figure 8(a,b) and

Figure 9(a-d). The good news is that the plots for

in-degree/PageRank behaves nicely in all three data sets, which makes our angular measure more reliable. The Growing Net-work exhibits an ideal Starica plot (Figure 8). A surpris-ingly bad behavior is on the plot for in-degree/out-degree in Wikipedia (Figure 9(d)(right)), where the Starica curve wonders well off the y = 1 line.

4.2

Dependence measurements on the data

After defining a suitable k, we compute the pairwise an-gular measure. In Figure 10 we depict θ ∈ [0, π/2] against

(7)

(a) 0 1 2 3 4 5 6 7 8 9 10 0.98 0.985 0.99 0.995 1 1.005 1.01 1.015 1.02 Scaling constant Scaling ratio k=6.000 (b) 0 1 2 3 4 5 6 7 8 9 10 0.98 0.985 0.99 0.995 1 1.005 1.01 1.015 1.02 Scaling constant Scaling ratio k=6.000 (c) 0 1 2 3 4 5 6 7 8 9 10 0.98 0.985 0.99 0.995 1 1.005 1.01 1.015 1.02 Scaling constant Scaling ratio k=5.000

Figure 8: Growing Network data set: Starica

plot for (a) in-degree and PageRank (c=0.85); (b) in-degree and PageRank (c=0.5); (c) PageRank (c=0.5) and PageRank (c=0.85).

the fraction of observations where the angle Θ is greater or equal to θ.

The results are striking. Let us look first at Figure 10(a,b) which characterizes the dependence between in-degree and

PageRank. For the Wikipedia data set we observe that

about half of observations are concentrated around 0 whereas another half is close to π/2. This suggests an independence of the tails of in-degree and PageRank (c=0.85 and c=0.5). That is, in Wikipedia data set an extremely high in-degree almost never implies an extremely high ranking. The picture is completely the opposite for Growing Networks, where the angular measure is entirely concentrated around π/4 indi-cating a complete dependence. Thus, in highly centralized preferential attachment graphs, most connected nodes are also most highly ranked.

Finally, the Web graph exhibits a subtle dependence struc-ture that results in angular measure which is almost uniform on [0, π/2]. This suggest that PageRank popularity measure can not be replaced by in-degree without significant distur-bance in the ranking (of course, in-degree can not be used as a popularity measure for many other reasons, for instance, because it is easy to spam by creating link farms; we refer to [13] for further discussion of PageRank and other popu-larity measures).

The picture is different in Figure 12(c) where we depict the angular measure for in-degree and out-degree in the Web and in Wikipedia. In the Web, the in- and out-degree tend to be independent which justifies the distinction between hubs and authorities [11]. In Wikipedia the in- and out-degrees are dependent but this dependence is not absolute.

Finally, the dependence between out-degree and Page-Rank in the Web and Wikipedia in Figure 12(d) resembles the patterns observed for in-degree and PageRank.

5.

RANK CORRELATION

In this section, we introduce a new method for measuring correlations between ranking orders in power law graphs. The proposed correlation measure is based on the extremal dependencies technique, presented in Section 4.

5.1

The

Θ

rank correlation measure

We start by noting that the angular measure described

(a) 0 1 2 3 4 5 6 7 8 9 10 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 Scaling constant Scaling ratio k=100.000 0 1 2 3 4 5 6 7 8 9 10 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 Scaling constant Scaling ratio k=600.000 (b) 0 1 2 3 4 5 6 7 8 9 10 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 Scaling constant Scaling ratio k=150.000 0 1 2 3 4 5 6 7 8 9 10 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 Scaling constant Scaling ratio k=600.000 (c) 0 1 2 3 4 5 6 7 8 9 10 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 Scaling constant Scaling ratio k=200.000 0 1 2 3 4 5 6 7 8 9 10 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 Scaling constant Scaling ratio k=600.000 (d) 0 1 2 3 4 5 6 7 8 9 10 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 Scaling constant Scaling ratio k=30.000 0 1 2 3 4 5 6 7 8 9 10 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 Scaling constant Scaling ratio k=200.000 (e) 0 1 2 3 4 5 6 7 8 9 10 0.9 0.925 0.95 0.975 1 1.025 1.05 1.075 1.1 Scaling constant Scaling ratio k=300.000 0 1 2 3 4 5 6 7 8 9 10 0.98 1 1.02 1.04 1.06 1.08 1.1 1.12 1.14 Scaling constant Scaling ratio k=2.000.000

Figure 9: EU-2005 (left) and Wikipedia (right) data sets: Starica plot for (a) in-degree and PageRank (c=0.85); (b) in-degree and PageRank (c=0.5); (c) out-degree and PageRank (c=0.85); (d) in-degree and out-degree; (e) PageRank (c=0.5) and Page-Rank (c=0.85).

in Section 4.1 is in fact based on a rank transformation. This is clearly seen from formula (4) where only rank of the parameters X and Y plays a role. This observation naturally leads to a new measure for rank correlations.

In summary, our idea is as follows. As before, we define

r1

i and r2i as a ranking order of page i in scheme 1 and 2,

respectively, where i = 1 . . . n. Now we suggest to represent the difference between the two rank positions of i by the angle

Θi= arctan(r1i/ri2).

For example, in Figure 11, Θi is depicted for a node that

has rank 3 in scheme 1 and rank 6 in scheme 2. Note that this is exactly the angle in (0, π/2) computed in (4) in order to construct the angular measure. The value Θ close to π/4 means a relatively small change in ranking. On the other hand, Θ around π/2 means that the node i is much better off with scheme 2, and the value close to 0 says that the node is ranked much higher by scheme 1. Thus, we actually

(8)

(a) 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 Theta Fraction of pages GRNet EU−2005 Wikipedia (b) 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 Theta Fraction of pages GRNet EU−2005 Wikipedia (c) 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 Theta Fraction of pages EU−2005 Wikipedia (d) 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 Theta Fraction of pages EU−2005 Wikipedia

Figure 10: Cumulative functions for Angular Mea-sures: (a) in-degree and PageRank (c=0.85); (b) in-degree and PageRank (c=0.5); (c) in-degree and out-degree; (d) out-degree and PageRank (c=0.85).

measure the rank difference for node i in radians! Having

computed Θifor every i (or for a certain set of highly ranked

nodes i) we construct a corresponding empirical cumulative distribution function for Θ. As in the previous section, the resulting angular measure can be used to characterize the rank correlations.

In order to illustrate the proposed methodology, consider the scatter plot of ranking order 1 against ranking order 2 (see Figure 11). When two ranks are the same (like the node ranked 1 in the example) then the corresponding point lies on the diagonal. On the other hand, if there is a consider-able disturbance in ranking (for instance, in the example, the rank 2 and 9 are swaped) then we immediately see con-siderable deviation from the diagonal.

Compared to the common rank correlation measures such as Kendall’s τ and Spearman’s ρ, our proposed measure has an important advantage that it is able to reveal the slight-est rank disturbance among highly rank nodes while ne-glecting even moderate perturbations among lowly ranked nodes. Indeed, if we swap the rank 1 and 10, we get Θ = arctan(1/10) ≈ 0.1, which is close to the x-axis, and is a vis-ible deviation from π/4. On the other hand, swapping the numbers 1000 and 1010 yields Θ = arctan(1000/1010) ≈ π/4. In other words, the Θ rank correlation measure actu-ally evaluates the rank disturbance visible for users. Cer-tainly, the arctan(·) function makes our measure symmetric with respect to the schemes 1 and 2.

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Rank 2 Rank 1

Figure 11: Rank Correlation.

Naturally, in this framework, it is also possible to compute such angular measure only for the top ranked pages. This can be done along the same lines as in Section 4.1 as follows.

Based on the polar transformation (4) we can separate top

ranked pages by considering only points {Θi,k : Ri,k > 1}.

Here the question of choosing k does not arise anymore. In-deed, the technique involving Starica plot was needed to get an idea where the power law behavior ‘starts’ in order to measure statistical dependency for the heavy-tailed data as in [22]. On the other hand, if we are interested in rank cor-relations, we may simply pick the k that gives us the top proportion of pages we are interested in. Note that by in-creasing k we do not change the observed values of Θ, we merely increase their number. As a result, in the angular measure, each observation will simply have less weight. On contrary, decreasing k means ‘zooming in’ the rank pertur-bations on the top.

One more advantage of the proposed correlation measure is the fast and easy implementation since for each node i,

only the fraction r1

i/ri2 has to be computed.

Below we present the example of the proposed rank corre-lation measure in Growing Networks, Web and Wikipedia. We rank the three data sets by using the definition of Page-Rank (2), where the damping factor is equal to c = 0.5 and c = 0.85. In Figure 12 we plot cumulative functions for an-gular measures for k = 100 and the values of k’s that have been chosen according to the Starica plots (see Figure 8(c) and Figure 9(e)). For Growing Network data set we observe the strong correlation between ranking schemes. We can also conclude that in Wikipedia the change in the damping factor affects only about 20% of considered pages, in the top-hundred group as well as in the larger group. For the Web data, the correlation between ranking is not significant for approximately half of the pages. However, for the top pages, the difference in the damping factor mixes up the order of ranking. The results for the top 100 pages are in lines but more informative than the corresponding values of

Kendall’s τ : τGN= 0.9967, τW I= 0.6879, τEU = 0.4092.

5.2

Discussion

The main idea of the Θ rank correlation measure is that we characterize the rank correlations by a cumulative

dis-tribution of Θi’s, where i = 1, . . . n. This way, one can

ac-tually see how many pages change their ranks significantly. Such measure is substantially more informative than just one number, that represents the correlation in the whole graph. For instance, Melucci [15] noticed that Kendall’s τ tends to grow close to one for large data sets. The author provides an example where Kendall’s τ for ranking orders of only a few hundred Web pages becomes almost 1, in spite of the large number of rank perturbations. We remark however that if for some reason having one number is necessary, one can always compute, e.g. the expected deviation of Θ from π/4.

As mentioned before, the proposed correlation measure is quite harsh with respect to lowly ranked nodes. Indeed, the node ranked 1000 must fall all the way to 2000 to make the same effect as number 1 becoming number 2. We would like to emphasize that such discrepancy is especially suitable for ranking order emerging from a heavy-tailed data, such as PageRank or in-degree. This is because in such data, there is a huge difference between the highest values of the realizations, cf. [9].

Acknowledgments

The authors would like to thank Debora Donato for her great help with Web and Wikipedia data sets. This article is the

(9)

(a) 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 Theta Fraction of pages k=5000 k=100 (b) 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 Theta Fraction of pages k=300.000 k=100 (c) 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 Theta Fraction of pages k=2.000.000 k=100

Figure 12: Cumulative functions for Angular Mea-sures for PageRank (c=0.5) and PageRank (c=0.85): (a) Growing Network; (b) EU-2005; (c) Wikipedia.

result of joint research in the 3TU Centre of Competence NIRICT (Netherlands Institute for Research on ICT) within the Federation of Three Universities of Technology in The Netherlands.

6.

REFERENCES

[1] http://law.dsi.unimi.it/. Accessed in January 2007.

[2] R. Albert and A. L. Barab´asi. Emergence of scaling in

random networks. Science, 286:509–512, 1999. [3] A. Broder, R. Kumar, F. Maghoul, P. Raghavan,

S. Rajagopalan, R. Statac, A. Tomkins, and J.Wiener. Graph structure in the Web. Comput. Networks, 33:309–320, 2000.

[4] A. Capocci, V. D. P. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli. Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia. Phys. Rev. E, 74:036116, 2006.

[5] D. Chakrabarti and C. Faloutsos. Graph mining: laws, generators, and algorithms. ACM Comput. Surv., 38(1):2, 2006.

[6] L. de Haan and J. de Ronde. Sea and wind: multivariate extremes at work. Extremes, 1(1):7–45, 1998.

[7] D. Donato, L. Laura, S. Leonardi, and S. Millozi. Large scale properties of the webgraph. Eur. Phys. J., 38:239–243, 2004.

[8] J. C. Doyle, D. L. Alderson, L. Li, S. Low, M. Roughan, S. Shalunov, R. Tanaka, and

W. Willinger. The “robust yet fragile” nature of the Internet. PNAS, 102(41):14497–14502, 2005.

[9] P. Embrechts, C. Kl¨uppelberg, and T. Mikosch.

Modelling Extremal Events. Springer, 1997. [10] S. Fortunato, M. Boguna, A. Flammini, and

F.Menczer. How to make the top ten: Approximating PageRank from in-degree. In Proceedings of WAW 2006, 2006.

[11] J. M. Kleinberg. Authoritative sources in a

hyperlinked environment. JACM, 46(5):604–632, 1999. [12] A. N. Langville and C. D. Meyer. Deeper inside

PageRank. Internet Math., 1:335–380, 2003.

[13] A. N. Langville and C. D. Meyer. Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton, NJ, 2006. [14] L. Li, D. L. Alderson, J. C. Doyle, and W. Willinger.

Towards a theory of scale-free graphs: definition, properties, and implications. Internet Math., 2(4):431–523, 2005.

[15] M. Melucci. On rank correlation in information retrieval evaluation. 2007.

[16] T. Mikosch. Modelling dependence and tails in financial time series. In Symposium in Honour of Ole E. Barndorff-Nielsen, volume 16, pages 61–73. Univ. Aarhus, Aarhus, 2000.

[17] M. Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet Math., 1(2):226–251, 2004.

[18] M. E. J. Newman. The structure and function of complex networks. SIAM Rev., 45(2):167–256, 2003. [19] M. E. J. Newman. Power laws, Pareto distributions

and Zipf’s law. Contemp. Phys., 46:323–351, 2005. [20] G. Pandurangan, P. Raghavan, and E. Upfal. Using

PageRank to characterize Web structure. In 8th Annual International Computing and Combinatorics Conference (COCOON), 2002.

[21] K. Park and W. Willinger. Self-similar network traffic and performance evaluation. Wiley, New York, 2000. [22] S. I. Resnick. Heavy-tail Phenomena. Springer Series in Operations Research and Financial Engineering. Springer, New York, 2007.

[23] Y. Volkovich, N. Litvak, and D. Donato. Determining factors behind the pagerank log-log plot. In

Referenties

GERELATEERDE DOCUMENTEN

On the sampling place Eragrostis curvula was the dominant species of grass, with sporadic tufts of a Panicum species in between.. In the immediate surroundings

These oversight mechanisms include the following: the British Intelligence and Security Committee (ISC), Investigatory Powers Tribunal (IPT), Interceptions

That is because the 2-dimensional CA solution is closely related to the 3-cluster solution (Gilula and Haberman 1986; De Leeuw and Van der Heijden, 1991) which we have found

The following example uses the previously defined shading style ‘Blue- Diamond’ for the residues and prints a red colored text in a blue framed yellow box to label the sequence

Figure 3: 95% confidence tube for the QQQ plot of the cholesterol levels for obese (BMI &gt; 30) men aged 45– 65 in Massachusetts, Honolulu, and Puerto Rico.. The empirical QQQ plot

To test the null hypothesis that these two samples come from the same distribution we have performed the two-sample t-test and the Wilcoxon two-sample test on the original data, and

De tijd tussen de start van domperidon en het optreden van convulsies bedroeg bij de kinderen een tot drie dagen, van de volwas- sen man is deze tijd niet bekend.. Na staken

A systematic review into the global prevalence of LBP by Walker in 2000, identified that of the 56 included studies, only 8% were conducted in developing countries, with only one