• No results found

7 Results: Statistics, Graphs and classifier experiment

7.1 Exploratory statistics

46

47 range (the middle 50%) is quite different from those within the top 10, and that moving up these 10 positions the use of hyperlinks is again of a different kind.

On inspection of the sites we can get a feeling for why this is so: a tendency to combine whatever linked material is available on a page in a multitude of combinations. On bhakti-holland.com the calendar/agenda has thrown the crawler into a near-loop by returning something like a cartesian product, treating combinations of date and time components as individual links. On reliwiki.com, a website hosting information about religious places and communities, the crawler has visited all the revisions of all the pages and corresponding ‘wiki-functionalities’ such as viewing differences between revisions. It is a miracle the crawler finished at all. The size of bethshoshanna.nl is in fact due to the site’s incorrect usage of so-called relative URL’s. The correct syntax of linking from one subpage to another within a site is ‘/subpage/’. If urllib is on ‘www.site.nl/subpage1/’ and encounters the hyperlink

‘/subpage2/’, it will return ‘www.site.nl/subpage2/’ as link for the crawler to follow. However, bethshoshanna forgets the first slash: ‘subpage1/’. This syntax is used for linking to a subpage of the current page. Whilst a browser seems to correct this, urlib does not, and returns

‘www.site.nl/subpage1/subpage2/’. The result is that all the links which are available from any point in the website are combined with one another in all possible combinations. Now, the erroneous syntax could be fixed manually, were it not that its erroneousness is due to context: sometimes an absent starting slash is correct. I don’t know by which method a browser can detect the error. Because this statistic does not impact the outlinks – which are of actual importance to the graph – I will leave this as an example of the anomalies which can be encountered while crawling for links.

Figure 3. Unique outer links distribution

1 www.kruislinks.nl 37.205

2 www.openbaring.org 36.367

3 amsterdamghanasda.org 26.838 4 www.orthodoxchurch.nl/nl 21.615 5 www.scientology.nl 17.727 6 bhagavata.org/index.ned.html 15.272 7 bderoos.wordpress.com 10.924 8

www.scientology-amsterdam.org

9344 9 www.christianarchy.nl 9297 10 spiritueleteksten.nl 8740

Count 605

Mean 608

Std. 2851

Min 1

25% 15

50% 66

75% 253

Max 37.205

48 The unique outlinks show a distribution somewhat similar to that of inner links, although the rise that emerges nearing the last quartile is relatively less rapid, with the 10th site containing 23% of the number of outlinks as the first compared to just under 16%. In absolute numbers the difference is marked, however, totaling about 28.500 between no. 10 and 1, compared to 150.000 for the previous graph.

Apparently, the systems described above that cause the number of internal (quasi-)pages to increase exponentially are not as much of an issue when linking to outside sites. Of course, it has to be reiterated that generic ‘share this’ links to social media have been filtered out. If they would’ve been kept, and if sites would’ve then used them liberally, they may have precipitated a similar phenomenon. In any case, calculating a Pearson’s correlation between inner and outer links returns a coefficient of 0.12, denoting only a very light positive relation.

Openbaring.org, Orthodoxchurch.nl and Christiananarchy.nl can be described as Christian

‘socially critical’ websites with a millenarian tendency. Both are article-focused and link very heavily to a great variety of sites, though news outlets, other blogs and Christian sites seem to make up a chunk of them. Number 1, kruislinks.nl, also hosts articles, but of course, as the name suggests, it is mainly a link database of Christian communities and organizations, and its top position in this and the following graph suggests it is fulfilling its purpose well. 5 and 9 are Scientology sites, whose seem to contain link to versions of the same article on a large number of localized Scientology websites as well as other

Scientology related websites. In fact, of the 429 and 444 websites which the two sites respectively link to, the percentage of host-URLs containing the word ‘scientology’ is 83% for both counts. Bhagavata.org is a Hindu site which, amongst others, apparently links to every individual verse of the Vedas, which

considering the voluminousness of those sacred scriptures, nets a respectable number 6. Number 7, bderoos.wordpress.com seems to owe its position due to standing in two modes of linkage, as it links both to a high number of individual sites, while also containing a very high number of ‘wordpress.com’

links, seemingly as methods for Wordpress’ functionality and inter- or intra-blog and navigation.

Amsterdamghanasda.org is a special case, which we will return to in when discussing unique base links.

Figure 4. Total outer links distribution

Count 605 Mean 8138 Std. 58.212 Min 1 25% 70 50% 468 75% 2224 Max 1.228.702

1 www.christianarchy.nl 1.228.702 2 www.gouwepeer.nl 497.070 3 reliwiki.nl 388.192

4 amsterdamghanasda.org 217.360 5 www.orthodoxchurch.nl/nl 161.052 6 www.scientology-amsterdam.org 128.231

49 7 www.nieuwetijdskind.com 124.710

8 www.scientology.nl 113.254

9 bderoos.wordpress.com 89.971 10 www.gerritveldman.nl 84.396

‘Total outer links’ is a difficult statistic to interpret. The correlation coefficient between it and

respectively inner links and unique outer links are 0.26 and 0.33, indicating that the weightiest websites are not destined to be the most frequent linkers, and neither do unique links necessarily mean these links are often re-used. As seen in the top 10, some sites link to others with inordinate frequency, and the question of why is not so easy to answer, straddling as it does the line between purposive usage and technological structure. Essentially the question is: does a high link count occur due to mechanical repetition, copy-pasting, manual linkage or a combination of all of them? Perhaps a sub-experimental intermezzo can show some light on this situation. Namely, namely checking the number of inlinks received by all the websites which our corpus-site link towards, adumbrates the outlines of where these rather high numbers of total links come from. First, I’ll take a look at the overall situation, and then I will zoom in on the number 1 site, christiananarchy.nl.

Figure 5. Received links per base outer link entry

count 72808 mean 67 std 1245 min 1 25% 1 50% 2 75% 4 max 181204

Figure 10 plots all base outer links and their links counts from the 605 sites. All duplicates (i.e. two sites linking to YouTube) have been counted individually. As is readily apparent, the distribution is

characterized by an extremely long and initially quite ‘fat’ tail. 50% of the 72808 sites is linked to 2 or less times, 75% 4 or less times, whilst the max is a whopping 181.204 links to a single site: erfgoedstem.nl on reliwiki.nl. Considering reliwiki is the champion inner page counts with its 173.469 pages, it would make sense to conclude that every single page on the site links to erfgoedstem.nl at least once. And indeed, inspecting the homepage, the link is part of a ‘recent news’ banner that follows the browser around while traversing the website. This is an excellent example of the mechanical repetition of links, and it makes it clear that a simple linear correlation between link count and link importance is off the table. In fact, one would be inclined to use something similar to the TD-IDF measure: the more a link appears on any given website, the less one can assume that it is meaningful relative to a specific webpage.

50

Figure 6. Christianarchy.nl: received links per base outer link entry

count 3005 mean 409 std 3544 min 1 25% 2 50% 2 75% 3 90% 9 99% 6916 max 159923

Figure 11 displays the same data type for christianarchy.nl only. The site links to 3005 unique base links, but the manner in which it gets to over a million total outlinks is quite interesting. 50 percent of these sites only gain a maximum of 2 links, 75 percent just 3. In fact, over 90 percent of the websites receives less than 10 links from across the entire site. Seeing as christianarchy.nl is a blog / article-focussed site, this 90 percent seems to consist of ‘traditional’ links found in bodies of text, meant to communicate with content that the article discusses, or cite sources for internet-argumentative strength. But it is the 10 percent, or rather, the 1 percent, that does the heavy lifting; just 30 sites constitute 99 percent of the 1.28 million outlinks. And the top two of these are christianarchie.blogspot.com, which links back to christianarchy.nl, christi-anarchy.blogspot.com (55268), which links to christianarchy.org, and

www.blogger.com (21018), which seems to be the host of the site. The rest of the ‘whale’ sites have a regularity in their link count that also suggests some structural inclusion. Positions 4 to 10 count between 20792 and 20718, 11 to 21 between 12863 and 13812. Skipping two, the next 104 sites all count between 6939 and 6906 (the latter number taking 47 cases for its part).

The exact mechanism for getting to these numbers would require closer inspection of the site.

For now, a good take-away is that raw enumerations of hyperlinks are useful in discerning a specific class of hyperlink on a website, one that is central to either the technological functionality or the

user-designed layout, but that the bulk of sites linked to ‘traditionally’ is likely to receive only a handful of links, and that winnowing out sites one is interested in is not very much a task for this statistic. It is hereby also clear that the ‘weight’ of the links between sites as a parameter in the link graphs will not be of much use in discerning overall structures. A single instance of inordinate interlinking could give a receiving node the impression of high centrality, while this is in fact could me merely due to the

behaviour of a prominent link-donator. Therefore, it would be best to leave weights (literally) out of the picture, and merely note the presence of a link to a site; not how often it appears.

51

Figure 7. Unique base outer link distribution

Count 605 Mean 120 Std. 743 Min 1 25% 8 50% 24 75% 68 Max 16.925

1 amsterdamghanasda.org 16.925 2 www.kruislinks.nl 4635

3 www.christianarchy.nl 3005 4 www.openbaring.org 2183 5 wiccanrede.org 1783

6 www.nieuwetijdskind.com 1607 7 bderoos.wordpress.com 1171 8 logos.nl 1108

9 www.crescas.nl 1056 10 reliwiki.nl 920

The final statistic, unique base outer links, shows that half of the websites link to at most two dozen websites. Only a rather small minority is well connected (or connecting) and approaches the 100.

Calculating the top 10 returns some of the previous high flyers to spotlight, though positions 4, 5, 6, 8 and 9 are debutants. That half of the sites appear for the first time may be due to the imperfect correlation with previous statistics. The correlation coefficient between unique base links and unique inner links is only 0.054, suggesting that larger sites do not tend to link to more unique websites. The coefficient with unique outer links is 0.62, however, so websites that use more distinct links do tend to link to more individual sites. But there’s a major anomalous factor that may muddy these statistics: the number one in unique base link counts, amsterdamghanasda. Belonging to a Ghanese pentecostalist church in Amsterdam, on the surface this website appears to be a totally ordinary (if slightly rudimentary, with an ‘About Us’ page sporting a ‘lorem ipsum’ and a questionably representative photo of the

Florence Cathedral in full saturated glory). But somehow, it links to almost 4 times as many unique sites as the runner-up. What is going on here?

Inspecting its link log, the first level doesn’t show anything unusual. But when we get to the second level, the log explodes with an enormous list of nondescript websites that seem slightly out of place. For instance, there are 2174 URLs containing the word ‘viagra’. Now, I am not an expert in

Ghanese Pentecostalism, but it seems somewhat unlikely that this specific community would suffer from endemic erectile dysfunction. Something less than savoury is going on on this site, with or without knowledge of the owners, with enough volume in shady sites that the dataset might just be excited into a swollen Pearson’s correlation. Filtering out just this site, the correlation with unique inner links rises to 0.16, and with unique outer links to 0.76. It thus seems that this specific site is interfering with a more linear correlation, and indeed, if we filter out all the top 10 sites, the correlations change to respectively 0.11 and 0.49. It thus seems that, at least in the case of the relation between unique links and base links,

52 the outliers contribute to the strength of the correlation up until amsterdamghanasda, whose likely hijacking resulted in a use of hyperlinks outside of the bounds of normality.