• No results found

Characteristics of social networks in the Chinese Web

N/A
N/A
Protected

Academic year: 2021

Share "Characteristics of social networks in the Chinese Web"

Copied!
129
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

by

Louis Lei Yu

B.Sc., Queen’s University, 2003 M.Sc., University of Victoria, 2005

A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

in the Department of Computer Science

c

Louis Lei Yu, 2010 University of Victoria

All rights reserved. This dissertation may not be reproduced in whole or in part, by photocopying or other means, without the permission of the author.

(2)

Characteristics of Social Networks in the Chinese Web by Louis Lei Yu B.Sc., Queen’s University, 2003 M.Sc., University of Victoria, 2005 Supervisory Committee

Dr. Valerie King, Supervisor (Department of Computer Science)

Dr. Bruce Kapron, Departmental Member (Department of Computer Science)

Dr. Ulrike Stege, Departmental Member (Department of Computer Science)

Dr. Zheng Wu, Outside Member (Department of Sociology)

(3)

Supervisory Committee

Dr. Valerie King, Supervisor (Department of Computer Science)

Dr. Bruce Kapron, Departmental Member (Department of Computer Science)

Dr. Ulrike Stege, Departmental Member (Department of Computer Science)

Dr. Zheng Wu, Outside Member (Department of Sociology)

ABSTRACT

We look at the underlying friendships and relationships between Chinese Internet users. We identify the presence and characteristics of the different types of online friendships and online relationships by analyzing various online social networks. First, we look at the concept of guanxi as it is applied to the interaction between web sites. Guanxi is a type of dyadic social interaction based on feelings and trust which has been well studied by scholars in China. We define guanxi in the web: particular linking patterns that appear in the web as well as supporting textual evidence in the web pages which we believe are indicative of the presence and varying strengths of the

(4)

underlying guanxi between Chinese web site owners. Through our empirical study of the Chinese web, the general web, and the Japanese, Iranian, and French web, we show that guanxi between web sites is a more prevalent feature in the Chinese web. Next, we study the formation of online friendships in Douban, an online social networking platform frequently used by the youth in China. We look at several factors that can affect the evolution of friendships such as having memberships in the same discussion groups and sharing common interests or common friends. We compare these factors in influencing the formation of online friendships. Our work provides the first study on the underlying relationships between web sites in the Chinese web and the first large scale empirical analysis on the evolution of friendships in a Chinese online social network.

(5)

Contents

Supervisory Committee ii Abstract iii Table of Contents v List of Tables ix List of Figures x Acknowledgements xii Dedication xiii 1 Introduction 1

1.1 Chinese Internet Statistics and Background . . . 2 1.2 The History and Development of the Chinese Internet . . . 3 1.3 Overview . . . 5

2 Preliminary Experiment 6

2.1 Adopting English Language Search Techniques to Chinese Web Search 6 2.1.1 Resource Finding and Information Selection . . . 7 2.1.2 Information Analysis . . . 10 2.1.3 Usage Mining . . . 12 2.1.4 Suitability of English Search Techniques in Chinese Web Mining 14

(6)

3 Related Work 15

3.1 Link Analysis . . . 16

3.1.1 PageRank . . . 16

3.1.2 HITS . . . 18

3.2 Measurements of Web Graphs . . . 19

3.3 Evolution of Web Graphs . . . 20

3.4 Random Graph Models . . . 21

3.4.1 Preferential Attachment Model . . . 22

3.4.2 Copying/the Hostgraph Model . . . 23

3.5 Analysis of Offline Social Networks . . . 25

3.5.1 Triadic Closure . . . 25

3.5.2 The Strength of Links . . . 26

3.5.3 Homophily . . . 26

3.6 Analysis of Online Social Networks . . . 29

3.6.1 Degree Distributions in Online Social Networks . . . 29

3.6.2 Community in Online Social Networks . . . 31

3.6.3 The Strength of Ties in Online Social Networks . . . 32

3.6.4 Homophily in Online Social Networks . . . 34

3.6.5 Analysis of Chinese Online Social Networks . . . 37

3.7 Diffusion Model . . . 39

3.7.1 The Threshold Model . . . 39

3.7.2 The Independent Cascade Model . . . 40

3.8 Web Graph Sampling . . . 41

4 Guanxi in the Chinese Web 43 4.1 Guanxi . . . 43

4.1.1 The Establishment of Guanxi . . . 44

4.1.2 The Exchange of Resource . . . 45

(7)

4.1.4 Dyadic Relationships in General Social Networks . . . 47

4.2 Guanxi Applied to the Web . . . 48

4.2.1 The Establishment of Web-Based Guanxi . . . 48

4.2.2 The Exchange of Resource . . . 48

4.2.3 The Strength of Web-Based Guanxi . . . 49

4.2.4 Textual Indication of Guanxi . . . 51

4.2.5 Mutual Links Between General Web Sites . . . 52

4.3 An Example Illustration . . . 53

4.4 Empirical Study of the Chinese Web and the General Web . . . 56

4.4.1 Degree Distributions . . . 57

4.4.2 Mutual Links, Type 1 and Type 2 Triangles . . . 57

4.4.3 Mutual Links as a Function of In-degree and Out-degree . . . 60

4.4.4 Strength of Guanxi in the Chinese Site Graph . . . 61

4.4.5 PageRank Correlation . . . 61

4.4.6 Guanxi Graph and Reference Graph . . . 68

4.4.7 A Guanxi Model of the Web . . . 72

4.4.8 Comparison Against Web Sites in Different Countries . . . 76

4.4.9 Identifying Guanxi Web Sites . . . 76

5 The Evolution of Friendships in Online Social Networks 81 5.1 Chinese Online Social Networking Platforms . . . 82

5.2 Empirical Study of Douban . . . 82

5.2.1 Static Analysis of User Profiles . . . 82

5.2.2 Static Analysis of Followers . . . 86

5.2.3 Time Analysis of User Profiles . . . 88

5.2.4 Time Analysis of Online Friendships . . . 89

5.2.5 Time Analysis of Followers . . . 91

5.3 Observations and Implications . . . 93

(8)

6.1 Guanxi in the Chinese Web . . . 99 6.2 Evolution of Friendships in Douban . . . 100 6.3 Future Work . . . 101

(9)

List of Tables

Table 2.1 Top Five Searches of “Tsunami” . . . 8

Table 2.2 Exploratory Web Sites . . . 11

Table 2.3 Size of the Web Graphs . . . 12

Table 2.4 External Link Classification . . . 12

Table 3.1 Link Analysis Algorithms: Symbols and Notations . . . 16

Table 4.1 Cluster Coefficient . . . 74

Table 4.2 Simulation Settings . . . 74

Table 5.1 Power Law Exponents I . . . 87

(10)

List of Figures

Figure 1.1 Comparison of Internet Penetration Rates . . . 3

Figure 2.1 Percentage of External Links . . . 13

Figure 3.1 Homophily in a Social Network . . . 28

Figure 3.2 Distribution of User Connections in Club Nexus . . . 30

Figure 3.3 Distribution of Online Friends on Facebook . . . 31

Figure 3.4 Probability of Users Joining a LiveJournal Community (y-axis) as a Function of Friends Already Joined (x-axis) . . . 36

Figure 3.5 Probability of Editing a Wikipedia Article (y-axis) as a Function of Friends Already Edited (x-axis) . . . 37

Figure 3.6 The Number of Common Wikipedia Articles Edited by Two Ed-itors Before and After Their First Communication . . . 38

Figure 4.1 An Example of a Link Exchange Advertisement . . . 51

Figure 4.2 An Example of a Link Exchange Platform . . . 52

Figure 4.3 An Example of a Mutual Link Between Web Sites . . . 55

Figure 4.4 Degree Distributions of the Chinese Site Graph . . . 58

Figure 4.5 Mutual Links and Triangles in the Sample Chinese and the Sam-ple General Web . . . 60

Figure 4.6 Number of Mutual Links/Out-degree Ratio Versus In-degree . . 62

Figure 4.7 Number of Mutual links/In-degree Ratio Versus In-degree . . . 63

Figure 4.8 Distribution of the Number of Type 1, Type 2 Triangles Per Mutual Link . . . 64

(11)

Figure 4.9 PageRank Correlation in the General Preferential Attachment

Model (Scaled by a Factor of 106) . . . . 66

Figure 4.10 PageRank Correlation in the Sample General and Chinese Web 67 Figure 4.11 PageRank Difference Distribution . . . 68

Figure 4.12 PageRank vs Number of Mutual Links . . . 69

Figure 4.13 Degree Distribution of the Chinese Reference Graph . . . 71

Figure 4.14 Degree Distribution of the Chinese Guanxi Graph . . . 72

Figure 4.15 In-degree Distribution of the Chinese Strong Guanxi Graph . 73 Figure 4.16 Simulation Results: In-degree Distributions . . . 75

Figure 4.17 Mutual Links and Triangles in Different Countries . . . 77

Figure 5.1 A User Profile on Douban . . . 83

Figure 5.2 A Discussion Group on Douban . . . 84

Figure 5.3 Distribution of Profiles with Number of Friends . . . 85

Figure 5.4 Distribution of the Number of Books and Music Interested . . . 86

Figure 5.5 Distribution of Profiles with Number of Discussion Groups . . . 88

Figure 5.6 Profiles for a CD and a Book on Douban . . . 89

Figure 5.7 Distribution of Followers . . . 90

Figure 5.8 Percentage of Links Added and Deleted . . . 91

Figure 5.9 Distribution of the Number of Friends Added and Deleted . . . 92

Figure 5.10 Percentage of Users Adding and Deleting Friends . . . 93

Figure 5.11 Percentage of Added Friends with Common Friends or Interests 94 Figure 5.12 Percentage of Deleted Friends with Common Friends or Interests 95 Figure 5.13 Common Friends and Discussion Groups Distributions . . . . 96

(12)

ACKNOWLEDGEMENTS I would like to thank:

Liu Li Hua and Yu Wa Xian, for supporting me always.

(13)

DEDICATION

(14)

Introduction

It has been documented that friendships and relationships between individuals in Chinese society exhibit different characteristics than those in Western society (see Chapter 3 for a survey on this topic) .

In this work, we look at the underlying friendships and relationships between Chinese Internet users. We identify different types of online friendships and online relationships in the Chinese web by analyzing the local linking structure of various Chinese web graphs.

First, we look at the presence of guanxi in the Chinese web. Guanxi is a type of dyadic social interaction based on feelings, trust, and the development of friendship. We define the concept of guanxi as it is applied to the interaction between web sites. We show by analyzing the local linking structure between Chinese web sites and the content of Chinese web documents that the interaction between Chinese web sites can be seen to exhibit two types of guanxi: strong guanxi and cheap guanxi. We compare the local linking structure of Chinese web sites to the local linking structure of web sites in the general web and in Japan, Iran, and France. Finally, we explore methods to identify different types of guanxi in the Chinese web, and give a method for simulating guanxi in a web graph model.

Next, we study how online friendships are formed and how interests are adopted in a Chinese online social network, Douban. Online social networks have become a

(15)

major platform for youth in China to gather information and to make friends with like-minded individuals [60]. We compare our findings to research which has been done on the formation of online friendships and the adaptation of trends and influences in Western online social networks [62] [63] [80].

We study the evolution of friendships on Douban. We compare several factors which can affect the evolution of friendships, such as having memberships in the same discussion groups and sharing common interests or common friends. Also, we exmaine how Douban users’ interests are influenced by their online friends’ interests. For the rest of this chapter, we provide some background information on the history and development of the Internet industry in China. We give some statistical information on the Chinese Internet user demographic. Finally, we give an overview of our work.

1.1

Chinese Internet Statistics and Background

The development of the Internet industry in China over the past decade has been impressive. According to a survey from the China Internet Network Information Center (CNNIC), by July 2008, the number of Internet users in China has reached 253 million, surpassing the U.S. to be the world’s largest Internet market [31].

However, the Internet penetration rate in China is still low. Figure 1.1 compares the Internet penetration rates in seven different countries in 2007 [31]. We see that Iceland has the highest Internet penetration rate of 86.30%, while India has only 3.70%. China’s Internet penetration rate is 16% and is just below the world average of 19.1%.

The 2007 survey by CNNIC on the Internet development in China [32] reports that the Internet penetration rate in the rural areas of China is on average 5.1%. In contrast, the Internet penetration rate in the urban cities of China is on average 21.6%. In metropolitan cities such as Beijing and Shanghai, the Internet penetration rate has reached over 45%, with Beijing being 46.4% and Shanghai being 45.8% [32].

(16)

China’s cyberspace is dominated by urban students between the age of 18–30. more than half (50.9%) of the Internet users in China are under 25, and 69% are below 30 [31]. In the US, however, young people between the age of 18–30 only account for 20.5% of the total Internet users [102].

Figure 1.1: Comparison of Internet Penetration Rates

1.2

The History and Development of the Chinese

Internet

Tai [112] points out the four major stages of Internet development in China, “with each period reflect[ing] a substantial change not only in technological progress and application, but also in the government’s approach to and apparent perception of the Internet.”

(17)

to the use of emails among a handful of computer research labs in China. 2. The second phase was between 1992–1995. Alarmed by the the Internet

de-velopment in the U.S., the Chinese government proposed several large scale network projects and built a national information network infrastructure. 3. The third phase was between 1995–1997. The Chinese government stepped up

its effort in building the information network infrastructure, hoping that the IT industry would yield significant benefits to the nation’s economy. Meanwhile, afraid of losing control of the nation’s information flow, the government started to implement a variety of technological and policy control mechanisms to beef up the censorship and surveillance of the information on the Internet.

4. The fourth phase started from 1998 and continues to the present, during which time the Internet has become a powerful medium in the Chinese society. The government plays an important role in fostering the advance of the Internet industry in China. According to Benkler:

The government [of China] holds a monopoly over all Internet connections going into and out of the country. It either provides or licenses the four national backbones that carry traffic throughout China and connect it to the global network. ISPs that hang off these backbones are licensed, and must provide information about the location and workings of their facilities, as well as comply with a code of conduct. Individual users must register and provide information about their machines, and the many Internet cafes are required to install filtering software that will filter out subversive sites. There have been crackdowns on Internet cafes to enforce these requirements. This set of regulations has replicated one aspect of the mass-medium model for the Internet. It has created a potential point of concentration or centralization of information flow that would make it easier to control Internet use [11].

(18)

1.3

Overview

We hypothesize that despite the government’s attempt in monitoring the Internet and censoring certain content, Chinese Internet users utilize the Internet to seek out the information not broadcasted in Chinese mainstream media such as television, radio and newspaper and to make friends with like-minded individuals. It is interesting to investigate if online friendships and online relationships in the Chinese web exhibit different characteristics from the friendships and relationships in Chinese society. It is also interesting to compare the characteristics of online friendships in the Chinese web to those in the general web.

Our work begins with an analysis and preliminary experiment on the suitability of English language web mining techniques in Chinese web search (Chapter 2). We show that methods and techniques originally developed for English web search cannot be used directly in the development of Chinese search engines. We notice several unique characteristics in the local linking structure between Chinese web sites which suggest that the hyper-links between Chinese web sites reflect the underlying friendships and relationships between Chinese web site owners. This analysis serves as the motivation for us to further investigate the presence and characteristics of underlying friendships and relationships in the Chinese web. In Chapter 3, we survey related work that provides the foundation for our main studies. In Chapter 4, we present our study on finding guanxi, a type of social construct in China, in the Chinese web. In Chapter 5, we present our study on the evolution of friendships in Chinese online social networks. In Chapter 6, we compare the characteristics of online friendships and online relation-ships in the Chinese web to the characteristics of online friendrelation-ships in the Chinese online social network, Douban. Finally, in Chapter 7, we give our conclusions.

(19)

Chapter 2

Preliminary Experiment

2.1

Adopting English Language Search Techniques

to Chinese Web Search

Most web mining techniques existing today are originally developed for English web search [67]. There are over three hundred search engines in China and most of the search engines use the resource finding, information extraction, and document ranking techniques used by English search engines [81] [85]. An example is the Google Chinese search engine (http://www.google.com/intl/zh-CN and http://www.google.com.hk, for simplified Chinese characters search and traditional Chinese characters search respec-tively), which essentially utilizes the same PageRank algorithm as the English Google search engine [81]. One questions the efficiency and performance of the techniques when they are applied directly to documents of another language.

It has been documented that the irrelevance of many Chinese search engines’ search results is quite prominent. Low precision and low recall are particularly acute in Chinese web search [85].

Most search engines in China have databases that are relatively small. Some spe-cialize in searching for focused information such as consumer electronics and enter-tainment; others are simply a part of bigger search engines [85]. The five biggest and

(20)

most well known search engines in China are Google China (http://www.google.cn), Yahoo Yisou (http://www.yisou.com), Zhongsou (http://www.zhongsou.com), Baidu (http://www.baidu.com) and Tianwang (http://www.tianwang.com). These search engines are very popular among Chinese users [81]

We conduct a small experiment. We submit the query “Taifeng” (tsunami in Chi-nese) to the five Chinese search engines listed above and observe the results displayed. Table 2.1 shows the lists of top five results displayed from each search engine. Among the sixteen distinct results, only one appears in all five lists. One appears in three lists and three appear in two lists. We also observe that none of the top five results from Tianwang appear in any other search engines’ lists. Most of the sixteen results are news items, some results are blog posts or forum pages.

In contrast, after we submit the query “tsunami” to four of the major English language search engines: Yahoo, Google, AltaVista, and MSN, the lists of top five results displayed show more overlaps (Table 2.1). There are nine distinct documents as compared to the sixteen found in our Chinese search. One result appears in all four search engines’ top five lists. Two results appear in three of the four lists and two results appear in two lists. A much higher percentage of overlaps is observed.

This difference in the percentage of overlaps suggests that either the English search engines are using very similar ranking algorithms, and/or the English search engines’ crawlers are searching similar web spaces. This characteristic deserves further inves-tigation.

2.1.1

Resource Finding and Information Selection

We examine some existing web mining techniques for their suitability of use in Chinese web search. Web mining describes the process of discovering documents in the web and extracting information from web documents. Kosala et al. [67] decomposes the web mining process into three sub-tasks:

(21)

Chinese Engines English Engines Google Google news.sina.com.cn/z/sumatraearthquake www.tsunami.org/ www.nju.edu.cn/njuc/dikexi/earthscience www.pmel.noaa.gov/tsunami blog.roodo.com/tsunamihelp www.ess.washington.edu/tsunami/ blog.yam.com/tsunamihelp tsunamihelp.blogspot.com news.xinhuanet.com/world/2004-12/26 en.wikipedia.org/wiki/Tsunami Yisou Yahoo news.sina.com.cn/z/sumatraearthquake www.geophys.washington.edu/tsunami news.tom.com/hot/ynqz en.wikipedia.org/wiki/2004 earthquake news.21cn.com/zhuanti/world/dzhx en.wikipedia.org/wiki/Tsunami news.xinhuanet.com/world/2004-12/26 tsunamihelp.blogspot.com www.nju.edu.cn/njuc/dikexi/earthscience www.geophys.washington.edu/tsunami Zhongsou AltaVista news.sina.com.cn/z/sumatraeart www.flickr.com/photos/tags/tsunami www.xinhuanet.com/world/ydyhx www.geophys.washington.edu/tsunami news.sohu.com/s2004/dizhenhaix en.wikipedia.org/wiki/earthquake www.phoenixtv.com/phoenixtv en.wikipedia.org/wiki/Tsunami 61.139.8.15/newstanfo/zhuanti tsunamihelp.blogspot.com Baidu MSN news.sina.com.cn/z/sumatraearthquake www.tsunami.org post.baidu.com/taifeng www.tsunami.org/faq www.nju.edu.cn/njuc/dikexi/earthscience wcatwc.arh.noaa.gov www.xinhuanet.com/world/ydyhx www.geophys.washington.edu news.sohu.com/s2004/dizhenhaixiao www.geophys.washington.edu/tsunami Tianwang news.sina.com.cn/z/run/rollnews/12 dl.dadui.com/softdown/9768 www.discloser.net/html/175487 www.51do.com/web/2362 www1.netcull.com/topic/ydyhx

(22)

2. Information extraction is the task of extracting relevant information from the retrieved documents

3. Information analysis is the task of ranking the web documents according to their relevance to the user submitted queries.

In resource finding [67], many techniques are used to retrieve as many relevant documents as possible from the web or from data collections. These techniques in-clude document classification and categorization, user feedback interfaces, and data visualization [67]. These techniques are language independent and can be applied to documents written in any languages.

Extracting information from unstructured data

From the retrieved documents, relevant information is extracted. For unstructured data such as natural text in a web document, linguistic approaches are necessary to perform syntactic and semantic analysis.

Of the three main areas of web information retrieval and analysis, content mining is the most difficult in Chinese web search as compared to structural mining and usage mining [97]. The reasons for this are explained below:

First, depending on the geographical region of the web site and the preferences of the web page makers, there are many different character sets and encoding schemes to choose from when writing a Chinese web document. Big Five (BIG5) or Dawu, the traditional Chinese character set, is usually used in Taiwan or Hong Kong while GB or Guojia Biaozhun (National Standard) is used to generate simplified Chinese characters in mainland China [43] [59].

Second, when writing Chinese, there is no white space between each word as in the English language. Depending on how one reads a sentence or combines the separate characters, it is possible to have multiple valid interpretations of a phrase or a sentence. Therefore, Chinese language segmentation is very difficult and has remained a very important open research problem [59] [82] [74].

(23)

The effectiveness of the term extraction process affects the clustering and cate-gorization of documents. It also affects the search engines’ capability to index doc-uments [97]. The white space problem at the query level also dictates how accurate the matching and ranking will be [43].

Third, a Chinese word often have multiple meanings and can be used as a verb, a noun or an adjective. Simply counting the frequency of the keywords in a web document may not be a good method to determine the relevance of the document to the query.

Extracting information from semi-structured data

Information can also be extracted from semi-structured data by analyzing the meta-information embedded in documents such as HTML tags, headings, and delimiters. Research has been done on automated adaptive algorithms using machine learning techniques [105] [85]. This approach may be less dependent on the language of the documents, however, the performance of the technique is still affected by how the meta-information is configured by the web page makers. Hybrid approaches are also proposed to extract both unstructured text and semi-structured text [85].

2.1.2

Information Analysis

The main objective of information analysis is to examine and to rank the selected documents according to their relevance to the query.

Most ranking factors are of a statistical natural to estimate the documents’ rel-evance ranking [67]. The frequency of keywords in the document is often used as one of the ranking factors. The technique of using keyword density, the frequency of keywords over the total number of words in the document, should work well in ranking Chinese documents as long as the query is simple. Complex and long queries tend to aggravate the Chinese white space problem.

(24)

algo-rithms. Each web page’s PageRank score is determined by its inbound and outbound links (see a more detail description of the PageRank algorithm in Chapter 3).

To explore the structural characteristics of the Chinese web, we performed a small experiment. We chose six web sites, three randomly selected from the Chinese web and three randomly selected from the general web. We constructed a web graph for each of the selected web sites. Table 2.2 summarizes the type of web sites we chose. Table 2.3 illustrates the number of web pages in the first four levels of each web site.

Legends Type

C1 a Chinese .com site C2 a Chinese .org.cn site C3 a Chinese .com.cn site E1 a North American .com site E2 a North American .cm site E3 a North American .org site

Table 2.2: Exploratory Web Sites

We observe that Chinese web pages reference each other in a more concentrated fashion than other web pages, usually within a close-knit community. For example, a research group at a university would link to other research groups within the same university.

We define the internal links of a web site as the links to web pages with the same primary and secondary domains. The external links of a web site is defined as the links to web pages with different primary or secondary domains. Figure 2.1 shows the number of external links over the total number of links for each web site in our sample set. It is evident that English web sites make more external references as compared to the Chinese web sites.

We also observe that Chinese entertainment web sites tend to reference other Chinese entertainment web sites. Similarly, Chinese news web sites have the majority

(25)

Root Node # of Layer 2 Nodes # of Layer 3 Nodes Total # of Nodes C1 3 760 764 C2 8 131 140 C3 7 754 762 E1 9 209 372 E2 15 356 372 E3 9 165 175

Table 2.3: Size of the Web Graphs

of external links to other Chinese news web sites. In contrast, English web sites are more likely to link to web sites of different purposes. We observe from table 2.4 that both E2 and E3 have a higher percentage of external links to .org and .gov web sites.

External Link .com .org .gov .edu

C1 100% 0% 0% 0% C2 96% 3% 1% 0% C3 91% 4% 5% 0% E1 97% 3% 0% 0% E2 74% 15% 5% 6% E3 70% 14% 15% 1%

Table 2.4: External Link Classification

We observe that the way Chinese web sites link to each other may exhibit different characteristics from that of English web sites. When applying the PageRank algo-rithm to Chinese documents, one needs to take these characteristics into consideration and eliminate any bias effect. In Chapter 4 we further examine the characteristics of external links between Chinese web sites.

2.1.3

Usage Mining

The frequency with which the web sites are visited by users (visiting frequency) can be used as a ranking factor. Craswell et al. [34] uses a database to maintain the frequency of visits for each web site. It is reasoned that the web sites with higher

(26)

Figure 2.1: Percentage of External Links

visiting frequency are more useful and should have higher rankings than the web sites less frequently visited.

This method may not be appropriate in Chinese web search. The majority of the web sites in China are devoted to specific purposes, in particular, entertainment (e.g., movie ratings, mp3 downloads, chats and blogs, etc.) and consumer information (e.g., electronic product ratings, pricing comparisons, etc.). These web sites tend to have a higher frequency of visits than other web sites, such as web sites by professional organizations or the government. Ranking algorithms with frequency of visits as a primary factor tend to give top rankings to news and entertainment web sites than to academic or government web sites. For instance, using a president’s name as the query, the president’s biography from an academic web site would provide more accurate

(27)

information than reviews for a movie about the president on an entertainment web site. However, the entertainment web site would have a higher frequency of visits. This is also true for news web sites. The popularity of news and entertainment web sites is evident from the lists of top 5 search results in Table 2.1.

2.1.4

Suitability of English Search Techniques in Chinese

Web Mining

From our brief analysis, we hypothesize that due to the complexity of the Chinese lan-guage and the Chinese web culture, methods originally developed for English search engines cannot be used directly in the development of Chinese search engines. How-ever, they can be used as a framework with some additional processing.

(28)

Chapter 3

Related Work

In the previous chapter, we gave a preliminary analysis on the suitability of English language search techniques in Chinese web search. We notice some unique character-istics in the local linking structure of the Chinese web, namely:

• Chinese web sites tend to have fewer external links compared to English web sites.

• Chinese web pages tend to reference each other in a more concentrated fashion, usually within a close-knit community.

• Chinese web sites tend to link to other web sites of similar purposes.

The above observations suggest that the hyper-links between Chinese web sites reflect the underlying friendships and relationships between Chinese web site owners. We hypothesize that Chinese web sites tend to link to other web sites who they have prior relationships with, such as web sites of a friend or a collaborating partner.

Before we begin our investigation (in Chapter 4), we provide some related work. Our work is based on previous work in link analysis, web graph measurement, web graph evolution, community detection, online social network analysis, offline social network analysis, and random graph modeling. We briefly survey each area. Some of the work we survey in Section 3.5 and Section 3.6 are also documented by Kleinberg and Easley in their excellent new book on social network analysis [65].

(29)

3.1

Link Analysis

Link analysis is a major research area in web mining. Algorithms such as HITS [64] [19] and PageRank [21] are used by search engines to determine the ranking of web pages. There are many advantages in using link analysis algorithms.

First, one can use link analysis algorithms to rank web pages without getting any feedback from the users or storing the content of web pages in memory.

Another advantage of using link analysis algorithms is that it makes it difficult for the web site owner to cheat by manipulating keywords in the web documents

In this section, we present some major link analysis algorithms. For some of the algorithms, a normalization step is necessary after each iteration. Throughout this section (Section 3.1), we use the symbols in Table 3.1 to present all the algorithms.

Symbols Meaning

A The adjacency matrix for the web graph N The total number of nodes in the web graph Ix The set of nodes that link to node x

|Ix| The number of the set of nodes that link to node x

Ox The set of nodes that node x points to

|Ox| The number of the set of nodes that node x points to

d damping factor (usually 0.85 for PageRank) Table 3.1: Link Analysis Algorithms: Symbols and Notations

3.1.1

PageRank

PageRank is a query-independent measurement of the importance of web pages based on the notion of peer-endorsement: a hyperlink from page A to page B is an endorse-ment of B’s content by A’s author.

Originally, the PageRank score, PR, is defined by Brin and Page [21] as follows:

P R(A) = (1 − d) + d(

P R(t1) C(t1)

+

P R(t2) C(t2)

+ ... +

P R(tn) C(tn)

)

(30)

where t1, t2, ...., tn are nodes linking to node A, C is the number of out-going links

(out-degree) from the nodes and d is the damping factor, usually set to 0.85.

Using the symbols in Table 3.1, the PageRank score for a node x (P Rx) is

equiv-alent to:

P R

x

= (1 − d) + d

X

∀y∈Ix

P R

y

O

y

The damping factor d is used to guarantee convergence and is usually set to 0.85. One can formulate the PageRank algorithm into a random surfer problem [17]. Consider a web surfer that surfs through the web graph from node to node by following the out-going links. Being in node i, which has C(i) links, the probability of the surfer moving to node j that is pointed to by node i is 1/C(i). Then the probability of the surfer moving to another page that is pointed by node j is 1/C(j) and so on.

After PageRank is computed for the entire web graph, every page x should have a PageRank value that denotes the probability of the web surfer landing on page x by following out-going links [25].

If there are many cycles in the web graph or the web graph is disconnected, the web surfer will be trapped in a graph area. In order to avoid this entrapment, we can instruct him or her not to follow the out-going links forever, but he or she should jump to a random node with a probability of 1 − d (and with probability d, he or she should keep following the links on the page).

There has been work on approximation methods to speed up the PageRank com-putation [27] [89]. Kamvar et al. [61] tries to predict the PageRank score by using Aitken Extrapolation and Quadratic Extrapolation. They also present a method that exploits the block structure of the Web.

Lu et al. [84] presents the PowerRank algorithm that exploits certain attributes of the web. Other variations of the PageRank algorithm include PopRank [94] and topic-sensitive PageRank [50] which add query related scores to the PageRank computation. There is also research on the behavior of PageRank under assumptions about

(31)

certain graph structure such as communities in the web [76]. Bianchini et al. [15] studies the relationship between the structure of the web and the distribution of PageRank scores. Liu et al. [83] looks at the distribution of PageRank scores in the Chinese web.

3.1.2

HITS

HITS (hypertext induced topic search) is a query-dependent algorithm based on the concept of topic endorsement [64]. The notion behind HITS is the discrimination between hubs and authorities. Hubs are pages with good links, whereas authorities are pages with good content. Any particular node can be a hub or an authority [19][3].

The HITS algorithm computes two scores for each page u: an authority score A(u) estimating how authoritative page u is on the topic induced by the query, and a hub score H(u) estimating if page u is a good reference to many authoritative pages.

The process of computing the authority score and the hub score for each node in a graph is as follows:

1. We start with A(u) = 1 and H(u) = 1 for each node u in the graph.

2. We then perform a sequence of updates, each update is described as follows: (a) First, for each node u, update A(u) to be the sum of the hub scores of all

the nodes that point to u.

(b) Next, for each node u, update H(u) to be the sum of the authority scores of all the nodes that u points to.

3. The process continues until the values converge.

By using the symbols in Table 3.1, given the initial condition, the authority score (HA) and the hub score (HH) for a node x (HAx and HHx) can be computed as:

(32)

HA

x

=

X

∀y∈Ix

HH

y

HH

x

=

X

∀y∈Ox

HA

y

Formally, HITS is computed over a graph subset. Each graph subset is prepared once for every user query. The set consists of all the pages that contain the user keywords, plus the pages that point to them, and the pages that are pointed to by them. Ideally, the HITS algorithm is performed on each graph subset while ignoring the rest of the web graph.

However, preparing such a graph subset for each user query is a very difficult task [91]. Kleinberg et al. [64] suggests some approximation methods to remedy the problem, such as sampling only a fixed-size random subset of the pages linking to the pages which contain the user keywords [64].

In practice, the HITS algorithm has some weak points. Borodin et al. [19] shows that a hub is penalized when it points to “poor” authorities. As a result, the “poor” authorities become even “poorer” during the computation.

There are some variations on the HITS algorithm. Two of them are the G-HITS [113] and subspace HITS [93].

3.2

Measurements of Web Graphs

Many experiments are conducted regarding the structure of web graphs and site graphs [22] [5] [72]. We define the web graph as a directed graph corresponding to the linking structure between web pages, with nodes representing web pages and directed links between nodes representing the hyperlinks between pages.

We define the site graph as a directed graph corresponding to the linking structure between web sites, with nodes representing web sites. There is a single directed edge

(33)

from node A to node B in the site graph if there is at least one link from a web page in web site A to a web page in web site B.

One important property of the general web graph is that it follows the power law in the in-degree and out-degree distribution of nodes [73] [9] [5] [72]. Researchers have taken snapshots of the web at different times in the web’s history and a recurring finding is that the fraction of web pages with degree i is proportional to 1/iα for some

constant α. Bharat et al. [12] shows that the in-degree and out-degree distributions of the general site graph also follow the power law with the exponents of 1.62 and 1.67 respectively.

Yan and Li shows that the Chinese web graph in 2002 has an in-degree exponent of 1.86 (see [51] for a report in Chinese). Liu et al. [83] analyzes the dataset from a crawl of the Chinese web by Peking University’s Sky Net search engine in May, 2003 and creates a Chinese web graph containing 140 million pages and 4.3 billion links. They report that their Chinese web graph has the in-degree exponent of 2.05 and the out-degree exponent of 2.62. They also construct a Chinese site graph of 479,000 web sites and 18 million links. The in-degree and out-degree distributions of their site graph have the exponents of 1.4 and 1.5 respectively.

3.3

Evolution of Web Graphs

While many studies have discovered patterns in static web graphs, it is hard to convert these findings into statements on the evolution of web graphs over time. In order to do so, one needs to take periodic snapshots of the web graph for a consecutive period of time.

Leskovec et al. [79] studies the evolution of a wide range of graphs in the web and observes that most of these graphs become denser over time and the average distance between nodes often shrinks over time.

Cho and Garcia-Molina [29] crawls a set of 720 000 web pages every day for four months and counts the pages as having changed if their MD5 checksum changed.

(34)

This study find that 40% of all the web pages in its set changes within a week, and 23% of the web pages changes daily.

Fetterly et al. [41] finds that the average degree of changes in web pages varies widely across top-level domains, and that the pages with more content tend to change more often and more severely than pages with less content.

Ntoulas et al. [95] collected weekly snapshots of 150 web sites over the course of one year and measured the evolution of both web pages and the link structure. Their findings indicate a rapid turnover rate of web pages, i.e., a high rate of birth and death, coupled with an even higher rate of turnover in the hyper-links that connect them.

Liu et al. [83] monitored 150 Chinese web sites for six weeks and calculates the weekly turnover rate of web pages and the link structure in their Chinese web graph. The study finds that on average the turnover rate for the links in the Chinese web graph is a little greater than that in the general web graph.

3.4

Random Graph Models

Many stochastic models have been created to generate random graphs with certain attributes which resemble the web (see [18] and [4] for surveys), namely, degree dis-tributions that follow the power law and a large number of small bipartite cliques.

These stochastic graph models are useful for the following reasons [65]:

1. First, the process can explain how the web evolves. This is helpful for studying the evolution of the web.

2. Second, these models can prompt further research such as analyzing or modeling sociological and economic issues surrounding the Internet (as we will show in Chapter 4).

Two main concepts have been proposed: preferential attachment [9] and copying [71].

(35)

3.4.1

Preferential Attachment Model

The first evolving graph model explicitly designed to model the web is given by Barabasi and Albert [9]. The idea behind their model is that new nodes are more likely to join to existing nodes with high in-degree.

This model is now known to as an example of the preferential attachment model. Barabasi and Albert [9] concludes that their model generates random graphs whose in-degree distribution follows the power law with the exponent of 3.

A simplified version of the preferential attachment model [9] is as follows: • At each time step, nodes are created in order, and are named 1, 2, 3, ..., N . • For each node j created, it produces a link to an existing node in the graph

according to the following probabilistic rule:

– Node j chooses a node l with probability proportional to l’s in-degree, and creates a link to l.

• This describes the creation of a single link from node j to node l. One can repeat this process to create multiple, independently generated links for each newly created node.

Many models have been created based on the concept of preferential attachment. Dorogoytsev et al. [36] modified the Barabsi - Albert model as follows:

• At each time step a new node is created and m new directed links are added to both the new node and existing nodes with uniforml probability. The destina-tion of the m new directed links are determined by each node’s attractiveness score. Each node’s attractiveness scores is calculated as follows:

– Each node is assigned an initial attractiveness score A when it is first created.

– As time increases the attractiveness score of a node S is equal to the initial attractiveness score A plus the in-degree of S.

(36)

The initial attractiveness score A governs the probability for newly created nodes to gather incoming links. After the initial creation, the attractiveness score of a node increases as the in-degree of the node increases.

In the original preferential attachment model, Barabasi and Albert [9] states that at each time step, outgoing directed links are only assigned to newly added nodes, whereas in the model by Dorogovtsev et al. [36], outgoing directed links can be assigned to both newly added nodes and old nodes. We can also assign multiple links at each time step instead of one single link at each time step.

Because the model by Dorogovtsev et al. [36] allows for the formation of mutual links, we use this model in Chapter 4 and refer it as the generalized preferential attachment model.

3.4.2

Copying/the Hostgraph Model

The first model created based on the concept of copying was by Kleinberg et al. [64]. Later, the model was analyzed and modified by Kumar et al. [71]. The copying mechanism is motivated by the intuition that authors of web pages will randomly find a web page and then copy some portion of the links to their own web page. Kleinberg et al. [64] shows that their model produces random graphs with in-degree distributions that follow the power law. We present the model by Kumar et al.

The random graph is generated by adding one new node (i.e., a web page) with k out-going links at each time step. The destination of the node’s ith out-going link is determined as follows:

• First we pick a node uniformly at random among all existing nodes. We call this node the “prototype”. We then pick k random out-going links from the prototype.

• With probability p, the destination of the ith out-going link is chosen uniformly at random.

(37)

• With the remaining probability 1 − p, the destination of the ith out-going link is chosen to be the destination of the ith out-going link picked from the prototype. This corresponds to an author creating a new web page on a topic by copying links from an already existing web page.

A notable variation of the copying model is the hostgraph model by Bharat et al. [12] designed to model the linking structure between web sites. The hostgraph model is described as follows:

• At each time step, with probability α, we select at random an existing node in the graph. We add k edges to the existing node. The destination of the ith edge is determined as follows.

– We pick a prototype uniformly at random among all existing nodes. We then pick k random out-going links from the prototype.

– With probability p, the destination of the ith out-going link is chosen uniformly at random.

– With the remaining probability 1 − p, the destination of the ith out-going link is chosen to be the destination of the ith out-going link picked from the prototype.

• With probability 1 − α, we follow the the copying model.

The main difference between the copying model by Kumar et al. [71] and the hostgraph model by Bharat et al. [12] is that at each time step, edges can be added to both new nodes and existing nodes, thus the hostgraph model allows for the formation of mutual links. We will use the model by Bharat et al. [12] in Chapter 4.

Both preferential attachment and copying assume that a node A links to a node B because node B is of interest to A, and this is independent of B’s interest in A. We call this behavior “referencing”.

(38)

3.5

Analysis of Offline Social Networks

For many years the structure of various offline social networks has been studied by sociologists and computer scientists (see [58] [90] [23] for surveys). Scott [101] iden-tifies the various cliques, dyads, components and circles in which social networks can be formed and the significance of positions in those networks.

Researchers have also analyzed the structure of various Chinese offline social net-works and guanxi netnet-works [13] [100] [40] [14] [24].

Software such as Pajek and Ucinet 6 for Windows [35] can be used to calculate basic characteristics of social and physical networks.

3.5.1

Triadic Closure

One of the most basic concept in social network is that:

If two people in a social network have a friend in common, then there is an increased likelihood that they will become friends themselves at some point in the future [77].

We refer to this as the principle of triadic closure. In a social network, if node A and node B have a friend C in common, then the formation of an edge between node A and node B produces a situation in which all three nodes A, B, and C have edges connecting each other, a structure we refer to as a triangle in the network.

Triadic closure is very natural in a social network. One reason why A and B are more likely to become friends is simply based on the opportunity for A and B to meet: if C spends time with both A and B, then there is an increased chance that A and B will end up knowing each other and potentially becoming friends.

Another reason why A and B are more likely to become friends is that the mutual friendships which A and B share with C gives A and B a a reason to trust each other.

(39)

3.5.2

The Strength of Links

Another basic idea in social network analysis is that links in a social network can represent different types of friendships and relationships and can have a wide range of possible strengths.

For conceptual simplicity, we categorize all links in the social network as one of two types: strong ties (the links with stronger strength, corresponding to friends), and weak ties (the links with weaker strength, corresponding to acquaintances). We can take a social network and classify each edge as either a strong or weak tie [88].

The principle of strong triadic closure is motivated by the following intuition: if a node A has edges to both nodes B and node C, then the edge between B and C is especially likely to form if A’s edges to B and C are both strong ties.

Granovetter [46] gives a more formal definition:

We say that a node A violates the strong triadic closure property if it has strong ties to two other nodes B and C, and there is no edge at all (either a strong or weak tie) between B and C. We say that a node A satisfies the strong triadic closure property if it does not violate it.

3.5.3

Homophily

Homophily is the concept of people bonding with similar others. Typically, an indi-vidual’s friends are not just random samples from the underlying population, they tend to be similar to the individual in terms of age, ethnic background, gender, in-terests, beliefs and so on. A social network’s surrounding context can be the force behind the formation of its friendship links [88].

Consider the contrast between a friendship that forms because two people are introduced by a common friend and a friendship that forms because two people attend the same school or work for the same company. In the first case, a new link is added between two nodes due to another node in the social network. In the second case, the new link is added due to the context outside the social network itself.

(40)

The tendency of people forming friendships with similar others is called selection. On the other hand, people may modify their behavior to bring them closer to the behavior of their friends. This process is called social influence. Finally, in a social network, two people may become friends if they share a common friend, this process is called triadic closure [65].

One can represent the surrounding context in the social network itself. A social-affiliation network consists of nodes representing individuals, links representing friend-ships, and nodes representing foci : “social, psychological, legal, or physical entit[ies] around which joint activities are organized (e.g., workplace, social groups) [88].” Over time, friendships and memberships can be established or diminished due to the rela-tionships between individuals and foci in the network.

Macpherson [88] identifies three basic patterns in which friendships and member-ships can evolve in a social-affiliation network.

1. Triadic Closure: if nodes A, B and C are person in the network and both A and B are friends with C. Over time, C can be the force behind the formation of friendship between A and B, even if A and B are both unaware of the existence of their mutual friendships with C [46] (see Figure 3.1(a)).

2. Focal Closure: if A and B are person in the network, and F is a focus that both A and B participate in (e.g., workplace, social group). Over time, A and B can form friendship due to the common focus (see Figure 3.1(b).

3. Membership Closure: if A and B are friends, and F is a focus that A participate in. Over time, B can participate in the same focus due to A’s involvement (see Figure 3.1(c)).

These three underlying mechanisms reflect triadic closure, social selection and social influence.

Homophily has been discovered in a vast array of social networks involving foci such as gender, age, religion, education, occupation, social class, and location [88].

(41)

Figure 3.1: Homophily in a Social Network

Researchers also find that homophily exists in a large number of societies, but its level and characteristics may differ from country to country [16] [104].

Researchers have also find homophily in various Chinese offline social networks. Blau et al. [16] finds roughly the same level of educational and occupational ho-mophily in a Chinese urban city as in the United States.

Xu et al. [111] looks at Chinese children’s aggressive behavior as related to their positions in the social network. As a method of controlling the aggression, teachers in China tend to put aggressive children in a peer group with non-aggressive children. Over time, Xu et al. [111] find that friendships can be formed between aggressive chil-dren and non-agressive chilchil-dren. For the aggressive chilchil-dren who are group members, the number of intra-group friendships moderates the children’s aggressive behavior.

(42)

Fang et al. [39] looks at the tendency for Chinese adolescents to smoke relative to their social position and find that the adolescents in China are more likely to experiment with cigarettes on their own than in a social group. After the adolescents take up smoking, they tend to socialize with other smokers.

We observe that the network analyzed by Fang et al. [39] exhibits weak member-ship closure and strong focal closure while the network analyzed by Xu et al. [111] exhibits strong membership closure and weak focal closure.

3.6

Analysis of Online Social Networks

In this section, we survey related work on the structural properties of online social networks.

3.6.1

Degree Distributions in Online Social Networks

The degree distributions of online social networks have been well documented. Mislove et al. [90] presents a large-scale measurement study and analysis of the structure of online social networks such as Orkut, YouTube, and Flickr. Their result shows that online social networks follow the power-law (see Section 3.2) in the in-degree and out-in-degree distributions. Kumar et al. [70] looks at the linking structure of Flickr and Yahoo!360 and reports similar findings.

Adamic et al. [2] presents an analysis of Club Nexus, an online community at Stanford University which represents a small pure social network on the web. This paper shows that the distribution of the number of connections users make does not follow the power law. Adamic et al. reasons:

In a pure social network such as an acquaintance network, there is a recurring cost in term of time and effort to maintain a friendship, and given the limited resources people have, they can only maintain a certain number of them [2].

(43)

Figure 3.2 (taken from [2]) shows the distribution for the number of user connec-tions on Club Nexsus, we see that the log-log plot does not follow a straight line. Adamic et al. [2] observes that this distribution does not follow the power law.

Figure 3.2: Distribution of User Connections in Club Nexus

Figure 3.3 illustrates the distribution of the number of online friends for users on Facebook [45], we observe that the distribution also does not follow the power law.

The general hypothesis is that a pure social networks, one which consists of friend-ships and relationfriend-ships that are already established before users joining the network, usually does not have degree distributions that follow the power law.

(44)

Figure 3.3: Distribution of Online Friends on Facebook

3.6.2

Community in Online Social Networks

One of the earlier uses of link structure is in the analysis of web communities, where properties such as cliques are identified and analyzed [44] [64] [72] [99].

Kumar et al. [73] defines topic enumeration, which seeks to enumerate all topics that are well represented on the web by finding dense bipartite cores [73]. The intu-ition is that a community emerges when many (hub) pages link to many of the same (authority) pages.

Another more general definition of a web community is a set of web pages that link (in either direction) to more web pages in the community than to web pages outside of the community [42][56]. Communities are identified by finding components separated by minimum cuts in the graph.

(45)

3.6.3

The Strength of Ties in Online Social Networks

As an increasing amount of our social interaction moves online, the way in which we maintain and access our social networks also changes.

Kleinberg [65] suspects that the users of online social networking platforms tend to maintain large explicit lists of online friends in their user profiles. In contrast, friendship circles were more implicit before the existence of online social networks.

When we see people maintaining hundreds of friendship links on a social network-ing platform, we can ask how many of these friendships correspond to strong ties that involve frequent contact, and how many of these friendships correspond to weak ties that that the users maintain.

The Strength of Ties on Facebook

Researchers have begun to address the questions of tie strength using data from various online social networks. At Facebook, Marlow [86] analyzes the friendship links in each user’s online profile investigating the extent to which each link is used for social interaction.

Marlow [86] separate the links in their data set into three categories based on usage over a one month observation period:

• A link represents reciprocal (mutual) communication if the user both sends mes-sages to a friend and receives mesmes-sages from him or her during the observation period.

• A link represents one-way communication if the user sends one or more messages to a friend.

• A link represents a maintained relationship if the user visits a friend’s profile more than once or clicks on the friend’s post displayed in Facebook’s news feed. These three categories are not mutually exclusive, the links classified as reciprocal

(46)

communication always belong to the set of the links classified as one-way communi-cation.

Marlow [86] shows that even for users with very large numbers of friends on their profile pages, the number of friends with whom they actually communicate with is between 10 and 20. The number of friendships they maintain is under 50.

Marlow [86] draw a further conclusion about the power of online media such as Facebook to enable this kind of passive engagement, in which one keeps up with friends by reading news about them even in the absence of communication. The paper argues that this passive network occupies an interesting middle ground between user’s strong ties and the weak ties.

The stark contrast between reciprocal and passive networks shows the effect of technologies such as News Feed. If these people were required to talk on the phone to each other, we might see something like the reciprocal network, where everyone is connected to a small number of individuals. Moving to an environment where everyone is passively engaged with each other, some event, such as a new baby or engagement can propagate very quickly through this highly connected network. [86]

The Strength of Ties on Twitter

Similar studies have been done on the strength of ties on Twitter, where users engage in micro-blogging by posting short public messages known as “tweets”. Twitter users can specify a set of other users whose tweets they will follow. They can also send direct messages to other users. Huberman et al. [52] states that the former kind of interactions corresponds to more passive, weak ties while the latter kind of interactions corresponds to strong ties.

Huberman et al. [52] analyzes these two kinds of ties on Twitter. Specically, for each user they look at the number of users he or she follows (his or her “followees”) and the number of users he or she messaged privately over the course of the observation period. They find that the number of strong ties varies directly as a function of the

(47)

number of followees each user have. This result differs from the result of Marlow [86] in that, on Facebook, even for users who have very large numbers of weak ties, the number of strong ties remain relatively small.

3.6.4

Homophily in Online Social Networks

Triadic Closure in Online Social Networks

How much more likely is an edge to form between two people if they have multiple friends in common? We see from the study of triadic closure in offline social net-works (Section 3.5.1) that the more friends two people have in common, the more incentive for the two person to trust each other and the more likely a friendship can be formed. In this section, we look at the characteristics of triadic closure in online social networks.

Kossinets and Watts [68] addresses the above question empirically by conducting the following experiment:

1. They take two snapshots of the network at different times.

2. For each k, they identify all pairs of nodes who have exactly k friends in common in the first snapshot.

3. They then define T (k) to be the fraction of these pairs that have formed an edge by the time of the second snapshot. T (k) is an estimate for the probability that a link will form between two people with k friends in common.

4. They plot T (k) as a function of k to illustrate the effect common friends on the formation of links.

Kossinets and Watts [68] computes T (k) as a function of k using the data set of e-mails among 22,000 undergraduate and graduate students in a large U.S. university over a one year period. They constructed a network that evolves over time, joining two people by a link if they had exchanged e-mail in each direction.

(48)

Kossinets and Watts [68] finds strong evidence of the triadic closure as the proba-bility of two people establishing friendships increases linearly as the number of com-mon friends between the two individuals increases.

Leskovec et al. [77] analyzes the properties of triadic closure in LinkedIn, Flickr, Del.icio.us and Yahoo! Answers and reports similar results.

Focal and Membership Closure in Online Social Networks

To investigate the characteristics of focal closure in an online social network, Kossinets and Watts [68] combines their university e-mail data with information about the class schedules for each student. Each class is looked at as a focus. Two students share a focus if they have taken a class together. They find that a single shared class have roughly the same effect on the formation of friendships as a single shared friend.

Backstrom et al. [7] looks at the characteristics of focal closure and membership closure on LiveJournal. Users of LiveJournal can specify their friends on their user profiles and join different LiveJournal communities. Backstrom et al. [7] looks at a LiveJournal community as a focus. Figure 3.4 shows the probability of users joining a LiveJournal community as a function of their friends who have already joined.

Crandall et al. [33] looks at the communication between editors of Wikipedia articles. Here, the social-affiliation network consists of a node for each editor. There is an edge joining two editors if they have communicated with each other. Each Wikipedia article is looked at as a focus. There is an edge joining an editor with a focus if he or she have edited the article. Figure 3.5 shows the probability of a person edits a Wikipedia article as a function of the number of prior editors with whom he or she communicated with.

We see that the probability of an individual joining a LiveJournal community increases as the number of the individual’s friends who have already joined increases. Similarly, the probability of an individual editing a Wikipedia article increases as the number of the editor’s friends who have edited increases.

(49)

Figure 3.4: Probability of Users Joining a LiveJournal Community (y-axis) as a Function of Friends Already Joined (x-axis)

two editors before and after their first communication (Figure 3.6). The argument is that an article edited by two editors represent a common interest (a focus) between the two individuals.

We see that the number of common interests shared between the two editors increases before the editors have even communicated. Crandall et al. [33] sees this as an indication of social selection.

After the two editors have communicated, there continue to be an increase in the number of common interests shared between the two editors. This is interpreted as an indication of social influences.

(50)

Figure 3.5: Probability of Editing a Wikipedia Article (y-axis) as a Function of Friends Already Edited (x-axis)

have a set of friends before joining a community. The probability of a user joining the community increases as the the number of their friends who have already joined increases. In another word, the social-affiliation network analyzed by Backstrom et al. [7] is low on socials selection and high on social influence.

We compare our result with the results by Backstrom et al. [7] and Crandall et al. [33] in Chapter 5 (see Section 5.3).

3.6.5

Analysis of Chinese Online Social Networks

There are few studies on the structure and evolution of Chinese online social networks. Jin [60] looks at various aspects of the Chinese online Bulletin Board Systems

(51)

Figure 3.6: The Number of Common Wikipedia Articles Edited by Two Editors Before and After Their First Communication

(BBS), a type of online social networks. These includes the history and development of the Chinese BBS, Chinese BBS regulation and censorship. Jin [60] also provides ob-servations on the structure and interface of Chinese BBS and the behavioral patterns of Chinese BBS users (we elaborate on these observations in Chapter 5).

Xin [110] conducts a survey on BBS’s influence on the University students in China and their behavior on Chinese BBS.

We relate our findings to the findings of Jin [60] and Xin [110] in Chapter 5 (see Section 5.3).

(52)

3.7

Diffusion Model

Diffusion is the process in social network analysis describing the spreading of behavior from person to person like an epidemic. In a social network people tend to influence their friends to adopt new ideas [65], and particular patterns of behavior can spread from node to node in the network.

There are two types of reasoning why adopting the behaviors of one’s friends can be beneficial to someone [65]:

1. The first is based on the fact that the choices made by others can provide indirect information about what they know. This is called informational effects [62].

2. The second is that if there are direct payoffs from adopting the behavior of others. For example, payoffs from using compatible technologies instead of incompatible ones. This is called direct benefit effects [47].

There are two main types of diffusion model: independent cascade model [103] and threshold model [47].

3.7.1

The Threshold Model

The threshold model by Granovetter [47] is a model based on direct benefit effects. The intuition behind the threshold model is that everyone in the social network has neighbors: friends, acquaintances, colleagues ect., and it may be beneficial for someone to adopt a new behavior as more and more of their neighbors adopt it. In such a case, simple self-interest will dictate the person’s interest in adopting the new behavior. For example, they may find it easier to collaborate with co-workers who are using compatible technologies. Or, they may find it easier to engage in social interaction with people whose beliefs and opinions are similar to theirs [47].

Granovetter formulate the threshold model as such [47]: for every node v in the social network, suppose that some of v’s neighbors adopt behavior A, and some of

(53)

v’s neighbors adopt behavior B. What should v do in order to maximize its own pay off. The behavior of v depends on the relative number of v’s neighbors adopting each behavior, and on the relation between the pay off values.

Suppose the payoff for v adopting behavior A is a. The payoff for v adopting behavior B is b. We can make up the decision rule for v as follows: suppose that a p fraction of v’s neighbors adopt behavior A, and a (1 − p) fraction of v’s neighbors adopt behavior B. If v has d neighbors, then pd of v’s neighbors adopt behavior A and (1 − p)d adopt behavior B.

If v chooses to adopt behavior A, v gets a pay off of pda, and if v chooses B, v gets a pay off of (1 − p)db. Thus, behavior A is a better choice for v to adopt if

pda ≥ (1 − p)db

rearranging the terms, we have:

p ≥

a+bb

This inequality describes the very simple rule for the threshold model: if at least q = a+bb fraction of v’s neighbors adopt behavior A, then v should adopt behavior A. Thus q is known as the “threshold” for node v. We will refer to the threshold model later in Chapter 5.

3.7.2

The Independent Cascade Model

In the independent cascade model [62], each node in the social network have two states:

• Active, meaning a node have adopted the idea that is currently spreading. • Inactive, meaning a node have not adopted the idea that is currently spreading. The independent cascade model starts with a social network W and an initial set of active nodes A0. The cascading process unfolds in discrete steps according to the

Referenties

GERELATEERDE DOCUMENTEN

While it is interesting and valuable to think of all the different ways we perform in our everyday lives and in our interactions with others, there are also times when people

In de praktijk bleek het niet altijd duidelijk te zijn wat deze functies inhielden en hoe ze zich verhouden tot de reeds bestaande functies – ook niet voor de mensen in die

Bijlage 3: Vragenlijst Imago/ Identiteit – onderzoek medewerkers Pentascope?.

In hoeverre maken (web)winkels en consumenten gebruik van Hyves, Facebook, LinkedIn, YouTube, weblogs, Twitter en fora en in welke mate zijn consumenten geïnteresseerd in het

corresponds with findings in other studies internationally [21]. Even if the patients referred by the criminal court had relapsed after discharge, there has been no clear evidence

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Toch wordt deze vraag aangehouden omdat er specifiek wordt gekeken naar de invloed die staten hebben gehad, ook als deze staten door Rusland worden beïnvloed.. In het

And last of all I’d like to thank my family, my parents Tom and CA, my brother Sam (and the upcoming little one) and sister Claire, who are a constant source of