8 1 4 park, hosung) - kaistan.kaist.ac.kr/~sbmoon/paper/thesis/2010dec-hosung.pdf · 2018-08-30 ·...

30
Y|8 Master’s Thesis 0 \ $lX Analysis on Information Spreading as Recorded in Twittersphere 8 1 (4 Park, Hosung) Yü Department of Computer Science KAIST 2011

Upload: others

Post on 12-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

석사 학위논문

Master’s Thesis

트위터소셜네트워크에서의정보전파분석

Analysis on Information Spreading as Recorded in Twittersphere

박 호 성 (朴 鎬 成 Park, Hosung)

전산학과

Department of Computer Science

KAIST

2011

Page 2: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

트위터소셜네트워크에서의정보전파분석

Analysis on Information Spreading as Recorded in Twittersphere

Page 3: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Analysis on Information Spreading as Recorded in

Twittersphere

Advisor : Professor Moon, Sue Bok

by

Park, Hosung

Department of Computer Science

KAIST

A thesis submitted to the faculty of KAIST in partial fulfillment

of the requirements for the degree of Master of Science in Engineering

in the Department of Computer Science . The study was conducted in

accordance with Code of Research Ethics1.

2010. 12. 16.

Approved by

Professor Moon, Sue Bok

[Advisor]

1Declaration of Ethical Conduct in Research: I, as a graduate student of KAIST, hereby declare that

I have not committed any acts that may damage the credibility of my research. These include, but are

not limited to: falsification, thesis written by someone else, distortion of research findings or plagiarism.

I affirm that my thesis contains honest conclusions based on my own careful research under the guidance

of my thesis advisor.

Page 4: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

트위터소셜네트워크에서의정보전파분석

박 호 성

위 논문은 한국과학기술원 석사학위논문으로

학위논문심사위원회에서 심사 통과하였음.

2010년 12월 16일

심사위원장 문 수 복 (인)

심사위원 황 규 영 (인)

심사위원 오 혜 연 (인)

Page 5: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

MCS

20093227

박호성. Park, Hosung. Analysis on Information Spreading as Recorded in Twittersphere.

트위터소셜네트워크에서의정보전파분석. Department of Computer Science . 2011.

22p. Advisor Prof. Moon, Sue Bok. Text in English.

ABSTRACT

Twitter is offering an unprecedented opportunity for the study of information spreading in human

society as all actions and underlying social network can be recorded. In this thesis, we present empirical

study of information spreading phenomena in Twitter. We collected 32 million photo URLs posted on

Twitter via Twitpic and related 52 million tweets from April 2010 to June 2010. Twitpic links guarantee

that the source of information is unique and a Twitter user. Thus, Twitpic links eliminate chance of

having multiple source users who bring the same information to Twittersphere. We show analysis on

information creation and spreading tracking Twitpic URL links. Microscopic characteristics of informa-

tion diffusion at the individual information level are not well-explored yet. We analyze temporal and

topological characteristics of microscopic information spreading characteristics with the reconstructed

diffusion trees of Korean users. We discover that diffusion in Twitter is very fast making wide and shal-

low diffusion trees. We show information diffusion model which is an extension of Independent Cascade

Model considering response time of users. We show that spreading probabilities are influenced by type

of information and directness to the original source. To be best of our knowledge this work is the first

study on characteristics of microscopic information diffusion in Twitter.

i

Page 6: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Chapter 1. Introduction 1

Chapter 2. Background and Related Work 3

Chapter 3. Basic Analysis 4

3.1 Data Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Daily Patterns of Twitpic Links . . . . . . . . . . . . . . . . . . . 4

3.3 Creation and Spreading of Photos . . . . . . . . . . . . . . . . . . 4

Chapter 4. Microscopic Patterns of Information Diffusion in Twitter 9

4.1 Temporal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Topological Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 5. Information Diffusion Model of Twitter 14

5.1 Process of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Characteristics of Diffusion Probabilities and Response Time . 15

Chapter 6. Discussion 17

Chapter 7. Conclusions 19

Summary (in Korean) 20

References 21

ii

Page 7: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

List of Tables

4.1 Statistics of diffusion trees ≥ 10 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Statistics of diffusion trees ≥ 100 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1 Diffusion probabilites for each type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6.1 Correlation coefficients for group L(followers ≥ 150) and group S(10 < followers < 150) . 18

iii

Page 8: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

List of Figures

3.1 Daily behavior of creation and spreading of information . . . . . . . . . . . . . . . . . . . 5

3.2 CCDF of the number of photo creations for users . . . . . . . . . . . . . . . . . . . . . . . 5

3.3 PDF of the interval between photo uploads . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.4 CCDF of the number of tweets for each link . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.5 CDF of spreading duration in days . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.6 The median duration in days over the number of spreading tweets . . . . . . . . . . . . . 8

4.1 Examples of diffusion trees with the same number of nodes . . . . . . . . . . . . . . . . . 10

4.2 CCDF of proportion of spread within 24 hours . . . . . . . . . . . . . . . . . . . . . . . . 10

4.3 CDF of max and median timestamps of tweets for each diffusion tree . . . . . . . . . . . . 11

4.4 Mean depth of tweets with timestamps in days . . . . . . . . . . . . . . . . . . . . . . . . 12

4.5 CDF of proportion of source contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.6 Cascade size over proportion of source contribution . . . . . . . . . . . . . . . . . . . . . . 12

5.1 CDF of diffusion probabilities of direct diffusion from the source node. . . . . . . . . . . . 16

5.2 CDF of diffusion probabilities of indirect diffusion from non-source nodes . . . . . . . . . 16

5.3 CDF of response time in hours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6.1 CDF of clusterring coefffiecint of Korean users . . . . . . . . . . . . . . . . . . . . . . . . 17

iv

Page 9: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Chapter 1. Introduction

Various online social network services (OSNs) allow users to communicate with each other and help

users acquire information. An increasing number of people not only use OSNs for social interactions

but also for propagating and obtaining information as OSNs recently play the important role of news

media [14], Information or idea is created from the sources then it spreads over a social network as some

users forward it to other users. This cascading phenomena is common in OSNs.

There is a large literature on information spreading. Measurement and analysis works discovered

how information spreads out in chain letters, blogosphere, social games, Flickr, Facebook, Digg, Twitter

and so on. Emerging OSNs allow researchers to track massive and complete user behavior data. One of

these, Twitter is offering an unprecedented opportunity for the study of information spreading in human

society as all actions and underlying social network can be recorded.

In this thesis, we present empirical study on how information spreads on Twitter with Twitpic links.

We trace Twitpic URL links to capture information propagation. Twitpic provides a unique URL link

for one uploaded photo. Twitpic guarantees that the source of information diffusion is a unique twitter

user. This is different from diffusion of news links or blog posts links which may have multiple sources

of information. These multiple sources bring redundant information into Twittersphere making it hard

to indentify microscopic patterns of information diffusion at the individual information level. Recent

researches concentrate on macroscopic trends of diffusion for aggregate topics because of this hardness.

On the other hand, we concentrate on microscopic trends of information diffsuion in Twitter with Twitpic

links.

Our key questions are How fast/How broad does information diffuse? and What is the microscopic

information spreading patterns in Twitter? To answer these questions,we collected 32 million photo

URLs posted on Twitter via Twitpic and related 52 million tweets from April 2010 to June 2010. We

also collected the social graph of Korean users to examine information diffusion paths by inferring

information diffusion trees.

We show empirical analysis on information creation and spreading. The users have tendency to

create photos periodically but they spread information without regularity. Only 1% of information are

to be popular gathering 10 tweets or more. We analyze temporal and topological characteristics of

microscopic information spreading characteristics with the inferred diffusion trees of Korean user. In

diffusion tree, each node corresponds to the forwarding tweet and each edge corresponds to diffusion

path. 99.88% of diffusion trees have median timestamp within 7 days implicating Twitter keeps topics

brand new. For topological analysis, we measured cascade size, maximum depth, median depth, width of

tree, single-child edge fraction and volume contribution of source nodes. Statistics show that information

diffusion trees are wide and shallow.

We think that the Independent Cascade Model is a suitable model for Twitter and present in-

formation diffusion model as the extension of ICM considering response time of user. We show that

different types of the information have different spreading probabilities. We divide types of information

into ‘General’, ‘Promotion’ and ‘Request’ information. Promotion Information which provides reward

for spreading has the largest spreading probability. Direct transfer from the original source has larger

spreading probability than indirect transfer from other nodes. We also measure response time of users.

– 1 –

Page 10: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

The median response time is 26.37 minutes.

This thesis is organized as follows. Chapter 2 describes background and related work. We conduct

basic analysis on creation and spreading information in chapter 3. In chapter 4, we study temporal

and topological diffusion patterns in Twitter. Chapter 5 covers information diffusion model of Twitter.

Chapter 6 discusses effective spreading structure of social network. In chapter 7 we conclude.

– 2 –

Page 11: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Chapter 2. Background and Related Work

Twitter(http://twitter.com) is a popular online social networking site. Twitter users can post and

read short text messages at most 140 characters and these messages are called tweets. The users ‘follow’

other users to subscribe tweets of them. This following interaction is undirected and only 22.1% of social

edges are reciprocal [14]. Twitter provides the ‘Retweet’ feature to allow users to forward information to

others easily. If one user retweets the tweet which he/she received, his/her followers can also read that

tweet. Once retweeted, a tweet gets retweeted almost instantly on next hops. General users, celebrities,

mass media and enterprises use Twitter for social interactions and information channel. Kwak et al. [14]

showed that Twitter has the characteristics of both social networking service and news media.

There is a large literature on the social network analysis topic [14, 15, 3, 1, 2, 16, 17, 12, 5].

Measurement and analysis works discovered how information spreads out on social networks. Examined

types of information and social networks are various. Gruhl et al. [12] studied information diffusion on

blogspace. They presented “chatter” and “spike” characterization of topic propagation. Leskovec et

al.[17] considered information propagation as recommendation cascades. They studied recommendation

and purchase propagation with large on-line retailer dataset. They showed cascade patterns with frequent

cascade subgraphs and size distribution of cascade. Leskovec et al.[16] analyzed temporal and structural

patterns of information propagation in blogspace. McGlohon et al.[19] clustered blogs into ‘humor’ and

‘conservative’ blogs by structural cascade types. They showed that the temporal activity of blogs is

bursty. Bakshy et al. [1] viewed gesture adoption in Second Life as information propagation. Cha et

al. [2, 3] analyze the onine media, Youtube and Flickr. Emerging Online Social Networking services allow

researchers to track massive and complete user behavior data on OSNs. Kwak et al. [14] studied on the

entire Twittersphere. Lerman et al. [15] studied spread of information on Digg and Twitter.

however, microscopic information spreading behavior of Twitter at the level of individual information

is not well explored yet. Our work concentrates the microscopic information diffusion patterns in Twitter.

Previous researches analyze the macroscopic characteristics like the diffusion of trending topics. We track

Twitpic URL links to capture temporal and topological characteristics of information spreading for each

links.

There are many efforts to definethe infromation spreading model in social networks [10, 8, 9, 4,

22, 13]. Independent Cascade Model(ICM) and Linear Threshold Model(LTM) are two basic diffusion

models for the information diffusion processes which have been considered in the precedent studies.

These models are expained in chapter 5

Kempe et al. [13] find influencial nodes to maximize the spread of influence with the given models.

Watts et al. [22] studied information diffusion based on simulations with the threshold model. Goyal

et al. [10] studied the method to set probability parameters in the model. Gomez et al. [8] inferred

underlying diffusion networks when only timestamps of diffusion are provided. Gotz et al. [9] proposed

zero-crossing model for each individual blog to produce temporal and topological characteristics of blo-

gosphere.

We consider ICM a suitable model for Twitter and show the extension of ICM regarding response

time of users. We show that spreading probabilites are influenced by type of information and directness

to the original source.

– 3 –

Page 12: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Chapter 3. Basic Analysis

In this chapter, we present basic analysis on creation and spreading of the information.

3.1 Data Collections

Twitpic (http://twitpic.com) is one of the third parties of Twitter which allows users to share their

photos easily on Twitter. Twitpic has about 110 million visitors a month [21], as of August 2010. Users

upload their photos to Twitpic from mobile devices or desktops. The uploaded photo acquires the unique

URL from Twitpic. Users use these URLs in the tweets sharing their photos. We use Twitpic URLs to

track the information spread in Twitter.

We use Streaming API of Twitter [20] to crawl Twitpic URLs on Twitter. We collect traces for 3

months from April 2010 to June 2010 for the basic analysis chapter. 52, 696, 184 tweets are collected

containing 32, 053, 742 unique photos. We also collected snapshots of social graph of Korean Twitter

users for the analysis of diffusion patterns. There are about 640 thousand Korean users on June 30th.

3.2 Daily Patterns of Twitpic Links

Figure 3.1 shows the daily created number of photos and tweets. The x-axis represents a timeline

in days. Red boxes are the number of tweets which upload the photos creating the information on

Twittersphere. Green line represents the number of total tweets which contain creation and spreading of

the photos. Blue line is difference between Red box and Green line which means the number of spreading

tweets.

The period of information creation is 7 days and the peaks are on weekends. Users create many

photos on weekends because they may have much spare time and special events on weekends. In contrast,

spreading behaviors don’t have obvious period. This implies that spreading events can occur anytime

because spreading behaviors only need the access to Twitter.

3.3 Creation and Spreading of Photos

How many users create photos for 3 months? The number of users who create photos is 3, 627, 782.

The average number of photo uploads for a user is 8.8 and the median is 3.

Figure 3.2 shows complementary cumulative distribution function (CCDF) of the number of photo

creations for users in logscale. More than 90% of users upload less than 25 photos.

What is the interval between photo uploads of a user? Figure 3.3 shows probability density func-

tion (PDF) of the interval between the photo uploads in logscale. The unit of x-axis is seconds. The

distribution fluctuates after 24 hours. Vertical black lines are guide lines of days (24hours, 48hours and

so on). We observe that the peaks are on the guide lines implicating the existence of regular behaviors

of users. The users may upload photos at regular time of the day. 90% of intervals are less than 8 days

and the medain interval is 12.87 hours.

– 4 –

Page 13: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Figure 3.1: Daily behavior of creation and spreading of information

Figure 3.2: CCDF of the number of photo creations for users

We regard tweeting the tweets which contain Twitpic URL links as spreading behavior. Figure 3.4

shows CCDF of the number of tweets for eack link in logscale. 81.6% of informations do not spread at

all implicating that most of photos have no meaning to others. Only 1% of informations are referred 10

– 5 –

Page 14: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Figure 3.3: PDF of the interval between photo uploads

times or more. In this paper, we focus on diffusion of information which is spread 10 times or more to

capture diffusion of meaningful information.

Figure 3.5 shows CDF of spreading duration in days. The spreading duration is time difference

between the first tweet and the last tweet. We only plot the informations which have two tweets or more.

90% of information spreadings end up within just one day.

Figure 3.6 shows the median duration in days over the number of spreading tweets in logscale. The

median duration is proportional to the number of tweets up to 100 tweets. Massive spreadings which

have more than 100 tweets can occur in various durations.

– 6 –

Page 15: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Figure 3.4: CCDF of the number of tweets for each link

Figure 3.5: CDF of spreading duration in days

– 7 –

Page 16: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Figure 3.6: The median duration in days over the number of spreading tweets

– 8 –

Page 17: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Chapter 4. Microscopic Patterns of Information

Diffusion in Twitter

We reconstruct infromation diffusion trees in temporal order to investigate microscopic information

diffusion patterns. We need some inferences to reconstruct trees for some reasons as below.

First, even though Twitter provides ‘Retweet’ feature to forward tweets to others, some users spreads

information in the normal tweet form which doesn’t contain the source from whom they received the

information. Second, after creation of the official ‘Retweet’ button in Twitter interface, the forwarder

which is parsed from retweet messages can be incorrect because if the messages are retweeted with

the retweet button, the original source of that message appears as the forwarder in retweet messages

regardless of the actual forwarder of that message.

Inference rules to find the actual forwarder of one tweet are as follows.

• Find users U who are followee of user A and tweeted before A’s tweet appears.

• If A’s tweet is retweet and forwarder F which is parsed from text of retweet is in U, set actual

forwarder of A to F

• Else set actual forwarder of A to the first tweeting user in U.

We ignore replies because replies have no purpose of spreading and followers of replier cannot see

those replies unless they follow both the replier and the replied user. We also eliminate loops and multiple

edges to make trees. Examples of reconstructed diffusion trees are shown in Figure 4.1. Two trees have

the same number of nodes but they are in different shapes.

There may exist many diffusion trees for one information. We focus the diffusion tree which stems

from the original source of the photo, and we call it the primary tree. The median proportion of tweet

volume contribution of the primary tree having 10 edges or more is 0.842, which means that 84.2% of

tweets are in the primary tree. The median proportion of contribution of the primary tree having 100

edges or more is 0.696, which means that largely spread information has more chance to be spread on

external paths not on Twitter social networks. We only examine the primary trees to study information

diffusion phenomena on social networks.

4.1 Temporal Analysis

In this section, we investigate how fast information diffuses in Twitter and temporal diffusion pat-

terns. We only consider 6, 728 diffusion trees of Korean users, which have 10 edges or more.

Figure 4.2 shows CCDF of the proportion of the tweets which spread the information within 24hours

after source information appeared. More than 90% of diffusion trees have 80% of tweets within 24 hours.

This result shows that most information diffusion in Twitter take place in a day which is very fast

compared to other OSNs.

Figure 4.3 shows CDF of max and median timestamps of tweets for each diffusion tree. Red points

stand for max timestamps and blue points are median timestamps. 90.84% of trees complete diffusion

– 9 –

Page 18: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Figure 4.1: Examples of diffusion trees with the same number of nodes

Figure 4.2: CCDF of proportion of spread within 24 hours

within 7 days. 99.88% of trees have median timestamps within 7 days. These characteristics allow

Twitter to be the source of realtime information or news.

Figure 4.4 represents mean depth of tweets in diffusion trees. Tweets are grouped with timestamps

in days. Tweets of the first day occur in shallower depth than other days tweets. But there is not much

difference in the mean depth between other days tweets. This implied that slow diffusion of information

– 10 –

Page 19: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Figure 4.3: CDF of max and median timestamps of tweets for each diffusion tree

does not always occur with many steps from the source and fast diffusion can also have many steps from

the source.

4.2 Topological Analysis

We measure properties of diffusion trees having (T1) 10 nodes or more and (T2) 100 nodes or more.

Measured properties are cascade size, maximum depth, median depth, width of tree, single-child edge

fraction and volume contribution of source nodes. Here, the width of a tree is defined as the maximum

size of a set of nodes that lies in the same depth and single-child edge fraction is defined as the fraction

of nodes with exactly one child which is used in [18]. Statistics are shown in Table 4.1. The median

cascade size is 17 for (T1) and 174 for (T2). The median depth is 1.5 for (T1) and 1.0 for (T2). The

median width of (T1) is 10.0 and 125.0 for (T2). These metrics tell that the diffusion trees in Twitter

are wide and shallow. Single-child edge fraction has the median 0.125 for (T1) and 0.0439 for (T2).

Single-child edge fraction is very low, meaning that there are not many single chains which lengthen the

depth of trees without broadening the width. Proportion of source’s children over all nodes is 0.4375 for

(T1) and 0.8104. This means that the source plays an important role in information diffusion.

But source volume contribution 0.8104 for (T2) can be misleading as shown in Figure 4.5, 4.6.

Figure 4.5 shows CDF of proportion of source contribution. Black dots represent trees having 10 nodes

or more, which matches to (T1) and red dots represent 100 nodes or more trees(T2). For largely spread

– 11 –

Page 20: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Figure 4.4: Mean depth of tweets with timestamps in days

Figure 4.5: CDF of proportion of source con-

tribution

Figure 4.6: Cascade size over proportion of source

contribution

information (T2), even though (T2) has the high median 0.8104, proportion of source contribution is

divided into two sides, large source contribution and small source contribution. This implies that largely

spread information needs influential source nodes or hubs in diffusion path. Figure 4.6 shows cascade

size over proportion of source contribution. we also confirm that largely information result from either

large source contribution or small source contribution.

– 12 –

Page 21: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

≥ 10 nodes Min. 1st Qu. Median Mean 3rd Qu. Max.

casecade size 10.00 12.00 17.00 36.88 31.00 991.00

max. depth 1.00 2.00 3.00 3.24 4.00 16.00

median depth 1.000 1.000 1.500 1.654 2.000 7.000

width 2.00 7.00 10.00 23.45 17.00 954.00

single-edge fraction 0.00000 0.05882 0.12500 0.13640 0.20000 0.66670

source contribution 0.001092 0.214300 0.437500 0.476700 0.750000 0.997100

Table 4.1: Statistics of diffusion trees ≥ 10 nodes

≥ 100 nodes Min. 1st Qu. Median Mean 3rd Qu. Max.

casecade size 100.0 131.0 174.0 226.3 265.0 991.0

max. depth 1.000 2.000 3.000 4.216 6.000 16.000

median depth 1.000 1.000 1.000 1.776 2.000 7.000

width 19.0 67.0 125.0 164.7 213.0 954.0

single-edge fraction 0.00000 0.01629 0.04390 0.08019 0.14790 0.24760

source contribution 0.001092 0.117200 0.810400 0.583000 0.960700 0.997100

Table 4.2: Statistics of diffusion trees ≥ 100 nodes

– 13 –

Page 22: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Chapter 5. Information Diffusion Model of Twitter

In this chapter, we show information diffusion model in Twitter and characteristics of information

diffusion for the model.

Independent Cascade and Linear Threshold models are two basic diffusion models for the information

diffusion processes which have been considered in the precedent studies.

The Independent Cascade Model(ICM) [7] starts with a set of active nodes A0. When a node

s becomes active, it has only one chance to activate each inactive neighbor t with probability p(s, t).

Regardless of success of activation, s cannot attempt to activate t after one try. Newly activated nodes

have chance to activate their neighbors in the next step. This process runs in discrete steps until no

more activations are possible.

The Linear Threshold Model(LTM) [11] has a threshold tn ∈ [0, 1] for each node n. A node n

is influenced by its neighbors m with weight w(n,m) with∑

m w(n,m) ≤ 1. This process also starts

with a set of active nodes A0 and continues in discrete steps. At each step, a node n is activated if

tn ≤∑

activated m w(n,m).

LTM is not suitable for our diffusion modeling because Twitter users cannot see the redundant

retweets which are arisen from the retweet button for the same tweet. Instead, they only see the first

retweet of them in their timelines. 61.4% of users receive only one tweet and more than 90% of users

receive no more than 5 tweets for the same links. Thus, we extends ICM for the information diffusion

model in Twitter.

5.1 Process of the Model

Contrary to ICM, our model has response time for each edge. Response time is the elapsed time

between receiving and forwarding information for a user. In the real world, depth-1-edge can be created

later than depth-3-edge when the depth-1-edge discovered the information later than the depth-3-edge.

But ICM cannot reflect this phenomena. ICM always create deeper-depth-edges later than shallower-

depth-edges. Thus we consider response time in the edge creation. In our model, there is only one

active starting node A0 because the source of Twitpic is unique. The activation behavior of this model is

forwarding the information to others. We assume that the activated node cannot go back to the inactive

state again. Information diffuses from the active node s to an inactive neighbor t with probability p(s, t).

If the activation succeeds, new edge (s, t) is created in the diffusion tree. When new edge is created,

we pick the response time from the time distribution which is shown later and attach timestamp to the

edge. The process runs in time ticks. Activated nodes which have earlier timestamps than current time

tick have chance to activate their neighbors. This process runs until the time tick reaches to the selected

time limit. If one node has multiple incoming edges, we only give chance for activating this node to the

first edge.

In the next section, we show characteristics on diffsuion probabilities and response time of Twitter

for the model.

– 14 –

Page 23: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

5.2 Characteristics of Diffusion Probabilities and Response Time

Diffusion Probabilities in Twitter

We measure diffusion probabilities regarding type of information and directness to the source node.

We divide information in Twitpic manually into three types manually, (1) general information, (2)

promotion information and (3) request information. The general information is normal information like

street scenes, portrait, humorous images and so on. The promotion information gives a reward to user for

spreading the information. For example, enterprises launch promotional campaign for new products in

Twitter giving the new product to the users who retweet the campaign. The request information does not

have reward for spreading but it requests to spread the information in the tweet. Searching for a missing

child is one of this type. We also consider whether a node receive the information from the source node

directly or not. We calculated the mean diffusion probability p = sumof all atcivatedfollowers of all nodessumof all followers of all nodes for

each case. Table 5.1 shows these diffusion probabilies.

information types direct to src indirect to src

General Information 0.00107 0.00083

Promotion Information 0.01382 0.00103

Request Information 0.00321 0.00112

Table 5.1: Diffusion probabilites for each type

The promotion information has the strongest spreading power for direct diffusion from information

source. Indirect spreading from the source has less spreading power than direct spreading from the

source. This implies that users have tendency to avoid spreading the already transferred information by

someone.

Figure 5.1 shows CDF of diffusion probabilities of direct diffusion from the source node. Each line

represent different type of information. Red line is the general information and blue is the promotion

information and green is the request information. Figure 5.2 shows CDF of diffusion probabilities of

indirect diffusion from non-source nodes. Compared to Figure 5.1, transferred information has less

spreading power than the information of the source. We confirm that different types of information have

different distribution of diffusion probabilities and that directness from the source node influences the

diffusion probabilites.

Response time of Twitter User

We measure response time of users to find out the response time distribution. The response time is

defined as the time period between the time when one information is transfered to user A and the time

when user A respond to the information with the tweet.

Figure 5.3 shows CDF of the response time with x-axis in hours. Most of responses occur in few

hours implicating fast speed of information diffusion in Twitter. The median of response time is 26.37

minutes.

– 15 –

Page 24: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Figure 5.1: CDF of diffusion probabilities of di-

rect diffusion from the source node.

Figure 5.2: CDF of diffusion probabilities of in-

direct diffusion from non-source nodes

Figure 5.3: CDF of response time in hours

– 16 –

Page 25: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Chapter 6. Discussion

In this chapter we discuss effective spreading structure of social network. Social networks have

characteristic of clustering. The clustering coefficient [23] is a quantitative measure of this phenomena.

The clustering coefficient C(s) for a node s is defined as follows. Let s be a node which has n neighbours

or followers in Twitter. Then three can be exist n(n − 1)/2 edges between them. C(s) is the fractionnumber of acutually existing edges

number of allowable edges . Intuitive meaning of C(s) is the probability that friends of s are also

friends each other. The clustering coefficient C for the whole network is the average of C(s) over all s.

Figure 6.1: CDF of clusterring coefffiecint of Korean users

Figure 6 shows CDF of 457, 168 korean users who have less than 2000 followers and more than 1

followers. More than 99% of korean users have less than 2000 followers. We ignore foreign followers in

the calculation. We plot two kinds of the connected edge because edges in Twitter are directed. (1) In

blue line, only reciprocal edges are considered as the connected edges. (2) Both reciprocal and one way

edges are counted in red line. The mean and median values of the clustering coefficient are 0.278 and

0.179 for (1) and 0.332 and 0.238 for (2). In average, about 30% of neighbors of a user are also neighbors

with each other in Twitter.

– 17 –

Page 26: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Dunbar’s number [6] is a limit to the number of people with whom one can maintain stable social

relationships. 150 is a commonly used value.

We divide users into two groups according to Dunbar’s number. The First group L have 150 followers

and more. The second group S have followers less than 150 and more than 10. We ignore users who

have followers less than 10.

For each group L and S, table 6.1 shows correlation coefficient between (1) number of followers -

the median number of followers of one hop neighbors, (2) number of followers - clustering coefficient, (3)

clustering coefficient - the median number of followers of one hop neighbors and (4) clustering coefficient

- the mean clustering coefficient of one hop neighbors.

Correation Group L Group S

(1) ] of followers, the median ] of followers of one hop neighbors 0.1094 0.0182

(2) ] of followers, clustering coefficient 0.0263 -0.1585

(3) clustering coefficient, the median ] of followers of one hop neighbors 0.8536 0.3236

(4) clustering coefficient, the mean clustering coefficient of one hop neighbors 0.6034 0.5091

Table 6.1: Correlation coefficients for group L(followers ≥ 150) and group S(10 < followers < 150)

There are weak or no correlations in (1) and (2) for both groups implicating the number of followers

is not the important factor for the structure of network of followers. But there are strong correlations

in (3) and (4). Statistics of (3) shows that the more neighbors are clustered densely, the more these

neighbors have followers making more chance to spread information. Correlation for group L is very

strong with coefficient 0.8536. Statistics of (4) shows homophily of users. The densely clustered users

also have the densely clustered one hop neighbors.

In viral marketing, It is important to choose the initiating nodes to maximize marketing outcomes.

Table 6.1 gives hints for this problem. Normal user may have more densely clustered friends than the

enterprise user. It is better to request an existing user who has densely clustered friends to initiate

the campaign than to initiate the campaign for itself when these two users have the similar number of

followers. This can be one of characteristics of effective spreading structure of social network.

– 18 –

Page 27: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Chapter 7. Conclusions

In this thesis, we study information spreading phenomena in Twitter with Twitpic links. Tracking

Twitpic links gurantees that the source of information is the unique Twitter user making it easy to trace

information diffusion in Twitter.

First, we present basic analysis on creation and spreading of information with 52 million tweets.

Twitter users create information periodically and spread information without regularity

Second, we study microscopic characteristics of information diffusion at the individual information

level with Korean users. Temporal and topological analysis shows that diffusion in Twitter is fast and

makes wide and shallow diffusiontrees. Large spread of information is due to either power of the source,for

example celebrities, or hubs in diffusion trees.

Third, we show information diffusion model of Twitter, which is an extension of independent cascade

model. This model considers response time of users and runs until the selected time limit is reached.

We show that spreading probabilites are influenced by type of information and directness to the original

source.

We leave comparison between result of Twitpic data and general tweets to future work. The accurate

prediction of spreading and The characteristics of nodes that are effective in information spreading also

remain as future works. Our work shed lights on microscopic characteristics of information diffusion in

OSNs. This work can help viral marketing on OSNs plan campaign and choose targets.

– 19 –

Page 28: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

Summary

Analysis on Information Spreading as Recorded in Twittersphere

트위터는 사용자들의 모든 행위와 소셜 네트워크 구조를 제공하면서 유래없는 인간 사회에서의 정보

전파 연구의 기회를 제공한다. 이 논문에서는 트위터가 제공하는 데이터를 기반으로 정보 전파 현상을

분석하였다. 우리는 2010년 4월 부터 2010년 6월 까지 Twitpic을 통해 트위터에 업로드된 3,200만 개의

사진 URL 링크와 관련된 5,200만개의 트윗을 수집하였다. Twitpic 링크는 정보의 발생점이 유일한

트위터 사용자임을 보장함으로써 정보 전파를 추적하는데 모호한 경우인 여러 사용자가 같은 정보를

트위터에가져와서정보전파가시작되는경우를배제한다. 우리는이 Twitpic링크로정보전파현상을

추적하여 정보의 생성과 전달을 분석하였다. 또한 아직 트위터에서 연구가 미진한 개별적인 정보 단위

의 미시적인 관점의 정보 전파 현상을 한국인 사용자의 소셜 그래프를 통해 추론한 정보 전파 트리를

이용하여 시간 및 위상적으로 분석하였다. 이 분석으로 우리는 트위터의 정보 전파는 매우 빠르고 넓

고 얕은 정보 전파 트리가 생성된다는 것을 밝혔다. 또한 사용자의 응답시간을 고려하여 Independent

Cascade Model을 확장한 트위터에서의 정보 전파 모델을 제안하였다. 트위터에서의 정보 전파 확률은

의도에 따라 구분된 정보의 종류와 정보의 발생점으로 부터 직접 전달 받았는지 간접적으로 전달 받았

는지에 영향을 받는다는 것을 밝혔다. 이 연구는 온라인 소셜 네트워크에서의 미시적인 관점으로 정보

전파의 특성을 밝히는데 기여을 하고, 바이럴 마케팅에서 효과적인 마케팅 대상을 찾거나 정보 전파를

예상하는데 응용될 수 있다.

– 20 –

Page 29: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

References

[1] E. Bakshy, B. Karrer, and L. Adamic. Social influence and the diffusion of user-created content. In

Proceedings of the tenth ACM conference on Electronic commerce, pages 325–334. ACM, 2009.

[2] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: analyzing

the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM

conference on Internet measurement, pages 1–14. ACM, 2007.

[3] M. Cha, A. Mislove, and K. Gummadi. A measurement-driven analysis of information propagation

in the flickr social network. In Proceedings of the 18th international conference on World wide web,

pages 721–730. ACM, 2009.

[4] J. Cointet and C. Roth. How realistic should knowledge diffusion models be. Journal of Artificial

Societies and Social Simulation, 10(3):5, 2007.

[5] J. Cointet and C. Roth. Socio-semantic dynamics in a blog network. In Computational Science and

Engineering, 2009. CSE’09. International Conference on, volume 4, pages 114–121. IEEE, 2009.

[6] R. Dunbar. Grooming, gossip, and the evolution of language. Harvard Univ Pr, 1998.

[7] J. Goldenberg, B. Libai, and E. Muller. Talk of the network: A complex systems look at the

underlying process of word-of-mouth. Marketing Letters, 12(3):211–223, 2001.

[8] M. Gomez Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. In

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data

mining, pages 1019–1028. ACM, 2010.

[9] M. Gotz, J. Leskovec, M. McGlohon, and C. Faloutsos. Modeling blog dynamics. In AAAI Confer-

ence on Weblogs and Social Media, 2009.

[10] A. Goyal, F. Bonchi, and L. Lakshmanan. Learning influence probabilities in social networks. In

Proceedings of the third ACM international conference on Web search and data mining, pages 241–

250. ACM, 2010.

[11] M. Granovetter and R. Soong. Threshold models of diffusion and collective behavior. The Journal

of Mathematical Sociology, 9(3):165–179, 1983.

[12] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In

Proceedings of the 13th international conference on World Wide Web, pages 491–501. ACM, 2004.

[13] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network.

In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and

data mining, pages 137–146. ACM, 2003.

[14] H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In

Proceedings of the 19th international conference on World wide web, pages 591–600. ACM, 2010.

– 21 –

Page 30: 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 · A thesis submitted to the faculty of KAIST in partial ful llment ... 22, 13]. Independent

[15] K. Lerman and R. Ghosh. Information contagion: An empirical study of the spread of news on Digg

and Twitter social networks. 2010.

[16] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large

blog graphs: Patterns and a model. In Society of Applied and Industrial Mathematics: Data Mining

(SDM07), 2007.

[17] J. Leskovec, A. Singh, and J. Kleinberg. Patterns of influence in a recommendation network.

Advances in Knowledge Discovery and Data Mining, pages 380–389, 2006.

[18] D. Liben-Nowell and J. Kleinberg. Tracing information flow on a global scale using Internet chain-

letter data. Proceedings of the National Academy of Sciences, 105(12):4633, 2008.

[19] M. McGlohon, J. Leskovec, C. Faloutsos, M. Hurst, and N. Glance. Finding patterns in blog shapes

and blog evolution. In International Conference on Weblogs and Social Media. The AAAI Press,

2007.

[20] Twitter Streaming API. http://dev.twitter.com/pages/streaming api.

[21] Visitor of Twitpic. http://www.quantcast.com/twitpic.comp.

[22] D. Watts and P. Dodds. Influentials, networks, and public opinion formation. Journal of Consumer

Research, 34(4):441–458, 2007.

[23] D. Watts and S. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393(6684):440–442,

1998.

– 22 –