developing the korean_internet_network_miner_change

24
2009년 한국자료분석학회 가을철 학술대회 Developing the Korean Internet Network Miner(KINM): E-research Tool for Social Network Analysis of Blogospherein South Korea* Anatoliy Gruzd1), Chung Joo Chung2), Jaeeun(Angela) Yoo3), 박한우4)

Upload: han-woo-park

Post on 10-May-2015

552 views

Category:

Documents


0 download

DESCRIPTION

한국자료분석학회 가을철 학술발표대회 2009

TRANSCRIPT

Page 1: Developing the korean_internet_network_miner_change

2009년 한국자료분석학회 가을철 학술대회

Developing the Korean Internet Network Miner(KINM): E-research Tool for Social Network Analysis of Blogospherein South Korea*

Anatoliy Gruzd1), Chung Joo Chung2), Jaeeun(Angela) Yoo3),

박한우4)

Page 2: Developing the korean_internet_network_miner_change

2009년 한국자료분석학회 가을철 학술대회

Anatoliy Gruzd1),

A ssistant Professor, School of Information Management, Dalhousie University, Canada

E-mail : [email protected]

Chung Joo Chung2)

Ph.D. Candidate, Department of Communication, State University of New York at Buffalo, USA

E-mail : [email protected]

Jaeeun(Angela) Yoo3)

3B.S. Student, Division of Engineering Science, University of Toronto, Canada

E-mail : [email protected]

박한우4)

(Corresponding A uthor) A ssociate Professor, Department of Media and Information, YeungNam

University, Korea.

E-mail : hanpark@ ynu.ac.kr

Page 3: Developing the korean_internet_network_miner_change

Page 3 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Contents

Introduction

Related Studies: Tools for Blog Network Analysis

Section 1.

Development of the Korean Internet Network Miner

Section2.

Evaluation of the Name Network Discovery Algorithm

Conclusion

Page 4: Developing the korean_internet_network_miner_change

Page 4 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Introduction

The growing adoption of e-research tools has lead to

changes in social and communication research(Jankowski, 2009)

Typical technologies in this domain include LexiURL(Thelwall,

2009), Virtual Observatory for the Study of Online Networks

(Ackland, 2009), and Internet Community Text Analyzer(ICTA;

Gruzd, 2009a).

Page 5: Developing the korean_internet_network_miner_change

Page 5 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

In contrast to the e-research developments in North America and

Europe, Soon and Park(2009) note that digital tools to support

e-research are rare in Asia, even in South Korea (Internet World

Stats, 2008).

Thus, we attempt to develop an e-research tool for automatic discovery

of online communication networks on the Korean Web.

First section deals with prior studies related to large-scale blog network

analysis using automatic tools,

Second section illustrates the process of developing our analytic tool called

the Korean Internet Network Miner(KINM).

Introduction

Page 6: Developing the korean_internet_network_miner_change

Page 6 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

The structure of networks can be measured mathematically and

visualized graphically. The shape of a network emerging from online

users’ writing and linking choices reflects interest trends.

Research on blog networks have confirmed that large-scale online

communities are structurally reflected in higher density network

neighborhoods through linking(Kelly and Etling, 2008).

Related Studies: Tools for Blog Network Analysis

Page 7: Developing the korean_internet_network_miner_change

Page 7 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Traditional data mining research

• Traditional data mining research focuses largely on algorithms for

inferring association rules and other statistical correlation measures

in a given data set(Kumar et al., 1999; Jung, 2009).

For example)

• Kelly and Etling(2008) used research firm Morningside Analytics for blog Selection and data mapping

along with Fruchterman-Rheingold's "physics model”algorithm to understand the blog networks of

the Iran blogosphere.

• Gryc and his colleagues(2008) developed categories for blog networks they studied by analyzing

key words, post classification, and linking patterns of blogs.

Related Studies: Tools for Blog Network Analysis

Page 8: Developing the korean_internet_network_miner_change

Page 8 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Current research

• The current research uses a web-based system for automated text

analysis to discover and understand social networks from blog data.

• It focuses not only on chain networks—social networks based on

the number of messages exchanged between individuals—but also on

name networks—social networks built from mining personal

names and nicknames.

Related Studies: Tools for Blog Network Analysis

Page 9: Developing the korean_internet_network_miner_change

Page 9 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

As an initial framework for the KINM, we used some of the social network

discovery and visualization tools and techniques previously developed by

Gruzd(2009).

These tools and techniques were developed as part of a General

purpose web system for content and network analysis of computer-

mediated communication in English called ICTA

(available at http://textanalytics.net).

Section 1. Development of the Korean Internet Network Miner

Page 10: Developing the korean_internet_network_miner_change

Page 10 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

As an initial framework for the KINM, we used some of the social network

discovery and visualization tools and techniques previously developed by

Gruzd(2009).

These tools and techniques were developed as part of a General

purpose web system for content and network analysis of computer-

mediated communication in English called ICTA

(available at http://textanalytics.net).

Section 1. Development of the Korean Internet Network Miner

Barriers>

1. since ICTA only works with texts in English, we had to modify all text

processing functions to support Korean texts.

2. ICTA requires the data to be stored in a machine-readable format such

as an RSS feed. However, after a manual examination of a number of

Korean blogs, we noticed that the majority of them do not provide RSS

feeds for their comments data.

Page 11: Developing the korean_internet_network_miner_change

Page 11 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

The reason we were especially interested in analyzing comments data

was because comments turned out to be a good source for mining social

connections among blog readers because comments contain

most of The social interactions on a blog.

To address this challenge, we created a script using the Kapow Mashup

Server (http://www.kapowtech.com) to retrieve comments from a selected

Blog automatically and output them as an RSS feed.

Section 1. Development of the Korean Internet Network Miner

Page 12: Developing the korean_internet_network_miner_change

Page 12 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

After retrieving the blog data, it was processed to build two types of networks.

• First, a chain network was extracted. In the chain network, one

commentator is connected to another if the first commentator directly replied to the second commentator by clicking on the "reply-to" button.

• However, after manually examining a number of comments on several

blogs, we found that there are some comments that are not "reply-to" comments, but are addressing or referencing a previous poster.

This observation is in-line with a previous empirical study on online Learning communities by Gruzd(2009a),

which discovered that the chain network misses on

average 40% of possible connections.

Section 1. Development of the Korean Internet Network Miner

Page 13: Developing the korean_internet_network_miner_change

Page 13 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Name Network>

• Instead of just relying on information about who replied to whom, the

Name network method starts by automatically finding all mentions of personal names or nicknames in comments and uses them as nodes in a social network.

• Next, to discover ties between nodes, the method connects a sender of a comment to all names found in his/her comment. (A more detailed description of this method can be found in Gruzd(2009b).)

• Although the name network approach provides additional information about connections among blog commentators, it has its own challenges. - personal name/nickname and a word that just appears to be one.

Section 1. Development of the Korean Internet Network Miner

Page 14: Developing the korean_internet_network_miner_change

Page 14 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Figure 1: Sample comment

For example, the algorithm marked

the word 사람 (people) as a

reference to another person on

the blog.

This happened because there was

at least one comment in the

dataset posted by a person with the

"사람" nickname.

However, in the sample comment,

this word does not refer to another

online participant; it is used as a

noun that means "people".

Section 1. Development of the Korean Internet Network Miner

Page 15: Developing the korean_internet_network_miner_change

Page 15 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Name Network>

Another good example of challenges associated with the name/nickname

disambiguation problem in comments is the word "2mb".

1. this word can be used as a nickname for one of the blog commentators. 2. it could refer to the capacity of a computer memory (2 megabytes). 3. it could be the alias of the current Korean president, Lee Myung-Bak.

To address these challenges and develop recommendations for the next

generation of the name network discovery algorithm, we conducted a

semi-automated analysis of all names/nicknames discovered from a

sample dataset using our initial algorithm.

Section 1. Development of the Korean Internet Network Miner

Page 16: Developing the korean_internet_network_miner_change

Page 16 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

To evaluate our automated approach for analyzing communication

networks from blog comments, specifically the accuracy of the name Network discovery algorithm,

We selected a single blog

authored by 방짜(bangzza)

from http://blog.ohmynews.com/bangzza

Section 2. Evaluation of the Name Network Discovery Algorithm

Page 17: Developing the korean_internet_network_miner_change

Page 17 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

OhMyNews was ranked as one of the top three web sites in terms of

blog users in 2009(Rankey.com, 2009) it is frequently ranked as the most

popular blog site in Korea, registering over 20 million page views per day

during the presidential election.

Users of OhMyNews, known as "news guerrillas", contribute news articles on

the Web site. OhMyNews allows individuals in far-flung locations to come together,

share, and build strong ties and a sense of community—united in ideology even

if separated by geographic distance—

that fosters a true Grassroots movement

(Streitmatter, 2001).

Section 2. Evaluation of the Name Network Discovery Algorithm

Page 18: Developing the korean_internet_network_miner_change

Page 18 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

For our tests, we retrieved and analyzed a sample set of 943 comments

(posted between April 2008 and April 2009) from the selected blog.

In the study, we relied on an interactive tag cloud feature available in

KINM To explore and evaluate all names and nicknames that were found

automatically (see Figure 2).

<Figure 2> An interactive tag cloud showing the 50 most frequently used name/nickname candidates found in the sample dataset

Section 2. Evaluation of the Name Network Discovery Algorithm

Page 19: Developing the korean_internet_network_miner_change

Page 19 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

The evaluation procedure involved clicking on each word found by the

name network algorithm and exploring the context where each instance

of the word was used(see Figure 3). The purpose of this semi-automated

analysis was to discover what name/nickname candidates were identified

incorrectly and why.

<Figure 3> A list of messages containing "2MB”

Section 2. Evaluation of the Name Network Discovery Algorithm

Page 20: Developing the korean_internet_network_miner_change

Page 20 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

The following set includes clues suggesting that a word is

likely to be a nickname :

● a word candidate is followed by a context word such as "님" = an honorific or "씨" = Mr./Ms.; other possibilities, although rare, include "굮" = Mr. for younger males or "양양" = Miss/Ms. for younger females at the end of a word candidate, and "미스터" = Mr. or "미스" = Miss at the beginning;

● a word candidate contains a combination of characters(Korean, English and/or Chinese), symbols(e.g., underscores, hyphens) and numbers;

● a word candidate appears to be a real name, which is almost always three characters: a single-character last name followed by the two-character first name, which may be found in a dictionary of first names and/or common characters used therein;

● a word candidate is a less common, non-topic word(e.g., "너구리" = raccoon);

● a word candidate is followed by punctuation indicative of someone being addressed (e.g., "/" or ":");

● a word candidate contains patterns indicative of non-native words-phonetic Koreanization of English(e.g., "미디어몽골" = mediamogul = Media Mogul) or phonetic romanization of Korean(e.g. "jihwaja" = 지화자).

Section 2. Evaluation of the Name Network Discovery Algorithm

Page 21: Developing the korean_internet_network_miner_change

Page 21 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

The second set includes clues suggesting that a word is

NOT likely to be used as a nickname:

● a word candidate is a phrase—for example, if the nickname input (the "FROM"field) is Used more like a subject line(possible indicators include white spaces and length);

● a word candidate consists of a single character(e.g., "a" or "ㄱ");

● a word candidate consists of netspeak, including emoticons(e.g. "=_="), slang and abbreviations(e.g., using "2MB" to refer to the current Korean president), and onomatopoeia (e.g. "ㅉㅉ" = tsk tsk, ” ㅋㅋ" = heehee, "하하" = haha, "음" = hmm);

● a word candidate appears more than one time in the comment;

● a word candidate consists of random characters(e.g. "ㅁㄴㅇㄹ" or "asdf");

● a word candidate is a short, conversational word or phrase(e.g., "나나 " = me, "아이고" = oh no, "그래서" = so/therefore);

● a word candidate is a common word or idea in the given context/topic(e.g., "대한민국" = Republic of Korea, "쥐체사상" = a newly created word used to refer to political fanatics).

Section 2. Evaluation of the Name Network Discovery Algorithm

Page 22: Developing the korean_internet_network_miner_change

Page 22 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Conclusion

This research briefly reviewed some of the studies related to

a large-scale blog analysis using automatic tools.

We reviewed the process of developing our own analysis tool

KINM. The main goal of KINM is to automate the process of

finding communication networks in the Korean blogosphere

that accurately represent social interactions among blog

readers.

To find these networks, the system relies on a set of text

mining techniques to look for personal names and

nicknames in users’comments.

Page 23: Developing the korean_internet_network_miner_change

Page 23 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

To address some of the challenges associated with the

automated discovery of names and nicknames in Korean

texts, this paper also presented an exploratory study of a

sample dataset.

The study suggests a set of additional rules to improve the

accuracy of the current name/nickname discovery algorithm

used in KINM.

These additional rules will be incorporated into KINM and

evaluated in a subsequent study.

Conclusion

Page 24: Developing the korean_internet_network_miner_change

Page 24 2009년 한국자료분석학회 가을철 학술대회

료분석학회 가을철 학술대회

Thank you.