developing the korean_internet_network_miner_change
DESCRIPTION
한국자료분석학회 가을철 학술발표대회 2009TRANSCRIPT
2009년 한국자료분석학회 가을철 학술대회
Developing the Korean Internet Network Miner(KINM): E-research Tool for Social Network Analysis of Blogospherein South Korea*
Anatoliy Gruzd1), Chung Joo Chung2), Jaeeun(Angela) Yoo3),
박한우4)
2009년 한국자료분석학회 가을철 학술대회
Anatoliy Gruzd1),
A ssistant Professor, School of Information Management, Dalhousie University, Canada
E-mail : [email protected]
Chung Joo Chung2)
Ph.D. Candidate, Department of Communication, State University of New York at Buffalo, USA
E-mail : [email protected]
Jaeeun(Angela) Yoo3)
3B.S. Student, Division of Engineering Science, University of Toronto, Canada
E-mail : [email protected]
박한우4)
(Corresponding A uthor) A ssociate Professor, Department of Media and Information, YeungNam
University, Korea.
E-mail : hanpark@ ynu.ac.kr
Page 3 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Contents
Introduction
Related Studies: Tools for Blog Network Analysis
Section 1.
Development of the Korean Internet Network Miner
Section2.
Evaluation of the Name Network Discovery Algorithm
Conclusion
Page 4 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Introduction
The growing adoption of e-research tools has lead to
changes in social and communication research(Jankowski, 2009)
Typical technologies in this domain include LexiURL(Thelwall,
2009), Virtual Observatory for the Study of Online Networks
(Ackland, 2009), and Internet Community Text Analyzer(ICTA;
Gruzd, 2009a).
Page 5 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
In contrast to the e-research developments in North America and
Europe, Soon and Park(2009) note that digital tools to support
e-research are rare in Asia, even in South Korea (Internet World
Stats, 2008).
Thus, we attempt to develop an e-research tool for automatic discovery
of online communication networks on the Korean Web.
First section deals with prior studies related to large-scale blog network
analysis using automatic tools,
Second section illustrates the process of developing our analytic tool called
the Korean Internet Network Miner(KINM).
Introduction
Page 6 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
The structure of networks can be measured mathematically and
visualized graphically. The shape of a network emerging from online
users’ writing and linking choices reflects interest trends.
Research on blog networks have confirmed that large-scale online
communities are structurally reflected in higher density network
neighborhoods through linking(Kelly and Etling, 2008).
Related Studies: Tools for Blog Network Analysis
Page 7 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Traditional data mining research
• Traditional data mining research focuses largely on algorithms for
inferring association rules and other statistical correlation measures
in a given data set(Kumar et al., 1999; Jung, 2009).
For example)
• Kelly and Etling(2008) used research firm Morningside Analytics for blog Selection and data mapping
along with Fruchterman-Rheingold's "physics model”algorithm to understand the blog networks of
the Iran blogosphere.
• Gryc and his colleagues(2008) developed categories for blog networks they studied by analyzing
key words, post classification, and linking patterns of blogs.
Related Studies: Tools for Blog Network Analysis
Page 8 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Current research
• The current research uses a web-based system for automated text
analysis to discover and understand social networks from blog data.
• It focuses not only on chain networks—social networks based on
the number of messages exchanged between individuals—but also on
name networks—social networks built from mining personal
names and nicknames.
Related Studies: Tools for Blog Network Analysis
Page 9 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
As an initial framework for the KINM, we used some of the social network
discovery and visualization tools and techniques previously developed by
Gruzd(2009).
These tools and techniques were developed as part of a General
purpose web system for content and network analysis of computer-
mediated communication in English called ICTA
(available at http://textanalytics.net).
Section 1. Development of the Korean Internet Network Miner
Page 10 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
As an initial framework for the KINM, we used some of the social network
discovery and visualization tools and techniques previously developed by
Gruzd(2009).
These tools and techniques were developed as part of a General
purpose web system for content and network analysis of computer-
mediated communication in English called ICTA
(available at http://textanalytics.net).
Section 1. Development of the Korean Internet Network Miner
Barriers>
1. since ICTA only works with texts in English, we had to modify all text
processing functions to support Korean texts.
2. ICTA requires the data to be stored in a machine-readable format such
as an RSS feed. However, after a manual examination of a number of
Korean blogs, we noticed that the majority of them do not provide RSS
feeds for their comments data.
Page 11 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
The reason we were especially interested in analyzing comments data
was because comments turned out to be a good source for mining social
connections among blog readers because comments contain
most of The social interactions on a blog.
To address this challenge, we created a script using the Kapow Mashup
Server (http://www.kapowtech.com) to retrieve comments from a selected
Blog automatically and output them as an RSS feed.
Section 1. Development of the Korean Internet Network Miner
Page 12 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
After retrieving the blog data, it was processed to build two types of networks.
• First, a chain network was extracted. In the chain network, one
commentator is connected to another if the first commentator directly replied to the second commentator by clicking on the "reply-to" button.
• However, after manually examining a number of comments on several
blogs, we found that there are some comments that are not "reply-to" comments, but are addressing or referencing a previous poster.
This observation is in-line with a previous empirical study on online Learning communities by Gruzd(2009a),
which discovered that the chain network misses on
average 40% of possible connections.
Section 1. Development of the Korean Internet Network Miner
Page 13 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Name Network>
• Instead of just relying on information about who replied to whom, the
Name network method starts by automatically finding all mentions of personal names or nicknames in comments and uses them as nodes in a social network.
• Next, to discover ties between nodes, the method connects a sender of a comment to all names found in his/her comment. (A more detailed description of this method can be found in Gruzd(2009b).)
• Although the name network approach provides additional information about connections among blog commentators, it has its own challenges. - personal name/nickname and a word that just appears to be one.
Section 1. Development of the Korean Internet Network Miner
Page 14 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Figure 1: Sample comment
For example, the algorithm marked
the word 사람 (people) as a
reference to another person on
the blog.
This happened because there was
at least one comment in the
dataset posted by a person with the
"사람" nickname.
However, in the sample comment,
this word does not refer to another
online participant; it is used as a
noun that means "people".
Section 1. Development of the Korean Internet Network Miner
Page 15 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Name Network>
Another good example of challenges associated with the name/nickname
disambiguation problem in comments is the word "2mb".
1. this word can be used as a nickname for one of the blog commentators. 2. it could refer to the capacity of a computer memory (2 megabytes). 3. it could be the alias of the current Korean president, Lee Myung-Bak.
To address these challenges and develop recommendations for the next
generation of the name network discovery algorithm, we conducted a
semi-automated analysis of all names/nicknames discovered from a
sample dataset using our initial algorithm.
Section 1. Development of the Korean Internet Network Miner
Page 16 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
To evaluate our automated approach for analyzing communication
networks from blog comments, specifically the accuracy of the name Network discovery algorithm,
We selected a single blog
authored by 방짜(bangzza)
from http://blog.ohmynews.com/bangzza
Section 2. Evaluation of the Name Network Discovery Algorithm
Page 17 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
OhMyNews was ranked as one of the top three web sites in terms of
blog users in 2009(Rankey.com, 2009) it is frequently ranked as the most
popular blog site in Korea, registering over 20 million page views per day
during the presidential election.
Users of OhMyNews, known as "news guerrillas", contribute news articles on
the Web site. OhMyNews allows individuals in far-flung locations to come together,
share, and build strong ties and a sense of community—united in ideology even
if separated by geographic distance—
that fosters a true Grassroots movement
(Streitmatter, 2001).
Section 2. Evaluation of the Name Network Discovery Algorithm
Page 18 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
For our tests, we retrieved and analyzed a sample set of 943 comments
(posted between April 2008 and April 2009) from the selected blog.
In the study, we relied on an interactive tag cloud feature available in
KINM To explore and evaluate all names and nicknames that were found
automatically (see Figure 2).
<Figure 2> An interactive tag cloud showing the 50 most frequently used name/nickname candidates found in the sample dataset
Section 2. Evaluation of the Name Network Discovery Algorithm
Page 19 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
The evaluation procedure involved clicking on each word found by the
name network algorithm and exploring the context where each instance
of the word was used(see Figure 3). The purpose of this semi-automated
analysis was to discover what name/nickname candidates were identified
incorrectly and why.
<Figure 3> A list of messages containing "2MB”
Section 2. Evaluation of the Name Network Discovery Algorithm
Page 20 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
The following set includes clues suggesting that a word is
likely to be a nickname :
● a word candidate is followed by a context word such as "님" = an honorific or "씨" = Mr./Ms.; other possibilities, although rare, include "굮" = Mr. for younger males or "양양" = Miss/Ms. for younger females at the end of a word candidate, and "미스터" = Mr. or "미스" = Miss at the beginning;
● a word candidate contains a combination of characters(Korean, English and/or Chinese), symbols(e.g., underscores, hyphens) and numbers;
● a word candidate appears to be a real name, which is almost always three characters: a single-character last name followed by the two-character first name, which may be found in a dictionary of first names and/or common characters used therein;
● a word candidate is a less common, non-topic word(e.g., "너구리" = raccoon);
● a word candidate is followed by punctuation indicative of someone being addressed (e.g., "/" or ":");
● a word candidate contains patterns indicative of non-native words-phonetic Koreanization of English(e.g., "미디어몽골" = mediamogul = Media Mogul) or phonetic romanization of Korean(e.g. "jihwaja" = 지화자).
Section 2. Evaluation of the Name Network Discovery Algorithm
Page 21 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
The second set includes clues suggesting that a word is
NOT likely to be used as a nickname:
● a word candidate is a phrase—for example, if the nickname input (the "FROM"field) is Used more like a subject line(possible indicators include white spaces and length);
● a word candidate consists of a single character(e.g., "a" or "ㄱ");
● a word candidate consists of netspeak, including emoticons(e.g. "=_="), slang and abbreviations(e.g., using "2MB" to refer to the current Korean president), and onomatopoeia (e.g. "ㅉㅉ" = tsk tsk, ” ㅋㅋ" = heehee, "하하" = haha, "음" = hmm);
● a word candidate appears more than one time in the comment;
● a word candidate consists of random characters(e.g. "ㅁㄴㅇㄹ" or "asdf");
● a word candidate is a short, conversational word or phrase(e.g., "나나 " = me, "아이고" = oh no, "그래서" = so/therefore);
● a word candidate is a common word or idea in the given context/topic(e.g., "대한민국" = Republic of Korea, "쥐체사상" = a newly created word used to refer to political fanatics).
Section 2. Evaluation of the Name Network Discovery Algorithm
Page 22 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Conclusion
This research briefly reviewed some of the studies related to
a large-scale blog analysis using automatic tools.
We reviewed the process of developing our own analysis tool
KINM. The main goal of KINM is to automate the process of
finding communication networks in the Korean blogosphere
that accurately represent social interactions among blog
readers.
To find these networks, the system relies on a set of text
mining techniques to look for personal names and
nicknames in users’comments.
Page 23 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
To address some of the challenges associated with the
automated discovery of names and nicknames in Korean
texts, this paper also presented an exploratory study of a
sample dataset.
The study suggests a set of additional rules to improve the
accuracy of the current name/nickname discovery algorithm
used in KINM.
These additional rules will be incorporated into KINM and
evaluated in a subsequent study.
Conclusion
Page 24 2009년 한국자료분석학회 가을철 학술대회
료분석학회 가을철 학술대회
Thank you.