the velocity of censorship: high-fidelity detection of microblog post deletions
DESCRIPTION
22 nd USENIX Security Symposium (USENIX Security '13). The Velocity of Censorship: High-Fidelity Detection of Microblog Post Deletions. Tao Zhu 1 ; David Phipps 2 ; Adam Pridgen 3 ; Jedidiah R. Crandall 4 ; Dan S. Wallach 3 1 Independent Researcher 2 Bowdoin College - PowerPoint PPT PresentationTRANSCRIPT
The Velocity of Censorship: High-Fidelity Detection of Microblog Post DeletionsTao Zhu1; David Phipps2; Adam Pridgen3; Jedidiah R. Crandall4; Dan S. Wallach3
1Independent Researcher2Bowdoin College3Rice University4University of New Mexico
22nd USENIX Security Symposium (USENIX Security '13)
左昌國2013/09/10 Seminar @ ADLab, CSIE, NCU
Outline• Introduction• Methodology• Hypotheses• Topic Extraction• Discussion• Conclusion
2
Introduction• Microblogs in China : Weibo
• Sina Weibo ( http://weibo.com )• 503 million registered users (Dec. 2012)• 100 million messages sent daily• Promoting visibility of social issues
• China employs both backbone-level filtering of IP packets and higher level filtering implemented in the software• Many works focus on how and what to filter• This paper focuses on how quickly microblog posts are removed
3
Introduction• Contributions:
• The implementation of a method that detect a censorship event within 1-2 mins of its occurrence
• To understand how Weibo can react so quickly in terms of deleting posts with sensitive content• 4 hypotheses
• To overcome the usage of neologisms, named entities, and informal language in Chinese for topical analysis
4
Methodology• Identifying the sensitive user group• Crawling posts of sensitive user group• Detecting deletions
5
Methodology – Identifying the Sensitive User Group
• Search the outdated sensitive keywords in China Digital Times (http://chinadigitaltimes.net/2013/06/two-years-of-sensitive-words-grass-mud-horse-list/)• Using the keywords like “ 党产共” ; 2011-4 ~ 2012-10• Starting with 25 sensitive users (manually selected)
6
> 5 repostsfor each user
25 sensitive users > 5 deletion26
Methodology - Identifying the Sensitive User Group
• Sensitive group reaches 3567 users after 15 days• More than 4500 post deletions daily
• 1500 “permission denied” posts• 12% of the total posts from the group were eventually deleted
• This methodology cannot a representative sample of the whole Weibo
7
Methodology - Crawling• User timeline :
• Weibo user timeline API returns the most recent 50 posts of the specified user.
• Querying 3567 sensitive users one per minute• 100 accounts for API call• 300 concurrent Tor circuit
• Four-node cluster running Hadoop and HBase
8
Methodology – Detecting Deletions• If a post is in the database but is not returned from Weiboissue a secondary query for that postto determine what error message is returned
• Permission-denied or system deletion• “Permission-Denied” error• Caused by censorship event• The post still exists but cannot be accessed by users
• General deletion• “Post does not exist” error• May caused by user self deletion or censorship events• The post does not exist.
9
Methodology – Detecting Deletions• This paper focuses on system deletions
• Apparently not by users• From July 2012 to September 2012, 2.38 million posts were
collected, with a 12.8% total deletion rate (4.5% for system deletions and 8.3% for general deletions).
• The lifetime of a post is the time difference between the time the system detected the post being deleted and the creation time.• The measurement fidelity is on the order of minutes
10
Distribution of Deleted Posts
11
Hypotheses• How can the Weibo system find sensitive posts and
remove them so quickly?• How are those sensitive posts located by the moderators
after a month in the huge database?
• Weibo has different strategies to target sensitive contents
12
Hypotheses• Hypothesis 1:
• Weibo has filtering mechanisms as a proactive, automated defense• Explicit filtering• Implicit filtering
• “shishikanfalunhowle”• Camouflaged posts
13
Hypotheses• Hypothesis 2:
• Weibo targets specificusers, such as those who frequently post sensitive content
14
15
• Hypothesis 3:• When a sensitive post is found, a moderator will
use automated searching tools to find all of its related reposts (parent, child, etc.), and delete them all at once
Hypotheses
Hypotheses• Hypothesis 4:
• Deletion speed is related to the topic.That is, particular topics are targeted for deletion based on how sensitive they are.
• Main 5 topics:• Qidong• Qian Yunhui• Beijing Rainstorm• Diaoyu Island• Group Sex
16
Topic Extraction• Automatic methods are needed to classify the posts• TF*IDF (https://zh.wikipedia.org/wiki/TF-IDF)
• Assign weights to the terms (n-grams) of a document• Pointillism approach [27]
• Reconstruction from grams to words and phrases using external information
17
Topic Extraction• 李 W 阳 (Li Wangyang, from李旺阳 )• 六圌四 (June Fourth, from 六四 )• 胡 () 涛 (Hu Jintao, from 胡锦涛 )• 启 - 东 , 启 \ 东 and 启 / 东(Qidong, from 启东 )
18
Topic Extraction• Which topics among these have been discussed for the
longest period of time?• Independent Component Analysis (ICA)
• Beijing, government, China, country, policeman, and people• These 6 terms appear in almost every individual topic
19
Discussion – Filtering Mechanisms• Proactive mechanisms
• Hypothesis 1• Backwards reposts search
• Hypothesis 3: chain reposts deletion• Backwards keyword search
• Similar to hypothesis 3: relative keywords deletion• 兲朝• 37 人 (http://
news.now.com/home/international/player?newsId=40857)• Monitoring specific users
• Hypothesis 2
20
Discussion – Filtering Mechanisms• Account closures
• 300 user accounts closed• Search filtering• Public timeline filtering• User credit point
• Users can report sensitive or rumor-based posts to earn points
21
Discussion – Time-of-day Behavior
22
Discussion – Time-of-day Behavior
23
Conclusion• Deletions happen most heavily in the first hour
• 90% of the deletions happen within the first 24 hours• The 4 hypotheses
24