icde2013勉強会 session 19: social media ii

Session 19: Social Media II

担当：デンソーアイティーラボラトリ　山本

【ICDE2013勉強会】

資料中の図は論文を引用しております。13年6月29日土曜日

Session 19 担当：山本光穂（デンソーアイティ Social Media II ーラボラトリ）

発表論文} (1) A Unified Model for Stable and Temporal Topic Detection from

Social Media Data } ソーシャルメディアにおけるコンテンツが一時的なトピックかそれとも恒久的なトピックかを考慮した上で判定

} (2) Crowdsourced Enumeration Queries (Best Paper)} クラウドソーシングの検索タスクに対する回答集合数

(母集団)の推定．} 生物統計学における固有種数の推定手法を応用(CHAO92)

} (3) On Incentive-based Tagging} tag情報の品質をインセンティブをワーカー与えることによって向上させる。

213年6月29日土曜日

} 【やりたいこと】Stable TopicとTemporal Topic考慮した上でのトピック抽出

} Stable Topic及びTemporal Topicの定義} Stable Topic :いつも誰かがそのテーマについて言及している} Temporal Topic: 時系列上でみて、急激にそのテーマについて言及する回数が激増・激減するようなテーマ。通常は実生活のイベントが影響


A Unified Model for Stable and Temporal Topic Detection from Social Media Data

3

A Unified Model for Stable and Temporal TopicDetection from Social Media DataHongzhi Yin† Bin Cui† Hua Lu‡ Yuxin Huang† Junjie Yao†

†Department of Computer Science and TechnologyKey Laboratory of High Confidence Software Technologies, Peking University

‡Department of Computer Science,†{bestzhi, bin.cui, huangyuxin, junjie.yao}@pku.edu.cn, ‡[email protected]

Abstract— Web 2.0 users generate and spread huge amountsof messages in online social media. Such user-generated contentsare mixture of temporal topics (e.g., breaking events) and stabletopics (e.g., user interests). Due to their different natures, it isimportant and useful to distinguish temporal topics from stabletopics in social media. However, such a discrimination is verychallenging because the user-generated texts in social media arevery short in length and thus lack useful linguistic features forprecise analysis using traditional approaches.

In this paper, we propose a novel solution to detect bothstable and temporal topics simultaneously from social media data.Specifically, a unified user-temporal mixture model is proposedto distinguish temporal topics from stable topics. To improve thismodel’s performance, we design a regularization framework thatexploits prior spatial information in a social network, as wellas a burst-weighted smoothing scheme that exploits temporalprior information in the time dimension. We conduct extensiveexperiments to evaluate our proposal on two real data setsobtained from Del.icio.us and Twitter. The experimental resultsverify that our mixture model is able to distinguish temporaltopics from stable topics in a single detection process. Ourmixture model enhanced with the spatial regularization andthe burst-weighted smoothing scheme significantly outperformscompetitor approaches, in terms of topic detection accuracy anddiscrimination in stable and temporal topics.

I. INTRODUCTION

User-generated contents (UGC) in Web 2.0 are valuableresources capturing people’s interests, thoughts and actions.Such contents cover a wide variety of topics that presentonline and offline lives. For example, the microblog servicesgather many short but quickly-updated texts that contain bothtemporal and stable topics. Such topics form a huge and richrepository of various kinds of interesting information.

Stable topics are often on users’ regular interests and theirdaily routine discussions, which usually evolve at a ratherslow speed. The extraction of such stable topics enables us topersonalize the results and to improve the result relevance inmany applications such as computational advertising, contenttargeting, personal recommendation and web search.

In contrast, temporal topics are on popular real-life eventsor hot spots. In many circumstances, temporal topics, e.g.,breaking events in the real world, bring about popular discus-sion and wide diffusion on the Internet, where social networksfurther boost the discussion and diffusion. Take Twitter, themost popular microblog service, as an example. Many socialevents can be discovered in Twitter’s posts (tweets), such

as emergencies (e.g., earthquakes), politics (e.g., elections),public events (e.g., Olympics), and business (e.g., the releaseof a new smartphone).

An example of stable and temporal topics from Twitter isillustrated in Figure 1. We can tell the difference betweenthem from the temporal distributions and the descriptionkeywords. A temporal topic has its text related to a certainevent like “Independence Day celebration” in a certain periodof time, and its popularity goes through a sharp increase at theoccurring time of the event. A stable topic has its descriptionon user’s regular interest like “Pet Adoption” and its temporaldistribution exhibits no sharp, spike-like fluctuation.

Fig. 1. Stable and Temporal Topics in Twitter

It is important and useful to distinguish the temporal topicsfrom the stable topics since they convey different kinds ofinformation. However, temporal topics are discussed with lessurgent themes in the background, and therefore temporal topicsare deeply mixed with stable topics in social media. As aresult, it is a challenging problem to detect and differenti-ate temporal and stable topics from large amounts of user-generated social media data.

Research on traditional topic detection and tracking employson-line incremental clustering [1] or retrospective off-line clus-tering [25] for documents and extracts representative featuresfor clusters as a summary of the events. These methods aresuitable for conventional web pages where most documents arelong, rich in keywords, and related to certain popular events.

Aalborg University

978-1-4673-4910-9/13/$31.00 ! 2013 IEEE ICDE Conference 2013661


【アプローチ】

}



4

a document is relatively rich in keywords and all keywordsin a document are coherent. However, user-generated contentsin today’s social media usually consist of very short textualdescriptions, which result in very sparse keyword-documentdistributions. Such a substantial distinction from traditionalmaterials for information retrieval also calls for novel tech-niques for topic detection from social media data [12].

IV. A MIXTURE MODEL FOR DETECTING STABLE ANDTEMPORAL TOPICS

In this section, we propose a user-temporal mixture topicmodel that integrates user and temporal features, followed byan EM-based algorithm for inferring model parameters.

A. User-Temporal Model

SYMBOL DESCRIPTIONu, t, w user, time stamp, keyword

U, T, W set of users, time stamps and keywordsM [u, t, w] frequency of w used by u within time stamp t!U , !T parameter controlling the branch selection

"i stable topic indexed by i"j temporal topic indexed by j

!U , !T stable and temporal topic set

TABLE INOTATIONS

In the user-temporal mixture model, we pre-categorizetopics into two types: stable topics and temporal topics.Stable topics summarize theme reflected from regular postingsaccording to the stable interest of a user or a community. Whiletemporal topics capture the popular events or controversialnews igniting hot discussion in a certain period. In this model,we aim to detect both temporal and stable topics in onegenerating process. Table I lists the relevant notations we use.

The mixture model is represented in Equation 1. For a stabletopic !i we pay particular attention to its user u who generatesit. For a temporal topic !j we pay more attention to when,indicated by time t, it is generated. Like PLSA [10], [11], ouruser-temporal model consists of three layers and two branchesmixing user and temporal features, each branch deciding adifferent topic type. Parameters "U and "T in Equation 1are the probability coefficients controlling the branch choice,which also denote the proportions of stable and temporal topicsin the data set.

p(w|u, t) = "U

!

!i!!U

p(!i|u)p(w|!i)+"T

!

!j!!T

p(!j |t)p(w|!j)

(1)For the user branch, a stable topic is chosen according to the

interest of a particular user. For the time branch, a temporaltopic is generated according to the time stamp of a post, whichmeans the post belongs to the topics that are popular for ashort period of time around that time stamp. Temporal topicshave their distribution on the time dimension, which indicatesits popularity probabilities. The time period during which atemporal topic has its highest probability is its popularityperiod. In our setting, the user interest is assumed to be stablethrough time, and we ignore the possible slight evolution ofuser interest.

Our mixture model contains four probability matrices,namely p(!i|u), p(w|!i), p(!j |t) and p(w|!j). These parame-ters are estimated from the observation data, which is a user-time-keyword matrix M where each cell M [u, t, w] indicateshow many times user u post word w during time period t.This matrix is used in estimating both types of topics.

Whether a keyword is generated by either a stable topic ora temporal topic is decided by relevant statistics which are thecontributions by the user u and the time slice t, respectively.For instance, if many other users also use keyword w in acertain period t, w would be treated as a temporal keywordwith higher probability. Otherwise, it would belong to stabletopics. Thus, keywords with clear temporal features would beclustered into temporal topics whose popularity synchronizesto that of their keywords.

The topics generated in the two branches are not estimatedindividually. Both types of topics interact with each otherduring the learning procedure. This two-branch assumptioncan filter out the stable components from burst topics by stablebranch. It also helps refine the quality of stable topics withoutdisturbance from breaking events as time elapses.

B. Estimation of Model ParametersGiven an observation matrix M(U, T,W ), the learning

procedure of our model is to estimate the maximum probabilityof generating the observed samples. The log-likelihood of thewhole document collection C by our approach is in Equation2, where p(w|u, t) is defined according to Equation 1.

L(C) =!

U

!

T

!

W

M [u, t, w] log p(w|u, t) (2)

The goal of parameter estimation is to maximize Equation2. As this equation cannot be solved directly by applyingMaximum Likelihood Estimation (MLE), we apply an EMapproach instead. In an expectation (E) step of the EMapproach, posterior probabilities are computed for the latentvariables based on the current estimates of the parameters. In amaximization (M) step, parameters are updated by maximiz-ing the so-called expected complete data log-likelihood thatdepends on the posterior probabilities computed in the E-step.

In our model, we have parameter set {p(w|!i), p(w|!j),p(!i|u), p(!j |t)} and the latent variables are hidden topics !i

and !j . For simplicity, we use # to denote all these parameters.The detailed parameter estimation procedure for the user-temporal model is illustrated in following equations.

E-step:

p(!i|u, t, w) ="Up(w|!i)p(!i|u)

"UB(w|u) + "T B(w|t) (3)

p(!j |u, t, w) ="T p(w|!j)p(!j |t)

"UB(w|u) + "T B(w|t) (4)

where B(w|u) and B(w|t) are defined as follows:

B(w|u) =!

!i!!U

p(w|!i)p(!i|u)

B(w|t) =!

!j!!T

p(w|!j)p(!j |t)

664









TABLE INOTATIONS



p(w|u, t) = "U

!

!i!!U

p(!i|u)p(w|!i)+"T

!

!j!!T

p(!j |t)p(w|!j)








L(C) =!

U

!

T

!

W

M [u, t, w] log p(w|u, t) (2)




E-step:

p(!i|u, t, w) ="Up(w|!i)p(!i|u)

"UB(w|u) + "T B(w|t) (3)

p(!j |u, t, w) ="T p(w|!j)p(!j |t)

"UB(w|u) + "T B(w|t) (4)


B(w|u) =!

!i!!U

p(w|!i)p(!i|u)

B(w|t) =!

!j!!T

p(w|!j)p(!j |t)

664

・user uがword w を時間 t に言及する確率

要はstableなトピックは人に依存、テンポラルなトピックは時間に依存









TABLE INOTATIONS



p(w|u, t) = "U

!

!i!!U

p(!i|u)p(w|!i)+"T

!

!j!!T

p(!j |t)p(w|!j)








L(C) =!

U

!

T

!

W

M [u, t, w] log p(w|u, t) (2)




E-step:

p(!i|u, t, w) ="Up(w|!i)p(!i|u)

"UB(w|u) + "T B(w|t) (3)

p(!j |u, t, w) ="T p(w|!j)p(!j |t)

"UB(w|u) + "T B(w|t) (4)


B(w|u) =!

!i!!U

p(w|!i)p(!i|u)

B(w|t) =!

!j!!T

p(w|!j)p(!j |t)

664

・user-time-associated document collection Cにおけるlog-likelihood

E-Mアルゴリズムを利用すれば、









TABLE INOTATIONS



p(w|u, t) = "U

!

!i!!U

p(!i|u)p(w|!i)+"T

!

!j!!T

p(!j |t)p(w|!j)








L(C) =!

U

!

T

!

W

M [u, t, w] log p(w|u, t) (2)




E-step:

p(!i|u, t, w) ="Up(w|!i)p(!i|u)

"UB(w|u) + "T B(w|t) (3)

p(!j |u, t, w) ="T p(w|!j)p(!j |t)

"UB(w|u) + "T B(w|t) (4)


B(w|u) =!

!i!!U

p(w|!i)p(!i|u)

B(w|t) =!

!j!!T

p(w|!j)p(!j |t)

664









TABLE INOTATIONS



p(w|u, t) = "U

!

!i!!U

p(!i|u)p(w|!i)+"T

!

!j!!T

p(!j |t)p(w|!j)








L(C) =!

U

!

T

!

W

M [u, t, w] log p(w|u, t) (2)




E-step:

p(!i|u, t, w) ="Up(w|!i)p(!i|u)

"UB(w|u) + "T B(w|t) (3)

p(!j |u, t, w) ="T p(w|!j)p(!j |t)

"UB(w|u) + "T B(w|t) (4)


B(w|u) =!

!i!!U

p(w|!i)p(!i|u)

B(w|t) =!

!j!!T

p(w|!j)p(!j |t)

664









TABLE INOTATIONS



p(w|u, t) = "U

!

!i!!U

p(!i|u)p(w|!i)+"T

!

!j!!T

p(!j |t)p(w|!j)








L(C) =!

U

!

T

!

W

M [u, t, w] log p(w|u, t) (2)




E-step:

p(!i|u, t, w) ="Up(w|!i)p(!i|u)

"UB(w|u) + "T B(w|t) (3)

p(!j |u, t, w) ="T p(w|!j)p(!j |t)

"UB(w|u) + "T B(w|t) (4)


B(w|u) =!

!i!!U

p(w|!i)p(!i|u)

B(w|t) =!

!j!!T

p(w|!j)p(!j |t)

664









TABLE INOTATIONS



p(w|u, t) = "U

!

!i!!U

p(!i|u)p(w|!i)+"T

!

!j!!T

p(!j |t)p(w|!j)








L(C) =!

U

!

T

!

W

M [u, t, w] log p(w|u, t) (2)




E-step:

p(!i|u, t, w) ="Up(w|!i)p(!i|u)

"UB(w|u) + "T B(w|t) (3)

p(!j |u, t, w) ="T p(w|!j)p(!j |t)

"UB(w|u) + "T B(w|t) (4)


B(w|u) =!

!i!!U

p(w|!i)p(!i|u)

B(w|t) =!

!j!!T

p(w|!j)p(!j |t)

664

が求まる

stableなトピックテンポラルなトピック


} special smoothing

} burst word対策



5

The parameter ! plays a similar role as " in Equation 11,controlling the balance between the data likelihood and thesmoothness of topic distribution over the time. Similar to thespatial regularization, the user-temporal mixture model withtemporal regularization adopts the same generative schemesas that of the basic mixture model. Thus, the enhanced modelhas exactly the same E-step as that of the basic model. Forthe M-step, it can be derived that the expected complete datalog-likelihood for the enhanced model is:

Q(#) = Q(#) ! !R(C, T ) (18)

Due to the same reason with the spatial regularization, wecannot obtain a close form solution for p($j |t) in the M-step.The generalized EM algorithm is also adopted to estimate allparameters and latent variables in the enhanced model with thetemporal regularization since R(C, T ) enjoys a similar formwith the spatial regularizer R(C, G). Clearly, just as the spatialregularizer R(C, G), the temporal regularizer R(C, T ) is non-negative, so the Newton-Raphson method decreases R(C, T )in each updating step. By integrating #(1)

n+1 and R(C, T ) intoEquation 14, we obtain the close form solution for #(2)

n+1, andthen #(3)

n+1, . . ., #(m)n+1, . . ., where

p($j |t)(m+1)n+1 = (1 ! %)p($j |t)(m)

n+1

+ %p($j |t ! 1)(m)

n+1 + p($j |t + 1)(m)n+1

2(19)

Notice that the p(w|$i)n+1, p(w|$j)n+1 and p($i|u)n+1 re-main the same in #n+1, and the temporal regularizationprocess is formalized in Algorithm 2.

Algorithm 2: Temporal Regularization

Input: Parameters #n+1, Temporal Regularizationparameters !, Newton step parameter %;

Output: p($j |t)n+1;

p($j |t)(1)n+1 " p($j |t)n+1;Compute p($j |t)(2)n+1 as in Eqn. (19);while Q(#(2)

n+1) # Q(#(1)n+1) do

p($j |t)(1)n+1 " p($j |t)(2)n+1;Compute p($j |t)(2)n+1 as in Eqn. (19);

endif Q(#(1)

n+1) # Q(#n) thenp($j |t)n+1 " p($j |t)(1)n+1;

endReturn p($j |t)n+1;

C. Burst-Weighted SmoothingTopic models without considering individual burst words

will lead to the problem that words with high occurrence rate,mostly denoting semantically abstract concepts, are likely toappear at top positions in summaries [6]. This phenomenonwidely exists and is the direct result of the principle ofprobabilistic models, because words with more appearances

tend to be estimated with high generation probabilities andranked at top positions in a topic. In stable topics, these generalwords can illustrate the domains of topics at a first glimpse.However, in temporal topics, words with notable burst featureare superior in expressing temporal information compared tothose with abstract meanings since users are more interestedin burst words than in abstract categories when browsing atemporal topic.

An example of two kinds of words is shown in Figure 2.Three burst words “mj”, “moonwalk” and “michaeljackson”have their distribution curves with sharp spikes. We cansee that although the trends of these words do not alwayssynchronize, they all go through a drastic increase and reachpeaks in July 2009. The bursts in their curve are ignited bya real life event, i.e., Michael Jackson’s death. An effectivetopic model should capture these words into one topic.

On the other hand, abstract words like “news” and “world”maintains high occurrences throughout the year in Figure 2but they convey little information. Although they are relevantto the event in July, they also have relationships to many othertopics. For example, word “news” could be used to representvarious different news. However, such abstract words shadowthe spikes of more meaningful words. The high occurrences ofsuch abstract words during the burst period of the burst wordsmay overwhelm the latter and render them unnoticed.

Fig. 2. Normalized Word Frequency Distribution on “Michael Jackson’sDeath” in 2009

To boost interesting temporal topics, we propose a smooth-ing technique that merges correlated words into one temporaltopic and achieves higher visibility for such a topic.

In particular, we first conduct burst detection [15], [27] todetect individual burst words. Most of the spikes on words’temporal distribution curves are brought by discussion ofcorrelated temporal topics. Burst detection on word granular-ity cannot clearly express the whole topic because of theirsemantic limitations. However, they are the core componentof temporal topics, like what is illustrated in Figure 2.

Since words’ burst information cannot be directly inferredfrom raw data, we apply an existing approach [27] by model-ing word occurrence stream with sliding windows. We assume

667

Bursty Degreeを計測(Yao at al ICDE’2010)して補正をかける。

友達間で同一のトピックに対して盛り上がっているときは、補正をかける。


} 結果



6

By comparison among BUT, EUTB and EUTT, we observethat the temporal regularization dose not improve the temporaltopic detection as the spatial regularization promotes thestable topic extraction, although they share similar forms andprinciples. This is because temporal topics enjoy a differentnature from stable topics. Specifically, temporal topics aresensitive to time and are often related with breaking eventswhile temporal regularization aims to reduce the differencebetween different time slices.

In summary, we have three conclusions. First, the accuracyof both temporal and stable topic detections benefit from theunified model that integrates two aspects of information. Sec-ond, spatial regularization, e.g., social network regularization,can improve the stable topic detection. Third, our proposedburst-weighted smoothing scheme can further improve theaccuracy of temporal topic detection, which is supported bythe accuracy difference between EUTB and EUTS.

2) Temporal Feature Component: Typical temporal topicsshould contain enough burst features to be distinguished fromstable topics. Since different models have different latenttopic parameters, their topic distributions are incomparable. Torepresent the topic’s temporal distribution from the illustrativeview, we define the topic distribution over the time axisas a weighted aggregation of its representative keywords:D(!, t) =

!w p(w|!) ! c(t,w)P

w!W c(t,w) . Here c(t, w) denotesthe occurrence of word w at time t for all users. This metriccan reflect the quality of topics from the aspect of valuingthe temporal features inside topics. A successfully detectedtemporal topic should contain representative bursty keywordswith spikes over time, while a stable topic should have a flatdistribution over time.

We measure the curve’s unevenness by its normalizedvariance of the curve. Intuitively, the larger the differenceis between N-variances of temporal and stable topics, thebetter the model performs on this measurement. The resultsare listed in Table II. Our proposed user-temporal mixturemodel and its variants outperform other competitors in thismetric. Temporal topics detected by user-temporal mixturemodel have much sharper curves, and therefore their normal-ized variances of curves significantly outnumber that of thetemporal topics detected by competitors. On the other hand,our models have lower normalized variances on stable topicsthan the competitors, which means stable topics extracted byour models present smoother curve for stable topics. Thecompetitor TimeUserLDA performs best among all competitormethods since it considers both users’ interests and temporaltopic trends simultaneously, just like ours. But, this methodis beaten by our proposed EUTB due to that TimeUserLDAdoes not consider the social aspect and the burst features ofwords when discovering topics.

Among our proposed models, EUTS obtains the lowestnormalized variance for stable topic extraction, which isconsistent with the aforementioned conclusion that spatial reg-ularization can improve the stable topic extraction. In addition,EUTB has the largest normalized variance value for temporaltopic detection. These results support the aforementioned

conclusion that burst-weighted smoothing scheme can furtherimprove the accuracy of temporal topic detection.

Topic Detection Approach N-Variance Difference

BUT stable topics 0.36 0.95temporal topics 1.31

EUTS stable topics 0.21 1.05temporal topics 1.26

EUTT stable topics 0.38 0.92temporal topics 1.30

EUTB stable topics 0.26 1.34temporal topics 1.60

TOT stable topics 0.39 0.11temporal topics 0.50

Individual Detection stable topics 0.38 0.61temporal topics 0.99

TimeUserLDA stable topics 0.39 0.93temporal topics 1.32

Twitter-LDA stable topics 0.38 0.58temporal topics 0.96

TABLE IIN-VARIANCES OF DIFFERENT APPROACHES

3) User Study: In order to evaluate the quality of tempo-ral topics detected by different models, we adopted similarmethod used in [7], and conducted a user survey on theTwitter data set by hiring 3 volunteers as annotators. For eachtopic, we extracted keywords with the highest probabilitiesto represent its content. Each topic was labeled by twodifferent annotators, and if they disagreed a third annotator wasintroduced. Three exclusive labels were provided to indicatethe quality of temporal topic detection.

• Excellent: a nicely presented temporal topic• Good: a topic containing bursty features• Poor: a topic without obvious bursty features

Excellent Good PoorEUTB 42.5% 32.5% 25%TOT 10% 40% 50%

Individual Detection 20% 37.5% 42.5%TimeUserLDA 29.5% 38% 32.5%Twitter-LDA 13.5% 39% 47.5%

TABLE IIICOMPARISON ON TEMPORAL TOPIC QUALITY

The labeling results are summarized in Table III. Up to 75%of the temporal topics detected by EUTB were labeled as “Ex-cellent” or “Good”, and 42.5% were regarded as “Excellent”.Among all competitors, TimeUserLDA performs best. 67.5%of the detected temporal topics were judged as “Excellent”or “Good”, and 29.5% were regarded as “Excellent”. Othercompetitors got merely or slightly more than 50% of theirdetected topics labeled as “Excellent” or “Good”. In particular,the competitors got significantly less “Excellent” labels. Theseresults demonstrate that our proposed user-temporal mixturemodel enhanced with spatial regularization and burst-weightedsmoothing outperforms its competitors in a remarked way.

4) Illustration of Topics Detected: In this section, we listpart of the stable and temporal topics detected by our user-temporal mixture model on the two real data sets.

670

PLSA on slices Individual Detection TOT model EUTB TimeUserLDAlatest michaeljackson news michaeljackson news

headline july world jackson jacksonnews breaking breaking mj michael

investigative news jackson moonwalk michaeljacksonmichaeljackson headline michaeljackson death death

event investigative death news investigative

TABLE IVTOPIC “MICHAEL JACKSON” DETECTED BY DIFFERENT APPROACHES

T77 T78 T87 T89 T60 T712009.1.12-2009.1.31 2009.6.15-2009.6.27 2009.4.24-2009.5.6 2009.5.27-2009.6.6 2009.1.24-2009.1.27 2009.1.1-2009.1.6

obama 0.144 moon 0.090 flu 0.158 google 0.061 droid 0.125 2008 0.099inauguration 0.106 space 0.068 swineflu 0.124 googlewave 0.059 go 0.113 webcomics 0.046

bush 0.059 apollo11 pandemic 0.062 wave 0.042 dragonage 0.101 macworld 0.033president 0.021 apollo 0.023 swine 0.050 bing 0.040 android 0.083 websites 0.028

gaza 0.017 nasa 0.019 health 0.020 apps 0.040 tricks 0.016 predictions 0.026whitehouse 0.012 competition 0.015 disease 0.010 realtime 0.038 jailbreak 0.013 videogame 0.022

TABLE VTEMPORAL TOPICS DETECTED ON DEL.ICIO.US DATA SET

T63 T86 T66 T70 T78 T942009.7.6-2009.7.15 2009.7.1-2009.7.6 2009.10.7-2009.10.15 2009.6.24-2009.6.30 2009.3.1-2009.4.15 2009.9.12-2009.9.15

july 0.012 july 0.035 free 0.012 michael 0.038 easter 0.010 kanye 0.016free 0.010 happy 0.020 nobel 0.012 jackson 0.036 happy 0.009 patrick 0.011

summer 0.008 day 0.016 prize 0.011 rip 0.007 spring 0.008 swayze 0.011live 0.007 firework 0.009 peace 0.008 farrah 0.007 day 0.007 west 0.007

potter 0.006 independ 0.006 win 0.008 dead 0.005 april 0.007 taylor 0.006harry 0.006 celebrate 0.005 obama 0.008 sad 0.005 march 0.007 vma 0.004

TABLE VITEMPORAL TOPICS DETECTED ON TWITTER DATA SET

Our mixture model is able to detect a lot of meaning-ful temporal topics like “swineflu” or “webcomics” whichcompetitor approaches cannot detect. Our mixture model alsopresents detected bursty topics in a more meaningful way thanthe competitors. An example of “michael jackson’s death” isillustrated in Table IV. In other approaches, keywords withabstract semantics like “event”, “headline”, and “world” areestimated to be generated with high probabilities due to theirhigh frequencies, and thus they are ranked in top positionsof the detected topic. But, these words are of little help todescribe the event. In contrast, our model clearly promotesconcrete words like “moonwalk” because they have tight co-burst relationship with “Michael Jackson”. Note that such lowfrequency words are generally ignored by existing approaches.

Further, we list the stable topics and temporal topics de-tected by our mixture model in Tables V to VIII. From theseresults, we can clearly tell the differences between stable andtemporal topics. Stable topics are about a certain, commonaspect with more general words as its composition, like top-ics “technology”, “news”, “international news” in Table VII.While temporal topics are closely related to specific eventsduring a certain period of time. Referring to Table V, thesesix topics shown are about “U.S president inauguration”,“Apollo 11 ceremony”, “swineflu”, “googlewave”, “android”and “webcomics”, respectively. All of them correspond to real-life events. The time when these topics reached their peaks,shown on the second row in the table, is very close to the timewhen these events happened in real life.

T10 T27 T33windows 0.049 news 0.107 food 0.034

tools 0.048 latest 0.102 recipe 0.033freeware 0.038 current 0.099 cooking 0.030firefox 0.038 world 0.094 dessert 0.026google 0.029 events 0.084 shopping 0.021security 0.028 newspaper 0.084 home 0.016

TABLE VIISTABLE TOPICS DETECTED ON DEL.ICIO.US DATA SET

T5 T11 T53free 0.020 day 0.104 assassin 0.039

market 0.011 travel 0.009 attempt 0.034money 0.010 hotel 0.008 wound 0.024people 0.007 check 0.006 level 0.020check 0.007 site 0.004 reach 0.016help 0.006 golf 0.004 account 0.013

TABLE VIIISTABLE TOPICS DETECTED ON TWITTER DATA SET

As shown in Table VI and VIII, topics detected on Twitterare more closely related to daily life, compared to the topicson Del.icio.us where users are more inclined to discuss tech-nology. Table VI shows that temporal topics on Twitter aremore correlated to season, festivals, breaking news as well asentertainment. These results are consistent with the differentnatures of Twitter and Del.icio.us.

Finally, Figure 4 shows the popularity curves of temporaltopics detected by our model. Most of these temporal topicsreach a peak in a short period of time and then disappearvery soon, making a sharp spike on the curve. Topics with

671

ユーザインタビューによるテストの結果、提案手法(EUTB)によるトッピック抽出は評価が高かった。



Crowdsourced Enumeration Queries

7

・既存のDBを利用した検索

データベースの中に存在するデータがすべて。(closed -world assumption)

・クラウドソースを利用した検索

データはweb/頭の中に存在母集団の数がわからん。。

Disk 2

Disk 1

Parser

Optimizer

Stat

istics

Executor

Files Access Methods

UI Manager

HIT Manager

Met

aDat

a

Quality Control/Progress Monitor

CrowdSQL Results

Turker Relationship Manager

Figure 2: CrowdDB Architecture

2. BACKGROUNDIn this section we describe the CrowdDB system, which

serves as the context for this work. We then discuss relatedwork on cardinality estimation in both the Statistics andDatabase Query Processing domains.

2.1 CrowdDB OverviewCrowdDB is a hybrid human-machine database system

that uses human input to process queries. CrowdDB cur-rently supports two crowdsourcing platforms: AMT and ourown mobile platform [16]. We focus on AMT in this paper,the leading platform for so-called microtasks. Microtasks,also called Human Intelligence Tasks (HITs) in AMT, usu-ally do not require any special training and do not take morethan a few minutes to complete. AMT provides a market-place for microtasks that allows requesters to post HITs andworkers to search for and work on HITs for a small reward,typically a few cents each.

Figure 2 shows the architecture of CrowdDB. CrowdDBincorporates traditional query compilation, optimization andexecution components, which are extended to cope withhuman-generated input. In addition the system is extendedwith crowd-specific components, such as a user interface(UI) manager and quality control/progress monitor. Usersissue queries using CrowdSQL, an extension of standardSQL. CrowdDB automatically generates UIs as HTML formsbased on the CROWD annotations and optional free-text anno-tations of columns and tables in the schema. Figure 3 showsan example HTML-based UI that would be presented to aworker for the following crowd table definition:

CREATE CROWD TABLE ice_cream_flavor {name VARCHAR PRIMARY KEY

}

Although CrowdDB supports alternate user interfaces (e.g.,showing previously received answers), this paper focuses ona pure form of the “getting it all” question. The use ofalternative UIs is the subject of future work.

During query processing, the system automatically postsone or more HITs using the AMT web service API and col-

Figure 3: Ice cream flavors task UI on AMT

lects the answers as they arrive. After receiving the an-swers, CrowdDB performs simple quality control using quo-rum votes before it passes the answers to the query executionengine. Finally, the system continuously updates the queryresult and estimates the quality of the current result basedon the new answers. The user may thus stop the query assoon as the quality is su�cient or intervene if a problem isdetected. More details about the CrowdDB components andquery execution are given in [18]. We describe in this paperhow the system can estimate completeness of the query re-sult using algorithms from the species estimation literature.

2.2 Cardinality EstimationTo estimate progress as answers are arriving, the system

needs an estimate of the result set’s cardinality. We cantackle cardinality estimation by applying in a new contextalgorithms developed for estimating the number of species.In the species estimation problem [4, 7], an estimate of thenumber of distinct species is determined using observationsof species in the locale of interest. These observations repre-sent samples drawn from a probability distribution describ-ing the likelihood of seeing each item. By drawing a paral-lel between observed species and answers received from thecrowd, we can apply these techniques to reason about theresult set size of a crowdsourced query.Work on distinct value estimation in traditional database

systems has also looked into species estimation techniquesto inform query optimization of large tables; tuples are sam-pled to estimate the number of distinct values present. Tech-niques used and developed in that literature leverage knowl-edge of the full table size, which is possible only because ofthe closed-world assumption. In the species estimation liter-ature, the di↵erence between these two scenarios is referredto as finite vs. infinite populations, which correspond toclosed vs. open world, respectively.In [22], Haas et. al. survey di↵erent estimators, several

of which we also investigate in this paper. They do notuse the algorithm we find superior because they observe itproduced overly large estimates when used in the context ofa finite population. Instead they propose a hybrid approach,choosing between the Shlosser estimator [31] and a versionof the Jackknife estimator [5] they modified to suit a finitepopulation. The Jackknife technique is used for tables inwhich distinct values are uniformly distributed.This work was extended in [9], in which Charikar et. al.

propose a di↵erent hybrid approach. They note a lack ofanalytic guarantees on errors in previous work, and derivea lower bound on error that an estimator should achieve.They then show that their algorithm is superior to Shlosserin the non-uniform case, substituting it in the hybrid ap-proach from [22]. Unfortunately, both the error bounds anddeveloped estimators explicitly incorporate knowledge of thefull table size – a closed-world luxury. Other database tech-



Crowdsourced Enumeration Queries} 内容

} クラウドソーシングの検索タスクに対する回答集合数の推定．} 生物統計学における固有種数の推定手法を応用(CHAO92)

} 固有種数の推定手法とは？} ある特定地域の個体数を調べ、種の種類や密度を推定。} 同種法を用いて類推した例としては、例えば地球上の恐竜の種類の推定、等が有名(図)。

}

8

Estimating the diversity of dinosaurs(Steve C. Wang and Peter Dodson )


http://www.pnas.org/search?author1=Steve+C.+Wang&sortspec=date&submit=Submit

http://www.pnas.org/search?author1=Steve+C.+Wang&sortspec=date&submit=Submit

http://www.pnas.org/search?author1=Peter+Dodson&sortspec=date&submit=Submit

http://www.pnas.org/search?author1=Peter+Dodson&sortspec=date&submit=Submit

} 推定関数は種を単位とするか,個体間のダイバージェンスを考慮するか,均等度を考慮に入れるか等によっていろいろあり。

} CHAO 84 estimator

} CHAO92 estimator(今回利用したもの)

} sample coverageという概念を利用


Crowdsourced Enumeration Queries

9

3.2.2 Chao84 EstimatorIn [6], Chao develops a simple estimator for species rich-

ness that is based solely on the number of rare species foundin the sample:

N

chao84 = c+f

21

2f2

Chao found that it actually is a lower bound, but it per-formed well on her test data sets. She also found that theestimator works best when there are relatively rare species,which is often the case in real species estimation scenarios.

3.2.3 Chao92 EstimatorIn [8], Chao develops another estimator based on the no-

tion of sample coverage. The sample coverage C is the sumof the probabilities p

i

of the observed classes. However,since the underlying distribution p1...pN is unknown, thisestimate from the Good-Turing estimator[20] is used:

C = 1� f1/n

The Chao92 estimator attempts to explicitly characterizeand incorporate the skew of the underlying distribution us-ing the coe�cient of variance (CV), denoted �, a metricthat can be used to describe the variance in a probabilitydistribution [8]; we can use the CV to compare the skewof di↵erent class distributions. The CV is defined as thestandard deviation divided by the mean. Given the p

i

’s(p1 · · · pN ) that describe the probability of the ith class be-ing selected, with mean p =

Pi

p

i

/N = 1/N , the CV is

expressed as � =⇥P

i

(pi

� p)2/N⇤1/2

/ p [8]. A higher CVindicates higher variance amongst the p

i

’s, while a CV of 0indicates that each item is equally likely.

The true CV cannot be calculated without knowledge ofthe p

i

’s, so Chao92 uses an estimate, �.

�

2 = max

(c

C

X

i

i(i� 1)fi

�n(n� 1)� 1, 0

)(2)

The estimator that uses the coe�cient of variance is below;note that if �2 = 0 (i.e., indicating a uniform distribution),the estimator reduces to c/C

N

chao92 =c

C

+n(1� C)

C

�

2

3.3 Experimental ResultsWe ran over 25,000 HITs on AMT to compare the perfor-

mance of the di↵erent estimators. Several CROWD tables weexperimented with include small and large well-defined setslike NBA teams, US states, UN member countries, as wellas less well-defined sets like ice cream flavors, animals foundin a zoo, and graduate students about to graduate. Workerswere paid $0.01 to provide one item in the set using the UIin Figure 3; they were allowed to complete multiple tasks ifthey wanted to submit more than one answer.

In the remainder of this paper we focus on three experi-ments, US states, UN member countries, and ice cream fla-vors, to demonstrate a range of characteristics that a partic-ular query may have. The US states is a small, constrainedset while the UN countries is a larger constrained set thatis not so readily memorizable. The ice cream flavors exper-iment captures a set that has a fair amount of membershipand size ambiguity. We repeated our experiments nine timesfor the US states, five times for the UN countries and once

for the ice cream flavors. In this paper we cleaned and ver-ified workers’ answers manually; other work has describedtechniques for crowd-based verification [24, 1, 10, 14].Figure 5(a-c) shows the average cardinality estimates over

time, i.e., for increasing numbers of HITs, for the US states,UN countries, and ice cream flavors using three di↵erentestimators. Error bars can be computed using variance esti-mators provided in [8, 6], however we omit them for betterreadability. The horizontal line indicates the true cardinal-ity if it is known. Below each graph, a table shows the“f1-ratio” and the actual number of received unique itemsover time. We define f1-ratio as f1/

Pi

f

i

, the fraction ofthe singletons as compared to the overall received uniqueitems. Recall that the presence of singletons is a strong in-dicator that there are more undetected items; when thereare relatively few singletons, we have likely approached theplateau of the SAC. The f1-ratio can be used as an indica-tion of whether or not the sample size is su�cient for stablecardinality estimation. Since estimators use the relative fre-quencies of f1 compared to the other f

i

’s, a high f1-ratiowill make it more di�cult for the estimators to converge.Also note that the ratio between the unique items and thepredicted cardinality is the completeness estimate.

3.3.1 US StatesFor the US states (Figure 5(a)), all estimators perform

fairly well; Chao92 remains closer to the true value thanChao84. The estimates are stable at 150 HITs, and nearthe true value even earlier. Note this happens well beforeall fifty states are acquired (on average, after 225 HITs). Itmay be be surprising that the uniform estimator performsas well as it does, as one might suspect that certain stateswould be more commonly chosen than others. There area few explanations for this performance. First, the aver-age coe�cient of variance � for the states experiments is0.53; in [8], Chao notes that the uniform estimator is stillreasonable for � 0.5. Furthermore, individual workerstypically do not submit the same answer multiple times;samples drawn without replacement from a particular dis-tribution will result in a less skewed distribution than theoriginal. We discuss sampling without replacement furtherin Section 4. Individual workers also may be drawing fromdi↵erent skewed distributions, e.g., naming the midwesternstates before those in the mid-atlantic.

3.3.2 UN CountriesIn contrast to the US states, the uniform estimator more

dramatically under-predicts the true value of 192 for theUN countries experiments (Figure 5(b)). This makes senseconsidering the average � for the UN countries experimentsis 0.73. Since the uniform estimator assumes each countryis equally likely to be given, it predicts that the total setsize is smaller. The Chao estimators converge faster to thetrue value than the uniform estimator. Unlike the Statesexperiment, we did not obtain the full set in most of theexperiment runs (see the table in Figure 5(b)).The Chao estimators produced good predictions, however

they appeared to fluctuate in the middle of the experiment,starting low then increasing over the true value before con-verging back down. While it is encouraging that the estima-tors perform well on average, we observed that the variancewas fairly high – indicating to us that some of the experi-ment runs did not act as expected. The classic theory starts

c: 観測された種の数f1:一度のみ観測された種の数f2:二度観測された種の数



N

chao84 = c+f

21

2f2




i


C = 1� f1/n


i


Pi

p

i



i

(pi

� p)2/N⇤1/2


i



i


�

2 = max

(c

C

X

i

i(i� 1)fi

�n(n� 1)� 1, 0

)(2)


N

chao92 =c

C

+n(1� C)

C

�

2






Pi

f

i


i







C: sample coverage (観測された種の確率piの和)



N

chao84 = c+f

21

2f2




i


C = 1� f1/n


i


Pi

p

i



i

(pi

� p)2/N⇤1/2


i



i


�

2 = max

(c

C

X

i

i(i� 1)fi

�n(n� 1)� 1, 0

)(2)


N

chao92 =c

C

+n(1� C)

C

�

2






Pi

f

i


i









N

chao84 = c+f

21

2f2




i


C = 1� f1/n


i


Pi

p

i



i

(pi

� p)2/N⇤1/2


i



i


�

2 = max

(c

C

X

i

i(i� 1)fi

�n(n� 1)� 1, 0

)(2)


N

chao92 =c

C

+n(1� C)

C

�

2






Pi

f

i


i







その他　Abundance-based Coverage Estimator 等様々な手法が存在



Crowdsourced Enumeration Queries} chao92を利用して国数を推定してみた。

} 原因　worker behaviorに起因

10

200 400 600 800

050

100

150

200

250

300

# answers

chao

92 e

stim

ate

actualexpected

Fig. 4. Estimated Cardinality

��

��

(A, B, C, D, F, A, G, B, A, ….)

…

A B C D E F G H I J K...

(A, B, G, H, F, I, A, E, E, K, ….)

(a) Database Sampling (B) Crowd Based Sampling

�= sampling process with replacement��= sampling process without replacement

Worker

ProcessesW

orker Arrival Process

A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K...

Fig. 5. Sampling Process

workers complete different amounts of work and arrive/departfrom the experiment at different points in time.

The next subsection formalizes a model of how answersarrive from the crowd in response to a set enumeration query,as well as a description of how crowd behaviors impactthe sample of answers received. We then use simulation todemonstrate the principles of how these behaviors play offone another and thereby influence an estimation algorithm.

B. A Model for Human Enumerations

Species estimation algorithms assume a with-replacementsample from some unknown distribution describing item likeli-hoods (visualized in Figure 5(a)). The order in which elementsof the sample arrive is irrelevant in this context.

After analyzing the crowdsourced enumerations, for exam-ple in the previously mentioned UN experiment, we foundthat this assumption does not hold for crowdsourced sets.In contrast to with-replacement samples, workers provideanswers from an underlying distribution without replacement.Furthermore, workers might sample from different underlyingdistributions (e.g., one might provide answers alphabetically,while another worker provides answers in a more randomorder).

This process of sampling significantly differs from what tra-ditional estimators assume, and it can be represented as a two-layer sampling process as shown in Figure 5(b). The bottomlayer consists of many sampling processes, each correspond-ing to one worker, that sample from some data distributionwithout replacement. The top layer processes samples withreplacement from the set of the bottom-layer processes (i.e.,workers). Thus, the ordered stream of answers from the crowdrepresents a with-replacement sampling amongst workers whoare each sampling a data distribution without replacement.

C. The Impact of Humans

The impact of the two-layer sampling process on the es-timation can vary significantly based on the parameterizationof the process (e.g., the number of worker processes, differentunderlying distributions, etc.). In this section, we study theimpact of different parameterizations, define when it is actuallypossible to make a completeness estimation as well as therequirements for an estimator considering human behavior.

1) Sampling Without Replacement: When a worker submitsmultiple items for a set enumeration query, each answer isdifferent from his previous ones. In other words, individualsare sampling without replacement from some underlying dis-tribution that describes the likelihood of selecting each answer.Of course, this behavior is beneficial with respect to the goal ofacquiring all the items in the set, as low-probability items be-come more likely after the high-probability items have alreadybeen provided by that worker (we do not pay for duplicatedwork from a single worker). A negative side effect of workerssampling without replacement is that the estimation algorithmreceives less information about the relative frequency of items,and thus the skew, of the underlying data distribution; havingknowledge of the skew is a requirement for a good estimate.

2) Worker skew: On crowdsourcing platforms like AMT, itis common that some workers complete many more HITs thanothers. This skew in relative worker HIT completion has beenlabeled the “streakers vs. samplers” effect [18]. In the two-layer sampling process, worker skew dictates which workersupplies the next answer; streakers are chosen with higherfrequency. High worker skew can cause the arrival rate ofunique answers to be even more rapid than that caused bysampling without replacement alone, causing the estimator toover-predict. The reasoning is intuitive: if one worker getsto provide a majority of the answers, and he provides onlyanswers he has not yet given, then a majority of the totalanswer set will be made up of his unique answers.

Furthermore, in an extreme scenario in which one workerprovides all answers, the two-layer process reduces to oneprocess sampling from one underlying distribution withoutreplacement. In this case, completeness estimation becomesimpossible because no inference can be made regarding theunderlying distribution. Another extreme is if an infinitenumber of “samplers” would provide one answer each usingthe same underlying distribution, the resulting sample wouldcorrespond to the original scenario of sampling with replace-ment (Figure 5(a)).

The latter is the reason why it is still possible to make esti-mations even in the presence of human-generated set enumer-ations. Figure 6(a) shows the impact of having more workerson the averaged Chao92 estimates with 100 simulation runsusing a uniform data distribution over 200 items. As expected,

676

うまくいかない。。

200 400 600 800

050

100

150

200

250

300

# answers

chao

92 e

stim

ate

actualexpected

Fig. 4. Estimated Cardinality

��

��

(A, B, C, D, F, A, G, B, A, ….)

…

A B C D E F G H I J K...

(A, B, G, H, F, I, A, E, E, K, ….)

(a) Database Sampling (B) Crowd Based Sampling

�= sampling process with replacement��= sampling process without replacement

Worker

ProcessesW

orker Arrival Process

A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K...

Fig. 5. Sampling Process

workers complete different amounts of work and arrive/departfrom the experiment at different points in time.

The next subsection formalizes a model of how answersarrive from the crowd in response to a set enumeration query,as well as a description of how crowd behaviors impactthe sample of answers received. We then use simulation todemonstrate the principles of how these behaviors play offone another and thereby influence an estimation algorithm.

B. A Model for Human Enumerations

Species estimation algorithms assume a with-replacementsample from some unknown distribution describing item likeli-hoods (visualized in Figure 5(a)). The order in which elementsof the sample arrive is irrelevant in this context.

After analyzing the crowdsourced enumerations, for exam-ple in the previously mentioned UN experiment, we foundthat this assumption does not hold for crowdsourced sets.In contrast to with-replacement samples, workers provideanswers from an underlying distribution without replacement.Furthermore, workers might sample from different underlyingdistributions (e.g., one might provide answers alphabetically,while another worker provides answers in a more randomorder).

This process of sampling significantly differs from what tra-ditional estimators assume, and it can be represented as a two-layer sampling process as shown in Figure 5(b). The bottomlayer consists of many sampling processes, each correspond-ing to one worker, that sample from some data distributionwithout replacement. The top layer processes samples withreplacement from the set of the bottom-layer processes (i.e.,workers). Thus, the ordered stream of answers from the crowdrepresents a with-replacement sampling amongst workers whoare each sampling a data distribution without replacement.

C. The Impact of Humans

The impact of the two-layer sampling process on the es-timation can vary significantly based on the parameterizationof the process (e.g., the number of worker processes, differentunderlying distributions, etc.). In this section, we study theimpact of different parameterizations, define when it is actuallypossible to make a completeness estimation as well as therequirements for an estimator considering human behavior.

1) Sampling Without Replacement: When a worker submitsmultiple items for a set enumeration query, each answer isdifferent from his previous ones. In other words, individualsare sampling without replacement from some underlying dis-tribution that describes the likelihood of selecting each answer.Of course, this behavior is beneficial with respect to the goal ofacquiring all the items in the set, as low-probability items be-come more likely after the high-probability items have alreadybeen provided by that worker (we do not pay for duplicatedwork from a single worker). A negative side effect of workerssampling without replacement is that the estimation algorithmreceives less information about the relative frequency of items,and thus the skew, of the underlying data distribution; havingknowledge of the skew is a requirement for a good estimate.

2) Worker skew: On crowdsourcing platforms like AMT, itis common that some workers complete many more HITs thanothers. This skew in relative worker HIT completion has beenlabeled the “streakers vs. samplers” effect [18]. In the two-layer sampling process, worker skew dictates which workersupplies the next answer; streakers are chosen with higherfrequency. High worker skew can cause the arrival rate ofunique answers to be even more rapid than that caused bysampling without replacement alone, causing the estimator toover-predict. The reasoning is intuitive: if one worker getsto provide a majority of the answers, and he provides onlyanswers he has not yet given, then a majority of the totalanswer set will be made up of his unique answers.

Furthermore, in an extreme scenario in which one workerprovides all answers, the two-layer process reduces to oneprocess sampling from one underlying distribution withoutreplacement. In this case, completeness estimation becomesimpossible because no inference can be made regarding theunderlying distribution. Another extreme is if an infinitenumber of “samplers” would provide one answer each usingthe same underlying distribution, the resulting sample wouldcorrespond to the original scenario of sampling with replace-ment (Figure 5(a)).

The latter is the reason why it is still possible to make esti-mations even in the presence of human-generated set enumer-ations. Figure 6(a) shows the impact of having more workerson the averaged Chao92 estimates with 100 simulation runsusing a uniform data distribution over 200 items. As expected,

676

・種推定においては、アイテム尺度が　　　未知の分布から標本が抽出される。・人間による列挙では、ある内在する　アイテム分布に基づき標本(回答)が抽出される。



Crowdsourced Enumeration Queries} ストリーカーの存在

} 完全な回答を提示する回答者も存在(ストリーカー)

} 特徴としては、重複なしの標本抽出をする。その結果真値よりも過大に推定されてしまう。

} 開始時に200アイテムすべてを回答するストリーカーを追加して検証

11

(a) with vs. without replacement (b) forms of skew (c) impact of streaker

100 200 300 400 500 600

010

020

030

040

0

# answers

chao

92 e

stim

ate

��

�

3 workers w/o replace5 workers, w/o replacewith replacement

��

500 1000 1500 2000

010

020

030

040

0

# answers

chao

92 e

stim

ate

��

��

��

��

��

�

�

�

ws=T, dd=Tws=F, dd=Tws=T, dd=Fws=F, dd=F

��

��

��

500 1000 1500 2000

010

020

030

040

0

# answers

chao

92 e

stim

ate ��

��

��

��

��

��

�

�

�

�

��

�

�

�

ws=T, dd=Tws=F, dd=Tws=T, dd=Fws=F, dd=F

Fig. 6. Cardinality estimation simulations illustrating the impact of worker behaviors

the with-replacement sample overestimates slightly because ofthe uniform data distribution, but quickly approaches the truevalue of 200. The without-replacement samples overestimateeven more and remain in that state for longer.

D. Different and Skewed Data Distributions

Individual workers also may be drawing their answers fromdifferent data distributions. For example, the most likely itemfor one worker could be the least likely item for another. Thesedifferences could arise from varying cultural or regional biases,or alternate techniques for finding the data on the web. Amixture of multiple distributions over the same data results in acombined distribution that is “flatter” than its constituent parts,thereby becoming less skewed. In contrast, when the under-lying data distribution is heavily skewed and shared amongstworkers, the estimator will typically underestimate becausethere will not be a sufficient number of items representing thelong tail of the distribution.

Figure 6(b) visualizes the impact of different data distri-butions combined with different worker skew on the Chao92estimate by showing four combinations: the absence/presenceof worker skew (WS) and the shared/different data distribu-tions for workers (DD). For all cases, we use a power lawdata distribution in which the most likely item has probabilityp, the second-most likely has probability p(1 ! p), etc.; weset p = 0.03. To simulate different data distributions, werandomly permute the original distribution for each worker.

The simulation shows that the worst scenario is charac-terized by a high worker skew and a single shared datadistribution (WS=T and DD=F). With a shared skewed dis-tribution, Chao92 will start out underestimating because allworkers are answering with the same high-probability items.However, with high worker skew, the streaker(s) provide(s)many unique answers quickly causing many more unique itemsthan encountered with sampling with replacement.

On the other hand, the best scenario is when there is noworker skew but there are different data distribution (WS=Fand DD=T). By using different data distributions withoutoveremphasizing a few workers, the overall sample looks moreuniform, similar to Figure 6(a) with replacement, due to theflattening effect of DD on skewed data.

E. Worker Arrival

Finally, the estimate can be impacted by the arrival anddeparture of workers during the experiment. All workersdo not necessarily provide answers during the lifetime of aquery. Instead they come and go as they please. However, theestimator can be strongly impacted when streakers arrive whothen suddenly dominate the total number of answers.

Figure 6(c) demonstrates the impact a single worker canhave. It uses the same simulation setup as in Figure 6(b),but also simulates an additional single streaker starting at200 HITs who continuously provides all 200 answers beforeanyone else has a chance to submit another answer. As thefigure shows, it causes Chao92 to over-predict in all four cases.However, if workers use different data distributions the impactis not as severe. Again, this happens because DD makes thesample appear more uniformly distributed.

F. Discussion

Particular crowd behaviors are inherent in a marketplace likeAMT. The order in which each worker provides his answersand how many he gives can depend on individual biases andpreferences. The four elements of crowd behavior we outlinedabove (without-replacement sampling, worker skew, differentdistributions, and worker arrival) can each cause Chao92 toperform poorly. The most volatile of these behaviors is workerskew, particularly when the data distribution itself is skewed;a single overzealous worker could cause massive fluctuationsin the estimate.

Our goal is to make Chao92 more fault-tolerant to theimpact of such a streaker; we discuss our technique for astreaker-tolerant cardinality estimator next. In Section V, werevisit the scenario in which the estimator under-predicts dueto one or more shared, highly skewed data distributions.In some cases, workers’ answer sequences originate from acommon list traversal. We develop a heuristic for detecting thisbehavior, which can be used to inform the decision to switchto an alternate data-gathering UI. Later on in Section VI, wedescribe a technique analyzing the cost versus benefit tradeoffof paying for more answers, helpful even when the cardinalityestimate can only serve as a lower bound due to high data skewor inherent fuzziness in items’ set membership.

677



Crowdsourced Enumeration Queries} Chao 92を改良した

Streaker-tolerant Estimatorという手法を開発} ストリーカの許容範囲を設定} Detect and remove outliners

12

(a) UN 1

200 400 600 800

010

020

030

0

# answers

chao

92 e

stim

ate

!orig = 0.14!new = 0.087

(b) UN 2

200 400 600 800

010

020

030

0

# answers

chao

92 e

stim

ate

!orig = 0.11!new = 0.099

(d) UN 3

200 400 600 800

010

020

030

0

# answers

chao

92 e

stim

ate

!orig = 0.065!new = 0.058

(e) UN 4

200 400 600 800

010

020

030

0

# answers

chao

92 e

stim

ate

!orig = 0.18!new = 0.28

(f) States 1

50 100 150 200 250

020

4060

8010

0

# answers

chao

92 e

stim

ate

!orig = 0.046!new = 0.053

(g) States 2

50 100 150 200 250

020

4060

8010

0

# answers

chao

92 e

stim

ate

!orig = 0.028!new = 0.024

(h) States 3

50 100 150 200 250

020

4060

8010

0

# answers

chao

92 e

stim

ate

!orig = 0.033!new = 0.068

originalcrowd estimatortrue value

Fig. 7. Estimator results on representative UN country and US states experiments

outlier’s f1 count. The f1 count of worker i is compared tothe mean xi and the sample standard deviation !i:

xi =!

!j,j "=i

f1(j)W ! 1 !i =

"##$!

!j,j "=i

(f1(j) ! xi)2

W ! 2 (4)

We create f1 from the original f1 by reducing each workeri’s f1-contribution to fall within 2!i + xi:

f1 =!

i

min(f1(i), 2!i + xi) (5)

The final estimator is similar to equation 3 except that ituses the f1 statistic. For example, with a coefficient of variance"2 = 0, it would simplify to:

Ncrowd = cn

n !%

i min(f1(i), 2!i + xi)(6)

Although a small adjustment, Ncrowd is more robust againstthe impact of streakers than the original Chao92, as we showin our evaluation next.

D. Experimental Results

We ran over 30,000 HITs on AMT for set enumerationtasks to evaluate our technique. Several CROWD tables weexperimented with include small and large well-defined setslike NBA teams, US states, UN member countries, as well assets that can truly leverage human perception and experiencelike indoor plants with low-light needs, restaurants in SanFrancisco serving scallops, slim-fit tuxedos, and ice creamflavors. Workers were paid $0.01-$0.05 to provide one itemin the result set using the UI shown in Figure 3; they wereallowed to complete multiple tasks if they wanted to submitmore than one answer. In the remainder of this paper we focuson a subset of the experiments, some with known cardinalityand fixed membership, US states (nine experiment runs) andUN member countries (five runs), as well as more open endedqueries like plants, restaurants, tuxedos, and ice cream flavors(one run each).

1) Error Metric: Due to a lack of a good metric to evaluateestimators with respect to stability and convergence rate, wedeveloped an error metric ! that captures bias (absolutedistance from the true value), as well as the estimator’s time toconvergence and stability. The idea is to weight the magnitudeof the estimator’s bias more as the size of the sample increases.Let N denote the known true value, and Ni denote the estimateafter i samples. After n samples, ! is defined as:

! =%n

i=1 |Ni ! N |i%i

= 2%n

i=1 |Ni ! N |in(n + 1) (7)

A lower ! value means a smaller averaged bias and thus,a better estimate. The weighting renders a harsher penaltyfor incorrectness later on than in the beginning, in additionto penalizing an estimator that takes longer to reach the truevalue; this addresses the convergence rate criteria. The errormetric also rewards estimators for staying near the true value.

2) Results: UN and States: We first illustrate how Ncrowd

behaves for a representative set of UN member countries andUS states experiments; we elide the full set for space reasons.For both experiments the UI from Figure 3 was shown byCrowdDB to ask for an UN member country, respectivelyUS state, on AMT for $0.01 cents per task. Figures 7(a-h) show cardinality estimates as well as the ! metric forthe selected experiments. We observed that our estimate hasan improvement over Chao92 for most UN experiments weperformed as Figure 7(a) and (b) show. In UN 1 our estimatesreduces the overestimation of Chao92 that occurred during themiddle of the experiment. In the UN 2 experiment, one streakerdominated the total answer set at the beginning—a substantialoutlier. Once his contribution was reduced dramatically, theremaining workers’ answers had significant overlap becausemost were enumerating the list of nations alphabetically,resulting in a low cardinality because of the heavily skeweddata distribution this scenario creates. Recall from the previoussection that the expected behavior of the estimator in thiscase is to under-predict. In contrast, the third UN experimentrun had several streakers at the beginning who each hadvery different data distributions (i.e., enumerating the list ofnations from different alphabetical start points). While theheuristic helped level the f1 contribution from these workers,

679

(a) UN 1

200 400 600 800

010

020

030

0

# answers

chao

92 e

stim

ate

!orig = 0.14!new = 0.087

(b) UN 2

200 400 600 800

010

020

030

0

# answers

chao

92 e

stim

ate

!orig = 0.11!new = 0.099

(d) UN 3

200 400 600 800

010

020

030

0

# answers

chao

92 e

stim

ate

!orig = 0.065!new = 0.058

(e) UN 4

200 400 600 800

010

020

030

0

# answers

chao

92 e

stim

ate

!orig = 0.18!new = 0.28

(f) States 1

50 100 150 200 250

020

4060

8010

0

# answers

chao

92 e

stim

ate

!orig = 0.046!new = 0.053

(g) States 2

50 100 150 200 250

020

4060

8010

0

# answers

chao

92 e

stim

ate

!orig = 0.028!new = 0.024

(h) States 3

50 100 150 200 250

020

4060

8010

0

# answers

chao

92 e

stim

ate

!orig = 0.033!new = 0.068

originalcrowd estimatortrue value

Fig. 7. Estimator results on representative UN country and US states experiments

outlier’s f1 count. The f1 count of worker i is compared tothe mean xi and the sample standard deviation !i:

xi =!

!j,j "=i

f1(j)W ! 1 !i =

"##$!

!j,j "=i

(f1(j) ! xi)2

W ! 2 (4)

We create f1 from the original f1 by reducing each workeri’s f1-contribution to fall within 2!i + xi:

f1 =!

i

min(f1(i), 2!i + xi) (5)

The final estimator is similar to equation 3 except that ituses the f1 statistic. For example, with a coefficient of variance"2 = 0, it would simplify to:

Ncrowd = cn

n !%

i min(f1(i), 2!i + xi)(6)

Although a small adjustment, Ncrowd is more robust againstthe impact of streakers than the original Chao92, as we showin our evaluation next.

D. Experimental Results

We ran over 30,000 HITs on AMT for set enumerationtasks to evaluate our technique. Several CROWD tables weexperimented with include small and large well-defined setslike NBA teams, US states, UN member countries, as well assets that can truly leverage human perception and experiencelike indoor plants with low-light needs, restaurants in SanFrancisco serving scallops, slim-fit tuxedos, and ice creamflavors. Workers were paid $0.01-$0.05 to provide one itemin the result set using the UI shown in Figure 3; they wereallowed to complete multiple tasks if they wanted to submitmore than one answer. In the remainder of this paper we focuson a subset of the experiments, some with known cardinalityand fixed membership, US states (nine experiment runs) andUN member countries (five runs), as well as more open endedqueries like plants, restaurants, tuxedos, and ice cream flavors(one run each).

1) Error Metric: Due to a lack of a good metric to evaluateestimators with respect to stability and convergence rate, wedeveloped an error metric ! that captures bias (absolutedistance from the true value), as well as the estimator’s time toconvergence and stability. The idea is to weight the magnitudeof the estimator’s bias more as the size of the sample increases.Let N denote the known true value, and Ni denote the estimateafter i samples. After n samples, ! is defined as:

! =%n

i=1 |Ni ! N |i%i

= 2%n

i=1 |Ni ! N |in(n + 1) (7)

A lower ! value means a smaller averaged bias and thus,a better estimate. The weighting renders a harsher penaltyfor incorrectness later on than in the beginning, in additionto penalizing an estimator that takes longer to reach the truevalue; this addresses the convergence rate criteria. The errormetric also rewards estimators for staying near the true value.

2) Results: UN and States: We first illustrate how Ncrowd

behaves for a representative set of UN member countries andUS states experiments; we elide the full set for space reasons.For both experiments the UI from Figure 3 was shown byCrowdDB to ask for an UN member country, respectivelyUS state, on AMT for $0.01 cents per task. Figures 7(a-h) show cardinality estimates as well as the ! metric forthe selected experiments. We observed that our estimate hasan improvement over Chao92 for most UN experiments weperformed as Figure 7(a) and (b) show. In UN 1 our estimatesreduces the overestimation of Chao92 that occurred during themiddle of the experiment. In the UN 2 experiment, one streakerdominated the total answer set at the beginning—a substantialoutlier. Once his contribution was reduced dramatically, theremaining workers’ answers had significant overlap becausemost were enumerating the list of nations alphabetically,resulting in a low cardinality because of the heavily skeweddata distribution this scenario creates. Recall from the previoussection that the expected behavior of the estimator in thiscase is to under-predict. In contrast, the third UN experimentrun had several streakers at the beginning who each hadvery different data distributions (i.e., enumerating the list ofnations from different alphabetical start points). While theheuristic helped level the f1 contribution from these workers,

679



On Incentive-based Tagging} 3000万件のURL中、39%のurlが一つのみのtag(図1)

} 投稿数を増やせばいいの？→品質がむしろ悪くなる(図2)



On Incentive-based Tagging} 限られた予算において、インセンティブ配分を最適化することによって品質の向上の最大化

ワーカーに対してどのbudgetを割り振るかが課題




Selected





Selected Quality Metric for Tag Data




On Incentive-based Tagging¨ 以下の５種類を提案

¨ Free Choice (FC)¨ Round Robin (RR)¨ Fewest Post First (FP)

¤タグが付けられていないものを優先¨ Most Unstable First (MU)

¤ rfd(Relative Frequency Distribution)の値をみて最も不確かなものを選択

¨ Hybrid (FP-MU)¨ 以上の手法をDP(theoretically optimal solution)と比較



On Incentive-based Tagging¨ Free Choice: 50% posts are over-tagging, wasted}

16

¨ FP & FP-MU are close to optimal

¨ Budget = 1,000¤ 0.7% more posts comparing

with initial no.¤ 6.7% quality improvement

¨ Free Choice: 50% posts are over-tagging, wasted