1 clustering the tagged web daniel ramage, nlp group, ai lab, 博士生 paul heymann, infolab,...
TRANSCRIPT
1
Clustering the Tagged Web
Daniel Ramage, NLP Group, AI Lab, 博士生 Paul Heymann, InfoLab, 資科系博士生Christopher D. Manning, 資科與語言學副教授 Hector Garcia-Molina, 電機資訊教授
StanfordWSDM’09, February 9-12, 2009, Barcelona, Spain.
Second ACM International Conferenceon Web Search and Data Mining
曾在 SIGIR ’08發表 Social Tag Prediction
2
Problem Statement
• Given a set of documents with both words and tags (defined in Section 2.4), partition the documents into groups (clusters) using a candidate clustering algorithm
• Create a gold standard to compare against by utilizing a web directory
• Compare the groups produced by the clustering algorithm to the gold standard groups in the web directory, using an evaluation metric
3
分群演算法• Two notable decisions
– assignments• hard assignments
– 每個文件只屬於一個而且唯一個群• Soft assignments
– 多– flat 分類或階層式分類
• 比較的兩種分群法– K-means 在 Vector Space Model 上 ( 簡單、
快 )– LDA-derived model ( 可能較複雜、慢 )
4
Gold Standard: Open Directory Project
• from the Open Directory Project (ODP)
• 例如 Art node 下面的所有文章算同一類 Business …
5
Cluster-F1 Evaluation Metric
• 8 取 2 有 28 對 pairs• True Positive: 5 (R1R2,R1R3,R2R3,G1G2,A1A2)• False Positive: 8 (R1G1,R1G2,R2G1,R2G2,…)• True Negative: 12 (R1A1,R2A1,R3A1,R1A2,…)• False Negative: 3 (R1R4,R2R4,R3R4)• P=TP/TP+FP=5/13• R=TP/TP+FN=5/8• F1=2PR/P+R
=0.476
6
Dataset
• a subset of the Stanford Tag Crawl Dataset 2007/5/25 從 delicious 抓一個月,有 2,549,282 個 url 其中 62,406有 ODP 丟掉屬於地理分類 (Regional) 的,剩下 15,230
• 2000 DEV• 13230 TEST• Stanford Penn
Treebank tokenizer
7
K-MEANS FOR WORDS AND TAGS (1/2)
• K-Means (Simple and Scalable)– vector space model
• dimensionality is the size of the vocabulary and where the sum of the squares of each document vector’s elements is equal to 1
– clusters documents into K groups by iteratively re-assigning each document to its nearest cluster
– The distance is defined as the distance of the document to the cluster centroid
• Case: 各群先選 10 randomly chosen documents in the collection 得到初始的 centroid
8
K-MEANS FOR WORDS AND TAGS (2/2)
• Words Only• Tags Only• Words + Tags
– 各貢獻一半 weight
• Tags as Words Times n– 當作一個詞彙集合,但是 Tag 出現一次貢獻 n 倍
• Tags as New Words– 一個詞彙集合,當 word#pig, tag#pig 不同詞彙
9
小討論 1
• How should the weights be assigned? Should more popular tags be weighted less strongly than rare tags? ( 用來分群 2000 文件看看 )
• tf-idf weighting performs poorly in this task because it over-emphasizes the rarest terms
10
小討論 2
• How should we combine the words and tags of a document in the vector space model? Which of the vector representations presented above is most appropriate? ( 用來分群 13230 文件看看 )
11
GENERATIVE TOPIC MODELS
• 想用 LDA-derived models 來作什麼事情– Can we do better than LDA by creating a model that e
xplicitly accounts for tags and words as separate annotations of a document?
– Do the same weighting and normalization choices from the VSM hold for generative models like LDA-derived models, or do they differ?
– Do LDA-derived models better describe the data and hence perform better on the tagged web document clustering task than clustering algorithms based on VSM?
15
FURTHER STUDIES (2/2)• 原來分 Top-level 的 ODP ,可分較底部的嗎• 實驗
– 程式語言類 Java C Perl ( 可能還是很類似 ) 1094 篇– 社會科學類 法律、歷史、哲學 ( 比較分歧 ) 1590 篇
Java 類下面的文章最常出現的 tag 是 java 488/600=73.9%Top-Level Computer 類下面最常出現的 tag 是 software 2562/11894=21.5
16
My Case Study (Topic Models )
訓練速度• (LDA 團隊 ) 會對全 instances 最佳化 – 訓練很久• 用 Gibbs Sampling 會比較快, Tools 如
– GibbsLDA++ by Xuan-Hieu Phan– Matlab Topic Modeling Toolbox 1.3.2
Supervised• K-means K 個 cluster 在 initialization 時是先給定的• 用 LDA 自己會長出 K 群目前發展• supervised LDA, (LDA 團隊 ) 2007
Dirichlet-multinomial Regression (CRF 團隊 ) 2008Markov Topic Models (LDA 團隊 ) 2009
18
Dirichlet-multinomial Regression
Dirichlet-multinomial regression (DMR) topic models are able to •incorporate arbitrary types of observed continuous, discrete and categorical features
20
Discussion
• Tagging 比 links 的出現還快,因此相關計算顯得更重要
• 以後要把 tag-extended clustering 的技術推廣到 search
• 三種 indexing vocabulary– Controlled Index Language– Tag ( 具有 semantic precision, 具有全面性 )– Full Text Indexing
22
A New Visual Search Interface for Web Browsing
Songhua Xu, Zhejiang University
Tao Jin, Yale University
Francis C.M. Lau, 香港大學
WSDM’09, February 9-12, 2009, Barcelona, Spain.
Second ACM International Conferenceon Web Search and Data Mining
23
GCEEL
• user-friendly and informative graphical front-end for organizing and presenting search results in the form of topic groups– first retrieves relevant online materials via a th
ird-party search engine– analyze the semantics of search results to det
ect latent topics in the result set– map the search result pages into topic cluster
s
http://www.gceel.com/