1 clustering the tagged web daniel ramage, nlp group, ai lab, 博士生 paul heymann, infolab,...

1

Clustering the Tagged Web

Daniel Ramage, NLP Group, AI Lab, 博士生 Paul Heymann, InfoLab, 資科系博士生Christopher D. Manning, 資科與語言學副教授 Hector Garcia-Molina, 電機資訊教授

StanfordWSDM’09, February 9-12, 2009, Barcelona, Spain.

Second ACM International Conferenceon Web Search and Data Mining

曾在 SIGIR ’08發表 Social Tag Prediction

2

Problem Statement

• Given a set of documents with both words and tags (defined in Section 2.4), partition the documents into groups (clusters) using a candidate clustering algorithm

• Create a gold standard to compare against by utilizing a web directory

• Compare the groups produced by the clustering algorithm to the gold standard groups in the web directory, using an evaluation metric

3

分群演算法• Two notable decisions

– assignments• hard assignments

– 每個文件只屬於一個而且唯一個群• Soft assignments

– 多– flat 分類或階層式分類

• 比較的兩種分群法– K-means 在 Vector Space Model 上 ( 簡單、

快 )– LDA-derived model ( 可能較複雜、慢 )

4

Gold Standard: Open Directory Project

• from the Open Directory Project (ODP)

• 例如 Art node 下面的所有文章算同一類 Business …

5

Cluster-F1 Evaluation Metric

• 8 取 2 有 28 對 pairs• True Positive: 5 (R1R2,R1R3,R2R3,G1G2,A1A2)• False Positive: 8 (R1G1,R1G2,R2G1,R2G2,…)• True Negative: 12 (R1A1,R2A1,R3A1,R1A2,…)• False Negative: 3 (R1R4,R2R4,R3R4)• P=TP/TP+FP=5/13• R=TP/TP+FN=5/8• F1=2PR/P+R

=0.476

6

Dataset

• a subset of the Stanford Tag Crawl Dataset 2007/5/25 從 delicious 抓一個月，有 2,549,282 個 url 其中 62,406有 ODP 丟掉屬於地理分類 (Regional) 的，剩下 15,230

• 2000 DEV• 13230 TEST• Stanford Penn

Treebank tokenizer

7

K-MEANS FOR WORDS AND TAGS (1/2)

• K-Means (Simple and Scalable)– vector space model

• dimensionality is the size of the vocabulary and where the sum of the squares of each document vector’s elements is equal to 1

– clusters documents into K groups by iteratively re-assigning each document to its nearest cluster

– The distance is defined as the distance of the document to the cluster centroid

• Case: 各群先選 10 randomly chosen documents in the collection 得到初始的 centroid

8

K-MEANS FOR WORDS AND TAGS (2/2)

• Words Only• Tags Only• Words + Tags

– 各貢獻一半 weight

• Tags as Words Times n– 當作一個詞彙集合，但是 Tag 出現一次貢獻 n 倍

• Tags as New Words– 一個詞彙集合，當 word#pig, tag#pig 不同詞彙

9

小討論 1

• How should the weights be assigned? Should more popular tags be weighted less strongly than rare tags? ( 用來分群 2000 文件看看 )

• tf-idf weighting performs poorly in this task because it over-emphasizes the rarest terms

10

小討論 2

• How should we combine the words and tags of a document in the vector space model? Which of the vector representations presented above is most appropriate? ( 用來分群 13230 文件看看 )

11

GENERATIVE TOPIC MODELS

• 想用 LDA-derived models 來作什麼事情– Can we do better than LDA by creating a model that e

xplicitly accounts for tags and words as separate annotations of a document?

– Do the same weighting and normalization choices from the VSM hold for generative models like LDA-derived models, or do they differ?

– Do LDA-derived models better describe the data and hence perform better on the tagged web document clustering task than clustering algorithms based on VSM?

12

MM-LDA Generative Model

• Multi-Multinomial LDA

14

FURTHER STUDIES (1/2)

• 本文加 anchor text ，跟加 tag 的探討• 用 Google API backlink query 最多抓 60 個

15

FURTHER STUDIES (2/2)• 原來分 Top-level 的 ODP ，可分較底部的嗎• 實驗

– 程式語言類 Java C Perl ( 可能還是很類似 ) 1094 篇– 社會科學類法律、歷史、哲學 ( 比較分歧 ) 1590 篇

Java 類下面的文章最常出現的 tag 是 java 488/600=73.9%Top-Level Computer 類下面最常出現的 tag 是 software 2562/11894=21.5

16

My Case Study (Topic Models )

訓練速度• (LDA 團隊 ) 會對全 instances 最佳化 – 訓練很久• 用 Gibbs Sampling 會比較快， Tools 如

– GibbsLDA++ by Xuan-Hieu Phan– Matlab Topic Modeling Toolbox 1.3.2

Supervised• K-means K 個 cluster 在 initialization 時是先給定的• 用 LDA 自己會長出 K 群目前發展• supervised LDA, (LDA 團隊 ) 2007

Dirichlet-multinomial Regression (CRF 團隊 ) 2008Markov Topic Models (LDA 團隊 ) 2009

17

supervised LDA

18

Dirichlet-multinomial Regression

Dirichlet-multinomial regression (DMR) topic models are able to •incorporate arbitrary types of observed continuous, discrete and categorical features

19

Markov Topic Models

20

Discussion

• Tagging 比 links 的出現還快，因此相關計算顯得更重要

• 以後要把 tag-extended clustering 的技術推廣到 search

• 三種 indexing vocabulary– Controlled Index Language– Tag ( 具有 semantic precision, 具有全面性 )– Full Text Indexing

21

Conclusion

• social tagging data 幫助了 web clustering– 比只用 text 好

• 兩種自動分類演算法– K-means– MM-LDA

22

A New Visual Search Interface for Web Browsing

Songhua Xu, Zhejiang University

Tao Jin, Yale University

Francis C.M. Lau, 香港大學

WSDM’09, February 9-12, 2009, Barcelona, Spain.

Second ACM International Conferenceon Web Search and Data Mining

23

GCEEL

• user-friendly and informative graphical front-end for organizing and presenting search results in the form of topic groups– first retrieves relevant online materials via a th

ird-party search engine– analyze the semantics of search results to det

ect latent topics in the result set– map the search result pages into topic cluster

s

http://www.gceel.com/

1 clustering the tagged web daniel ramage, nlp group, ai lab, 博士生 paul heymann, infolab,...

Documents