tag-based social interest discovery by yjhuang 2008.5 yahoo! inc searcher xin li, lei guo,...
TRANSCRIPT
Tag-based Social Interest Discovery
By yjhuang2008.5
Yahoo! Inc Searcher Xin Li, Lei Guo, Yihong(Eric) Zhao
此投影片所有權為該著作者所有,在此僅作講解使用。將於最後附上出處
Outline
Introduction Data Set Analysis of Tags The Architecture Evaluation
Introduction
Social network systems Del.icio.us, Facebook, MySpace, Youtube
Discovering Social Interests Main challenge
Difficult to detect and represent Existing approaches: online connections
This paper’s work
Based on user-generated tags Analyze the real-world traces of tags
and web content Develop the Internet Social Interest
Discovery system (ISID) Discover the common user interests Cluster users and urls by topics
Evaluation
Data Set
Delicious Bookmark 4.3m bookmarks, 0.2m users, 1.4m urls
Data Collection and Pre-Processing
Crawl the urls & download the url pages Discard all non-html objects Coding -> UTF-8, remove non-English
pages Stopword List Porter Stemming algorithm 298,350 distinct tags, 4,072,265
keywords
Users, URLs and Tags
Figure 1: Distribution of the frequencies that the
URLs were bookmarked in our data set Log-log scale
Users, URLs and Tags
Figure 2: Distribution of the bookmarking activity Log-log scale
Users, URLs and Tags
Figure 3: Distribution of tag frequencies
Analysis of Tags
Use VSM model Each URL: two vectors
One in the space of all tags, one for doc keywords
A corpus with t terms and d documents A term-document matrix A = . .
Weight Measurements
Tf-based
Tf-Idf based
An Example of Tags vs. Keywords
A URL bookmarked by users About the resolv.conf in Linux
Table show the top 10 keywords
The Vocabulary of Tags
Compare the vocabulary of tags with that of keywords in web documents
if the most import words be covered Figure 4 (5)
The coverage of user-generated tags for the tf (tf-idf) keywords of 7000 random docs.
The Convergence of Tag Selections
Measure the convergence of tags for all URLs
X-axis: the popularity of URLs Y-axis: the no. of distinct tags
Tags Matched by Documents
Tags: catch the main concept of docs? Matched by the content of the URL?
Statistical analysis Occurrences no. -> weight Tag match ration e(T, U)
T= ti: the set of tags attached to a
given URL U
The total weight of the tags that also appeared in the keyword set
of U
Tags Matched by Documents
Architecture for Social Interest Discovery
1.Find topics of interests
2.Clustering
3.Indexing
Topic Discovery
Find frequent tag pattern for a given set The association rule algorithms
Support Implication rules Identify the frequent tag patterns a frequent tag pattern {a,b}
If w({a,b}) = w({a}) = w({b})
Clustering
Indexing
Evaluation
The URL Similarity of Intra- and Inter- Topics Cosine similarity of tf-idf keyword term vector Cosine similarity of Tag tem vector 500 interest topics
> 30 bookmarked urls Share 5-6 co-occurring tags
Inter-: 10,000 topic-pairs
User Interest Coverage
For each user Sort his tags by the number of times the
tags have been used by the user
Top-5: the top 5 hot tags of each user Top-10: All:
Human Reviews
4 human editors 10 topics 20 most frequent urls for each topic Scores: 1-5
Cluster Properties(Add)
此頁內容非原作者投影片,如需參考原版請至出處參考
Cluster Properties(Add)
此頁內容非原作者投影片,如需參考原版請至出處參考
Cluster Properties(Add)
此頁內容非原作者投影片,如需參考原版請至出處參考
Conclusion(Add)
Propose a tag-based social interest discovery approach
Justify user-generated tags to represent user interests
Implement a system in social network such as delicious
此頁內容非原作者投影片,如需參考原版請至出處參考
References
Xin Li, Lei Guo, Yihong Zhao, Tag-based Social Interest Discovery, www08, Yahoo! Inc
備註 投影片下載出處:http://fusion.grids.cn/wiki/download/att
achments/1313/Tag-based+Socail+Interest+Discovery-by+yjhuang.ppt?version=1
Data Set 網頁http://delicious.com/