presenter: lung-hao lee ( 李龍豪 ) january 7, 2010@room 309
DESCRIPTION
Improving Web Page Classification by Label-propagation over Click Graphs Soo -Min Kim, Patrick Pantel , Lei Duan and Scott Gaffney Yahoo ! Labs CIKM 2009. Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309. Outlines. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/1.jpg)
Presenter: Lung-Hao Lee (李龍豪 )January 7, 2010@Room 309
![Page 2: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/2.jpg)
Introduction Calculating Page Similarity Finding Similar Pages
◦ Click Data Model (CDM)◦ Query Constraint (QC) algorithm
Experimental Results Discussion Conclusion
2
![Page 3: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/3.jpg)
Large labor cost of annotating the data
The aggregated click data across many users over time provides valuable information
Leveraging click logs to argument training data by propagating class labels to unlabeled similar documents
3
![Page 4: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/4.jpg)
“Two pages that tend to be clicked by the same user queries tend to be topically similar”
4
A B
“How to tie a tie”
“How to tie a tie”
“How to tie a neck tie knots ”
“How to tie a neck tie knots ”
“Tying a tie”“Tying a tie”
Label as “Positive” (class “How-to”)
Unknown Label“Positive” ?
![Page 5: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/5.jpg)
A page is represented as a node in the similar graph
Normalize all the URLse.g. the following 4 URLs are treated as the same(1)“http://www.acm.org”(2)“www.acm.org”(3)“www.acm.org/”(4)“http://www.acm.org/”
5
![Page 6: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/6.jpg)
Each URL is represented as a vector of queries that users issued and clicked through to the page
6
Pantel & Lin (2002)
![Page 7: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/7.jpg)
Compute the similarity between two pages using the cosine similarity of their respective feature vector
sim (p1,p2) > sim (p1,p3) sim (p1,p2) > sim(p2,p3)Because p1 and p2 share more common queries than p3
7
![Page 8: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/8.jpg)
What’s a “seed set” ?A set of some labeled data
Two algorithms for seed set expansion◦ Click Data Model (CDM)◦ Query Constraints (QC) algorithm
8
![Page 9: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/9.jpg)
Two phases◦ Updating score phase◦ Filtering phase
Input◦ S1 (positive set) ◦ S2 (negative set)◦ G (click graph)
Output◦ E1 (positive)◦ E2 (negative)
Thresholds◦ 0.1<T1<0.6
◦ 0.6<T2<1.2 9
![Page 10: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/10.jpg)
Additional Module that checks whether the common queries between two nodes have certain term patterns
10
![Page 11: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/11.jpg)
Reduce the amount of human annotation effort by leveraging the click data
Build an expansion model with labeled training data and use it to select next round of training data
11
![Page 12: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/12.jpg)
Click Data◦ During December 2008 from Yahoo! Search
engine◦ Only the top 10 URLs are considered◦ URLs with less than 10 clicks are excluded
Tree classification tasks◦ How-to◦ Adult◦ review
12
![Page 13: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/13.jpg)
Training sets◦ 10,000 manually labeled positive and negative examples◦ For “review” classifier, queries such as “digital camera
reviews” or “baby swing reviews”◦ For “How-to” classifier, queries such as “how to clean
uggs” or “best way to loose weight”
Testing sets
13
![Page 14: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/14.jpg)
Classifier◦ Gradient Boosting Decision Tree (GBDT)
Features◦ Textual, Link, URL, HTML, Other features
Metrics◦ Area Under the ROC Curve (AUC) (Fawcett, 2003)◦ F score◦ Accuracy
14
![Page 15: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/15.jpg)
The big improvement of CDM is observed with a model using 5000 labeled data as a seed set (+1.07% in F-score, +0.81 in Accuracy and +0.25% in AUC)
15
![Page 16: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/16.jpg)
Reduce the manual labor by 50%
QC (exclude pages that do not have “review” in query terms) is useful when labeled data is small
16
![Page 17: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/17.jpg)
With 1000 and 2000 human labeled data, CDM performs worse than the baseline
QC (exclude pages that do not have “How-to” in query terms)
17
![Page 18: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/18.jpg)
Baseline: Type A
CDM: Type C
18
![Page 19: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/19.jpg)
From “How-to” Classifier Seed 1Seed 2 (human label from Expnd1)
Expand2
19
![Page 20: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/20.jpg)
A random sample of 50 positive and 50 negative example from “how-to” classifier
Positive class has 82.3% precision whereas negative class has 83.6% precision
20
![Page 21: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/21.jpg)
Is the proposed method always useful for web page classification ?
How can we improve the quality of automatically labeled data from unlabeled data ?
21
![Page 22: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/22.jpg)
Present a method for improve webpage classification by leveraging click data to augment training data
Argument manually labeled data by modeling the similarity between pages in a click graph
22
![Page 23: Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309](https://reader036.vdocuments.pub/reader036/viewer/2022062309/56814d77550346895dbad52c/html5/thumbnails/23.jpg)
Thank you very much Questions & Answers
23