feng zhang, guang qiu, jiajun bu*, mingcheng qu, chun chen college of computer science, zhejiang...

13
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪洪洪 Adviser: 洪洪洪 Date:2010/10/26 1

Upload: jared-joshua-berry

Post on 01-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun ChenCollege of Computer Science, Zhejiang University

Hangzhou, China

Reporter: 洪紹祥Adviser: 鄭淑真

Date:2010/10/26

1

Page 2: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

The textual advertising market is becoming the substantial source of the Web revenue

Contextual advertising has played an important role in it.

Relevance between content and ads leads users to click and browse the ads and brings the advertisers potential increase in revenue.

2

Page 3: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

The key step of contextual advertising Keyword extraction affects the accuracy of the

advertising system directly Research has been done on English keyword

extraction. There is little work existing on Chinese

keyword extraction.1. The unique characteristics of Chinese

language 2. The Internet and Webadvertising market have

just started in China

3

Page 4: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

News and email query extraction TFIDF

The closed captioning of TV news Mail subjec

Information extraction Extract phrases

The extraction techniques adopted are different from keyword extraction.

Keyword extraction in case of English Keyphrase Extraction Algorithm (KEA)

three features TFIDF Distance

(number of words before firstword/all words) Term frequency

4

Page 5: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

DataProcess

5

Page 6: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

Candidate selection criterions1. The length of a candidate is as least two

words.2. The candidate occurs in different places in

the same document Considered as the identical one Its value of features will be combined

6

Page 7: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

Building the classifier(Using C4.5 decision tree algorithm)

Feature selection. Binary Value

Linguistic features. noun, verb …

Named Entity. Name,Place …

Numeric Value Length.

Length of the candidate Length of the document Sentence number of the document

7

Page 8: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

Building the classifier(Using C4.5 decision tree algorithm)

Feature selection. Location.

First (nth phrase/all phrases),(nth sentence/all sentences)

Last (nth phrase/all phrases),(nth sentence/all sentences)

TFIDF. Traditional log2 (TF +1) log2 (IDF +1)

Information entropy. H(x) = −(T/N)*log2(T/N)

Diameter. Last(nth phrase)-first(nth phrase) Last(nth sentence)-first(nth sentence)

8

Page 9: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

Corpus construction. Contains 2200 documents

2000 for training and 100 for testing Labeling.

Submit the candidates in a document to Google

Performance measures Top − N = CorrectNum/TotalNum

9

Page 10: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

Algorithm comparison experiment.

10

Page 11: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

Feature contribution experiment.

11

Page 12: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

Feature contribution experiment. To analyze other features’ influences

12

Page 13: Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真

The experimental results show that our approach is promising and has a large improvement over KEA and Yih’s work, ignoring the difference of the language.

We attribute the superior performance to the appropriate features we select and the classification algorithm we adopt.

13