feng zhang, guang qiu, jiajun bu*, mingcheng qu, chun chen college of computer science, zhejiang...

Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun ChenCollege of Computer Science, Zhejiang University

Hangzhou, China

Reporter: 洪紹祥Adviser: 鄭淑真

Date:2010/10/26

1

The textual advertising market is becoming the substantial source of the Web revenue

Contextual advertising has played an important role in it.

Relevance between content and ads leads users to click and browse the ads and brings the advertisers potential increase in revenue.

2

The key step of contextual advertising Keyword extraction affects the accuracy of the

advertising system directly Research has been done on English keyword

extraction. There is little work existing on Chinese

keyword extraction.1. The unique characteristics of Chinese

language 2. The Internet and Webadvertising market have

just started in China

3

News and email query extraction TFIDF

The closed captioning of TV news Mail subjec

Information extraction Extract phrases

The extraction techniques adopted are different from keyword extraction.

Keyword extraction in case of English Keyphrase Extraction Algorithm (KEA)

three features TFIDF Distance

(number of words before firstword/all words) Term frequency

4

DataProcess

5

Candidate selection criterions1. The length of a candidate is as least two

words.2. The candidate occurs in different places in

the same document Considered as the identical one Its value of features will be combined

6

Building the classifier(Using C4.5 decision tree algorithm)

Feature selection. Binary Value

Linguistic features. noun, verb …

Named Entity. Name,Place …

Numeric Value Length.

Length of the candidate Length of the document Sentence number of the document

7

Building the classifier(Using C4.5 decision tree algorithm)

Feature selection. Location.

First (nth phrase/all phrases),(nth sentence/all sentences)

Last (nth phrase/all phrases),(nth sentence/all sentences)

TFIDF. Traditional log2 (TF +1) log2 (IDF +1)

Information entropy. H(x) = −(T/N)*log2(T/N)

Diameter. Last(nth phrase)-first(nth phrase) Last(nth sentence)-first(nth sentence)

8

Corpus construction. Contains 2200 documents

2000 for training and 100 for testing Labeling.

Submit the candidates in a document to Google

Performance measures Top − N = CorrectNum/TotalNum

9

Algorithm comparison experiment.

10

Feature contribution experiment.

11

Feature contribution experiment. To analyze other features’ influences

12

The experimental results show that our approach is promising and has a large improvement over KEA and Yih’s work, ignoring the difference of the language.

We attribute the superior performance to the appropriate features we select and the classification algorithm we adopt.

13

feng zhang, guang qiu, jiajun bu*, mingcheng qu, chun chen college of computer science, zhejiang...

Documents