feng zhang, guang qiu, jiajun bu*, mingcheng qu, chun chen college of computer science, zhejiang...
TRANSCRIPT
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun ChenCollege of Computer Science, Zhejiang University
Hangzhou, China
Reporter: 洪紹祥Adviser: 鄭淑真
Date:2010/10/26
1
The textual advertising market is becoming the substantial source of the Web revenue
Contextual advertising has played an important role in it.
Relevance between content and ads leads users to click and browse the ads and brings the advertisers potential increase in revenue.
2
The key step of contextual advertising Keyword extraction affects the accuracy of the
advertising system directly Research has been done on English keyword
extraction. There is little work existing on Chinese
keyword extraction.1. The unique characteristics of Chinese
language 2. The Internet and Webadvertising market have
just started in China
3
News and email query extraction TFIDF
The closed captioning of TV news Mail subjec
Information extraction Extract phrases
The extraction techniques adopted are different from keyword extraction.
Keyword extraction in case of English Keyphrase Extraction Algorithm (KEA)
three features TFIDF Distance
(number of words before firstword/all words) Term frequency
4
DataProcess
5
Candidate selection criterions1. The length of a candidate is as least two
words.2. The candidate occurs in different places in
the same document Considered as the identical one Its value of features will be combined
6
Building the classifier(Using C4.5 decision tree algorithm)
Feature selection. Binary Value
Linguistic features. noun, verb …
Named Entity. Name,Place …
Numeric Value Length.
Length of the candidate Length of the document Sentence number of the document
7
Building the classifier(Using C4.5 decision tree algorithm)
Feature selection. Location.
First (nth phrase/all phrases),(nth sentence/all sentences)
Last (nth phrase/all phrases),(nth sentence/all sentences)
TFIDF. Traditional log2 (TF +1) log2 (IDF +1)
Information entropy. H(x) = −(T/N)*log2(T/N)
Diameter. Last(nth phrase)-first(nth phrase) Last(nth sentence)-first(nth sentence)
8
Corpus construction. Contains 2200 documents
2000 for training and 100 for testing Labeling.
Submit the candidates in a document to Google
Performance measures Top − N = CorrectNum/TotalNum
9
Algorithm comparison experiment.
10
Feature contribution experiment.
11
Feature contribution experiment. To analyze other features’ influences
12
The experimental results show that our approach is promising and has a large improvement over KEA and Yih’s work, ignoring the difference of the language.
We attribute the superior performance to the appropriate features we select and the classification algorithm we adopt.
13