intelligent database systems lab advisor : dr. hsu graduate : chien-shing chen author :...

60
Intelligent Database Systems Lab Advisor Dr. Hsu Graduate Chien-Shing Chen Author Satoshi Oyama Takashi Kokubo Toru lshida 國國國國國國國國 National Yunlin University of Science and Technology Domain-Specific Web Search with Keyword Spices Knowledge and Data Engineering, IEEE Transactions on , Jan. 2004 ,IEEE JNL

Upload: robyn-simon

Post on 31-Dec-2015

244 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Advisor: Dr. Hsu

Graduate: Chien-Shing Chen

Author: Satoshi Oyama

Takashi Kokubo

Toru lshida

國立雲林科技大學National Yunlin University of Science and Technology

Domain-Specific Web Search with Keyword Spices

Knowledge and Data Engineering, IEEE Transactions on ,  Jan. 2004 ,IEEE JNL

Page 2: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Outline Motivation Objective Introduction Domain-specific web search with keyword spices Algorithm for extracting keyword spices Experiments Conclusions Opinion

N.Y.U.S.T.

I.M.

Page 3: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.Motivation

naïve queries may find many irrelevant pages obtain more relevant pages

depend on much experience and skill previous, domain-specific collect and index

relevant page manually constructed: cost, scalable

Page 4: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Objective

Domain-specific search engines return: relevant to certain domains filter irrelevant web pages

N.Y.U.S.T.

I.M.

Page 5: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

1-1.Introduction

Domain-specific web search engines Looking for a recipe

Only input ‘beef’, find few recipes Input ‘beef pepper’, find other recipes

N.Y.U.S.T.

I.M.

Page 6: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.牛肉 牛肉、胡椒

Page 7: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

1-2.IntroductionN.Y.U.S.T.

I.M.

Page 8: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

1-3.IntroductionN.Y.U.S.T.

I.M.

Page 9: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

1-4.IntroductionN.Y.U.S.T.

I.M.

Domain-specific search engines return: relevant to certain domains filter irrelevant web pages

download irrelevant and relevant, classify them Use Decision-Tree

Page 10: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

2-1.Domain-Specific web search with keyword spices

Domain-Specific Web search as a Text Classification problem

Domain-Specific which collect sample web pages according to the assumption of user’s input

N.Y.U.S.T.

I.M.

Page 11: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

2-1. Domain-specific web search as a text classification

D : all web documents Dt: the set of documents relevant to a certain domain

N.Y.U.S.T.

I.M.

Page 12: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

2-1. Domain-specific web search as a text classification

set of all keywords in the domain be the hypothesis space composed of all Boolean

expressions is regarded as a Boolean variable A Boolean expression of keywords can be regarded as a

function from D to 1, keywords is contained in the document 0, otherwise

N.Y.U.S.T.

I.M.

Words in domain-specific

output

1 1 1 0 0 0 1

2 0 1 0 1 1 0

3 0 1 1 0 0 1

Page 13: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

2-1. Domain-specific web search as a text classification

Finding hypothesis h that minimizes the error rate:

N.Y.U.S.T.

I.M.

Page 14: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

2-2.Collecting sample web pages by user’s input

It’s difficult with random sampling. assume all candidates keyword have the same probability

of occurrence in the “recipe domain”, input “beef,” “salmon(鮭魚 ),” “

potato,” etc. as sample keywords and download the same web pages for each keyword

N.Y.U.S.T.

I.M.

Page 15: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

2-2.Collecting sample web pages by user’s input

N.Y.U.S.T.

I.M.

Page 16: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

3-1.Identifying keyword spicesN.Y.U.S.T.

I.M.

classify sample pages into two classes T or F by hand a decision tree learning algorithm to discover keyword

spices each node is an attribute value of a branch indicates the value of the attribute each leaf is a class

No “tablespoon” , has “recipe”, no “home”, no “top, class T

Page 17: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

3-1. Extracting keyword spicesN.Y.U.S.T.

I.M.

Words in domain-specific output

d1 1 1 0 0 0 1

d2 0 1 0 1 1 0

d3 0 1 1 0 0 1

Classified by humans

Web pages collected by user’s input keyword

Page 18: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

3-1.Identifying keyword spicesN.Y.U.S.T.

I.M.

Page 19: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

3-2.Simplifying keyword spices

Decision trees are very large. Too-complex queries can’t be accepted. overfitting problem

N.Y.U.S.T.

I.M.

Page 20: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

3-2.Simplifying keyword spices

Simplify the induced Boolean expression

1.For each conjunction c in h we remove

keywords (Boolean literals) from c to simplify.

2.We remove conjunctions from disjunctive

normal from h to simplify it.

N.Y.U.S.T.

I.M.

Page 21: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

3-2.Simplifying keyword spices

Precision P and recall R are defined over validation

Harmonic mean of P and R

N.Y.U.S.T.

I.M.

Page 22: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

3-2.Simplifying keyword spices

greater contribution to F

weighted harmonic mean of F

N.Y.U.S.T.

I.M.

Page 23: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

4.ExperimtentsN.Y.U.S.T.

I.M.

Page 24: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

4-1.Experimtents-extracting keyword spices

N.Y.U.S.T.

I.M.

Page 25: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

4-1.Experimtents-extracting keyword spices

N.Y.U.S.T.

I.M.

Page 26: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

4-1.Extracting keyword spices

sample pages were split randomly in the recipe domain

N.Y.U.S.T.

I.M.

Page 27: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

keyword spices discovered for a recipe search engines

N.Y.U.S.T.

I.M.4-1.Extracting keyword spices

Page 28: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

trade off between precision and recall

N.Y.U.S.T.

I.M.4-1.Extracting keyword spices

Page 29: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

When , keyword spices extracted for the domain of …

N.Y.U.S.T.

I.M.4-1.Extracting keyword spices

Page 30: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

4-2.Evluation Using a General-Purpose search engine

Page 31: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

to test queries in each domain

N.Y.U.S.T.

I.M.

4-2.Evluation Using a General-Purpose search engine

Page 32: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

4-2.Evluation Using a General-Purpose search engine

Page 33: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

precision values of the sample queries conjoined with “recipe”

keyword “recipe” finds fewer relevant than the query with keyword spice, for example: “beef recipe”

N.Y.U.S.T.

I.M.

4-2.Evluation Using a General-Purpose search engine

Page 34: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

4-3.Comparison to the Filtering model

Page 35: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

precision values of the sample queries in the filtering model

N.Y.U.S.T.

I.M.

4-3.Comparison to the Filtering model

Page 36: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

numbers of relevant pages returned by the …

N.Y.U.S.T.

I.M.

4-3.Comparison to the Filtering model

Page 37: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

for example “shrimp”, must download 5 pages to obtain one result and so is quite inefficient

N.Y.U.S.T.

I.M.

4-3.Comparison to the Filtering model

Page 38: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

5.Future Work

training examples classified by human cost

N.Y.U.S.T.

I.M.

Page 39: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

5.Future Work

1. Using a Web Directory as a Source for Training examples Web directories such as Yahoo, Open Direct

ory,…,… estimate bias

N.Y.U.S.T.

I.M.

Page 40: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

5.Future Work

2. Learning Classifiers from Partially Labeled Data Proposed an algorithm

augment a small to huge

N.Y.U.S.T.

I.M.

Page 41: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

6.Conclusion

keyword spices human

Cost, effective

N.Y.U.S.T.

I.M.

Page 42: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Opinion

dependent on human seriously assume all candidates keyword have the same

probability of occurrence ……

N.Y.U.S.T.

I.M.

Page 43: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Opinion

Pr(TL)?Pr(TL’)?

N.Y.U.S.T.

I.M.

)Pr()'Pr(

)'Pr(

)Pr(

)'Pr()|'Pr(

)'Pr()Pr(

)Pr(

)Pr(

)Pr()|Pr(

)|Pr()Pr()Pr(

)Pr()|Pr(

WiTLWiTL

WiTL

Wi

WiTLWiTL

WiTLWiTL

WiTL

Wi

WiTLWiTL

TLWiTLTL

TLWiTLWi

)'Pr()'Pr(

)Pr()Pr(

)'|Pr(

)|Pr(

TLTLWi

TLTLWi

TLWi

TLWi

Page 44: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Opinion

• Poster Probability Rule

X

N.Y.U.S.T.

I.M.

)|'Pr(

)|Pr(

)(lim

)(lim

)Pr(

)'Pr(

)Pr(

)'Pr(

)|'Pr(

)|Pr(

)'|Pr(

)|Pr(

0

1

WiTL

WiTL

xf

xf

TL

TL

TL

TL

WiTL

WiTL

TLWi

TLWi

x

x

assume all candidates keyword have the same probability of occurrence

Page 45: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 46: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 47: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 48: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 49: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 50: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 51: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 52: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 53: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 54: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

N.Y.U.S.T.

I.M.

Page 55: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Advisor: Dr. Hsu

Graduate: Chien-Shing Chen

國立雲林科技大學National Yunlin University of Science and Technology

Keyword Spices Modified

Page 56: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Advisor: Dr. Hsu

Graduate: Chien-Shing Chen

國立雲林科技大學National Yunlin University of Science and Technology

Information Retrieval

Page 57: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Advisor: Dr. Hsu

Graduate: Chien-Shing Chen

國立雲林科技大學National Yunlin University of Science and Technology

Machine Learning (cluster,classify)

Page 58: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Advisor: Dr. Hsu

Graduate: Chien-Shing Chen

國立雲林科技大學National Yunlin University of Science and Technology

Content Web Mining

Page 59: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Advisor: Dr. Hsu

Graduate: Chien-Shing Chen

國立雲林科技大學National Yunlin University of Science and Technology

Dictionary which can represent a distance between Words

Page 60: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin

Intelligent Database Systems Lab

Advisor:Graduate: Chien-Shing Chen

國立雲林科技大學National Yunlin University of Science and Technology