takuya funahashi and hayato yamana ([email protected])...

30
Takuya FUNAHASHI and Hayato YAMANA ([email protected]) Computer Science and Engineering Div., Waseda Univ. JAPAN 1

Upload: others

Post on 08-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Takuya FUNAHASHI and Hayato YAMANA ([email protected])

Computer Science and Engineering Div., Waseda Univ. JAPAN

1

Page 2: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Have you ever experienced?

8 870 t 2 070?8,870 to 2,070?Where have 6,800 web pages gone?

Put a Query andClick “Search”Button

, p g g

How can we have reliable hit counts?How can we have reliable hit counts?

Click “Next” Button

W t t k l it2

We want to make clear it.

Page 3: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Agenda:

PART1:PART1: Importance of Search Engines’ hit counts

PART2PART2: Related Work

PART3: Experiment and Trustworthiness of hit countsExperiment and Trustworthiness of hit counts

PART4: C l iConclusions

33

Page 4: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Introduction

1.IMPORTANCE OF SEARCH ENGINES’ HIT COUNTS

Introduction

SEARCH ENGINES’ HIT COUNTS

44

Page 5: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Background – How important hit-count is?

Many researches based on Hit Count exist …

Researches using Hit Count

- Translation SupportAl-Onaizen, Y. and Knight, K.:”Translating named entities using monolingual and bilingual resources”, Proc. of the 40th Ann. Meeting on Association for

NLP researches using web data aremostly based on SE’s hit counts

- Calculate Similarity between Words or Sentences

and bilingual resources , Proc. of the 40 Ann. Meeting on Association for Computational Linguisitics, pp.400 – 408 (2002).mostly based on SE s hit counts

E l ti f th lit f b

Cilibrasi, R.L. and Vitanyi, P.M.B.:”The Google Similarity Distance”, IEEE Trans. on Knowledge and Data Engineering, Vol.19, No.3, pp.370-383 (2007).

SE’s hit counts now become- Evaluation of the quality of web pages

Gelman, I.A. and Barletta, A.L.:”A “quick and dirty” website data quality indicator”, Proc. of the 2nd ACM Workshop on Information credibility on the web, pp.43-46

one of indispensable web resources

5

(2008)

Page 6: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Example “Hont? Search” System[4]by Y.Yamamoto et al.(Fact Search System)( y )

It utilizes the difference among hit counts.

66

g

Page 7: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Hit count v.s. the number of returned web documents

• Difference between the Hit Count on the 1st SERP and the number of returned web documents

e 1s

tS

ER

P

SE

RP

ount

s on

the

s on

the

1st

hit c

o

number of returned hit c

ount

s

number of returnedweb documents(MSN)

web documents(Google)

e 1s

tS

ER

P

Experiment by using 1,000 querieson Nov 2008ou

nts

on th

77number of returned web documents(Yahoo! JAPAN)

on Nov.2008

hit c

o

Page 8: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

How often does the relationship turn over?・Using 1,000 queries to compare the relationship, i.e. which has large hit count,for every pair of 1,000 queries

CASE1 i th hit t th 1 t SERP

12 11,1% CASE1: comparing the hit counts on the 1st SERP.

CASE2: comparing the numbers of returned web documents

8

10

5 9We should make clear the

t t thi f

4

6

2 6

5,9trustworthiness of search engines’ hit counts

2

4 2,6

0Google Yahoo! JAPAN Live Search

t ti f th b f lt f i

88

turn over ratio of the number of results of every pair

Page 9: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

2.RELATED WORK

99

Page 10: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Related work

• Googleology is Bad Science– Kilgarriff, A. : “Googleology is Bad Science”, Computational Linguistic,

V l 33 I 1 Queries repeated the following day gave hit Vol.33, Issue 1, pp.147-151 (2007).

• Quantitative comparisons of search engine results

Que es epeated t e o o g day ga e tcounts varied by more than 10%.

Quantitative comparisons of search engine results– Thelwall, M. : “Quantitative comparisons of search engine results”, J. of

American for Information Science and Technology, Vol.59, Issue 11, pp 1702 1710 (2008)

He compared three search engines, i.e. difference among the three.pp.1702-1710 (2008)

• Investigation of the accuracy of search engine hit counts

difference among the three.

search engine hit counts– Uyar, A.:”Investigation of the accuracy of search engine hit counts”, J. of

Information Science, Vol.35, Issue 4, pp.469-480 (2009)He compared three search engines to find out

the search engines returning the mostthe search engines returning the most accurate hit count.

He assumed the number of returned web

10

documents is correct.

Page 11: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Related work

None of previous researches provideh t th li bl hit twhat the reliable hit count in a search engine is.g

1111

Page 12: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

3.EXPERIMENT AND TRUSTWORTHINESS OF HIT COUNTTRUSTWORTHINESS OF HIT COUNT

1212

Page 13: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

We cannot trust search engines’ hit count

Put a Query andClick “Search”Button

Where have the 6,800 pages gone?Sometimes, ,

hit count decreases to 1/10or

increases to 10 times.

Click “Next” Button

13

Page 14: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Purpose of Our Research

To provide “reliable hit count in a psearch engine” for researches using it

Cues of hit count ‘dance’ (change):

CASE1: Clicking the “Search” button many timesg y

CASE2: Clicking the “Next” button step by step to reach the last page of the search resultsto reach the last page of the search results

CASE3: Searching with the same query on different

14

days

Page 15: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Experimental Setting

Target:

(APIs)(APIs)

API Settings:API Settings:non-phrase search, normal-safe filter, Japanese language

Queries:10 000 queries provided by Yahoo! JAPAN10,000 queries, provided by Yahoo! JAPANas the top 10,000 frequent queries in December 2007.

Experimental Period:From October 2009 to December 2009.

15

From October 2009 to December 2009.

Page 16: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Case 1:Clicking the “Search” button many times

Queriesquery1 100 Hit C t

100 times in 5 min.

query1query2

query10000

・・・

100 Hit Counts

CV of “query1”query10000 CV of query1

1. Submit the same query to each SE 100 times in 5 min.2. Calculate the coefficient of variation(CV) from 100 Hit Counts.

CV =standard deviation

=variance

3. Execute 1 and 2 for all queries, then produce its histogram.

CV = average = average

16

q p g

Page 17: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Histogram of the coefficient of variation

All Search Engines’ CV are l t l th 5 0%

We may ignore

17

almost less than 5.0% the dance in case1.

Page 18: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Case 2: Clicking the “Next” button step by step

Queriesquery1

1. Submit one query to Search Engine to getHitCount(1,10), HitCount(11,20), ..., HitCount(991,1000)

1 t SERP 2 d SERP 100th SERPquery1query2

q er 10000

2. Calculate Deep hit count Vector(DV)1st SERP 2nd SERP 100th SERP

query10000 DV =

3. Execute 1 and 2 for all queries,th l k l t i t t

Click!!

then apply k-means clustering to a setof DVs

20,0001-10

20,00011-20

......

12,000991-1000 note:

Case 2 is applied to only Bing and Yahoo!because Google API does not return 1000

18

because Google API does not return 1000 results.

Page 19: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Clustering( by using k-means)

• Objective– extract transition patterns of the hit countsp

• MethodMethod– vary its clustering size k from 1 to 6, then, select the

best size by manual.best size by manual. – choose the best size based on the following points;1) start offsets when hit count dances begin are clearly1) start offsets when hit count dances begin are clearly

different among clusters.2) curves of change ratio are clearly different among ) g y g

clusters.

1919

Page 20: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Transition of Hit Counts - Case 2

(k=1)

Adj ti th fi l hit t

e R

atio

Adjusting the final hit countto the number actually returned

web documents.

Cha

nge

Search Offset N

(k=2)

nge

Rat

io

Re-calculateHit Count!!Every 100 results,

Cha

ny ,Yahoo! re-calculates

the hit count

20

Search Offset N

Page 21: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Discussion(our assumption)

At first, hit count should be calculated by using top-k

results

top-k results

results

When requesting more

Full Index

q gresult,

search engine will useanother top-k resultspto calculate hit count.

Most reliable hit count will be the hit count when “k” is the largest number, but not adjusted to such as Bing.

21

g , j g

Page 22: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Search Engines’ Pruned Index Architecture

Querieshits results cache

Neither result cache hit

not hit results cache, but hits pruned index

Neither result cache hit nor pruned index hit

Results Cache

Results Cache

Results Cache

Low level cachePrunedIndex

PrunedIndex

PrunedIndex

Low-level cache

Full Index

Gleb Skobeltsyn, Flavio P. Junqueira ,Vassilis Plachouras, Ricardo Baeza-Yates:”ResIn: A Combination of Results Caching and Index Pruning for High-performanceW b S h E i ” SIGIR2008 131 138

2222

Web Search Engines”, SIGIR2008, pp.131-138

Page 23: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Discussion(our assumption)

At first, hit count is calculated by using top-k

results

top-k results

results

When requesting more

Full Index

q gresult,

search engine will useanother top-k resultstop-k results

Reliable Hit Count in Case 2 isp

To calculate hit count.HitCount(k, k+9) where k is defined as the largest number,usually 991 for a search having more than 1,000 matched

Most reliable hit count will be the hit count when “k” is the largest number, but not adjusted to such as Bing.

pages. If a search engine adjusts the last hit count,we should use the hit count just before the adjusted count.

23

g , j gwe should use the hit count just before the adjusted count.

Page 24: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Case 3: Searching on different days

1 Submit queries to Search Engine every day during 2 months1. Submit queries to Search Engine every day during 2 months.(From October 11, 2009, to December 12, 2009)

2. Calculate Vectors of Variational ratio (VV).

VV =

3. Apply k-means clustering to VVs like Case 2.3. Apply k means clustering to VVs like Case 2.

24

Page 25: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Clustering Result - Case 3

(k=4) (k=5)

Rat

io

atio

Cha

nge

R

Cha

nge

Ra

within a week

25

changes more than 30%

Page 26: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Clustering Result - Case 3

(k=3)( )

tioC

hang

e R

atC

within a week

h th 30%

26

changes more than 30%

Page 27: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Conclusion of Case 3(k 4)(k=4)

o

Search Engines have two phases

(k=5)

hang

e R

atio

“Dancing Hard Phase”

: index update phase.During this phase, hit counts will change

th 30% ithi k fCh

Rat

io

more than 30% within a week from our observation.

(k=3)

Cha

nge

R“Stable Phase” Hit counts do not change more than 30%within a week.

Rat

ioC

hang

e R

Reliable Hit Count in Case 3 is...

A hit count when a Search Engine is on stable phase

27

A hit count when a Search Engine is on stable phase.

Page 28: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

4.CONCLUSIONS

2828

Page 29: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

Conclusion

Hit count has become one of indispensable web resourcesresources

Reliable hit count can be the last hit count when h i i bl h h ha search engine is on stable phase where there

exists small change less than 30% during one kweek.

Our next challengeBuilding some benchmark to correct/refine theBuilding some benchmark to correct/refine the hit count more reliable.Providing some quality/reliable measure for the

29

Providing some quality/reliable measure for the hit count.

Page 30: Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org) …gplsi.dlsi.ua.es/congresos/qwe10/fitxers/SLIDES... · 2010-07-19 · 8 10 59 We should make clear the tt thi f 4 6 26 trustworthiness

THANK YOU FOR YOUR ATTENTION

3030