takuya funahashi and hayato yamana ([email protected])...
TRANSCRIPT
Takuya FUNAHASHI and Hayato YAMANA ([email protected])
Computer Science and Engineering Div., Waseda Univ. JAPAN
1
Have you ever experienced?
8 870 t 2 070?8,870 to 2,070?Where have 6,800 web pages gone?
Put a Query andClick “Search”Button
, p g g
How can we have reliable hit counts?How can we have reliable hit counts?
Click “Next” Button
W t t k l it2
We want to make clear it.
Agenda:
PART1:PART1: Importance of Search Engines’ hit counts
PART2PART2: Related Work
PART3: Experiment and Trustworthiness of hit countsExperiment and Trustworthiness of hit counts
PART4: C l iConclusions
33
Introduction
1.IMPORTANCE OF SEARCH ENGINES’ HIT COUNTS
Introduction
SEARCH ENGINES’ HIT COUNTS
44
Background – How important hit-count is?
Many researches based on Hit Count exist …
Researches using Hit Count
- Translation SupportAl-Onaizen, Y. and Knight, K.:”Translating named entities using monolingual and bilingual resources”, Proc. of the 40th Ann. Meeting on Association for
NLP researches using web data aremostly based on SE’s hit counts
- Calculate Similarity between Words or Sentences
and bilingual resources , Proc. of the 40 Ann. Meeting on Association for Computational Linguisitics, pp.400 – 408 (2002).mostly based on SE s hit counts
E l ti f th lit f b
Cilibrasi, R.L. and Vitanyi, P.M.B.:”The Google Similarity Distance”, IEEE Trans. on Knowledge and Data Engineering, Vol.19, No.3, pp.370-383 (2007).
SE’s hit counts now become- Evaluation of the quality of web pages
Gelman, I.A. and Barletta, A.L.:”A “quick and dirty” website data quality indicator”, Proc. of the 2nd ACM Workshop on Information credibility on the web, pp.43-46
one of indispensable web resources
5
(2008)
Example “Hont? Search” System[4]by Y.Yamamoto et al.(Fact Search System)( y )
It utilizes the difference among hit counts.
66
g
Hit count v.s. the number of returned web documents
• Difference between the Hit Count on the 1st SERP and the number of returned web documents
e 1s
tS
ER
P
SE
RP
ount
s on
the
s on
the
1st
hit c
o
number of returned hit c
ount
s
number of returnedweb documents(MSN)
web documents(Google)
e 1s
tS
ER
P
Experiment by using 1,000 querieson Nov 2008ou
nts
on th
77number of returned web documents(Yahoo! JAPAN)
on Nov.2008
hit c
o
How often does the relationship turn over?・Using 1,000 queries to compare the relationship, i.e. which has large hit count,for every pair of 1,000 queries
CASE1 i th hit t th 1 t SERP
12 11,1% CASE1: comparing the hit counts on the 1st SERP.
CASE2: comparing the numbers of returned web documents
8
10
5 9We should make clear the
t t thi f
4
6
2 6
5,9trustworthiness of search engines’ hit counts
2
4 2,6
0Google Yahoo! JAPAN Live Search
t ti f th b f lt f i
88
turn over ratio of the number of results of every pair
2.RELATED WORK
99
Related work
• Googleology is Bad Science– Kilgarriff, A. : “Googleology is Bad Science”, Computational Linguistic,
V l 33 I 1 Queries repeated the following day gave hit Vol.33, Issue 1, pp.147-151 (2007).
• Quantitative comparisons of search engine results
Que es epeated t e o o g day ga e tcounts varied by more than 10%.
Quantitative comparisons of search engine results– Thelwall, M. : “Quantitative comparisons of search engine results”, J. of
American for Information Science and Technology, Vol.59, Issue 11, pp 1702 1710 (2008)
He compared three search engines, i.e. difference among the three.pp.1702-1710 (2008)
• Investigation of the accuracy of search engine hit counts
difference among the three.
search engine hit counts– Uyar, A.:”Investigation of the accuracy of search engine hit counts”, J. of
Information Science, Vol.35, Issue 4, pp.469-480 (2009)He compared three search engines to find out
the search engines returning the mostthe search engines returning the most accurate hit count.
He assumed the number of returned web
10
documents is correct.
Related work
None of previous researches provideh t th li bl hit twhat the reliable hit count in a search engine is.g
1111
3.EXPERIMENT AND TRUSTWORTHINESS OF HIT COUNTTRUSTWORTHINESS OF HIT COUNT
1212
We cannot trust search engines’ hit count
Put a Query andClick “Search”Button
Where have the 6,800 pages gone?Sometimes, ,
hit count decreases to 1/10or
increases to 10 times.
Click “Next” Button
13
Purpose of Our Research
To provide “reliable hit count in a psearch engine” for researches using it
Cues of hit count ‘dance’ (change):
CASE1: Clicking the “Search” button many timesg y
CASE2: Clicking the “Next” button step by step to reach the last page of the search resultsto reach the last page of the search results
CASE3: Searching with the same query on different
14
days
Experimental Setting
Target:
(APIs)(APIs)
API Settings:API Settings:non-phrase search, normal-safe filter, Japanese language
Queries:10 000 queries provided by Yahoo! JAPAN10,000 queries, provided by Yahoo! JAPANas the top 10,000 frequent queries in December 2007.
Experimental Period:From October 2009 to December 2009.
15
From October 2009 to December 2009.
Case 1:Clicking the “Search” button many times
Queriesquery1 100 Hit C t
100 times in 5 min.
query1query2
query10000
・・・
100 Hit Counts
CV of “query1”query10000 CV of query1
1. Submit the same query to each SE 100 times in 5 min.2. Calculate the coefficient of variation(CV) from 100 Hit Counts.
CV =standard deviation
=variance
3. Execute 1 and 2 for all queries, then produce its histogram.
CV = average = average
16
q p g
Histogram of the coefficient of variation
All Search Engines’ CV are l t l th 5 0%
We may ignore
17
almost less than 5.0% the dance in case1.
Case 2: Clicking the “Next” button step by step
Queriesquery1
1. Submit one query to Search Engine to getHitCount(1,10), HitCount(11,20), ..., HitCount(991,1000)
1 t SERP 2 d SERP 100th SERPquery1query2
q er 10000
2. Calculate Deep hit count Vector(DV)1st SERP 2nd SERP 100th SERP
query10000 DV =
3. Execute 1 and 2 for all queries,th l k l t i t t
Click!!
then apply k-means clustering to a setof DVs
20,0001-10
20,00011-20
......
12,000991-1000 note:
Case 2 is applied to only Bing and Yahoo!because Google API does not return 1000
18
because Google API does not return 1000 results.
Clustering( by using k-means)
• Objective– extract transition patterns of the hit countsp
• MethodMethod– vary its clustering size k from 1 to 6, then, select the
best size by manual.best size by manual. – choose the best size based on the following points;1) start offsets when hit count dances begin are clearly1) start offsets when hit count dances begin are clearly
different among clusters.2) curves of change ratio are clearly different among ) g y g
clusters.
1919
Transition of Hit Counts - Case 2
(k=1)
Adj ti th fi l hit t
e R
atio
Adjusting the final hit countto the number actually returned
web documents.
Cha
nge
Search Offset N
(k=2)
nge
Rat
io
Re-calculateHit Count!!Every 100 results,
Cha
ny ,Yahoo! re-calculates
the hit count
20
Search Offset N
Discussion(our assumption)
At first, hit count should be calculated by using top-k
results
top-k results
results
When requesting more
Full Index
q gresult,
search engine will useanother top-k resultspto calculate hit count.
Most reliable hit count will be the hit count when “k” is the largest number, but not adjusted to such as Bing.
21
g , j g
Search Engines’ Pruned Index Architecture
Querieshits results cache
Neither result cache hit
not hit results cache, but hits pruned index
Neither result cache hit nor pruned index hit
Results Cache
Results Cache
Results Cache
Low level cachePrunedIndex
PrunedIndex
PrunedIndex
Low-level cache
Full Index
Gleb Skobeltsyn, Flavio P. Junqueira ,Vassilis Plachouras, Ricardo Baeza-Yates:”ResIn: A Combination of Results Caching and Index Pruning for High-performanceW b S h E i ” SIGIR2008 131 138
2222
Web Search Engines”, SIGIR2008, pp.131-138
Discussion(our assumption)
At first, hit count is calculated by using top-k
results
top-k results
results
When requesting more
Full Index
q gresult,
search engine will useanother top-k resultstop-k results
Reliable Hit Count in Case 2 isp
To calculate hit count.HitCount(k, k+9) where k is defined as the largest number,usually 991 for a search having more than 1,000 matched
Most reliable hit count will be the hit count when “k” is the largest number, but not adjusted to such as Bing.
pages. If a search engine adjusts the last hit count,we should use the hit count just before the adjusted count.
23
g , j gwe should use the hit count just before the adjusted count.
Case 3: Searching on different days
1 Submit queries to Search Engine every day during 2 months1. Submit queries to Search Engine every day during 2 months.(From October 11, 2009, to December 12, 2009)
2. Calculate Vectors of Variational ratio (VV).
VV =
3. Apply k-means clustering to VVs like Case 2.3. Apply k means clustering to VVs like Case 2.
24
Clustering Result - Case 3
(k=4) (k=5)
Rat
io
atio
Cha
nge
R
Cha
nge
Ra
within a week
25
changes more than 30%
Clustering Result - Case 3
(k=3)( )
tioC
hang
e R
atC
within a week
h th 30%
26
changes more than 30%
Conclusion of Case 3(k 4)(k=4)
o
Search Engines have two phases
(k=5)
hang
e R
atio
“Dancing Hard Phase”
: index update phase.During this phase, hit counts will change
th 30% ithi k fCh
Rat
io
more than 30% within a week from our observation.
(k=3)
Cha
nge
R“Stable Phase” Hit counts do not change more than 30%within a week.
Rat
ioC
hang
e R
Reliable Hit Count in Case 3 is...
A hit count when a Search Engine is on stable phase
27
A hit count when a Search Engine is on stable phase.
4.CONCLUSIONS
2828
Conclusion
Hit count has become one of indispensable web resourcesresources
Reliable hit count can be the last hit count when h i i bl h h ha search engine is on stable phase where there
exists small change less than 30% during one kweek.
Our next challengeBuilding some benchmark to correct/refine theBuilding some benchmark to correct/refine the hit count more reliable.Providing some quality/reliable measure for the
29
Providing some quality/reliable measure for the hit count.
THANK YOU FOR YOUR ATTENTION
3030