quantitative comparisons of search engine results mike thlwall school of computing and information...
TRANSCRIPT
Quantitative Comparisons of Search Engine Results
Mike Thlwall
School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓
UK)
Journal of the American Society for Information Science and Technology 2008
2
Abstract• Search engines
– To find information or web sites
• Webometric– Finding and measuring web based phenomena
• Comparing the applications programming interfaces– Google, Yahoo!, Live Search
• Webometric application– hit count, number of URLs, number of domains,
number of web sites, number of top-level domains
3
Search Engine and Web Crawlers
• Three key operations:– Crawling : identifying, downloading and storing to DB– Results matching: a search engine identifies the pages
in its database that match any user query.
4
Search Engine and Web Crawlers
– Results ranking• A search engine will arrange the matching URLs to
maximize the probability that a relevant result is in the first or second pages.
• Search term
• Occur frequency
• Number of click
5
Research Objectives• Are there specific anomalies that make the HCEs of
Google, Live Search or Yahoo! unreliable for particular values?
• How consistent are Google, Live Search and Yahoo! in the number of URLs returned for a search, and which of them typically returns the most URLs?
• How consistent are the search engines in terms of the spread of results (sites, domains and top-level domains) and which search engine gives the widest spread of results for a search?
6
Data
• 1,587 words– Blogs– Word frequency– http://cybermetrics.wlv.ac.uk/paperdata/
• Three engine searchs– Google, Yahoo! and Live Search– 1000 pages
• Five webometrics– hit count, number of URLs, number of domains,
number of web sites, number of top-level domains
7
Results - 1
• Hit count estimates
Figure 2a,b,c. Hit count estimates of the three search engines compared (logarithmic scales, excluding data with zero values; r=0.80, 0.96, 0.83).
8
Results - 2
• Number of URLs returned
Figure 3a,b,c. URLs returned by the three search engines compared (r=0.71, 0.68, 0.84)
9
Results - 3
• Number of domains returned
Figure 4a,b,c. Domains returned by the three search engines compared (r=0.65, 0.69, 0.83).
10
Results - 4
• Number of sites returned
Figure 5a,b,c. Sites returned by the three search engines compared (r=0.66, 0.69, 0.81)
11
Results - 5
• Number of TLDs returned
Figure 6a,b,c. TLDs returned by the three search engines compared (r=0.74, 0.77, 0.84)
12
Results - 6
• Comparison within results
13
Conclusion
• Google seems to be the most consistent in terms of the relationship between its HCEs and number of URLs returned.
• Yahoo! is recommended if the objective is to get results from the widest variety of web sites, domains or TLDs.
14
Evaluating Search Engine Effects on Web-based
Relatedness Measurement
15
Snippets
• Six manifest records – snippets– hit count– number of URLs– number of domains– number of web sites– number of top-level domains
16
Dataset• WordSimilarity-353 Test Collection (TC-353)
– TC353 Full (353 pairs)– TC353 Testing (153 pairs)
• Three famous search engines– Yahoo!– Google– Live Search
• Five domains– general web search (web09)– .Com – .Edu– .Net– .Org
17
The Model
• A web-based relatedness WebMetric(X, Y) measures the association of two objects X and Y
– where F is a transfer function and d is a dependency score.
• The dependency score d reflects a mutual dependency of X and Y on the web.
WebMetric(X, Y)= F(d(X,Y))
18
The Model
• Given a search engine G and two objects X and Y – we employ two double-checking functions, fG(Y@X) a
nd fG(X@Y), to estimate the dependence between X and Y
•
)(
)(
)(
)() ,(
Yf
Y@Xf
Xf
X@YfYXd
G
G
G
G
WebMetric(X, Y) =)( daebe
19
Figure 8. Behaviors of the Gompertz Curve and a Mapping Example
)()(
daebedy
20
Experiments
WebMetric(X, Y) =)( daebe
21