quantitative comparisons of search engine results mike thlwall school of computing and information...

Quantitative Comparisons of Search Engine Results

Mike Thlwall

School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓

UK)

Journal of the American Society for Information Science and Technology 2008

2

Abstract• Search engines

– To find information or web sites

• Webometric– Finding and measuring web based phenomena

• Comparing the applications programming interfaces– Google, Yahoo!, Live Search

• Webometric application– hit count, number of URLs, number of domains,

number of web sites, number of top-level domains

3

Search Engine and Web Crawlers

• Three key operations:– Crawling : identifying, downloading and storing to DB– Results matching: a search engine identifies the pages

in its database that match any user query.

4

Search Engine and Web Crawlers

– Results ranking• A search engine will arrange the matching URLs to

maximize the probability that a relevant result is in the first or second pages.

• Search term

• Occur frequency

• Number of click

5

Research Objectives• Are there specific anomalies that make the HCEs of

Google, Live Search or Yahoo! unreliable for particular values?

• How consistent are Google, Live Search and Yahoo! in the number of URLs returned for a search, and which of them typically returns the most URLs?

• How consistent are the search engines in terms of the spread of results (sites, domains and top-level domains) and which search engine gives the widest spread of results for a search?

6

Data

• 1,587 words– Blogs– Word frequency– http://cybermetrics.wlv.ac.uk/paperdata/

• Three engine searchs– Google, Yahoo! and Live Search– 1000 pages

• Five webometrics– hit count, number of URLs, number of domains,

number of web sites, number of top-level domains

7

Results - 1

• Hit count estimates

Figure 2a,b,c. Hit count estimates of the three search engines compared (logarithmic scales, excluding data with zero values; r=0.80, 0.96, 0.83).

8

Results - 2

• Number of URLs returned

Figure 3a,b,c. URLs returned by the three search engines compared (r=0.71, 0.68, 0.84)

9

Results - 3

• Number of domains returned

Figure 4a,b,c. Domains returned by the three search engines compared (r=0.65, 0.69, 0.83).

10

Results - 4

• Number of sites returned

Figure 5a,b,c. Sites returned by the three search engines compared (r=0.66, 0.69, 0.81)

11

Results - 5

• Number of TLDs returned

Figure 6a,b,c. TLDs returned by the three search engines compared (r=0.74, 0.77, 0.84)

12

Results - 6

• Comparison within results

13

Conclusion

• Google seems to be the most consistent in terms of the relationship between its HCEs and number of URLs returned.

• Yahoo! is recommended if the objective is to get results from the widest variety of web sites, domains or TLDs.

14

Evaluating Search Engine Effects on Web-based

Relatedness Measurement

15

Snippets

• Six manifest records – snippets– hit count– number of URLs– number of domains– number of web sites– number of top-level domains

16

Dataset• WordSimilarity-353 Test Collection (TC-353)

– TC353 Full (353 pairs)– TC353 Testing (153 pairs)

• Three famous search engines– Yahoo!– Google– Live Search

• Five domains– general web search (web09)– .Com – .Edu– .Net– .Org

17

The Model

• A web-based relatedness WebMetric(X, Y) measures the association of two objects X and Y

– where F is a transfer function and d is a dependency score.

• The dependency score d reflects a mutual dependency of X and Y on the web.

WebMetric(X, Y)= F(d(X,Y))

18

The Model

• Given a search engine G and two objects X and Y – we employ two double-checking functions, fG(Y@X) a

nd fG(X@Y), to estimate the dependence between X and Y

•

)(

)(

)(

)() ,(

Yf

Y@Xf

Xf

X@YfYXd

G

G

G

G

WebMetric(X, Y) =)( daebe

19

Figure 8. Behaviors of the Gompertz Curve and a Mapping Example

)()(

daebedy

20

Experiments

WebMetric(X, Y) =)( daebe

quantitative comparisons of search engine results mike thlwall school of computing and information...

Documents