quantitative comparisons of search engine results mike thlwall school of computing and information...

21
Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍伍伍伍伍伍 UK ) Journal of the American Society for Information Science and Technology 2008

Upload: catherine-booth

Post on 02-Jan-2016

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

Quantitative Comparisons of Search Engine Results

Mike Thlwall

School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓

UK)

Journal of the American Society for Information Science and Technology 2008

Page 2: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

2

Abstract• Search engines

– To find information or web sites

• Webometric– Finding and measuring web based phenomena

• Comparing the applications programming interfaces– Google, Yahoo!, Live Search

• Webometric application– hit count, number of URLs, number of domains,

number of web sites, number of top-level domains

Page 3: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

3

Search Engine and Web Crawlers

• Three key operations:– Crawling : identifying, downloading and storing to DB– Results matching: a search engine identifies the pages

in its database that match any user query.

Page 4: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

4

Search Engine and Web Crawlers

– Results ranking• A search engine will arrange the matching URLs to

maximize the probability that a relevant result is in the first or second pages.

• Search term

• Occur frequency

• Number of click

Page 5: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

5

Research Objectives• Are there specific anomalies that make the HCEs of

Google, Live Search or Yahoo! unreliable for particular values?

• How consistent are Google, Live Search and Yahoo! in the number of URLs returned for a search, and which of them typically returns the most URLs?

• How consistent are the search engines in terms of the spread of results (sites, domains and top-level domains) and which search engine gives the widest spread of results for a search?

Page 6: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

6

Data

• 1,587 words– Blogs– Word frequency– http://cybermetrics.wlv.ac.uk/paperdata/

• Three engine searchs– Google, Yahoo! and Live Search– 1000 pages

• Five webometrics– hit count, number of URLs, number of domains,

number of web sites, number of top-level domains

Page 7: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

7

Results - 1

• Hit count estimates

Figure 2a,b,c. Hit count estimates of the three search engines compared (logarithmic scales, excluding data with zero values; r=0.80, 0.96, 0.83).

Page 8: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

8

Results - 2

• Number of URLs returned

Figure 3a,b,c. URLs returned by the three search engines compared (r=0.71, 0.68, 0.84)

Page 9: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

9

Results - 3

• Number of domains returned

Figure 4a,b,c. Domains returned by the three search engines compared (r=0.65, 0.69, 0.83).

Page 10: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

10

Results - 4

• Number of sites returned

Figure 5a,b,c. Sites returned by the three search engines compared (r=0.66, 0.69, 0.81)

Page 11: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

11

Results - 5

• Number of TLDs returned

Figure 6a,b,c. TLDs returned by the three search engines compared (r=0.74, 0.77, 0.84)

Page 12: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

12

Results - 6

• Comparison within results

Page 13: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

13

Conclusion

• Google seems to be the most consistent in terms of the relationship between its HCEs and number of URLs returned.

• Yahoo! is recommended if the objective is to get results from the widest variety of web sites, domains or TLDs.

Page 14: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

14

Evaluating Search Engine Effects on Web-based

Relatedness Measurement

Page 15: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

15

Snippets

• Six manifest records – snippets– hit count– number of URLs– number of domains– number of web sites– number of top-level domains

Page 16: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

16

Dataset• WordSimilarity-353 Test Collection (TC-353)

– TC353 Full (353 pairs)– TC353 Testing (153 pairs)

• Three famous search engines– Yahoo!– Google– Live Search

• Five domains– general web search (web09)– .Com – .Edu– .Net– .Org

Page 17: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

17

The Model

• A web-based relatedness WebMetric(X, Y) measures the association of two objects X and Y

– where F is a transfer function and d is a dependency score.

• The dependency score d reflects a mutual dependency of X and Y on the web.

WebMetric(X, Y)= F(d(X,Y))

Page 18: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

18

The Model

• Given a search engine G and two objects X and Y – we employ two double-checking functions, fG(Y@X) a

nd fG(X@Y), to estimate the dependence between X and Y

)(

)(

)(

)() ,(

Yf

Y@Xf

Xf

X@YfYXd

G

G

G

G

WebMetric(X, Y) =)( daebe

Page 19: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

19

Figure 8. Behaviors of the Gompertz Curve and a Mapping Example

)()(

daebedy

Page 20: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

20

Experiments

WebMetric(X, Y) =)( daebe

Page 21: Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

21