introduction into search engines and information retrieval

47
Search Engines Google & Co. vs Internet An Introduction to Information Retrieval

Upload: a-le

Post on 29-Jan-2015

153 views

Category:

Technology


0 download

DESCRIPTION

Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithm

TRANSCRIPT

Page 1: Introduction into Search Engines and Information Retrieval

Search Engines

Google & Co. vs Internet

An Introduction to Information Retrieval

Page 2: Introduction into Search Engines and Information Retrieval

Contents

Overview History Introduction to Information Retrieval Page Rank in Example Google & Co.

Page 3: Introduction into Search Engines and Information Retrieval

Search Engines Overview

deep impact (not only for search) developers in big challenge search engines getting larger problems not new

Page 4: Introduction into Search Engines and Information Retrieval

History

The web happened (1992) Mosaic/Netscape happened (1993-95) Crawler happened (1994): M. Mauldin SEs happened 1994-1996

– InfoSeek, Lycos, Altavista, Excite, Inktomi, … Yahoo decided to go with a directory Google happened 1996-98

Tried selling technology to other engines SEs though search was a commodity, portals were in Microsoft said: whatever …

Page 5: Introduction into Search Engines and Information Retrieval

Present

Most search engines have vanished Google is a big player Yahoo decided to de-emphasize directories

Buys three search engines Microsoft realized Internet is here to stay

Dominates the browser market Realizes search is critical

Page 6: Introduction into Search Engines and Information Retrieval

Share Of Searches: July 2005

Page 7: Introduction into Search Engines and Information Retrieval

Google

first launched Sep. 1999 Over 4 billion pages by beginning of 2004 strengths

size and scope relevance based cached archive

weaknesses limited search features only indexes first 101KB of sites and PDFs

Page 8: Introduction into Search Engines and Information Retrieval

Yahoo!

David Filo, Jerry Yang => 1995 originally just a subject directory strengths

large, new(Feb. 2004) database cached copies support of full boolean searching

weaknesses lack of some advanced search features indexes only the first 500KB tricky wildcard

Page 9: Introduction into Search Engines and Information Retrieval

MSN Search

used to use third party db´s Feb. 2005 began using own db strenghts

large, unique database cached copies including data cached

weaknesses limited advanced features no title search, truncation, stemming

Page 10: Introduction into Search Engines and Information Retrieval

How Search Engines Work

Crawler-Based Search Engines listing created automatically

Human-Powered Directories contents filled by hand

"Hybrid Search Engines" Or Mixed Results best of both worlds

Page 11: Introduction into Search Engines and Information Retrieval

Ranking Of Sites

location and frequency of keywords keywords near top of page spamming filter „off the page“ ranking

link structure filtering fake links clickthrough measurement

Page 12: Introduction into Search Engines and Information Retrieval

Search Engine Placement Tips (1)

pick your target keywords position your keywords have relevant content avoid search engine stumbling blocks

have html links frames can kill dynamic doorblocks

Page 13: Introduction into Search Engines and Information Retrieval

Search Engine Placement Tips (2)

build links just say no to search engine spamming submit your key pages verify & maintain your listing

beyond search engines

Page 14: Introduction into Search Engines and Information Retrieval

Features for webmasters

Crawling Yes No Notes

Deep CrawlAllTheWeb, Google,

InktomiAltaVista, Teoma

Frames Support All n/a

Robots.txt All n/a

Meta Robots Tag All n/a

Paid Inclusion All but… Google

Full Body Text All n/aSome stop words may

not be indexed

Stop WordsAltaVista, Inktomi,

GoogleFAST Teoma unkown

Meta DescriptionAll provide some support, but AltaVista, AllTheWeb and Teoma make most

use of the tag

Meta Keywords Inktomi, TeomaAllTheWeb, Altavista,

GoogleTeoma support is

„unofficial“

ALT textAltaVista, Google,

TeomaAllTheWeb, Inktomi

Comments Inktomi Others

Page 15: Introduction into Search Engines and Information Retrieval

What is Information Retrieval?

Informations get lost in the amount of documents, but have to be relocated

Definition: IR is the field, that deals with the relocation of

information/knowledge out of large document database.

Page 16: Introduction into Search Engines and Information Retrieval

Quality of an IR-System (1)

Precision: Is the ratio of the relevant documents retrieved

to the total number of documents retrieved.

Precision = 1: all retrieved documents are relevant

= [0;1]

Page 17: Introduction into Search Engines and Information Retrieval

Quality of an IR-System (2)

Recall: Is the ratio of the number of relevant

documents retrieved to the total number of relevant documents (retrieved and not).

Recall = 1: all relevant documents were found

= [0;1]

Page 18: Introduction into Search Engines and Information Retrieval

Quality of an IR-System (3)

Aim of a good IR-System: increasing Precision and Recall!

Problem: increasing Precision cause a decrease of Recall

e.g.: search results 1 document:

Recall->0, Precision=1

increasing Recall cause a decrease of Precision e.g. search results all available documents

Recall=1, Precision->0

Page 19: Introduction into Search Engines and Information Retrieval

Mathematical models

Boolean Model

Vector Space Model

Page 20: Introduction into Search Engines and Information Retrieval

Boolean model

checks if the document includes the search term (true) or not (false). True means, the document is relevant

Problem: high variation on the result size, depending on

the search term no ranking on result set -> no sort possible “relevance” criteria is too strict (e.g. AND,OR)

Page 21: Introduction into Search Engines and Information Retrieval

Vector space model (1)

index weighted vector

search weighted vector

analyze the angle between search vector and document vector by using the cosine function

the smaller the angle, the more relevant is the document -> use it for ranking

),,,( ,,3,2,1 jnjjjj wwwwd

),,,( ,,3,2,1 qnqqq wwwwq

Page 22: Introduction into Search Engines and Information Retrieval

Vector space model (2)

“relevance” criteria is more tolerant no use of boolean operators uses weighting creates a ranking -> sort is possible

Problem: automatic weighting of index terms in queries

and documents

Page 23: Introduction into Search Engines and Information Retrieval

Weighting Methods (1)

law of Zipf global weighting (IDF “inverse document

frequency”) considers the distribution of words in a

language filters out words like “or”, “and” (words with

large occurrence) and weights them weakly

)/log( nNIDF

N = Number of documents in the system

n = number of documents including the index term

Page 24: Introduction into Search Engines and Information Retrieval

Weighting Methods (2)

local weighting considers term frequency into documents weighting corresponding to the frequency regards different length of documents and

normalize the term frequency

jlnl

ji

ji

tf

tfntf

,...1

,

,

max

= absolute number of term frequency in a document jitf , idit

Page 25: Introduction into Search Engines and Information Retrieval

Weighting Methods (3)

tf-idf weighting combination of global (inverse document

frequency) and local (normalized term frequency) weighting

ijiji idfntfw ,,

Page 26: Introduction into Search Engines and Information Retrieval

Web-Mining

Web-Mining ≈ Data-Mining, different problems

Mining of: Content, Structure or User Content-Mining: VSM,BM Structure-Mining: Analysis of Structure User-Mining: Infos about User of a page

Let‘s have a deeper look at Web-Structure-Mining!

Page 27: Introduction into Search Engines and Information Retrieval

History

IR necessary but not sufficient for web search Doesn’t address web navigation

Query ibm seeks www.ibm.com To IR www.ibm.com may look less topical than a

quarterly report Link analysis

Hubs and authority (Jon Kleinberg) PageRank (Brin and Page)

Computed on the entire graph Query independent Faster if serving lots of queries

Others…

Page 28: Introduction into Search Engines and Information Retrieval

Analysis of Hyperlinks

Links Long history in citation analysis Navigational tools on the web Also a sign of popularity Can be thought of as recommendations

(source recommends destination) Also describe the destination: anchor text

Idea: The exist of a Hyperlink between two pages can also give Information

Hyperlinks can be used to: Create a weighting of web pages Find pages with similiar topics Group pages by different context of meaning

Page 29: Introduction into Search Engines and Information Retrieval

Hubs and Authorities

Describe the qualitiy of a website

Authorities: pages which is linked very often

Hubs: pages which are linking other pages very often

Example: Authority: Heise.de Hub: Peter‘s Linklist

Page 30: Introduction into Search Engines and Information Retrieval

Page Rank

Invented by Lawrence Page a. Sergey Brin Algorithm itself is well-described Implementations are not (Google) Main Idea:

relationship of all Links in WWW The more a document is linked, the more important it is Not every link counts the same – a link from an

important page has more worth

Page 31: Introduction into Search Engines and Information Retrieval

Page Rank Algorithm

PR(p0) : Page Rank of a page

PR(pi) : Page Rank of pages linking to p0

outlink(pi): All outgoing links of pi

q = Random walks (normally q=0,85) Attention: Recursive Function!

Page 32: Introduction into Search Engines and Information Retrieval

Page Rank Example

PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615

with q=0.5

Page 33: Introduction into Search Engines and Information Retrieval

Page Rank Calculation

Solution of system of equation not possible Iterative Calcuation of Page Rank necessary Each page starts with 1

Page 34: Introduction into Search Engines and Information Retrieval

Page Rank Incoming Links

PR(A) = 0.5 + 0.5 (PR(X) + PR(D)) = 5.5 + 0.5 PR(D)PR(B) = 0.5 + 0.5 PR(A)PR(C) = 0.5 + 0.5 PR(B)PR(D) = 0.5 + 0.5 PR(C)

PR(A) = 19/3 = 6.33PR(B) = 11/3 = 3.67PR(C) = 7/3 = 2.33PR(D) = 5/3 = 1.67

Given PR(A) = PR(B) = PR(C) = PR(D) = 1 PR(X) = 10

Page 35: Introduction into Search Engines and Information Retrieval

Page Rank Outgoing Links

PR(A) = 0.25 + 0.75 PR(B)PR(B) = 0.25 + 0.375 PR(A)PR(C) = 0.25 + 0.75 PR(D) + 0.375 PR(A)PR(D) = 0.25 + 0.75 PR(C)

PR(A) = 14/23PR(B) = 11/23 PR(C) = 35/23PR(D) = 32/23

Page 36: Introduction into Search Engines and Information Retrieval

Page Rank other Examples

Dangling Links

Different hierachies

Page 37: Introduction into Search Engines and Information Retrieval

Page Rank Implementation

Normally implemented as weighting system Additional content-search needed for

retrieving the document set Also involved in Page Rank

The markup of a link The position of a link in the document The distance between the pages (e.g. other

domain) The context of the linking page The actuality of the page

Page 38: Introduction into Search Engines and Information Retrieval

Google Past

1995 research project at Stanford University

Page 39: Introduction into Search Engines and Information Retrieval

Google Past

One of the earliest storage systems

Page 40: Introduction into Search Engines and Information Retrieval

Google – How it began

Peak of google.stanford.edu

Page 41: Introduction into Search Engines and Information Retrieval

Google

Servers 1999

Page 42: Introduction into Search Engines and Information Retrieval

Google

Page 43: Introduction into Search Engines and Information Retrieval

Google by Numbers

Index: 40 TB (4 Bill. Pages with est. Size 10 kb) Up to 2000 Servers in one Cluster Over 30 Cluster One Petabyte Data per Cluster – so much that a

quota of hard disk breakdowns with 1 in 10-15 Bits gets a real problem

Each day in each greater cluster normally two servers will breakdown

System running stable (without any breakdowns) since February 2000 (Yes, they don’t use Windows server…)

Page 44: Introduction into Search Engines and Information Retrieval

Look-out: Semantic Web

Information should be read by men & machine

Unified description of data & knowledge First approaches: Meta-Data, e.g. Dublin

Core

Actual: RDF

Page 45: Introduction into Search Engines and Information Retrieval

Look-out: Personalized Search Engine

A new approach: personalized Search Engines

Advantage: Only get in what you‘re personally interested

Disadvantage: A lot of data has to be collected

Example: www.fooxx.com

Page 46: Introduction into Search Engines and Information Retrieval

Links

www.searchenginewatch.com (common Information about search engines)

http://pr.efactory.de (page rank algorithm)

http://zdnet.de/itmanager/unternehmen/0,39023441,39129811-2,00.htm (article: “Google’s Technologien: Von Zauberei kaum zu unterscheiden”)

Page 47: Introduction into Search Engines and Information Retrieval

The End

Thank you for your attention