introduction into search engines and information retrieval
DESCRIPTION
Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithmTRANSCRIPT
Search Engines
Google & Co. vs Internet
An Introduction to Information Retrieval
Contents
Overview History Introduction to Information Retrieval Page Rank in Example Google & Co.
Search Engines Overview
deep impact (not only for search) developers in big challenge search engines getting larger problems not new
History
The web happened (1992) Mosaic/Netscape happened (1993-95) Crawler happened (1994): M. Mauldin SEs happened 1994-1996
– InfoSeek, Lycos, Altavista, Excite, Inktomi, … Yahoo decided to go with a directory Google happened 1996-98
Tried selling technology to other engines SEs though search was a commodity, portals were in Microsoft said: whatever …
Present
Most search engines have vanished Google is a big player Yahoo decided to de-emphasize directories
Buys three search engines Microsoft realized Internet is here to stay
Dominates the browser market Realizes search is critical
Share Of Searches: July 2005
first launched Sep. 1999 Over 4 billion pages by beginning of 2004 strengths
size and scope relevance based cached archive
weaknesses limited search features only indexes first 101KB of sites and PDFs
Yahoo!
David Filo, Jerry Yang => 1995 originally just a subject directory strengths
large, new(Feb. 2004) database cached copies support of full boolean searching
weaknesses lack of some advanced search features indexes only the first 500KB tricky wildcard
MSN Search
used to use third party db´s Feb. 2005 began using own db strenghts
large, unique database cached copies including data cached
weaknesses limited advanced features no title search, truncation, stemming
How Search Engines Work
Crawler-Based Search Engines listing created automatically
Human-Powered Directories contents filled by hand
"Hybrid Search Engines" Or Mixed Results best of both worlds
Ranking Of Sites
location and frequency of keywords keywords near top of page spamming filter „off the page“ ranking
link structure filtering fake links clickthrough measurement
Search Engine Placement Tips (1)
pick your target keywords position your keywords have relevant content avoid search engine stumbling blocks
have html links frames can kill dynamic doorblocks
Search Engine Placement Tips (2)
build links just say no to search engine spamming submit your key pages verify & maintain your listing
beyond search engines
Features for webmasters
Crawling Yes No Notes
Deep CrawlAllTheWeb, Google,
InktomiAltaVista, Teoma
Frames Support All n/a
Robots.txt All n/a
Meta Robots Tag All n/a
Paid Inclusion All but… Google
Full Body Text All n/aSome stop words may
not be indexed
Stop WordsAltaVista, Inktomi,
GoogleFAST Teoma unkown
Meta DescriptionAll provide some support, but AltaVista, AllTheWeb and Teoma make most
use of the tag
Meta Keywords Inktomi, TeomaAllTheWeb, Altavista,
GoogleTeoma support is
„unofficial“
ALT textAltaVista, Google,
TeomaAllTheWeb, Inktomi
Comments Inktomi Others
What is Information Retrieval?
Informations get lost in the amount of documents, but have to be relocated
Definition: IR is the field, that deals with the relocation of
information/knowledge out of large document database.
Quality of an IR-System (1)
Precision: Is the ratio of the relevant documents retrieved
to the total number of documents retrieved.
Precision = 1: all retrieved documents are relevant
= [0;1]
Quality of an IR-System (2)
Recall: Is the ratio of the number of relevant
documents retrieved to the total number of relevant documents (retrieved and not).
Recall = 1: all relevant documents were found
= [0;1]
Quality of an IR-System (3)
Aim of a good IR-System: increasing Precision and Recall!
Problem: increasing Precision cause a decrease of Recall
e.g.: search results 1 document:
Recall->0, Precision=1
increasing Recall cause a decrease of Precision e.g. search results all available documents
Recall=1, Precision->0
Mathematical models
Boolean Model
Vector Space Model
Boolean model
checks if the document includes the search term (true) or not (false). True means, the document is relevant
Problem: high variation on the result size, depending on
the search term no ranking on result set -> no sort possible “relevance” criteria is too strict (e.g. AND,OR)
Vector space model (1)
index weighted vector
search weighted vector
analyze the angle between search vector and document vector by using the cosine function
the smaller the angle, the more relevant is the document -> use it for ranking
),,,( ,,3,2,1 jnjjjj wwwwd
),,,( ,,3,2,1 qnqqq wwwwq
Vector space model (2)
“relevance” criteria is more tolerant no use of boolean operators uses weighting creates a ranking -> sort is possible
Problem: automatic weighting of index terms in queries
and documents
Weighting Methods (1)
law of Zipf global weighting (IDF “inverse document
frequency”) considers the distribution of words in a
language filters out words like “or”, “and” (words with
large occurrence) and weights them weakly
)/log( nNIDF
N = Number of documents in the system
n = number of documents including the index term
Weighting Methods (2)
local weighting considers term frequency into documents weighting corresponding to the frequency regards different length of documents and
normalize the term frequency
jlnl
ji
ji
tf
tfntf
,...1
,
,
max
= absolute number of term frequency in a document jitf , idit
Weighting Methods (3)
tf-idf weighting combination of global (inverse document
frequency) and local (normalized term frequency) weighting
ijiji idfntfw ,,
Web-Mining
Web-Mining ≈ Data-Mining, different problems
Mining of: Content, Structure or User Content-Mining: VSM,BM Structure-Mining: Analysis of Structure User-Mining: Infos about User of a page
Let‘s have a deeper look at Web-Structure-Mining!
History
IR necessary but not sufficient for web search Doesn’t address web navigation
Query ibm seeks www.ibm.com To IR www.ibm.com may look less topical than a
quarterly report Link analysis
Hubs and authority (Jon Kleinberg) PageRank (Brin and Page)
Computed on the entire graph Query independent Faster if serving lots of queries
Others…
Analysis of Hyperlinks
Links Long history in citation analysis Navigational tools on the web Also a sign of popularity Can be thought of as recommendations
(source recommends destination) Also describe the destination: anchor text
Idea: The exist of a Hyperlink between two pages can also give Information
Hyperlinks can be used to: Create a weighting of web pages Find pages with similiar topics Group pages by different context of meaning
Hubs and Authorities
Describe the qualitiy of a website
Authorities: pages which is linked very often
Hubs: pages which are linking other pages very often
Example: Authority: Heise.de Hub: Peter‘s Linklist
Page Rank
Invented by Lawrence Page a. Sergey Brin Algorithm itself is well-described Implementations are not (Google) Main Idea:
relationship of all Links in WWW The more a document is linked, the more important it is Not every link counts the same – a link from an
important page has more worth
Page Rank Algorithm
PR(p0) : Page Rank of a page
PR(pi) : Page Rank of pages linking to p0
outlink(pi): All outgoing links of pi
q = Random walks (normally q=0,85) Attention: Recursive Function!
Page Rank Example
PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615
with q=0.5
Page Rank Calculation
Solution of system of equation not possible Iterative Calcuation of Page Rank necessary Each page starts with 1
Page Rank Incoming Links
PR(A) = 0.5 + 0.5 (PR(X) + PR(D)) = 5.5 + 0.5 PR(D)PR(B) = 0.5 + 0.5 PR(A)PR(C) = 0.5 + 0.5 PR(B)PR(D) = 0.5 + 0.5 PR(C)
PR(A) = 19/3 = 6.33PR(B) = 11/3 = 3.67PR(C) = 7/3 = 2.33PR(D) = 5/3 = 1.67
Given PR(A) = PR(B) = PR(C) = PR(D) = 1 PR(X) = 10
Page Rank Outgoing Links
PR(A) = 0.25 + 0.75 PR(B)PR(B) = 0.25 + 0.375 PR(A)PR(C) = 0.25 + 0.75 PR(D) + 0.375 PR(A)PR(D) = 0.25 + 0.75 PR(C)
PR(A) = 14/23PR(B) = 11/23 PR(C) = 35/23PR(D) = 32/23
Page Rank other Examples
Dangling Links
Different hierachies
Page Rank Implementation
Normally implemented as weighting system Additional content-search needed for
retrieving the document set Also involved in Page Rank
The markup of a link The position of a link in the document The distance between the pages (e.g. other
domain) The context of the linking page The actuality of the page
Google Past
1995 research project at Stanford University
Google Past
One of the earliest storage systems
Google – How it began
Peak of google.stanford.edu
Servers 1999
Google by Numbers
Index: 40 TB (4 Bill. Pages with est. Size 10 kb) Up to 2000 Servers in one Cluster Over 30 Cluster One Petabyte Data per Cluster – so much that a
quota of hard disk breakdowns with 1 in 10-15 Bits gets a real problem
Each day in each greater cluster normally two servers will breakdown
System running stable (without any breakdowns) since February 2000 (Yes, they don’t use Windows server…)
Look-out: Semantic Web
Information should be read by men & machine
Unified description of data & knowledge First approaches: Meta-Data, e.g. Dublin
Core
Actual: RDF
Look-out: Personalized Search Engine
A new approach: personalized Search Engines
Advantage: Only get in what you‘re personally interested
Disadvantage: A lot of data has to be collected
Example: www.fooxx.com
Links
www.searchenginewatch.com (common Information about search engines)
http://pr.efactory.de (page rank algorithm)
http://zdnet.de/itmanager/unternehmen/0,39023441,39129811-2,00.htm (article: “Google’s Technologien: Von Zauberei kaum zu unterscheiden”)
The End
Thank you for your attention