Tiny Search EngineTiny Search Engine :Design and implementation
YAN Hongfei (YAN Hongfei ( 闫宏飞闫宏飞 ))
Network GroupNetwork Group
Oct. 2003Oct. 2003
2
OutlineOutline
analysisanalysis– which deals with the design requirements and which deals with the design requirements and
overall architecture of a system; overall architecture of a system;
designdesign– which translates a system architecture into which translates a system architecture into
programming constructs (such as interfaces, programming constructs (such as interfaces, classes, and method descriptions); classes, and method descriptions);
and programmingand programming– which implements these programming constructs.which implements these programming constructs.
3
Defining System Requirements and CapabilitiesDefining System Requirements and Capabilities
Supports capability to crawl Supports capability to crawl pagepages multi-threadlys multi-threadly– Supports persistent HTTP connectionSupports persistent HTTP connection
– Supports DNS cacheSupports DNS cache
– Supports IP blockSupports IP block
– Supports the capability to filter unreachable sitesSupports the capability to filter unreachable sites
– Supports the capability to parse linksSupports the capability to parse links
– Supports the capability to crawl recursivelySupports the capability to crawl recursively
Supports Tianwang-format outputSupports Tianwang-format output Supports ISAM outputSupports ISAM output Supports the capability to enumerate a page according to a Supports the capability to enumerate a page according to a URLURL Supports the capability to search a key word in the depotSupports the capability to search a key word in the depot
4
Web 有向图
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
网页为节点HTML 链接引用为有向边
5
有向图的连通性
强连通( strong connectivity ):任何两点存在一条有向通路
“ 根连通”:存在一个节点,从它到每一个其他节点都有一条有向通路
定理:一个强连通的有向图一定是根连通的
6
Web 有向图的性质
Web graph at any instant of time contains k-connected subgraphs (but we do not know the value of k, nor do we know a-priori the structure of the web-graph).
If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF.“
还可以细一些,我们实际可以关心“根连通子图”(但)找到那些“根”不容易
7
Three main components of the Web
• HyperText Markup Language– A language for specifying the contents and layout
of pages
• Uniform Resource Locators– Identify documents and other resources
• A client-server architecture with HTTP– By with browsers and other clients fetch
documents and other resources from web servers
8
HTML
HTML text is stored in a file of a web server. A browser retrieves the contents of this file from a web server.
-The browser interprets the HTML text-The server can infer the content type from the filename
extension.
<IMG SRC = http://www.cdk3.net/WebExample/Images/earth.jpg><P>Welcome to Earth! Visitors may also be interested in taking a look at the <A HREF = “http://www.cdk3.net/WebExample/moon.html>Moon</A>.</P>(etcetera)
9
URL
HTTP URLs are the most widely used An HTTP URL has two main jobs to do:
- To identify which web server maintains the resource- To identify which of the resources at that server
Scheme: scheme-specific-locatione.g:
mailto:[email protected]://ftp.downloadIt.com/software/aProg.exehttp://net.pku.cn/ ….
10
HTTP URLs
• http://servername[:port]//pathNameOnServer][?arguments]• e.g.
http://www.cdk3.net/http://www.w3c.org/Protocols/Activity.htmlhttp://e.pku.cn/cgi-bin/allsearch?word=distributed+system
----------------------------------------------------------------------------------------------------
Server DNS name Pathname on server Argumentswww.cdk3.net (default) (none)www.w3c.org Protocols/Activity.html (none)e.pku.cn cgi-bin/allsearch word=distributed+system-------------------------------------------------------------------------------------------------------
11
HTTP
• Defines the ways in which browsers and any other types of client interact with web servers (RFC2616)
• Main features– Request-replay interaction– Content types. The strings that denote the type of
content are called MIME (RFC2045,2046)– One resource per request. HTTP version 1.0– Simple access control
12
More features-services and dynamic pages
• Dynamic content– Common Gateway Interface: a program that web
servers run to generate content for their clients
• Downloaded code– JavaScript– Applet
13
Web Graph-Search Algorithms I
PROCEDURE SPIDER1(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>
While STACK is not empty,
URLcurr := pop(STACK)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)
Return COLLECTION
What is wrong with the above algorithm?
14
Depth-first Search
1
2
3 4
5
6
7numbers = order inwhich nodes arevisited
15
SPIDER1 是不正确的
如果 web graph 有回路会导致=> Algorithm will not halt
遇到汇聚的结构会出现=> Pages will replicated in collection
=> Inefficiently large index
=> Duplicates to annoy user
16
SPIDER1 is 不完整的
Web graph has k-connected subgraphs. SPIDER1 only reaches pages in the the
connected web subgraph where ROOT page lives.
17
A Correct Spidering Algorithm
PROCEDURE SPIDER2(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>
While STACK is not empty,
| Do URLcurr := pop(STACK)
| Until URLcurr is not in COLLECTION
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)Return COLLECTION
18
A More Efficient Correct Algorithm
PROCEDURE SPIDER3(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>
| Initialize VISITED <big hash-table>
While STACK is not empty,
| Do URLcurr := pop(STACK)
| Until URLcurr is not in VISITED
| insert-hash(URLcurr, VISITED)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)Return COLLECTION
19
A More Complete Correct Algorithm
PROCEDURE SPIDER4(G, {SEEDS})| Initialize COLLECTION <big file of URL-page pairs>| Initialize VISITED <big hash-table>
| For every ROOT in SEEDS| Initialize STACK <stack data structure>| Let STACK := push(ROOT, STACK)
While STACK is not empty,
Do URLcurr := pop(STACK)
Until URLcurr is not in VISITED
insert-hash(URLcurr, VISITED)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)Return COLLECTION
20
爬取器的一种结构图
21
What we need?
Intel x86/Linux (Red Hat Linux) platform
C++ ….
Linus Torvalds
22
Get the homepage of PKU site
[webg@BigPc ]$ telnet www.pku.cn 80 连接到服务器的 80 号端口 Trying 162.105.129.12... 由 Telnet 客户输出 Connected to rock.pku.cn (162.105.129.12). 由 Telnet 客户输出 Escape character is '^]'. 由 Telnet 客户输出 GET / 我们只输入了这一行 <html> Web 服务器输出的第一行 <head> <title> 北京大学 </title> …… 这里我们省略了很多行输
出 </body> </html> Connection closed by foreign host. 由 Telnet 客户输出
23
OutlineOutline
analysisanalysis– which deals with the design requirements and which deals with the design requirements and
overall architecture of a system; overall architecture of a system;
designdesign– which translates a system architecture into which translates a system architecture into
programming constructs (such as interfaces, programming constructs (such as interfaces, classes, and method descriptions); classes, and method descriptions);
and and programmingprogramming– which implements these programming constructs.which implements these programming constructs.
24
Defining system objects
URL (RFC-1738)– <scheme>://<net_loc>/<path>;<params>?<query>#<fragment>
– 除了 scheme 部分,其他部分可以不在 URL 中同时出现。– scheme ":" ::= 协议名称 .– "//" net_loc ::= 网络位置 / 主机名,登陆信息 .– "/" path ::= URL 路径 .– ";" params ::= 对象参数 .– "?" query ::= 查询信息 .
Page ….
25
Class URL
class CUrl{public: string m_sUrl; // URL 字串 enum url_scheme m_eScheme; // URL scheme 协议名 string m_sHost; // 主机字串 int m_nPort; // 端口号
/* URL components (URL-quoted). */ string m_sPath, m_sParams, m_sQuery, m_sFragment;
/* Extracted path info (unquoted). */ string m_sDir, m_sFile;
/* Username and password (unquoted). */ string m_sUser, m_sPasswd;public: CUrl(); ~CUrl(); bool ParseUrl( string strUrl );private: void ParseScheme ( const char *url );};
26
CUrl::Curl()
CUrl::CUrl(){ this->m_sUrl = ""; this->m_eScheme= SCHEME_INVALID; this->m_sHost = ""; this->m_nPort = DEFAULT_HTTP_PORT; this->m_sPath = ""; this->m_sParams = ""; this->m_sQuery = ""; this->m_sFragment = ""; this->m_sDir = ""; this->m_sFile = ""; this->m_sUser = ""; this->m_sPasswd = "";}
27
CUrl::ParseUrl
bool CUrl::ParseUrl( string strUrl ){ string::size_type idx; this->ParseScheme( strUrl.c_str( ) ); if( this->m_eScheme != SCHEME_HTTP ) return false;
// get host name this->m_sHost = strUrl.substr(7); idx = m_sHost.find('/'); if( idx != string::npos ){ m_sHost = m_sHost.substr( 0, idx ); } this->m_sUrl = strUrl; return true;}
28
Defining system objects
URL – <scheme>://<net_loc>/<path>;<params>?<query>#<fragment>
– 除了 scheme 部分,其他部分可以不在 URL 中同时出现。– scheme ":" ::= 协议名称 .– "//" net_loc ::= 网络位置 / 主机名,登陆信息 .– "/" path ::= URL 路径 .– ";" params ::= 对象参数 .– "?" query ::= 查询信息 .
Page ….
29
Class Page
public: string m_sUrl; string m_sLocation; string m_sHeader; int m_nLenHeader; string m_sCharset; string m_sContentEncoding; string m_sContentType;
string m_sContent; int m_nLenContent;
string m_sContentLinkInfo; string m_sLinkInfo4SE; int m_nLenLinkInfo4SE; string m_sLinkInfo4History; int m_nLenLinkInfo4History;
string m_sContentNoTags; int m_nRefLink4SENum; int m_nRefLink4HistoryNum; enum page_type m_eType;
RefLink4SE m_RefLink4SE[MAX_URL_REFERENCES]; RefLink4History m_RefLink4History[MAX_URL_REFERENCES/2]; map<string,string,less<string> > m_mapLink4SE; vector<string > m_vecLink4History;
30
Class Page …continued
public: CPage(); CPage::CPage(string strUrl, string strLocation, char* header, char* body, int
nLenBody); ~CPage(); int GetCharset(); int GetContentEncoding(); int GetContentType(); int GetContentLinkInfo(); int GetLinkInfo4SE(); int GetLinkInfo4History(); void FindRefLink4SE(); void FindRefLink4History();private: int NormallizeUrl(string& strUrl); bool IsFilterLink(string plink);};
31
Sockets used for streams
Requesting a connection Listening and accepting a connection
bind(s, ServerAddress);listen(s,5);
sNew = accept(s, ClientAddress);
n = read(sNew, buffer, amount)
s = socket(AF_INET, SOCK_STREAM,0)
connect(s, ServerAddress)
write(s, "message", length)
s = socket(AF_INET, SOCK_STREAM,0)
ServerAddress and ClientAddress are socket addresses
32
与服务器建立连接中需要考虑的问题
DNS 缓存– URL 数以亿计,而主机数以百万计。
是否为允许访问范围内的站点– 有些站点不希望搜集程序搜走自己的资源。– 针对特定信息的搜索,比如:校园网搜索,新闻网站搜索。– 存在着类似这样的收费规则 : 同 CERNET 连通的国内站点不收费。
是否为可到达站点 与服务器 connect 的时候,使用了非阻塞连接。
– 超过定时,就放弃。
33
构造请求消息体并发送给服务器 1/3
实现代码– int HttpFetch(string strUrl, char **fileBuf, char **fileHeadBuf, char
**location, int* nPSock)
– 参考了 http://fetch.sourceforge.net中的 int http_fetch(const char *url_tmp, char **fileBuf)
– 申请内存,组装消息体,发送
34
获取 header 信息 2/3
实现代码– int HttpFetch(string strUrl, char **fileBuf, char **fileHeadBuf, char
**location, int* nPSock)– e.g.
HTTP/1.1 200 OK Date: Tue, 16 Sep 2003 14:19:15 GMT Server: Apache/2.0.40 (Red Hat Linux) Last-Modified: Tue, 16 Sep 2003 13:18:19 GMT ETag: "10f7a5-2c8e-375a5cc0" Accept-Ranges: bytes Content-Length: 11406 Connection: close Content-Type: text/html; charset=GB2312
35
获取 body 信息 3/3
实现代码– int HttpFetch(string strUrl, char **fileBuf, char **fileHeadBuf, char
**location, int* nPSock)– e.g.
<html>
<head><meta http-equiv="Content-Language" content="zh-cn"><meta http-equiv="Content-Type" content="text/html;
charset=gb2312"><meta name="GENERATOR" content="Microsoft FrontPage 4.0"><meta name="ProgId" content="FrontPage.Editor.Document"><title>Computer Networks and Distributed System</title></head>….
36
多道收集程序并行工作
局域网的延迟在 1-10ms ,带宽为 10-1000Mbps
Internet 的延迟在 100-500ms ,带宽为 0.010-2 Mbps
在同一个局域网内的多台机器,每个机器多个进程并发的工作– 一方面可以利用局域网的高带宽,低延时,各节点充分交流数据,– 另一方面采用多进程并发方式降低 Internet 高延迟的副作用。
37
应该有多少个节点并行搜集网页 ?每个节点启动多少个 Robot ? 1/2
计算理论值:– 平均纯文本网页大小为 13KB– 在连接速率为 100Mbps 快速以太网络上,假设线路的最大利用率是
100% ,则最多允许同时传输( 1.0e+8b/s ) / ( 1500B*8b/B )≈ 8333 个数据帧,也即同时传输 8333 个网页
– 如果假设局域网与 Internet 的连接为 100Mbs , Internet 带宽利用率低于 50% (网络的负载超过 80% ,性能是趋向于下降的;路由),则同时传输的网页数目平均不到 4000 个。
– 则由 n 个节点组成的搜集系统,单个节点启动的 Robot 数目应该低于4000/n 。
38
应该有多少个节点并行搜集网页 ?每个节点启动多少个 Robot ? 2/2
经验值:– 在实际的分布式并行工作的搜集节点中,还要考虑 CPU 和磁盘的使用
率问题,通常 CPU 使用率不应该超过 50% ,磁盘的使用率不应该超过80% ,否则机器会响应很慢,影响程序的正常运行。
– 在天网的实际系统中局域网是 100Mbps 的以太网,假设局域网与Internet 的连接为 100Mbps (这个数据目前不能得到,是我们的估计),启动的 Robot 数目少于 1000 个。
– 这个数目的 Robot 对于上亿量级的搜索引擎( http://e.pku.cn/ )是足够的。
39
单节点搜集效率
以太网数据帧的物理特性是其长度必须在 46~1500 字节之间。
在一个网络往返时间 RTT 为 200ms 的广域网中,服务器处理时间 SPT 为 100ms ,那么 TCP 上的事务时间就大约 500ms( 2 RTT+SPT )。
网页的发送是分成一系列帧进行的,则发送 1 个网页的最少时间是 (13KB/1500B) * 500ms ≈4s 。
如果系统中单个节点启动 100 个 Robot 程序,则每个节点每天应该搜集( 24 *60 *60s/4s ) * 100 = 2,160,000 个网页。
考虑到 Robot 实际运行中可能存在超时,搜集的网页失效等原因,每个节点的搜集效率小于 2,160,000 个网页 / 天。
40
TSE 中多线程工作
多个收集线程并发的从待抓取的 URL 队列中取任务
控制对一个站点并发收集程序的数目– 提供 WWW 服务的机器,能够处理的未完成的 TCP 连接数是有一个上
限,未完成的 TCP 连接请求放在一个预备队列– 多道收集程序并行的工作,如果没有控制,势必造成对于搜集站点的类
似于拒绝服务( Denial of service )攻击的副作用
41
如何避免网页的重复收集 ?
记录未访问和已访问 URL 信息– ISAM格式存储:
当新解析出一个 URL 的,要在 WebData.idx 中查找是否已经存在,如果有,则丢弃该 URL 。
– .md5.visitedurl E.g. 0007e11f6732fffee6ee156d892dd57e
– .unvisit.tmp E.g.
http://dean.pku.edu.cn/zhaosheng/北京大学 2001年各省理科录取分数线 .files/filelist.xml
http://mba.pku.edu.cn/Chinese/xinwenzhongxin/xwzx.htm
http://mba.pku.edu.cn/paragraph.css
http://www.pku.org.cn/xyh/oversea.htm
42
域名与 IP 对应问题
存在 4 种情况:– 一对一,一对多,多对一,多对多。一对一不会造成重复搜集,
后三种情况都有可能造成重复搜集。– 可能是虚拟主机– 可能是 DNS 轮转– 可能是一个站点有多个域名对应
43
ISAM
抓回的网页存到 isam形式的文件中,– 包括一个存数据的文件( WebData.dat )– 和一个索引文件 (WebData.idx)
索引文件中存放每个网页开始位置的偏移地址以及url
函数原型:– int isamfile(char * buffer, int len);
44
Enumerate a page according to a Enumerate a page according to a URLURL
根据文件WebData.dat 和 WebData.idx查找指定url 并将此网页的前面一部分内容显示在屏幕上。
函数原型:– int FindUrl(char * url,char * buffer,int buffersize);
45
Search a key word in the depotSearch a key word in the depot
根据 WebData.dat查找含有指定关键字的网页,并输出匹配的关键字的上下文。
函数原型:– void FindKey(char *key);– 函数中打印找到的 url 以及 key附近的相关内容,每打印一个就出现提示让用户选择继续打印或退出以及显示整个网页文件。
46
Tianwang format output
a raw page depot consists of records, every record includes a raw data of a page, records are stored sequentially, without delimitation between records.
a record consists of a header(HEAD), a data(DATA) and a line feed ('\n'), such as is HEAD + blank line + DATA + '\n‘
a header consists of some properties. Each property is a non blank line. Blank line is forbidden in the header.
a property consists of a name and a value, with delimitation ":". the first property of the header must be the version property, such as:
version: 1.0 the last property of the header must be the length property, such as: length:
1800 for simpleness, all names of properties should be in lowercase.
47
SummarySummary
Supports capability to crawl Supports capability to crawl pagepages multi-threadlys multi-threadly– Supports persistent HTTP connectionSupports persistent HTTP connection
– Supports DNS cacheSupports DNS cache
– Supports IP blockSupports IP block
– Supports the capability to filter unreachable sitesSupports the capability to filter unreachable sites
– Supports the capability to parse linksSupports the capability to parse links
– Supports the capability to crawl recursivelySupports the capability to crawl recursively
Supports Tianwang-format outputSupports Tianwang-format output Supports ISAM outputSupports ISAM output Supports the capability to enumerate a page according to a Supports the capability to enumerate a page according to a URLURL Supports the capability to search a key word in the depotSupports the capability to search a key word in the depot
48
TSE package
http://net.pku.edu.cn/~webg/src/TSE/ nohup ./Tse –c seed.pku &
To stop crawling process– ps –ef
– Kill ???
49
Thank you for your attentionThank you for your attention !!