introduction to computer networks 2004, 劉震昌. review of lab#2 and homework#1 “ lab ” means...
Post on 19-Dec-2015
219 views
TRANSCRIPT
Review of Lab#2 and Homework#1
“Lab” means “Laboratory”, not “Label”. Algorithm steps must be executed in
turn. You can not skip any step on your own decision. Why?
Please write your homework subject correctly
No delay for homework
Outline Origins of the Internet 網際網路的發源 Origins of the WWW (World Wide Web)
HTML (Hypertext Markup Language 超文件標示語言 ?) guide
Searching the Web Search engine (Web browser 網路瀏覽器 ) Web directories
Origins of the Internet
In 1969, US DoD’s ARPA(Advanced Research Projects Agency) built the ARPANET Only 4 nodes De-centralized system Data transmission 參考網站
Origins of the Internet (cont.)
1974, TCP/IP was developed and later became a standard in 1983 TCP(Transmission Control Protocol) IP(Internet Protocol) 網路通訊協定的重要性
Growth of ARPANET --> Internet Internetworking No organization owns or controls it
IP Service
Where is your computer on Internet ? Current internet (IPv4)
32 bits to represent an IP address Ex. 163.22.20.129 What is your computer’s IP address? ipconfig
163.22.20.129
163.22.20.118
163.22.22.119
Address Resolution Protocol (ARP)
IP protocol address is an abstraction; physical network hardware does not know how to locate the computer from IP address
Techniques table look-up closed form computation message exchange
Computers on the Net
Every Internet host has a unique IP address, however, it is hard to remember. So we have host name e.g., arbor.watson.ibm.com is 9.2.13.20 and ar
bor.ee.ntu.edu.tw is 140.112.21.236 Try: nslookup
Domain Name Server 網域名稱伺服器
Host name is to be converted into IP address
Domain Name Servers (DNS) containing a database (look-up table) for host
name to IP address mapping there are many domain name servers “.com”, “.gov”, “.edu”, “.tw”
Internet application telnet: A terminal emulation program
for TCP/IP networks such as the Internet
ftp (file transfer protocol)
telnet163.22.22.119
163.22.22.119(Run telnet server)
Outline Origins of WWW(World Wide Web) Web browser HTML(Hyper-Text Markup Language) HTTP(Hyper-Text Transfer Protocol)
Origins of WWW
World Wide Web(WWW) Proposed in 1989, by Tim Berners-Lee at
CERN(European Particle Research Center) A large-scale, online repository of
information Develops interoperable technologies
(specifications, guidelines, software, and tools)
Currently, there is a W3C (WWW consortium) doing these things
Origins of WWW (cont.) Data format: HTML (HyperText Markup L
anguage) Allow hypertext link (URL: Universal Resource
Locator) to other documents on Web
Protocol: HTTP (HyperText Transfer Protocol)
Data exchange standard on Web 資料交換的共通格式與傳輸協定
Protocol://computer_name:port/document_name
Web browser tools to read HTML document
Web browser Web server(ex. 跑 IIS)
client server
click a link send requestfind document
return HTML documentdisplay
Connection terminated after receiving all items
Web browser (cont.) Text mode browser: lynx
lynx http://www.csie.ncnu.edu.tw Graphics mode browser
NCSA(National Center for Supercomputing Applications) Mosaic by Marc Andreeson
Netscape IE
Document representation Hypertext: textual information Hypermedia: additional info., like images a
nd graphics HyperXXXX: an abstract idea
A set of documents, and a document can contain pointers to other documents
Page: a hypermedia document on the Web
Hypertext Markup Language (HTML)
Markup Language: publishing hypertext in a less detailed format
HTMLdocument
display resultsmay be different
HTML layout
<HTML> <HEAD> <TITLE> ….title of the text…. </TITLE> </HEAD> <BODY> …body of the document… </BODY></HTML>
* 良好的縮排便於人類理解編輯
HTML layout (cont.)
<HTML><HEAD><TITLE>….title of the text….</TITLE></HEAD><BODY>…body of the document…</BODY></HTML>
HTML examples Example1 Example2 Example3: embedding images Example4: hypertext link(anchor 錨 )
<a> ….anything…</a> Any item can have a hypertext link
Lab#4 in the afternoon http://www.csie.nctu.edu.tw/~jglee/teacher/content.
htm
HTTP documents See http://ftp.ics.uci.edu/pub/ietf/http/ HTTP/1.0, RFC 1945, 1996 HTTP/1.1, RFC 2068, 1997
Searching the Web
Ref: Chapter 13 in “Modern Information Retrieval”
Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Searching the Web WWW starts in 1989 Just the textual data is estimated to be
in the order of one terabyte Goal: how to efficiently manage,
retrieve and filter information from the Web?
Challenges Distributed data
Data spans over many computers interconnected without predefined topology
High percentage of volatile data 易變資料 40% of the Web changes every month
Large volume Unstructured and redundant data 重複資料
30% of Web pages are (near) duplicates Heterogeneous data
Different languages
Measuring the Web
Internet
URLsWWW
Webserver
*1998, 3M servers
No. of servers =1/10 no. of computers on Internet
3 百萬
Measuring the Web (cont.) 1998 5Kb per Web page on average 300M Web pages (3 億… ) 300M * 5Kb = 1.5 Terabytes Grow at a rate of 20M pages per month
Methods for searching the Web
Search engines 搜尋引擎 Index the Web documents as a full-text d
atabase Alta Vista, Google, …
Web directories 入門網站目錄 Classify selected Web documents by subj
ect Yahoo!
Search engines搜尋引擎
Model the Web as a database All queries must be answered without
accessing the Web pages
Userqueries database
Search engines (cont.) AltaVista (www.altavista.com)
20 multi-processor machines 130 Gb of RAM each Over 500 Gb of disk space each 75% resources on the query engine
The top search engines Foreign
Google ( www.google.com ) www.yahoo.com www.altavista.com Inktomi ( www.inktomi.com ) Statistics on search engines
www.searchenginewatch.com http://imt.net/~notess/search
Taiwan Yahoo!/Kimo uses google Openfind ( www.openfind.com.tw )( 中正大學吳昇教授 ) Yam ( www.yam.com.tw )
Search engines (cont.) Centralized crawler-indexer
architecture
UserInterface
QueryEngine
Indexdatabase
users
Indexer
Crawler
Web
User Interface
Query interface Keywords Boolean operator
Answer interface Rank the searched pages
Statistics about the term occurrence within the document
Popularity Hyperlink information
Crawler Robots, spiders( 蜘蛛 ), wanderers, wal
kers, and knowbots Inspite of their name, the crawler runs
on a local system and sends requests to remote Web servers
Method: start with a set of URLs, and from there extract other URLs
Crawler (cont.)
How the Web is traversed, the index of a search engine can be thought as analogous to the stars in a sky Invalid links in search engines vary from
2% to 9% The current fastest crawlers are able
to traverse up to 10M Web pages per day 300M/10M = 30 days
Web directories 網站目錄 Classify the Web pages by categories Directories are hierarchical taxonomies
that classify human knowledge Yahoo! has close to 1M pages classified How to classify pages?
Pages has to submitted to the Web directories
Manually done by few people Automatic classification is not yet mature Not every page is classified
Some Web directories
Web directories URL Web sites(K) Categories
Yahoo! www.yahoo.com 750LookSmart www.looksmart.com 300 24Lycos Subjects a2z.lycos.com 50eBLAST www.eblast.com 125NewHoo www.newhoo.com 100 23Magellan www.mckinley.com 60Netscape www.netscape.com Snap www.snap.com