sistemes distribuïts en xarxa (sdx) facultat d’informàtica de barcelona (fib) universitat...

Sistemes Distribuïts en Xarxa (SDX)Facultat d’Informàtica de Barcelona (FIB)Universitat Politècnica de Catalunya (UPC)2013/2014 Q2

9. Peer-to-Peer Systems

SDX (2) Peer-to-peer systems

Contents

• Introduction

• Unstructured P2P systems

• Structured P2P systems


Introduction

• Client-server model– Traditional model in distributed computing


Introduction

• Client-server model– Successful architecture until now

• New need for global scale computation– Modern Internet is characterized by the

instability of nodes and links• Sudden spikes in demand for services

– Scalability and robustness problems with client-server model• Centralized resources are potential performance

bottlenecks and points of failure


Introduction

• Peer-to-Peer (P2P) model


Introduction

• Peer-to-Peer (P2P) model– A P2P computer network refers to any

network that does not have fixed clients and servers, but a number of peer nodes that function as both clients and servers to other nodes on the network• Wikipedia.org

– Nodes are symmetric in function– No centralized control– Typically many nodes, but unreliable,

dynamic and heterogeneous• Machines can come and go any time


Introduction

• P2P advantages– Aggregate tremendous amount of

computation and storage resources– High availability & scalability

• Hundred of thousands or millions of machines• No bottlenecks, no single points of vulnerability

– Low latency• Exploit locality whenever possible

– Resilience to failures, spikes in demand, surges in traffic or intentional attacks


Introduction

• Basic interface in P2P systemsA. Insert / publish a resource / object / file /

serviceB. Find / lookup a resource / object / file /

service• Content location is the main challenge in P2P

systems

AB

C

D

E

F

E?


Introduction

• Two approaches for content location1. Unstructured P2P systems

• No (or little) control over the P2P network topology

• No (or little) control over resource placement• Insert is easy, search is hard or imprecise• Examples: Napster, Gnutella, KaZaA

2. Structured P2P systems• P2P network topology is tightly controlled • Resources stored at specified locations • Insert is hard, search is easy• Examples: DHT-based systems

– CAN, Chord, Pastry, Tapestry


Contents

• Introduction




Contents

• Introduction

• Unstructured P2P systems– Centralized model– Decentralized model– Hierarchical model



Centralized model: Napster

A. Query a centralized index system that returns the machine/s that stores the required file

B. Transfer the file from given machine/s

AB

C

D

E

F

m1m2

m3

m4

m5

m6

m1 Am2 Bm3 Cm4 Dm5 Em6 F

E?m5

E? E

Insert(“E”, m5)

Lookup(“E”)


Centralized model: Napster

• Advantages– Simple and locates files quickly and

efficiently– Easy to implement sophisticated search

engines on top of the index system• Disadvantages

– Lack of robustness: single point of failure– Scalability / performance bottleneck– Resources are distributed, but index service

is not• Easy to detect copyright infringements• For this reason, Napster could be shut down


READING REPORT

• Cohen03. Incentives Build Robustness in BitTorrent


Centralized model: BitTorrent

• Collaborative system providing efficient content distribution using file swarming– Each file split into smaller pieces– Nodes request desired pieces from

neighbors– Encourages contribution by all nodes

• Combines a client-server model for locating the pieces with a P2P download protocol– Usually does not perform all the functions of

a typical P2P system, like searching• Written by Bram Cohen (in Python) in

2001



1. Peers obtain a .torrent file, hosted in a web server

2. Peers connect to the tracker specified there3. Tracker responds with contact information

about the peers that are downloading the same file

4. Peers then use this information to connect to each other and distribute file pieces among them



• .torrent file contains:– Metadata about the file to be shared

• Name, length, piece length, SHA-1 hash of each piece

– URL of tracker• Tracker

– Keeps track of all the peers who have the file (seeds/leechers) and which chunks each peer has

– When asked for a list of peers, the tracker returns a random subset of the peers, typically 50 peers

– Can receive statistics about what peers are doing



• Seeds– Peers that have a complete copy of the file– They are usually selfish and do not want to

wait after they get the file– Initial seed: a peer that provides the initial

copy– At least the initial seed has to stay to serve

one complete copy of the file• Leechers

– Any peer who does not have a complete file– He stays a leecher and eventually becomes

a seed


BitTorrent: Piece selection

• Each file split into smaller pieces– This is the unit for uploading, typically

256Kb• Pieces are further divided in sub-pieces

– This is the unit for downloading, typically 16Kb

• The order in which pieces are selected is critical for good performance– Avoid ending with peers having all the same

set of available pieces, and none of the missing ones

– If the original seed is prematurely taken down, then the file cannot be completely downloaded



• Rarest Piece First– This is the general rule: download first the

pieces that are most rare among your peers• Most commonly available pieces are left until the

end to download– Increases diversity in the pieces

downloaded• You are going to be useful to others• Avoids the case where all peers have exactly the

same pieces– Increases likelihood that all pieces are still

available even if the original seed leaves



• The behavior of the ‘Rarest Piece First’ policy can be modified by three additional policies:– Strict Priority, Random First Piece, Endgame

Mode

• Strict Priority– Once a single sub-piece has been

requested, the other sub-pieces from that piece are requested before sub-pieces from any other piece• As only complete pieces can be sent, it is

important to minimize the number of partially-received pieces



• Random First Piece– Special case at the beginning of the

download– A peer has nothing to trade so it is

important to get a complete piece as soon as possible

– ‘Rarest Piece First’ is not a good option• Rare pieces are present on few peers, so they

would be downloaded slower– Select a random piece of the file and

download it– When first complete piece is assembled,

switch to ‘Rarest Piece First’



• End Game Mode– Special case at the end of the download– The completion of a download can be

delayed due to a single peer with a slow transfer rate

– When all missing sub-pieces have been requested, those not received yet are requested from every peer containing them

– When a sub-piece arrives, the pending requests for that sub-piece are cancelled

– Some bandwidth is wasted, but in practice, this is not too much


BitTorrent: Choking

• Want to encourage all peers to contribute– Guarantee a reasonable level of upload and

download reciprocation– Avoid ‘free riders’

• Choking is a temporary refusal to upload– A chokes B if A decides not to upload to B

• A given peer can unchoke only 4 remote peers at a given time– Not because you are connected to

somebody you can download from him


BitTorrent: Choking

• Decision about unchoking is reconsidered periodically:– Every 10 seconds, the interested remote

peers are ordered according to their upload rate to the local peer and the 3 fastest peers are unchoked• i.e. initially they should have unchoked the local

peer– Every 30 seconds, one additional interested

remote peer from the remaining connections is unchoked at random • This is called optimistic unchoke• Allows to replace peers with better upload

capacity and bootstrap new peers that do not have any piece to share



• Advantages– Very good download performance

• Slow nodes do not make slow other nodes• Proficient in utilizing partially downloaded files

– Discourages ‘freeloading’ by rewarding fastest uploaders

– ‘Rarest Piece First’ extends the lifetime of swarm, increasing the likelihood all pieces are available

– Users use well-known, trusted sources to locate content, as there is not search feature• Avoids the pollution problem, where garbage is

passed off as authentic content



• Disadvantages– Assumes all interested peers active at same

time: performance deteriorates if swarm “cools off”

– The centralized tracker is a single point of failure • New nodes cannot enter swarm if tracker goes

down• BitTorrent variants without a centralized tracker

– e.g. Vuze (former Azureus)– Now, official client also supports distributed

tracking» Mainline DHT (based on Kademlia)

– Lack of search feature• Users need to resort to out-of-band search


Contents

• Introduction




Decentralized model: Gnutella

• Based on query flooding

• Hot to find a file:– Node sends Query

descriptor to all neighbors

• All nodes connected to it

– Neighbors recursively multicast the descriptor

– Eventually a peer that has the file receives the descriptor, and sends back a QueryHit (retracing the path of Query)

– Node requests file to this peer (direct connection)



• Example– Assumption: m1’s neighbors are m2 and

m3; m3’s neighbors are m4 and m5;…

AB

C

D

E

F

m1m2

m3

m4

m5

m6

E?

E?

E?E?

E

Lookup(“E”)



• Advantages– Highly decentralized– Location service distributed over peers

• Peers have similar responsibilities– No centralized directory

• System harder to shut down

• Disadvantages– Slow information discovery– Scalability problem due to excessive query

traffic• Queries for content that are not widely replicated

must be sent to a large fraction of peers• Worst case: O(N) messages per lookup



• Need some mechanisms to control flooding

A. Add TTL (Time-to-Live) to each message– Number of hops that message may travel

before giving up (default in Gnutella is 7)– How to adjust properly TTL value?

• For popular objects, small TTLs suffice• For rare objects, large TTLs are necessary

– Number of messages grow exponentially with TTL

B. Expanding ring– Start querying with TTL=1– Keep increasing TTL until search succeeds



• Need some mechanisms to control flooding

C. Random walk: Ask one neighbor randomly, who asks one neighbor, etc.– Reduces the number of flooded messages

• Tradeoff: longer delay, fewer messages– Multiple-walker random walk

• Initiate several random walks in parallel

D. State keeping: Keep track of recently routed messages– Avoids re-broadcasting query messages– Can choose another neighbor if random

walk used


Contents

• Introduction




Hierarchical model: FastTrack

• Special nodes (group leader or super-peer)

• Each peer is either a super-peer or assigned to a super-peer

• Super-peer knows the files in all its peers

• Peer queries super-peer – Super-peer may query

other super-peer using flooding

• Download among peers


Hierarchical model: FastTrack

• Advantages– Combines advantages of centralized and

decentralized models– Directory is decentralized

• System harder to shut down– Less flooding than decentralized model

• Disadvantages– Super-peers can suffer the same problems

that centralized directory– Scalability problems by using flooding

• Used in KaZaA, Skype


Contents

• Introduction




Distributed Hash Tables (DHT)

• Goals– Make sure that an item identified is always

found• Directed search: Assign particular nodes to hold

particular content (or know where it is)– Decentralization and scalability– Self-organization and fault-tolerance

• Handle rapidly arrival and failure of nodes– Good for exact matches, not keyword search

• Examples– CAN, Chord, Kademlia, Pastry, Viceroy,

Tapestry


Distributed Hash Tables (DHT)

• A hash table associates data with keys– Key identifies data uniquely

• Key is hashed to find bucket in hash table

• In a DHT, nodes are the hash buckets– Key is hashed to find responsible peer node

key pos

00

hash functionhash function

11

22

N-1N-1

33...

xx

yy zz

lookup (key) → datainsert (key, data)lookup (key) → datainsert (key, data)

“Beatles” 2

hash table

hash bucket

h(key)%N

0

1

2

...

node

key poshash functionhash function

lookup (key) → datainsert (key, data)lookup (key) → datainsert (key, data)

“Beatles” 2h(key)%N

N-1


Contents

• Introduction


• Structured P2P systems– Chord– CAN


Chord

• Associate to each node and key a unique ID in an uni-dimensional space

• Properties – Efficient: O(Log N) messages per lookup,

where N is the total number of servers – Scalable: O(Log N) state per node– Robust: survives massive changes in

membership • Challenges

– How to locate an item?– How to maintain routing tables? – How to deal with changes in membership?


Chord

• m bit identifier space for both keys and nodes– Identifiers go from 0 to 2m-1– Map nodes & keys to identifiers with hash

function• Key identifier = SHA-1(key)• Node identifier = SHA-1(IP address)

• How to map key IDs to node IDs? – Use consistent hashing

– Map key IDs to nodes with ‘closest’ IDs– A key identified by ID is stored on the

successor node of ID (node with next higher ID)

• Each node is responsible for O(K/N) keys


6

1

2

6

0

4

26

5

1

3

7

2

identifier

node

X key

successor(1) = 1

successor(2) = 3successor(6) = 0

Chord: Item insertion

• Example:– Identifier space with m=3


Chord: Item insertion

6

1

2

0

4

26

5

1

3

7successor(6) = 7

6

1

successor(1) = 3

• Item management with node joins and departures


Chord: Item lookup

A. Basic Chord– Each node knows

only two other nodes• Successor• Predecessor (for ring

management)– Lookup by

forwarding requests around the ring through successor pointers

– Requires O(N) hops

N1

N8

N14

N32

N21

N38

N42

N48

N51

N56

m=6m=6

K54

2m-1 0

lookup(K54)


N8080 + 20

N112

N96

N16

80 + 21

80 + 22

80 + 23

80 + 24

80 + 25 80 + 26

Chord: Item lookup

B. Finger tables for accelerating lookup– Every node knows

m other nodes– Entry i in the finger

table of node n points to node n + 2i (if any) or to its successor

– Increase distance exponentially

– Lookups take O(Log N) hops


Chord: Managing finger tables

01

2

34

5

6

7

i id+2i succ0 2 11 3 12 5 1

finger table

• Example:– Node n1:(1)

joins• All entries in its

finger table are initialized to itself



• Example:– Node n2:(2) joins

01

2

34

5

6

7

i id+2i succ0 2 21 3 12 5 1

finger table

i id+2i succ0 3 11 4 12 6 1

finger table



• Example:– Nodes n3:(0) & n4:(6) join

01

2

34

5

6

7

i id+2i succ0 2 21 3 62 5 6

finger table

i id+2i succ0 3 61 4 62 6 6

finger table

i id+2i succ0 1 11 2 22 4 6

finger table

i id+2i succ0 7 01 0 02 2 2

finger table



• Example:– Insert keys f1:(7) &

f2:(1)– A key identified by

ID is stored on the successor node of ID

01

2

34

5

6

7 i id+2i succ0 2 21 3 62 5 6

finger table

i id+2i succ0 3 61 4 62 6 6

finger table

i id+2i succ0 1 11 2 22 4 6

finger table

7

keys 1

keys

i id+2i succ0 7 01 0 02 2 2

finger table



0

4

26

5

1

3

7

012

124

130

finger tablei id+2i succ

keys1

012

235

330


keys2

012

457

000


keys


keys

012

702

003

6

6

66

6

• Example 2:– Item

management when node 6 joins



0

4

26

5

1

3

7

012

124

130


keys1

012

235

330


keys2

012

457

660


keys

finger tablei id+2i succ.

keys

012

702

003

6

6

6

0

3

• Example 2:– Item

management when node 1 departures


Chord: Item lookup with finger tables

01

2

34

5

6

7i id+2i succ0 2 21 3 62 5 6

finger table

i id+2i succ0 3 61 4 62 6 6

finger table

i id+2i succ0 1 11 2 22 4 6

finger table

7

keys 1

keys

i id+2i succ0 7 01 0 02 2 2

finger table

query(7)

1. Check whether the key is stored locally

2. If not, forward the query to the largest node in the finger table not exceeding key ID


Chord: Item lookup with finger tables

• Example 2:

N1

N8

N14

N32

N21

N38

N42

N48

N51

N56

m=6m=6

K54

2m-1 0

N8+1

N8+2

N8+4

N8+8

N8+16

N8+32

N14

N14

N14

N21

N32

N42

Finger table

+32

+16 +8

+4

+2

+1lookup(K54)

N42+1

N42+2

N42+4

N42+8

N42+16

N42+32

N48

N48

N48

N51

N1

N14

Finger table

N51+1

N51+2

N51+4

N51+8

N51+16

N51+32

N56

N56

N56

N1

N8

N21

Finger table


N36

Lookup(37,38,40,…,100,164)

N60

N40

N5

N20N99

N80

Chord: Node joining

A. Basic process for joining the ring (3 steps):1. Initialize fingers and predecessor of new

node j• Locate any node n in the ring• Ask n to lookup the peers at j+20, j+21, j+22… • Use results to populate finger table of j• j.pred = j.finger[0].pred


Chord: Node joining

2. Update finger tables of existing nodes

– For each entry i in the finger table, new node j calls update function on existing nodes that must point to j• Nodes in the

ranges[pred(j)-2i+1, j-2i]

– O(log N) nodes must be updated

N1

N8

N14

N32

N21

N38

N42

N48

N51

N56

m=6m=6 2m-1 0

N8+1

N8+2

N8+4

N8+8

N8+16

N8+32

N14

N14

N14

N21

N32

N42

+16

N28

N28

-16

N12

N6


3. Transfer keys responsibility– Connect to successor and transfer keys in

the range from successor to new node

K38K30

N36

N60

N40

N5

N20N99

N80

Chord: Node joining

Copy keys 21..36from N40 to N36

K30

K38


Chord: Node joining

B. Stabilization: Less aggressive mechanism to deal with concurrent joins– The goal is to keep successor pointers up to

date • This sufficient to guarantee correctness of lookups

– Finger entries are updated in a lazy fashion

1. Join only initializes the finger to successor node

2. All nodes run a stabilization procedure that periodically verifies successor and predecessor

3. All nodes also refresh periodically finger table entries


Chord: Stabilization

np

succ

(np)

= n

s

ns

n

pre

d(n

s) =

np

n joins predecessor = nil n acquires ns as successor via some node in the ring

n runs stabilization procedure n asks ns for its predecessor

ns responds (pred(ns)= np) to n

n confirms ns as successor and notifies ns that it is its new predecessor otherwise, n updates its successor and

stabilizes again ns checks and acquires n as its predecessor

ns transfers keys in the range to n

np also runs stabilization procedure

np asks ns for its predecessor (now n)

np acquires n as its successor

np notifies n that it is its new predecessor

n acquires np as its predecessor

nil

pre

d(n

s) =

n

succ

(np)

= n

succ

(n)

= n

s

pre

d(n

) =

np


Chord: Dealing with node failures• Failure of nodes might cause incorrect lookup

– Lookup(K19) fails: N8 returns the first alive node it knows (N32)

• Solution: successor list– Each node knows r immediate successors– When successor fails, replace it with the first live

entry in the list– Stabilize will correct finger table entries pointing to

failed node

N1

N8

N32

N21

N38

N42

N48

N51

N56

m=6m=6

2m-1 0

+16 +8

+4

+2

+1

N14

N18

K19

lookup(K19) ?

N8+1

N8+2

N8+4

N8+8

N8+16

N8+32

N14

N14

N14

N18

N32

N42


Chord

• Advantages– Availability

• Guaranteed to find an item if in the network– Decentralized: improves robustness– Scalability: logarithmic growth of lookup

costs• Disadvantages

– Member joining is complicated– Does not provide anonymity


SEMINAR PREPARATION – Chordy

• Stoica03. Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications


Contents

• Introduction


• Structured P2P systems– Chord– CAN


Content Addressable Network (CAN)• Virtual Cartesian coordinate space

– Associate to each node and item a unique ID in an d-dimensional Cartesian space

• Properties – Routing table size: O(d)– O(d*N1/d) messages per lookup, where N is

the total number of servers • Challenges

– How to divide hash space evenly among nodes?

– How to locate an item?


CAN: Virtual space construction

• The entire space is divided among all nodes– Every node “owns” a zone in the overall space

• Node coordinates are assigned in the joining process

• Example: – Node n1:(1, 2) first node that joins– n1 covers the entire space

1 2 3 4 5 6 70

1

2

3

4

5

6

7

0

n1

d=2



• Incrementally split space between nodes that join– Each node covers either a square or a rectangular

area of ratios 1:2 or 2:1

• Example:– Node n2:(4, 2) joins– Space is divided between n1 and n2

1 2 3 4 5 6 70

1

2

3

4

5

6

7

0

n1 n2

d=2



• Example:– Node n3:(3, 5) joins – n1’s zone is divided between n1 and n3

1 2 3 4 5 6 70

1

2

3

4

5

6

7

0

n1 n2

n3

d=2



• Example:– Nodes n4:(5, 5) and n5:(6, 6) join– n2’s zone is divided between n2 and n4– n4’s zone is divided between n4 and n5

1 2 3 4 5 6 70

1

2

3

4

5

6

7

0

n1 n2

n3 n4n5

d=2


CAN: Item insertion

• Node responsible for key K is determined by hashing K for each dimension

• Example:– node I::insert(K,V)– a = hx(K)

– b = hy(K)

x = a

y = b

I

d=2


CAN: Item insertion

• Each item is stored by the node who owns its mapping in the space

• Example:– Items f1:(2,3) & f3:(2,1) stored by n1– Item f2:(5,1) stored by n2– Item f4:(7,5) stored by n5

1 2 3 4 5 6 70

1

2

3

4

5

6

7

0

n1 n2

n3 n4n5

f1

f2

f3

f4

d=2


CAN: Item lookup

• A node only maintains state for its immediate neighbors– 2d neighbors per node

if zones are equal• Messages are routed

to the neighbor with coordinates closest to the destination– Min. Cartesian distance

• Multiple choices– Can route around

failures

d=2


CAN: Item lookup

• Example:– n1 lookups f4

• Path through n3 & n4

1 2 3 4 5 6 70

1

2

3

4

5

6

7

0

n1 n2

n3 n4n5

f1

f2

f3

f4

d=2


CAN: Item lookup


• n3 & n4 fail

1 2 3 4 5 6 70

1

2

3

4

5

6

7

0

n1 n2

n3 n4n5

f1

f2

f3

f4

d=2


CAN: Item lookup


• Alternate route through n2

1 2 3 4 5 6 70

1

2

3

4

5

6

7

0

n1 n2

n3 n4n5

f1

f2

f3

f4

d=2


Content Addressable Network (CAN)• Advantages

– Availability• Guaranteed to find an item if in the network

– Efficient at locating information• O(d.N1/d)

– Decentralized: no single point of failure– Fault tolerant routing

• Disadvantages– Impossible to perform a fuzzy search– Low security


Summary

• P2P: new model for global scale computation– Decentralized: Nodes are symmetric in

function– High availability & scalability– Typically many nodes, but unreliable,

dynamic and heterogeneous– Two approaches for content location

• Unstructured P2P systems– Examples: Napster, Gnutella, KaZaA,

BitTorrent• Structured P2P systems

– Examples: DHT-based systems: CAN, Chord

sistemes distribuïts en xarxa (sdx) facultat d’informàtica de barcelona (fib) universitat...

Documents