zookeeper introduce

zookeeper

A Distributed Coordination Service for Distributed Applications

By Fx_bull

1. Zookeeper Arhitecture and Features2. Zookeeper Node roles3. Zookeeper configuration （配置）4. Zookeeper data model( 介绍 znode, zxid 等 )5. Zookeeper data read/write6. Key mechanisms, 包括 Leader 的选举， log 和 snapshot 的作用，为什么要奇数个节点，为什么半数以上 follower 同意才可以完成写操作，等

What’s zookeeper

ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems.

The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch

An open source implementation of Chubby

12/11/14

Zookeeper architectureZooKeeper consists of more serversOne Leader ， more FollowersHigh performance: It can be used in large, distributed systems.Highly available: The reliability aspects keep it from being a single point of failureStrictly ordered access:sophisticated synchronization primitives can be implemented at the client.

12/11/14

The servers that make up the ZooKeeper service must all know about each other. Zookeeper uses configuration file to konw each other and PING message type is enchanged between follower and leader to determine liveliness

note: ping is a kind of sending packet to a specified port

12/11/14

ZooKeeper achieves high-availability through replication, and can provide aservice as long as a majority of the machines in the ensemble are up.For example, in a five-node ensemble, any two machines can fail and the service will still work because a majority of three remain. Note that a six-node ensemble can also tolerate only two machines failing, since with three failures the remaining three do not constitute a majority of the six. For this reason, it is usual to have an odd number of machines in an ensemble.

others reason: can not form the majority, any value can not be approved.

feature

12/11/14

1 、 It is especially fast in "read-dominant" workloads.2 、 ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.3 、 Every update made to the znode tree is given a globally unique identifier, called a zxid (which stands for “ZooKeeper transaction ID”).

…………

12/11/14

Zookeeper Data Model

A shared hierarchal namespace, similarly to a standard file system

Each folder called znode

ZooKeeper was designed to store coordination data, so it is very smallstatus information (version numbers for data changes, ACL changes, and timestamps) ,configuration,location information.

12/11/14

czxid ： The zxid of the change that caused this znode to be created.mzxid ： The zxid of the change that last modified this znode.ctime ： The time in milliseconds from epoch when this znode was created.mtime ： The time in milliseconds from epoch when this znode was last modified.version ： The number of changes to the data of this znode.cversion ： The number of changes to the children of this znode.aversion ： The number of changes to the ACL of this znode.ephemeralOwner ： The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.dataLength ： The length of the data field of this znode.The maximum allowable size of the data array is 1 MBnumChildren ： The number of children of this znode. pzxid ??

zonde data structure

12/11/14

ZooKeeper is replicated.

Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.

All of the server have the same data guaranteed by fast paxos algorithm

Theory a client will see the same view of the system regardless of the server it connects to

12/11/14

role of zookeeper

• Leader ： responsible for initiation and resolution of the final vote, update the status in the end .

note:It is possible to configure ZooKeeper so that the leader does not accept client connections. set zookeeper.leaderServes value to "no"

• Follower ： Follower for receiving client requests and returned to the client results. Participate in the Leader-sponsored vote. the server will synchronize with the leader and replicate any transactions.

• Oberserver ： The observer can enhance the performance of the read operation of the cluster that it does not affect the write performance, it only accepts read requests, write requests are forwarded to the leader.

The problem is that as we add more voting members, the write performance drops. This is due to the fact that a write operation requires the agreement of (in general) at least half the nodes in an ensemble and therefore the cost of a vote can increase significantly as more voters are added.

peerType=observer

server.1:localhost:2181:3181:observer

detail: http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html

Read data from the connected Server

Read requests are serviced from the local replica of each server database

12/11/14

Write data:Paxos•N Senator make decision in Paxos Island

•Each proposal has a increasing PID

•More than half Senators pass the proposal，it can pass

•Each Senator just agree the proposal whose PID is bigger than the current PID

ZooKeeper

Senator -> Serverproposal ->ZnodeChangePID -> ZooKeeper Transaction Id

http://en.wikipedia.org/wiki/Paxos_algorithmhttp://zh.wikipedia.org/zh-cn/Paxos%E7%AE%97%E6%B3%95http://research.microsoft.com/pubs/64624/tr-2005-112.pdfhttp://rdc.taobao.com/blog/cs/?p=162

paxos

Write data: Cilent zookeeper

1.Client sent write request to Server2. Server sent write request to leader3. Leader sent PROPOSAL message to all the followers.(asynchronous sent)4. Followers: Agree or deny (ACK sent by a follower after it has synced a proposal) 5. Commit6. Sent response to client

write requests, are processed by an agreement protocol

a leader proposes a request, collects votes, and finally commits

12/11/14

note: All machines in the ensemble write updates to disk before updating their in-memory copy of the znode tree.Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.http://zookeeper.apache.org/doc/r3.2.2/zookeeperOver.html

SyncRequestProcessorZkDatabaseif restart: ZkDatabase-> load the database from the disk onto memory when boot up

This class maintains the in memory database of zookeeper server states that includes the sessions, datatree and the committed logs. It is booted up after reading the logs and snapshots from the disk.

log and snapshot:

SyncRequestProcessor

detail : http://rdc.taobao.com/team/jm/archives/947

when take a snapshot1、 when leader change2、 a new server comes 3、 when logCount>(snapCount/2+randRoll)

snapshot is used for recoverability with logs

question

1、When the leader crash?

2、 they make the point that a follower may lag the leader ，so one cilent may read outdate data.

3、 why update half of the nodes

ServerServer

Leader Selection

Send data

Selected Leader idzxidLogic clock(init value equals 0)Status: LOOKING, FOLLOWING, OBSERVING,LEADING

Server1

Server2

Server3

Server5

Server4

Step 1

Step 2

Step 3

Step 4

Step 5

No response looking

Server2 is leader, but less than half Servers agree looking

Server3 is leader, more than half Servers agree Leading

There is leader already following

There is leader already following

12/11/14

Leader selection

http://zookeeper.apache.org/doc/r3.2.2/zookeeperInternals.html#sc_leaderElectionhttp://rdc.taobao.com/blog/cs/?p=162

note: dataVersion->zxid

This phase is finished once a majority (or quorum) of followers have synchronized their state with the leader.

12/11/14

question 2

1 、 use sync

2 、 watcher

Application scenarios limit

The sync operation forces the ZooKeeper server to which a cilent is connected to “catch up” with the leader,

12/11/14

Watcher :data zookeeper Cilent

1 、 ZooKeeper supports the concept of watches.2.Clients can set a watch on a znodes.3.A watch will be triggered and removed when the znode changes.4.When a watch is triggered, the client receives a packet saying that the znode has changed.

12/11/14

question 3 ： why half of the nodes

• performance

12/11/14

Zookeeper Performace

It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)

12/11/14

use of zookeeper 1 、 Master election

/currentMaster/{sessionId}-1 , /currentMaster/{sessionId}-2 , /currentMaster/{sessionId}-3

EPHEMERAL_SEQUENTIAL node

2 、 Hbase use zookeeper

Select a master.

Discover which master controls which servers.

Help the client to find its master.

Configuration Management （ push ）

12/11/14

1. Every server corresponds to a znodein ZooKeeper. (Client1 P1, C2 P2, …)

2. Multiple servers in one cluster may share one configuration.3. When the configuration changed, they should receive notification

Cluster Management

12/11/14

1. When one machine is dead, other machine should receive the notification.2. When one server dies, his znode will be automatically removed. (C1 P1, C2 P2 …)3. When the master machine is dead, how to select the new master? Paxos!

Other ： Queues 、 Double Barriers 、 Two-phased Commit. etc

reference:

• http://zookeeper.apache.org/doc/r3.3.2/recipes.html

• http://rdc.taobao.com/team/jm/archives/1232

这里总结很详细，就不粘贴了

12/11/14

http://rdc.taobao.com/team/jm/archives/1232

12/11/14

ConfigurationEach server in the ensemble of ZooKeeper servers has a numeric identifier that is unique within the ensemble, and must fall between 1 and 255.we can see that the number of zookeeper server is less than 255;

A ZooKeeper service usually consists of three to seven machines. Our implementation supports more machines, but three to seven machines provide more than enough performance and resilience.

So if you want reliability go with at least 3. We typically recommend having 5 servers in "online" production serving environments. This allows you to take 1 server out of service (say planned maintenance) and still be able to sustain an unexpected outage of one of the remaining servers w/o interruption of the service.

12/11/14

zoo.cfg

tickTime=2000dataDir=/disk1/zookeeperdataLogDir=/disk2/zookeeperclientPort=2181initLimit=5syncLimit=2server.1=zookeeper1:2888:3888server.2=zookeeper2:2888:3888server.3=zookeeper3:2888:3888

12/11/14

initLimit is the amount of time to allow for followers to connect to and sync with the leader. If a majority of followers fail to sync within this period, then the leader renounces its leadership status and another leader election takes place. If this happens often (and you can discover if this is the case because it is logged), it is a sign that the setting is too low. (10s)

syncLimit is the amount of time to allow a follower to sync with the leader. If a follower fails to sync within this period, it will restart itself. Clients that were attached to this follower will connect to another one.(4s)

12/11/14

Servers listen on three ports:

2181 for client connections; 2888 for follower connections,if they are the leader; 3888 for other server connections during the leader election phase.

12/11/14

FAQHow do I size a ZooKeeper ensemble (cluster)?

In general when determining the number of ZooKeeper serving nodes to deploy (the size of an ensemble) you need to think in terms of reliability, and not performance.

Reliability:A single ZooKeeper server (standalone) is essentially a coordinator with no reliability (a single serving node failure brings down the ZK service).

A 3 server ensemble (you need to jump to 3 and not 2 because ZK works based on simple majority voting) allows for a single server to fail and the service will still be available.

So if you want reliability go with at least 3. We typically recommend having 5 servers in "online" production serving environments. This allows you to take 1 server out of service (say planned maintenance) and still be able to sustain an unexpected outage of one of the remaining servers w/o interruption of the service.

Performance:Write performance actually decreases as you add ZK servers, while read performance increases modestly: http://bit.ly/9JEUju

faq

• http://rdc.taobao.com/team/jm/archives/1384

1 、 leader 选举完之后，或者新加入 zookeeper server ，follower 都要和 leader 进行同步，我看配置文件经常设置成4s ，其实内存中的镜像有时候可能很大， 4s 之内完不成同步，怎么办

2 、“粗粒度”的锁服务，说下 “粗粒度”该怎么理解啊？

3 、权威指南强调大部分持久化成功之后才，返回？？

4 、并不是每次都持久化？？

QuorumPeerMain \ZookeeperServerMain is used to start the program

Processor Chain

LeaderZooKeeperServer

FollowerZooKeeperServer

•Hadoop Zookeeper:An open source implementation of Chubby.

•Data Model:A shared hierarchal namespace, similarly to a standard file system

•One Leader ， more Followers ArchitectureFollower has the same Data Model.Use Paxos algorithm to implement consistency

•WatcherClient can monitor the znode change by watcher

summary

zookeeper introduce

Software