mogilefs, 簡約可靠的儲存方案

Post on 22-Jan-2018

310 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MogileFS簡約可靠的儲存方案

TWJUG Meetup Nov. 2016

kaif@kaif (member of mogilefs-moji)

Outline

• Mogilefs

• Moji

• State of the art in mogilefs reliability

Quick facts

“Open source distributed object storage” – a.k.a. cloud storage, soft defined storage…

• 高可用、水平擴展

• 檔案多副本儲存、修復

• 簡單的架構、容易使用

• 眾多應用實績

Brad Fitzpatrick

• Golang

• OpenID

• LiveJournal

– Memcached

– MogileFS

– …

Simplicity

Easy-to-use

• Command line tool

• Config file

Easy-to-use

• Admin tool

client

tracker

store

mysql

create_opendomain=toast&class=triple&debug_profile=0&fid=

0&multi_dest=1&key=qoo3

OK

path_1=http://127.0.0.20:7500/dev2/0/000/000/0000000014.fid&path_3=http://127.0.0.25:7500/dev3/0/000/000/0000000014.fid&devid_1=2&devid_3=3&

fid=14&path_2=http://127.0.0.25:7500/dev4/0/000/000/0000000014.fid&dev_count=3&devid_2=4

storestore

trackertracker

PUT /dev208/0/068/050/0068050934.fid HTTP/1.0Content-length: 9

some data

200 OK

1. Create open

3. Create close

2. Write data (webdav)

create_closedomain=toast&fid=14&devid=2&path=http://127.

0.0.20:7500/dev2/0/000/000/0000000014.fid&size=1048576&key=qoo3&devid_2=3&path_2=http://127.0.0.25:7500/dev3/0/000/000/0000000014.fid&mul

ti_dest=1

Availability

1WNR, memcached…

Scalability

使用者見證

KKBOX

KKBOX

• 超過3,000 萬首歌(檔案)

• 儲存伺服器超過 75 台

• 總硬碟超過 2,300 顆

• 總儲存空間超過 10 PB

• 使用 8 個機櫃

(KKBOX 的音樂檔案儲存技術Posted on August 2, 2016 by Chris Yuan)

My production experience

• 檔案量:KKBOX*10*N

• Node數:10^2*N

• 複雜的workload(備份、串流、物聯網、web、log…orz)

• Java ♥

Moji

• A file-like MogileFS client for Java developers

• Production-ready features

– Connection pooling, load balancing, fault-tolerant…

• Quality

– Spring friendly, integration tests, well documented, actively developing…

https://github.com/mogilefs-moji/moji

Configuration

• Using plain-old-Java

• Using the Spring framework

SpringMojiBean moji = new SpringMojiBean();moji.setAddressesCsv("192.168.0.1:7001,192.168.0.2:7001");moji.setDomain("testdomain");moji.initialise();moji.setTestOnBorrow(true);

moji.tracker.address=192.168.0.1:7001,192.168.0.2:7001moji.domain=testdomain

<import resource="moji-context.xml" />

Usage

• Create/update a remote file

• Download a remote file

MojiFile rickRoll = moji.getFile("rick-astley");moji.copyToMogile(new File("never-gonna-give-you-up.mp3"), rickRoll);

rickRoll.copyToFile(new File("foo-fighters.mp3"));

Usage

• IO streamMojiFile fooFighters = moji.getFile("stacked-actors");

InputStream stream = null;try {

stream = fooFighters.getInputStream();// Do something streamy// stream.read();

} finally {stream.close();

}

OutputStream stream = null;try {

stream = fooFighters.getOutputStream();// Do something streamy// stream.write(...);stream.flush();

} finally {stream.close();

}

• Setup environment manually

– MogileFS

– Maven dependency

Call to action!

• Quickstart feat. docker run -d --name mogile-node jeffutter/mogile-nodedocker run -it --link mogile-node:mogile-node hrchu/mogile-moji

<dependency><groupId>fm.last</groupId><artifactId>moji</artifactId><version>2.0.0</version>

</dependency>

https://code.google.com/p/mogilefs/wiki/QuickStartGuide

來講一些 關於可靠度的事

Mogilefs的可靠度對策

• Single copy ACK

• Multiple host replication policy

• MD5 checksum

• Basic health disk check

• Multiple zone plugin

• Reaper/fsck

從此檔案們就過著幸福快樂的日子~

… ?

強化可靠度可能方向

• Mutiple sites

• Scrubber

• Modern durable write

Multiple Sites

• MogileFS::Network plugin

• 不同機房配置不同網段

• Zone對應網段設定

• Replication policy

Multiple Sites• Given a network of: 10.10.0.0/16

• All of your machines are configured to have a netmask of 10.10.0.0/16 . When assigning IP addresses to machines, pick them from 10.10.5.0/24

• 設定IP

– web1: 10.10.5.1 (netmask 255.255.0.0 or /16)

– web2: 10.10.5.2

– tracker1: 10.10.5.3

– tracker2: 10.10.5.4

– storage node 1: 10.10.5.5

– storage node 2: 10.10.5.6

– storage node 3: 10.10.8.1

• MogileFS zones, you configure:

– near=10.10.5.0/24 far=10.10.8.0/24

web1

tracker1

node1 node2

near

tracker2

node3

far

web2

Scrubber

• Make use of routine FSCK as scrubber

• Modified Algorithm

– Remove exhaustive search

– Improve performance in large scalehttps://github.com/mogilefs/MogileFS-

Network/blob/master/lib/MogileFS/ReplicationPolicy/HostsPerNetwork.pm#L84

mogadm fsck status |grep " Yes " || (mogadm fsck reset; mogadm fsck clearlog; mogadm fsck start) >/var/log/mogadm.fsck 2>&1

Modern durable write

• AS-IS

client

tracker

store

mysql

store store

trackertracker

4. Write other copies asynchronously

Assume that a file should have at least three replicas in the system to fit the durability requirement

Modern durable write

client

tracker

store

mysql

2. Write at least two copiesbefore ACK

store store

trackertracker

4. Write other copiesasynchronously

• TO-BEAssume that a file should have at least three replicas in the system to fit the durability requirement

mogilefs-moji#25

mogilefs/MogileFS-Server#39

Analysis

• Disk failure pattern

– MTTF?

– poisson distribution?

• Mark-out: 發現錯誤的空窗期

• Rep latency: 非同步複製的空窗期

• 硬碟大小,檔案大小也會影響計算結果

Analysis

• Combinatorial analysis model

– Assume that each disk fails independently

– Assume that after x hours of operation each block has P(xi) = p

– Probability of failure q = 1 - p.

– 對replication來說是一個naive的公式:1 – qn

Analysis

• 若考慮

– Non-Recoverable Errors (NREs)

– drive failure events are poisson

– site failures (e.g. due to regional disasters)

– rep latency, mark-out time

– …

• Analysis of system durability is commonly done with Markov models

Analysis

• Example of durable write

– Assume mean disk life is 500K hrs

– 2 replicas, no NRE

249960

249980

250000

250020

250040

250060

250080

1 0.041666667 0.020833333 0.013888889

diff disk life 5

diff disk life 5

Diff of MTTDL in hr

mu

複製速率越低, durable write的改善幅度越大

Analysis

• Example of probability of data loss

0.000000E+00

1.000000E-05

2.000000E-05

3.000000E-05

4.000000E-05

5.000000E-05

6.000000E-05

7.000000E-05

8.000000E-05

1 2 3 4 5 6 7 8 9 10 11 12 13 14

P of data loss 72

P of data loss 48

P of data loss 24

P of data loss 1

Recap

儲存之於架構 案場需求決定儲存架構抉擇

在考量機敏資料、業主需求、成本或是legacy的情境,mogilefs或許會是合適的儲存架構選擇~

關於Mogilefs,我想說的是… 簡單可擴展的非結構化儲存系統

Java stack建議搭配moji服用

如果事業做很大有富爸爸,能找specialist/consulting,ceph/swift會是更先進複雜的選擇!

Thank you~

【關於我】

https://kaif.io/u/kaif

https://github.com/hrchu

petertc.chu@gmail.com

【關於moji】

https://github.com/mogilefs-moji/moji

FIN~

top related