mogilefs, 簡約可靠的儲存方案
TRANSCRIPT
MogileFS簡約可靠的儲存方案
TWJUG Meetup Nov. 2016
kaif@kaif (member of mogilefs-moji)
Outline
• Mogilefs
• Moji
• State of the art in mogilefs reliability
Quick facts
“Open source distributed object storage” – a.k.a. cloud storage, soft defined storage…
• 高可用、水平擴展
• 檔案多副本儲存、修復
• 簡單的架構、容易使用
• 眾多應用實績
Brad Fitzpatrick
• Golang
• OpenID
• LiveJournal
– Memcached
– MogileFS
– …
Simplicity
Easy-to-use
• Command line tool
• Config file
Easy-to-use
• Admin tool
client
tracker
store
mysql
create_opendomain=toast&class=triple&debug_profile=0&fid=
0&multi_dest=1&key=qoo3
OK
path_1=http://127.0.0.20:7500/dev2/0/000/000/0000000014.fid&path_3=http://127.0.0.25:7500/dev3/0/000/000/0000000014.fid&devid_1=2&devid_3=3&
fid=14&path_2=http://127.0.0.25:7500/dev4/0/000/000/0000000014.fid&dev_count=3&devid_2=4
storestore
trackertracker
PUT /dev208/0/068/050/0068050934.fid HTTP/1.0Content-length: 9
some data
200 OK
1. Create open
3. Create close
2. Write data (webdav)
create_closedomain=toast&fid=14&devid=2&path=http://127.
0.0.20:7500/dev2/0/000/000/0000000014.fid&size=1048576&key=qoo3&devid_2=3&path_2=http://127.0.0.25:7500/dev3/0/000/000/0000000014.fid&mul
ti_dest=1
Availability
1WNR, memcached…
Scalability
使用者見證
KKBOX
KKBOX
• 超過3,000 萬首歌(檔案)
• 儲存伺服器超過 75 台
• 總硬碟超過 2,300 顆
• 總儲存空間超過 10 PB
• 使用 8 個機櫃
(KKBOX 的音樂檔案儲存技術Posted on August 2, 2016 by Chris Yuan)
My production experience
• 檔案量:KKBOX*10*N
• Node數:10^2*N
• 複雜的workload(備份、串流、物聯網、web、log…orz)
• Java ♥
Moji
• A file-like MogileFS client for Java developers
• Production-ready features
– Connection pooling, load balancing, fault-tolerant…
• Quality
– Spring friendly, integration tests, well documented, actively developing…
https://github.com/mogilefs-moji/moji
Configuration
• Using plain-old-Java
• Using the Spring framework
SpringMojiBean moji = new SpringMojiBean();moji.setAddressesCsv("192.168.0.1:7001,192.168.0.2:7001");moji.setDomain("testdomain");moji.initialise();moji.setTestOnBorrow(true);
moji.tracker.address=192.168.0.1:7001,192.168.0.2:7001moji.domain=testdomain
<import resource="moji-context.xml" />
Usage
• Create/update a remote file
• Download a remote file
MojiFile rickRoll = moji.getFile("rick-astley");moji.copyToMogile(new File("never-gonna-give-you-up.mp3"), rickRoll);
rickRoll.copyToFile(new File("foo-fighters.mp3"));
Usage
• IO streamMojiFile fooFighters = moji.getFile("stacked-actors");
InputStream stream = null;try {
stream = fooFighters.getInputStream();// Do something streamy// stream.read();
} finally {stream.close();
}
OutputStream stream = null;try {
stream = fooFighters.getOutputStream();// Do something streamy// stream.write(...);stream.flush();
} finally {stream.close();
}
• Setup environment manually
– MogileFS
– Maven dependency
Call to action!
• Quickstart feat. docker run -d --name mogile-node jeffutter/mogile-nodedocker run -it --link mogile-node:mogile-node hrchu/mogile-moji
<dependency><groupId>fm.last</groupId><artifactId>moji</artifactId><version>2.0.0</version>
</dependency>
https://code.google.com/p/mogilefs/wiki/QuickStartGuide
來講一些 關於可靠度的事
Mogilefs的可靠度對策
• Single copy ACK
• Multiple host replication policy
• MD5 checksum
• Basic health disk check
• Multiple zone plugin
• Reaper/fsck
從此檔案們就過著幸福快樂的日子~
… ?
強化可靠度可能方向
• Mutiple sites
• Scrubber
• Modern durable write
Multiple Sites
• MogileFS::Network plugin
• 不同機房配置不同網段
• Zone對應網段設定
• Replication policy
Multiple Sites• Given a network of: 10.10.0.0/16
• All of your machines are configured to have a netmask of 10.10.0.0/16 . When assigning IP addresses to machines, pick them from 10.10.5.0/24
• 設定IP
– web1: 10.10.5.1 (netmask 255.255.0.0 or /16)
– web2: 10.10.5.2
– tracker1: 10.10.5.3
– tracker2: 10.10.5.4
– storage node 1: 10.10.5.5
– storage node 2: 10.10.5.6
– storage node 3: 10.10.8.1
• MogileFS zones, you configure:
– near=10.10.5.0/24 far=10.10.8.0/24
web1
tracker1
node1 node2
near
tracker2
node3
far
web2
Scrubber
• Make use of routine FSCK as scrubber
• Modified Algorithm
– Remove exhaustive search
– Improve performance in large scalehttps://github.com/mogilefs/MogileFS-
Network/blob/master/lib/MogileFS/ReplicationPolicy/HostsPerNetwork.pm#L84
mogadm fsck status |grep " Yes " || (mogadm fsck reset; mogadm fsck clearlog; mogadm fsck start) >/var/log/mogadm.fsck 2>&1
Modern durable write
• AS-IS
client
tracker
store
mysql
store store
trackertracker
4. Write other copies asynchronously
Assume that a file should have at least three replicas in the system to fit the durability requirement
Modern durable write
client
tracker
store
mysql
2. Write at least two copiesbefore ACK
store store
trackertracker
4. Write other copiesasynchronously
• TO-BEAssume that a file should have at least three replicas in the system to fit the durability requirement
mogilefs-moji#25
mogilefs/MogileFS-Server#39
Analysis
• Disk failure pattern
– MTTF?
– poisson distribution?
• Mark-out: 發現錯誤的空窗期
• Rep latency: 非同步複製的空窗期
• 硬碟大小,檔案大小也會影響計算結果
Analysis
• Combinatorial analysis model
– Assume that each disk fails independently
– Assume that after x hours of operation each block has P(xi) = p
– Probability of failure q = 1 - p.
– 對replication來說是一個naive的公式:1 – qn
Analysis
• 若考慮
– Non-Recoverable Errors (NREs)
– drive failure events are poisson
– site failures (e.g. due to regional disasters)
– rep latency, mark-out time
– …
• Analysis of system durability is commonly done with Markov models
Analysis
• Example of durable write
– Assume mean disk life is 500K hrs
– 2 replicas, no NRE
249960
249980
250000
250020
250040
250060
250080
1 0.041666667 0.020833333 0.013888889
diff disk life 5
diff disk life 5
Diff of MTTDL in hr
mu
複製速率越低, durable write的改善幅度越大
Analysis
• Example of probability of data loss
0.000000E+00
1.000000E-05
2.000000E-05
3.000000E-05
4.000000E-05
5.000000E-05
6.000000E-05
7.000000E-05
8.000000E-05
1 2 3 4 5 6 7 8 9 10 11 12 13 14
P of data loss 72
P of data loss 48
P of data loss 24
P of data loss 1
Recap
儲存之於架構 案場需求決定儲存架構抉擇
在考量機敏資料、業主需求、成本或是legacy的情境,mogilefs或許會是合適的儲存架構選擇~
關於Mogilefs,我想說的是… 簡單可擴展的非結構化儲存系統
Java stack建議搭配moji服用
如果事業做很大有富爸爸,能找specialist/consulting,ceph/swift會是更先進複雜的選擇!
Thank you~
【關於我】
https://kaif.io/u/kaif
https://github.com/hrchu
【關於moji】
https://github.com/mogilefs-moji/moji
FIN~