marfs: a scalable near-posix file system over cloud · pdf filela-ur-15-27431 . marfs: a...
TRANSCRIPT
LA-UR-15-27431
MarFS: A Scalable Near-POSIX File System over Cloud Objects
Kyle E. Lamb HPC Storage Team Lead
Nov 18th 2015
Why Do We Need a MarFS
• If the Burst Buffer does its job very well (and indications are capacity of in system NV will grow radically) and campaign storage works out well (leveraging cloud), do we need a parallel file system anymore, or an archive?
• Maybe just a bw/iops tier and a capacity tier. • Too soon to say, seems feasible longer term
Memory
Burst Buffer
Parallel File System (PFS)
Campaign Storage
Archive
Memory
IOPS/BW Tier
Parallel File System (PFS)
Capacity Tier
Archive
Factoids LANL HPSS = 53 PB and 543 M files Trinity 2 PB memory, 4 PB flash (11% of HPSS) and 80 PB PFS or 150% HPSS) Crossroads may have 5-10 PB memory, 40 PB solid state or 100% of HPSS with data residency measured in days or weeks
Memory
Parallel File System
Archive
HPC Pre Trinity
HPSS Parallel Tape
Lustre Parallel File System
DRAM 1-2 PB/sec Residence – hours Overwritten – continuous
4-6 TB/sec Residence – hours Overwritten – hours
1-2 TB/sec Residence – days/weeks Flushed – weeks
100-300 GB/sec Residence – months-years Flushed – months-years
10s GB/sec (parallel tape Residence – forever
HPC Post Trinity HPC At Trinity
How about a Scalable Near-POSIX File System over Object Erasure?
• Best of both worlds – Objects Systems
• Provide massive scaling and efficient erasure • Friendly to applications, not to people. People need a name space. • Huge Economic appeal (erasure enables use of inexpensive storage)
– POSIX name space is powerful but has issues scaling • The challenges
– Mismatch of POSIX an Object metadata, security, read/write semantics, efficient object/file sizes and no update in place with Objects
– How do we • scale POSIX name space to trillions of files/directories • leverage massive Object Scalability for “Capacity Tier) with 100X BW of HSM’s but 1/10 BW of PFS • with many years of data protection longevity?
• Looked at – GPFS, Lustre, Panasas, OrangeFS, Cleversafe/Scality/EMC ViPR/Ceph/Swift, Glusterfs,
General Atomics Nirvana/Storage Resource Broker/IRODS, Maginatics, Camlistore, Bridgestore, Avere, HDFS
Object Repo X
MarFS Scaling (Metadata and Data) MarFS
ProjectA Dir
DirA
DirA.A
DirA.A.A
DirA.A.A.A
DirB
DirA.A.B
FA
FF
FB
FD FC
FE
FI FH FG
ProjectN Dir
DirA
DirA.A
DirA.A.A
DirA.A.A.A
DirB
DirA.A.B
FA
FF
FB
FD FC
FE
FI FH FG
Namespace Project A
Namespace Project N
N
Name Spaces PFS
MDS A
PFS MDS
N
PFS MDS A.1
PFS MDS A.2
PFS MDS A.M
PFS MDS N.1
PFS MDS N.2
PFS MDS N.M
Uni Object
File Packed
Object File Multi
Object File Object Repo A
N X M MDS File Systems (for metadata only)
Namespaces MDS holds Directory Metadata
File Metadata is hashed over M multiple MDS
Striping across 1 to X Object Repos
File Name Hash File Name Hash
MarFS Internals Overview
/GPFS-MarFS-md1 /GPFS-MarFS-mdN
Dir1.1
UniFile - Attrs: uid, gid, mode, size, dates, etc. Xattrs - objid repo=1, id=Obj001, objoffs=0, chunksize=256M, Objtype=Uni, NumObj=1, etc.
trashdir
/MarFS top level namespace aggregation
Metadata
Data
Object System 1
Object System X
Dir2.1
Obj001
Users do data movement here
Metadata Servers
Simple MarFS Deployment
GPFS Server (NSD)
Dual Copy Raided Fast Stg
GPFS Server (NSD)
Obj md/d
ata server
Obj md/d
ata server
Batch FTA Have your enterprise file
systems and MarFS mounted
Interactive FTA Have your
enterprise file systems and
MarFS mounted
Batch FTA Have your enterprise file
systems and MarFS mounted
Separate interactive and batch FTAs due to object
security and performance reasons. Data Repos
Dual Copy
Raided Fast Stg
Scale
Scale Scale
Open Source, BSD License Partners Welcome
https://github.com/mar-file-system/marfs https://github.com/pftool/pftool)
Thank You For Your Attention
BOF: Two Tiers Scalable Storage: Building POSIX-Like Namespaces with Object Stores Date: Nov 18th, 2015 Time: 5:30PM - 7:00PM Ron: Hilton Salon A Session leaders : Sorin Faibish, Gary A. Grider, John Bent
Backup
Won’t someone else do it? • There is evidence others see the need but no magic bullets yet: (partial list)
– Cleversafe/Scality/EMC ViPR/Ceph/Swift etc. are moving towards multi-personality data lakes over erasure coded objects, all are young and assume update in place for posix
– Glusterfs is probably the closes thing to MarFS. Gluster is aimed more for the enterprise and midrange HPC and less for extreme HPC. It also is making the trade off space for update in place which we can live without. Glusterfs is a way to unify file and object systems, MarFS is another, each coming at it from a different stance in trade space
– General Atomics Nirvana, Storage Resource Broker/IRODS is optimized for WAN and HSM metadata rates. There are some capabilities for putting POSIX files over objects, but these methods are largely via NFS or other methods that try to mimic full file system semantics including update in place. These methods are not designed for massive parallelism in a single file, etc.
– Maginatics from EMC but it is in its infancy and isnt a full solution to our problem yet. – Camlistore appears to be targeted and personal storage. – Bridgestore is a POSIX name space over objects but they put their metadata in a flat
space so rename of a directory is painful. – Avere over objects is focused at NFS so N to 1 is a non starter. – HPSS or SamQFS or a classic HSM? The metadata rates design target way too low. – HDFS metadata doesn’t scale well.
MarFS Requirements • Linux system(s) with C/C++ and FUSE support • MPI for parallel comms in Pftool (a parallel data transfer tool)
– MPI library can use many comm methods like TCP/IP, Infiniband OFED, etc.
• Support lazy data and metadata quotas per user per name space
• Wide parallelism for data and metadata • Try hard not to walk trees for management (use inode scans etc.) • Use trash mechanism for user recovery • If use MarFS to combine multiple POSIX file systems into one
mount point, any set of POSIX file systems can be used. • Multi-node parallelism MD FS’s must be globally visible somehow • Using object store data repo, object store needs globally visible. • The MarFS MD FS’s must be capable of POSIX xattr and sparse
– don’t have to use GPFS, we use due to ILM inode scan features
What is MarFS? • Near-POSIX global scalable name space over many POSIX and non POSIX
data repositories (Scalable object systems - CDMI, S3, etc.) • It scales name space by sewing together multiple POSIX file systems
both as parts of the tree and as parts of a single directory allowing scaling across the tree and within a single directory
• It is small amount of code (C/C++/Scripts) – A small Linux Fuse – A pretty small parallel batch copy/sync/compare/ utility – A set of other small parallel batch utilities for management – A moderate sized library both FUSE and the batch utilities call
• Data movement scales just like many scalable object systems • Metadata scales like NxM POSIX name spaces both across the tree and
within a single directory • It is friendly to object systems by
– Spreading very large files across many objects – Packing many small files into one large data object
What are all these storage layers? Why do we need all these storage layers?
• Why – BB: Economics (disk
bw/iops too expensive)
Memory
Burst Buffer
Parallel File System
Campaign Storage
Archive
Memory
Parallel File System
Archive
HPC Before Trinity
HPC After Trinity 1-2 PB/sec Residence – hours Overwritten – continuous
4-6 TB/sec Residence – hours Overwritten – hours
1-2 TB/sec Residence – days/weeks Flushed – weeks
100-300 GB/sec Residence – months-year Flushed – months-year
10s GB/sec (parallel tape Residence – forever
HPSS Parallel Tape
Lustre Parallel File System
DRAM
Campaign: Economics (PFS Raid too expensive, PFS solution too rich in function, PFS metadata not scalable enough, PFS designed for scratch use not years residency, Archive BW too expensive/difficult, Archive metadata too slow)
What it is not! • Doesn’t allow update file in place for object data
repo’s ( no seeking around and writing – it isnt a parallel file system)
• FUSE – Does not check for or protect against multiple writers
into the same file (when writing into object repos), use batch copy utility or library to do this efficiently)
– Fuse is targeted at interactive use – Writing to object backed files works but FUSE will not
create data objects that are packed as optimized as the parallel copy utility.
– Batch utilities to reshape data written by fuse
MarFS Internals Overview Multi-File
/GPFS-MarFS-md1 /GPFS-MarFS-mdN
Dir1.1
MultiFile - Attrs: uid, gid, mode, size, dates, etc. Xattrs - objid repo=1, id=Obj002., objoffs=0, chunksize=256M, ObjType=Multi, NumObj=2, etc.
trashdir
/MarFS top level namespace aggregation
Metadata
Data
Object System 1
Object System X
Dir2.1
Obj002.1
Obj002.2
MarFS Internals Overview Multi-File (striped Object Systems)
/GPFS-MarFS-md1 /GPFS-MarFS-mdN
Dir1.1
MultiFile - Attrs: uid, gid, mode, size, dates, etc. Xattrs - objid repo=S, id=Obj002., objoffs=0, chunksize=256M, ObjType=Multi, NumObj=2, etc.
trashdir
/MarFS top level namespace aggregation
Metadata
Data
Object System 1
Object System X
Dir2.1
Obj002.1 Obj002.2
MarFS Internals Overview Packed-File
/GPFS-MarFS-md1 /GPFS-MarFS-mdN
Dir1.1
UniFile - Attrs: uid, gid, mode, size, dates, etc. Xattrs - objid repo=1, id=Obj003, objoffs=4096, chunksize=256M, Objtype=Packed, NumObj=1, Ojb=4 of 5, etc.
trashdir
/MarFS top level namespace aggregation
Metadata
Data
Object System 1
Object System X
Dir2.1
Obj003
Configuration • Top level MarFS mountpoint • Stanza for every name space/metadata file
system you want to bring into MarFS – Describes how metadata is to be handled in this
part of the MarFS system – Describes where data is to be put for files of
various sizes/shapes (into which data repo) • Stanza for every Repo - object system/area of
object system, or other access method for data – Describes how data is to be stored into this data
repo, chunksizes/methods/etc.
Recoverability • Parallel backup of the metadata file system(s)
backs up the metadata for MarFS – (METADATA ONLY)
• Objects are encoded with CREATE TIME info – (path,uidgid,mode,times, MarFS/other xattrs, etc.
• Parallel backup of object metadata – Object name has some recovery info in it, so just
listing/saving the objects/buckets is useful – Any other info your object server will allow you to
back up
Pftool • A highly parallel copy/rsync/compare/list tool • Walks tree in parallel, copy/rsync/compare in
parallel. • Parallel Readdir’s, stat’s, and copy/rsinc/compare
– Dynamic load balancing – Restart-ability for large trees or even very large files – Repackaging: breaks up big files, coalesces small files – To/From NFS/POSIX/parallel FS/MarFS
Load Balancer
Scheduler
Reporter
Stat Readdir
Stat
Copy/Rsync/Compare
Done Queue
Dirs Queue
Stat Queue
Cp/R/C Queue
4 PB BB 2
TB/S
How does it fit into our environment (circa FY16) ?
Premier Machine 2PB Dram
General IO Nodes
Private IO Nodes
Capacity machines ~50-300 TB Dram
General IO Nodes
Capacity machines ~50-300 TB Dram
General IO Nodes
/localscratch(s) /sitescratch(s) /home /project
/sitescratch(s) /home /project
Local Scratch 100 PB
1 TB/sec 1-4 Weeks
Site Scratch 10’s PB
100 GB/sec 1-4 Weeks
Private IB
Site IB/Ether/Lnet
Routers/switches (damselfly)
HPSS 100 PB
10 GB/sec Forever
Site Scratch 10’s PB
100 GB/sec 1-4 Weeks
Campaign MarFS 100’s
PB 100’s
GB/sec Few Years (erasure)
Batch File Transfer Agents
~100 at 2-8 GB/sec per
Interactive FTA(s)
WAN FTA(s) Special Security
Rules
100(s) Gbits/sec
/localscratch(s) /sitescratch(s) /home /project /campaign HPSS /analytics (HDFS other)
Parallel load
balanced movers
Parallel Tape with Disk Cache
NFS /home
/project
Analytics machine
potentially disk full/big
memory HDFS?
/sitescratch(s) /campaign /analytics (HDFS other) Use HDFS – POSIX Shim for access to POSIX resources
/sitescratch(s) /home /project
Security Model • All POSIX security is obeyed by MarFS • Addition special security can be added by configuration to manage what parts of
the name space allow metadata and data update/read – and you can control those special permissions for interactive and batch
separately per name space. • rm – read metadata wm – write metadata rd – read data wd – write
data ud – unlink data • Can lock down data read/write separately from metadata read/update • Value is not stored with the file, real time, fast way to control access.
• Object Security is provided by the following methods – Password for Object Server access is stashed safely, can be time based,
crypto securely sent to Object Server on every request. – Encryption in the data path to objects can be turned on – Encryption at rest could be implemented and is on the futures list. – Protecting the trash is essential as well
Futures • File data versioning – using data pointers in trash • Dual copy/MarFS erasure to enable erasure on erasure • Metadata update logging, investigate how this might be done and
what the cost is • Compression and Encrypton in MarFS ( for repos that don’t
compress) • Offline optimizations/sorting/indexing of attrs and user xattrs etc. • Append or sparse support, need to consider carefully, hard to do
because of book keeping • Other access methods HPSS, Globus, other. • HDFS alternate access of same data, via java hdfs lib • Offline deep reconcile/repack – if trash is lost • Semi-direct – store data into parallel file system file(s) for Globus or
other parallel N to 1 write staging