nova-fortis: a fault-tolerant non-volatile main memory file system · 2017-11-19 · 1 nova-fortis:...

71
1 NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System Jian Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego

Upload: others

Post on 13-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

1

NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System

Jian Andiry Xu, Lu Zhang, Amirsaman Memaripour,

Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva,

Andy Rudoff (Intel), Steven Swanson

Non-Volatile Systems LaboratoryDepartment of Computer Science and Engineering

University of California, San Diego

Page 2: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

2

Non-volatile Memory and DAX

• Non-volatile main memory (NVMM)

– PCM, STT-RAM, ReRAM, 3D XPoint technology

– Reside on memory bus, load/store interfaceApplication

NVMMDRAM

HDD / SSD

File system

load/store load/store

Page 3: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

3

Non-volatile Memory and DAX

• Non-volatile main memory (NVMM)

– PCM, STT-RAM, ReRAM, 3D XPoint technology

– Reside on memory bus, load/store interface

• Direct Access (DAX)

– DAX file I/O bypasses the page cache

– DAX-mmap() maps NVMM pages to application address space directly and bypasses file system

– “Killer app”

Application

NVMMDRAM

HDD / SSD

mmap()

copy

DAX-mmap()

Page 4: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

4

Application expectations on NVMM File System

POSIX I/O Atomicity Fault Tolerance

SpeedDirect Access

DAX

Page 5: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

5

POSIX I/O Atomicity Fault Tolerance

SpeedDirect Access

DAX

ext4 xfs BtrFS F2FS

✔ ❌ ❌✔❌

Page 6: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

6

Fault SpeedDirect

DAX ❌❌✔ ✔✔

PMFS ext4-DAX xfs-DAX

Page 7: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

7

Fault SpeedDirect

DAX

StrataSOSP ’17

✔ ✔✔ ❌❌

Page 8: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

8

Fault SpeedDirect

DAX

NOVA FAST ’16

✔ ✔✔✔❌

Page 9: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

9

Fault SpeedDirect

DAX

NOVA-Fortis

✔ ✔✔✔✔

Page 10: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

10

Challenges

DAX

Page 11: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

11

NOVA: Log-structured FS for NVMM

• Per-inode logging

– High concurrency

– Parallel recovery

• High scalability

– Per-core allocator, journal and inodetable

• Atomicity

– Logging for single inode update

– Journaling for update across logs

– Copy-on-Write for file data

Head TailInode

Inode log

Per-inode logging

Data Data Data

Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, FAST ’16.

Page 12: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

13

Snapshot

Page 13: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

14

Snapshot support

• Snapshot is essential for file system backup

• Widely used in enterprise file systems

– ZFS, Btrfs, WAFL

• Snapshot is not available with DAX file systems

Page 14: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

15

Snapshot for normal file I/O

0Current snapshot

0File log

Data

Page 0

Data in snapshotFile write entry

Reclaimed data Current data

1

1

Data

Page 0

Data

1

Data

Page 0

Data

2

Data

2

Data

Page 0

Data

recover_snapshot(1);

take_snapshot();

take_snapshot();

write(0, 4K);

write(0, 4K);

write(0, 4K);

write(0, 4K);

Page 15: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

16

Memory Ordering With DAX-mmap()

D = 42;Fence();V = True;

• Recovery invariant: if V == True, then D is valid

D V Valid

? False ✓

42 False ✓

42 True ✓

? True ✗

Page 16: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

17

Memory Ordering With DAX-mmap()

D = 42;Fence();V = True;

• Recovery invariant: if V == True, then D is valid• D and V live in two pages of a mmap()’d region.

Page 1 Page 3

D V

DAX-mmap()

Application

NVMM

Page 17: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

18

• Set pages read-only, then copy-on-write

DAX Snapshot: Idea

File data:

File system:

Applications:

DAX-mmap()

no file system intervention

RO

Page 18: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

19

• Application invariant: if V is True, then D is valid

page fault

D = ?;V = False;

D = 42;

V = True;? T

?

DAX Snapshot: Incorrect implementation

snapshot_begin();set_read_only(page_d);copy_on_write(page_d);

set_read_only(page_v);snapshot_end();

D VD V

Applicationthread

NOVAsnapshot

Applicationvalues

Snapshotvalues

? F

42 F

? T

42 T

Page 19: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

20

• Delay CoW page faults completion until all pages are read-only

? F

?

DAX Snapshot: Correct implementation

snapshot_begin();set_read_only(page_d);

set_read_only(page_v);snapshot_end();copy_on_write(page_d);copy_on_write(page_v);

D VD V

Applicationthread

NOVAsnapshot

Applicationvalues

Snapshotvalues

? F42 F

42 T

page fault

D = ?;V = False;

D = 42;

V = True;

? F

Page 20: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

21

Performance impact of snapshots

• Normal execution vs. taking snapshots every 10s

– Negligible performance loss through read()/write()

– Average performance loss 3.7% through mmap()

0

0.2

0.4

0.6

0.8

1

1.2

W/O snapshot W snapshot

Filebench (read/write) WHISPER (DAX-mmap())

Page 21: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

22

Protecting Metadata and Data

Page 22: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

23

NVMM Failure Modes

• Detectable errors

– Media errors detected by NVMM controller

– Raises Machine Check Exception (MCE)

• Undetectable errors

– Media errors not detected by NVMM controller

– Software scribbles

NVMM data:

Software:

NVMM Ctrl.:

Receives MCE

Media error

Detects uncorrectable errorsRaises exception

Re

ad

Page 23: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

24

NVMM Failure Modes

• Detectable errors

– Media errors detected by NVMM controller

– Raises Machine Check Exception (MCE)

• Undetectable errors

– Media errors not detected by NVMM controller

– Software scribbles

NVMM data:

Software:

NVMM Ctrl.:

Consumes corrupted data

Media error

Re

adSees no error

Page 24: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

25

NVMM Failure Modes

• Detectable errors

– Media errors detected by NVMM controller

– Raises Machine Check Exception (MCE)

• Undetectable errors

– Media errors not detected by NVMM controller

– Software scribbles

NVMM data:

Software:

NVMM Ctrl.: Updates ECC

Bug code scribbles NVMM

Scribble error

Write

Page 25: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

26

NOVA-Fortis Metadata Protection

• Detection

– CRC32 checksums in all structures

– Use memcpy_mcsafe() to catch MCEs

• Correction

– Replicate all metadata: inodes, logs, superblock, etc.

– Tick-tock: persist primary before updating replica

ent1 entN…

Head’ Tail’ csum’

Head Tail

Head’ Tail’ csum’ H1’ T1’

inode

c1 cN

Data 1 Data 2

ent1’ c1’ entN’ cN’…

inode’

Head Tail csumHead Tail csum H1 T1

log

log’

Page 26: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

27

NOVA-Fortis Data Protection

• Metadata

– CRC32 + replication for all structures

• Data

– RAID-4 style parity

– Replicated checksums

ent1 entN…

Head’ Tail’ csum’

Head Tail

Head’ Tail’ csum’ H1’ T1’

inode

c1 cN

Data 1 Data 2

ent1’c1’

entN’ cN’…

inode’

Head Tail csumHead Tail csum H1 T1

S0 S1 S2 S3 S4 S5 S6 S7 P

1 Block (8 stripes)

P = ⊕ S0..7

Ci = CRC32C(Si)

Replicated

log

log’

Page 27: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

28

File data protection with DAX-mmap

• Stores are invisible to the file systems

• The file systems cannot protect mmap’ed data

• NOVA-Fortis’ data protection contract:

NOVA-Fortis protects pages from media errors and scribbles iff they are not mmap()’d for

writing.

DAX

Page 28: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

29

File data protection with DAX-mmap

• NOVA-Fortis logs mmap() operations

File data:

File log:

NOVA-Fortis: read/write

Applications:

Kernel-space

NVDIMMs

User-space

mmap()

load/storeload/store

protected

unprotected

mmap log entry

Page 29: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

30

File data protection with DAX-mmap

• On munmap and during recovery, NOVA-Fortis restores protection

File data:

File log:

NOVA-Fortis: read/write

Applications:

Kernel-space

NVDIMMs

User-space

mmap()

munmap()

Protection restored

load/store

Page 30: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

31

File data protection with DAX-mmap

• On munmap and during recovery, NOVA-Fortis restores protection

File data:

File log:

NOVA-Fortis: read/write

Applications:

Kernel-space

NVDIMMs

User-space

mmap()

System Failure + recovery

Page 31: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

32

Performance

Page 32: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

33

Latency breakdown

0 1 2 3 4 5 6

Read 16KB

Read 4KB

Overwrite 512B

Overwrite 4KB

Append 4KB

Create

Latency (microsecond)

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache

append entry free old data calculate entry csum verify entry csum replicate inode

replicate log verify data csum update data csum update data parity

Page 33: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

34

Latency breakdown

0 1 2 3 4 5 6

Read 16KB

Read 4KB

Overwrite 512B

Overwrite 4KB

Append 4KB

Create

Latency (microsecond)

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache

append entry free old data calculate entry csum verify entry csum replicate inode

replicate log verify data csum update data csum update data parity

Metadata Protection

Page 34: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

35

Latency breakdown

0 1 2 3 4 5 6

Read 16KB

Read 4KB

Overwrite 512B

Overwrite 4KB

Append 4KB

Create

Latency (microsecond)

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache

append entry free old data calculate entry csum verify entry csum replicate inode

replicate log verify data csum update data csum update data parity

Metadata Protection Data Protection

Page 35: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

36

Application performance

0

0.2

0.4

0.6

0.8

1

1.2

Fileserver Varmail MongoDB SQLite TPCC Average

No

rmal

ized

th

rou

ghp

ut

Normalized throughput

ext4-DAX Btrfs NOVA w/ MP w/ MP+DP

Page 36: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

37

Conclusion

• Fault tolerance is critical for file system, but existing DAX file systems don’t provide it

• We identify new challenges that NVMM file system fault tolerance poses

• NOVA-Fortis provides fault tolerance with high performance– 1.5x on average to DAX-aware file systems without reliability features

– 3x on average to other reliable file systems

Page 37: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

38

Give a try

https://github.com/NVSL/linux-nova

Page 38: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

39

Thanks!

Page 39: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

40

Backup slides

Page 40: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

41

Hybrid DRAM/NVMM system

• Non-volatile main memory (NVMM)

– PCM, STT-RAM, ReRAM, 3D XPoint technology

• File system for NVMMHost

CPU

DRAM NVMM

NVMM FS

Page 41: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

42

Disk-based file systems are inadequate for NVMM

• Ext4, xfs, Btrfs, F2FS, NILFS2

• Built for hard disks and SSDs

– Software overhead is high

– CPU may reorder writes to NVMM

– NVMM has different atomicity guarantees

• Cannot exploit NVMM performance

• Performance optimization compromises consistency on system failure [1]

[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14.

AtomicityExt4 wb

Ext4 order

Ext4 dataj

Btrfs xfs

1-Sector overwrite

✓ ✓ ✓ ✓ ✓

1-Sector append

✗ ✓ ✓ ✓ ✓

1-Block overwrite

✗ ✗ ✓ ✓ ✗

1-Block append

✗ ✓ ✓ ✓ ✓

N-Block write/append

✗ ✗ ✗ ✗ ✗

N-Block prefix/append

✗ ✓ ✓ ✓ ✓

Page 42: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

43

NVMM file systems are not strongly consistent

• BPFS, PMFS, Ext4-DAX, SCMFS, Aerie

• None of them provide strong metadata and data consistency

File systemMetadata atomicity

Data atomicity

MmapAtomicity [1]

BPFS Yes Yes [2] No

PMFS Yes No No

Ext4-DAX Yes No No

SCMFS No No No

Aerie Yes No No

[1] Each msync() commits updates atomically.[2] In BPFS, write times are not updated atomically with respect to the write itself.

File systemMetadata atomicity

Data atomicity

MmapAtomicity [1]

BPFS Yes Yes [2] No

PMFS Yes No No

Ext4-DAX Yes No No

SCMFS No No No

Aerie Yes No No

NOVA Yes Yes Yes

Page 43: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

44

Why LFS?

• Log-structuring provides cheaper atomicity than journaling and shadow paging

• NVMM supports fast, highly concurrent random accesses

– Using multiple logs does not negatively impact performance

– Log does not need to be contiguous

• Rethink and redesign log-structuring entirely

Page 44: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

45

Atomicity

• Log-structuring for single log update– Write, msync, chmod, etc– Strictly commit log entry to NVMM

before updating log tail

• Lightweight journaling for update across logs– Unlink, rename, etc– Journal log tails instead of metadata

or data

• Copy-on-write for file data– Log only contains metadata– Log is short

File log

Directory log

Tail Tail

TailTail

Tail

Dir tail

File tailJournal

Page 45: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

46

Atomicity

• Log-structuring for single log update– Write, msync, chmod, etc– Strictly commit log entry to NVMM

before updating log tail

• Lightweight journaling for update across logs– Unlink, rename, etc– Journal log tails instead of metadata

or data

• Copy-on-write for file data– Log only contains metadata– Log is short

File log

Directory log

Tail

Tail

Data 1 Data 2

Tail

Data 0 Data 1

Page 46: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

47

Performance

• Per-inode logging allows for high concurrency

• Split data structure between DRAM and NVMM

– Persistent log is simple and efficient

– Volatile tree structure has no consistency overhead

File log

Directory log

Tail

Data 1 Data 2

Tail

Data 0

Page 47: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

48

Performance

• Per-inode logging allows for high concurrency

• Split data structure between DRAM and NVMM

– Persistent log is simple and efficient

– Volatile tree structure has no consistency overhead

File log

Data 1 Data 2

Tail

Data 0

DRAM

NVMM

Radix tree

0 1 2 3

Page 48: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

49

NOVA layout

• Put allocator in DRAM

• High scalability

– Per-CPU NVMM free list, journal and inode table

– Concurrent transactions and allocation/deallocation

DRAM

NVMMJournal

Inode table

Free list

CPU 0

Journal

Inode table

Free list

CPU 1

Head TailInode

Inode log

Superblock

Recoveryinode

Page 49: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

50

Fast garbage collection

• Log is a linked list

• Log only contains metadata

• Fast GC deletes dead log pages from the linked list

• No copying

Head

Tail

Vaild log entry Invalid log entry

Page 50: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

51

Thorough garbage collection

• Starts if valid log entries < 50% log length

• Format a new log and atomically replace the old one

• Only copy metadata

Head

Tail

Vaild log entry Invalid log entry

Page 51: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

52

Recovery

• Rebuild DRAM structure– Allocator– Lazy rebuild: postpones inode radix tree rebuild

• Accelerates recovery• Reduces DRAM consumption

• Normal shutdown recovery:– Store allocator in recovery inode– No log scanning

• Failure recovery:– Log is short– Parallel scan– Failure recovery bandwidth: > 400 GB/s

DRAM

NVMMJournal

Inode table

Free list

CPU 0

Journal

Inode table

Free list

CPU 1

Superblock

Recoveryinode

Recoveryinode Recovery

threadRecovery

thread

Page 52: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

53

Snapshot for normal file I/O

0Current snapshot

0File log

Data

Page 1

Snapshot entry

Data in snapshot

File write entry

Reclaimed data

Epoch ID

Current data

Snapshot 0

1

1

Data

Page 1

Data

1

Data

Page 1

Data

Snapshot 1

2

Data

2

Data

Page 1

Data

[0, 1) [1, 2)

Delete snapshot 0;

Data

Page 53: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

54

Corrupt Snapshots with DAX-mmap()

• Recovery invariant: if V == True, then D is valid

– Incorrect: Naïvely mark pages read-only one-at-a-time

False

?

V = True;D = 5;

R/W ROPageFault

Copy on Write

ValueChange

Application:

Page hosting D:

Page hosting V:

?

T

Snapshot

Snapshot

True

5

Timeline:

?

False

?

FalseFalse

?

False

?

Corrupt

Page 54: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

55

Consistent Snapshots with DAX-mmap()

• Recovery invariant: if V == True, then D is valid

– Correct: Delay CoW page faults completion until all pages are read-only

False

?

D = 5;

R/W ROPageFault

ValueChange

Application:

Page hosting D:

Page hosting V:

?

Snapshot

V = True;

5

RO

Waiting

F

Copy on Write

SnapshotTimeline:

False True

?

Consistent

Page 55: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

56

Snapshot-related latency

0 1 2 3 4 5 6 7 8 9 10

CoW page fault (4KB)

Snapshot deletion

Snapshot creation

Latency (microsecond)

snapshot manifest init combine manifests radix tree locking

sync superblock mark pages read-only change mapping memcpy_nocache

Page 56: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

57

Defense Against Scribbles

• Tolerating Larger Scribbles

– Allocate replicas far from one another

– NOVA metadata can tolerate scribbles of 100s of MB

• Preventing scribbles

– Mark all NVMM as read-only

– Disable CPU write protection while accessing NVMM

– Exposes all kernel data to bugs in a very small section of NOVA

code.

Page 57: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

58

NVMM Failure Modes: Media Failures

• Media errors

– Detectable & correctable

– Detectable & uncorrectable

– Undetectable

• Software scribbles

– Kernel bugs or own bugs

– Transparent to hardware

Software:

NVMM Ctrl.:

Rea

d

NVMM data:

Detects & corrects errors

Consumes good data

Media error

Page 58: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

59

NVMM Failure Modes: Media Failures

• Media errors

– Detectable & correctable

– Detectable & uncorrectable

– Undetectable

• Software scribbles

– Kernel bugs or own bugs

– Transparent to hardware

NVMM data:

Software:

NVMM Ctrl.: Detects uncorrectable errorsRaises exception

Receives MCE

Media error &Poison Radius (PR)e.g. 512 bytes

Rea

d

Page 59: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

60

NVMM Failure Modes: Media Failures

• Media errors

– Detectable & correctable

– Detectable & uncorrectable

– Undetectable

• Software scribbles

– Kernel bugs or own bugs

– Transparent to hardware

NVMM data:Media error

Software:

NVMM Ctrl.: Sees no error

Consumes corrupted data

Rea

d

Page 60: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

61

NVMM Failure Modes: Scribbles

• Media errors

– Detectable & correctable

– Detectable & uncorrectable

– Undetectable

• Software “scribbles”

– Kernel bugs or NOVA bugs

– NVMM file systems are highly vulnerable

NVMM data:

Software:

NVMM Ctrl.: Updates ECC

Bug code scribbles NVMM

Scribble error

Write

Page 61: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

62

NVMM Failure Modes: Scribbles

• Media errors

– Detectable & correctable

– Detectable & uncorrectable

– Undetectable

• Software “scribbles”

– Kernel bugs or NOVA bugs

– NVMM file systems are highly vulnerable

NVMM data:

Software:

NVMM Ctrl.: Sees no error

Consumes corrupted data

Scribble error

Rea

d

Page 62: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

63

File operation latency

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Create Append (4KB) Overwrite (4KB) Overwrite (512B) Read (4KB)

Late

ncy

(m

icro

seco

nd

)

xfs-DAX

PMFS

ext4-DAX

ext4-dataj

Btrfs

NOVA

w/ MP

w/ MP+WP

w/ MP+DP

w/ MP+DP+WP

Relaxed mode

Page 63: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

64

Random R/W bandwidth on NVDIMM-N

0

5

10

15

20

25

30

1 2 4 8 16

Ban

dw

idn

th (

GB

/s)

Threads

NVDIMM-N 4K Read

0

2

4

6

8

10

12

14

1 2 4 8 16

Ban

dw

idn

th (

GB

/s)

Threads

NVDIMM-N 4K Write

xfs-DAX

PMFS

ext4-DAX

ext4-dataj

Btrfs

NOVA

w/ MP

w/ MP+DP

Relaxed mode

Page 64: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

65

Scribble size and metadata bytes at risk

Met

adat

a P

ages

at

Ris

k

Scribble Size in Bytes

no replication, worst

no replication, average

simple replication, worst

simple replication, average

two-way replication, worst

two-way replication, average

dead-zone replication, worst

dead-zone replication, average

1 16 256 4K 64K 1M 16M 256M1.5E-5

1.2E-4

9.8E-4

7.8E-3

0.06

0.5

4

32

256

2K

16K

Page 65: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

66

Storage overhead

File data82.4%

Primary inode0.1%

Primary log2.0%

Replica inode0.1%

Replica log2.0%

File checksum1.6%

File parity11.1%

Unused0.8%

Page 66: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

67

Latency breakdown

0 1 2 3 4 5 6 7 8 9

Read 16KB

Read 4KB

Overwrite 512B

Overwrite 4KB

Append 4KB

Create

Latency (microsecond)

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache

append entry free old data calculate entry csum verify entry csum replicate inode

replicate log verify data csum update data csum update data parity write protection

Page 67: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

68

Latency breakdown

0 1 2 3 4 5 6 7 8 9

Read 16KB

Read 4KB

Overwrite 512B

Overwrite 4KB

Append 4KB

Create

Latency (microsecond)

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache

append entry free old data calculate entry csum verify entry csum replicate inode

replicate log verify data csum update data csum update data parity write protection

Metadata Protection

Page 68: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

69

Latency breakdown

0 1 2 3 4 5 6 7 8 9

Read 16KB

Read 4KB

Overwrite 512B

Overwrite 4KB

Append 4KB

Create

Latency (microsecond)

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache

append entry free old data calculate entry csum verify entry csum replicate inode

replicate log verify data csum update data csum update data parity write protection

Metadata Protection Data Protection

Page 69: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

70

Latency breakdown

0 1 2 3 4 5 6 7 8 9

Read 16KB

Read 4KB

Overwrite 512B

Overwrite 4KB

Append 4KB

Create

Latency (microsecond)

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache

append entry free old data calculate entry csum verify entry csum replicate inode

replicate log verify data csum update data csum update data parity write protection

Metadata Protection Data Protection Scribble Prevention

Page 70: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

71

Application performance on NOVA-Fortis

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim SQLite TPCC Average

No

rmal

ized

th

rou

ghp

ut

Op

s/se

con

d

NVDIMM-N

xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode

495k 610k 553k 692k 27k 73k 30k 126k 45k

Page 71: NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System · 2017-11-19 · 1 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System Jian Andiry Xu, Lu Zhang,

72

Application performance on NOVA-Fortis

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim SQLite TPCC Average

No

rmal

ized

th

rou

ghp

ut

to N

VD

IMM

-N

PCM

xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode