resolving journaling of journal anomaly in android i/o: multi-version b-tree with lazy split
DESCRIPTION
LG 전자 세미나 Mar 18, 2014 Paper presented in USENIX FAST ’14, Santa Clara, CA. Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split. Wook-Hee Kim 1 , Beomseok Nam 1 , Dongil Park 2 , Youjip Won 2 - PowerPoint PPT PresentationTRANSCRIPT
Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy
Split
Wook-Hee Kim1, Beomseok Nam1, Dongil Park2, Youjip Won2
1 Ulsan National Institute of Science and Technology, Korea2 Hanyang University, Korea
LG 전자 세미나Mar 18, 2014
Paper presented in USENIX FAST ’14, Santa Clara, CA
USENIX FAST ‘14, Santa Clara, CA
Outline
Motivation• Journaling of Journal Anomaly
How to resolve Journaling of Journal anomaly • Multi-Version B-Tree (MVBT)
Optimizations of Multi-Version B-Tree • Lazy Split • Reserved Buffer Space • Lazy Garbage Collection • Metadata Embedding • Disabling Sibling Redistribution
Evaluation Demo
2
USENIX FAST ‘14, Santa Clara, CA
Android is the most popular mobile platform
3
USENIX FAST ‘14, Santa Clara, CA
Storage in Android platform
Performance Bottleneck
4
1 0
0
1 0 1
0 1
0
10
0 1
11
1
00
0
1
0
Major cause for device break down.
Cause = Excessive IO
USENIX FAST ‘14, Santa Clara, CA
Android I/O Stack
EXT4
Insert/Update/Delete/Select
Read/Write
5
SQLiteMisaligned Interaction
Block Device Driver
Read/Write
USENIX FAST ‘14, Santa Clara, CA
SQLite insert (Truncate mode)
SQLite
insert
Open rollback journal.
Record the data to journal.
Put commit mark to journal.
Insert entry to DBTruncate journal.
6
USENIX FAST ‘14, Santa Clara, CA
EXT4 Journaling (ordered mode)
EXT4
SQLite
Write data block
Write EXT4 journal
Write journal metadata
Write journal commit
7
write()
USENIX FAST ‘14, Santa Clara, CA
Journaling of Journal Anomaly
EXT4
SQLite
insert
8
One insert of 100 Byte 9 Random Writes of 4KByte
Write data block
Write EXT4 journal
Write journal metadata
Write journal commit
USENIX FAST ‘14, Santa Clara, CA
How to Resolve Journaling of Journal?
EXT4
SQLite
insert
Database Journaling Database Insertion
9
USENIX FAST ‘14, Santa Clara, CA
How to Resolve Journaling of Journal?
fsync() fsync() fsync()
Database Journaling
DatabaseInsertion
Journaling +Database
=
Version-based B-Tree
(MVBT)
10
Reducing the number of fsync() calls
USENIX FAST ‘14, Santa Clara, CA
Version-based B-Tree (MVBT)
10 [T1, ∞)
T1
Insert 10 Update 10 with 20
T2
10 [T1, T2)
20 [T2, ∞)
Time
Do not overwrite old version dataDo not need a rollback journal
11
10 [T1, T2)
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
12
5 [3~∞)
10 [2~∞)
12 [4~∞)
40 [1~∞)
Page 1Key = 25
= 5
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
13
5 [3~5)
10 [2~5)
12 [4~5)
40 [1~5)
Page 1 Dead NodeKey = 25
= 5
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
14
5 [3~5)
10 [2~5)
12 [4~5)
40 [1~5)
Page 1 Key = 25
= 5
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
15
5 [5~∞)
10 [5~∞)
Page 3
12 [5~∞)
40 [5~∞)
Page 2 5 [3~5)
10 [2~5)
12 [4~5)
40 [1~5)
Page 1
Key = 25
= 5
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
16
5 [3~5)
10 [2~5)
12 [4~5)
40 [1~5)
Page 1 5 [5~∞)
10 [5~∞)
Page 3
12 [5~∞)
40 [5~∞)
Page 2
25 [5~∞)
10 [5~∞) : Page 3
∞ [5~∞) : Page 2
Page 4
∞ [0~5) : Page 1
USENIX FAST ‘14, Santa Clara, CA
5 [5~∞)
10 [5~∞)
Page 3
Node Split in Multi-Version B-Tree
17
5 [3~5)
10 [2~5)
12 [4~5)
40 [1~5)
Page 1
12 [5~∞)
40 [5~∞)
Page 2
10 [5~∞) : Page 3
∞ [5~∞) : Page 2
Page 4
25 [5~∞)
dirtydirty dirty
dirtyOne more dirty page than original B-Tree
∞ [0~5) : Page 1
USENIX FAST ‘14, Santa Clara, CA
I/O Traffic in MVBT
DB buffer cache
fsync()
18
Reduce the number of dirty pages!
MVBT
USENIX FAST ‘14, Santa Clara, CA
I/O Traffic in MVBT
MVBT
LS -MVBT
DB buffer cache
fsync()
DB buffer cache
fsync()
19
USENIX FAST ‘14, Santa Clara, CA
Optimizations in Android I/O
Write Read
twrite tread>>
WriteTransaction 1
WriteTransaction 2
WriteTransaction 3
20
Single Insert Transaction
Multiple Insert
USENIX FAST ‘14, Santa Clara, CA
Optimizations
LS-MVBT
Lazy Split Multi-Version B-Tree (LS-MVBT)
Lazy Split
Disabling Sibling
Redistribution
Metadata Embedding
Lazy Garbage
Collection
Reserved Buffer Space
21
USENIX FAST ‘14, Santa Clara, CA
Legacy Split in MVBT
Optimization1: Lazy Split
22
5 [3~5)
10 [2~5)
12 [4~5)
40 [1~5)
Page 1 5 [5~∞)
10 [5~∞)
Page 3
12 [5~∞)
40 [5~∞)
Page 2
10 [5~∞) : Page 3
∞ [5~∞) : Page 2
Page 4
25 [5~∞)
dirtydirty dirty
dirty
∞ [0~5) : Page 1 4 dirty pages
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
23
5 [3~∞)
10 [2~∞)
12 [4~∞)
40 [1~∞)
Page 1Key = 25
= 5
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
24
5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1Lazy Node:Half-dead, Half-live
Lazy Node:Half-dead, Half-liveKey = 25
= 5
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
25
5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1Key = 25
= 5
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
26
12 [5~∞)
40 [5~∞)
Page 2 5 [3~∞)
10 [2~∞)
Page 1
12 [4~5)
40 [1~5)
Key = 25
= 5
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
27
12 [5~∞)
40 [5~∞)
Page 2 5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
25 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
Lazy Node Overflow
28
Dead entriesDead entries
12 [5~∞)
40 [5~∞)
Page 2 5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
25 [5~∞)
Key = 8
= 6
USENIX FAST ‘14, Santa Clara, CA
Lazy Node Overflow → Garbage collect dead entries
Optimization1: Lazy Split
29
12 [5~∞)
40 [5~∞)
Page 2
5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
25 [5~∞)
Key = 8
= 6
USENIX FAST ‘14, Santa Clara, CA
Lazy Node Overflow → Garbage collect dead entries
Optimization1: Lazy Split
30
5 [3~∞)
8 [6~∞)
10 [2~∞)
Page 1
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
12 [5~∞)
40 [5~∞)
Page 2
25 [5~∞)
But, what if the dead entries are being accessed by other transactions?
Key = 8
= 6
USENIX FAST ‘14, Santa Clara, CA
Optimization2: Reserved Buffer Space
Option 1. Wait for read transactions to finish
31
12 [5~∞)
40 [5~∞)
Page 2
5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
25 [5~∞)
Key = 8
= 6
USENIX FAST ‘14, Santa Clara, CA
Optimization2: Reserved Buffer Space
Option 2. Split as in legacy MVBT split
32
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
12 [5~∞)
40 [5~∞)
Page 2
25 [5~∞)
Key = 8
= 6
5 [6~∞)
10 [6~∞)
Page 4
5 [3~6)
10 [2~6)
12 [4~5)
40 [1~5)
Page 1
USENIX FAST ‘14, Santa Clara, CA
5 [6~∞)
8 [6~∞)
Page 4
Optimization2: Reserved Buffer Space
Option 2. Split as in legacy MVBT split
33
5 [3~6)
10 [2~6)
12 [4~5)
40 [1~5)
Page 1
10 [5~6) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
12 [5~∞)
40 [5~∞)
Page 2
25 [5~∞)
10 [6~∞) : Page 3
10 [6~∞)
USENIX FAST ‘14, Santa Clara, CA
Option 3. Pad some buffer space in tree nodes
Optimization2: Reserved Buffer Space
Buffer space is usedwhen lazy node is full
34
40 [5~∞)
Page 2
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
25 [5~∞)
12 [5~∞)
8 [6~∞)If buffer space is also full,
split as in legacy MVBT
Key = 8
= 6
USENIX FAST ‘14, Santa Clara, CA
• Similar to rollback in MVBT
Rollback in LS-MVBT
35
Txn 5 crashes
12 [5~∞)
40 [5~∞)
Page 2
25 [5~∞)
Rollback
Number of dirty nodes touched by rollback of LS-MVBT is also smaller than that of MVBT.
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3 10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 3
5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1 5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1 5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
Page 1
∞ [0~5) : Page 1
Page 3
5 [3~∞)
10 [2~∞)
12 [4~∞)
40 [1~ ∞)
Page 1
∞ [0~∞) : Page 1
Page 3
USENIX FAST ‘14, Santa Clara, CA
Optimization3: Lazy Garbage Collection
Periodic Garbage Collection
36
P1 P2
P3
P1
Transaction buffer cache
Transaction 1
USENIX FAST ‘14, Santa Clara, CA
Optimization3: Lazy Garbage Collection
Periodic Garbage Collection
37
GarbageCollection
P1
GC buffer cache
Transaction 1
P2 P3
Stopped
P1 P2
P3
fsync()
Extra Dirty Pages
Transaction buffer cache
USENIX FAST ‘14, Santa Clara, CA
Optimization3: Lazy Garbage Collection
Lazy Garbage Collection
38
P2
P3
Transaction buffer cache
Transaction 1
fsync()
P1
P1
Do not garabge collect if no space is needed
Do not garabge collect if no space is needed
Insert
No Extra Dirty Page by GC
USENIX FAST ‘14, Santa Clara, CA
Version = “File Change Counter” in DB header page
25 [5~∞)40 [1~∞)
15 [6~∞)
Optimization4: Metadata embedding
5
Header Page 1
Page 2
...
WriteTransaction 6
ReadTransaction
5
6
Commit
dirty
dirty
39
Page N
++
2 dirty pages
USENIX FAST ‘14, Santa Clara, CA
RAMDISK
Optimization4: Metadata embedding
Flush “File Change Counter” to the last modified page and RAMDISK
WriteTransaction 6
ReadTransaction
5
Commit
5
40
25 [5~∞)40 [1~∞)
15 [6~∞)
Header Page 1
Page 2
...
Page N 5
5
6
6
USENIX FAST ‘14, Santa Clara, CA
Optimization4: Metadata embedding
Flush “File Change Counter” to the last modified page and RAMDISK
RAMDISK
Volatile
66
41
WriteTransaction 6
ReadTransaction
5
Commit
25 [5~∞)40 [1~∞)
15 [6~∞)
Header Page 1
Page 2
...
Page N 6
dirty
5
1 dirty page
CRASH
USENIX FAST ‘14, Santa Clara, CA
Optimization5: Disable Sibling Redistribution
Sibling redistribution hurts insertion performance
• Disable sibling redistribution → avoid dirtying 4 pages
Page 4:
5 10
Page 3: Page 2:
55∞
Insertion of key 20
12
2515
40
20
dirty dirty dirty
4 dirty pages
42
dirty Page 5:
10 : Page 4
40 : Page 3
∞ : Page 2
25
12
Page 5:
10 : Page 4
40 : Page 3
∞ : Page 2
40
10
dirty
USENIX FAST ‘14, Santa Clara, CA
Optimization5: Disable Sibling Redistribution
43
10 : Page 4
5590
Insertion of key 20
12
2515
40
40 : Page 3 90 : Page 2
Page 5:
20
15 : Page 5
dirty
3 dirty pages
Page 5: dirty
Page 2 Page 3: dirty
USENIX FAST ‘14, Santa Clara, CA
Summary
Optimizations
44
Lazy Split
Disabling Sibling
Redistribution
Metadata Embedding
Lazy Garbage
Collection
Reserved Buffer Space
LS-MVBT
Avoid dirtying an extra page when split occurs
Reduce the probability of node split
Delete dead entries on the next mutation
Avoid dirtying header page
Do not touch siblings to make search faster
USENIX FAST ‘14, Santa Clara, CA
Performance Evaluation
Implemented LS-MVBT in SQLite 3.7.12
Testbed: Samsung Galaxy-S4 (JellyBean 4.3)• Exynos 5 Octa 5410 Quad Core 1.6GHz• 2GB Memory• 16GB eMMC
Performance Analysis Tools• Mobibench• MOST
45
USENIX FAST ‘14, Santa Clara, CA
LS-MVBT fsync WAL fsync() MVBT fsync()B-tree computation
Analysis of insert Performance
WAL Checkpointing Interval
SQLite Insert (Response Time)
LS-MVBT: 30~78% faster than WAL mode
46
USENIX FAST ‘14, Santa Clara, CA
Analysis of insert Performance
LS-MVBT flushes only one dirty page
47
SQLite Insert (# of Dirty Pages per Insert)
USENIX FAST ‘14, Santa Clara, CA
I/O Traffic at Block Device Level
49
LS-MVBT saves 67% I/O traffic: 9.9 MB (LS-MVBT) vs 31 MB (WAL)
x3 longer life time of NAND flash
SQLite Insert (I/O Traffic per 10 msec)
1000 insertions (LS-MVBT) 1000 insertions (WAL)
USENIX FAST ‘14, Santa Clara, CA
How about Search performance?
• LS-MVBT is optimized for write
performance.
50
USENIX FAST ‘14, Santa Clara, CA
Search Performance
LS-MVBT wins unless more than 93% of transactions are search
51
Transaction Throughput with varying Search/Insert ratio
USENIX FAST ‘14, Santa Clara, CA
How about recovery latency?
• WAL mode exhibits longer recovery latency
due to log replay.
• How about LS-MVBT?
52
USENIX FAST ‘14, Santa Clara, CA
Recovery
LS-MVBT is x5~x6 faster than WAL.
53
Recovery Time with varying # of insert operations to rollback
USENIX FAST ‘14, Santa Clara, CA
How much performance improvement
for each optimization?
54
USENIX FAST ‘14, Santa Clara, CA
Effect of Disabling Sibling Redistribution
Disabling sibling redistribution
↓20% faster insertion
55
SQLite Insert: (Response Time and # of Dirty Pages)
USENIX FAST ‘14, Santa Clara, CA
Performance Quantification
56
Throughput of SQLite: Combining All Optimizations
40% improvement
51% improvement70% improvemnet
X1.7 performance improvement against WAL
mode
LS-MVBT: 704 insertions/sec WAL : 417 insertions/sec
1220% improvemnet
X13.2 performance improvement against
Truncate mode
LS-MVBT: 704 insertions/sec Truncate : 53 insertions/sec
USENIX FAST ‘14, Santa Clara, CA
Query response time
57
Response time of MVBT is 80% of WAL
Response time of LS-MVBT is 58% of WAL
USENIX FAST ‘14, Santa Clara, CA
Conclusion
LS-MVBT resolves the Journaling of Journal anomaly via
• Reducing the number of fsync()’s with Version based B-
tree.
• Reducing the overhead of single fsync() via lazy split,
disabling sibling redistribution, metadata embedding,
reserved buffer space, lazy garbage collection.
LS-MVBT achieves 70% performance gain against WAL mode
(417 ins/sec 704 ins/sec) solely via software optimization.
58
USENIX FAST ‘14, Santa Clara, CA
Thank you…