软件管理是高效使用固态硬盘的关键
张晓东
美国俄亥俄州立大学
合作单位:Intel, Samsung, and VMWare
1 9/21/2014
Flash Memory is Affordable and Widely Used
NAND Flash
consumer client enterprise
2
• NAND Flash Memory are widely used: from cell phones to cloud systems
Increasingly wide adoption
100nm
10nm2007 2008 2009 2010 2011 2012 2013 2014
Bit cost reduction$10/GB $1/GB
$0.35/GB
• # P/E cycles in flash memory decreases as chip technology scaling continues
Endurance of Flash Memory is Reduced
3
Floating gate
Electron tunnel through Oxide
Oxide traps created by inject/extract
Electrons
Cell
Cell Cell
Cell Cell
Cell
Bit Line
Word Line
BL BL
WLFlashBlock
• Cell size and distance between cells shrink
• Oxide layer becomes thinner• Higher percentage of oxide
traps interfere P/E
Write: injecting electrons Into floating gateErase: extracting electrons From floating gate
• Noise in NAND flash memory increases as chip technology scaling continues
Error Rate Increases in Flash Memory
4
Victim cell
Cell
Cell Cell
Cell Cell
Cell
BLWL
BL BL
WL
Cell‐to‐cell interferenceRandom telegraph noise (RTN): “burse noise” caused by thin interface
Retention noiseElectrons leaks from the floating gate
Higher error rate
The Cost Reduction is not Free
5
Rapid Reduction of P/E Cycles
6
SSDs not widely used in Enterprise Sys
Using SSD as an acceleration device, instead of using SSD alone, thus creating a storage system with “dual” devices, such as Intel RST, Apple Fusion Drive.
Source: Flexstar SSD Test Market Analysis, June 2012http://info.flexstar.com/Portals/161365/docs/SSD_%20Testing_%20Market_Analysis.pdf
7
Distinguished Merits and Limits
Random reads are fast at a low cost Low power
Sequential accesses may not be cost effective The number of writes is increasingly limited Error rate increases
A whole SSD solution may be suboptimal Cost, reliability and performance
A hybrid storage system can best utilize SSD Random reads on SSD, sequential reads/writes on HDD Minimizing unnecessary writes to SSD
8
Software Solutions at OS Level
9
• Conquest [USENIX’02]• SmartSaver [ISLPED’06]• ReadyBoost [MS’06]• TurboMemory[ToS’08] • L2ARC [CACM’08]• FlashCache [ISCA’08] • other …
SSD‐cached DiskSSD‐cached DiskSSD‐cached Disk
• Hystor
Disk‐cached SSDDisk‐cached SSDDisk‐cached SSD Hybrid StorageHybrid StorageHybrid Storage
• Soundararajan [FAST’10]
• Cache‐based solutions– SSD – a secondary‐level cache– HDD– the permanent storage – Cache replacement policy
• Limitations– Weak locality memory misses– Intensive write traffic– Non‐trivial system changes– High‐cost on‐line replacement
• Frequent on‐access updates• 10‐20x Larger SSD space
Hystor: A cost‐efficient hybrid storage*
• A small data set– Semantically critical : F/S metadata blocks
– Performance critical : small, randomly accessed blocks
10* Collaborated work at Intel® Labs
• A large data set• Less frequently accessed• Sequentially accessed
SSD High-performance, high-cost
HDDLow-cost, high-capacity
A prototype system developed at Ohio State and Intel® Labs .
Identifying data blocks for SSDs
• A metric highly correlated to latency– Latency (optimal)– Frequency– Request size– Reuse distance– Seek distance– combinations
11
Percentage of total latency
Percentage of total blocks
Latency curve: optimal
The metric is highly correlated to latency
The metric is uncorrelated to
latency
• A metric highly correlated to latency– Latency (optimal)– Frequency– Request size– Reuse distance– Seek distance– combinations
• Frequently used small blocks
Identifying the high‐cost data blocks
12
The best metric:Frequency/Request size
0
2
4
6
8
10
12
14
16
18
20
Archive Postmark Mail TPC‐H Q1
Speedu
p (X)
Hystor Performance Evalution
HDD‐Only
20%
40%
60%
80%
100%
Full‐SSD
Performance Evaluation• Measurement System
– Intel® D975BX, 2.66GHz Intel ® Core™ 2 Quad, 4GB Memory– LSI ® MegaRAID 8704 SAS Card, Seagate ® 15k.5 SAS HDD, Intel X25‐E SSD– Fedora Core 8 Linux, Linux Kernel 2.6.25.8
• Experimental Results
13
3x/2% 3.2x/0.9%11%/7%
62.5%/0.4%
11.7x HDD: baseline
SSD: optimal
Proportional toworking‐set
size
Impact of Hystor • Hystor was presented in ACM ICS 2011 (Best Paper Award)
• Hystor lays a foundation for Apple hybrid storage product Fusion Drive‐ A new storage product on the market since October 23, 2012 ‐ Consisting of a small SSD (128 GB) and a large hard drive (1 TB) ‐ The hybrid storage is managed by OS in a single space
• Comments on Hystor from Apple: –Hystor is a well‐designed system, and its paper discussed several key systems trade‐offs in details. The Apple software engineers had carefully and systematically evaluated Hystor. This work had a significant influence in the design of Apple's Fusion Drive. Some design elements and algorithms in Hystor have been directly used in Apple's Fusion Drive.
• Steve Jobs’ Philosophy on technology transfer for Apple: –Picasso had a saying “Good artists copy, great artists steal”. We have always been shameless about stealing great ideas …
14
hStorage‐DB: A Software Solution for Database
• DBs have different storage QoS requirements – Different access patterns – Different priorities of data processing requests– Dynamic changes of requirements
• Hybrid storage can well satisfy diverse QoS of DB requests – should be automatic and adaptive with low overhead – But with challenges
Existing Interface between Applications and Storage
read/write(int fd, void *buf, size_t count);
On‐disk location In‐memory data Request size
This interface cannot pass specific requirements to storage
Challenges for Hybrid Storage Systems to Satisfy Different QoS Requirements
• DBMS (What I/O services do I need as a storage user?) – Classifications of I/O requests based on QoS– hStorage awareness– DBMS enhancements to utilize classifications automatically
• hStorage (What can I do for you as a service provider?) – A clear definition of supported QoS classifications– Hide device details to DBMS– Efficient data management between hybrid devices
• Communication between DBMS and hStorage– Rich information to deliver but limited by interface abilities– Need a standard and general purpose protocol
DBA‐based Approach
• DBAs decide data placement among heterogeneous devices based on experiences
• Limitations:– Significant human efforts: expertise on both DB and storage.– Large granularity, e.g. table/partition‐based data placements – Static storage layout:
• Tuned for the “common” case• Could not well respond to execution dynamics
Indexes Other data
DBMS
SSD HDD
Monitoring‐based Solutions
• Storage systems automatically make data placement and replacement decisions, by monitoring access patterns– LRU (a basic structure), LIRS (MySQL), ARC (IBM I/O controller)– Example products from industry:
• Solid State Hybrid Drive (Seagate)• Easy Tier (IBM)
• Limitations:– Takes time to recognize access patterns
• Hard to handle dynamics in short periods
– With concurrency, access patterns cannot be easily detected– Certain critical insights are not access patterns related
• Domain information (available from DBMS) is not utilized
What information from DBMS we can use?
• System catalog– Data type: index, regular table– Ownership of data sets: e.g. VIP user, regular user
• Query optimizer – Orders of operations and access path – Estimated frequency of accesses to related data
• Query planner– Access patterns
• Execution engine – Life cycles of data usage
Semantic information for I/O requests is not organized
Goal: organize/utilize DBMS semantic Information
Buffer Pool
Query Optimizer
Checkpoint
Vacuum
Bkgd. processes Connection pool
User1 User2。。。
DBMS
SequentialRandom
Repeated scan
Sys table Index User Table Temp data
The mission of hStorage‐DB is to fill this gap.
Storage
Semantic gap
Structure of hStorage‐DB
Buffer Pool Manager
Storage ManagerInfo 1 ... Info N QoS policy
(Policy assignment table)
Request + Semantic Information
Storage System Control Logic
I/O Request + QoS policy
SSD SSD……HDD HDD
Query Optimizer Query
Planner
Execution Engine
…
Highlights of hStorage‐DB
• Policy assignment table– Stores all the rules to assign a QoS policy for each I/O request – Assignments are made on organized DB semantic information
• Communication between a DBMS and hStorage– The QoS policy for each I/O request is delivered by protocol of “Differentiated Storage Services” (DSS, SOSP’11)
– hStorage system makes action accordingly
• hStorage‐DB in practice – First application for Intel product DSS
Caching Priorities as QoS Policies
• Priorities are enumerated– E.g. 1, 2, 3, …, N– Priority 1 is the highest priority
• Data from high‐priority requests can evict data cached for low‐priority requests
• Special “priorities”– Bypass
• Requests with this priority will not affect in‐cache data– Eviction
• Data accessed by requests with a eviction “priority” will be immediately evicted out of cache
– Write buffer
Policy Assignment Table
Sequential accesses Priority 1
Priority 2
Priority N
Bypass
Eviction
Write Buffer
…Random accesses
Temporary data accesses
Temporary Data delete
Updates
Experimental setup
• Dual‐machine setup (with 10GB Ethernet)– A DBMS: hStorage‐DB based on PostgreSQL– A dedicated storage system, with an SSD cache
• Configuration– Xeon, 2‐way, quad‐core 2.33GHz, 8GB RAM,– 2 Seagate 15.7K rpm HDD– SSD cache: Intel 320 Series 300GB (use 32GB)
• Workload– TPC‐H @30SF (46GB with 7 indexes)
Diverse Request Types in TPC‐H
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Tmp. Rand. Seq.
• Most queries are dominated by sequential requests• Queries 2,8,9,20,21 have a large number of random requests• Query 18 has a large number of temporary data requests
Performance Comparisons: HDD, SSD‐caching, hStorage‐DB, SSD‐only
• hStroage‐DB:– Random accesses are in SSD – Temporary data is cached and evicted timely– Sequential accesses are bypassed – Data in SSD are timely Updated
8950 8694
6146 5990
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Query 18
Execution tim
e (sec)
HDD‐only LRU hStorage‐DB SSD‐only
Virtualization is Basic Infrastructure for Cloud
Conventional system setups Computing Virtualization
Running each application on a dedicated cluster
Running all applications on consolidated hardware resources
29
• Benefits:– Reduced management cost– Resource consolidation: E.g., when Hadoop becomes I/O intensive, its CPU
resources could be allocated to other applications transparently
The Core of System Virtualization: Hypervisor
Virtualization packs the whole stack of hardware + OS + applications into a portable virtual machine (VM) package
30
Operating System
Application
Hypervisor
App
OS
App
OS
App
OS
App
OS
Host: the physical machine
Guest: each VM is also called a guest
Hypervisor provides the supporting environment for each VM package and manages hardware resources among multiple VMs.
S‐CAVE: Hypervisor‐based SSD Caching
1. The SSD cache is directly managed by the hypervisor2. Hypervisor makes efforts to best utilize the SSD cache for each VM
31
Storage System
HDD HDD HDD HDD HDD HDD HDD
Hypervisor
VM VM VM
controller
What Do We Gain?
• Benefits– Transparent to VMs: no modification to guest OS or applications– A “global” view of all VMs’ I/O activities– Full access privilege to storage devices
• Must address the following challenges– Effective and accurate cache space allocation– Responsive to VMs’ I/O dynamics– Low overhead implementation
• In practice – S‐CAVE is implemented in VMware hypervisor as a software solution
32
S‐CAVE: System Design – an overview
Hypervisor
VM1 VM2 VMn
Storage System
Cache Monitor Cache Monitor Cache Monitor
Cache Space
Allocator
Block Interface
SSD
• Cache Monitor– One for each running VM– Watching usage of allocated SSD space– Report usage status, and space demand
• Cache Space Allocator– A central control for space allocation– Determines how much SSD cache space
each VM should be allocated33
Summary• SSD and HDD should be utilized by their best merits
– Random reads are fast and cost‐effective in SSD – Sequential accesses are fast and cost‐effective in HDD– Writes to SSD must be limited
• Communication between applications and storage systems– Detect access patterns, and let hybrid storage serve accordingly – Hystor => Fusion Drive (Apple) and Hybrid Aggregate (NetApp)– App expresses access patterns and expected storage services – Storage system make actions accordingly: hstorage‐DB, S‐CAVE
• Other efforts– LDPC‐in SSD: placing advanced ECC in SSD
34
Thank you!
35