a study of memory management for web-based … study of memory management for web-based applications...
TRANSCRIPT
IBM Tokyo Research Laboratory
June 18, 2009 | PLDI’09 at Dublin, Ireland © 2009 IBM Corporation
A Study of Memory Management for Web-based Applications on Multicore Processors
Hiroshi Inoue, Hideaki Komatsu and Toshio NakataniIBM Tokyo Research Laboratory
2
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Goal
Our main contributions include:1. Showing that a good technique for a single core is not necessarily a
good one on a multicore processor.2. Proposing our new approach for multicore environments
To develop an efficient memory management techniquefor Web-based applications to improve their performance on multicore processors.
Our Goal
3
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Characteristics of Web-based applications
In Web-based applications, most of the allocated memory blocks are transaction scoped, and live only during one transaction
We exploit their characteristics to reduce the costs of memory management
4
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
An existing approachRegion-based memory management – reduces the cost of memory management by discarding
all blocks in a region at once (instead of freeing individual blocks)
– is widely used (e.g. Apache pool allocator)
time
one transaction of Web application
r = create_region()objA =
alloc_from_region(r)objA =
alloc_from_region(r)A =
alloc_from_region(r)destroy_region(r)
The region-based memory management looks ideal for managing transaction-scoped memory blocks efficiently
5
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
custom memory allocator
Software stack of the Zend’s PHP runtime
Operating System
Zend’sPHP Runtime
PHP Application
Standard C Library
GarbageCollector
for transaction-scoped allocations
for transaction-scoped allocations
for persistent allocations
Experimental setup with the PHP runtime
hardware
6
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Software stack of the Zend’s PHP runtime
region-based allocator(replacing default allocator)
Operating System
Zend’sPHP Runtime
PHP Application
Standard C Library
GarbageCollector
for transaction-scoped allocations
for transaction-scoped allocations
for persistent allocations
We replaced only thecustom memory allocator
of the PHP runtime
We did not modifyPHP applicationsgarbage collector memory allocator in libc
hardware
Experimental setup with the PHP runtime
7
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Software stack of the Zend’s PHP runtime
region-based allocator(replacing default allocator)
Operating System (Red Hat Enterprise Linux 5)
Zend’sPHP Runtime(PHP-5.2.1)
PHP Application
Standard C Library
GarbageCollector
for transaction-scoped allocations
for transaction-scoped allocations
for persistent allocations
2x Intel quad-core Xeon (Clovertown)
Experimental setup with the PHP runtime
System usedBlade Center HS212x quad-core Xeon (Clovertown) 1.86 GHz8 GB system memoryRed Hat Enterprise Linux 5
SoftwarePHP-5.2.1 (w/ APC-3.0.14)lighttpd-1.4.13mysql-4.1.20
Benchmarksix server applications
(ezPublish, MediaWiki, SugarCRM, phpBBCakePHP, SPECweb2005)
8
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
on one core
0.0
0.2
0.4
0.6
0.8
1.0
1.2
rela
tive
thro
ughp
ut .
default allocator of PHP runtime
region-based allocator
MediaW
iki
(read
only)
MediaW
iki
(read
/write
)Sug
arCRM
eZ Pub
lish
phpB
BCak
ePHP
SPECwebGeo
. MEAN
8% speedup
high
er is
fast
er
ThroughputPerformance of the region-based allocator:
9
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
rela
tive
thro
ughp
ut .
default allocator of PHP runtime
region-based allocator
high
er is
fast
er
on one core on eight cores
0.0
0.2
0.4
0.6
0.8
1.0
1.2
rela
tive
thro
ughp
ut .
default allocator of PHP runtime
region-based allocator
MediaW
iki
(read
only)
MediaW
iki
(read
/write
)Sug
arCRM
eZ Publis
h
phpB
BCak
ePHP
SPECwebGeo
. MEAN
MediaW
iki
(read
only)
MediaW
iki
(read
/write
)Sug
arCRM
eZ Publis
h
phpB
BCak
ePHP
SPECwebGeo
. MEAN
8% speedup
13% slowdown!
ThroughputPerformance of the region-based allocator:
10
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8number of cores
rela
tive
thro
ughp
ut
region-based allocatordafult allocator of PHP runtime
The region-based allocator was 12% faster on one core
The region-based allocator was 23% slower on eight cores
high
er is
fast
er
for ezPublish
ScalabilityPerformance of the region-based allocator:
11
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Execution time breakdown
0
0.2
0.4
0.6
0.8
1
1.2
1.4
defaultallocator of thePHP runtime
region-basedallocator
rela
tive
exec
utio
n tim
e .memory managementothers
0
0.2
0.4
0.6
0.8
1
1.2
1.4
default allocatorof the PHP
runtime
region-basedallocator
rela
tive
exec
utio
n tim
e .
memory managementothers
10x speedupin memory
management40% slowdownin other parts
10x speedup
on one core on eight cores
for ezPublish
Performance of the region-based allocator:
12
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Cache misses and bus traffic
-40%
-20%
0%
20%
40%
60%
80%
chan
ge in
num
ber o
f cac
he m
isse
s .
eZ PublishMediaWiki (read only)MediaWiki (read/write)SugarCRM phpBBCakePHPSPECweb2005Geo. MEAN
L1I miss L1D miss
DTLB missL2 miss bus traffic
more cache misses than default allocator
fewer cache misses than default allocator(better)
chan
ge in
num
ber o
f eve
nts
on eight cores
115% 103%
Performance of the region-based allocator:
13
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
0
10
20
30
40
50
60
1 2 4 8number of cores
aver
age
L1 m
iss
late
ncy
(cyc
le)
dafult allocator of PHP runtimeregion-based allocator
average memory latency increased
due to bus contention
average L1 cache miss latency: (L1D_PEND_MISS / L1D_REPL)
reduced cost of memory management > increased bus traffic
reduced cost of memory management < increased bus traffic
on one or few cores
on more cores
Memory latency Performance of the region-based allocator:
for ezPublish
14
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
to reduce the cost of memory management without slowing down on multicore processors
Our Goal
0
0.2
0.4
0.6
0.8
1
1.2
1.4
default allocator ofthe PHP runtime
region-basedallocator
Our Goal
rela
tive
exec
utio
n tim
e .
memory managementothers
15
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Revisiting general-purpose memory allocators
General-purpose memory allocators’ tasks
– malloc(): allocate memory block
– free(): reclaim memory block and prepare for reuse in future allocations
– minimize the heap fragmentation (defragmentation)
For example,
coalescing multiple small blocks into large blocks
splitting large blocks into small blocks
sorting unused blocks in the free lists
typically consumes large amounts of CPU time
16
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Our approach: Defrag-Dodging
Our new approach: Defrag-Dodging
– reduces the memory management cost by avoiding defragmentation activities in malloc and free
– unlike the region-based memory management, support a free() function to enable fine-grained memory reuse
Key observation
cost of defragmentation > benefits
the transactions in Web-based applications are short enough to ignore heap fragmentation, and so the
17
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Comparing three approaches
malloc()allocate new memory
block
-free()
reclaim and reuse memory block
--defragmentation
region-based memory
management
our Defrag-Dodging
general-purpose memory
management
18
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Comparing three approaches
malloc()allocate new memory
block
-free()
reclaim and reuse memory block
--defragmentation
region-based memory
management
our Defrag-Dodging
general-purpose memory
management
reduced memory management cost•simpler allocator code•simpler heap structure (e.g. no per-object metadata)
19
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Comparing three approaches
malloc()allocate new memory
block
-free()
reclaim and reuse memory block
--defragmentation
region-based memory
management
our Defrag-Dodging
general-purpose memory
management
better scalability on multicore processors by avoiding increases in bus traffic
20
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
DDmalloc: Implementation of Defrag-Dodging
based on a segregated heap allocator– maintains free lists for each size of blocks to keep track of
freed blocks and reuse them in future allocations
reduced cost by keeping malloc and free as simple as possible
– for example, free() function only chains the freed blocks to the corresponding free list (and does nothing else!)
– the allocator code is less than 500 lines of C code
clears all metadata to refresh the heap at the end of each transactionsee the paper for the implementation details
21
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
custom memory allocator
Software stack of the Zend’s PHP runtime
Operating System
Zend’sPHP Runtime
PHP Application
Standard C Library
GarbageCollector
for transaction-scoped allocations
for transaction-scoped allocations
for persistent allocations
Experimental setup with the PHP runtime
hardware
22
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Software stack of the Zend’s PHP runtime
DDmalloc(replacing default allocator)
Operating System
Zend’sPHP Runtime
PHP Application
Standard C Library
GarbageCollector
for transaction-scoped allocations
for transaction-scoped allocations
for persistent allocations
Again, we replaced only the custom memory
allocator with DDmalloc
We did not modifyPHP applicationsgarbage collector memory allocator in libc
hardware
Experimental setup with the PHP runtime
23
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
rela
tive
thro
ughp
ut .
default allocator of PHP runtimeregion-based allocatorour DDmalloc
0.0
0.2
0.4
0.6
0.8
1.0
1.2
rela
tive
thro
ughp
ut .
default allocator of PHP runtimeregion-based allocatorour DDmalloc
Throughput
high
er is
fast
er
MediaW
iki
(read
only)
MediaW
iki
(read
/write
)Sug
arCRM
eZ Pub
lish
phpB
BCak
ePHP
SPECwebGeo
. MEAN
MediaW
iki
(read
only)
MediaW
iki
(read
/write
)Sug
arCRM
eZ Pub
lish
phpB
BCak
ePHP
SPECwebGeo
. MEAN
on one core on eight cores
Performance of DDmalloc:
24
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8number of cores
rela
tive
thro
ughp
ut .
our DDmallocregion-based allocatordafult allocator of PHP runtime
Scalability
high
er is
fast
er
for ezPublish
Performance of DDmalloc:
25
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Execution time breakdown
0
0.2
0.4
0.6
0.8
1
1.2
1.4
default allocatorof the PHP
runtime
region-basedallocator
DDmalloc
rela
tive
exec
utio
n tim
e .
memory managementothers
for ezPublish
10x speedup
2x speedupin memory
management
no slowdown!(2% improved)
Performance of DDmalloc:
26
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
-20%
-15%
-10%
-5%
0%
5%
10%
15%
20%
25%
30%
chan
ge in
num
ber o
f eve
nts
eZ PublishMediaWiki (read only)MediaWiki (read/write)SugarCRM phpBBCakePHPSPECweb2005Geo. MEAN
Cache misses and bus traffic
L1I miss L1D miss DTLB miss L2 miss bus traffic
more cache misses than default allocator
fewer cache misses than default allocator(better)
on eight cores
Performance of DDmalloc:
27
IBM Tokyo Research Laboratory
A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation
Summary
We studied the effects of memory management approaches on the performance of Web-based applications on multicore processors
– region-based allocator: fast on a single core, but slow on multicore processors due to increased bus traffic
– general-purpose allocator: not cost-effective in avoiding heap fragmentation
We proposed the new approach of Defrag-Dodging to reduce memory management costs
More data on the paperevaluation on Niagara
evaluation with Ruby runtimecomparisons with TCmalloc, Hoard
discussions on the GC-based languages
More data on the paperevaluation on Niagara
evaluation with Ruby runtimecomparisons with TCmalloc, Hoard
discussions on the GC-based languages