a study of memory management for web-based … study of memory management for web-based applications...

27
IBM Tokyo Research Laboratory June 18, 2009 | PLDI’09 at Dublin, Ireland © 2009 IBM Corporation A Study of Memory Management for Web-based Applications on Multicore Processors Hiroshi Inoue , Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

Upload: nguyennga

Post on 02-May-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

IBM Tokyo Research Laboratory

June 18, 2009 | PLDI’09 at Dublin, Ireland © 2009 IBM Corporation

A Study of Memory Management for Web-based Applications on Multicore Processors

Hiroshi Inoue, Hideaki Komatsu and Toshio NakataniIBM Tokyo Research Laboratory

Page 2: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

2

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Goal

Our main contributions include:1. Showing that a good technique for a single core is not necessarily a

good one on a multicore processor.2. Proposing our new approach for multicore environments

To develop an efficient memory management techniquefor Web-based applications to improve their performance on multicore processors.

Our Goal

Page 3: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

3

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Characteristics of Web-based applications

In Web-based applications, most of the allocated memory blocks are transaction scoped, and live only during one transaction

We exploit their characteristics to reduce the costs of memory management

Page 4: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

4

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

An existing approachRegion-based memory management – reduces the cost of memory management by discarding

all blocks in a region at once (instead of freeing individual blocks)

– is widely used (e.g. Apache pool allocator)

time

one transaction of Web application

r = create_region()objA =

alloc_from_region(r)objA =

alloc_from_region(r)A =

alloc_from_region(r)destroy_region(r)

The region-based memory management looks ideal for managing transaction-scoped memory blocks efficiently

Page 5: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

5

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

custom memory allocator

Software stack of the Zend’s PHP runtime

Operating System

Zend’sPHP Runtime

PHP Application

Standard C Library

GarbageCollector

for transaction-scoped allocations

for transaction-scoped allocations

for persistent allocations

Experimental setup with the PHP runtime

hardware

Page 6: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

6

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Software stack of the Zend’s PHP runtime

region-based allocator(replacing default allocator)

Operating System

Zend’sPHP Runtime

PHP Application

Standard C Library

GarbageCollector

for transaction-scoped allocations

for transaction-scoped allocations

for persistent allocations

We replaced only thecustom memory allocator

of the PHP runtime

We did not modifyPHP applicationsgarbage collector memory allocator in libc

hardware

Experimental setup with the PHP runtime

Page 7: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

7

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Software stack of the Zend’s PHP runtime

region-based allocator(replacing default allocator)

Operating System (Red Hat Enterprise Linux 5)

Zend’sPHP Runtime(PHP-5.2.1)

PHP Application

Standard C Library

GarbageCollector

for transaction-scoped allocations

for transaction-scoped allocations

for persistent allocations

2x Intel quad-core Xeon (Clovertown)

Experimental setup with the PHP runtime

System usedBlade Center HS212x quad-core Xeon (Clovertown) 1.86 GHz8 GB system memoryRed Hat Enterprise Linux 5

SoftwarePHP-5.2.1 (w/ APC-3.0.14)lighttpd-1.4.13mysql-4.1.20

Benchmarksix server applications

(ezPublish, MediaWiki, SugarCRM, phpBBCakePHP, SPECweb2005)

Page 8: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

8

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

on one core

0.0

0.2

0.4

0.6

0.8

1.0

1.2

rela

tive

thro

ughp

ut .

default allocator of PHP runtime

region-based allocator

MediaW

iki

(read

only)

MediaW

iki

(read

/write

)Sug

arCRM

eZ Pub

lish

phpB

BCak

ePHP

SPECwebGeo

. MEAN

8% speedup

high

er is

fast

er

ThroughputPerformance of the region-based allocator:

Page 9: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

9

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

rela

tive

thro

ughp

ut .

default allocator of PHP runtime

region-based allocator

high

er is

fast

er

on one core on eight cores

0.0

0.2

0.4

0.6

0.8

1.0

1.2

rela

tive

thro

ughp

ut .

default allocator of PHP runtime

region-based allocator

MediaW

iki

(read

only)

MediaW

iki

(read

/write

)Sug

arCRM

eZ Publis

h

phpB

BCak

ePHP

SPECwebGeo

. MEAN

MediaW

iki

(read

only)

MediaW

iki

(read

/write

)Sug

arCRM

eZ Publis

h

phpB

BCak

ePHP

SPECwebGeo

. MEAN

8% speedup

13% slowdown!

ThroughputPerformance of the region-based allocator:

Page 10: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

10

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8number of cores

rela

tive

thro

ughp

ut

region-based allocatordafult allocator of PHP runtime

The region-based allocator was 12% faster on one core

The region-based allocator was 23% slower on eight cores

high

er is

fast

er

for ezPublish

ScalabilityPerformance of the region-based allocator:

Page 11: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

11

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Execution time breakdown

0

0.2

0.4

0.6

0.8

1

1.2

1.4

defaultallocator of thePHP runtime

region-basedallocator

rela

tive

exec

utio

n tim

e .memory managementothers

0

0.2

0.4

0.6

0.8

1

1.2

1.4

default allocatorof the PHP

runtime

region-basedallocator

rela

tive

exec

utio

n tim

e .

memory managementothers

10x speedupin memory

management40% slowdownin other parts

10x speedup

on one core on eight cores

for ezPublish

Performance of the region-based allocator:

Page 12: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

12

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Cache misses and bus traffic

-40%

-20%

0%

20%

40%

60%

80%

chan

ge in

num

ber o

f cac

he m

isse

s .

eZ PublishMediaWiki (read only)MediaWiki (read/write)SugarCRM phpBBCakePHPSPECweb2005Geo. MEAN

L1I miss L1D miss

DTLB missL2 miss bus traffic

more cache misses than default allocator

fewer cache misses than default allocator(better)

chan

ge in

num

ber o

f eve

nts

on eight cores

115% 103%

Performance of the region-based allocator:

Page 13: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

13

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

0

10

20

30

40

50

60

1 2 4 8number of cores

aver

age

L1 m

iss

late

ncy

(cyc

le)

dafult allocator of PHP runtimeregion-based allocator

average memory latency increased

due to bus contention

average L1 cache miss latency: (L1D_PEND_MISS / L1D_REPL)

reduced cost of memory management > increased bus traffic

reduced cost of memory management < increased bus traffic

on one or few cores

on more cores

Memory latency Performance of the region-based allocator:

for ezPublish

Page 14: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

14

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

to reduce the cost of memory management without slowing down on multicore processors

Our Goal

0

0.2

0.4

0.6

0.8

1

1.2

1.4

default allocator ofthe PHP runtime

region-basedallocator

Our Goal

rela

tive

exec

utio

n tim

e .

memory managementothers

Page 15: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

15

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Revisiting general-purpose memory allocators

General-purpose memory allocators’ tasks

– malloc(): allocate memory block

– free(): reclaim memory block and prepare for reuse in future allocations

– minimize the heap fragmentation (defragmentation)

For example,

coalescing multiple small blocks into large blocks

splitting large blocks into small blocks

sorting unused blocks in the free lists

typically consumes large amounts of CPU time

Page 16: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

16

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Our approach: Defrag-Dodging

Our new approach: Defrag-Dodging

– reduces the memory management cost by avoiding defragmentation activities in malloc and free

– unlike the region-based memory management, support a free() function to enable fine-grained memory reuse

Key observation

cost of defragmentation > benefits

the transactions in Web-based applications are short enough to ignore heap fragmentation, and so the

Page 17: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

17

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Comparing three approaches

malloc()allocate new memory

block

-free()

reclaim and reuse memory block

--defragmentation

region-based memory

management

our Defrag-Dodging

general-purpose memory

management

Page 18: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

18

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Comparing three approaches

malloc()allocate new memory

block

-free()

reclaim and reuse memory block

--defragmentation

region-based memory

management

our Defrag-Dodging

general-purpose memory

management

reduced memory management cost•simpler allocator code•simpler heap structure (e.g. no per-object metadata)

Page 19: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

19

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Comparing three approaches

malloc()allocate new memory

block

-free()

reclaim and reuse memory block

--defragmentation

region-based memory

management

our Defrag-Dodging

general-purpose memory

management

better scalability on multicore processors by avoiding increases in bus traffic

Page 20: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

20

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

DDmalloc: Implementation of Defrag-Dodging

based on a segregated heap allocator– maintains free lists for each size of blocks to keep track of

freed blocks and reuse them in future allocations

reduced cost by keeping malloc and free as simple as possible

– for example, free() function only chains the freed blocks to the corresponding free list (and does nothing else!)

– the allocator code is less than 500 lines of C code

clears all metadata to refresh the heap at the end of each transactionsee the paper for the implementation details

Page 21: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

21

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

custom memory allocator

Software stack of the Zend’s PHP runtime

Operating System

Zend’sPHP Runtime

PHP Application

Standard C Library

GarbageCollector

for transaction-scoped allocations

for transaction-scoped allocations

for persistent allocations

Experimental setup with the PHP runtime

hardware

Page 22: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

22

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Software stack of the Zend’s PHP runtime

DDmalloc(replacing default allocator)

Operating System

Zend’sPHP Runtime

PHP Application

Standard C Library

GarbageCollector

for transaction-scoped allocations

for transaction-scoped allocations

for persistent allocations

Again, we replaced only the custom memory

allocator with DDmalloc

We did not modifyPHP applicationsgarbage collector memory allocator in libc

hardware

Experimental setup with the PHP runtime

Page 23: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

23

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

rela

tive

thro

ughp

ut .

default allocator of PHP runtimeregion-based allocatorour DDmalloc

0.0

0.2

0.4

0.6

0.8

1.0

1.2

rela

tive

thro

ughp

ut .

default allocator of PHP runtimeregion-based allocatorour DDmalloc

Throughput

high

er is

fast

er

MediaW

iki

(read

only)

MediaW

iki

(read

/write

)Sug

arCRM

eZ Pub

lish

phpB

BCak

ePHP

SPECwebGeo

. MEAN

MediaW

iki

(read

only)

MediaW

iki

(read

/write

)Sug

arCRM

eZ Pub

lish

phpB

BCak

ePHP

SPECwebGeo

. MEAN

on one core on eight cores

Performance of DDmalloc:

Page 24: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

24

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8number of cores

rela

tive

thro

ughp

ut .

our DDmallocregion-based allocatordafult allocator of PHP runtime

Scalability

high

er is

fast

er

for ezPublish

Performance of DDmalloc:

Page 25: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

25

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Execution time breakdown

0

0.2

0.4

0.6

0.8

1

1.2

1.4

default allocatorof the PHP

runtime

region-basedallocator

DDmalloc

rela

tive

exec

utio

n tim

e .

memory managementothers

for ezPublish

10x speedup

2x speedupin memory

management

no slowdown!(2% improved)

Performance of DDmalloc:

Page 26: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

26

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

-20%

-15%

-10%

-5%

0%

5%

10%

15%

20%

25%

30%

chan

ge in

num

ber o

f eve

nts

eZ PublishMediaWiki (read only)MediaWiki (read/write)SugarCRM phpBBCakePHPSPECweb2005Geo. MEAN

Cache misses and bus traffic

L1I miss L1D miss DTLB miss L2 miss bus traffic

more cache misses than default allocator

fewer cache misses than default allocator(better)

on eight cores

Performance of DDmalloc:

Page 27: A Study of Memory Management for Web-based … Study of Memory Management for Web-based Applications on Multicore ... (w/ APC-3.0.14) ... performance of Web-based applications on multicore

27

IBM Tokyo Research Laboratory

A Study of Memory Management for Web-based Applications on Multicore Processors © 2009 IBM Corporation

Summary

We studied the effects of memory management approaches on the performance of Web-based applications on multicore processors

– region-based allocator: fast on a single core, but slow on multicore processors due to increased bus traffic

– general-purpose allocator: not cost-effective in avoiding heap fragmentation

We proposed the new approach of Defrag-Dodging to reduce memory management costs

More data on the paperevaluation on Niagara

evaluation with Ruby runtimecomparisons with TCmalloc, Hoard

discussions on the GC-based languages

More data on the paperevaluation on Niagara

evaluation with Ruby runtimecomparisons with TCmalloc, Hoard

discussions on the GC-based languages