2006 michigan technological university cs 6091 3/15/6 1 shared memory programming for large scale...

27
2006 Michigan Technological University CS 6091 CS 6091 3/15/6 3/15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1 , C. Cascaval 2 , G. Almasi 2 , Y. Zheng 3 , M. Farreras 4 , J. Nelson Amaral 1 1 University of Alberta 2 IBM Watson Research Center 3 Purdue 4 Universitat Politecnica de Catalunya IBM Research Report RC23853 January 27, 2006

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

1

Shared Memory Programming for Large Scale Machines

C. Barton1, C. Cascaval2, G. Almasi2,Y. Zheng3, M. Farreras4, J. Nelson Amaral1

1University of Alberta

2IBM Watson Research Center

3Purdue

4Universitat Politecnica de Catalunya

IBM Research Report RC23853January 27, 2006

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

2

Abstract

UPC is scalable and competitive with MPI on hundreds of thousands of processors.

This paper discusses the compiler and runtime system features that achieve this performance on the IBM BlueGene/L.

Three benchmarks are used: HPC RandomAccess HPC STREAMS NAS Conjugate Gradient (CG).

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

3

1. BlueGene/L

65,536 x 2-way 700MHz processors (low power)

280 sustained Tflops on HPL Linpack

64 x 32 x 32 3d packet-switched torus network

XL UPC compiler and UPC runtime system (RTS)

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

4

2.1 XL Compiler Structure

UPC source is translated to W-code

An early version did as MuPC: calls to the RTS were inserted into W-code. This prevents optimizations such as copy propagation and common sub-expression elimination.

The current version delays the insertion of RTS calls. W-code is extended to represent shared variables and the memory access mode (strict or relaxed).

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

5

XL Compiler (cont’d)

Toronto Portable Optimizer (TPO) can “apply all the classical optimizations” to shared memory accesses.

UPC-specific optimizations are also performed.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

6

2.2 UPC Runtime System

The RTS targets SMPs using Pthreads Ethernet and LAPI clusters using LAPI BlueGene/L using the BlueGene/L message layer

TPO does link-time optimizations between the user program and the RTS.

Shared objects are accessed through handles.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

7

Shared objects

The RTS identifies five shared object types: shared scalars shared structures/unions/enumerations shared arrays shared pointers [sic] with shared targets shared pointers [sic] with private targets

“Fat” pointers increase remote access costs and limit scalability.

(optimizing remote accesses is discussed soon)

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

8

Shared Variable Directory (SVD)

Each thread on a distributed memory machine contains a two-level SVD containing handles pointing to all shared objects.

The SVD in each thread has THREADS+1 partitions.

Partition i contains handles for shared objects in thread i, except the last partition which contains handles for statically declared shared arrays.

Local sections of shared arrays do not have to be mapped to the same address on each thread.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

9

SVD benefits

Scalability: Pointers to shared do not have to span all of shared memory. Only the owner knows the addresses of its shared object. Remote access are made via handles.

Each thread mediates access to its shared objects so coherence problems are reduced1.

Only nonblocking synchronization is needed for upc_global_alloc(), for example.

1 Runtime caching is beyond the scope of this paper.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

10

2.3 Messaging Library

This topic is beyond the scope of this talk.

Note, however, that the network layer does not support one-sided communication.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

11

3. Compiler Optimizations

3.1 upc_forall(init; limit; incr; affinity)

3.2 local memory optimizations

3.3 update optimizations

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

12

3.1 upc_forall

The affinity parameter may be: pointer-to-shared integer type continue

If the (unmodified) induction variable is used the conditional is eliminated.

This is the only optimization technique used.

“... even this simple optimization captures most of the loops in the existing UPC benchmarks.”

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

13

Observations

upc_forall loops cannot be meaningfully nested.

upc_forall loops must be inner loops for this optimization to pay off.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

14

3.2 Local Memory Operations

Try to turn dereferences of fat pointers into dereferences of ordinary C pointers.

Optimization is attempted only when affinity can be statically determined.

Move the base address calculation to the loop preheader (initialization block).

Generate code to access intrinsic types directly, otherwise use memcpy.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

15

3.3 Update Optimizations

Consider operations of the form r = r op B, where r is a remote shared object and B is local or remote.

Implement this as an active message [Culler, UC Berkeley].

Send the entire instruction to the thread with affinity to r.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

16

4. Experimental Results

4.1 Hardware

4.2 HPC RandomAccess benchmark

4.3 Embarrassingly Parallel (EP) STREAM triad

4.4 NAS CG

4.5 Performance evaluation

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

17

4.1 Hardware

Development done on 64-processor node cards.

TJ Watson: 20 racks, 40960 processors

LLNL: 64 racks, 131072 processors

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

18

4.2 HPC RandomAccess

111 lines of code

Read-modify-write randomly selected remote objects.

Use 50% of memory.

[Seems a good match for the update optimization.]

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

19

4.3 EP STREAM Triad

105 lines of code

All computation is done locally within a upc_forall loop.

[Seems like a good match for the loop optimization.]

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

20

4.4 NAS CG

GW’s translation of MPI code into UPC.

Uses upc_memcpy in place of MPI sends and receives.

It is not clear whether IBM used GW’s hand-optimized version.

IBM mentions that they manually privatized some pointers, which is what is done in GW’s optimized version.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

21

4.5 Performance Evaluation

Table 1: FE is MuPC-style front end containing some optimizations Others all use TPO front end

no optimizations “indexing” is shared to local pointer reduction “update” is active messages “forall” is upc_forall affinity optimization

Speedups are relative to no TPO optimization maximum speedup for random and stream is 2.11

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

22

Combined Speedup

The combined stream speedup is 241!

This is attributed to the shared to local pointer reductions.

This seems inconsistent with “indexing” speedups of 1.01 and 1.32 for random and streams benchmarks, respectively.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

23

Table 2: Random Access

This is basically a measurement of how many asynchronous messages can be started up.

It is not known whether the network can do combining.

Beats MPI (0.56 vs. 0.45) on 2048 processors.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

24

Table 3: Streams

This is EP.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

25

CG

Speedup tracks MPI through 512 processors.

Speedup exceeds MPI on 1024 and 2048 processors.

This is a fixed-problem-size benchmark so network latency eventually dominates.

The improvement over MPI is explained: “In the UPC implementation, due to the use of one-sided communication, the overheads are smaller” compared to MPI two-sided overhead. [But the BlueGene/L network does not implement one-sided communication.]

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

26

Comments? I have some … Recall slide 12: “... even this simple [upc_forall affinity]

optimization captures most of the loops in the existing UPC benchmarks.”

From the abstract: “We demonstrate not only that shared memory programming for hundreds of thousands of processors is possible, but also that with the right support from the compiler and run-time system, the performance of the resulting codes is comparable to MPI implementations.”

The large-scale scalability demonstrated is for two 100-line codes for the simplest of all benchmarks.

The scalability of CG was knowingly limited by fixed problem size. Only two data points are offered that outperform MPI.

2006 Michigan Technological University

CS 6091CS 60913/15/63/15/6

27

Week Date Topic

2 1/18 Organizational meeting

31/25 UPC Tutorial

SC05 Phil Merkey slides

42/1 MuPC Runtime System Design

PDP 2006 Steve Seidel slides

52/8 Reuse Distance in UPC

PACT05 Steve Carr slides

62/15 UPC Memory Model

IPDPS 2004 Øystein Thorsen slides part A

slides part B

7 2/22 Planning meeting

83/1 Communication Optimizations

in the Berkeley UPC Compiler PACT05

Weiming Zhao slides

93/15 UPC on the BlueGene/L

IBM Research Report RC23853 Steve Seidel slides

10 3/22 Reuse Distances in UPC Applications Phil Merkey

11 3/29

12 4/6

134/13 A UPC Performance Model

IPDPS 2006 Steve Seidel

14 4/20