lawrence livermore national laboratory sequoia rfp and benchmarking status unclassified scott futral...

Lawrence Livermore National Laboratory

Sequoia RFP and BenchmarkingSequoia RFP and BenchmarkingStatusStatus

UNCLASSIFIED

Scott FutralMark K. Seager

Tom SpelceLawrence Livermore National Laboratory

2008 SciComp Summer Meeting

5/23/2008 2008 SciComp Summer Meeting

Overview

Sequoia Objectives•25-50x BlueGene/L (367TF/s) on Science Codes•12-24x Purple on Integrated Design Codes

Sequoia Procurement Strategy•Sequoia is actually a cluster of procurements•Risk management pervades everything

Sequoia Target Architecture•Driven by programmatic requirements and technical realities•Requires innovation on several fronts

2

Sequoia will deliver petascale computing for the missionand pushes the envelope by 10-100x in every dimension!

5/23/2008 2008 SciComp Summer Meeting 3

By leveraging industry trends, Sequoia will successfully deliver a petascale UQ engine for the stockpile

Sequoia Production Platform Programmatic Drivers•UQ Engine for mission deliverables in the 2011-2015 timeframe

Programmatic drivers require unprecedented leap forward in computing power

Program needs both Capability and Capacity•25-50x BGL (367TF/s) for science codes (knob removal)•12-24x Purple for capability runs on Purple (8,192 MPI tasks UQ

Engine)

These requirements met with current industry trends drive us to a different target architecture than Purple or BGL

5/23/2008 2008 SciComp Summer Meeting 4

Predicting stockpile performance drives five separate classes of petascale calculations

1. Quantifying uncertainty (for all classes of simulation)

2. Identify and model missing physics

3. Improving accuracy in material property data

4. Improving models for known physical processes

5. Improving the performance of complex models and algorithms in macro-scale simulation codes

Each of these mission drivers require petascale computing


Sequoia Strategy

Two major deliverables•Petascale Scaling “Dawn” Platform in 2009•Petascale “Sequoia” Platform in 2011

Lessons learned from previous capability and capacity procurements•Leverage best-of-breed for platform, file system, SAN and storage•Major Sequoia procurement is for long term platform partnership•Three R&D partnerships to incentivize bidders to stretch goals•Risk reduction built into overall strategy from day-one

Drive procurement with single peak mandatory•Target Peak+Sustained on marquee benchmarks•Timescale, budget, technical details as target requirements•Include TCO factors such as power

5


To Minimize Risk, Dawn Deployment Extends the Existing Purple and BG/L Integrated Simulation Environment

ASC Dawn is the initial delivery system for Sequoia

Code development platform and scaling for Sequoia

0.5 petaFLOP/s peak for ASC production usage

Target production 2009-2014 Dawn Component Scaling•Memory B:F = 0.3•Mem BW B:F = 1.0•Link BW B:F = 2.0•Min Bisect B:F = 0.001•SAN GB/s:PF/s = 384•F is peak FLOP/s

6


Sequoia Target Architecture in Integrated Simulation Environment Enables a Diverse Production Workload

Diverse usage models drive platform and simulation environment requirements• Will be 2D ultra-res and 3D high-res

Quantification of Uncertainty engine

• 3D Science capability for known unknowns and unknown unknowns

Peak of 14 petaFLOP/s with option for 20 petaFLOP/s

Target production 2011-2016 Sequoia Component Scaling•Memory B:F = 0.08•Mem BW B:F = 0.2•Link BW B:F = 0.1•Min Bisect B:F = 0.03•SAN BW GB/:PF/s = 25.6•F is peak FLOP/s

7


Sequoia Targets A Highly Scalable Operating System

1-N CN… Light weight kernel on compute node

Optimized for scalability and reliability

As simple as possible. Full control

Extremely low OS noise Direct access to interconnect hardware

OS features Linux compatible with OS functions forwarded to I/O node OS

Support for dynamic libs runtime loading

Shared memory regions Open source

Compute Nodes

Sequoia ION and Interconnect

I/O Node

Linux on I/O Node Leverage huge Linux base & community

Enhance TCP offload, PCIe, I/O

Standard File Systems Lustre, NFSv4, etc

Factor to Simplify: Aggregates N CN for I/O & admin

Open source

Linux/Unix

FSDPerf toolstotalview

Lustre ClientNFSv4

SLURMD

MPI

Application

GLIBC

Sequoia CN and Interconnect

NPTL Posix threadsglibc dynamic loading

ADI

hardware transport

RASFutexsyscalls SharedMemory

MPI

Application

GLIBC



ADI

hardware transport


MPI

Application

GLIBC



ADI

hardware transport


MPI

Application

GLIBC


Posix threads, OpenMP and SE/TMglibc dynamic loading

ADI

hardware transport

RASFunction Shipped

syscalls SMP

UDP TCP/IPLNet

Function Shippedsyscalls

8


Sequoia Target Application Programming Model Leverages Factor and Simplify to Scale Applications to O(1M) Parallelism

MPI Parallelism at top level•Static allocation of MPI tasks to nodes and sets of

cores+threads•Allow for MPI everywhere, just in case…

Effectively absorb multiple cores+threads in MPI task

Support multiple languages•C/C++/Fortran03/Python

Allow different physics packages to express node concurrency in different ways

18


With Careful Use of Node Concurrency We can Support A Wide Variety of Complex Applications

MPI Tasks on a node are processes (one shown) with multiple OS threads (Thread0-3 shown)

Thread0 is “Main thread” Thread1-3 are helper threads that morph from Pthread to OpenMP worker to TM/SE compiler generated threads via runtime support

Hardware support to significantly reduce overheads for thread repurposing and OpenMP loops and locks

MAINThread

0Thread1Thread2Thread3

MPI_INIT

Funct1

MPI Call

1-3MPI Call

Funct2

MPI Call

MPI Call

TM/SE

TM/SE

OpenMP

1-3 Funct1

MPI Call

1-3

MPI Call

MAIN

MPI_FINALIZ

EExit

OpenMP

OpenMP

OpenMP

1) Pthreads born with MAIN2) Only Thread0 calls functions to nest parallelism3) Pthreads based MAIN calls OpenMP based Funct14) OpenMP Funct1 calls TM/SE based Funct25) Funct2 returns to OpenMP based Funct16) Funct1 returns to Pthreads based MAIN

MPI Call

WWW

WWW

1-3 1-3 1-3 1-3

19


Code Development Tools

Sequoia Distributed Software Stack Targets Familiar Environment for Easy Applications Port

C/C++/FortranCompilers,

Python

LW

K, L

inu

x

Op

tim

ize

d M

ath

Lib

s APPLICATION

IP

UDPTCP

SOCKETS

Lustre Client

Clib/F03 runtime

MPI2

Interconnect Interface

User Space Kernel Space

ADI

Parallel Math Libs

External Network

LNet

OpenMP, Threads, SE/TM

Function Shippedsyscalls

SLURM/Moab

RAS, Control System

Code Dev Tools Infrastructure

20


Sequoia Platform Target Performance is a Combination of Peak and Application Sustained Performance

“Peak” of the machine is absolute maximum performance•FLOP/s = FLoating point OPeration per second

Sustained is weighted average of five “marquee” benchmark code “Figure of Merit”•Four IDC package benchmarks and one “science workload”

benchmark from SNL•FOM chosen to mimic “grind times” and factor out scaling issues

Purple – 0.1PF/s BlueGene/L – 0.4 TF/s

22


Sequoia Benchmarks have already incentivized the industry to work on problems relevant to our mission needs

What’s missing?– Hydrodynamics– Structural mechanics– Quantum MD

23

ASC Sequoia Benchmarks

Language Parallelism Description

Tier Code F Py C C++ MPI OpenMP Pthreads 1 UMT

X X X X X X Marquee performance code. Single physics package code. Unstructured-Mesh deterministic radiation Transport

1 AMG X X X

Marquee performance code. Algebraic Multi-Grid linear system solver for unstructured mesh physics packages

1 IRS X X X

Marquee performance code. Single physics package code. Implicit Radiation Solver for diffusion equation on a block structured mesh

1 SPhot X X X

Marquee performance code. Single physics package code. Monte Carlo Scalar PHOTon transport code

1 LAMMPS X X

Marquee performance code. Full-system science code. Classical molecular dynamics simulation code (as used)

2 Pynamic X X X

Subsystem functionality and performance test. Dummy application that closely models the footprint of an important Python-based multi-physics ASC code

2 CLOMP X X

Subsystem functionality and performance test. Measure OpenMP overheads and other performance impacts due to threading

2 FTQ X X Fixed Time Quantum test. Measures operating system noise 2 IOR

X X Interleaved or Random I/O Benchamrk. IOR is used for testing the performance of parallel filesystems using various interfaces and access patterns.

2 Phloem MPI Benchmarks

X X

Subsystem functionality and performance tests. Collection of independent MPI Benchmarks to measure the health and stability of various aspects of MPI performance including interconnect messaging rate, latency, aggregate bandwidth, and collective latencies under heavy network loads.

2 Memory Benchmarks

X X

Memory Subsystem functionality and performance tests. Collection of STREAMS and STRIDE memory benchmarks to measure the memory subsystem under a variety of memory access patterns

3 UMTMk X Threading compiler test and single core performance 3 AMGMk

X X Sparse matrix-vector operations single core performance and OpenMP performance

3 IRSMk X Single core optimization and SIMD compiler challenge 3 SPhotMK X Single core integer arithmetic and branching performance 3 CrystalMK X Single core optimization and SIMD compiler challenge.


Validation and Benchmark Efforts

Platforms Purple (IBM Power5, AIX) BGL (IBM PPC440, LWK) BGP (IBM PPC450, LWK, SMP) ATLAS ( AMD Opteron, TOSS) Red Storm ( AMD Opteron, Catamount) Franklin (AMD Opteron, CNL ) Phoenix (Vector, UNICOS)


Weak Scaling on Purple

0.E+00

1.E+13

2.E+13

3.E+13

4.E+13

5.E+13

6.E+13

7.E+13

8.E+13

9.E+13

1.E+14

1 2 3 4 5 6 7 8 9

Thousands

# PEs

WEIGHTED Figure of Merit

SPhot UMT AMG IRS

The strategy for aggregating performance incentivizes vendors in two ways.

25

AMGwFOM = A x “solution vector size” * iter / sec

IRSwFOM = B x “temperature variables” * iter / sec

SPhot wFOM = C x “tracks” / sec

UMTwFOM = D x corners*angles*groups*zones * iter / sec

LAMMPS wFOM = E x atom updates / sec

awFOM = wFOMAMG + wFOMIRS + wFOMSPhot

+ wFOMUMT + wFOMLAMMPS

1 – Peak (petaFLOP/s)2 – #MPI / Node <= Memory per Node / 2 GB

SPhot

UMT

IRS

AMG3

SPhot

UMT

IRS

AMG3

SPhot

UMT

IRS

AMG3

SPhot SPhot SPhot

IRS IRS IRS

AMG4 AMG4 AMG4

UMT UMT UMTLAMMPS


AMG Results

AMG Weak Scaling

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 2 3 4 5 6 7 8 9

Billions

Thousands

PEs

Raw Figure of Merit

BG/L Atlas-6A Red Storm-6A Purple-6A


AMG message size distribution

AMG Message # By Size

0

5

10

15

20

25

1.E-01 1.E+01 1.E+03 1.E+05 1.E+07

Thousands

Bytes

Count

1024

2048

4096

AMG (4096) MPI Characteristics

02468

101214161820

1 8 26 98 372 1508 55632502783188

Thousands

Bytes

Count

00.005

0.010.015

0.020.025

0.030.035

0.04

Time

Count Time

An improved messaging rate would significantly impact AMG communication performance.

27


UMT and Sphot results

UMT Weak Scaling

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9

Billions

ThousandsPEs

Raw Figure of Merit

Purple BG/L* Atlas Red Storm

SPhot Weak Scaling

-

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9

Billions

ThousandsPEs

Raw Figure of Merit

Purple Purple-rerun BG/L

Atlas Red Storm


Observations of messaging rate for UMT indicate we need to have messaging rate as an interconnect requirement

29

UMT Messaging Rate

1

10

100

1000

10000

0 100 200 300Thousands

Messages/sec

# of Occurences

S12

S6

Maximum Message Rate

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120 140

Thousands

Window Size (microseconds)

Messaging Rate (per second)

S12

S6

23 130

Messaging is very bursty, and most messaging occurs at a high messaging rate.


IRS- Implicit Radiation Solver results

IRS Weak Scaling

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8

Billions

Thousands

# PEs

Raw FOM

Purple BG/L Atlas Red Storm


#PE Model Power5 BG/L Red Storm512 1.1429 1.521 1.0611,000 1.1111 1.487 1.092 1.0642,197 1.0833 1.428 1.080 1.0524,096 1.0667 1.352 1.067 1.0308,000 1.0526 1.052

IMBALANCE (MAX / AVG)

IRS Load Imbalance has two components: compute and communications

31

IRS "Send/Receive"Load Balance

40%

50%

60%

70%

80%

90%

100%

512 1,000 2,197 4,096 8,000

Processor Count

Percentage of Work

6_Core 5_Face 4_Edge 3_Corner

8,000512

COMPUTECOM PREPMPI B

C

MPI COM PREP

MPI APPLICATION

wire

COMPUTATIONCOMMUNICATION

% of MPI Time

0.01%

0.10%

1.00%

10.00%

100.00%

0 1 2 3 4 5 6 7 8 9Thousands

PEs

MPI_Allreduce

MPI_Waitany

MPI_Waitall

MPI_Bcast

MPI_Recv

MPI_Isend

MPI_Irecv

MPI_Wait

MPI_Send

BG/L


Summary

Sequoia is a carefully choreographed risk mitigation strategy to develop and deliver a huge leap forward in computing power to the National Stockpile Stewardship Program

Sequoia will work for weapons science and integrated design codes when delivered because of our evolutionary approach to yield a revolutionary advance on multiple fronts

The ground work on system requirements, benchmarks, and SOW are in place for launch of a successful procurement competition for Sequoia

32

lawrence livermore national laboratory sequoia rfp and benchmarking status unclassified scott futral...

Documents