lawrence livermore national laboratory sequoia rfp and benchmarking status unclassified scott futral...
TRANSCRIPT
Lawrence Livermore National Laboratory
Sequoia RFP and BenchmarkingSequoia RFP and BenchmarkingStatusStatus
UNCLASSIFIED
Scott FutralMark K. Seager
Tom SpelceLawrence Livermore National Laboratory
2008 SciComp Summer Meeting
5/23/2008 2008 SciComp Summer Meeting
Overview
Sequoia Objectives•25-50x BlueGene/L (367TF/s) on Science Codes•12-24x Purple on Integrated Design Codes
Sequoia Procurement Strategy•Sequoia is actually a cluster of procurements•Risk management pervades everything
Sequoia Target Architecture•Driven by programmatic requirements and technical realities•Requires innovation on several fronts
2
Sequoia will deliver petascale computing for the missionand pushes the envelope by 10-100x in every dimension!
5/23/2008 2008 SciComp Summer Meeting 3
By leveraging industry trends, Sequoia will successfully deliver a petascale UQ engine for the stockpile
Sequoia Production Platform Programmatic Drivers•UQ Engine for mission deliverables in the 2011-2015 timeframe
Programmatic drivers require unprecedented leap forward in computing power
Program needs both Capability and Capacity•25-50x BGL (367TF/s) for science codes (knob removal)•12-24x Purple for capability runs on Purple (8,192 MPI tasks UQ
Engine)
These requirements met with current industry trends drive us to a different target architecture than Purple or BGL
5/23/2008 2008 SciComp Summer Meeting 4
Predicting stockpile performance drives five separate classes of petascale calculations
1. Quantifying uncertainty (for all classes of simulation)
2. Identify and model missing physics
3. Improving accuracy in material property data
4. Improving models for known physical processes
5. Improving the performance of complex models and algorithms in macro-scale simulation codes
Each of these mission drivers require petascale computing
5/23/2008 2008 SciComp Summer Meeting
Sequoia Strategy
Two major deliverables•Petascale Scaling “Dawn” Platform in 2009•Petascale “Sequoia” Platform in 2011
Lessons learned from previous capability and capacity procurements•Leverage best-of-breed for platform, file system, SAN and storage•Major Sequoia procurement is for long term platform partnership•Three R&D partnerships to incentivize bidders to stretch goals•Risk reduction built into overall strategy from day-one
Drive procurement with single peak mandatory•Target Peak+Sustained on marquee benchmarks•Timescale, budget, technical details as target requirements•Include TCO factors such as power
5
5/23/2008 2008 SciComp Summer Meeting
To Minimize Risk, Dawn Deployment Extends the Existing Purple and BG/L Integrated Simulation Environment
ASC Dawn is the initial delivery system for Sequoia
Code development platform and scaling for Sequoia
0.5 petaFLOP/s peak for ASC production usage
Target production 2009-2014 Dawn Component Scaling•Memory B:F = 0.3•Mem BW B:F = 1.0•Link BW B:F = 2.0•Min Bisect B:F = 0.001•SAN GB/s:PF/s = 384•F is peak FLOP/s
6
5/23/2008 2008 SciComp Summer Meeting
Sequoia Target Architecture in Integrated Simulation Environment Enables a Diverse Production Workload
Diverse usage models drive platform and simulation environment requirements• Will be 2D ultra-res and 3D high-res
Quantification of Uncertainty engine
• 3D Science capability for known unknowns and unknown unknowns
Peak of 14 petaFLOP/s with option for 20 petaFLOP/s
Target production 2011-2016 Sequoia Component Scaling•Memory B:F = 0.08•Mem BW B:F = 0.2•Link BW B:F = 0.1•Min Bisect B:F = 0.03•SAN BW GB/:PF/s = 25.6•F is peak FLOP/s
7
5/23/2008 2008 SciComp Summer Meeting
Sequoia Targets A Highly Scalable Operating System
1-N CN… Light weight kernel on compute node
Optimized for scalability and reliability
As simple as possible. Full control
Extremely low OS noise Direct access to interconnect hardware
OS features Linux compatible with OS functions forwarded to I/O node OS
Support for dynamic libs runtime loading
Shared memory regions Open source
Compute Nodes
Sequoia ION and Interconnect
I/O Node
Linux on I/O Node Leverage huge Linux base & community
Enhance TCP offload, PCIe, I/O
Standard File Systems Lustre, NFSv4, etc
Factor to Simplify: Aggregates N CN for I/O & admin
Open source
Linux/Unix
FSDPerf toolstotalview
Lustre ClientNFSv4
SLURMD
MPI
Application
GLIBC
Sequoia CN and Interconnect
NPTL Posix threadsglibc dynamic loading
ADI
hardware transport
RASFutexsyscalls SharedMemory
MPI
Application
GLIBC
Sequoia CN and Interconnect
NPTL Posix threadsglibc dynamic loading
ADI
hardware transport
RASFutexsyscalls SharedMemory
MPI
Application
GLIBC
Sequoia CN and Interconnect
NPTL Posix threadsglibc dynamic loading
ADI
hardware transport
RASFutexsyscalls SharedMemory
MPI
Application
GLIBC
Sequoia CN and Interconnect
Posix threads, OpenMP and SE/TMglibc dynamic loading
ADI
hardware transport
RASFunction Shipped
syscalls SMP
UDP TCP/IPLNet
Function Shippedsyscalls
8
5/23/2008 2008 SciComp Summer Meeting
Sequoia Target Application Programming Model Leverages Factor and Simplify to Scale Applications to O(1M) Parallelism
MPI Parallelism at top level•Static allocation of MPI tasks to nodes and sets of
cores+threads•Allow for MPI everywhere, just in case…
Effectively absorb multiple cores+threads in MPI task
Support multiple languages•C/C++/Fortran03/Python
Allow different physics packages to express node concurrency in different ways
18
5/23/2008 2008 SciComp Summer Meeting
With Careful Use of Node Concurrency We can Support A Wide Variety of Complex Applications
MPI Tasks on a node are processes (one shown) with multiple OS threads (Thread0-3 shown)
Thread0 is “Main thread” Thread1-3 are helper threads that morph from Pthread to OpenMP worker to TM/SE compiler generated threads via runtime support
Hardware support to significantly reduce overheads for thread repurposing and OpenMP loops and locks
MAINThread
0Thread1Thread2Thread3
MPI_INIT
Funct1
MPI Call
1-3MPI Call
Funct2
MPI Call
MPI Call
TM/SE
TM/SE
OpenMP
1-3 Funct1
MPI Call
1-3
MPI Call
MAIN
MPI_FINALIZ
EExit
OpenMP
OpenMP
OpenMP
1) Pthreads born with MAIN2) Only Thread0 calls functions to nest parallelism3) Pthreads based MAIN calls OpenMP based Funct14) OpenMP Funct1 calls TM/SE based Funct25) Funct2 returns to OpenMP based Funct16) Funct1 returns to Pthreads based MAIN
MPI Call
WWW
WWW
1-3 1-3 1-3 1-3
19
5/23/2008 2008 SciComp Summer Meeting
Code Development Tools
Sequoia Distributed Software Stack Targets Familiar Environment for Easy Applications Port
C/C++/FortranCompilers,
Python
LW
K, L
inu
x
Op
tim
ize
d M
ath
Lib
s APPLICATION
IP
UDPTCP
SOCKETS
Lustre Client
Clib/F03 runtime
MPI2
Interconnect Interface
User Space Kernel Space
ADI
Parallel Math Libs
External Network
LNet
OpenMP, Threads, SE/TM
Function Shippedsyscalls
SLURM/Moab
RAS, Control System
Code Dev Tools Infrastructure
20
5/23/2008 2008 SciComp Summer Meeting
Sequoia Platform Target Performance is a Combination of Peak and Application Sustained Performance
“Peak” of the machine is absolute maximum performance•FLOP/s = FLoating point OPeration per second
Sustained is weighted average of five “marquee” benchmark code “Figure of Merit”•Four IDC package benchmarks and one “science workload”
benchmark from SNL•FOM chosen to mimic “grind times” and factor out scaling issues
Purple – 0.1PF/s BlueGene/L – 0.4 TF/s
22
5/23/2008 2008 SciComp Summer Meeting
Sequoia Benchmarks have already incentivized the industry to work on problems relevant to our mission needs
What’s missing?– Hydrodynamics– Structural mechanics– Quantum MD
23
ASC Sequoia Benchmarks
Language Parallelism Description
Tier Code F Py C C++ MPI OpenMP Pthreads 1 UMT
X X X X X X Marquee performance code. Single physics package code. Unstructured-Mesh deterministic radiation Transport
1 AMG X X X
Marquee performance code. Algebraic Multi-Grid linear system solver for unstructured mesh physics packages
1 IRS X X X
Marquee performance code. Single physics package code. Implicit Radiation Solver for diffusion equation on a block structured mesh
1 SPhot X X X
Marquee performance code. Single physics package code. Monte Carlo Scalar PHOTon transport code
1 LAMMPS X X
Marquee performance code. Full-system science code. Classical molecular dynamics simulation code (as used)
2 Pynamic X X X
Subsystem functionality and performance test. Dummy application that closely models the footprint of an important Python-based multi-physics ASC code
2 CLOMP X X
Subsystem functionality and performance test. Measure OpenMP overheads and other performance impacts due to threading
2 FTQ X X Fixed Time Quantum test. Measures operating system noise 2 IOR
X X Interleaved or Random I/O Benchamrk. IOR is used for testing the performance of parallel filesystems using various interfaces and access patterns.
2 Phloem MPI Benchmarks
X X
Subsystem functionality and performance tests. Collection of independent MPI Benchmarks to measure the health and stability of various aspects of MPI performance including interconnect messaging rate, latency, aggregate bandwidth, and collective latencies under heavy network loads.
2 Memory Benchmarks
X X
Memory Subsystem functionality and performance tests. Collection of STREAMS and STRIDE memory benchmarks to measure the memory subsystem under a variety of memory access patterns
3 UMTMk X Threading compiler test and single core performance 3 AMGMk
X X Sparse matrix-vector operations single core performance and OpenMP performance
3 IRSMk X Single core optimization and SIMD compiler challenge 3 SPhotMK X Single core integer arithmetic and branching performance 3 CrystalMK X Single core optimization and SIMD compiler challenge.
5/23/2008 2008 SciComp Summer Meeting
Validation and Benchmark Efforts
Platforms Purple (IBM Power5, AIX) BGL (IBM PPC440, LWK) BGP (IBM PPC450, LWK, SMP) ATLAS ( AMD Opteron, TOSS) Red Storm ( AMD Opteron, Catamount) Franklin (AMD Opteron, CNL ) Phoenix (Vector, UNICOS)
5/23/2008 2008 SciComp Summer Meeting
Weak Scaling on Purple
0.E+00
1.E+13
2.E+13
3.E+13
4.E+13
5.E+13
6.E+13
7.E+13
8.E+13
9.E+13
1.E+14
1 2 3 4 5 6 7 8 9
Thousands
# PEs
WEIGHTED Figure of Merit
SPhot UMT AMG IRS
The strategy for aggregating performance incentivizes vendors in two ways.
25
AMGwFOM = A x “solution vector size” * iter / sec
IRSwFOM = B x “temperature variables” * iter / sec
SPhot wFOM = C x “tracks” / sec
UMTwFOM = D x corners*angles*groups*zones * iter / sec
LAMMPS wFOM = E x atom updates / sec
awFOM = wFOMAMG + wFOMIRS + wFOMSPhot
+ wFOMUMT + wFOMLAMMPS
1 – Peak (petaFLOP/s)2 – #MPI / Node <= Memory per Node / 2 GB
SPhot
UMT
IRS
AMG3
SPhot
UMT
IRS
AMG3
SPhot
UMT
IRS
AMG3
SPhot SPhot SPhot
IRS IRS IRS
AMG4 AMG4 AMG4
UMT UMT UMTLAMMPS
5/23/2008 2008 SciComp Summer Meeting
AMG Results
AMG Weak Scaling
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1 2 3 4 5 6 7 8 9
Billions
Thousands
PEs
Raw Figure of Merit
BG/L Atlas-6A Red Storm-6A Purple-6A
5/23/2008 2008 SciComp Summer Meeting
AMG message size distribution
AMG Message # By Size
0
5
10
15
20
25
1.E-01 1.E+01 1.E+03 1.E+05 1.E+07
Thousands
Bytes
Count
1024
2048
4096
AMG (4096) MPI Characteristics
02468
101214161820
1 8 26 98 372 1508 55632502783188
Thousands
Bytes
Count
00.005
0.010.015
0.020.025
0.030.035
0.04
Time
Count Time
An improved messaging rate would significantly impact AMG communication performance.
27
5/23/2008 2008 SciComp Summer Meeting
UMT and Sphot results
UMT Weak Scaling
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9
Billions
ThousandsPEs
Raw Figure of Merit
Purple BG/L* Atlas Red Storm
SPhot Weak Scaling
-
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8 9
Billions
ThousandsPEs
Raw Figure of Merit
Purple Purple-rerun BG/L
Atlas Red Storm
5/23/2008 2008 SciComp Summer Meeting
Observations of messaging rate for UMT indicate we need to have messaging rate as an interconnect requirement
29
UMT Messaging Rate
1
10
100
1000
10000
0 100 200 300Thousands
Messages/sec
# of Occurences
S12
S6
Maximum Message Rate
0
100
200
300
400
500
600
700
0 20 40 60 80 100 120 140
Thousands
Window Size (microseconds)
Messaging Rate (per second)
S12
S6
23 130
Messaging is very bursty, and most messaging occurs at a high messaging rate.
5/23/2008 2008 SciComp Summer Meeting
IRS- Implicit Radiation Solver results
IRS Weak Scaling
0
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8
Billions
Thousands
# PEs
Raw FOM
Purple BG/L Atlas Red Storm
5/23/2008 2008 SciComp Summer Meeting
#PE Model Power5 BG/L Red Storm512 1.1429 1.521 1.0611,000 1.1111 1.487 1.092 1.0642,197 1.0833 1.428 1.080 1.0524,096 1.0667 1.352 1.067 1.0308,000 1.0526 1.052
IMBALANCE (MAX / AVG)
IRS Load Imbalance has two components: compute and communications
31
IRS "Send/Receive"Load Balance
40%
50%
60%
70%
80%
90%
100%
512 1,000 2,197 4,096 8,000
Processor Count
Percentage of Work
6_Core 5_Face 4_Edge 3_Corner
8,000512
COMPUTECOM PREPMPI B
C
MPI COM PREP
MPI APPLICATION
wire
COMPUTATIONCOMMUNICATION
% of MPI Time
0.01%
0.10%
1.00%
10.00%
100.00%
0 1 2 3 4 5 6 7 8 9Thousands
PEs
MPI_Allreduce
MPI_Waitany
MPI_Waitall
MPI_Bcast
MPI_Recv
MPI_Isend
MPI_Irecv
MPI_Wait
MPI_Send
BG/L
5/23/2008 2008 SciComp Summer Meeting
Summary
Sequoia is a carefully choreographed risk mitigation strategy to develop and deliver a huge leap forward in computing power to the National Stockpile Stewardship Program
Sequoia will work for weapons science and integrated design codes when delivered because of our evolutionary approach to yield a revolutionary advance on multiple fronts
The ground work on system requirements, benchmarks, and SOW are in place for launch of a successful procurement competition for Sequoia
32