keynote snir sc
DESCRIPTION
presentation Cray award SC13TRANSCRIPT
Supercomputing: The Next 10 Years
Marc Snir Argonne Na.onal Laboratory & University of Illinois at Urbana-‐Champaign
Past
Those who cannot remember the past are condemned to repeat it (Santayana)
November 13
MCS -‐-‐ Marc Snir
2
The Last Great Extinction
November 13
MCS -‐-‐ Marc Snir
3
1
10
100
1000
10000
100000
1000000
10000000
Core Count of leading Top500 System
The aJack of the killer micros
ShiL from bipolar vector processor to clusters of MOS microprocessors
1990: The Attack of the Killer Micros (Eugene Brooks, 1990)
§ Bipolar technology had hit a power wall (nitrogen cooling) § Alterna.ve materials were too expensive /not ready (gallium arsenide) § An alterna.ve “good enough” technology was ready
– MOS microprocessors had been around 20 years and were a fast growing market
– MOS had a clear evolu.on path (“Moore’s Law”)
§ MOS was no beJer than bipolar (in 1991)
November 13
MCS -‐-‐ Marc Snir
4
Cray C90 • 244 MHz • Vector • Vector registers • 16 shared-‐memory nodes
CM5 • 32 MHz • Scalar • Cache • 1024 message-‐
passing nodes
§ New paradigm took a while to establish itself (CM1, CM2, KSR…) § Change in technology led to change in vendors and business model § Technology shiL required a long and painful process of code rewrite
Present
The past no longer is and the future is not yet (St. Augus.ne)
November 13
MCS -‐-‐ Marc Snir
5
20 Years of (Near) Stability
§ One dominant programming model: Message-‐Passing (MPI) § One major shiL – from single core to mul.core – Easy since one can treat each core as a node
November 13
MCS -‐-‐ Marc Snir
6
1
10
100
1000
10000
100000
1000000
10000000
mul.core
Increasing Instability
§ Heterogeneous memory: NUMA, noncoherent shared memory, scratchpads…
§ Heterogenous processing: GPUs, accelerators, big-‐small cores (NVIDIA, Xeon Phi, ARM big.LITTLE))
§ Hybrid Memory Cube & near-‐memory processing § No standard programming model
November 13
MCS -‐-‐ Marc Snir
7
1
10
100
1000
10000
100000
1000000
10000000
mul.core
accelerators
On Our Way to the Next Extinction? § History repeats itself: – CMOS technology has hit a power wall • Clock speed is not raising
– Alterna.ve materials are (too) expensive /not ready (gallium arsenide and other III-‐V materials; nanowires, nanotubes)
While power consump0on is an urgent challenge, its leakage or sta0c component will become a major industry crisis in the long term, threatening the survival of CMOS technology itself, just as bipolar technology was threatened and eventually disposed of decades ago (ITRS 2011)
§ History does not repeat itself: – There is a much larger industrial base – An alterna.ve “good enough” technology IS NOT ready – There is much more code that needs to be rewriJen if new model is needed (>200MLOCs)
November 13
MCS -‐-‐ Marc Snir
8
Future
It is difficult to make predic.ons, especially about the future (Yogi Berra)
November 13
MCS -‐-‐ Marc Snir
9
The End of Moore’s Law is Coming
§ Moore’s Law: The number of transistors per chip doubles every two/three years
§ Stein’s Law: If something cannot go forever, it will stop
§ Ques.on is not whether but when will Moore’s Law stop
November 13
MCS -‐-‐ Marc Snir
10
The 7nm Wall
19 November 2013
ANL-‐LBNL-‐ORNL-‐PNNL
11
(courtesy J. Aldun)
The End of the Road (?)
§ Quantum tunneling becomes a major obstacle as devices shrinks – 7-‐5nm feature size has long been predicted to be the lower limit for CMOS devices • ITRS predicts 7.5nm will be reached in 2024
§ 7.5nm ~ 30 atoms of silicon – No much room for further miniaturiza0on, independent of technology!
– Room for clock increase (new materials, quantum effect gates, cryogenic devices…)
November 13
MCS -‐-‐ Marc Snir
12
The Last Mile is the Most Expensive Mile
§ New technologies are needed – New materials (e.g., III-‐V, germanium thin channels, nanowires, nanotubes
or graphene) – New structures (e.g., 3D transistor structures) – New packages (e.g., HMC, photonics) – New lithography – Control or tolerance of large variances (safety margins, resilience, aging)
§ New technologies are expensive – NRE increases faster than profits – forces consolida.on – Only two companies can sustain the investments needed to go below 22nm
(Intel and Samsung) [Heck, Kaza, Pinner] § Less compe..on & larger investments = slower progress
November 13
MCS -‐-‐ Marc Snir
13
The Future Is Not What It Was
19 November 2013
ANL-‐LBNL-‐ORNL-‐PNNL
14
(courtesy J. Aldun)
The Path of Least Resistance – Other than Moore
§ Industry goal is not increased performance; it is increased ROI. Industry will increasingly invest in alterna.ves as increasing performance becomes more expensive – Low power, low cost – New markets: MEMS, sensors – System on a chip (smartphone, tablet) ✗ Fewer good commodity building blocks for HPC – No low-‐power/high-‐flops/high-‐resilience CPU ✔ More opportuni.es for semi-‐custom and integra.on of mul.ple vendor IP on a chip
§ New business model for supercompu.ng? – Semi-‐custom & system on a chip integrator
Exascale
November 13
MCS -‐-‐ Marc Snir
16
Identified Issues
§ Scale (billion threads) § Power (10’s of MWaJs) – Communica<on: > 99% of power is consumed by moving operands across the memory hierarchy and across nodes
– Reduced memory size: (communica.on in .me) § Resilience: Something fails every hour; the machine is never
“whole” – Trade-‐off between power and resilience
§ Asynchrony: Equal work ≠ equal .me – Power management – Error recovery
November 13
MCS -‐-‐ Marc Snir
17
My Main Concerns
§ Uncertainly about underlying HW architecture – Slower progress of IC will necessitate faster progress of architecture – May not converge to a new, stable model – It is not about por.ng applica.ons to a new programming model – it is about designing applica.ons for portability
§ Increased soFware complexity – Simula.ons of complex systems + uncertainty quan.fica.on + op.miza.on…
– Support of complex workflows (e.g., in situ analysis) – SoLware management of power and failures – Heterogeneity – Scale and .ght coupling (tail of distribu.on maJers!) – Hypothesis: soLware will con.nue to be dominant cause of failures
November 13
MCS -‐-‐ Marc Snir
18
Conclusion
§ Moore’s Law is slowing down; the slow-‐down has many fundamental consequences – only a few of them explored in this talk
§ HPC is the “canary in the mine”: – issues appear earlier because of size and .ght coupling
§ Op.mis.c view of the next decades: no stasis. – A frenzy of innova.on to con.nue pushing current ecosystem, followed by frenzy of innova.on to use totally different compute technologies
§ Pessimis.c view: The end is coming
November 13
MCS -‐-‐ Marc Snir
19
November 13
MCS -‐-‐ Marc Snir
20
Backup
November 13
MCS -‐-‐ Marc Snir
21
Do We Care?
§ It’s all about Big Data Now, simula.ons are passé. § B***t § All science is either physics or stamp collec0ng. (Ernest
Rutherford) – In Physical Sciences, experiments and observa.ons exist to validate/refute/mo.vate theory. “Data Mining” not driven by a scien.fic hypothesis is “stamp collec.on”.
§ Simula.on is needed to go from a mathema.cal model to predic.ons on observa.ons. – If system is complex (e.g., climate) then simula.on is expensive – OLen, models are stochas.c and predic.ons are sta.s.cal – complica.ng both simula.on and data analysis
November 13
MCS -‐-‐ Marc Snir
22
Observation Meets Data: Cosmology Computation Meets Data: The Argonne View
Mapping the Sky with Survey Instruments
Observations: Statistical error bars will ‘disappear’ soon!
Emulator based on Gaussian Process Interpolation in High-
Dimensional Spaces
Supercomputer Simulation Campaign
Markov chain Monte Carlo
‘PrecisionOracle’
‘Cosmic Calibration’
LSST Weak Lensing
HACC+CCF (Domain science+CS+Math+Stats
+Machine learning)
CCF= Cosmic Calibration Framework
w = -1w = - 0.9
LSSTHACC=Hardware/Hybrid Accelerated Cosmology Code(s)
Wednesday, September 19, 12
(courtesy Salman Habib) Record-‐breaking applica.on: 3.6 Trillion par.cles, 14 Pflop/s