the new era of coprocessor in supercomputing - 并行计算中协处理应用的新时代-
Post on 23-Feb-2016
123 Views
Preview:
DESCRIPTION
TRANSCRIPT
© Supermicro 2013
The New Era of Coprocessor in Supercomputing -
并行计算中协处理应用的新时代-
5/07/2013 @ BAH! Oil & Gas - Rio de Janeiro, Brazil
Marc XAB, M.A. - 桜美林大学大学院Country Manager
Super Micro Computer Inc.Rua Funchal, 418. Sao Paulo – SP
www.supermicro.com/brazil
Networking in Rio
Company Overview
Fremont Facility
Revenues: FY10 $721 M FY11 $942 M FY12 $1BGlobal Footprint: >70 Countries, 700 customers, 6800 SKUsProduction: US, EU and Asia Production facilities Engineering: 70% of workforce in engineering, SSI Member
Market Share: #1 Server Channel Corporate Focus: Leader Energy Efficient, HPC & Application-Optimized Systems
San Jose (Headquarter)
Fortune 2012 100 Fastest-Growing Companies
COPROCESSOR (协处理器 ) A coprocessor is a computer processor used to supplement
the functions of the primary processor (the CPU).
Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, encryption or I/O Interfacing with peripheral devices. Math coprocessor – a computer chip that handles the floating
point operations and mathematical computations in a computer.
Graphics Processing Unit (GPU) – a separate card that handles graphics rendering and can improve performance in graphics intensive applications, like games.
Secure crypto-processor – a dedicated computer on a chip or microprocessor for carrying out cryptographic operations, embedded in a packaging with multiple physical security measures, which give it a degree of tamper resistance
Network coprocessor. 网络协处理器 .
……..
Green500 Rank MFLOPS/W Site* Computer* Total Power (kW)
1 2,499.44
National Institute for Computational Sciences/University of Tennessee
Beacon - Appro GreenBlade GB824M, Xeon E5-2670 8C 2.600GHz, Infiniband FDR, Intel Xeon Phi 5110P
44.89
2 2,351.10 King Abdulaziz City for Science and Technology
SANAM - Adtech ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000
179.15
3 2,142.77 DOE/SC/Oak Ridge National Laboratory
Titan - Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x
8,209.00
4 2,121.71 Swiss Scientific Computing Center (CSCS)
Todi - Cray XK7 , Opteron 6272 16C 2.100GHz, Cray Gemini interconnect, NVIDIA Tesla K20 Kepler
129.00
5 2,102.12 Forschungszentrum Juelich (FZJ)
JUQUEEN - BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect 1,970.00
The Trend Indicated on Green500
“Submerged Supermicro Servers Accelerated by GPUs”
Supermicro 1U (Single CPU) with two coprocessors No requirement for room-level cooling Operates at PUE ~ 1.12 25 kilowatts per rack – the breakpoint per rack
(between regular air-cool and submerged cool)
Case Study – Submerged Liquid Cooling
Cost Efficiency
Air cool
Submerged liquid cool
KW / rack
~25kW
Removed Fans and Heat Sinks Use SSD & Updated BIOS Reverse the handlers
Tesla: 2-3x Faster Every 2 Years16
2
4
6
8
10
12
14
DP
GFL
OPS
per
Wat
t
2008
2010
2012
2014
T10 Fermi
Kepler
Maxwell
512 cores
Thousands of core
GPU Supercomputer Momentum
0
10
20
30
40
50
60
Tesla Fermi Launched
2008 2009 2010 2011 2012 2013
June 2012 Top500
# of GPU Accelerated Systems on Top500 52
First Double Precision GPU
4x
Case Study – PNNL
Expects supercomputer to rank in world's top 20 fastest machines.
Research for climate and environmental science, chemical processes, biology-based fuels that can replace fossil fuels, new materials for energy applications, etc.
Supermicro FatTwin™with 2x MIC 5110P per node
Theoretical peak processing speed of
3.4 petaflops
42 racks / 195,840 cores
1440 compute nodes with conventional
processors and Intel Xeon Phi "MIC"
accelerators
128 GB memory per node
FDR Infiniband network
2.7 petabyte shared parallel file system
(60 gigabytes per second read/write)
Case Study – PNNL
Supermicro FatTwin™with 2x MIC 5110P per node
Programing Paradigm
The Xeon Phi programming model and its optimization are shared across the Intel Xeon
CUDA (Compute Unified Device Architecture) – a parallel computing platform and programming model. CUDA provides developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs.
Made Easier
Don’t Complicated
Keynotes
This is a new era of hybrid computing – heterogeneous architecture with PCI-E based coprocessor
Specialized (or application-optimized) design is required for GPU/MIC applications and HPC future scalability
There are more to come in the industry roadmap with new technologies, power management and system architecture
Configurable cooling & power for energy efficiency and performance are more and more critical
The trend towards heterogeneous architecture poses many challenges for system builder and software developers in making efficient use of the resources
Programming paradigm and its investment are important as a part of the selecting consideration
•Options pricing•Risk analysis•Algorithmic trading
•Medical imaging•Visualization & docking•Filmmaking & animation
•Computational fluid dynamics•Materials science•Molecular dynamics•Quantum chemistry
•Mechanical design & simulation•Structural mechanics•Electronic Design Automation
•Data parallel mathematics
•Extend Excel with OLAP for planning & analysis
•Database and data analysis acceleration
Computational Finance
Imaging and Computer Vision
•Weather•Atmospheric•Ocean Modeling•Space Sciences
Weather and Climate
Simulation & Creation DesignScientific
•Seismic imaging •Seismic Interpretation•Reservoir Modeling•Seismic Inversion
Oil and Gas/Seismic
Data Mining
Massively parallel architecture accelerates Scientific & Engineering Applications
HPC Coprocessor Applications
Telsa S1070
PCI-E x16
1U Twin™The most
powerful PSC
The fastest 1U serverin the world
1U 4-GPU Standalone box
2U GPU w/ QDR IB onboard
2U Twin
2U 4-GPU
1U 3-GPU
7U GPU Blades20 CPUs + 20 GPUs
X9 (DP) 1U 4-GPU/MIC X9 2U 6-GPU/MIC
X9 (UP) 1U 2-GPU/MIC
NVIDIA Kepler & Intel Xeon Phi supports
Hybrid Computing
FatTwin™ 2-node8 GPUs or MICs per node
FatTwin™ 4-node3 GPUs or MICs per node
Ultra HighEfficiency
2008 2009 2010 2011 2012 2013
4 GPUs or MICsWorkstation / 4U
HybridComputing PioneerGPGPU
Where it started…
EfficiencyDensity
Mainstream
Communication Between Coprocessors
IBIB
IB Switch
The model used by existing CPU-GPU Heterogeneous architectures for GPU-GPU communication. Data travels via CPU & Infiniband (IB) Host Channel Adapter (HCA) and Switch or other proprietary interconnect
Data transfer between cooperating GPUs in separate nodes in a TCA cluster enabled by the PEACH2 chip.
Schematic of the PEARL network within a CPU/GPU cluster
Implementation Example
Source: Tsukuba University
Designing GPU/MIC Optimized Systems
Performance PCI-e lanes arrangement, PCB placement,
interconnect Mechanical design
mounting, location, space utilization Thermal
air flow, fan speed control, location, noise control
Power support PSU efficiency, wattage options,
power management Number of power connectors (& location)
Summary Coprocessor and Applications Performance and Efficiency Top500 & Green500 Hybrid Computing & HPC GPU/MIC Optimized Systems Design Considerations
Performance Mechanical Design Thermal & Cooling Power Support
Thank You!
Marc XABmarc.xab@supermicro.com
Conference Puzzle
How do you put an ELEPHANT in a Refrigerator ?
Conference Puzzle
top related