martin tsenkov;elfe;221210014;77b

7/31/2019 Martin Tsenkov;ELFE;221210014;77B

1/17

Computing is changing more rapidly

than ever before, and scientists have the

unprecedented opportunity to changecomputing directions


2/17

Largest computer at a given time Technical use for science and engineering

calculations Large government defense, weather, aero

laboratories are first buyers

Price is no object Market size is 3-5

5/26/2012 Copyright G Bell & TCM History Center 2


3/17

Major Challenges are ahead for extremecomputing Power

Parallelism

and many others not discussed here

We will need completely new approaches andtechnologies to reach the Exascale level

This opens up a unique opportunity for scienceapplications to lead extreme scale systemsdevelopment


4/17

Commodity processor with commodity inter-processorconnectionClustersPentium, Itanium, Opteron, AlphaGigE, Infiniband, Myrinet, Quadrics, SCINEC TX7

HP AlphaCommodity processor with custom interconnect

SGI AltixIntel Itanium 2Cray Red StormAMD Opteron

Custom processor with custom interconnectCray X1NEC SX-7IBM RegattaIBM Blue Gene/L

Loosely

Coupled

TightlyCoupled

Commercial Parallel Computer Architecture


5/17

SGI AltixThe Columbia Supercomputer at NASA'sAdvanced Supercomputing Facility at AmesResearch Center.It consists of a 10,240-processor SGI Altix

system comprised of 20 nodes, each with512 Intel Itanium 2 processors, and running aLinux operating systemBlack Hole Simulations

Hitachi SR11000NEC SX-7AppleCray RedStormCray BlackWidow

IBM Blue Gene/L
http://imagine.gsfc.nasa.gov/Images/news/columbia_computer.jpghttp://imagine.gsfc.nasa.gov/Images/news/columbia_computer.jpg


6/17

5/26/2012 Copyright G Bell & TCM History Center 6

Time $M structure example1950 1 mainframes many...

1960 3 instruction //sm IBM / CDCmainframe SMP

1970 10 pipelining 7600 / Cray 11980 30 vectors; SCI Crays1990 250 MIMDs: mC, SMP, DSM Crays/MPP2000 1,000 ASCI, COTS MPP Grid, Legion


7/17

Intel Pentium Xeon3.2 GHz, peak = 6.4 Gflop/s

Linpack 100 = 1.7 Gflop/sLinpack 1000 = 3.1 Gflop/s

AMD Opteron2.2 GHz, peak = 4.4 Gflop/sLinpack 100 = 1.3 Gflop/sLinpack 1000 = 3.1 Gflop/s

Intel Itanium 21.5 GHz, peak = 6 Gflop/sLinpack 100 = 1.7 Gflop/sLinpack 1000 = 5.4 Gflop/s

HP PA RISCSun UltraSPARC IVHP Alpha EV68

1.25 GHz, 2.5 Gflop/s

MIPS R16000

Linpack: a standard benchmark software that testhow fast your computer runs

Gflop/s: One billion floating pointoperations per second


8/17

Switch

topology

NIC $ Node MPI Lat

(us)

1-way

speed(MB/s)

Bi-Dir

speed(MB/s)

GigabitEthernet

Bus $ 50 $100 30 100 150

SCI Torus $1,600

$1600 5 300 400

QsNetII(R)

Fat Tree $1200 $2900 3 880 900

Myrinet(D card

Clos $595 $995 6.5 240 480

Myrinet(E card)

Clos $995 $1395 6 450 900

IBM 4X Fat Tree $1000 $1400 6 820 790

Gig EthernetMyrinetInfinibandQsNetSCI

More detail


9/17

Tree network: there is only one path between

any pair of processors. Fat tree network: increase the number of

communication links close to the root.

Root levelhas more physicalconnections


10/17

A.K.A----Wrapped-around-mesh topology

Three-dimensional Mesh

Mesh with wraparound


11/17

is a kind of multistage switchingnetwork

Three stages, each consisting a numberof crossbars.

Middle stage have redundant switchingboxes to alleviate blocking probability


12/17

By Myricom company First Myrinet in 1994 An alternative for

Ethernet to connectthe nodes in a cluster

entirely operated inuser space, no

Operating Systemdelays

Miyinet switch: 10-Gbps, $12,800Clos networks up to 128 host ports

10G PCI Express NICWith fiber connectors


13/17

By Quadrics (formed in 1996) uses a 'fat tree' topology QsNetII scales up to 4096 nodes

Each node might have multiple CPUs Designed for use within SMP systems MPI latency on standard AMD Opteron starts

at 1.22 usec; Bandwidth on Intel Xeon EM64T is 912

Mbytes/s.

QsNetII E-Series128-way switch
http://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Opteronhttp://en.wikipedia.org/wiki/EM64Thttp://en.wikipedia.org/wiki/EM64Thttp://en.wikipedia.org/wiki/Opteronhttp://en.wikipedia.org/wiki/Symmetric_multiprocessing


14/17

Each chip containstwo nodes

Each node is aPPC440 processor

Each node has 512local memory

Each node runslightweight OS withMPI.

Each node runs oneuser process

No contextswitching at node


15/17

Use five networks: GigE for I/O nodes, to external systems

A control network use FastEthernet

3-D Torus for node-to-node message passing Handle majority of application traffic (mpi messaging)

Longest path: 64 hops

MPI software is highly customized: A collective network for broadcasting

A barrier network


16/17


17/17

System attributes 2010 2015 2018

System peak 2 Peta 200 Petaflop/sec 1 Exaflop/sec

Power 6 MW 15 MW 20 MW

System memory 0.3 PB 5 PB 32-64 PB

Node performance 125 GF 0.5 TF 7 TF 1 TF 10 TF

Node memory BW 25 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec

Node concurrency 12 O(100) O(1,000) O(1,000) O(10,000)

System size (nodes) 18,700 50,000 5,000 1,000,000 100,000

Total Node

Interconnect BW

1.5 GB/s 20 GB/sec 200 GB/sec

MTTI days O(1day) O(1 day)

martin tsenkov;elfe;221210014;77b

Documents