multicore architecture ctrinitis

Upload: xsober

Post on 06-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Multicore Architecture CTrinitis

    1/79

    Technische Universitt Mnchen

    Multicore ArchitecturesCLGrid5 Workshop

    Valparaiso, ChileSeptember 29 th , 2008

    LRR-TUM, September 9 th, 2008

    Carsten Trinitis

    Lehrstuhl fr Rechnertechnik und Rechnerorganisation (LRR)Institut fr Informatik, Technische Universitt Mnchen

  • 8/2/2019 Multicore Architecture CTrinitis

    2/79

    Technische Universitt MnchenHow did it all evolve?

    Mechanical devices

    Abacus,3000 BC (?)

    1642, add & sub, Blaise Pascal

    1822Charles

    Babbage

  • 8/2/2019 Multicore Architecture CTrinitis

    3/79

    Technische Universitt MnchenElectromechanical Machines

    Based on Relays Konrad Zuse (1910-1995)

  • 8/2/2019 Multicore Architecture CTrinitis

    4/79

    Technische Universitt MnchenZuse Z3 & Z4

    Z1 / 1938,Z3 / 1941:

    First freelyprogrammable

    machinesin the world

    Z3 and itssuccessor Z4

    can be seenat Deutsches

    Museum!

  • 8/2/2019 Multicore Architecture CTrinitis

    5/79

    Technische Universitt MnchenElectronic Computers

    First Generation No mechanical components any more Vacuum Tubes

    Principle Basic: Triode Controllable flow within

    diode by a fence On / Off

    1946: ENIAC machine E lectronic N umerical I ntegrator A nd C omputer

  • 8/2/2019 Multicore Architecture CTrinitis

    6/79

    Technische Universitt MnchenENIAC (1946)

  • 8/2/2019 Multicore Architecture CTrinitis

    7/79

    Technische Universitt MnchenOrganization

    Question: How to structure / organize computational machines? How to control and steer execution?

    Original work (1946) Burks, Goldstine, von Neumann:

    Preliminary discussion of the logical design of anelectronic computing instrument.

    Result: von Neumann Architecture Most dominant architecture even today !

  • 8/2/2019 Multicore Architecture CTrinitis

    8/79

    Technische Universitt MnchenThe IAS machine

    Developed 1952 by John von Neumann

    First machine based on his design principle I nstitute for A dvanced S tudies computer

  • 8/2/2019 Multicore Architecture CTrinitis

    9/79

    Technische Universitt MnchenTechnology Development

    Vacuum tubes replaced Transistors Smaller, more power efficient DEC PDP-1, IBM 7094

    Still large machines Next step: Integrated Circuits

    Many transistors packed on one die High density & reliability, low power

    IBM 360 family & first Intel chips Many subsequent improvements

  • 8/2/2019 Multicore Architecture CTrinitis

    10/79

    Technische Universitt Mnchen1971: 1 st Microprocessor Intel 4004

    ~2300 Transistors , 108 KHz, 10000nm

  • 8/2/2019 Multicore Architecture CTrinitis

    11/79

    Technische Universitt Mnchen

    Intel 4004 First Microprocessor

  • 8/2/2019 Multicore Architecture CTrinitis

    12/79

    Technische Universitt Mnchen

    Pentium 4 (55 Million Transistors)

  • 8/2/2019 Multicore Architecture CTrinitis

    13/79

    28.09.08

    Technische Universitt MnchenIntel Montecito

    1.7 Billion Transistors, Intel's 1 st Dual Core, 90nm

  • 8/2/2019 Multicore Architecture CTrinitis

    14/79

    28.09.08

    Technische Universitt MnchenDual Core 2 (Woodcrest)

    , 2.4-3 GHz, 65nm290 Million Transistors

  • 8/2/2019 Multicore Architecture CTrinitis

    15/79

    28.09.08

    Technische Universitt MnchenCore i7 (Nehalem)

    731 Million Transistors, 45nm

  • 8/2/2019 Multicore Architecture CTrinitis

    16/79

    28.09.08

    Technische Universitt MnchenAMD Shanghai

    705 Million Transistors, 45nm

  • 8/2/2019 Multicore Architecture CTrinitis

    17/79

    28.09.08

    Technische Universitt MnchenIntel Larrabee

    ... Transistors, 45nm

  • 8/2/2019 Multicore Architecture CTrinitis

    18/79

    Technische Universitt MnchenAnd the Future ... ?

    Many-core array

    CMP with 10s-100s lowpower cores

    Scalar cores Capable of TFLOPS+

    Full System-on-Chip Servers, workstations,

    embeddedDual core Symmetric multithreading

    Multi-core array CMP with ~10 cores

    Evolution

    Large, Scalar cores for high single-thread

    performance

    Scalar plus many core for highly threaded workloads

  • 8/2/2019 Multicore Architecture CTrinitis

    19/79

  • 8/2/2019 Multicore Architecture CTrinitis

    20/79

    Technische Universitt MnchenFrom Single- to Multi-Core

    Netburst: >30 Pipeline Stages

    No longer feasible...

    2005: Move to dual core (and less pipeline stages)

    2, 4, 6, 8, ... cores

    But: The free lunch is over!

    The good news is: This is good for parallel programmers.

  • 8/2/2019 Multicore Architecture CTrinitis

    21/79

    Technische Universitt MnchenImpact

    What does multi-core mean in particular?

    Is it just an SMP system, i.e. programmable withOpenMP, Pthreads, etc. ?

    Or does it differ from SMP Systems?

    How do multi-core systems fit into clusters?

  • 8/2/2019 Multicore Architecture CTrinitis

    22/79

    Technische Universitt MnchenJust an SMP system?

    Partly, but those issues will be covered by my colleagues...

  • 8/2/2019 Multicore Architecture CTrinitis

    23/79

    Technische Universitt MnchenIs Multi-Core different?

    Yes, with regard to memory hierarchies and interconnect!

  • 8/2/2019 Multicore Architecture CTrinitis

    24/79

    Technische Universitt MnchenThe Memory Wall

    Processor speed is increasing much faster than memoryspeed

    Microprocessors: 50-100% per year (Moores law)DRAMs: 7-15% per year

    The gap is widening

    Time P e r f o r m a n c e

    Me mor y a cce s s C P U

    p e r f o r m a n

    c e

  • 8/2/2019 Multicore Architecture CTrinitis

    25/79

    Technische Universitt MnchenCaches

    Main Memory: Problems with Bandwidth & LatencyMemory bus located off-chip / on boardPhysical boundariesResults: Memory too far away

    Cache: Memory closer to CPU which hold asubset of the main memory

    +

    Lower latency, higher bandwidth, On-chip- Which subset should be present?- Can we manage this transparently?

  • 8/2/2019 Multicore Architecture CTrinitis

    26/79

  • 8/2/2019 Multicore Architecture CTrinitis

    27/79

  • 8/2/2019 Multicore Architecture CTrinitis

    28/79

    Technische Universitt MnchenTerminology

    Accesses to memory can be aCache hit : Data is in CacheCache miss : Data has to be retrieved from memoryCache misses are expensive!

    Cache size : Total size of CacheCache line size/length :

    Caches do not store individual bytes/wordsManagement overhead too high

    Unit of storage: Cache linesConsecutive number of bytes / memory

  • 8/2/2019 Multicore Architecture CTrinitis

    29/79

    Technische Universitt Mnchen

    Replacement policy :Which cache line to evict if new space is needed?Optimal: Data not used in the near futureMake prediction from the pastOften used: Least recently used (LRU)

    How are writes treated?Write back caching

    Writes are stored in cachesData is written back in case of line eviction

    Write through caching Data is written directly to main memory

    Terminology

  • 8/2/2019 Multicore Architecture CTrinitis

    30/79

    Technische Universitt MnchenCache Associativity

    Caches are a collection of cache linesEqually sized & Much smaller than memory

    Question of mapping between CLs and memoryWhere to look for a cache hit?

    Where to put a newly loaded cache line?Free mapping is very costly

    Difficult lookup function for cache accessesTarget CLs for a particular access restrictedOnly a certain number of CLs possible

    Associativity of a cache

  • 8/2/2019 Multicore Architecture CTrinitis

    31/79

    Technische Universitt MnchenCache Structures

    Where is a block stored in the cache?0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

    11 11 11 11 11 2 2 2 2 2 2 2 2 2 2 3 3

    0 1 2 3 4 5 6 7

    Main Memory

    Cache

    Block B j

    Block Z i

    j = 0, 1, ..., (n-1)

    i = 0, 1, ...(m-1)

    (Cache-Line)

    n >> m, n = 2 s , m = 2 r Each block contains b wordswith b = 2 w

    Capacity:

    m * b = 2 r+w WordsCapacity:

    n * b = 2 s+w

    Words

    Mapping from{B j } to {Z i}

  • 8/2/2019 Multicore Architecture CTrinitis

    32/79

    Technische Universitt Mnchen

    Direct-Mapped Cache

    Direct mapping of n/m = 2 s-r memory blocks intoone cache line:

    Mapping: B j --> Z i, where i = j mod m

    Cache Structures

    B0

    B2B1

    B3B4B5B6B7B8B9B10B11B12B13B14B15

    BlockHaupspeicherZ0Tag

    Z1Tag

    Z2Tag

    Z3Tag

    Cache

    Zeile

    ZeileMain Memory

    Line

  • 8/2/2019 Multicore Architecture CTrinitis

    33/79

    Technische Universitt Mnchen

    Direct-Mapped CacheLow hardware complexity.Fixed mapping block line yields fixed replacementstrategy.

    Cache Structures

  • 8/2/2019 Multicore Architecture CTrinitis

    34/79

    Technische Universitt MnchenCache Structures

    Fully Associative Cache

    Any block in main memory can be mapped to anycache line (flexibility).

    Replacement strategy tells which line is to beoverwritten when loading the cache (e.g. Least-Recently-Used).

    High hardware complexity.

  • 8/2/2019 Multicore Architecture CTrinitis

    35/79

    Technische Universitt Mnchen

    Set Associative Cache

    Compromise between Direct-Mapped- and fullyassociative Cache.

    k-way set associative cache:

    k lines form one set.

    m cache-lines are divided into v = m/k sets with k each.

    Cache Structures

  • 8/2/2019 Multicore Architecture CTrinitis

    36/79

    Technische Universitt MnchenProgrammability

    Caches have no impact (from a logical point of view)Designed to be transparent

    BUT: large performance impactNeed to use caches efficiently

    E.g. Try to reuse data in caches

    HPC Applications need to be tailored to caches Adapt to cache sizes, cache line sizes, and hierarchiesGood understanding of architecture requiredSignificant performance gains possible!

  • 8/2/2019 Multicore Architecture CTrinitis

    37/79

    Technische Universitt MnchenExample

    Parameters (taken from typical L1 Cache)32 KB sizeCache line size 32 BytesCache has 1024 cache lines

    Address format:

    Rest 10 bit CL select 5 bit / CL offset

  • 8/2/2019 Multicore Architecture CTrinitis

    38/79

    Technische Universitt MnchenExample (cont.)

    Full Associativity (m-way associativity)New cache line can be stored in any of the 1024 CLSelection e.g. by Least Recently Used (LRU)

    Direct mapped (1-way associativity)New cache line can only be placed in defined line

    2-way associativityOnly use 9 bits for CL selection, i.e for 512 setsSelection within set can again be done e.g. using LRU

  • 8/2/2019 Multicore Architecture CTrinitis

    39/79

  • 8/2/2019 Multicore Architecture CTrinitis

    40/79

    Technische Universitt MnchenCache Hierarchies

    Caches are layeredSeveral levels of lachesEach level works independentlyTransparency still maintainedCurrently up to 3 levels

    Higher levelsSlower, but larger

    CPU

    L1 Cache

    L2 Cache

    Main Memory

    L3 Cache

  • 8/2/2019 Multicore Architecture CTrinitis

    41/79

    Technische Universitt MnchenInstruction vs. Data Caches

    L1 Caches are often splitI-Cache for InstructionsD-Cache for Data

    Reduces conflictsSignificantly different access patterns

    Allows additional optimizationsProcessor layout (CPU design)Make use of the special access patterns

    Example: Trace Caches for I-CacheStore longer instruction sequences/traces

  • 8/2/2019 Multicore Architecture CTrinitis

    42/79

    Technische Universitt MnchenCache Optimization

    Why does cache architecture have an impact onperformance?

    Data should be reused as much as possible!

    Locality of reference:

    Temporal locality : recently accessed data is likely be be

    accessed in the future.Spatial locality: Data located closely together is likely tobe accessed closely together in time.

    h

  • 8/2/2019 Multicore Architecture CTrinitis

    43/79

    Technische Universitt MnchenCache Optimization

    How can this be optimized?Code transformations: Change order of loop iterationexecutions.

    Must not change numerical results!

    Maintain data dependencies!

    h

  • 8/2/2019 Multicore Architecture CTrinitis

    44/79

    Technische Universitt MnchenCache Optimization

    Loop interchange:

    Stride = 8 Stride = 1

    O h T h i

  • 8/2/2019 Multicore Architecture CTrinitis

    45/79

    Technische Universitt MnchenOther Techniques

    PrefetchingTry to preload data that will potentially be usedPro: Data can be pre-requestedCon: May waste bandwidth / not used loads

    Controlled by HardwareSpeculative loads

    Controlled by programmer / compiler

    Insert the prefetching statements into the codeTraditionally disturbed pipeline!Can be used with multi-core processors with sharedcache!

    E l CMP

  • 8/2/2019 Multicore Architecture CTrinitis

    46/79

    Technische Universitt MnchenEarly CMPs

    Intel Montecito

    Intel Pentium-D

    AMD Dual Core Opteron

    IBM Cell

    I t l M t it

  • 8/2/2019 Multicore Architecture CTrinitis

    47/79

    Technische Universitt MnchenIntel Montecito

    I l P i D

  • 8/2/2019 Multicore Architecture CTrinitis

    48/79

    Technische Universitt MnchenIntel Pentium-D

    E l CMP

  • 8/2/2019 Multicore Architecture CTrinitis

    49/79

    Technische Universitt MnchenEarly CMPs

    IBM / Sony / Toshiba CellProcessor:

    1 Power Processor Element (PPE) 8 Synergistic Processing Elements

    (SPE) Element Interface Bus (EIB),

    384GB/s 25,6 GB/s memory bandwidth 50-80 Watts energy consumption

    C ll P

  • 8/2/2019 Multicore Architecture CTrinitis

    50/79

    Technische Universitt MnchenCell Processor

    SUN Ult S T1

  • 8/2/2019 Multicore Architecture CTrinitis

    51/79

    Technische Universitt MnchenSUN UltraSparc T1

    Eight cores, connectedvia Crossbar

    134 GB/s

    Each core can process 4threads

    25,6GB/s memory bandwidth

    70 Watts energy consumption => 2 Watts/Thread

    Trends through Multi Core

  • 8/2/2019 Multicore Architecture CTrinitis

    52/79

    28.09.08

    Technische Universitt MnchenTrends through Multi-Core

    Computers move into chip!

    New memory hierarchies==> Caches!

    New interconnect topologies.

    Three levels of parallelism: On-chip

    On-board Cluster

    Trends through Multi Core

  • 8/2/2019 Multicore Architecture CTrinitis

    53/79

    28.09.08

    Technische Universitt MnchenTrends through Multi-Core

    Computers move into chip!

    New memory hierarchies==> Caches!

    New interconnect topologies.

    Three levels of parallelism: On-chip

    On-board Cluster

    Contemporary Multicore Chips

  • 8/2/2019 Multicore Architecture CTrinitis

    54/79

    28.09.08

    Technische Universitt Mnchen

    Intel Clovertown/Penryn:

    4 Cores

    Split L1 Cache

    Partly Shared L2 Cache!

    FSB

    Contemporary Multicore Chips

    Contemporary Multicore Chips

  • 8/2/2019 Multicore Architecture CTrinitis

    55/79

    28.09.08

    Technische Universitt MnchenContemporary Multicore Chips

    AMD Barcelona:

    4 Cores

    Split L1/L2 Cache

    Shared L3 Cache!

    On Chip Crossbar

    Contemporary Multicore Chips

  • 8/2/2019 Multicore Architecture CTrinitis

    56/79

    28.09.08

    Technische Universitt MnchenContemporary Multicore Chips

    SUN Niagara 2

    2 Cores

    4 Threads / Core

    32 Threads

    On Chip Crossbar

    IBM Power 5 / Power 6

    Upcoming Archs: Dunnington

  • 8/2/2019 Multicore Architecture CTrinitis

    57/79

    28.09.08

    Technische Universitt MnchenUpcoming Archs: Dunnington

    Core i7 (Nehalem)

  • 8/2/2019 Multicore Architecture CTrinitis

    58/79

    28.09.08

    Technische Universitt MnchenCore i7 (Nehalem)

    731 Million Transistors, 45nm

    Nehalem: Intel's Next Generation

  • 8/2/2019 Multicore Architecture CTrinitis

    59/79

    28.09.08

    Technische Universitt Mnchen

    AMD Shanghai

  • 8/2/2019 Multicore Architecture CTrinitis

    60/79

    28.09.08

    Technische Universitt MnchenAMD Shanghai

    705 Million Transistors, 45nm

  • 8/2/2019 Multicore Architecture CTrinitis

    61/79

    28.09.08

    Technische Universitt Mnchen

    Larrabee: Intel's Many-Core Architecture

  • 8/2/2019 Multicore Architecture CTrinitis

    62/79

    28.09.08

    Technische Universitt Mnchen

    Larrabee: Intel's Many-Core Architecture

    Plenty of x86 in-order cores plus standard 64bitextensions

    16 wide SIMD unit per core

    Fully coherent L1 (32KB) /L2 (256KB) caches

    Bidirectional ring bus

    Short in order pipeline

    4-way SMT

  • 8/2/2019 Multicore Architecture CTrinitis

    63/79

    28.09.08

    Technische Universitt Mnchen

    Larrabee vs. Core

  • 8/2/2019 Multicore Architecture CTrinitis

    64/79

    28.09.08

    Technische Universitt Mnchen

    Larrabee: Intel's Many-Core Architecture

    Shared Memory Programming Model: Pthreads OpenMP Prromises to be standard conform

    C / FORTRAN Compiler Key advantage: x86 binary compatibility!

  • 8/2/2019 Multicore Architecture CTrinitis

    65/79

    28.09.08

    Technische Universitt Mnchen

    autopin:A Tool for automatic Optimization of PinningProcesses in Multicore Architectures

  • 8/2/2019 Multicore Architecture CTrinitis

    66/79

    can lead to non-deterministic runtimes

  • 8/2/2019 Multicore Architecture CTrinitis

    67/79

    Technische Universitt Mnchencan lead to non-deterministic runtimes...

  • 8/2/2019 Multicore Architecture CTrinitis

    68/79

    h h h

  • 8/2/2019 Multicore Architecture CTrinitis

    69/79

    28.09.08

    Technische Universitt Mnchen

    The autopin Approach

    User-level tool

    Start multi-threaded application under autopin control

    User can specify pinnings of interest

    Pin threads to cores

    Assess performance of chosen pinning using performance counters

    Try alternative pinnings until optimal pinning is found

    69

    T h i h U i i M hPerformance Counters

  • 8/2/2019 Multicore Architecture CTrinitis

    70/79

    Technische Universitt Mnchen

    Multiple Event Sensors ALU Utilization

    Branch Prediction Cache Events (L1/L2/TLB) Bus Utilization

    Two Uses: Read: Get Precise Count of Events in Code Regions =>

    Counting Interrupt on Overflow => Statistical Sampling

    Well-known tools: Oprofile

    Perfctr Intel Vtune Perfmon2

  • 8/2/2019 Multicore Architecture CTrinitis

    71/79

    Technische Universitt Mnchenautopin Strategy

  • 8/2/2019 Multicore Architecture CTrinitis

    72/79

    Technische Universitt Mnchen

    numOfPinnings = 3; pinning = {"1984", "182B", "58BE"};

    for (i=0; i

  • 8/2/2019 Multicore Architecture CTrinitis

    73/79

  • 8/2/2019 Multicore Architecture CTrinitis

    74/79

    Technische Universitt Mnchen

  • 8/2/2019 Multicore Architecture CTrinitis

    75/79

    Technische Universitt Mnchen

    Barcelona

    Technische Universitt Mnchen

  • 8/2/2019 Multicore Architecture CTrinitis

    76/79

    Technische Universitt Mnchen

    Results

    4242842

    332.ammp

    330.art

    328.fma3d

    324.apsi

    320.equake

    316.applu

    314.mgrid

    312.swim

    310.wupwise

    BarcelonaClovertownCaneland

  • 8/2/2019 Multicore Architecture CTrinitis

    77/79

    Technische Universitt Mnchen

  • 8/2/2019 Multicore Architecture CTrinitis

    78/79

    28.09.08

    Conclusions and Outlook 2

    Pinning is essential!

    We will see more and more GPU features in main processors!We will see more and more GPU features in main processors!

    78

    Technische Universitt Mnchen

  • 8/2/2019 Multicore Architecture CTrinitis

    79/79

    Gracias!

    Thank you!