cs6461 – computer architecture spring 2015 morris lancaster - instructor adapted from professor...

72
CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

Upload: elvin-glenn

Post on 26-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

CS6461 – Computer ArchitectureSpring 2015

Morris Lancaster - InstructorAdapted from Professor Stephen Kaisler’s Notes

Lecture 12

Multicore Architectures

Page 2: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-2CS61 Computer Architecture 2

Moore’s Law

Page 3: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-3CS61 Computer Architecture 3

Moore’s Law: 2005

Page 4: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-4CS61 Computer Architecture 4

Single Thread Performance is Falling Off

Sou

rce:

SP

EC

Int

Pub

lishe

d D

ata

Page 5: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-5CS61 Computer Architecture 5

Multiprocessors

• We moved from one processor in a system to multiple processors in a system

• Speedup: near-linear until interprocessor or remote memory communication overwhelms performance increase

• Reaching single core limit of performance• So, as multiple processors improved

performance, look for another performance boost from multiple cores

Page 6: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-6CS61 Computer Architecture 6

So, What’s the Story …?

• Functional units– Superscalar is known territory– Diminishing returns for adding more functional blocks– Alternatives like VLIW have been considered and rejected

by the market– Single-threaded architectural performance is pegged

• Data paths– Increasing bandwidth between functional units in a

core makes a difference• Such as comprehensive 64-bit design, but then where to?

• Is 128 bits really needed in a processor?– Do we know how to use it?

Page 7: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-7CS61 Computer Architecture 7

And, the Story ….?

• Pipeline– Deeper pipeline buys number of instructions in processing

at the expense of increased cache miss penalty and lower instructions per clock

– Shallow pipeline gives better instructions per clock at the expense of number of instructions in processing scaling

– Industry converging on middle ground…9 to 11 stages• Successful RISC CPUs are in the same range

• Cache– Cache size buys performance at expense of die size, it’s a

direct hit to manufacturing cost– Deep pipeline cache miss penalties are reduced by larger

caches– Not always the best match for shallow pipeline cores, as

cache misses penalties are not as steep

Page 8: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-8CS61 Computer Architecture 8

Manufacturing

• Moore’s Law isn’t dead, more transistors for everyone!– But…it doesn’t really mention scaling transistor power

– Transistors are not free!

– More functional units, deeper pipelines, larger caches means more transistors ===> real estate problems!

• Chemistry and physics at nano-scale– Stretching materials science

– Voltage doesn’t scale yet

– Transistor leakage current is increasing

• As manufacturing economies and frequency increase, power consumption is increasing disproportionately

There are no process or architectural quick-fixes

Page 9: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-9CS61 Computer Architecture 9

Multicore Processor

• Definition: A multicore processor is a chip with multiple processors (cores). What a “core” is, is not well-defined, so the “core” varies with implementations.

• For example, the Cell has a Power PC and 8 special processing elements (SPEs), but all are assumed to be cores, although the SPEs have some limits on functionality.

Page 10: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-10CS61 Computer Architecture 10

Why Multicore?

• Can’t make a single core faster (Physics and noise are problems)

• Moore’s Law same core is 2X smaller per generation– Need to keep adding value to maintain average selling price – More and more cache doesn’t cut it– More transistors per generation

• Use all those transistors to put multiple processors (cores) on a chip– 2X cores per generation– Cores can potentially be optimized for power

• But harder to program, except for independent tasks– How many independent tasks are there to run at once?

Page 11: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-11

Core Design Parameters - ISA

Pro Con

LegacyCompiler and Software Support wellunderstood

May be inefficient for certain apprequiring higher performance to achieveend-to-end performance objectives.

Custom Can be optimized for targeted applicationsCompiler and software support maybe nonexistent

RISCEasy microarchitecture designEasy compiler design

Code size may be large and inefficientfor certain types of apps.

CISCMore instructions may allow for betteroptimization, smaller code size

Complex microarchitecture design tosupport all instructionsComplex compiler design

Special InstructionsHighly optimized code for targeted appsInstructions specified to appsrequirements

Complex designOften requires hand coding as nocompiler support

Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)

Page 12: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-12

Core Design Parameters - Microarchitecture

Pro Con

In-orderLow to medium complexityLow powerLow are so many can be placed on die

Low to medium single threadperformance

Out-of-orderVery fast single-thread performance dueto dynamic scheduling of instructions

High design complexityLarge areaHigh power

SIMDVery efficient for highly data-parallel orVector code

Underutilized if code cannot beparallelizedNot applicable for control-dominated apps

VLIWMay issue many more instructions thanout-of-order due to reduced complexity

Requires advanced compiler supportMay perform poorly if compiler cannotStatically find ILP.

Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)

Page 13: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-13

Memory System Design Parameters – On-Die

Pro Con

CachesTransparently provide appearance of low-latency access to main memoryCan be configured into multiple levels

No real-time performance guaranteeMust use die area to store tags

Local Store

Stores more data per die area thanCachesCan provide real-time performanceguarantee

Must be software controlled (withPerformance implications)

Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)

Page 14: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-14

Memory System Design Parameters – Coherence

Pro Con

YesProvides a shared memorymultiprocessorSupports all programming models

Hard to implement

No Easy to ImplementSupports limited number ofprogramming models

Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)

Page 15: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-15

Memory System Design Parameters – Interconnect

Pro Con

BusEasy to implementAll processors see uniform latencies toother processors and memories

Low bisection bandwidthSupports small number of cores

RingHigher bisection bandwidth than busSupports larger number of processors

Non-uniform access latencies withhigh varianceRequires routing logic

Network-on-Chip

High section bandwidthSupports large number of coresNonuniform latencies w/ lower variancethan ring

Requires sophisticated routing andarbitration logic

CrossbarHighest bisection bandwidthSupports large number of coresUniform Access latencies

Requires sophisticated arbitration logicRequires large die area

Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)

Page 16: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-1616

Multicore: Where Processor and System Collide

• Scales performance– Dedicated resources for multiple simultaneous threads

– Multiple cores will contend for memory and I/O bandwidth• Northbridge is the bottleneck – connects the cores and caches• Integrating Northbridge into chip eliminates much of bottleneck• Northbridge architecture has significant impact on performance• Cores, cache and Northbridge must be balanced for optimal

performance

– Most application software doesn’t need to do anything to benefit from multicore

– Be aware that, for a processor within a given power envelope• Fewer cores will clock faster than more cores

– Single-threaded performance-sensitive applications

• More cores will out-perform fewer cores for– Multi-threaded applications

– Multi-tasking response times

– Transaction processing

Page 17: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CS61 Computer Architecture 17CS61 Computer Architecture 17

Basic Idea: Multicore Architectures

• Replicate multiple processor cores on a single die. The cores fit on a single processor chip utilizing one socket

Page 18: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-18

Basic Idea: Cores Run in Parallel

core

1

core

2

core

3

core

4

several threads

several threads

several threads

several threads

Page 19: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-19

Simultaneous Multithreading (SMT)

• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core

• Weaving together multiple “threads” on the same core

• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units

Page 20: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-20

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating point

Without SMT, only a single thread can run at any given time

Page 21: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-21

Without SMT, only a single thread can run at any given time

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 2:integer operation

Page 22: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-22

SMT processor: both threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating pointThread 2:integer operation

Page 23: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-23

But: Can’t simultaneously use the same functional unit

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

This scenario isimpossible with SMTon a single core(assuming a single integer unit)

IMPOSSIBLE

Page 24: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-24

SMT not a “true” parallel processor

• Enables better threading (e.g. up to 30%)• OS and applications perceive each simultaneous thread as

a separate “virtual processor”• The chip has only a single copy

of each resource• Compare to multi-core:

each core has its own copy of resources

Page 25: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-25

Multi-core: threads can run on separate cores

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

Page 26: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-26

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 3 Thread 4

Multi-core: threads can run on separate cores

Page 27: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-27

Combining Multi-core and SMT

• Cores can be SMT-enabled (or not)• The different combinations:

– Single-core, non-SMT: standard uniprocessor– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines

• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads

• Intel calls them “hyper-threads”

Page 28: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-28

SMT Dual-core: all four threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 3 Thread 2 Thread 4

Page 29: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-29

Multicores: Relative SpeedupW

illia

m S

talli

ng

s, C

om

pu

ter

Org

an

iza

tion

an

d A

rch

itect

ure

, 8th

Ed

itio

n

Page 30: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-30

Multicores: Speedup w/ OverheadW

illia

m S

talli

ng

s, C

om

pu

ter

Org

an

iza

tion

an

d A

rch

itect

ure

, 8th

Ed

itio

n

Page 31: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-31

Multicore Connectivity

s s s s

p

c

p

c

p

c

p

c

Ring Multicore

BUS

p

c

p

c

p

c

Bus Multicore

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

Mesh Multicore

We have seen similar topologies before for multiprocessor systems

Page 32: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-32

Multicore Architectures

Core Core Core Core

L1 L1 L1 L1

L2 L2

L3L3

Memory Module 1

Memory Module 2

I/O

Homogeneous with Shared Caches anda Crossbar

Page 33: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-33

Multicore Architectures

Heterogeneous with

caches, local store and

ring bus

Core (2x SMT)

CoreL1

L2

Core

LocalStore

LocalStore

Core Core

LocalStore

LocalStore

I/OMemory Module

Heterogeneous withcaches, local store and ring bus

Page 34: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-34

Multicore Architecture: Alternatives

Page 35: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-35

IBM Cell Processor

• Joint collaboration of IBM/Sony/Toshiba• Develop a new/next-gen processor

– Initially for Play Station 3– Others, multimedia application (Blu-ray, HDTV)– Server systems

• Cell designed for vector computations– Vector arithmetic faster than scalar arithmetic

• Designed for fast SIMD processing• PowerPC Processing Element (PPE)

– PPE register file: 32 x 128-byte vectors– PPE: dual-issue in-order processor– In-order & out-of-order computation (load instructions)– 1 x PPE 64-bit PowerPC (L1: 32KB I$ + 32 KB D$; L2: 512 KB)

• - PPE design goals– Maximize performance/power– Maximize performance/area ratio

• - PPE main tasks– Run OS (Linux)– Coordinate with SPE's-

Page 36: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-36

IBM Cell Processor

• Synergistic Processing Element (SPE):– An SPE is a self contained vector processor (SIMD) which acts as

a co-processor– SPE’s ISA is a cross between VMX and the PS2’s Emotion

Engine.– SPE register file: 128 x 128-byte vectors– In-order (again to minimize circuitry to save power)– Statically scheduled (compiler plays big role)– Also no dynamic prediction hardware (relies on compiler generated

hints)– 8 x SPE cores (LS: 256KB, SIMD machines)– Both PPE and SPE have Vector instruction capability

• PPE & SPE's @ 3.2Ghz• External RAMBUS XDR Memory

– Two channels @ 3.2Ghz (400Mhz, Octal data rate)

• IO Controller @ 5Ghz

Page 37: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-37

IBM Cell Processor

Element Interconnection Bus:• Connects various on chip elements• Data-ring structure with control of a bus• 4 unidirectional rings but 2 rings run counter direction to other 2• Worst-case maximum latency is only half distance of the ring• Each ring is 16 bytes wide and runs at half the core clock frequency (core

clock freq ~3.2 GHz)

Page 38: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-38

IBM Cell Processor: Chip Photo

Page 39: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-39

IBM Cell Processor: PPE Architecture

Page 40: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-40

IBM Cell Processor: SPE Architecture

Page 41: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-41

IBM Cell Programming: Programming

• Creating instructions in a different language for the 8 SPEs than for the PowerPC core.

• Separate compiler for SPE– Embed SPE executable into library– 'extern spe_program_handle_t <program_name>'– Compile main PPU program with library

• Thread-based model, push/pull data– Thread scheduling by user – Five layers of parallelism:

• Task parallelism (MPMD) • Data parallelism (SPMD)• Data streaming parallelism (DMA double buffering) • Vector parallelism (SIMD – up to 16-ways)• Pipeline parallelism (dual-pipelined SPEs)

• Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs

• SPU local store needs to perform coherent DMA access for accessing system memory

Page 42: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-42CSCI6461 Computer Architecture

IBM Cell Processor: Programming

Page 43: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-43CSCI6461 Computer Architecture

IBM Cell Processor: Programming

• Manually partition the application into separate code segments and use the compiler that targets the appropriate ISA

• For SPUs, SIMD code generation can be done by parallelizing compiler with auto-SIMDization

• Allocating SPE program data in system memory (shared memory view) & have SPE compiler automatically manage the movement of data– A naive compiler inserts an explicit DMA transfer for each access

to shared memory– optimized: employ a software cache mechanism that permits reuse

of the temporary buffers in the LS• Using the SPE linker and an embedding tool

– generate a PPE executable that contains the SPE binary embedded within the data section

• PPE object is then linked, using a PPE linker– with the runtime libraries which are required for thread creation and

management, to create a bound executable for the Cell BE program

Page 44: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-44

AMD Athlon Barcelona

Page 45: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-45

Basic Idea: Programming Multiple Cores

• Programmer:– Programmers must use threads or processes.

– Write parallel algorithms.

• Parallel programming is harder than normal programming because it involves:– Additional techniques– Problem partitioning– Synchronization– Access control– …

• 90% of programmers don’t do parallel programming.

Page 46: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-46

Basic Idea: Programming Multiple Cores

• Operating System Interaction:– Most major OS support multi-core today.– OS perceives each core as a separate processor.– OS scheduler maps threads/processes to different cores.

– OS will map threads/processes to cores

Page 47: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-47

Multicore Programming: Shared Memory

• The Shared Memory Model: cores share a single memory• Typically written using OpenMP (http://openmp.org/wp/)• Software constructs that allow individual processes to physically

share certain portions of the same address space– Directives to compilers: FORTRAN, C/C++

• Seems intuitive (physical memory chips are shared by the cores)– Core virtualization?

• Pros– Easy to write– Communication coordination between processes is built-in– Allows support of both sequential and parallel processes– Easily scalable to a certain point

• Cons– Not very general, geared toward loop-level parallelism– Does not support asynchronous events very well– Not scalable to distributed systems easily

Page 48: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-48

Multicore Programming: Message Passing

• Often written in Message Passing Interface (MPI)– an API specification that allows computers to communicate with one

another.

• Allows communication between processes (threads) using specific message-passing system calls.

• All shared data is communicated through messages• Physical memory not necessarily shared• Pros

– Allows for asynchronous events– Does not require programmer to write in terms of loop-level parallelism– Operates on multicores AND is scalable to distributed systems– A more general model of programming…extremely flexible

• Cons– Considered extremely difficult to write– Difficult to incrementally increase parallelism– Implicitly shared data (in MPI-2.0)

Page 49: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-49

Multicore Programming: Transaction Model

• Instructions are grouped into sets of transactions• All-or-nothing model of execution and completion…Atomicity• Suitable for certain types of applications

– (ATMs, bank processing, database applications)

• Scalability!• Pros

– Scalable to large distributed systems– Applicable to a wide range of consumer-oriented applications– Does not necessarily imply a message-passing or shared-memory

interface– Applicable to many hardware models (assuming support for

atomicity)

• Cons– Not obviously amenable to all problems– Difficulty to reason about for many applications

Page 50: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-50

Old Approaches Fall Short

• Pthreads– Intel webinar likens it to the assembly of parallel programming– Data races are hard to analyze– No encapsulation or modularity – But evolutionary, and OK in the interim

• DMA with external shared memory– DSP programmers favor DMA– Explicit copying from global shared memory to local store– Wastes pin bandwidth and energy– But, evolutionary, simple, modular and small core memory footprint

• MPI– Province of HPC users– Based on sending explicit messages between private memories– High overheads and large core memory footprint

Page 51: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-51

Multicore Programming: What’s Best Model?

• The Billion $$$ Question…• No great general model…(yet)• Hardware and software issues• “The vast majority of programmers today don’t grok concurrency,

just as the vast majority of programmers 15 years ago didn’t yet grok objects.” (1)

• Economic & Cultural Forces(1) “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.”

Herb Sutter. Dr. Dobb’s Journal, March 2005. (http://www.gotw.ca/publications/concurrency-ddj.htm)

“grok” = “to understand deeply (Stranger in a Strange Land, Robert Heinlein)

Page 52: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-52

Shared L2 Cache: Advantages

• Constructive interference reduces overall miss rate• Data shared by multiple cores not replicated at cache level• With proper frame replacement algorithms, the mean amount of

shared cache dedicated to each core is dynamic– Threads with less locality can have more cache

• Easy inter-process communication through shared memory• Cache coherency confined to L1• Dedicated L2 cache gives each core more rapid access

– Good for threads with strong locality

• Shared L3 cache may also improve performance

Page 53: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-53

Multicore Challenges

• Relies on effective exploitation of multiple-thread parallelism

• Need for parallel computing model and parallel programming model

• Aggravates memory wall– Memory bandwidth

• Way to get data out of memory banks• Way to get data into multi-core processor array

– Memory latency– Fragments L3 cache

• Pins become strangle point• Rate of pin growth projected to slow and flatten• Rate of bandwidth per pin (pair) projected to grow slowly

• Requires mechanisms for efficient inter-processor coordination

• Synchronization• Mutual exclusion• Context switching

Page 54: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-54

Multicore: Advantages

• Cache coherency circuitry can operate at a much higher clock rate than is possible if the signals have to travel off-chip.

• Signals between different CPUs travel shorter distances, those signals degrade less.

• These higher quality signals allow more data to be sent in a given time period since individual signals can be shorter and do not need to be repeated as often.

• A dual-core processor uses slightly less power than two coupled single-core processors.

Page 55: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-55

Multicore: Disadvantages

• Ability of multi-core processors to increase application performance depends on the use of multiple threads within applications.

• Most Current video games will run faster on a 3 GHz single-core processor than on a 2GHz dual-core processor (of the same core architecture).

• Two processing cores sharing the same system bus and memory bandwidth limits the real-world performance advantage.

• If a single core is close to being memory bandwidth-limited, going to dual-core might only give 30% to 70% improvement.

• If memory bandwidth is not a problem, a 90% improvement can be expected.

Page 56: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-56

Multicore Issues

• How many general purpose cores is enough?– Conjecture: probably no more than 16 based on experience with

multiprocessor systems

• Should future systems have homogeneous or heterogeneous cores (e.g., the Cell)

• What is the best way to connect the cores on a chip?• Are threads or processes better for programming

multicore processors?• Will software vendors charge a separate license per

each core or only a single license per chip?

Page 57: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CS61 Computer Architecture 57CS61 Computer Architecture 57

Page 58: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-58

Additional Material

Page 59: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-59

AMD Opteron Processor

AGUAGU

Int Decode & Rename

FADD FMISCFMUL

44-entryLoad/Store

Queue

36-entry FP scheduler

FP Decode & Rename

ALU

AGU

ALU

MULT

ALU

Res Res Res

L1Icache64KB

L1Dcache64KB

FetchBranch

Prediction

Instruction Control Unit (72 entries)

Fastpath Microcode EngineScan/Align/Decode

µops

AMD Opteron processor core architecture

AGU = Address Generation UnitALU = Arithmetic/Logical UnitRES = Reservation Station

Page 60: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-60

AMD Athlon 64-bit Lines

http://www.amd.com/gb-uk/Processors/ProductInformation/0,,30_118_9485_13041^13043,00.html

Page 61: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-61

Evolution of AMD Athlon 64-bit Processors

L2: 2x 1 MB) (L2: 2 x 1 MB) (L2: 2 MB)L2: 1 MB

BarcelonaWindsorToledoSan Diego

Page 62: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-62

Single Chip cloud Computer (SCC)

Page 63: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-63

Inside the SCC

Page 64: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-64

Tilera 64

• Cores connected by mesh network

• Five physical mesh networks– UDN, IDN, SDN,

TDN, MDN– Each has 32

channels– Packet-switched– Wormhole routed– Point-to-point

• TDN and MDN are used for handling memory traffic:– Separate networks

improve concurrency by reducing bottlenecks

Page 65: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-65

Tilera 64

• Number of Tiles = 64• On Chip Distributed Cache

= 5 MB• Operations at 32, 16, 8 bits

= 144, 192, 384 BOPS• On Chip interconnect

Bandwidth = 32 Tbits• I/O Bandwidth = 40 Gbps• Memory Bandwidth = 200

Gbps• 3-Way, 64-bit VLIW CPU

Page 66: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-66

Tilera 64

• Memory requests transit via TDN

– Large store requests, small load requests

• Memory responses transit via MDN

– Large load responses, small store responses

– Includes cache-to-cache transfers and off-chip transfers

• Directory-based cache coherence

• Directory cache at every node

• Off-chip directory controller• Tile-to-tile requests and

responses transit the TDN• Off-chip memory requests

and responses transit the MDN

Page 67: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-67

Itanium Dual-Core

Page 68: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-68

Itanium core Duo

• 2 mobile-optimized execution cores– No multi-threading

• Cache hierarchy– Private 32-KB L1I and L1D– Shared 2 MB L2 cache– Provides efficient data sharing between both cores

• Power reduction– Some states individually by each processor– Deeper and enhanced deeper sleep states only for die– Dynamic Cache Sizing feature

• Flushes entire cache

• This enables Enhanced Deeper Sleep with lower voltage which does not guarantee cache integrity

• 151 Million transistors

Page 69: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-69

ARM11 MPCore

• Up to 4 processors each with own L1 instruction and data cache• Distributed interrupt controller• Timer per CPU• Watchdog

– Warning alerts for software failures– Counts down from predetermined values– Issues warning at zero

• CPU interface– Interrupt acknowledgement, masking and completion acknowledgement

• CPU– Single ARM11 called MP11

• Vector floating-point unit– FP co-processor

• L1 cache• Snoop control unit

– L1 cache coherency

Page 70: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-70CS61 Computer Architecture 70

ARM11 MPCore

Page 71: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-71

ARM11 MPCore Interrupt Handling

• Distributed Interrupt Controller (DIC) collates from many sources– Masking– Prioritization– Distribution to target MP11 CPUs– Status tracking– Software interrupt generation

• Number of interrupts independent of MP11 CPU design• Memory mapped• Accessed by CPUs via private interface through SCU• Can route interrupts to single or multiple CPUs• Provides inter-process communication

– Thread on one CPU can cause activity by thread on another CPU

Page 72: CS6461 – Computer Architecture Spring 2015 Morris Lancaster - Instructor Adapted from Professor Stephen Kaisler’s Notes Lecture 12 Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-72

ARM11 MPCore: Cache Coherency

• Snoop Control Unit (SCU) resolves most shared data bottleneck issues

• L1 cache coherency based on MESI• Direct data Intervention

– Copying clean entries between L1 caches without accessing external memory

– Reduces read after write from L1 to L2– Can resolve local L1 miss from remote L1 rather than L2

• Duplicated tag RAMs– Cache tags implemented as separate block of RAM– Same length as number of lines in cache– Duplicates used by SCU to check data availability before sending

coherency commands– Only send to CPUs that must update coherent data cache

• Migratory lines– Allows moving dirty data between CPUs without writing to L2 and

reading back from external memory