cs6461 – computer architecture spring 2015 morris lancaster - instructor adapted from professor...

CS6461 – Computer ArchitectureSpring 2015

Morris Lancaster - InstructorAdapted from Professor Stephen Kaisler’s Notes

Lecture 12

Multicore Architectures

04/19/23 CSCI6461 Computer Architecture 12-2CS61 Computer Architecture 2

Moore’s Law


Moore’s Law: 2005


Single Thread Performance is Falling Off

Sou

rce:

SP

EC

Int

Pub

lishe

d D

ata


Multiprocessors

• We moved from one processor in a system to multiple processors in a system

• Speedup: near-linear until interprocessor or remote memory communication overwhelms performance increase

• Reaching single core limit of performance• So, as multiple processors improved

performance, look for another performance boost from multiple cores


So, What’s the Story …?

• Functional units– Superscalar is known territory– Diminishing returns for adding more functional blocks– Alternatives like VLIW have been considered and rejected

by the market– Single-threaded architectural performance is pegged

• Data paths– Increasing bandwidth between functional units in a

core makes a difference• Such as comprehensive 64-bit design, but then where to?

• Is 128 bits really needed in a processor?– Do we know how to use it?


And, the Story ….?

• Pipeline– Deeper pipeline buys number of instructions in processing

at the expense of increased cache miss penalty and lower instructions per clock

– Shallow pipeline gives better instructions per clock at the expense of number of instructions in processing scaling

– Industry converging on middle ground…9 to 11 stages• Successful RISC CPUs are in the same range

• Cache– Cache size buys performance at expense of die size, it’s a

direct hit to manufacturing cost– Deep pipeline cache miss penalties are reduced by larger

caches– Not always the best match for shallow pipeline cores, as

cache misses penalties are not as steep


Manufacturing

• Moore’s Law isn’t dead, more transistors for everyone!– But…it doesn’t really mention scaling transistor power

– Transistors are not free!

– More functional units, deeper pipelines, larger caches means more transistors ===> real estate problems!

• Chemistry and physics at nano-scale– Stretching materials science

– Voltage doesn’t scale yet

– Transistor leakage current is increasing

• As manufacturing economies and frequency increase, power consumption is increasing disproportionately

There are no process or architectural quick-fixes


Multicore Processor

• Definition: A multicore processor is a chip with multiple processors (cores). What a “core” is, is not well-defined, so the “core” varies with implementations.

• For example, the Cell has a Power PC and 8 special processing elements (SPEs), but all are assumed to be cores, although the SPEs have some limits on functionality.


Why Multicore?

• Can’t make a single core faster (Physics and noise are problems)

• Moore’s Law same core is 2X smaller per generation– Need to keep adding value to maintain average selling price – More and more cache doesn’t cut it– More transistors per generation

• Use all those transistors to put multiple processors (cores) on a chip– 2X cores per generation– Cores can potentially be optimized for power

• But harder to program, except for independent tasks– How many independent tasks are there to run at once?

04/19/23 CSCI6461 Computer Architecture 12-11

Core Design Parameters - ISA

Pro Con

LegacyCompiler and Software Support wellunderstood

May be inefficient for certain apprequiring higher performance to achieveend-to-end performance objectives.

Custom Can be optimized for targeted applicationsCompiler and software support maybe nonexistent

RISCEasy microarchitecture designEasy compiler design

Code size may be large and inefficientfor certain types of apps.

CISCMore instructions may allow for betteroptimization, smaller code size

Complex microarchitecture design tosupport all instructionsComplex compiler design

Special InstructionsHighly optimized code for targeted appsInstructions specified to appsrequirements

Complex designOften requires hand coding as nocompiler support

Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)


Core Design Parameters - Microarchitecture

Pro Con

In-orderLow to medium complexityLow powerLow are so many can be placed on die

Low to medium single threadperformance

Out-of-orderVery fast single-thread performance dueto dynamic scheduling of instructions

High design complexityLarge areaHigh power

SIMDVery efficient for highly data-parallel orVector code

Underutilized if code cannot beparallelizedNot applicable for control-dominated apps

VLIWMay issue many more instructions thanout-of-order due to reduced complexity

Requires advanced compiler supportMay perform poorly if compiler cannotStatically find ILP.



Memory System Design Parameters – On-Die

Pro Con

CachesTransparently provide appearance of low-latency access to main memoryCan be configured into multiple levels

No real-time performance guaranteeMust use die area to store tags

Local Store

Stores more data per die area thanCachesCan provide real-time performanceguarantee

Must be software controlled (withPerformance implications)



Memory System Design Parameters – Coherence

Pro Con

YesProvides a shared memorymultiprocessorSupports all programming models

Hard to implement

No Easy to ImplementSupports limited number ofprogramming models



Memory System Design Parameters – Interconnect

Pro Con

BusEasy to implementAll processors see uniform latencies toother processors and memories

Low bisection bandwidthSupports small number of cores

RingHigher bisection bandwidth than busSupports larger number of processors

Non-uniform access latencies withhigh varianceRequires routing logic

Network-on-Chip

High section bandwidthSupports large number of coresNonuniform latencies w/ lower variancethan ring

Requires sophisticated routing andarbitration logic

CrossbarHighest bisection bandwidthSupports large number of coresUniform Access latencies

Requires sophisticated arbitration logicRequires large die area



Multicore: Where Processor and System Collide

• Scales performance– Dedicated resources for multiple simultaneous threads

– Multiple cores will contend for memory and I/O bandwidth• Northbridge is the bottleneck – connects the cores and caches• Integrating Northbridge into chip eliminates much of bottleneck• Northbridge architecture has significant impact on performance• Cores, cache and Northbridge must be balanced for optimal

performance

– Most application software doesn’t need to do anything to benefit from multicore

– Be aware that, for a processor within a given power envelope• Fewer cores will clock faster than more cores

– Single-threaded performance-sensitive applications

• More cores will out-perform fewer cores for– Multi-threaded applications

– Multi-tasking response times

– Transaction processing

04/19/23 CS61 Computer Architecture 17CS61 Computer Architecture 17

Basic Idea: Multicore Architectures

• Replicate multiple processor cores on a single die. The cores fit on a single processor chip utilizing one socket


Basic Idea: Cores Run in Parallel

core

1

core

2

core

3

core

4

several threads

several threads

several threads

several threads


Simultaneous Multithreading (SMT)

• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core

• Weaving together multiple “threads” on the same core

• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units


BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating point

Without SMT, only a single thread can run at any given time


Without SMT, only a single thread can run at any given time

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 2:integer operation


SMT processor: both threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating pointThread 2:integer operation


But: Can’t simultaneously use the same functional unit

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

This scenario isimpossible with SMTon a single core(assuming a single integer unit)

IMPOSSIBLE


SMT not a “true” parallel processor

• Enables better threading (e.g. up to 30%)• OS and applications perceive each simultaneous thread as

a separate “virtual processor”• The chip has only a single copy

of each resource• Compare to multi-core:

each core has its own copy of resources


Multi-core: threads can run on separate cores

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2


BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 3 Thread 4

Multi-core: threads can run on separate cores


Combining Multi-core and SMT

• Cores can be SMT-enabled (or not)• The different combinations:

– Single-core, non-SMT: standard uniprocessor– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines

• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads

• Intel calls them “hyper-threads”


SMT Dual-core: all four threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 3 Thread 2 Thread 4


Multicores: Relative SpeedupW

illia

m S

talli

ng

s, C

om

pu

ter

Org

an

iza

tion

an

d A

rch

itect

ure

, 8th

Ed

itio

n


Multicores: Speedup w/ OverheadW

illia

m S

talli

ng

s, C

om

pu

ter

Org

an

iza

tion

an

d A

rch

itect

ure

, 8th

Ed

itio

n


Multicore Connectivity

s s s s

p

c

p

c

p

c

p

c

Ring Multicore

BUS

p

c

p

c

p

c

Bus Multicore

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

p

cs

Mesh Multicore

We have seen similar topologies before for multiprocessor systems



Core Core Core Core

L1 L1 L1 L1

L2 L2

L3L3

Memory Module 1

Memory Module 2

I/O

Homogeneous with Shared Caches anda Crossbar



Heterogeneous with

caches, local store and

ring bus

Core (2x SMT)

CoreL1

L2

Core

LocalStore

LocalStore

Core Core

LocalStore

LocalStore

I/OMemory Module

Heterogeneous withcaches, local store and ring bus


Multicore Architecture: Alternatives


IBM Cell Processor

• Joint collaboration of IBM/Sony/Toshiba• Develop a new/next-gen processor

– Initially for Play Station 3– Others, multimedia application (Blu-ray, HDTV)– Server systems

• Cell designed for vector computations– Vector arithmetic faster than scalar arithmetic

• Designed for fast SIMD processing• PowerPC Processing Element (PPE)

– PPE register file: 32 x 128-byte vectors– PPE: dual-issue in-order processor– In-order & out-of-order computation (load instructions)– 1 x PPE 64-bit PowerPC (L1: 32KB I$ + 32 KB D$; L2: 512 KB)

• - PPE design goals– Maximize performance/power– Maximize performance/area ratio

• - PPE main tasks– Run OS (Linux)– Coordinate with SPE's-


IBM Cell Processor

• Synergistic Processing Element (SPE):– An SPE is a self contained vector processor (SIMD) which acts as

a co-processor– SPE’s ISA is a cross between VMX and the PS2’s Emotion

Engine.– SPE register file: 128 x 128-byte vectors– In-order (again to minimize circuitry to save power)– Statically scheduled (compiler plays big role)– Also no dynamic prediction hardware (relies on compiler generated

hints)– 8 x SPE cores (LS: 256KB, SIMD machines)– Both PPE and SPE have Vector instruction capability

• PPE & SPE's @ 3.2Ghz• External RAMBUS XDR Memory

– Two channels @ 3.2Ghz (400Mhz, Octal data rate)

• IO Controller @ 5Ghz


IBM Cell Processor

Element Interconnection Bus:• Connects various on chip elements• Data-ring structure with control of a bus• 4 unidirectional rings but 2 rings run counter direction to other 2• Worst-case maximum latency is only half distance of the ring• Each ring is 16 bytes wide and runs at half the core clock frequency (core

clock freq ~3.2 GHz)


IBM Cell Processor: Chip Photo


IBM Cell Processor: PPE Architecture


IBM Cell Processor: SPE Architecture


IBM Cell Programming: Programming

• Creating instructions in a different language for the 8 SPEs than for the PowerPC core.

• Separate compiler for SPE– Embed SPE executable into library– 'extern spe_program_handle_t <program_name>'– Compile main PPU program with library

• Thread-based model, push/pull data– Thread scheduling by user – Five layers of parallelism:

• Task parallelism (MPMD) • Data parallelism (SPMD)• Data streaming parallelism (DMA double buffering) • Vector parallelism (SIMD – up to 16-ways)• Pipeline parallelism (dual-pipelined SPEs)

• Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs

• SPU local store needs to perform coherent DMA access for accessing system memory

04/19/23 CSCI6461 Computer Architecture 12-42CSCI6461 Computer Architecture

IBM Cell Processor: Programming

04/19/23 CSCI6461 Computer Architecture 12-43CSCI6461 Computer Architecture

IBM Cell Processor: Programming

• Manually partition the application into separate code segments and use the compiler that targets the appropriate ISA

• For SPUs, SIMD code generation can be done by parallelizing compiler with auto-SIMDization

• Allocating SPE program data in system memory (shared memory view) & have SPE compiler automatically manage the movement of data– A naive compiler inserts an explicit DMA transfer for each access

to shared memory– optimized: employ a software cache mechanism that permits reuse

of the temporary buffers in the LS• Using the SPE linker and an embedding tool

– generate a PPE executable that contains the SPE binary embedded within the data section

• PPE object is then linked, using a PPE linker– with the runtime libraries which are required for thread creation and

management, to create a bound executable for the Cell BE program


AMD Athlon Barcelona


Basic Idea: Programming Multiple Cores

• Programmer:– Programmers must use threads or processes.

– Write parallel algorithms.

• Parallel programming is harder than normal programming because it involves:– Additional techniques– Problem partitioning– Synchronization– Access control– …

• 90% of programmers don’t do parallel programming.


Basic Idea: Programming Multiple Cores

• Operating System Interaction:– Most major OS support multi-core today.– OS perceives each core as a separate processor.– OS scheduler maps threads/processes to different cores.

– OS will map threads/processes to cores


Multicore Programming: Shared Memory

• The Shared Memory Model: cores share a single memory• Typically written using OpenMP (http://openmp.org/wp/)• Software constructs that allow individual processes to physically

share certain portions of the same address space– Directives to compilers: FORTRAN, C/C++

• Seems intuitive (physical memory chips are shared by the cores)– Core virtualization?

• Pros– Easy to write– Communication coordination between processes is built-in– Allows support of both sequential and parallel processes– Easily scalable to a certain point

• Cons– Not very general, geared toward loop-level parallelism– Does not support asynchronous events very well– Not scalable to distributed systems easily


Multicore Programming: Message Passing

• Often written in Message Passing Interface (MPI)– an API specification that allows computers to communicate with one

another.

• Allows communication between processes (threads) using specific message-passing system calls.

• All shared data is communicated through messages• Physical memory not necessarily shared• Pros

– Allows for asynchronous events– Does not require programmer to write in terms of loop-level parallelism– Operates on multicores AND is scalable to distributed systems– A more general model of programming…extremely flexible

• Cons– Considered extremely difficult to write– Difficult to incrementally increase parallelism– Implicitly shared data (in MPI-2.0)


Multicore Programming: Transaction Model

• Instructions are grouped into sets of transactions• All-or-nothing model of execution and completion…Atomicity• Suitable for certain types of applications

– (ATMs, bank processing, database applications)

• Scalability!• Pros

– Scalable to large distributed systems– Applicable to a wide range of consumer-oriented applications– Does not necessarily imply a message-passing or shared-memory

interface– Applicable to many hardware models (assuming support for

atomicity)

• Cons– Not obviously amenable to all problems– Difficulty to reason about for many applications


Old Approaches Fall Short

• Pthreads– Intel webinar likens it to the assembly of parallel programming– Data races are hard to analyze– No encapsulation or modularity – But evolutionary, and OK in the interim

• DMA with external shared memory– DSP programmers favor DMA– Explicit copying from global shared memory to local store– Wastes pin bandwidth and energy– But, evolutionary, simple, modular and small core memory footprint

• MPI– Province of HPC users– Based on sending explicit messages between private memories– High overheads and large core memory footprint


Multicore Programming: What’s Best Model?

• The Billion $$$ Question…• No great general model…(yet)• Hardware and software issues• “The vast majority of programmers today don’t grok concurrency,

just as the vast majority of programmers 15 years ago didn’t yet grok objects.” (1)

• Economic & Cultural Forces(1) “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.”

Herb Sutter. Dr. Dobb’s Journal, March 2005. (http://www.gotw.ca/publications/concurrency-ddj.htm)

“grok” = “to understand deeply (Stranger in a Strange Land, Robert Heinlein)


Shared L2 Cache: Advantages

• Constructive interference reduces overall miss rate• Data shared by multiple cores not replicated at cache level• With proper frame replacement algorithms, the mean amount of

shared cache dedicated to each core is dynamic– Threads with less locality can have more cache

• Easy inter-process communication through shared memory• Cache coherency confined to L1• Dedicated L2 cache gives each core more rapid access

– Good for threads with strong locality

• Shared L3 cache may also improve performance


Multicore Challenges

• Relies on effective exploitation of multiple-thread parallelism

• Need for parallel computing model and parallel programming model

• Aggravates memory wall– Memory bandwidth

• Way to get data out of memory banks• Way to get data into multi-core processor array

– Memory latency– Fragments L3 cache

• Pins become strangle point• Rate of pin growth projected to slow and flatten• Rate of bandwidth per pin (pair) projected to grow slowly

• Requires mechanisms for efficient inter-processor coordination

• Synchronization• Mutual exclusion• Context switching


Multicore: Advantages

• Cache coherency circuitry can operate at a much higher clock rate than is possible if the signals have to travel off-chip.

• Signals between different CPUs travel shorter distances, those signals degrade less.

• These higher quality signals allow more data to be sent in a given time period since individual signals can be shorter and do not need to be repeated as often.

• A dual-core processor uses slightly less power than two coupled single-core processors.


Multicore: Disadvantages

• Ability of multi-core processors to increase application performance depends on the use of multiple threads within applications.

• Most Current video games will run faster on a 3 GHz single-core processor than on a 2GHz dual-core processor (of the same core architecture).

• Two processing cores sharing the same system bus and memory bandwidth limits the real-world performance advantage.

• If a single core is close to being memory bandwidth-limited, going to dual-core might only give 30% to 70% improvement.

• If memory bandwidth is not a problem, a 90% improvement can be expected.


Multicore Issues

• How many general purpose cores is enough?– Conjecture: probably no more than 16 based on experience with

multiprocessor systems

• Should future systems have homogeneous or heterogeneous cores (e.g., the Cell)

• What is the best way to connect the cores on a chip?• Are threads or processes better for programming

multicore processors?• Will software vendors charge a separate license per

each core or only a single license per chip?

04/19/23 CS61 Computer Architecture 57CS61 Computer Architecture 57


Additional Material


AMD Opteron Processor

AGUAGU

Int Decode & Rename

FADD FMISCFMUL

44-entryLoad/Store

Queue

36-entry FP scheduler

FP Decode & Rename

ALU

AGU

ALU

MULT

ALU

Res Res Res

L1Icache64KB

L1Dcache64KB

FetchBranch

Prediction

Instruction Control Unit (72 entries)

Fastpath Microcode EngineScan/Align/Decode

µops

AMD Opteron processor core architecture

AGU = Address Generation UnitALU = Arithmetic/Logical UnitRES = Reservation Station


AMD Athlon 64-bit Lines

http://www.amd.com/gb-uk/Processors/ProductInformation/0,,30_118_9485_13041^13043,00.html


Evolution of AMD Athlon 64-bit Processors

L2: 2x 1 MB) (L2: 2 x 1 MB) (L2: 2 MB)L2: 1 MB

BarcelonaWindsorToledoSan Diego


Single Chip cloud Computer (SCC)


Inside the SCC


Tilera 64

• Cores connected by mesh network

• Five physical mesh networks– UDN, IDN, SDN,

TDN, MDN– Each has 32

channels– Packet-switched– Wormhole routed– Point-to-point

• TDN and MDN are used for handling memory traffic:– Separate networks

improve concurrency by reducing bottlenecks


Tilera 64

• Number of Tiles = 64• On Chip Distributed Cache

= 5 MB• Operations at 32, 16, 8 bits

= 144, 192, 384 BOPS• On Chip interconnect

Bandwidth = 32 Tbits• I/O Bandwidth = 40 Gbps• Memory Bandwidth = 200

Gbps• 3-Way, 64-bit VLIW CPU


Tilera 64

• Memory requests transit via TDN

– Large store requests, small load requests

• Memory responses transit via MDN

– Large load responses, small store responses

– Includes cache-to-cache transfers and off-chip transfers

• Directory-based cache coherence

• Directory cache at every node

• Off-chip directory controller• Tile-to-tile requests and

responses transit the TDN• Off-chip memory requests

and responses transit the MDN


Itanium Dual-Core


Itanium core Duo

• 2 mobile-optimized execution cores– No multi-threading

• Cache hierarchy– Private 32-KB L1I and L1D– Shared 2 MB L2 cache– Provides efficient data sharing between both cores

• Power reduction– Some states individually by each processor– Deeper and enhanced deeper sleep states only for die– Dynamic Cache Sizing feature

• Flushes entire cache

• This enables Enhanced Deeper Sleep with lower voltage which does not guarantee cache integrity

• 151 Million transistors


ARM11 MPCore

• Up to 4 processors each with own L1 instruction and data cache• Distributed interrupt controller• Timer per CPU• Watchdog

– Warning alerts for software failures– Counts down from predetermined values– Issues warning at zero

• CPU interface– Interrupt acknowledgement, masking and completion acknowledgement

• CPU– Single ARM11 called MP11

• Vector floating-point unit– FP co-processor

• L1 cache• Snoop control unit

– L1 cache coherency


ARM11 MPCore


ARM11 MPCore Interrupt Handling

• Distributed Interrupt Controller (DIC) collates from many sources– Masking– Prioritization– Distribution to target MP11 CPUs– Status tracking– Software interrupt generation

• Number of interrupts independent of MP11 CPU design• Memory mapped• Accessed by CPUs via private interface through SCU• Can route interrupts to single or multiple CPUs• Provides inter-process communication

– Thread on one CPU can cause activity by thread on another CPU


ARM11 MPCore: Cache Coherency

• Snoop Control Unit (SCU) resolves most shared data bottleneck issues

• L1 cache coherency based on MESI• Direct data Intervention

– Copying clean entries between L1 caches without accessing external memory

– Reduces read after write from L1 to L2– Can resolve local L1 miss from remote L1 rather than L2

• Duplicated tag RAMs– Cache tags implemented as separate block of RAM– Same length as number of lines in cache– Duplicates used by SCU to check data availability before sending

coherency commands– Only send to CPUs that must update coherent data cache

• Migratory lines– Allows moving dirty data between CPUs without writing to L2 and

reading back from external memory

cs6461 – computer architecture spring 2015 morris lancaster - instructor adapted from professor...

Documents