ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · architecture transistor densities increased...

ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ([email protected])

ΗΜΥ 656ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

Εαρινό Εξάμηνο 2007ΔΙΑΛΕΞΗ 5:Chip Multiprocessors –The New Era in Processor Architectures

Acknowledgements:

Wen-Mei Hwu, Kunle Olukotun, Shih-HaoHung, Dezső Sima and the Stanford Hydra Group.

Microarchitecture: Overview

InstructionSupply

ExecutionMechanism

DataSuppl

Highest performance means generating the highest instructionand data bandwidth you can, and effectively consuming that

bandwidth in execution – paraphrased from M. Alsup, AMD Fellow

Microarchitecture, 1990

• Short pipelines• On-chip I and D Caches, blocking• Simple prediction

Microarchitecture, 2000

• Mechanisms to find parallel instructions– dynamic scheduling– static scheduling

• On-chip cache hierarchies, with non-blocking, higher-bandwidth caches

• Sophisticated branch prediction

Future Microarchitecture:One Perspective

InstructionSupply

ExecutionMechanism

DataSuppl

Highest performance means generating the highest instructionand data bandwidth you can, and effectively consuming that

bandwidth in execution – paraphrased from M. Alsup, AMD Fellow

Where are we headed?• More ILP : Even wider, deeper

– enabling technology: speculation, predication, compiler transformations, binary re-optimization, complexity effective design

• Multithreading– enabling technology: speculation, subordinate

threads, discovery of thread-level parallelism• Chip Multiprocessors

– enabling technology: speculation, discovery of thread-level, course-grained parallelism

More ILP• Instruction Supply

– Branches, cache misses, partial fetches• Data Supply

– Higher bandwidth, lower latency, memory ordering, non-blocking caches

• Execution– Reduction of redundant work, design complexity

and partitioning• Tolerating Latency

– Can some things just take a long time?

Multithreading[Burton Smith, 1978]

Fetch

Execute

WriteBack

This is a snapshot of the pipeline during a single cycle. Each colorrepresents instructions from a different thread.

B. Smith’s original concept was for a single-wide pipeline, butextends naturally to a multiple issue pipeline.

Simultaneous Multithreadiing[W. Yamamoto, 1994/D. Tullsen, 1995]

Fetch

Execute

WriteBack

Simultaneous Multithreading,possible implementation

Front End Back End

•Intel Hyperthreading in Pentium 4 [HotChips’14] is first realization with two threads

•Small ISA register file minimizes effect of replication•Replicated retirement logic•Minimal hardware overhead but major increase in verification costcost

Chip Multiprocessor[K. Olukotun, 1996]

Fetch

Execute

WriteBack

ProcA

Shared L2 Cache

ProcC

ProcDProcB

Single processor die contains multiple CPUs all ofwhich share some amount of resources, such as an L2 cache and chip pins.

Hardware Accelerators

Existing Solutions…

Intel IXP1200 Intel IXP1200 Network Network

ProcessorProcessor

Philips Philips Nexperia Nexperia (Viper)(Viper)

ARM

MICRO-

ENGINES

ACCESSCTL.

MIPS

MPEG

VLIW

VIDEO

MSP

IBM CellIBM Cell …… whatwhat’’s next? s next? ……

Discussion/Thought Exercise• What are the essential differences

between the SMT model of execution and the CMP model?– What resources are shared and in what

manner?– What type of data movement exists in one

but not others?– What types of applications/situations are

the best case situations for each model?

The Advent of Superscalar Architecture

Transistor densities increased at a stunning pace.

Any method to increase computing performance for using those transistors ?

Put more than one ALU a chip

The RS6000 from IBM released in 1990The world's first superscalar CPU

Most general purpose CPUs developed since about 1998 are superscalar

Technology ↔ Architecture

• Transistors are cheap, plentiful and fast– Moore’s law– 100 million transistors by 2000

• Wires are cheap, plentiful and slow– Wires get slower relative to transistors– Long cross-chip wires are especially slow

• Architectural implications– Plenty of room for innovation– Single cycle communication requires localized blocks of logic– High communication bandwidth across the chip easier to

achieve than low latency

Exploiting Program Parallelism

Instruction

Loop

Thread

Process

Leve

ls o

f Par

alle

lism

1 10 100 1K 10K 100K 1M

Grain Size (instructions)

Future Processors to use Coarse-Grain Parallelism

• Today‘s microprocessors utilize instruction level parallelism by a deepinstruction pipeline and by the superscalar or VLIW multiple issuetechniques

• Today‘s (2001) technology: approx. 40 M transistors per chip, In future (2012): 1.4 G transistors per chip,

What next?

• Two directions:– Increase of single-thread performance

--> use of more speculative instruction-level parallelism– Increase of multi-thread (multi-task) performance

--> Utilize thread-level parallism additionally to instruction-level parallelismA „thread“ in this lecture means a „HW thread“ which can be a SW (Posix) thread, a process, ...

• Far future (??): Increase of single-thread performance by use of speculative instruction-level and thread-level parallelism

Advanced Superscalar Processors for Billion Transistor Chips in Year 2005 - Characteristics

• Aggressive speculation, such as a very aggressive dynamic branch predictor,

• a large trace cache,• very-wide-issue superscalar processing (an issue width of 16

or 32 instructions per cycle),• a large number of reservation stations to accommodate

2,000 instructions,• 24 to 48 highly optimized, pipelined functional units,• sufficient on-chip data cache, and• sufficient resolution and forwarding logic.

– see: Yale N. Patt, Sanjay J. Patel, Marius Evers, Daniel H. Friendly, Jaret Stark: One Billion Transistors, One Uniprocessor, One Chip. IEEE Computer, September 1997, pp. 51-57.

Requirements and Solutions• Delivering optimal instruction bandwidth requires:

– a minimal number of empty fetch cycles, – a very wide (conservatively 16 instructions, aggressively 32), full issue

each cycle, – and a minimal number of cycles in which the instructions fetched are

subsequently discarded.• Consuming this instruction bandwidth requires:

– sufficient data supply,– and sufficient processing resources to handle the instruction bandwidth.

• Suggestions:– an instruction cache system (the I-cache) that provides for out-of-order

fetch (fetch, decode, and issue in the presence of I-cache misses). – a large Trace cache for providing a logically contiguous instruction

stream,– an aggressive Multi-Hybrid branch predictor (multiple, separate branch

predictors, each tuned to a different class of branches) with support for context switching, indirect jumps, and interference handling.

Future Processors to use Coarse-Grain Parallelism

• Chip multiprocessors (CMPs) or multiprocessor chips– integrate two or more complete processors on a single

chip,– every functional unit of a processor is duplicated

• Simultaneous multithreaded processors (SMPs)– store multiple contexts in different register sets on the

chip,– the functional units are multiplexed between the threads,– instructions of different contexts are simultaneously

executed

Chip Multiprocessors (CMPs)Principal Chip Multiprocessor Alternatives

• symmetric multiprocessor (SMP), • distributed shared memory

multiprocessor (DSM), • message-passing shared-nothing

multiprocessor.

Organizationalprinciples of

multiprocessors

Pro-cessor

Pro-cessor...

Interconnection

Shared Memory

(SMP) symmetric multiprocessor

Pro-cessor

Pro-cessor...

(DSM) distributed-shared-memorymultiprocessor

Interconnection

LocalMemory

LocalMemory

Pro-cessor

Pro-cessor...

Interconnection

LocalMemory

LocalMemory

message-passing(shared-nothing) multiprocessor

send receive

empty

global memory physically distributed memory

dist

ribut

ed a

ddre

ss sp

aces

shar

ed a

ddre

ss sp

ace

Typical SMP

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

SecondaryCache

Bus

SecondaryCache

SecondaryCache

SecondaryCache

PrimaryCache

PrimaryCache

PrimaryCache

Global Memory

Shared memory candidates for CMPs

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

SecondaryCache

SecondaryCache

SecondaryCache

SecondaryCache

Global Memory

PrimaryCache

PrimaryCache

PrimaryCache

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

Secondary Cache

Global Memory

PrimaryCache

PrimaryCache

PrimaryCache

Shared-main memory and shared-secondary cache

Shared memory candidates for CMPs

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

Secondary Cache

Global Memory

Primary Cache

and shared-primary cache

Grain-levels for CMPs• multiple processes in parallel• multiple threads from a single

application ⇒ implies a common address space for

all threads• extracting threads of control

dynamically from a single instruction stream

• ⇒ see last lecture, multiscalar, trace processors, ...

Hydra: A Single-Chip Multiprocessor

CPU 0

Centralized Bus Arbitration Mechanisms

Cache SRAM Array DRAM Main Memory I/O Device

A Single C

hip

PrimaryI-cache

PrimaryD-cache

CPU 0 Memory Controller

Rambus MemoryInterface

Off-chip L3Interface

I/O BusInterface

DMA

CPU 1

PrimaryI-cache

PrimaryD-cache


CPU 2

PrimaryI-cache

PrimaryD-cache

CPU2 Memory Controller

CPU 3

PrimaryI-cache

PrimaryD-cache


On-chip SecondaryCache

Multithreaded Processors• Aim: Latency tolerance• What is the problem?

• Load access latencies measured on an Alpha Server 4100 SMP with four 300 MHz Alpha 21164 processors are:– 7 cycles for a primary cache miss which hits in the on-chip L2 cache

of the 21164 processor,– 21 cycles for a L2 cache miss which hits in the L3 (board-level)

cache,– 80 cycles for a miss that is served by the memory, and– 125 cycles for a dirty miss, i.e., a miss that has to be served from

another processor's cache memory.

• Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors.

Register set 1

Register set 2

Register set 3

Register set 4

PC PSR 1

PC PSR 2

PC PSR 3

PC PSR 4

FP

Thread 1:

Thread 2:

Thread 3:

Thread 4:

... ... ...

Multithreaded Processors• Multithreading:

– Provide several program counters registers (and usually several register sets) on chip

– Fast context switching by switching to another thread of control

Approaches of Multithreaded Processors

• Cycle-by-cycle interleaving– An instruction of another thread is fetched and fed into the

execution pipeline at each processor cycle.• Block-interleaving

– The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch.

• Simultaneous multithreading– Instructions are simultaneously issued from multiple threads

to the FUs of a superscalar processor.– combines a wide issue superscalar instruction issue with

multithreading.

Comparison of Multithreading with Non-Multithreading Approaches:

(a) single-threaded scalar(b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar

(a)

Tim

e (p

roce

sscy

cles

)

(c)

Con

text

switc

h

(b)

Con

text

switc

h

Comparison of Multithreading with Non-Multithreading Approaches:

(a) superscalar (c) cycle-by-cycle interleaving(b) VLIW (d) cycle-by-cycle interleaving VLIW

(a)

Tim

e(p

roc e

sso r

cyc l

e s)

Issue slots

(b)

N

NNNNN

(c)

Con

text

switc

h

(d)

Con

text

switc

h

NN

N

NNNNN

NN

N

Comparison of Multithreading withNon-Multithreading:

simultaneous multithreading (SMT) and chip multiprocessor (CMP)

Issue slots

Tim

e (p

roce

ssor

cyc

les)

Cycle-by-Cycle Interleaving• the processor switches to a different thread after

each instruction fetch• pipeline hazards cannot arise and the processor

pipeline can be easily built without the necessity of complex forwarding paths

• context-switching overhead is zero cycles• memory latency is tolerated by not scheduling a

thread until the memory transaction has completed • requires at least as many threads as pipeline stages in

the processor• degrading the single-thread performance if not

enough threads are present

Cycle-by-Cycle Interleaving- Improving single-thread performance

• The dependence look-ahead technique adds several bits to each instruction format in the ISA.– Scheduler feeds non data or control dependent

instructions of the same thread successively into the pipeline.

• The interleaving technique proposed by Laudon et al. adds caching and full pipeline interlocks to the cycle-by-cycle interleaving approach.

Single-core computer

Single-core CPU chipthe single core

Multi-core architectures

• This lecture is about a new trend in computer architecture:Replicate multiple processor cores on a single die.

Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip

Multi-core CPU chip• The cores fit on a single processor

socket• Also called CMP (Chip Multi-Processor)

core

1

core

2

core

3

core

4

The cores run in parallel

core

1

core

2

core

3

core

4

thread 1 thread 2 thread 3 thread 4

Within each core, threads are time-sliced (just like on a uniprocessor)

core

1

core

2

core

3

core

4

several threads

several threads

several threads

several threads

Interaction with OS• OS perceives each core as a separate

processor

• OS scheduler maps threads/processes to different cores

• Most major OS support multi-core today

Why multi-core ?• Difficult to make single-core

clock frequencies even higher • Deeply pipelined circuits:

– heat problems– speed of light problems– difficult design and verification– large design teams necessary– server farms need expensive

air-conditioning• Many new applications are multithreaded • General trend in computer architecture (shift

towards more parallelism)

Instruction-level parallelism• Parallelism at the machine-instruction

level• The processor can re-order, pipeline

instructions, split them into microinstructions, do aggressive branch prediction, etc.

• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

Thread-level parallelism (TLP)• This is parallelism on a more coarser scale• Server can serve each client in a separate

thread (Web server, database server)• A computer game can do AI, graphics, and

physics in three separate threads• Single-core superscalar processors cannot

fully exploit TLP• Multi-core architectures are the next

step in processor evolution: explicitly exploiting TLP

General context: Multiprocessors• Multiprocessor is any

computer with several processors

• SIMD– Single instruction, multiple data– Modern graphics cards

• MIMD– Multiple instructions, multiple data

Lemieux cluster,Pittsburgh

supercomputing center

http://www.psc.edu/machines/tcs/lemieux.html

Multiprocessor memory types• Shared memory:

In this model, there is one (large) common shared memory for all processors

• Distributed memory:In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else

Multi-core processor is a special kind of a multiprocessor:

All processors are on the same chip

• Multi-core processors are MIMD:Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).

• Multi-core is a shared memory multiprocessor:All cores share the same memory

What applications benefit from multi-core?

• Database servers• Web servers (Web

commerce)• Compilers• Multimedia applications• Scientific applications,

CAD/CAM• In general, applications with

Thread-level parallelism(as opposed to instruction-level parallelism)

Each can run on itsown core

More examples

• Editing a photo while recording a TV show through a digital video recorder

• Downloading software while running an anti-virus program

• “Anything that can be threaded today will map efficiently to multi-core”

• BUT: some applications difficult toparallelize

A technique complementary to multi-core:Simultaneous multithreading

• Problem addressed:The processor pipeline can get stalled:– Waiting for the result

of a long floating point (or integer) operation

– Waiting for data to arrive from memory

Other execution unitswait unused

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

Source: Intel

Simultaneous multithreading (SMT)

• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core

• Weaving together multiple “threads”on the same core

• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating point

Without SMT, only a single thread can run at any given time

Without SMT, only a single thread can run at any given time

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 2:integer operation

SMT processor: both threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating pointThread 2:integer operation

But: Can’t simultaneously use the same functional unit

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE

SMT not a “true” parallel processor

• Enables better threading (e.g. up to 30%)• OS and applications perceive each

simultaneous thread as a separate “virtual processor”

• The chip has only a single copy of each resource

• Compare to multi-core:each core has its own copy of resources

Multi-core: threads can run on separate cores

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 3

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 2 Thread 4

Multi-core: threads can run on separate cores

Combining Multi-core and SMT

• Cores can be SMT-enabled (or not)• The different combinations:

– Single-core, non-SMT: standard uniprocessor

– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines

• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads

SMT Dual-core: all four threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCodeROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2 Thread 3 Thread 4

Comparison: multi-core vs SMT

• Multi-core:– Since there are several cores,

each is smaller and not as powerful(but also easier to design and manufacture)

– However, great with thread-level parallelism• SMT

– Can have one large and fast superscalar core– Great performance on a single thread– Mostly still only exploits instruction-level

parallelism

Concepts – Multithreading

• Data Access Latency– Cache misses (L1, L2)– Memory latency (remote, local)– Often unpredictable

• Multithreading (MT)– Tolerate or mask long and often unpredictable

latency operations by switching to another context, which is able to do useful work.

Why Multithreading Today?

• ILP is exhausted, TLP is in.• Large performance gap bet. MEM and

PROC.• Too many transistors on chip• More existing MT applications Today.• Multiprocessors on a single chip.• Long network latency, too.

Classical Problem, 60’ & 70’• I/O latency prompted multitasking • IBM mainframes • Multitasking • I/O processors • Caches within disk controllers

Requirements of Multithreading• Storage need to hold multiple context’s PC,

registers, status word, etc. • Coordination to match an event with a saved

context • A way to switch contexts • Long latency operations must use resources

not in use

Processor Utilization vs. Latency

R = the run length to a long latency event

L = the amount of latency

Problem of 80’• Problem was revisited due to the advent of

graphics workstations– Xerox Alto, TI Explorer – Concurrent processes are interleaved to allow for

the workstations to be more responsive. – These processes could drive or monitor display,

input, file system, network, user processing – Process switch was slow so the subsystems

were microprogrammed to support multiple contexts

Scalable Multiprocessor (90’)• Dance hall – a shared interconnect with memory on one side

and processors on the other. • Or processors may have local memory

How do the processors communicate?• Shared Memory • Potential long latency on every load

– Cache coherency becomes an issue – Examples include NYU’s Ultracomputer, IBM’s RP3, BBN’s

Butterfly, MIT’s Alewife, and later Stanford’s Dash. – Synchronization occurs through share variables, locks, flags, and

semaphores. • Message Passing

– Programmer deals with latency. This enables them to minimize the number of messages, while maximizing the size, and this scheme allows for delay minimization by sending a message so that it reaches the receiver at the time it expects it.

– Examples include Intel’s PSC and Paragon, Caltech’s Cosmic Cube, and Thinking Machines’ CM-5

– Synchronization occurs through send and receive

Cycle-by-Cycle Interleaved Multithreading

• Denelcor HEP1 (1982), HEP2

• Horizon, which was never built

• Tera, MTA

Cycle-by-Cycle Interleaved Multithreading

• Features– An instruction from a different context is launched at each

clock cycle– No interlocks or bypasses thanks to a non-blocking

pipeline

• Optimizations: – Leaving context state in proc (PC, register #, status) – Assigning tags to remote request and then matching it on

completion

Challenges with this approach

• I-Cache:– Instruction bandwidth– I-Cache misses: Since instructions are being grabbed from many different

contexts, instruction locality is degraded and the I-cache miss rate rises. • Register file access time:

– Register file access time increases due to the fact that the regfile had to significantly increase in size to accommodate many separate contexts.

– In fact, the HEP and Tera use SRAM to implement the regfile, which means longer access times.

• Single thread performance– Single thread performance significantly degraded since the context is

forced to switch to a new thread even if none are available. • Very high bandwidth network, which is fast and wide • Retries on load empty or store full

Improving Single Thread Performance

• Do more operations per instruction (VLIW) • Allow multiple instructions to issue into pipeline from

each context. – This could lead to pipeline hazards, so other safe

instructions could be interleaved into the execution. – For Horizon & Tera, the compiler detects such data

dependencies and the hardware enforces it by switching to another context if detected.

• Switch on load • Switch on miss

– Switching on load or miss will increase the context switch time.

Simultaneous Multithreading (SMT)• Tullsen, et. al. (U. of Washington), ISCA ‘95• A way to utilize pipeline with increased

parallelism from multiple threads.

Simultaneous Multithreading

SMT Architecture• Straightforward extension to conventional

superscalar design.– multiple program counters and some mechanism by which

the fetch unit selects one each cycle,– a separate return stack for each thread for predicting

subroutine return destinations,– per-thread instruction retirement, instruction queue flush,

and trap mechanisms,– a thread id with each branch target buffer entry to avoid

predicting phantom branches, and– a larger register file, to support logical registers for all

threads plus additional registers for register renaming. • The size of the register file affects the pipeline and the

scheduling of load-dependent instructions.

SMT PerformanceTullsen ‘96

Commercial Machines w/ MT Support

• Intel Hyperthreding (HT)– Dual threads– Pentium 4, XEON

• Sun CoolThreads– UltraSPARC T1– 4-threads per core

• IBM– POWER5

IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf

SMT Summary

• Pros:– Increased throughput w/o adding much cost– Fast response for multitasking environment

• Cons:– Slower single processor performance

Multicore • Multiple processor cores on a chip

– Chip multiprocessor (CMP)– Sun’s Chip Multithreading (CMT)

• UltraSPARC T1 (Niagara)– Intel’s Pentium D– AMD dual-core Opteron

• Also a way to utilize TLP, but– 2 cores 2X costs– No good for single thread performacne

• Can be used together with SMT

Chip Multithreading (CMT)

Sun UltraSPARC T1 Processor

http://www.sun.com/servers/wp.jsp?tab=3&group=CoolThreads%20servers

8 Cores vs 2 Cores

• Is 8-cores too aggressive?– Good for server applications, given

• Lots of threads• Scalable operating environment• Large memory space (64bit)

– Good for power efficiency• Simple pipeline design for each core

– Good for availability– Not intended for PCs, gaming, etc

INTRODUCTION SINGLE CHIPS COMPARISON CONCLUSION

The Case for a Single-Chip Multiprocessor

Kunle [email protected]

Lance [email protected]

Basem A. [email protected]

Computer Systems LaboratoryStanford UniversityStanford, CA 94305-4070Hydra Grouphttp://www-hydra.stanford.edu

Ken Wilson

Kunyung Chang


Multiple Single Chip

Increase in integration densityIntroduction

Higher clock rates

New Micro architectural innovation

Multiple instruction issueDynamic schedulingSpeculative executionNon-blocking caches

Microprocessor performance growth



The Limits of the Superscalar Approach

Recent trend of designing: CPU with Multiple instruction issue(dynamic scheduling)

Track register dependency between instructions

Superscalar Approach



Instruction

Introduction

Higher Clock rates

A dynamic superscalar CPU

Execution array




Instructions

Mispredicted branches

Instruction misalignment

Cache misses

2 ways to implement

Use an explicit table for mapping architectural registers to physical registers

Use a combination reorder buffer/instruction queue

3 factors constrain instruction fetch

Instruction is inserted into the instruction issue queue, and

instruction is issued for execution once all of its operands are ready




Instructions

O : the number of operands / instruction

W : the issue width of the machine

O x W : the number of access ports required by the mapping tablestructure

Ex : eight-wide issue machine with three operands per instruction requires a24 port mapping table




Instructions

n : the number of bits required to encode a register identifier

Q : the size of the instruction issue queue

n x Q x O x W : the number of comparators grows with the size of the instruction queue and issue widthstructure

Ex : eight-wide issue machine with three operands per instruction, a 64-entry instruction queue, 6-bit comparisons requires a

8x3x64x6= 9216 port mapping table




Instructions

That

can issue in parallel,

maintain the full issue bandwidth

-> A quadratic increase in the size of the instruction issue queue -> will limit the cycle time of the processor-> will limit the performance of wide issue superscalar machines

As Instruction issue widths increase, larger windows of instructions are required to find independent instructions




Motivation

Technology push

Technical issues especially -the delay of the complex issue queue -multi-port register fileswill limit the performance returns from a wide superscalar execution model.

Needs for decentralized micro architecture that maintain performance growth of microprocessors

Single-Chip Multiprocessor



Motivation

Application pull

micro architecture that works best depends on the amount and characteristics of the parallelism in the applications.

Application pull towards a single-chip multiprocessor

In parallelism, 2 different applications require different execution models




Application parallelism

comparison

Applications with low to moderate amounts of parallelism

Applications with large amount of parallelism greater than 40 i/c

Class 1

Class 2

Floating point apps, loop-level parallelism

Under 10 instructions/cycleVS





comparison

Applications with low to moderate amounts of parallelism

Applications work best on moderately superscalar processors with very high clock rates

Class 1

Little parallelism to exploit





comparison

However, the recent advances in parallel compilers make a multiprocessor an efficient and flexible way to exploit the parallelism in these programs

Class 2

Large amounts of parallelism and see performance benefits from a variety of methods designed to exploit parallelism such as Superscalar, VLIW, vector processing

Applications with large amount of parallelism greater than 40 i/c





comparison

1st Way to use multiprocessor

In the environment under the control of a multiprocessor aware operating system, a number of commercially available OS like WINNT, IRIX, Sun Solaris, etc.. have capability.

The increasingly widespread use of visualization and multimedia applications tends to increase the number of active processes orindependent threads on a desktop machine or server at a particular point in time.

Execution of multiple processes in parallel to increase throughput in a multiprogramming environment

Way to use Multiprocessor




comparison

2nd Way to use multiprocessor

Ex1, transaction processing

The threads communicate using shared memory.

Designed to run on parallel machines with communication latencies in the hundreds of CPU clock cycles.

The threads do not communicate in a very fine grained manner.

Execution of multiple threads in parallel that come from a single application.





comparison

2nd Way to use multiprocessor

Ex2, hand parallelized floating point scientific applications

When instruction window size is very large and the branch prediction is perfect. (because existing parallelism is widely distributed)

The parallelism exposed in this fine-grained manner cannot be exploited by a conventional multiprocessor architecture.

-> to exploit this, a single-chip multiprocessor architecture is available.

Execution of multiple threads in parallel that come from a single application.


Hand code (code before compile program




comparison

3rd Way to use multiprocessor

Automatic parallelization technology was shown to be effective.

The automatic parallelization technology derive significant performance benefits from the low-latency inter-processor communication

which are provided by a single-chip multiprocessor.

Accelerating the execution of sequential applications without manual intervention


Usually used in Fortran



Key Characteristics of two micro architectures

Two-issue CPU will have a higher clock rate than the six-issue CPU

But, assume that two processors have the same clock.

Key characteristics of the two micro architectures

The Difference of two systems




Key characteristics of the two micro architectures

The Difference of two systems




Introduction

Higher Clock rates

Floor plan for the six-issue dynamic superscalar

Big overhead

6-Way Superscalar Architecture



TERMS

TLB :The translation lookaside buffer (TLB) is a table in the processor that contains cross-references between the virtual and real addresses of recently referenced pages of memory. It functions like a "hot list" or quick-lookup index of the pages in main memory that have been most recently accessed.

6-Way Superscalar Architecture




4x2-Way Superscalar Multiprocessor Architecture

Introduction

Higher Clock rates

Floor plan for the four-way single chip

Also have floating point unit, integer unit, smaller overhead

L1 caches in each chip

Switch for data communication among processors and memory

through the cache



.

Simulation Environment



ApplicationsSample Applications

applu

apsi

swim

tomcatv

pmake

Floating point applicationsMultiprogramming applications

Integer applications

compress

eqntott

m88ksim

MPsim



compress – compresses and uncompressed file in memory

eqntott – translates logic equations into truth tables

m88ksim – Motorola 88000 CPU simulatorMPsim – VCS compiled Verilog simulation of a multiprocessor

applu – solver for parabolic/elliptic partial differential equations

swim – shallow water model with 1K x 1K grid

tomcatv – mesh-genetation with Thomson solver

apsi – solves problems of temperature, wind, velocity, and distribution of pollutants

Sample Applications

pmake– parallel make of gnuchess using C compiler

Applications



How to compare

MPCI : misses per completed instructor – the cache miss rates

IPC :

BP rate :

Performance Comparison


Comparison of two micro architectures


IPC Breakdown for a single 2-issue processor

Performance of a single 2-issue superscalar processor





IPC Breakdown for the 6-issue processor

Performance of the 6-issue superscalar processor




Performance of the 4x2-issue processor




Comparison of two micro architectures

Introduction

Higher Clock rates

Performance comparison of SS and MP


SS MP(Hydra)



Conclusion

in selecting the way to design microprocessors

• Single-chip MP exploits parallelism more effectively at some levels than SS microprocessor

• Single-chip MP architecture is more efficient than Superscalar architecture in same physical space

• Provides up to 2x performance on applications with higher levels of parallelism

Conclusion

Case examples

Intel MCPs (1)

The Move to Intel MultiThe Move to Intel Multi--corecore20052005 20062006 2007+2007+PlatformPlatform

ItaniumItanium®®processorprocessor

Desktop Desktop ClientClient

Mobile Mobile ClientClient

All products and dates are preliminary and subject to change without notice.

MP ServerMP Server

DP Server / DP Server / WSWS

Refer to ‘fact sheet’ for specific product timings

today

Figure 5.1: The move to Intel multi-coreSource: A. Loktu: Itanium 2 for Enterprise Computing

http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps

http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps

Intel MCPs (2)

Figure 5.2: Processor specifications of Intel’s Pentium D family (90 nm)Source: http://www.intel.com/products/processor/index.htm

http://www.intel.com/products/processor/index.htm

EIST: Enhanced Intel SpeedStep Technology

First delivered in Intel’s mobile and server platforms,It allows the system to dynamically adjust processor voltage and core frequency,which can result in decreased average power consumptionand decreased average heat production.

It is a set of hardware enhancements to Intel’s server and client platformsthat can improve the performance and robustness of traditional software-based virtualization solutions.

Virtualization solutions will allow a platform to run multipleoperating systems and applications in independent partitions.Using virtualization capabilities, one computer system can function as multiple "virtual" systems.

VT: Virtualization Technology

Malicious buffer overflow attacks pose a significant security threat. In a typical attack, a malicious worm creates a flood of code that overwhelms the processor,allowing the worm to propagate itself to the network, and to other computers. It can help prevent certain classes of malicious buffer overflow attackswhen combined with a supporting operating system.

Execute Disable Bit allows the processor to classify areas in memoryby where application code can execute and where it cannot.When a malicious worm attempts to insert code in the buffer,the processor disables code execution, preventing damage and worm propagation.

ED: Execute Disable Bit

Intel MCPs (3)

Intel MCPs (4)

Figure 5.3: Processor specifications of Intel’s Pentium D family (65 nm)Source: http://www.intel.com/products/processor/index.htm


Intel MCPs (5)

Figure 5.4 Specifications of Intel’s Pentium Processor Extrem Edition models 840/955/965Source: http://www.intel.com/products/processor/index.htm


Intel MCPs (6)

Figure 5.5: Procesor specifications of Intel’s Yonah Duo (Core Duo) family

Source: http://www.intel.com/products/processor/index.htm


Source: http://www.intel.com/products/processor_number/chart/core2duo.htm

Intel MCPs (7)

Figure 5.6 Specifications of Intel’s Core Processors

Intel MCPs (8)Category Code Name Cores Cache Market

Desktop Kentsfield Dual core multi-die 4 MB Mid 2007

Desktop Conroe Dual core single die 4 MB shared End 2006

Desktop Allendale Dual core single die 2 MB shared End 2006

Desktop Cedar Mill (NetBurst/P4) Single core 512 kB, 1 MB, 2 MB Early 2006

Desktop Presler (NetBurst/P4) Dual core, dual die 4 MB Early 2006

Desktop/Mobile Millville Single core 1 MB Early 2007

Mobile Yonah2 Dual core, single die 2 MB Early 2006

Mobile Yonah1 Single core 1/2 MB Mid 2006

Mobile Stealey Single core 512 kB Mid 2007

Mobile Merom Dual core, single die 2/4 MB shared End 2006

Enterprise Sossaman Dual core, single die 2 MB Early 2006

Enterprise Woodcrest Dual core, single die 4 MB Mid 2006

Enterprise Clovertown Quad core, multi-die 4 MB Mid 2007

Enterprise Dempsey (NetBurst/Xeon) Dual core, dual die 4 MB Mid 2006

Enterprise Tulsa Dual core single die 4/8/16 MB End 2006

Enterprise Whitefield Quad core single die 8 MB, 16 MB shared Early 2008

Figure 5.7: Future 65 nm processors (overview)Source: P. Schmid: Top Secret Intel Processor Plans Uncovered

www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered

http://www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered

Codename Cores Cache Market

Desktop Wolfdale Dual core, single die 3 MB shared 2008

Desktop Ridgefield Dual core single die 6 MB shared 2008

Desktop Yorkfield 8 cores multi-die 12 MB shared 2008+

Desktop Bloomfield Quad core, single die - 2008+

Desktop/Mobile Perryville Single core 2 MB 2008

Mobile Penryn Dual core single die 3 MB, 6 MB shared 2008

Mobile Silverthorne - - 2008+

Enterprise Hapertown 8 cores multi-die 12 MB shared 2008

Figure 5.8: Future 45 nm processors (overview)

Intel MCPs (9)

Source: P. Schmid: Top Secret Intel Processor Plans Uncovered www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered

http://www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered

Athlon 64 X2

Figure 5.9: AMD Athlon 64 X2 dual-core processor architectureSource: AMD Athlon 64 X2 Dual-Core Processor for Desktop – Key Architecture Features, http:///www.amd.com/us-en/Processors/ProductInformation/0,,30_118_9485_13041.00.html

http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_9485_13041.00.html

Sun’s UltraSPARC IV/IV+ (1)

Figure 5.10: UltraSPARC IV (Jaguar)

Source: C. Boussard: Architecture des processeurshttp://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf

ARB: Arbiter

http://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf

Sun’s UltraSPARC IV/IV+ (2)

Figure 5.11: UltraSPARC IV+ (Panther)

Source: C. Boussard: Architecture des processeurshttp://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf

http://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf

POWER4/POWER5 (1)

Figure 5.12: POWER4 chip logical viewSource: J.M. Tendler, S. Dodson, S. Fields, H. Le, B. Sinharoy: Power4 System Microarchitecture, IBM Server,

Technical White Paper, October 2001http://www-03.ibm.coom/servers/eserver/pseries/hardware/whitepapers/power4.pdf

Built-In-SelfTest

Service ProcessorPower On Reset

Core interface Unit(crossbar)

Non-CacheableUnit

MultiChip Module

http://www-03.ibm.coom/servers/eserver/pseries/hardware/whitepapers/power4.pdf

POWER4/POWER5 (2)

Figure 5.13: POWER4 chip

Source: R. Kalla, B. Sinharoy, J. Tendler: Simultaneous Multi-threading Implementation in Power5 –IBM’s Next Generation POWER Microprocessor, 2003

http://www.hotchips.org/archives/hc15/3_Tue/11.ibm.pdf

http://www.hotchips.org/archives/hc15/3_Tue/11.ibm.pdf

POWER4/POWER5 (3)

Figure 5.14: POWER4 and POWER5 system structures

Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.

FabricController

Cell

Figure 5.15: Cell (BE) microarchitecture

Source: IBM: „Cell Broadband Engine™ processor – based systems”, IBM corp. 2006

SPE: SynergisticProcessing Element

EIB: Element Interface Bus

MFC: Memory Flow Controller

PPE: Power Processing Element

AUC: Atomic Update Cache

Cell (2)

Figure 5.16: Cell SPE architecture

Source: Blachford N.: „Cell Architecture Explained Version 2”, http://www.blachford.info/computer/Cell/Cell1_v2.html

Cell (3)

Figure 5.17: Cell floorplan

Source: Blachford N.: „Cell Architecture Explained Version 2”, http://www.blachford.info/computer/Cell/Cell1_v2.html

Issues and Challenges

• Memory Organization– Distributed? Shared?

• Interconnect and Communication Protocols• Coherency and Consistency – Memory/Cache.• Scheduling, Load Balancing and Synchronization• Reliability? Energy Efficiency?

• Will see all these through the rest of the course!

Next Week

• We will talk about Scheduling and Load Balancing issues, with respect to multiple processing nodes (in effect covers CMPs as well).

• We will also talk about the possibility of quizzes…

ΠΡΟΧΩΡΗΜΕΝΗΑΡΧΙΤΕΚΤΟΝΙΚΗ … · architecture transistor densities increased...

Documents