communication-centric design robert mullins computer architecture group computer laboratory,...

30
Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11 th 2006)

Post on 21-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

Communication-Centric Design

Robert MullinsComputer Architecture GroupComputer Laboratory, University of Cambridge(University of Twente, December 11th 2006)

Page 2: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

2

Convergence to flexible parallel architectures

• Power Efficient– Better match application

characteristics (streaming, coarse-grain parallelism…)

– Constraint-driven execution

• Simple– Increased regularity– S/W programmable– Limited core/tile set – Ease verification issues

• Flexible– Multi-use platform

FPGAsFPGAs

Embedded Embedded ProcessorsProcessorsGPUsGPUs

Multi-CoreMulti-CoreProcessorsProcessors

SoC PlatformsSoC Platforms

?

Page 3: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

3

Performance from Parallelism

• Shift to multi-core processors has begun.• Performance is boosted in complexity and power

effective manner• Latencies on-chip are much lower than in

previous parallel machines. This should make programming easier + new oppurtinities

• Easy to increase number of cores as underlying fabrication technology scales

• Simply an engineering problem?

Page 4: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

4

Multi-core Showstoppers!

• As the number of cores increase current interconnection network designs would require an unacceptable fraction of the chip power budget

• Attempts at exploiting low-latency on-chip communication are mitigated by complex network interfaces (old problem) and routers

• Today’s programming models provide little hope of efficiently exploiting a large degree of on-chip parallelism (for the average programmer) – related to N.I. problem

• Power constraints will be such that only a small percentage of the transistors may be active at once! Why are 100’s of identical cores useful?

• New interconnection network solutions are critical to making multi- and many-core chips work

• Power limits imply heterogeneity in some form?

Page 5: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

5

Our Group’s Research

• Now: support evolution of existing platforms – Low-latency and low-power

on-chip networks– System-timing

considerations– Networking communications

within FPGAs– Flexible networked SoC

systems, virtual IP– On-chip serial interconnects– Multi-wavelength optical

communication (off-chip)– Fault tolerant design

• Future:– Networks of processors to

processing networks– Processing Fabrics

FPGAsFPGAs

Embedded Embedded ProcessorsProcessorsGPUsGPUs

Multi-CoreMulti-CoreProcessorsProcessors

SoC PlatformsSoC Platforms

?

Page 6: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

6

Low-Latency Virtual-Channel Packet-Switched Routers

• Goal was to develop a virtual-channel network for a tiled processor architecture

• Collaboration with Krste Asanović’s SCALE group at MIT

• Problem faced is rising interconnect costs

• Networking communications can increase communication latencies by an order of magnitude or more!

Page 7: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

7

The Lochside test chip (2004/5)

• UMC 0.18um Process• 4x4 mesh network, 25mm2

• Single Cycle Routers (router + link = 1 clock)

• May be clocked by both traditional H-tree and DCG

• 4 virtual-channels/input• 80-bit links

– 64-bit data + 16-bit control

• 250MHz (worst-case PVT) 16Gb/s/channel (~35 FO4)

• Approx 5M transistors

TILE

TrafficGenerator, Debug &

Test

R

Mullins, West and Moore (ISCA’04, ASP-DAC’06)

Page 8: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

8

Virtual-Channel Flow Control

Page 9: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

9

Typical Router Pipeline

• Router pipeline depth limits minimum latency– Even under low traffic conditions– Can make packet buffers less effective– Incurs pipelining overheads

Page 10: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

10

Speculative Router Architecture

• VC and switch allocation may be performed concurrently:– Speculate that waiting packets will be successful in acquiring a VC– Prioritize non-speculative requests over speculative ones

Li-Shiuan Peh and William J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers”, In Proceedings HPCA’01, 2001.

Page 11: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

11

Single Cycle Speculative Router

Page 12: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

12

Page 13: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

13

Page 14: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

14

Single Cycle Router Architecture

• Once speculation mechanism is in place a range of accuracy/cycle-time trade-offs can be made– Blocked VC, pipeline and speculate – use low

priority switch scheduler– Switch and VC next request calculation

• Don’t bother calculating next switch requests just use current set. Safe to be pessimistic about what has been granted.

• Need to be more accurate for VC allocation

– Abort logic accuracy

Page 15: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

15

Single Cycle Router Architecture

• Decreasing accuracy often leads to poorer schedule and more aborts but reduces the router’s cycle time

• Impact of speculation on single cycle router:– 10% more cycles on average– clock period reduced by factor of 1.6– Network latency reduced by a factor of ~1.5

• Need to be careful about updating arbiter state correctly after speculation outcome is known

Page 16: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

16

Lochside Router Clock Period

100% standard cellFF/Clocking 23% (8.3 FO4)FIFOs/Control/Datapath 53% (19 FO4)Link 22% (7.9 FO4) range 4.6-7.9

• Could move to router/link pipeline• Option to pipeline control - maintaining single cycle best case• Impact of technology scaling• Scalability: doubling VCs to 8, only adds ~10% to cycle time

30-35 FO4 delays (~800MHz)

5-port router4 VCs per port64-bit links, ~1.5mm90nm technology

Page 17: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

17

Router Power Optimisation

• Local and global clock gating & signal gating– Global clock gating exploits early-request

signals from neighbouring routers– Slightly pessimistic (based on what is

requested not granted)– Factor 2-4 reduction power consumption

• Peak 0.15mW/Mhz (0.35 unopt.)

• Low Random 0.06mW/Mhz (0.27 unopt.)

Mullins, SoC’06

Page 18: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

18

• 22% Static power• 11% Inter-Router Links• ~1% Global Clock tree• 65% Dynamic Power

– Power Breakdown• ~50% local clock tree and input FIFOs• ~30% on router datapath• ~20% on scheduling and arbitration

(Low random traffic case)

Analysis of Power Consumption

} Due to increase as % as technology scales

Page 19: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

19

Distributed Clock Generator (DCG)

• Exploits self-timed circuitry to generate a clock in a distributed fashion

• Low-skew and low-power solution to providing global synchrony

• Mesh topology

• Simple proof of concept provided by Lochside test chip S. Fairbanks and S. Moore “Self-timed

circuitry for global clocking”, ASYNC’05

Page 20: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

20

Beyond global synchrony

• Clock distribution issues– Challenge as network is physically distributed

• Increasing process variation

• Synchronization– Core clock frequencies may vary, perhaps adaptively– Link and router DVS or other energy/perf. trade-offs

• Selecting a global network clock frequency– Run at maximum frequency continuously?– Use a multitude of network clock frequencies?– Select a global compromise?

Page 21: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

21

Beyond Global Synchrony

Synchronous Delay Insensitive

Global None

Timing Assumptions

Loca

l Rel

ative

Wire

Del

ay

Less Detection

Sub-S

yste

m

Loca

l

Isoc

hron

ic Fo

rks

Mul

tiple

clo

cks

Data-

Driven

and

Pausi

ble

Clock

sBun

dled

Dat

a

Qua

si-Del

ay

Inse

nsitiv

e

Local Clocks, Interaction with data (becoming

aperiodic)

• A complete spectrum of approaches to system-timing exist

Page 22: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

22

Data-Driven and Pausible Clocks

Mullins/Moore, ASYNC’07

Page 23: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

23

Example: AsAP project (UC Davis, 2006)

Yu et al, ISSCC’06

Page 24: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

24

Example: MAIA chip (Berkeley, 2000)

• GALS architecture, data-flow driven processing elements (“satellites”)

Zhang et al, ISSCC’00

Page 25: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

25

Data-Driven Clocking for On-Chip Routers

• Router should be clocked when one or more inputs are valid (or flits are buffered)

• Free running (paternoster) elevator– Chain of open compartments – Must synchronise before you jump on!

• Traditional elevator– Wait for someone to arrive– Close doors, decide who is in and who is out– Metastability issue again (potentially painful!)

Page 26: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

26

Data-Driven Clock Implementation

Local Clock Generator TemplateSample inputs

when at least one input is ready (and clock is low)

Assert Lock

Either admitted or locked out

(Close Lift Doors)

Incoming data

Page 27: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

27

Data-driven clocking benefits

Self-timed power gating?

DI barrier synchronisation and scheduling extensions

NO GLOBAL CLOCK

Page 28: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

28

Networks of processors to processing networks

• How will on-chip networks and cores evolve over time?FPGAsFPGAs

Embedded Embedded ProcessorsProcessorsGPUsGPUs

Multi-CoreMulti-CoreProcessorsProcessors

SoC PlatformsSoC Platforms

?

Page 29: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

29

Evolving towards processing networks– Network of Processors– Number of processors increase– Core architectures tailored to “many-core”

environment– Remove hard tile boundaries

• Why fix granularity of cores, communication and memory hierarchies?

– Move away from processor + router model• Everything is on the network, we need much thinner

network interfaces• Richer interconnection of components, increased

flexibility – Add network-based services

• Network aids collaboration, focuses resources, supports dynamic optimisations, scheduling, …

• Tailor virtual architecture to application– Processing Network or Fabric

Incre

as

ed flex

ibility to

o

ptim

ise c

om

pu

tation

a

nd

co

mm

un

icatio

n

Page 30: Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11

30

Thank You.

• 2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems http://www.ece.ucdavis.edu/~ocin06/(Videos of talks and working group feedback are on-line)

• Computer Architecture: A Quantitative Approach, 4th Edition, Appendix E, entitled Interconnection Networks. http://ceng.usc.edu/smart/slides/appendixE.html

• Principles and Practices of Interconnection Networks by

William James Dally, Brian Patrick Towles • On-chip network bibliography and resources:

http://www.cl.cam.ac.uk/~rdm34/onChipNetBib/noc.html