the elusive metric for low-power architecture research center for experimental research in computer...

The Elusive Metric for The Elusive Metric for Low-Power Architecture Low-Power Architecture ResearchResearch

Center for Experimental Research in Computer SystemsCenter for Experimental Research in Computer SystemsGeorgia Institute of TechnologyGeorgia Institute of Technology

Atlanta, GA 30332Atlanta, GA 30332

Workshop for Complexity-Effective Design, San Diego, CA, 2003

Hsien-Hsin Hsien-Hsin ““SeanSean”” Lee Lee Joshua B. Fryman

A. Utku Diril Yuvraj S. Dhillon

2WCED-03

Background PictureBackground PictureEnergy-Delay product (EDP) [Gonzalez & Horowitz 96]

“Power” is meaningless ( frequency) “Energy per instruction” is elusive ( CV2) “Energy Delay” (J/SPEC or J IPC) is better Use Alpha-power model,

Note that no “physical” meaning of EDP

Widespread adoption De facto standard by community Metric for energy and complexity effectiveness

New architectural techniques have arrived New hardware exploiting low-power opportunities Temperature-aware power detectors Voltage & Frequency Scaling Multi-threshold voltage

)V-(V

CV ED

thdd

3dd

3WCED-03

Outline of the TalkOutline of the Talk

Potential pitfalls Yeah, we all know, it is obvious…. but

Which “E” goes in ED product?Impact of new hardware (more

transistors)Methodology matters in deep submicron

processesObservationsSummary

4WCED-03

Calculating ED ProductCalculating ED ProductNew architecture solutions save energy at the

expense of (insensitive) performance lossA number of research results were reported in the

following manner: Technique “X” for Data Cache

Reduce 50% energy of Data Cache Lose 20% IPC EDP = (1-0.5)(1+0.2) = 0.60 Very Energy efficient

Technique “Y” for Branch Predictor Reduce 10% energy of Branch Predictor Lose 20% IPC EDP = (1-0.1)(1+0.2) = 1.08 Energy inefficient

5WCED-03

So What is E and What is D So What is E and What is D in EDP?in EDP?Hypothetical black box

Battery (i.e. E) shared by CPU, DRAM, chipsets, graphics,

TFT, Wi-Fi, HDD, flash disk D typically account for some system effect

such as DRAM latency

Improvement proposed: Remove 5% of E from flash disk No delay incurred

Is this a good design decision? Flash disk is 10% of total E in system Improvement amounts to 0.5% system

impact “In-the-noise” improvement Is the “complexity” worth the effort?

So, is EDP used in the right way? And is EDP so important?

Battery

flash 802.11

Gfx card C.S.

DDR-DRAM

HDD

TFT Display

6WCED-03

Energy Efficiency: E versus Energy Efficiency: E versus DD

0.0001

0.001

0.01

0.1

1

10

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Esaved=99%Esaved=90%Esaved=58%Esaved=50%Esvaed=30%Esaved=10%

Esaved=5%

Maxm

um

Dela

y T

ole

ran

ce

Power Distribution of a FU w.r.t. target system

7WCED-03

Example: Energy Efficiency: E Example: Energy Efficiency: E vs. Dvs. D

0.0001

0.001

0.01

0.1

1

10

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Esaved=99%Esaved=90%Esaved=58%Esaved=50%Esvaed=30%Esaved=10%

Esaved=5%

Maxm

um

Dela

y T

ole

ran

ce

Energy Distribution w.r.t. target system

Tolerate ~25% performance loss

8WCED-03

Using EDP: Pentium ProUsing EDP: Pentium ProM

axim

um

Dela

y T

ole

rance

Energy Saved for a functional unit u

Data Source: [Brooks et al. 00]

Assume 100% for CPU

40% IFU power reduction can tolerate < 10% performance loss

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

IFU (22%)IEU (14%)

ROB, DCU (11.1%)RS, FPU, Global Clock (7.9%)

RAT, MOB (6.3%)BTB (4.7%)

9WCED-03

But CPU is not 100% of a But CPU is not 100% of a SystemSystem

CPU=100%CPU=75%CPU=50%CPU=25%

0 0.1

0.2 0.3

0.4 0.5

0.6 0.7

0.8 0.9

1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 10 20 30 40 50 60 70 80 90

100 110 120 130 140 150

Energy Saving for a functional unit

Energy Distribution of

w.r.t. CPU only

Maxim

um

Dela

y T

ole

rance

10WCED-03

Case Study: Filter Cache Case Study: Filter Cache [Kin et. [Kin et. al 97,00]al 97,00]

The Filter Cache design as reported 58% Energy savings in “L1 Caches” 21% IPC degradation ED product as shown

(1-0.58)(1+0.21) << 1suggests this is a winning design

Question is “which E ?”

11WCED-03

Filter Cache: E ValuesFilter Cache: E ValuesM

axim

um

Dela

y T

ole

rance

Energy distribution for a functional unit u wrt CPU only

Use StrongARM 110 43% () energy by

Caches 27% in I-CACHE 16% in D-CACHE

CPU=X% stands for X% of overall power drawn by CPU

Delay Tolerance 33% : CPU=100% 21% : CPU=70% 14% : CPU=50% 6% : CPU=25%

Not energy-efficient if CPU < 70% 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

FilterCacheCPU=100%

CPU=70%CPU=50%CPU=25%

FilterCache SA-110 (I$+D$=43%)

Esaved = 58% [Kin et al. 00]

FC slowdown 21%

12WCED-03

Rethinking EDP: Rethinking EDP: Switching Activity vs. New Switching Activity vs. New HardwareHardware

Ignore leakage and short-circuit power Dynamic switching power is dominantThe “E” would be below

T: Transistor count f: frequency

)()( TTffaTfa

PP

VTCfaVCfaP

newref

dyndyn

ddgdddyn

newref

avg

22

13WCED-03

ED VariablesED Variables

The elegant ratio governing E…

To include the application delay, D…

Can be applied to Macromodeling to determine the trade-off between transistor count and performance degradation

Tf

Tf

T

T

f

f

a

a

new

ref

1

2

11

D

D

T

T

f

f

a

a

new

ref

14WCED-03

Impact of Additional Transistor Impact of Additional Transistor CountCount

0

5

10

15

20

25

30

35

40

45

50

-35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45

30% switching reduced25% switching reduced10% switching reduced

% Impact on T (given freq. unchanged) % Impact on T (given delay unchanged by frequency scaling

% Im

pact

on f

% Im

pact

on D

Given a new avg switching probability of new architectureLHS: Trading transistors with delay given no freq. scalingRHS: Delay recovered by freq. scaling

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35 40 45 50

30% switching reduced25% switching reduced10% switching reduced

15WCED-03

Role of Leakage EnergyRole of Leakage EnergyAs Deep Sub-Micron (DSM) era is upon us...

Source: Intel Corp. Custom Integrated Circuits Conference 2002

More than 50% power

from leakage

Leakage ignorance could revert conclusionEarly architecture evaluation

Leakage cannot be isolated from switching during evaluation Additional HW can be harmful

16WCED-03

Evaluate the Leakage when Evaluate the Leakage when adding HW in Early Stage of adding HW in Early Stage of Arch DefinitionArch DefinitionExample: Dual-speed pipeline [Pyreddy and Tyson’01]

Idea appears to be plausible Identify critical instructions [Tune et al 01] [Seng et al. 01]

Two datapaths: fast and slow Critical inst fast pipe; remainder to slow Slow pipe consumes less E than fast pipe

E.g. Multi-voltage supply, lower frequencyLet’s evaluate and assume:

N instructions; x slow datapath (N-x) fast datapath

How does leakage impact efficiency?What x value to achieve energy efficiency?

slow fast

x% instnon-critical

1-x% instcritical

17WCED-03

Dual Datapath Leakage Dual Datapath Leakage ImpactImpact

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

r = 0.9r = 0.75r = 0.60

r = 0.5r = 0.4r = 0.2

Min

imu

m in

str

ucti

on

s t

o S

low

Data

path

Static-to-Total Energy RatioToday Soon to be

”r” is power ratio of slow vs. fast

A small r impair

performance Slow path

becomes critical path

18WCED-03

Dual Datapath Leakage Dual Datapath Leakage ImpactImpact

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

r = 0.9r = 0.75r = 0.60

r = 0.5r = 0.4r = 0.2

Min

imu

m in

str

ucti

on

s t

o S

low

Data

path

Static-to-Total Energy Ratio

”r” is power ratio of slow vs. fast

A small r impair

performance Slow path

becomes critical path

% of non-critical inst needed for slow datapath Today: ~17% Soon: ~40%

Soon to be

Today

19WCED-03

Energy Savings v. Energy Savings v. # Inst of # Inst of Slow PathSlow Path

-60

-55

-50

-45

-40

-35

-30

-25

-20

-15

-10

-5

0

5

10

15

20

0 0.1 0.2 0.3 0.4

Static-to-Total=1%Static-to-Total=20%Static-to-Total=33%Static-to-Total=50%Static-to-Total=67%Static-to-Total=75%

-60

-55

-50

-45

-40

-35

-30

-25

-20

-15

-10

-5

0

5

10

15

20

0 0.1 0.2 0.3 0.4

Static-to-Total=1%Static-to-Total=20%Static-to-Total=33%Static-to-Total=50%Static-to-Total=67%Static-to-Total=75%

r = 75% r = 50%

X-axis : % of instructions to non-critical datapath Y-axis : % Energy saved If send 30% instructions to non-critical datapth

Only save ~5% energy (savings only on datapath) in DSM for r=75% Consume more energy in DSM for r=50%

Is the extra complexity paid off?

20WCED-03

ObservationsObservationsIt is insufficient to examine ED product on

a microscale; the entire system must be examined.

Adding HW complexity for low energy needs to be evaluated thoroughly If the target process is not DSM, ED product

can be examined via simplified ratio analysis For DSM process

Leakage must be accounted for in local and system E

Additional HW could be an overkill

21WCED-03

SummarySummaryLow-power architecture research:

Metric could be elusive Methodology

More susceptible to reverse conclusions than performance research, if not meticulously applied

2nd order effect today 1st order effect tomorrow “Complexity” can be ineffective in energy reduction

Purposes of our study Provide analytical models and methodology for early

evaluation No intention to invalidate prior results

WCED WDDD Raise more discussions To get it right in education

22WCED-03

That’s All Folks !That’s All Folks !

the elusive metric for low-power architecture research center for experimental research in computer...

Documents