the elusive metric for low-power architecture research center for experimental research in computer...
TRANSCRIPT
The Elusive Metric for The Elusive Metric for Low-Power Architecture Low-Power Architecture ResearchResearch
Center for Experimental Research in Computer SystemsCenter for Experimental Research in Computer SystemsGeorgia Institute of TechnologyGeorgia Institute of Technology
Atlanta, GA 30332Atlanta, GA 30332
Workshop for Complexity-Effective Design, San Diego, CA, 2003
Hsien-Hsin Hsien-Hsin ““SeanSean”” Lee Lee Joshua B. Fryman
A. Utku Diril Yuvraj S. Dhillon
2WCED-03
Background PictureBackground PictureEnergy-Delay product (EDP) [Gonzalez & Horowitz 96]
“Power” is meaningless ( frequency) “Energy per instruction” is elusive ( CV2) “Energy Delay” (J/SPEC or J IPC) is better Use Alpha-power model,
Note that no “physical” meaning of EDP
Widespread adoption De facto standard by community Metric for energy and complexity effectiveness
New architectural techniques have arrived New hardware exploiting low-power opportunities Temperature-aware power detectors Voltage & Frequency Scaling Multi-threshold voltage
)V-(V
CV ED
thdd
3dd
3WCED-03
Outline of the TalkOutline of the Talk
Potential pitfalls Yeah, we all know, it is obvious…. but
Which “E” goes in ED product?Impact of new hardware (more
transistors)Methodology matters in deep submicron
processesObservationsSummary
4WCED-03
Calculating ED ProductCalculating ED ProductNew architecture solutions save energy at the
expense of (insensitive) performance lossA number of research results were reported in the
following manner: Technique “X” for Data Cache
Reduce 50% energy of Data Cache Lose 20% IPC EDP = (1-0.5)(1+0.2) = 0.60 Very Energy efficient
Technique “Y” for Branch Predictor Reduce 10% energy of Branch Predictor Lose 20% IPC EDP = (1-0.1)(1+0.2) = 1.08 Energy inefficient
5WCED-03
So What is E and What is D So What is E and What is D in EDP?in EDP?Hypothetical black box
Battery (i.e. E) shared by CPU, DRAM, chipsets, graphics,
TFT, Wi-Fi, HDD, flash disk D typically account for some system effect
such as DRAM latency
Improvement proposed: Remove 5% of E from flash disk No delay incurred
Is this a good design decision? Flash disk is 10% of total E in system Improvement amounts to 0.5% system
impact “In-the-noise” improvement Is the “complexity” worth the effort?
So, is EDP used in the right way? And is EDP so important?
Battery
flash 802.11
Gfx card C.S.
DDR-DRAM
HDD
TFT Display
6WCED-03
Energy Efficiency: E versus Energy Efficiency: E versus DD
0.0001
0.001
0.01
0.1
1
10
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Esaved=99%Esaved=90%Esaved=58%Esaved=50%Esvaed=30%Esaved=10%
Esaved=5%
Maxm
um
Dela
y T
ole
ran
ce
Power Distribution of a FU w.r.t. target system
7WCED-03
Example: Energy Efficiency: E Example: Energy Efficiency: E vs. Dvs. D
0.0001
0.001
0.01
0.1
1
10
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Esaved=99%Esaved=90%Esaved=58%Esaved=50%Esvaed=30%Esaved=10%
Esaved=5%
Maxm
um
Dela
y T
ole
ran
ce
Energy Distribution w.r.t. target system
Tolerate ~25% performance loss
8WCED-03
Using EDP: Pentium ProUsing EDP: Pentium ProM
axim
um
Dela
y T
ole
rance
Energy Saved for a functional unit u
Data Source: [Brooks et al. 00]
Assume 100% for CPU
40% IFU power reduction can tolerate < 10% performance loss
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
IFU (22%)IEU (14%)
ROB, DCU (11.1%)RS, FPU, Global Clock (7.9%)
RAT, MOB (6.3%)BTB (4.7%)
9WCED-03
But CPU is not 100% of a But CPU is not 100% of a SystemSystem
CPU=100%CPU=75%CPU=50%CPU=25%
0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 10 20 30 40 50 60 70 80 90
100 110 120 130 140 150
Energy Saving for a functional unit
Energy Distribution of
w.r.t. CPU only
Maxim
um
Dela
y T
ole
rance
10WCED-03
Case Study: Filter Cache Case Study: Filter Cache [Kin et. [Kin et. al 97,00]al 97,00]
The Filter Cache design as reported 58% Energy savings in “L1 Caches” 21% IPC degradation ED product as shown
(1-0.58)(1+0.21) << 1suggests this is a winning design
Question is “which E ?”
11WCED-03
Filter Cache: E ValuesFilter Cache: E ValuesM
axim
um
Dela
y T
ole
rance
Energy distribution for a functional unit u wrt CPU only
Use StrongARM 110 43% () energy by
Caches 27% in I-CACHE 16% in D-CACHE
CPU=X% stands for X% of overall power drawn by CPU
Delay Tolerance 33% : CPU=100% 21% : CPU=70% 14% : CPU=50% 6% : CPU=25%
Not energy-efficient if CPU < 70% 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FilterCacheCPU=100%
CPU=70%CPU=50%CPU=25%
FilterCache SA-110 (I$+D$=43%)
Esaved = 58% [Kin et al. 00]
FC slowdown 21%
12WCED-03
Rethinking EDP: Rethinking EDP: Switching Activity vs. New Switching Activity vs. New HardwareHardware
Ignore leakage and short-circuit power Dynamic switching power is dominantThe “E” would be below
T: Transistor count f: frequency
)()( TTffaTfa
PP
VTCfaVCfaP
newref
dyndyn
ddgdddyn
newref
avg
22
13WCED-03
ED VariablesED Variables
The elegant ratio governing E…
To include the application delay, D…
Can be applied to Macromodeling to determine the trade-off between transistor count and performance degradation
Tf
Tf
T
T
f
f
a
a
new
ref
1
2
11
D
D
T
T
f
f
a
a
new
ref
14WCED-03
Impact of Additional Transistor Impact of Additional Transistor CountCount
0
5
10
15
20
25
30
35
40
45
50
-35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45
30% switching reduced25% switching reduced10% switching reduced
% Impact on T (given freq. unchanged) % Impact on T (given delay unchanged by frequency scaling
% Im
pact
on f
% Im
pact
on D
Given a new avg switching probability of new architectureLHS: Trading transistors with delay given no freq. scalingRHS: Delay recovered by freq. scaling
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20 25 30 35 40 45 50
30% switching reduced25% switching reduced10% switching reduced
15WCED-03
Role of Leakage EnergyRole of Leakage EnergyAs Deep Sub-Micron (DSM) era is upon us...
Source: Intel Corp. Custom Integrated Circuits Conference 2002
More than 50% power
from leakage
Leakage ignorance could revert conclusionEarly architecture evaluation
Leakage cannot be isolated from switching during evaluation Additional HW can be harmful
16WCED-03
Evaluate the Leakage when Evaluate the Leakage when adding HW in Early Stage of adding HW in Early Stage of Arch DefinitionArch DefinitionExample: Dual-speed pipeline [Pyreddy and Tyson’01]
Idea appears to be plausible Identify critical instructions [Tune et al 01] [Seng et al. 01]
Two datapaths: fast and slow Critical inst fast pipe; remainder to slow Slow pipe consumes less E than fast pipe
E.g. Multi-voltage supply, lower frequencyLet’s evaluate and assume:
N instructions; x slow datapath (N-x) fast datapath
How does leakage impact efficiency?What x value to achieve energy efficiency?
slow fast
x% instnon-critical
1-x% instcritical
17WCED-03
Dual Datapath Leakage Dual Datapath Leakage ImpactImpact
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r = 0.9r = 0.75r = 0.60
r = 0.5r = 0.4r = 0.2
Min
imu
m in
str
ucti
on
s t
o S
low
Data
path
Static-to-Total Energy RatioToday Soon to be
”r” is power ratio of slow vs. fast
A small r impair
performance Slow path
becomes critical path
18WCED-03
Dual Datapath Leakage Dual Datapath Leakage ImpactImpact
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r = 0.9r = 0.75r = 0.60
r = 0.5r = 0.4r = 0.2
Min
imu
m in
str
ucti
on
s t
o S
low
Data
path
Static-to-Total Energy Ratio
”r” is power ratio of slow vs. fast
A small r impair
performance Slow path
becomes critical path
% of non-critical inst needed for slow datapath Today: ~17% Soon: ~40%
Soon to be
Today
19WCED-03
Energy Savings v. Energy Savings v. # Inst of # Inst of Slow PathSlow Path
-60
-55
-50
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
10
15
20
0 0.1 0.2 0.3 0.4
Static-to-Total=1%Static-to-Total=20%Static-to-Total=33%Static-to-Total=50%Static-to-Total=67%Static-to-Total=75%
-60
-55
-50
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
10
15
20
0 0.1 0.2 0.3 0.4
Static-to-Total=1%Static-to-Total=20%Static-to-Total=33%Static-to-Total=50%Static-to-Total=67%Static-to-Total=75%
r = 75% r = 50%
X-axis : % of instructions to non-critical datapath Y-axis : % Energy saved If send 30% instructions to non-critical datapth
Only save ~5% energy (savings only on datapath) in DSM for r=75% Consume more energy in DSM for r=50%
Is the extra complexity paid off?
20WCED-03
ObservationsObservationsIt is insufficient to examine ED product on
a microscale; the entire system must be examined.
Adding HW complexity for low energy needs to be evaluated thoroughly If the target process is not DSM, ED product
can be examined via simplified ratio analysis For DSM process
Leakage must be accounted for in local and system E
Additional HW could be an overkill
21WCED-03
SummarySummaryLow-power architecture research:
Metric could be elusive Methodology
More susceptible to reverse conclusions than performance research, if not meticulously applied
2nd order effect today 1st order effect tomorrow “Complexity” can be ineffective in energy reduction
Purposes of our study Provide analytical models and methodology for early
evaluation No intention to invalidate prior results
WCED WDDD Raise more discussions To get it right in education