ece668 part.1.1 csaba andras moritz university of massachusetts dept. of electrical & computer...
TRANSCRIPT
ECE668 Part.1 .1
Csaba Andras Moritz
UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer
Engineering
Computer Architecture ECE 668
Part 1
IntroductioN
ECE668 Part.1 .2
Coping with ECE 668 Students with varied backgrounds Prerequisites – Basic Computer Architecture, VLSI 2 projects to choose from, some flexibility beyond that
You need software and/or Verilog/HSPICE skils to complete it 2 exams – midterm and final Class participation, attend office hours
About the instructor First lectures- review of Performance and Pipelining
(Chapter 1 + Appendix A)
Many lectures will be using the whiteboard, and slides Lectures related to textbook and beyond
Many lectures are outside the textbook Web: www.ecs.umass.edu/ece/andras/courses/ECE668/
ECE668 Part.1 .3
What you should know
Basic machine structure processor (data path, control, arithmetic),
memory, I/O
Read and write in an assembly language, C, C++,.. MIPS/ARM ISA preferred
Understand the concepts of pipelining and virtual memory
Basic VLSI – HSPICE and/or Verilog
ECE668 Part.1 .4
Textbook and references
Textbook: D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach, 4th edition (or later), Morgan-Kaufmann.
Recommended reading: J.P. Shen and M.H. Lipasti, Modern Processor
Design: Fundamentals of Superscalar Processors, McGraw-Hill, 2005.
Chandrakasan et al, Design of High-Performance Microprocessor Circuits
NASIC research papers and Nanoelectronics textbook chapter; SKYBRIDGE, N3ASIC, CMOL, FPNI, SPWF papers if interested
Other research papers we bring up in class.
ECE668 Part.1 .5
Course Outline
I. Introduction (Ch 1) II. Pipeline Design (App A) III. Instruction-level Parallelism,
Pipelining (App.A,Ch.2) IV. Memory Design: Memory Hierarchy,
Cache Memory, Secondary Memory (Ch.4) V. Multiprocessors (Ch. 3) VI. Deep Submicron Implementation –
Process Variation, Power-aware Architectures, Compiler’s role
VII. Nanoscale architectures
ECE668 Part.1 .6
Administrative Details
Instructor: Prof. Csaba Andras Moritz KEB 309H Email: [email protected] Office Hours: 2:30-3:30 pm, Tues., &
2:30-3PM Thur. TA – pending Course web page: details available at:
http://www.ecs.umass.edu/ece/andras/courses/ECE668
ECE668 Part.1 .7
Grading
Midterm I - 35% Project – 30%: two projects to choose
from Class Participation – 5% Final Exam. - 30% Homework – exam questions
ECE668 Part.1 .8
What is “Computer Architecture”
Computer Architecture =
Instruction Set Architecture +
Machine Organization (e.g., Pipelining, Memory Hierarchy,
Storage systems, etc)
Or Unconventional OrganizationIBM 360 (minicomputer, mainframe, supercomputer)
Intel X86 vs. ARM vs. Nanoprocessors
ECE668 Part.1 .9
Computer Architecture Topics - Processors
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering, Branch Prediction, VLIW, Vector
AddressingL1 Cache
L2 Cache
DRAM
Disks, Tape
Bandwidth,Latency
InterleavingBus protocols
RAIDperformance,reliability
VLSI
Input/Output and Storage
MemoryHierarchy
Instruction Level Parallelism
ECE668 Part.1 .10
Advanced CMOS multi-cores
&Nano proc.?
2013
ECE668 Part.1 .11
ECE668 Part.1 .12
Scaling
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
ECE668 Part.1 .13
Shrinking geometry
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
ECE668 Part.1 .14
Die
ECE668 Part.1 .15
Wafer
ECE668 Part.1 .16
CPUs: Archaic (Nostalgic) v. Semi Modern v. Modern?
1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm2
16-bit data bus, 68 pins Microcode interpreter,
separate FPU chip (no caches)
2001 Intel Pentium 4 1500 MHz
(120X) 4500 MIPS (peak)
(2250X) Latency 15 ns
(20X) 42,000,000 xtors, 217 mm2
64-bit data bus, 423 pins 3-way superscalar,
Dynamic translate to RISC, Superpipelined (22 stage),Out-of-Order execution
On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache
2015?
ECE668 Part.1 .17
Multi-core = Network on a chip
Everything you learn as CSE students applied/integrated in a chip!
ECE668 Part.1 .18
Intel Polaris with 80 cores
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
ECE668 Part.1 .19
Tilera processor with 64 cores
MIT startup from Raw project (used to be involved in this)
ECE668 Part.1 .20
What is next: Nanoprocessors? Molecular memory, NASIC processors,
3D?
opcode
oper
andb
dest oper
anda
resu
lt
rf3~
0
2-4 decoder
adder/mul t i pl i er
2-4 decoder
adder/multiplier
opcode
opcode
operanda
operandb
rf3~0
result
dest
NASIC ALU, Copyright: NASIC group, UMASS
CrossNW devicesCourtesy of Prof Chui’s
Group at UCLA
ECE668 Part.1 .21
From Nanodevices to Nanocomputing
Array-based Circuits with Built-in Fault-tolerance
(NASICs)
Nanoprocessor
Crossed Nanowire Array
s0s0
a0a0
b0b0
Up
Down
cl k
s0s0
Down
cl k
Up
s0s0
Up
Down
cl k
s0s0
Down
cl k
Up
c1c1
c0c0
c2c2
c3c3
c4c4
s1s1s0s0 s2s2 s3s3
s1s1 s2s2s0s0 s3s3a1a1
b1b1
a2a2
b2b2
a3a3
b3b3
Evaluation/Cascading: Streaming Control with Surrounding Microwires
n+source & drain
n+gate
p-channel
ECE668 Part.1 .22
NASICs Fabric Based Architectures
General purpose stream processor 5-stage pipeline with minimal
feedback Built-in fault tolerance: up to 10%
device level defect rates 33X density adv vs. 16nm scaled CMOS
Simpler manufacturing ~9X improved power-per-performance
efficiency (rough estimate)
WIre Streaming Processor• Special purpose for image
and signal processing Massively parallel array of
identical interacting simple functional cells
Fully programmable from external template signals
22X denser than in 16nm scaled CMOS
Cellular Architecture
ECE668 Part.1 .23
N3ASIC- 3D Nanowire Technology
ECE668 Part.1 .24
N3P – Hybrid Spin-Charge Platform
ECE668 Part.1 .25
Skybridge 3D Circuits – Vertically Integrated
3D Circuit concept and 1 bit full adder
Designed in my group
FETs are gate-all-around on vertical nanowires
ECE668 Part.1 .26
Example ISAs in Processors
(Instruction Set Architectures)ARM (32, 64-bit, v8) 1985
Digital Alpha (v1, v3) 1992HP PA-RISC (v1.1, v2.0) 1986Sun Sparc (v8, v9) 1987MIPS (MIPS I, II, III, IV, V) 1986Intel (8086,80286,80386, 1978
80486,Pentium, MMX, ...)RISC vs. CISC
ECE668 Part.1 .27
Basics
Let us review some basics
ECE668 Part.1 .28
RISC ISA Encoding Example
ECE668 Part.1 .29
Virtualized ISAs
BlueRISC TrustGUARD ISA is randomly created internally Fluid - more than one ISA possible
ECE668 Part.1 .30
Characteristics of RISC
Only Load/Store instructions access memory
A relatively large number of registers
Goals of new computer designs Higher performance More functionality (e.g., MMX) Other design objectives? (examples)
ECE668 Part.1 .31
How to measure performance?
• Time to run the task – Execution time, response time, latency– Performance may be defined as 1 / Ex_Time
– Throughput, bandwidth
ECE668 Part.1 .32
Speedup
performance(x) = 1 execution_time(x)
" Y is n times faster than X" means
Execution_time(old / brand x)
Execution_time(new / brand y)
n = speedup =
Speedup must be greater than 1;
Tx/Ty = 3/2 = 1.5 but not Ty/Tx = 2/3 = 0.67
ECE668 Part.1 .33
MIPS and MFLOPS MIPS (Million Instructions Per Second)
Can we compare two different CPUs using MIPS?
MFLOPS (Million Floating-point operations Per Sec.) Application dependent (e.g., compiler)
Still useful for benchmarks Benchmarks: e.g., SPEC CPU 2000: 26
applications (with inputs) SPECint2000: Twelve integer, e.g., gcc, gzip, perl SPECfp2000: Fourteen floating-point intensive, e.g., equake
ECE668 Part.1 .34
SPEC CPU 2000SPECint2000 SPECfp2000
www.specbench.org/cpu200
Benchmark Language Category 164.gzip C Compression 175.vpr C FPGA Circuit Place& Route 176.gcc C C Compiler 181.mcf C Combinatorial Optimization186.crafty C Game Playing: Chess 197.parser C Word Processing 252.eon C++ Computer Visualization 253.perlbmk C PERL Prog Language254.gap C Group Theory, Interpreter255.vortex C Object-oriented Database 256.bzip2 C Compression 300.twolf C Place and Route Simulator
Benchmark Language Category 168.wupwise Fortran77 Quantum Chromodynamics171.swim Fortran77 Shallow Water Modeling 172.mgrid Fortran77 Multi-grid Solver173.applu Fortran77 Partial Differential Equations177.mesa C 3-D Graphics Library 178.galgel Fortran90 Fluid Dynamics179.art C Image Recognition /Neural Nets183.equake C Seismic Wave Propagation 187.facerec Fortran 90 Face Recognition188.ammp C Computational Chemistry 189.lucas Fortran90 Primality Testing191.fma3d Fortran90 Finite-element Crash - Nuclear
Physics200.sixtrack Fortran77 Accelerator Design301.apsi Fortran77 Meteorology: Pollutant
Distribution
ECE668 Part.1 .35
Spec2006(still
current)
ECE668 Part.1 .36
Other Benchmark
s
www.spec.org
Workload Category Example Benchmark Suite
CPU Benchmarks - Uniprocessor SPEC CPU 2006 Java Grande Forum Benchmarks
SciMark, ASCICPU - Parallel Processor SPLASH, NASPAR Multimedia MediaBenchEmbedded EEMBC benchmarksDigital Signal Processing BDTI benchmarks Java - Client side SPECjvm98, CaffeineMark Java - Server side SPECjBB2000, VolanoMark Java - Scientific Java Grande Forum Benchmarks
SciMark Transaction Processing On-Line Transaction Processing TPC-C, TPC-WTransaction Processing Decision Support Systems TPC-H, TP-R Web Server SPEC web99, TPC-W, VolanoMark Electronic commerce TPC-W, SPECjBB2000 Mail-server SPECmail2000 Network File System SPEC SFS 2.0Personal Computer SYSMARK, WinBench, DMarkMAX99
Handheld device committee SPEC
ECE668 Part.1 .37
Whetstone Benchmark
Dhrystone Benchmark
Synthetic Benchmar
ks
www.cse.clrc.ac.uk/disco/Benchmarks/whetstone.shtml
Rank Machine Mflop ratings (Vl=1024) Total CPU MWIPS N2 N3 N8 (seconds)
1 Pentium 4/3066 (ifc) 1966 529 1347 9.2 40712 HP Superdome Itanium2/1500 492 3441 2907 9.8 38263 HP RX5670 Itanium2/1500-H 655 3441 2907 9.8 38554 Pentium 4/2666 (ifc) 1966 444 1201 10.4 35325 IBM pSeries 690Turbo/1.7 1996 475 1841 10.8 34726 Compaq Alpha ES45/1250 1679 815 1925 10.9 34417 HP RX4640 Itanium2/1300 492 2753 2511 11.3 33248 IBM Regatta-HPC/1300 492 444 1454 11.5 32819 IBM pSeries 690Turbo/1.3 1996 353 1905 11.7 326010 AMD Opteron848/2200 1966 1147 1255 11.8 3158
Core DMIPS Freq. DMIPS Inline Inline /MHz. (MHz) DMIPS/MHz DMIPS
4Kc™ 1.3 300 390 1.6 480 4KEc™ 1.35 300 405 1.8 540 5Kc™ 1.4 350 490 2.0 700 5Kf™ 1.4 320 448 2.0 640 20Kc™ 1.7 600 1020 2.2 1320
ECE668 Part.1 .38
How do we design faster CPUs?
Faster technology – used to be the main approach (a) getting more expensive (b) reliability & yield (c) speed of light (3.10^8 m/sec)
Larger dies (SOC - System On a Chip) less wires between ICs but - low yield (next slide)
Parallel processing - use n independent processors limited success
n-issue superscaler microprocessor (currently n=4) Can we expect a Speedup = n ?
Pipelining Multi-threading
ECE668 Part.1 .39
Power consumption
Dynamic α * Vdd^2 * f* Cl
Leakage Mainly from subthreshold (the FETs leak current) Significant for small feature sizes (less Ion/Ioff)
Power-aware architectures Objective is to minimize activity often Role of compilers - control Circuit level optimizations – make same more
efficient CAD tools – e.g., clock gating – make it easy to
add
ECE668 Part.1 .40
Define and quantify power
Leakage current increases in processors with smaller transistor sizes
Increasing the number of transistors increases power even if they are turned off
Leakage is dominant sub 90nms Very low power systems even gate voltage
to inactive modules to control loss due to leakage
ECE668 Part.1 .41
Define and quantity dependability (2/3)
Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics Mean Time To Failure (MTTF) measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures
• Traditionally reported as failures per billion hours of operation
Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR
Module availability (MA) measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability MA = MTTF / ( MTTF + MTTR)
ECE668 Part.1 .42
Example calculating reliability
If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules
Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):
MTTF
eFailureRat
ECE668 Part.1 .44
Integrated Circuits Yield
Die_area sityDefect_Den 1 dWafer_yiel YieldDie
ECE668 Part.1 .45
Integrated Circuits Costs
Die Cost goes up roughly with (Die_Area)2
Test_Die
Die_Area
Dies per wafer
Final test yield
Die cost + Testing cost + Packaging cost IC cost
Dies per Wafer x Die Yield
Wafer cost Die cost
Wafer_diam/2)² x Wafer_diam
2 x Die_Area
ECE668 Part.1 .46
Amdahl’s Law - BasicsExample: Executing a program on n independent
processors
enhanced
enhancedenhanced
new
oldoverall
Speedup
Fraction Fraction
1
ExTimeExTime
Speedup
1
enhancedSpeedup
enhancedFraction
= parallelizable part of program
= n
ExTimeenhanc
ed
enhanced
old ne
w
ExTimeExTimeold=
(1-
Fraction
) +Fractio
n
n
Lim Speedup = 1 / (1 - Fraction )
overallenhance
d n
ECE668 Part.1 .47
Amdahl’s Law - GraphLaw of Diminishing Returns
1-fenh
ECE668 Part.1 .48
Amdahl’s Law - Extension Example: Improving part of a processor
(e.g., multiplier, floating-point unit)
enhancedFraction
= part of program to be enhanced
overall Speedup
enhanced
enhancedenhanced Speedup
Fraction Fraction
1
1
< 1 / (1 - Fraction )
enhanced
A given signal processing application consists of 40% multiplications.
An enhanced multiplier will execute 5 times faster
Speedup = 1 / ( + ) = 1.47 < 1/0.6 = 1.66overall
ECE668 Part.1 .49
Amdahl’s Law - Another Example
Floating point instructions improved to run 2X; but only 10% of actual run time is used by FP instructions
Speedupoverall=1
0.95= 1.053
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
ECE668 Part.1 .50
Instruction execution
Components of average execution time (CPI Law) Average CPU time per program
CPI 1/clock_rate
The “End to End Argument” is what RISC was ultimately about - it is the performance of the complete system that matters, not individual components!
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
ECE668 Part.1 .51
Cycles Per Instruction – Another Performance Metric
“Instruction Frequency”
CPI = Total_No_of_Cycles / Instruction Count
“Average Cycles per Instruction”
j
n
jj I CPI TimeCycle time CPU
1
Count nInstructio
I F where F CPI CPI j
j
n
jjj
1
CPIj - CPI for instruction j (j=1,…,n)
Ij - # of times instruction j is executed
“CPI of Individual Instructions”
ECE668 Part.1 .52
Example: Calculating CPI
Typical Mix of instruction typesin program
Base Machine (Reg / Reg)
Op Freq Cycles CPIj * Fj (% Time)
ALU 50% 1 .5 ( %)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
ECE668 Part.1 .53
Pipelining - Basics
X
Y
Z( )2 Square
Root
4 consecutive operations
Z=F(X,Y)=SqRoot(X +Y )
22
If each step takes 1T then one calculation takes 3T, four take 12T
Stage 1
X
Stage 2
+Y
Stage 3
SqRoot2 2
X
Y
Z
Assuming ideally that each stage takes 1T
What will be the latency (time to produce the first result)?
What will be the throughput (pipeline rate in the steady state)?
ECE668 Part.1 .54
T TTT TT
Total of 6T; Speedup = ?
For n operations: 3T + (n-1)T = latency +
n-1throughp
ut 3T n 3nSpeedup = =
3T + (n-1)T n + 2 # of stages
n
Pipelining - Timing
ECE668 Part.1 .55
Pipelining - Non ideal
Non-ideal situation:
1. Steps take T ,T ,T Rate = 1 / max T
Slowest unit determines the throughput
2. To allow independent operation must add latches
max T
1 2 3 i
i
latch
ECE668 Part.1 .56
Rule of Thumb for Latency Lagging BW
In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4
(and capacity improves faster than bandwidth)
Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency
ECE668 Part.1 .57
Latency Lags Bandwidth (last ~20 years)
Performance Milestones Processor: ‘286, ‘386,
‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)
Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)
Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
1
10
100
1000
10000
1 10 100
Relative Latency Improvement
Relative BW
Improvement
Processor
Memory
Network
Disk
(Latency improvement = Bandwidth improvement)
CPU high, Memory low(“Memory Wall”)
ECE668 Part.1 .58
Summary of Technology Trends
For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement In the time that bandwidth doubles, latency improves by no more
than 1.2X to 1.4X
Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components Multiple processors in a cluster or even in a chip Multiple disks in a disk array Multiple memory modules in a large memory Simultaneous communication in switched LAN
HW and SW developers should innovate assuming Latency Lags Bandwidth If everything improves at the same rate, then nothing really
changes When rates vary, require real innovation
ECE668 Part.1 .59
Summary of Architecture Trends
CMOS Microprocessors focus on computing bandwidth with multiple cores Accelerators for specialized support Software to take advantage – Von Neumann
design
As nanoscale technologies emerge new architectural areas are created Unconventional architectures
» Not programmed – would operate more like the brain through learning and inference
As well as new opportunities for microprocessor design
ECE668 Part.1 .60
Backup slides for students
ECE668 Part.1 .61
6 Reasons Latency Lags Bandwidth
1. Moore’s Law helps BW more than latency • Faster transistors, more transistors,
more pins help Bandwidth» MPU Transistors: 0.130 vs. 42 M xtors
(300X)» DRAM Transistors: 0.064 vs. 256 M xtors
(4000X)» MPU Pins: 68 vs. 423 pins
(6X) » DRAM Pins: 16 vs. 66 pins
(4X)
• Smaller, faster transistors but communicate over (relatively) longer lines: limits latency » Feature size: 1.5 to 3 vs. 0.18 micron
(8X,17X) » MPU Die Size: 35 vs. 204 mm2
(ratio sqrt 2X) » DRAM Die Size: 47 vs. 217 mm2
(ratio sqrt 2X)
ECE668 Part.1 .62
6 Reasons Latency Lags Bandwidth (cont’d)
2. Distance limits latency • Size of DRAM block long bit and word lines
most of DRAM access time
• Speed of light and computers on network
3. Bandwidth easier to sell (“bigger=better”)• E.g., 10 Gbits/s Ethernet (“10 Gig”) vs.
10 sec latency Ethernet
• 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency
• Even if just marketing, customers now trained
• Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance
ECE668 Part.1 .63
4. Latency helps BW, but not vice versa • Spinning disk faster improves both bandwidth
and rotational latency » 3600 RPM 15000 RPM = 4.2X» Average rotational latency: 8.3 ms 2.0 ms» Things being equal, also helps BW by 4.2X
• Lower DRAM latency More access/second (higher bandwidth)
• Higher linear density helps disk BW (and capacity), but not disk Latency» 9,550 BPI 533,000 BPI 60X in BW
6 Reasons Latency Lags Bandwidth (cont’d)
ECE668 Part.1 .64
5. Bandwidth hurts latency• Queues help Bandwidth, hurt Latency (Queuing
Theory)
• Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency
6. Operating System overhead hurts Latency more than Bandwidth
• Long messages amortize overhead; overhead bigger part of short messages
6 Reasons Latency Lags Bandwidth (cont’d)