ece668 part.1.1 csaba andras moritz university of massachusetts dept. of electrical & computer...

63
CE668 Part.1 .1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

Upload: felicia-west

Post on 19-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .1

Csaba Andras Moritz

UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer

Engineering

Computer Architecture ECE 668

Part 1

IntroductioN

Page 2: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .2

Coping with ECE 668 Students with varied backgrounds Prerequisites – Basic Computer Architecture, VLSI 2 projects to choose from, some flexibility beyond that

You need software and/or Verilog/HSPICE skils to complete it 2 exams – midterm and final Class participation, attend office hours

About the instructor First lectures- review of Performance and Pipelining

(Chapter 1 + Appendix A)

Many lectures will be using the whiteboard, and slides Lectures related to textbook and beyond

Many lectures are outside the textbook Web: www.ecs.umass.edu/ece/andras/courses/ECE668/

Page 3: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .3

What you should know

Basic machine structure processor (data path, control, arithmetic),

memory, I/O

Read and write in an assembly language, C, C++,.. MIPS/ARM ISA preferred

Understand the concepts of pipelining and virtual memory

Basic VLSI – HSPICE and/or Verilog

Page 4: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .4

Textbook and references

Textbook: D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach, 4th edition (or later), Morgan-Kaufmann.

Recommended reading: J.P. Shen and M.H. Lipasti, Modern Processor

Design: Fundamentals of Superscalar Processors, McGraw-Hill, 2005.

Chandrakasan et al, Design of High-Performance Microprocessor Circuits

NASIC research papers and Nanoelectronics textbook chapter; SKYBRIDGE, N3ASIC, CMOL, FPNI, SPWF papers if interested

Other research papers we bring up in class.

Page 5: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .5

Course Outline

I. Introduction (Ch 1) II. Pipeline Design (App A) III. Instruction-level Parallelism,

Pipelining (App.A,Ch.2) IV. Memory Design: Memory Hierarchy,

Cache Memory, Secondary Memory (Ch.4) V. Multiprocessors (Ch. 3) VI. Deep Submicron Implementation –

Process Variation, Power-aware Architectures, Compiler’s role

VII. Nanoscale architectures

Page 6: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .6

Administrative Details

Instructor: Prof. Csaba Andras Moritz KEB 309H Email: [email protected] Office Hours: 2:30-3:30 pm, Tues., &

2:30-3PM Thur. TA – pending Course web page: details available at:

http://www.ecs.umass.edu/ece/andras/courses/ECE668

Page 7: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .7

Grading

Midterm I - 35% Project – 30%: two projects to choose

from Class Participation – 5% Final Exam. - 30% Homework – exam questions

Page 8: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .8

What is “Computer Architecture”

Computer Architecture =

Instruction Set Architecture +

Machine Organization (e.g., Pipelining, Memory Hierarchy,

Storage systems, etc)

Or Unconventional OrganizationIBM 360 (minicomputer, mainframe, supercomputer)

Intel X86 vs. ARM vs. Nanoprocessors

Page 9: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .9

Computer Architecture Topics - Processors

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Branch Prediction, VLIW, Vector

AddressingL1 Cache

L2 Cache

DRAM

Disks, Tape

Bandwidth,Latency

InterleavingBus protocols

RAIDperformance,reliability

VLSI

Input/Output and Storage

MemoryHierarchy

Instruction Level Parallelism

Page 10: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .10

Advanced CMOS multi-cores

&Nano proc.?

2013

Page 11: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .11

Page 12: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .12

Scaling

Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program

Page 13: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .13

Shrinking geometry

Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program

Page 14: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .14

Die

Page 15: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .15

Wafer

Page 16: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .16

CPUs: Archaic (Nostalgic) v. Semi Modern v. Modern?

1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm2

16-bit data bus, 68 pins Microcode interpreter,

separate FPU chip (no caches)

2001 Intel Pentium 4 1500 MHz

(120X) 4500 MIPS (peak)

(2250X) Latency 15 ns

(20X) 42,000,000 xtors, 217 mm2

64-bit data bus, 423 pins 3-way superscalar,

Dynamic translate to RISC, Superpipelined (22 stage),Out-of-Order execution

On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache

2015?

Page 17: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .17

Multi-core = Network on a chip

Everything you learn as CSE students applied/integrated in a chip!

Page 18: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .18

Intel Polaris with 80 cores

Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program

Page 19: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .19

Tilera processor with 64 cores

MIT startup from Raw project (used to be involved in this)

Page 20: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .20

What is next: Nanoprocessors? Molecular memory, NASIC processors,

3D?

opcode

oper

andb

dest oper

anda

resu

lt

rf3~

0

2-4 decoder

adder/mul t i pl i er

2-4 decoder

adder/multiplier

opcode

opcode

operanda

operandb

rf3~0

result

dest

NASIC ALU, Copyright: NASIC group, UMASS

CrossNW devicesCourtesy of Prof Chui’s

Group at UCLA

Page 21: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .21

From Nanodevices to Nanocomputing

Array-based Circuits with Built-in Fault-tolerance

(NASICs)

Nanoprocessor

Crossed Nanowire Array

s0s0

a0a0

b0b0

Up

Down

cl k

s0s0

Down

cl k

Up

s0s0

Up

Down

cl k

s0s0

Down

cl k

Up

c1c1

c0c0

c2c2

c3c3

c4c4

s1s1s0s0 s2s2 s3s3

s1s1 s2s2s0s0 s3s3a1a1

b1b1

a2a2

b2b2

a3a3

b3b3

Evaluation/Cascading: Streaming Control with Surrounding Microwires

n+source & drain

n+gate

p-channel

Page 22: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .22

NASICs Fabric Based Architectures

General purpose stream processor 5-stage pipeline with minimal

feedback Built-in fault tolerance: up to 10%

device level defect rates 33X density adv vs. 16nm scaled CMOS

Simpler manufacturing ~9X improved power-per-performance

efficiency (rough estimate)

WIre Streaming Processor• Special purpose for image

and signal processing Massively parallel array of

identical interacting simple functional cells

Fully programmable from external template signals

22X denser than in 16nm scaled CMOS

Cellular Architecture

Page 23: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .23

N3ASIC- 3D Nanowire Technology

Page 24: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .24

N3P – Hybrid Spin-Charge Platform

Page 25: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .25

Skybridge 3D Circuits – Vertically Integrated

3D Circuit concept and 1 bit full adder

Designed in my group

FETs are gate-all-around on vertical nanowires

Page 26: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .26

Example ISAs in Processors

(Instruction Set Architectures)ARM (32, 64-bit, v8) 1985

Digital Alpha (v1, v3) 1992HP PA-RISC (v1.1, v2.0) 1986Sun Sparc (v8, v9) 1987MIPS (MIPS I, II, III, IV, V) 1986Intel (8086,80286,80386, 1978

80486,Pentium, MMX, ...)RISC vs. CISC

Page 27: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .27

Basics

Let us review some basics

Page 28: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .28

RISC ISA Encoding Example

Page 29: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .29

Virtualized ISAs

BlueRISC TrustGUARD ISA is randomly created internally Fluid - more than one ISA possible

Page 30: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .30

Characteristics of RISC

Only Load/Store instructions access memory

A relatively large number of registers

Goals of new computer designs Higher performance More functionality (e.g., MMX) Other design objectives? (examples)

Page 31: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .31

How to measure performance?

• Time to run the task – Execution time, response time, latency– Performance may be defined as 1 / Ex_Time

– Throughput, bandwidth

Page 32: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .32

Speedup

performance(x) = 1 execution_time(x)

" Y is n times faster than X" means

Execution_time(old / brand x)

Execution_time(new / brand y)

n = speedup =

Speedup must be greater than 1;

Tx/Ty = 3/2 = 1.5 but not Ty/Tx = 2/3 = 0.67

Page 33: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .33

MIPS and MFLOPS MIPS (Million Instructions Per Second)

Can we compare two different CPUs using MIPS?

MFLOPS (Million Floating-point operations Per Sec.) Application dependent (e.g., compiler)

Still useful for benchmarks Benchmarks: e.g., SPEC CPU 2000: 26

applications (with inputs) SPECint2000: Twelve integer, e.g., gcc, gzip, perl SPECfp2000: Fourteen floating-point intensive, e.g., equake

Page 34: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .34

SPEC CPU 2000SPECint2000 SPECfp2000

www.specbench.org/cpu200

Benchmark Language Category 164.gzip C Compression 175.vpr C FPGA Circuit Place& Route 176.gcc C C Compiler 181.mcf C Combinatorial Optimization186.crafty C Game Playing: Chess 197.parser C Word Processing 252.eon C++ Computer Visualization 253.perlbmk C PERL Prog Language254.gap C Group Theory, Interpreter255.vortex C Object-oriented Database 256.bzip2 C Compression 300.twolf C Place and Route Simulator

Benchmark Language Category 168.wupwise Fortran77 Quantum Chromodynamics171.swim Fortran77 Shallow Water Modeling 172.mgrid Fortran77 Multi-grid Solver173.applu Fortran77 Partial Differential Equations177.mesa C 3-D Graphics Library 178.galgel Fortran90 Fluid Dynamics179.art C Image Recognition /Neural Nets183.equake C Seismic Wave Propagation 187.facerec Fortran 90 Face Recognition188.ammp C Computational Chemistry 189.lucas Fortran90 Primality Testing191.fma3d Fortran90 Finite-element Crash - Nuclear

Physics200.sixtrack Fortran77 Accelerator Design301.apsi Fortran77 Meteorology: Pollutant

Distribution

Page 35: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .35

Spec2006(still

current)

Page 36: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .36

Other Benchmark

s

www.spec.org

Workload Category Example Benchmark Suite

CPU Benchmarks - Uniprocessor SPEC CPU 2006 Java Grande Forum Benchmarks

SciMark, ASCICPU - Parallel Processor SPLASH, NASPAR Multimedia MediaBenchEmbedded EEMBC benchmarksDigital Signal Processing BDTI benchmarks Java - Client side SPECjvm98, CaffeineMark Java - Server side SPECjBB2000, VolanoMark Java - Scientific Java Grande Forum Benchmarks

SciMark Transaction Processing On-Line Transaction Processing TPC-C, TPC-WTransaction Processing Decision Support Systems TPC-H, TP-R Web Server SPEC web99, TPC-W, VolanoMark Electronic commerce TPC-W, SPECjBB2000 Mail-server SPECmail2000 Network File System SPEC SFS 2.0Personal Computer SYSMARK, WinBench, DMarkMAX99

Handheld device committee SPEC

Page 37: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .37

Whetstone Benchmark

Dhrystone Benchmark

Synthetic Benchmar

ks

www.cse.clrc.ac.uk/disco/Benchmarks/whetstone.shtml

Rank Machine Mflop ratings (Vl=1024) Total CPU MWIPS N2 N3 N8 (seconds)

1 Pentium 4/3066 (ifc) 1966 529 1347 9.2 40712 HP Superdome Itanium2/1500 492 3441 2907 9.8 38263 HP RX5670 Itanium2/1500-H 655 3441 2907 9.8 38554 Pentium 4/2666 (ifc) 1966 444 1201 10.4 35325 IBM pSeries 690Turbo/1.7 1996 475 1841 10.8 34726 Compaq Alpha ES45/1250 1679 815 1925 10.9 34417 HP RX4640 Itanium2/1300 492 2753 2511 11.3 33248 IBM Regatta-HPC/1300 492 444 1454 11.5 32819 IBM pSeries 690Turbo/1.3 1996 353 1905 11.7 326010 AMD Opteron848/2200 1966 1147 1255 11.8 3158

Core DMIPS Freq. DMIPS Inline Inline /MHz. (MHz) DMIPS/MHz DMIPS

4Kc™ 1.3 300 390 1.6 480 4KEc™ 1.35 300 405 1.8 540 5Kc™ 1.4 350 490 2.0 700 5Kf™ 1.4 320 448 2.0 640 20Kc™ 1.7 600 1020 2.2 1320

Page 38: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .38

How do we design faster CPUs?

Faster technology – used to be the main approach (a) getting more expensive (b) reliability & yield (c) speed of light (3.10^8 m/sec)

Larger dies (SOC - System On a Chip) less wires between ICs but - low yield (next slide)

Parallel processing - use n independent processors limited success

n-issue superscaler microprocessor (currently n=4) Can we expect a Speedup = n ?

Pipelining Multi-threading

Page 39: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .39

Power consumption

Dynamic α * Vdd^2 * f* Cl

Leakage Mainly from subthreshold (the FETs leak current) Significant for small feature sizes (less Ion/Ioff)

Power-aware architectures Objective is to minimize activity often Role of compilers - control Circuit level optimizations – make same more

efficient CAD tools – e.g., clock gating – make it easy to

add

Page 40: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .40

Define and quantify power

Leakage current increases in processors with smaller transistor sizes

Increasing the number of transistors increases power even if they are turned off

Leakage is dominant sub 90nms Very low power systems even gate voltage

to inactive modules to control loss due to leakage

Page 41: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .41

Define and quantity dependability (2/3)

Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics Mean Time To Failure (MTTF) measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures

• Traditionally reported as failures per billion hours of operation

Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR

Module availability (MA) measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability MA = MTTF / ( MTTF + MTTR)

Page 42: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .42

Example calculating reliability

If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules

Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):

MTTF

eFailureRat

Page 43: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .44

Integrated Circuits Yield

Die_area sityDefect_Den 1 dWafer_yiel YieldDie

Page 44: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .45

Integrated Circuits Costs

Die Cost goes up roughly with (Die_Area)2

Test_Die

Die_Area

Dies per wafer

Final test yield

Die cost + Testing cost + Packaging cost IC cost

Dies per Wafer x Die Yield

Wafer cost Die cost

Wafer_diam/2)² x Wafer_diam

2 x Die_Area

Page 45: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .46

Amdahl’s Law - BasicsExample: Executing a program on n independent

processors

enhanced

enhancedenhanced

new

oldoverall

Speedup

Fraction Fraction

1

ExTimeExTime

Speedup

1

enhancedSpeedup

enhancedFraction

= parallelizable part of program

= n

ExTimeenhanc

ed

enhanced

old ne

w

ExTimeExTimeold=

(1-

Fraction

) +Fractio

n

n

Lim Speedup = 1 / (1 - Fraction )

overallenhance

d n

Page 46: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .47

Amdahl’s Law - GraphLaw of Diminishing Returns

1-fenh

Page 47: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .48

Amdahl’s Law - Extension Example: Improving part of a processor

(e.g., multiplier, floating-point unit)

enhancedFraction

= part of program to be enhanced

overall Speedup

enhanced

enhancedenhanced Speedup

Fraction Fraction

1

1

< 1 / (1 - Fraction )

enhanced

A given signal processing application consists of 40% multiplications.

An enhanced multiplier will execute 5 times faster

Speedup = 1 / ( + ) = 1.47 < 1/0.6 = 1.66overall

Page 48: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .49

Amdahl’s Law - Another Example

Floating point instructions improved to run 2X; but only 10% of actual run time is used by FP instructions

Speedupoverall=1

0.95= 1.053

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

Page 49: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .50

Instruction execution

Components of average execution time (CPI Law) Average CPU time per program

CPI 1/clock_rate

The “End to End Argument” is what RISC was ultimately about - it is the performance of the complete system that matters, not individual components!

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Page 50: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .51

Cycles Per Instruction – Another Performance Metric

“Instruction Frequency”

CPI = Total_No_of_Cycles / Instruction Count

“Average Cycles per Instruction”

j

n

jj I CPI TimeCycle time CPU

1

Count nInstructio

I F where F CPI CPI j

j

n

jjj

1

CPIj - CPI for instruction j (j=1,…,n)

Ij - # of times instruction j is executed

“CPI of Individual Instructions”

Page 51: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .52

Example: Calculating CPI

Typical Mix of instruction typesin program

Base Machine (Reg / Reg)

Op Freq Cycles CPIj * Fj (% Time)

ALU 50% 1 .5 ( %)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

Page 52: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .53

Pipelining - Basics

X

Y

Z( )2 Square

Root

4 consecutive operations

Z=F(X,Y)=SqRoot(X +Y )

22

If each step takes 1T then one calculation takes 3T, four take 12T

Stage 1

X

Stage 2

+Y

Stage 3

SqRoot2 2

X

Y

Z

Assuming ideally that each stage takes 1T

What will be the latency (time to produce the first result)?

What will be the throughput (pipeline rate in the steady state)?

Page 53: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .54

T TTT TT

Total of 6T; Speedup = ?

For n operations: 3T + (n-1)T = latency +

n-1throughp

ut 3T n 3nSpeedup = =

3T + (n-1)T n + 2 # of stages

n

Pipelining - Timing

Page 54: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .55

Pipelining - Non ideal

Non-ideal situation:

1. Steps take T ,T ,T Rate = 1 / max T

Slowest unit determines the throughput

2. To allow independent operation must add latches

max T

1 2 3 i

i

latch

Page 55: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .56

Rule of Thumb for Latency Lagging BW

In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4

(and capacity improves faster than bandwidth)

Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency

Page 56: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .57

Latency Lags Bandwidth (last ~20 years)

Performance Milestones Processor: ‘286, ‘386,

‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)

Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)

Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)

Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

1

10

100

1000

10000

1 10 100

Relative Latency Improvement

Relative BW

Improvement

Processor

Memory

Network

Disk

(Latency improvement = Bandwidth improvement)

CPU high, Memory low(“Memory Wall”)

Page 57: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .58

Summary of Technology Trends

For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement In the time that bandwidth doubles, latency improves by no more

than 1.2X to 1.4X

Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components Multiple processors in a cluster or even in a chip Multiple disks in a disk array Multiple memory modules in a large memory Simultaneous communication in switched LAN

HW and SW developers should innovate assuming Latency Lags Bandwidth If everything improves at the same rate, then nothing really

changes When rates vary, require real innovation

Page 58: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .59

Summary of Architecture Trends

CMOS Microprocessors focus on computing bandwidth with multiple cores Accelerators for specialized support Software to take advantage – Von Neumann

design

As nanoscale technologies emerge new architectural areas are created Unconventional architectures

» Not programmed – would operate more like the brain through learning and inference

As well as new opportunities for microprocessor design

Page 59: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .60

Backup slides for students

Page 60: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .61

6 Reasons Latency Lags Bandwidth

1. Moore’s Law helps BW more than latency • Faster transistors, more transistors,

more pins help Bandwidth» MPU Transistors: 0.130 vs. 42 M xtors

(300X)» DRAM Transistors: 0.064 vs. 256 M xtors

(4000X)» MPU Pins: 68 vs. 423 pins

(6X) » DRAM Pins: 16 vs. 66 pins

(4X)

• Smaller, faster transistors but communicate over (relatively) longer lines: limits latency » Feature size: 1.5 to 3 vs. 0.18 micron

(8X,17X) » MPU Die Size: 35 vs. 204 mm2

(ratio sqrt 2X) » DRAM Die Size: 47 vs. 217 mm2

(ratio sqrt 2X)

Page 61: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .62

6 Reasons Latency Lags Bandwidth (cont’d)

2. Distance limits latency • Size of DRAM block long bit and word lines

most of DRAM access time

• Speed of light and computers on network

3. Bandwidth easier to sell (“bigger=better”)• E.g., 10 Gbits/s Ethernet (“10 Gig”) vs.

10 sec latency Ethernet

• 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency

• Even if just marketing, customers now trained

• Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance

Page 62: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .63

4. Latency helps BW, but not vice versa • Spinning disk faster improves both bandwidth

and rotational latency » 3600 RPM 15000 RPM = 4.2X» Average rotational latency: 8.3 ms 2.0 ms» Things being equal, also helps BW by 4.2X

• Lower DRAM latency More access/second (higher bandwidth)

• Higher linear density helps disk BW (and capacity), but not disk Latency» 9,550 BPI 533,000 BPI 60X in BW

6 Reasons Latency Lags Bandwidth (cont’d)

Page 63: ECE668 Part.1.1 Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN

ECE668 Part.1 .64

5. Bandwidth hurts latency• Queues help Bandwidth, hurt Latency (Queuing

Theory)

• Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency

6. Operating System overhead hurts Latency more than Bandwidth

• Long messages amortize overhead; overhead bigger part of short messages

6 Reasons Latency Lags Bandwidth (cont’d)