computer architecture lecture 0:...
TRANSCRIPT
-
Computer ArchitectureLecture 0: Introduction
Chih‐Wei Liu 劉志尉National Chiao Tung [email protected]
-
Course Information• Lecture:
– Chih‐Wei Liu 劉志尉 [email protected]– 5731685, ED618
• Teaching Assistant:– 楊承彥 張鈞豪– 54225, ED412
• Course website: http://twins.ee.nctu.edu.tw• Textbook: J. L. Hennessy and D.A. Patterson, Computer Architecture: A
Quantitative Approach, 5th Edition, Morgan Kaufmann Publishers, 2012• Prerequisites:
– Computer organization– Computer programming
CA-Lec0 [email protected] 2
-
This Course Has 5 Modules• Module 1
– Instruction set architecture (ISA)– Pipelining and Hazards
• Module 2– Memory hierarchy – Caches and virtual memory
• Module 3– Instruction‐level Parallelism – Branch prediction– Speculation and Reorder buffer
• Module 4– Thread‐level Parallelism– Cache coherent protocols
• Module 5– Data‐level Parallelism– Vector machines
CA-Lec0 [email protected] 3
Midterm exam.
Final exam.
-
Course Grade (tentative)
• Lectures, Homework, and Quizzes: 25%
– Adapted from Prof. David Patterson’s class notes– Please avoid arriving late or leaving early
– One problem sets with respect to each chapter
– One‐page reading report with respect to each lecture
– Homework should be handed in on time
• LAB & Project: 25%
• Midterm and Final Exams.: 50%
• (Extra points 5%)
CA-Lec0 [email protected] 4
-
Computer ?
CA-Lec0 [email protected] 5
Building Hardwarethat Computes
-
Computing Devices Then…
EDSAC, University of Cambridge, UK, 1949CA-Lec0 [email protected] 6
-
Computing Systems Today …
• Computers & Microprocessors in everything– Vast infrastructure behind them
CA-Lec0 [email protected] 7
Robots
-
Overview of Computer Systems
CA-Lec0 [email protected] 8
Clusters
Massive Cluster
Gigabit Ethernet
Cloud
-
“Mea
ley
Mac
hine
”“M
oore
Mac
hine
”
Computer is a State‐Controlled Machine:Implementation as Comb logic + Latch
Alpha/00
Delta/10
Beta/01
0/0
1/0
1/1
0/1
0/0
1/1La
tch
Com
bina
tion
alLo
gic
Input Stateold Statenew Div
0 0 0
00 01 10
00 10 01
0 0 1
1 1 1
00 01 10
01 00 10
0 1 1
CA-Lec0 [email protected] 9
Finite state machine
-
Computer is a Microprogrammed Controller
• State machine in which part of state is a “micro‐pc”.– Explicit circuitry for incrementing or changing PC
• Includes a ROM with “microinstructions”.– Controlled logic implements at least branches and jumps
ROM
(Instructions)
Addr
BranchPC
+ 1
MUX
Control
0: forw 35 xxx1: b_no_obstacles 0002: back 10 xxx3: rotate 90 xxx4: goto 001
Instruction Branch
Com
bina
tion
al L
ogic
/Co
ntro
lled
Mac
hine
State w/ Address
CA-Lec0 [email protected] 10
-
Instruction Execution Cycle
CA-Lec0 [email protected] 11
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
Processor
regs
F.U.s
Memory
program
Data
von Neumanbottleneck
-
Computer Architecture?
CA-Lec0 [email protected] 12
Application
Physics
Gap too large to bridge in one step
In its general definition, computer architecture is the design of theabstraction layers that allow us to implement information processing applications efficiently using available manufacturing technologies.
-
Abstraction Layers in Modern Systems
CA-Lec0 [email protected] 13
Algorithm
Gates/Register-Transfer Level (RTL)
Application
Instruction Set Architecture (ISA)
Operating System/Virtual Machine
Microarchitecture
Devices
Programming Language
Circuits
Physics
Software
Hardware
-
Example: PC
CA-Lec0 [email protected] 14
Algorithm: Linked list
Gates/Register-Transfer Level (RTL)
Application: Microsoft Powerpoint
Instruction Set Architecture (ISA): x86
Operating System/Virtual Machine: Windows
Microarchitecture: 6-core, L3 Cache
Devices: 22nm Tri-gate Transistors
Programming Language: Microsoft SDK
Circuits
Physics
-
Example: Smart Phone
CA-Lec0 [email protected] 15
Algorithm: Trajectory
Gates/Register-Transfer Level (RTL)
Application: Angry Birds
Instruction Set Architecture (ISA): ARM
Operating System/Virtual Machine: Android/Dalvik
Microarchitecture: Quad-core, L2 Cache
Devices: 32nm CMOS
Programming Language: Android SDK
Circuits
Physics
-
Computer Architecture Course
CA-Lec0 [email protected] 16
Algorithm
Gates/Register-Transfer Level (RTL)
Application
Instruction Set Architecture (ISA)
Operating System/Virtual Machine
Microarchitecture
Devices
Programming Language
Circuits
Physics
Original domain of
the computer architect
(‘50s-’80s)
Domain of recent
computer architecture
(‘90s)Reliability, power, …
Parallel computing, security, …
Reinvigoration of computer architecture,
mid-2000s onward.
-
Concluding Remarks• The abstraction layers provide
– Environmental stability within generation– Environmental stability across generation– Consistency across a large number of units
• Modern computer architecture cannot avoid paying attention to software and compilation issues.
• Computer architecture is engineering design under constraints– Average case & worst case, peak power & energy per operation, soft errors & hard errors, hardware cost & software cost, ….
CA-Lec0 [email protected] 17
-
“Bell’s Law” – new class per decade
year
log
(peo
ple
per c
ompu
ter)
streaming informationto/from physical world
Number CrunchingData Storage
productivityinteractive
• Enabled by technological opportunities• Smaller, more numerous and more intimately connected• Brings in a new kind of application• Used in many ways not previously imagined
CA-Lec0 [email protected] 18
-
Uniprocessor Performance
CA-Lec0 [email protected] 19
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Perfo
rman
ce (v
s. V
AX-1
1/78
0)
25%/year
52%/year
??%/year
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
Pipelining,Data locality,Parallelism processing
-
Driven Technology: Moore’s Law
• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965
• # of transistors on cost‐effective integrated circuit double every 18 months
CA-Lec0 [email protected] 20
-
CISC (Complex Instruction Set Computer)
CA-Lec0 [email protected] 21
-
RISC (Reduced Instruction Set Computer)
CA-Lec0 [email protected] 22
-
UniProcessor Performance• Early 1970s, mainframes and minicomputers
– 25%~30% growth per year in performance• Late 1970, microprocessor
– 35% growth per year in performance• Early 1980s, Reduced Instruction Set Computer (RISC) architectures
– 2 critical performance techniques• ILP (initially through pipelining and later through multiple instruction issue) • Cache
– 50% growth per year in performance• 1998~2000, relative performance
– By technology: 1.35per year– By technology + architecture, overall: 1.58 per yearNote: 1.581.35(1+15%), the architecture improvement factor is 15%
CA-Lec0 [email protected] 1-23
-
Pipelined Instruction Execution
Instr.
Order
Time (clock cycles)
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
CA-Lec0 [email protected] 24
-
Limiting Force: Power Density
CA-Lec0 [email protected] 25
-
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)1
10
100
10001980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orm
ance
Time
Limiting Force:Processor‐DRAM Memory Gap (latency)
CA-Lec0 [email protected] 26
-
Limiting Force: Clock Speed and ILP
• Chip density is continuing increase ~2x every 2 years– Clock speed is not– # processors/chip (cores)
may double instead• There is little or no more
Instruction Level Parallelism (ILP) to be found– Can no longer allow
programmer to think in terms of a serial programming model
• Conclusion:Parallelism must be exposed to software!
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
CA-Lec0 [email protected] 27
-
• Old Conventional Wisdom: Power is free, Transistors expensive• New Conventional Wisdom: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on)• Old CW: Sufficiently increasing Instruction Level Parallelism via compilers,
innovation (Out‐of‐order, speculation, VLIW, …)• New CW: “ILP wall” law of diminishing returns on more HW for ILP • Old CW: Multiplies are slow, Memory access is fast• New CW: “Memory wall” Memory slow, multiplies fast
(200 clock cycles to DRAM memory, 4 clocks for multiply)• Old CW: Uniprocessor performance 2X / 1.5 yrs• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall
– Uniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple “cores”
(2X processors per chip / ~ 2 years)• More power efficient to use a large number of simpler processors rather than a small number of complex processors
Crossroads in Computer Architectue
CA-Lec0 [email protected] 28
-
Sea Change in Chip Design• Intel 4004 (1971):
– 4‐bit processor,– 2312 transistors, 0.4 MHz, – 10 m PMOS, 11 mm2 chip
• RISC II (1983): – 32‐bit, 5 stage – pipeline, 40,760 transistors, 3 MHz, – 3 m NMOS, 60 mm2 chip
• 125 mm2 chip, 65 nm CMOS = 2312 RISC II+FPU+Icache+Dcache– RISC II shrinks to ~ 0.02 mm2 at 65 nm– Caches via DRAM or 1 transistor SRAM (www.t‐ram.com) ?– Proximity Communication via capacitive coupling at > 1 TB/s ?
(Ivan Sutherland @ Sun / Berkeley)
• Processor is the new transistor?
CA-Lec0 [email protected] 29
-
ManyCore Chips: The trend is here
• “ManyCore” refers to many processors/chip– 64? 128? Hard to say exact boundary
• How to program these?– Use 2 CPUs for video/audio– Use 1 for word processor, 1 for browser– 76 for virus checking???
• Something new is clearly needed here…
• Intel 80-core multicore chip (Feb 2007)– 80 simple cores– Two FP-engines / core– Mesh-like network– 100 million transistors– 65nm feature size
• Intel Single-Chip Cloud Computer (August 2010)– 24 “tiles” with two IA
cores per tile – 24-router mesh network
with 256 GB/s bisection– 4 integrated DDR3 memory controllers– Hardware support for message-passing
CA-Lec0 [email protected] 30
-
Problems with Sea Change
• Algorithms, programming languages, compilers, operating systems, architectures, libraries, … not ready for 1000 CPUs / chip– Thread‐level parallelism (TLP)– Data‐level parallelism (DLP)
• Four architectures exploit DLP and TLP– Instruction‐level parallelism – chapter 3– Vector architectures and GPUs – chapter 4– Thread‐level parallelism – chapter 5– Request‐level parallelism – chapter 6
CA-Lec0 [email protected] 31
-
Course Information
• Lecture:– Chih‐Wei Liu 劉志尉 [email protected]– 5731685, ED618
• Teaching Assistants:– 楊承彥, 張鈞豪– 54225, ED412
• Course website: http://twins.ee.nctu.edu.tw• Prerequisites:
– Computer organization– Computer programming
CA-Lec0 [email protected] 32
-
Computer Architecture Course
CA-Lec0 [email protected] 33
Algorithm
Gates/Register-Transfer Level (RTL)
Application
Instruction Set Architecture (ISA)
Operating System/Virtual Machine
Microarchitecture
Devices
Programming Language
Circuits
Physics
Original domain of
the computer architect
(‘50s-’80s)
Domain of recent
computer architecture
(‘90s)Reliability, power, …
Parallel computing, security, …
Reinvigoration of computer architecture,
mid-2000s onward.
-
This Course Has 5 Modules• Module 1
– Instruction set architecture (ISA)– Pipelining and Hazards
• Module 2– Memory hierarchy – Caches and virtual memory
• Module 3– Instruction‐level Parallelism – Branch prediction– Speculation and Reorder buffer
• Module 4– Thread‐level Parallelism– Cache coherent protocols
• Module 5– Data‐level Parallelism– Vector machines
CA-Lec0 [email protected] 34
Midterm exam.
Final exam.
-
Course Grade (tentative)
• Lectures, Homework, and Quizzes: 25%
– Adapted from Prof. David Patterson’s class notes– Please avoid arriving late or leaving early
– One problem sets with respect to each chapter
– One‐page reading report with respect to each lecture
– Homework should be handed in on time
• LAB & Project: 25%
• Midterm and Final Exams.: 50%
• (Extra points 5%)
CA-Lec0 [email protected] 35