aca lecture 1
TRANSCRIPT
-
8/6/2019 ACA Lecture 1
1/42
Advanced Computer Architecture&
Algorithms
Lecture 01: Introduction
-
8/6/2019 ACA Lecture 1
2/42
Introduction
Control
Datapath
Memory
Processor
Input
Output
-
8/6/2019 ACA Lecture 1
3/42
Introduction
High Level LanguageProgram
Assembly Language
Program
Machine LanguageProgram
Control SignalSpecification
Compiler
Assembler
Machine Interpretation
-
8/6/2019 ACA Lecture 1
4/42
Introduction
instruction set
software
hardware
-
8/6/2019 ACA Lecture 1
5/42
Introduction
Computer Architecture = ISA + MO How the hardware implements ISA ?
System Organization (processor, memory, I/O)
Microarchitecture
Learn methods of measuring and improving performance
Metrics
Benchmarks
Performance methods
Learn to think and program concurrently
-
8/6/2019 ACA Lecture 1
6/42
Introduction..
I/O systemInstr. Set Proc.
Compiler
OperatingSystem
Application
Digital DesignCircuit Design
Instruction SetArchitecture
Datapath & Control
Layout
-
8/6/2019 ACA Lecture 1
7/42
Instruction Set Architecture is the structure of acomputer that a machine language programmer (or a
compiler) must understand to write a correct (timing
independent) program for that machine.
The ISA defines:
Operations that the processor can execute
Data Transfer mechanisms + how to access data
Control Mechanisms (branch, jump, etc)
Contract between programmer/compiler + HW
Introduction
-
8/6/2019 ACA Lecture 1
8/42
Specification
Program
ISA (Instruction Set Architecture)
microArchitecture
Logic
Transistors
compute the Fibonacci sequence
for(i=2; i
-
8/6/2019 ACA Lecture 1
9/42
NAND Gate NOR Gate
Vdd
A
B
Out
Vdd
A
B
Out
OutA
B
A
B
Out
A B Out
0 0 10 1 11 0 11 1 0
A B Out
0 0 10 1 01 0 01 1 0
-
8/6/2019 ACA Lecture 1
10/42
Introduction..
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering,Prediction, Speculation,
Vector, VLIW, DSP, Reconfiguration
Addressing,Protection,Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
VLSI
Input/Output and Storage
Memory
Hierarchy
Pipelining and Instruction
Level Parallelism
-
8/6/2019 ACA Lecture 1
11/42
Intel 4004 - 1971
The first
microprocessor
2,300 transistors
108 KHz
-
8/6/2019 ACA Lecture 1
12/42
Examples (Throughput/Performance)
Replace the processor with a faster version?
3.8 GHz instead of 3.2 GHz
Add an additional processor to a system?
Core Duo instead of P4
Performance is determined by execution time
Related variables
# of cycles to execute program
# of instructions in program # of cycles per second (clock rate)
average # of cycles per instruction
average # of instructions per second
-
8/6/2019 ACA Lecture 1
13/42
Relating the Metrics
Performance = 1/Execution Time
CPU Execution Time = CPU clock cycles for program
Clock cycle time
orExecution time = # of clock cycles cycle time
Cycle time (period) = time between ticks = seconds per cycle
CPU clock cycles = Instructions for a program Average
clock cycles per Instruction
-
8/6/2019 ACA Lecture 1
14/42
Execution time = seconds/program
instructions cycles seconds_________ x _____ x _____program instruction cycle
Program Compiler Architecture (ISA) Compiler Micro architects
(Scheduling) Organization Circuit Designers
Technology
Performance
-
8/6/2019 ACA Lecture 1
15/42
Clocking and Clocked Elements
Typical Clock
1Hz = 1 cycle per secondperiod
(cycle time)
A
B
CD
A
B
C
D
Reduce the number of gate levels
-
8/6/2019 ACA Lecture 1
16/42
-
8/6/2019 ACA Lecture 1
17/42
For some program running on machine X,
PerformanceX = 1 / Execution timeX
"X is n times faster than Y"
n = PerformanceX/ PerformanceY
= Execution timeY/ Execution timeX
Problem: Machine A runs a program in 10 seconds and machine B in 15
seconds. How much faster is A than B?
Answer:
n = PerformanceA / PerformanceB
= Execution timeB/Execution timeA = 15/10 = 1.5
A is 1.5 times faster than B.
Performance
-
8/6/2019 ACA Lecture 1
18/42
Suppose we have two implementations of the same instruction set architecture (ISA).For some program,
Machine A has a clock cycle time of 10 ns and a CPI of 2.0
Machine B has a clock cycle time of 20 ns and a CPI of 1.2
Which machine is faster for this program, and by how much?
Time per instruction:
for A 2.0 * 10 ns = 20 ns
for B 1.2 * 20 ns = 24 ns
A is 24/20 = 1.2 times faster
CPI Example
-
8/6/2019 ACA Lecture 1
19/42
-
8/6/2019 ACA Lecture 1
20/42
To improve performance (everything else being equal) you
can either
reduce the # of required clock cycles for a program
or
decrease the clock period or, said another way,
increase the clock frequency.
How to Improve Performance
-
8/6/2019 ACA Lecture 1
21/42
Multiplication takes more time than addition
Floating point operations take longer than integer operations
Accessing memory takes more time than accessing registers
Important point: changing the cycle time often changes the number of cycles required
for various instructions (more later)
e.g. memory operations spend time, not cycles
Another point: the same instruction might require a different number of cycles on a
different machine
circuits have been implemented in different ways
How to improve the performance
-
8/6/2019 ACA Lecture 1
22/42
# of Instructions:
A compiler designer has two alternatives for a certain code sequence. There are three
different classes of instructions: A, B, and C, and they require one, two, and three cycles,
respectively.
The first sequence has 5 instructions: 2 of A, 1 of B, and 2 of C.
The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.
Which sequence will be faster? What are the CPI values?
Sequence 1: 2*1+1*2+2*3 = 10 cycles; CPI1 = 10 / 5 = 2
Sequence 2: 4*1+1*2+1*3 = 9 cycles; CPI2 = 9 / 6 = 1.5
Sequence 2 is faster.
Example...
-
8/6/2019 ACA Lecture 1
23/42
Cycles Per Instruction (CPI)
Depends on the instruction type
Average cycles per instruction
Example:
RateClockninstructiooftimeExecution iCPIi
n
i tot
i
iii
IC
ICFFCPICPI
1
where
Op Freq Cycles CPI(i) %time
ALU 50% 1 0.5 33%
Load 20% 2 0.4 27%
Store 10% 2 0.2 13%
Branch 20% 2 0.4 27%
CPI(total) 1.5
-
8/6/2019 ACA Lecture 1
24/42
Amdahl's Law
Speedup due to enhancement E:ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
-
8/6/2019 ACA Lecture 1
25/42
Version 1
Execution Time After Improvement = Execution Time Unaffected +
Execution Time Affected /Amount of Improvement
Version 2
Speedup
= Performance after improvement / Performance before improvement
= Execution time before improvement / Execution time after
improvement
Amdahl's Law
-
8/6/2019 ACA Lecture 1
26/42
Amdahls Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
-
8/6/2019 ACA Lecture 1
27/42
Example:
A benchmark program spends half of the time executing floating point
instructions.
We improve the performance of the floating point unit by a factor offour.
What is the speedup?
Time before 10s (supposition)
Time after = 5s + 5s/4 = 6.25 s
Speedup = 10/6.25 = 1.6
Amdahl's Law
-
8/6/2019 ACA Lecture 1
28/42
Example:
Suppose a program runs in 100 seconds on a
machine, with multiply responsible for 80 seconds
of this time. How much do we have to improvethe speed of multiplication if we want the program
to run 4 times faster?"
100 s/4 = 80 s/n + 20 s
5 s = 80s/n
n= 80 s/ 5 s = 16
Amdahl's Law
-
8/6/2019 ACA Lecture 1
29/42
Metrics of Performance
Compiler
Programming
Language
Application
DatapathControl
TransistorsWiresPins
ISA
Function Units
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second:
MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per month
Operations per second
-
8/6/2019 ACA Lecture 1
30/42
Summary
Computer Architecture = Instruction Set Architecture + MachineOrganization
All computers consist of five components
Processor: (1) datapath and (2) control (3) Memory
(4) Input devices and (5) Output devices
Not all memory are created equally
Cache: fast (expensive) memory are placed closer to the processor
Main memory: less expensive memory Interfaces are where the problems are - between functional units and
between the computer and the outside world
Need to design against constraints of performance, power, area and
cost
-
8/6/2019 ACA Lecture 1
31/42
Instruction-Set Architecture
Software impact
support OS functions
restartable instructions
memory relocation and
protection a good compiler target
simple
orthogonal
Hardware impact
parallel implementation
OP R1 R2 R3 imm
OP R1M1 im2R2M2
R3M3 im2 ...
Hardware/Software Interface
-
8/6/2019 ACA Lecture 1
32/42
ISA Basics
OpModeRaRb
MemRegs
Before State
MemRegs
After State
instructionInstruction formatsInstruction typesAddressing modes
Data typesOperations
Interrupts/Events
Machine stateMemory organizationRegister organization
-
8/6/2019 ACA Lecture 1
33/42
System-Level Organization
Design at the level of processors, memories, andinterconnect.
More important to application performance than CPUdesign
Feeds and speeds
constrained by IC pin count, module pin count,and signaling rates
System balance
for a particular application
Driven by
performance/cost goals available components (cost/perf)
technology constraints
-
8/6/2019 ACA Lecture 1
34/42
Microarchitecture
Register-transfer-level (RTL) design
Implement instruction set
Exploit capabilities of technology
locality and concurrency
Iterative process
generate proposed architecture
estimate cost measure performance
Current emphasis is on overcoming sequential
nature of programs
deep pipelining
multiple issue
dynamic scheduling branch prediction/speculation
Regs
Instr.
Cache
IR
PC
B
A
C
-
8/6/2019 ACA Lecture 1
35/42
Memory
Requires energy to change stateFeedback circuit - SRAM
CapacitorsDRAM
Magnetic media - disk
Required for memories
Storage mediumWrite mechanism
Read mechanism
4Gb DRAM Die
-
8/6/2019 ACA Lecture 1
36/42
Technology Trends Processor
logic capacity: about 30% per year
clock rate: about 20% per year
Memory
DRAM capacity: about 60% per year (4x every 3 years)
Memory speed: about 10% per year Cost per bit: improves about 25% per year
Disk
capacity: about 60% per year
Total use of data: 100% per 9 months! Network Bandwidth
Bandwidth increasing more than 100% per year!
-
8/6/2019 ACA Lecture 1
37/42
Architecture trends
1970s (CISC mainframes) multi-chip CPUs
semiconductor memory veryexpensive
microcoded control
complex instruction sets(good code density)
1980s (RISC micros)
single-chip CPUs, on-chipRAM feasible
simple, hard-wired control simple instruction sets
small on-chip caches
1990s (fast clocks) lots of transistors
complex control to exploit
instruction-level parallelism
2000s (???)
even more transistors
slow wires
BIG SHIFT Here!!!
Parallelism is focus
Power now critical Open debate
-
8/6/2019 ACA Lecture 1
38/42
Moores Law In 1965, Gordon Moore predicted that the number of
transistors that can be integrated on a die would doubleevery 18 to 24 months (i.e., grow exponentially with time).
Amazingly visionarymillion transistor/chip barrier wascrossed in the 1980s.
2300 transistors, 1 MHz clock (Intel 4004) - 1971 16 Million transistors (Ultra Sparc III)
42 Million transistors, 2 GHz clock (Intel Xeon)2001
55 Million transistors, 3 GHz, 130nm technology,250mm2 die (Intel Pentium 4) - 2004
140 Million transistor (HP PA-8500)
-
8/6/2019 ACA Lecture 1
39/42
Processor Performance Increase
1
10
100
1000
10000
1987 1989 1991 1993 1995 1997 1999 2001 2003
Year
Performance
(SPECInt)
SUN-4/260 MIPS M/120
MIPS M2000
IBM RS6000
HP 9000/750
DEC AXP/500 IBM POWER 100
DEC Alpha 4/266DEC Alpha 5/500
DEC Alpha 21264/600
DEC Alpha 5/300
DEC Alpha 21264A/667
Intel Xeon/2000
Intel Pentium 4/3000
-
8/6/2019 ACA Lecture 1
40/42
DRAM Capacity Growth
10
100
1000
10000
100000
1000000
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002
Year of introduction
Kbitcapacity
16K
64K
256K
1M
4M
16M
64M128M
256M
512M
-
8/6/2019 ACA Lecture 1
41/42
Function LevelParallelism Driven by 42
Eras in Processor Architecture
0.01
0.1
1
10
1001000
10000
100000
1000000
1970 1985 2000 2015
MIPS
486386
2868086
PipelinedArchitecture
SuperscalarSpeculative
Instruction
Level
Parallelism
HyperThreaded
Pentium
4Multi-ThreadedMulti-Core
Thread & Processor
Level Parallelismwith
Special Purpose HW
4004
i386
Conroe
March 2005
-
8/6/2019 ACA Lecture 1
42/42
THANK YOU