aca lecture 1

8/6/2019 ACA Lecture 1

1/42

Advanced Computer Architecture&

Algorithms

Lecture 01: Introduction


2/42

Introduction

Control

Datapath

Memory

Processor

Input

Output


3/42

Introduction

High Level LanguageProgram

Assembly Language

Program

Machine LanguageProgram

Control SignalSpecification

Compiler

Assembler

Machine Interpretation


4/42

Introduction

instruction set

software

hardware


5/42

Introduction

Computer Architecture = ISA + MO How the hardware implements ISA ?

System Organization (processor, memory, I/O)

Microarchitecture

Learn methods of measuring and improving performance

Metrics

Benchmarks

Performance methods

Learn to think and program concurrently


6/42

Introduction..

I/O systemInstr. Set Proc.

Compiler

OperatingSystem

Application

Digital DesignCircuit Design

Instruction SetArchitecture

Datapath & Control

Layout


7/42

Instruction Set Architecture is the structure of acomputer that a machine language programmer (or a

compiler) must understand to write a correct (timing

independent) program for that machine.

The ISA defines:

Operations that the processor can execute

Data Transfer mechanisms + how to access data

Control Mechanisms (branch, jump, etc)

Contract between programmer/compiler + HW

Introduction


8/42

Specification

Program

ISA (Instruction Set Architecture)

microArchitecture

Logic

Transistors

compute the Fibonacci sequence

for(i=2; i


9/42

NAND Gate NOR Gate

Vdd

A

B

Out

Vdd

A

B

Out

OutA

B

A

B

Out

A B Out

0 0 10 1 11 0 11 1 0

A B Out

0 0 10 1 01 0 01 1 0


10/42

Introduction..

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering,Prediction, Speculation,

Vector, VLIW, DSP, Reconfiguration

Addressing,Protection,Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

VLSI

Input/Output and Storage

Memory

Hierarchy

Pipelining and Instruction

Level Parallelism


11/42

Intel 4004 - 1971

The first

microprocessor

2,300 transistors

108 KHz


12/42

Examples (Throughput/Performance)

Replace the processor with a faster version?

3.8 GHz instead of 3.2 GHz

Add an additional processor to a system?

Core Duo instead of P4

Performance is determined by execution time

Related variables

# of cycles to execute program

# of instructions in program # of cycles per second (clock rate)

average # of cycles per instruction

average # of instructions per second


13/42

Relating the Metrics

Performance = 1/Execution Time

CPU Execution Time = CPU clock cycles for program

Clock cycle time

orExecution time = # of clock cycles cycle time

Cycle time (period) = time between ticks = seconds per cycle

CPU clock cycles = Instructions for a program Average

clock cycles per Instruction


14/42

Execution time = seconds/program

instructions cycles seconds_________ x _____ x _____program instruction cycle

Program Compiler Architecture (ISA) Compiler Micro architects

(Scheduling) Organization Circuit Designers

Technology

Performance


15/42

Clocking and Clocked Elements

Typical Clock

1Hz = 1 cycle per secondperiod

(cycle time)

A

B

CD

A

B

C

D

Reduce the number of gate levels


16/42


17/42

For some program running on machine X,

PerformanceX = 1 / Execution timeX

"X is n times faster than Y"

n = PerformanceX/ PerformanceY

= Execution timeY/ Execution timeX

Problem: Machine A runs a program in 10 seconds and machine B in 15

seconds. How much faster is A than B?

Answer:

n = PerformanceA / PerformanceB

= Execution timeB/Execution timeA = 15/10 = 1.5

A is 1.5 times faster than B.

Performance


18/42

Suppose we have two implementations of the same instruction set architecture (ISA).For some program,

Machine A has a clock cycle time of 10 ns and a CPI of 2.0

Machine B has a clock cycle time of 20 ns and a CPI of 1.2

Which machine is faster for this program, and by how much?

Time per instruction:

for A 2.0 * 10 ns = 20 ns

for B 1.2 * 20 ns = 24 ns

A is 24/20 = 1.2 times faster

CPI Example


19/42


20/42

To improve performance (everything else being equal) you

can either

reduce the # of required clock cycles for a program

or

decrease the clock period or, said another way,

increase the clock frequency.

How to Improve Performance


21/42

Multiplication takes more time than addition

Floating point operations take longer than integer operations

Accessing memory takes more time than accessing registers

Important point: changing the cycle time often changes the number of cycles required

for various instructions (more later)

e.g. memory operations spend time, not cycles

Another point: the same instruction might require a different number of cycles on a

different machine

circuits have been implemented in different ways

How to improve the performance


22/42

# of Instructions:

A compiler designer has two alternatives for a certain code sequence. There are three

different classes of instructions: A, B, and C, and they require one, two, and three cycles,

respectively.

The first sequence has 5 instructions: 2 of A, 1 of B, and 2 of C.

The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.

Which sequence will be faster? What are the CPI values?

Sequence 1: 2*1+1*2+2*3 = 10 cycles; CPI1 = 10 / 5 = 2

Sequence 2: 4*1+1*2+1*3 = 9 cycles; CPI2 = 9 / 6 = 1.5

Sequence 2 is faster.

Example...


23/42

Cycles Per Instruction (CPI)

Depends on the instruction type

Average cycles per instruction

Example:

RateClockninstructiooftimeExecution iCPIi

n

i tot

i

iii

IC

ICFFCPICPI

1

where

Op Freq Cycles CPI(i) %time

ALU 50% 1 0.5 33%

Load 20% 2 0.4 27%

Store 10% 2 0.2 13%

Branch 20% 2 0.4 27%

CPI(total) 1.5


24/42

Amdahl's Law

Speedup due to enhancement E:ExTime w/o E Performance w/ E

Speedup(E) = ------------- = -------------------

ExTime w/ E Performance w/o E


25/42

Version 1

Execution Time After Improvement = Execution Time Unaffected +

Execution Time Affected /Amount of Improvement

Version 2

Speedup

= Performance after improvement / Performance before improvement

= Execution time before improvement / Execution time after

improvement

Amdahl's Law


26/42

Amdahls Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced


27/42

Example:

A benchmark program spends half of the time executing floating point

instructions.

We improve the performance of the floating point unit by a factor offour.

What is the speedup?

Time before 10s (supposition)

Time after = 5s + 5s/4 = 6.25 s

Speedup = 10/6.25 = 1.6

Amdahl's Law


28/42

Example:

Suppose a program runs in 100 seconds on a

machine, with multiply responsible for 80 seconds

of this time. How much do we have to improvethe speed of multiplication if we want the program

to run 4 times faster?"

100 s/4 = 80 s/n + 20 s

5 s = 80s/n

n= 80 s/ 5 s = 16

Amdahl's Law


29/42

Metrics of Performance

Compiler

Programming

Language

Application

DatapathControl

TransistorsWiresPins

ISA

Function Units

(millions) of Instructions per second: MIPS

(millions) of (FP) operations per second:

MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per month

Operations per second


30/42

Summary

Computer Architecture = Instruction Set Architecture + MachineOrganization

All computers consist of five components

Processor: (1) datapath and (2) control (3) Memory

(4) Input devices and (5) Output devices

Not all memory are created equally

Cache: fast (expensive) memory are placed closer to the processor

Main memory: less expensive memory Interfaces are where the problems are - between functional units and

between the computer and the outside world

Need to design against constraints of performance, power, area and

cost


31/42

Instruction-Set Architecture

Software impact

support OS functions

restartable instructions

memory relocation and

protection a good compiler target

simple

orthogonal

Hardware impact

parallel implementation

OP R1 R2 R3 imm

OP R1M1 im2R2M2

R3M3 im2 ...

Hardware/Software Interface


32/42

ISA Basics

OpModeRaRb

MemRegs

Before State

MemRegs

After State

instructionInstruction formatsInstruction typesAddressing modes

Data typesOperations

Interrupts/Events

Machine stateMemory organizationRegister organization


33/42

System-Level Organization

Design at the level of processors, memories, andinterconnect.

More important to application performance than CPUdesign

Feeds and speeds

constrained by IC pin count, module pin count,and signaling rates

System balance

for a particular application

Driven by

performance/cost goals available components (cost/perf)

technology constraints


34/42

Microarchitecture

Register-transfer-level (RTL) design

Implement instruction set

Exploit capabilities of technology

locality and concurrency

Iterative process

generate proposed architecture

estimate cost measure performance

Current emphasis is on overcoming sequential

nature of programs

deep pipelining

multiple issue

dynamic scheduling branch prediction/speculation

Regs

Instr.

Cache

IR

PC

B

A

C


35/42

Memory

Requires energy to change stateFeedback circuit - SRAM

CapacitorsDRAM

Magnetic media - disk

Required for memories

Storage mediumWrite mechanism

Read mechanism

4Gb DRAM Die


36/42

Technology Trends Processor

logic capacity: about 30% per year

clock rate: about 20% per year

Memory

DRAM capacity: about 60% per year (4x every 3 years)

Memory speed: about 10% per year Cost per bit: improves about 25% per year

Disk

capacity: about 60% per year

Total use of data: 100% per 9 months! Network Bandwidth

Bandwidth increasing more than 100% per year!


37/42

Architecture trends

1970s (CISC mainframes) multi-chip CPUs

semiconductor memory veryexpensive

microcoded control

complex instruction sets(good code density)

1980s (RISC micros)

single-chip CPUs, on-chipRAM feasible

simple, hard-wired control simple instruction sets

small on-chip caches

1990s (fast clocks) lots of transistors

complex control to exploit

instruction-level parallelism

2000s (???)

even more transistors

slow wires

BIG SHIFT Here!!!

Parallelism is focus

Power now critical Open debate


38/42

Moores Law In 1965, Gordon Moore predicted that the number of

transistors that can be integrated on a die would doubleevery 18 to 24 months (i.e., grow exponentially with time).

Amazingly visionarymillion transistor/chip barrier wascrossed in the 1980s.

2300 transistors, 1 MHz clock (Intel 4004) - 1971 16 Million transistors (Ultra Sparc III)

42 Million transistors, 2 GHz clock (Intel Xeon)2001

55 Million transistors, 3 GHz, 130nm technology,250mm2 die (Intel Pentium 4) - 2004

140 Million transistor (HP PA-8500)


39/42

Processor Performance Increase

1

10

100

1000

10000

1987 1989 1991 1993 1995 1997 1999 2001 2003

Year

Performance

(SPECInt)

SUN-4/260 MIPS M/120

MIPS M2000

IBM RS6000

HP 9000/750

DEC AXP/500 IBM POWER 100

DEC Alpha 4/266DEC Alpha 5/500

DEC Alpha 21264/600

DEC Alpha 5/300

DEC Alpha 21264A/667

Intel Xeon/2000

Intel Pentium 4/3000


40/42

DRAM Capacity Growth

10

100

1000

10000

100000

1000000

1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002

Year of introduction

Kbitcapacity

16K

64K

256K

1M

4M

16M

64M128M

256M

512M


41/42

Function LevelParallelism Driven by 42

Eras in Processor Architecture

0.01

0.1

1

10

1001000

10000

100000

1000000

1970 1985 2000 2015

MIPS

486386

2868086

PipelinedArchitecture

SuperscalarSpeculative

Instruction

Level

Parallelism

HyperThreaded

Pentium

4Multi-ThreadedMulti-Core

Thread & Processor

Level Parallelismwith

Special Purpose HW

4004

i386

Conroe

March 2005


42/42

THANK YOU

aca lecture 1

Documents