aca lecture 1

Upload: krisravi

Post on 07-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 ACA Lecture 1

    1/42

    Advanced Computer Architecture&

    Algorithms

    Lecture 01: Introduction

  • 8/6/2019 ACA Lecture 1

    2/42

    Introduction

    Control

    Datapath

    Memory

    Processor

    Input

    Output

  • 8/6/2019 ACA Lecture 1

    3/42

    Introduction

    High Level LanguageProgram

    Assembly Language

    Program

    Machine LanguageProgram

    Control SignalSpecification

    Compiler

    Assembler

    Machine Interpretation

  • 8/6/2019 ACA Lecture 1

    4/42

    Introduction

    instruction set

    software

    hardware

  • 8/6/2019 ACA Lecture 1

    5/42

    Introduction

    Computer Architecture = ISA + MO How the hardware implements ISA ?

    System Organization (processor, memory, I/O)

    Microarchitecture

    Learn methods of measuring and improving performance

    Metrics

    Benchmarks

    Performance methods

    Learn to think and program concurrently

  • 8/6/2019 ACA Lecture 1

    6/42

    Introduction..

    I/O systemInstr. Set Proc.

    Compiler

    OperatingSystem

    Application

    Digital DesignCircuit Design

    Instruction SetArchitecture

    Datapath & Control

    Layout

  • 8/6/2019 ACA Lecture 1

    7/42

    Instruction Set Architecture is the structure of acomputer that a machine language programmer (or a

    compiler) must understand to write a correct (timing

    independent) program for that machine.

    The ISA defines:

    Operations that the processor can execute

    Data Transfer mechanisms + how to access data

    Control Mechanisms (branch, jump, etc)

    Contract between programmer/compiler + HW

    Introduction

  • 8/6/2019 ACA Lecture 1

    8/42

    Specification

    Program

    ISA (Instruction Set Architecture)

    microArchitecture

    Logic

    Transistors

    compute the Fibonacci sequence

    for(i=2; i

  • 8/6/2019 ACA Lecture 1

    9/42

    NAND Gate NOR Gate

    Vdd

    A

    B

    Out

    Vdd

    A

    B

    Out

    OutA

    B

    A

    B

    Out

    A B Out

    0 0 10 1 11 0 11 1 0

    A B Out

    0 0 10 1 01 0 01 1 0

  • 8/6/2019 ACA Lecture 1

    10/42

    Introduction..

    Instruction Set Architecture

    Pipelining, Hazard Resolution,Superscalar, Reordering,Prediction, Speculation,

    Vector, VLIW, DSP, Reconfiguration

    Addressing,Protection,Exception Handling

    L1 Cache

    L2 Cache

    DRAM

    Disks, Tape

    Coherence,Bandwidth,Latency

    Emerging TechnologiesInterleavingBus protocols

    VLSI

    Input/Output and Storage

    Memory

    Hierarchy

    Pipelining and Instruction

    Level Parallelism

  • 8/6/2019 ACA Lecture 1

    11/42

    Intel 4004 - 1971

    The first

    microprocessor

    2,300 transistors

    108 KHz

  • 8/6/2019 ACA Lecture 1

    12/42

    Examples (Throughput/Performance)

    Replace the processor with a faster version?

    3.8 GHz instead of 3.2 GHz

    Add an additional processor to a system?

    Core Duo instead of P4

    Performance is determined by execution time

    Related variables

    # of cycles to execute program

    # of instructions in program # of cycles per second (clock rate)

    average # of cycles per instruction

    average # of instructions per second

  • 8/6/2019 ACA Lecture 1

    13/42

    Relating the Metrics

    Performance = 1/Execution Time

    CPU Execution Time = CPU clock cycles for program

    Clock cycle time

    orExecution time = # of clock cycles cycle time

    Cycle time (period) = time between ticks = seconds per cycle

    CPU clock cycles = Instructions for a program Average

    clock cycles per Instruction

  • 8/6/2019 ACA Lecture 1

    14/42

    Execution time = seconds/program

    instructions cycles seconds_________ x _____ x _____program instruction cycle

    Program Compiler Architecture (ISA) Compiler Micro architects

    (Scheduling) Organization Circuit Designers

    Technology

    Performance

  • 8/6/2019 ACA Lecture 1

    15/42

    Clocking and Clocked Elements

    Typical Clock

    1Hz = 1 cycle per secondperiod

    (cycle time)

    A

    B

    CD

    A

    B

    C

    D

    Reduce the number of gate levels

  • 8/6/2019 ACA Lecture 1

    16/42

  • 8/6/2019 ACA Lecture 1

    17/42

    For some program running on machine X,

    PerformanceX = 1 / Execution timeX

    "X is n times faster than Y"

    n = PerformanceX/ PerformanceY

    = Execution timeY/ Execution timeX

    Problem: Machine A runs a program in 10 seconds and machine B in 15

    seconds. How much faster is A than B?

    Answer:

    n = PerformanceA / PerformanceB

    = Execution timeB/Execution timeA = 15/10 = 1.5

    A is 1.5 times faster than B.

    Performance

  • 8/6/2019 ACA Lecture 1

    18/42

    Suppose we have two implementations of the same instruction set architecture (ISA).For some program,

    Machine A has a clock cycle time of 10 ns and a CPI of 2.0

    Machine B has a clock cycle time of 20 ns and a CPI of 1.2

    Which machine is faster for this program, and by how much?

    Time per instruction:

    for A 2.0 * 10 ns = 20 ns

    for B 1.2 * 20 ns = 24 ns

    A is 24/20 = 1.2 times faster

    CPI Example

  • 8/6/2019 ACA Lecture 1

    19/42

  • 8/6/2019 ACA Lecture 1

    20/42

    To improve performance (everything else being equal) you

    can either

    reduce the # of required clock cycles for a program

    or

    decrease the clock period or, said another way,

    increase the clock frequency.

    How to Improve Performance

  • 8/6/2019 ACA Lecture 1

    21/42

    Multiplication takes more time than addition

    Floating point operations take longer than integer operations

    Accessing memory takes more time than accessing registers

    Important point: changing the cycle time often changes the number of cycles required

    for various instructions (more later)

    e.g. memory operations spend time, not cycles

    Another point: the same instruction might require a different number of cycles on a

    different machine

    circuits have been implemented in different ways

    How to improve the performance

  • 8/6/2019 ACA Lecture 1

    22/42

    # of Instructions:

    A compiler designer has two alternatives for a certain code sequence. There are three

    different classes of instructions: A, B, and C, and they require one, two, and three cycles,

    respectively.

    The first sequence has 5 instructions: 2 of A, 1 of B, and 2 of C.

    The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.

    Which sequence will be faster? What are the CPI values?

    Sequence 1: 2*1+1*2+2*3 = 10 cycles; CPI1 = 10 / 5 = 2

    Sequence 2: 4*1+1*2+1*3 = 9 cycles; CPI2 = 9 / 6 = 1.5

    Sequence 2 is faster.

    Example...

  • 8/6/2019 ACA Lecture 1

    23/42

    Cycles Per Instruction (CPI)

    Depends on the instruction type

    Average cycles per instruction

    Example:

    RateClockninstructiooftimeExecution iCPIi

    n

    i tot

    i

    iii

    IC

    ICFFCPICPI

    1

    where

    Op Freq Cycles CPI(i) %time

    ALU 50% 1 0.5 33%

    Load 20% 2 0.4 27%

    Store 10% 2 0.2 13%

    Branch 20% 2 0.4 27%

    CPI(total) 1.5

  • 8/6/2019 ACA Lecture 1

    24/42

    Amdahl's Law

    Speedup due to enhancement E:ExTime w/o E Performance w/ E

    Speedup(E) = ------------- = -------------------

    ExTime w/ E Performance w/o E

  • 8/6/2019 ACA Lecture 1

    25/42

    Version 1

    Execution Time After Improvement = Execution Time Unaffected +

    Execution Time Affected /Amount of Improvement

    Version 2

    Speedup

    = Performance after improvement / Performance before improvement

    = Execution time before improvement / Execution time after

    improvement

    Amdahl's Law

  • 8/6/2019 ACA Lecture 1

    26/42

    Amdahls Law

    ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

    Speedupoverall =ExTimeold

    ExTimenew

    Speedupenhanced

    =

    1

    (1 - Fractionenhanced) + Fractionenhanced

    Speedupenhanced

  • 8/6/2019 ACA Lecture 1

    27/42

    Example:

    A benchmark program spends half of the time executing floating point

    instructions.

    We improve the performance of the floating point unit by a factor offour.

    What is the speedup?

    Time before 10s (supposition)

    Time after = 5s + 5s/4 = 6.25 s

    Speedup = 10/6.25 = 1.6

    Amdahl's Law

  • 8/6/2019 ACA Lecture 1

    28/42

    Example:

    Suppose a program runs in 100 seconds on a

    machine, with multiply responsible for 80 seconds

    of this time. How much do we have to improvethe speed of multiplication if we want the program

    to run 4 times faster?"

    100 s/4 = 80 s/n + 20 s

    5 s = 80s/n

    n= 80 s/ 5 s = 16

    Amdahl's Law

  • 8/6/2019 ACA Lecture 1

    29/42

    Metrics of Performance

    Compiler

    Programming

    Language

    Application

    DatapathControl

    TransistorsWiresPins

    ISA

    Function Units

    (millions) of Instructions per second: MIPS

    (millions) of (FP) operations per second:

    MFLOP/s

    Cycles per second (clock rate)

    Megabytes per second

    Answers per month

    Operations per second

  • 8/6/2019 ACA Lecture 1

    30/42

    Summary

    Computer Architecture = Instruction Set Architecture + MachineOrganization

    All computers consist of five components

    Processor: (1) datapath and (2) control (3) Memory

    (4) Input devices and (5) Output devices

    Not all memory are created equally

    Cache: fast (expensive) memory are placed closer to the processor

    Main memory: less expensive memory Interfaces are where the problems are - between functional units and

    between the computer and the outside world

    Need to design against constraints of performance, power, area and

    cost

  • 8/6/2019 ACA Lecture 1

    31/42

    Instruction-Set Architecture

    Software impact

    support OS functions

    restartable instructions

    memory relocation and

    protection a good compiler target

    simple

    orthogonal

    Hardware impact

    parallel implementation

    OP R1 R2 R3 imm

    OP R1M1 im2R2M2

    R3M3 im2 ...

    Hardware/Software Interface

  • 8/6/2019 ACA Lecture 1

    32/42

    ISA Basics

    OpModeRaRb

    MemRegs

    Before State

    MemRegs

    After State

    instructionInstruction formatsInstruction typesAddressing modes

    Data typesOperations

    Interrupts/Events

    Machine stateMemory organizationRegister organization

  • 8/6/2019 ACA Lecture 1

    33/42

    System-Level Organization

    Design at the level of processors, memories, andinterconnect.

    More important to application performance than CPUdesign

    Feeds and speeds

    constrained by IC pin count, module pin count,and signaling rates

    System balance

    for a particular application

    Driven by

    performance/cost goals available components (cost/perf)

    technology constraints

  • 8/6/2019 ACA Lecture 1

    34/42

    Microarchitecture

    Register-transfer-level (RTL) design

    Implement instruction set

    Exploit capabilities of technology

    locality and concurrency

    Iterative process

    generate proposed architecture

    estimate cost measure performance

    Current emphasis is on overcoming sequential

    nature of programs

    deep pipelining

    multiple issue

    dynamic scheduling branch prediction/speculation

    Regs

    Instr.

    Cache

    IR

    PC

    B

    A

    C

  • 8/6/2019 ACA Lecture 1

    35/42

    Memory

    Requires energy to change stateFeedback circuit - SRAM

    CapacitorsDRAM

    Magnetic media - disk

    Required for memories

    Storage mediumWrite mechanism

    Read mechanism

    4Gb DRAM Die

  • 8/6/2019 ACA Lecture 1

    36/42

    Technology Trends Processor

    logic capacity: about 30% per year

    clock rate: about 20% per year

    Memory

    DRAM capacity: about 60% per year (4x every 3 years)

    Memory speed: about 10% per year Cost per bit: improves about 25% per year

    Disk

    capacity: about 60% per year

    Total use of data: 100% per 9 months! Network Bandwidth

    Bandwidth increasing more than 100% per year!

  • 8/6/2019 ACA Lecture 1

    37/42

    Architecture trends

    1970s (CISC mainframes) multi-chip CPUs

    semiconductor memory veryexpensive

    microcoded control

    complex instruction sets(good code density)

    1980s (RISC micros)

    single-chip CPUs, on-chipRAM feasible

    simple, hard-wired control simple instruction sets

    small on-chip caches

    1990s (fast clocks) lots of transistors

    complex control to exploit

    instruction-level parallelism

    2000s (???)

    even more transistors

    slow wires

    BIG SHIFT Here!!!

    Parallelism is focus

    Power now critical Open debate

  • 8/6/2019 ACA Lecture 1

    38/42

    Moores Law In 1965, Gordon Moore predicted that the number of

    transistors that can be integrated on a die would doubleevery 18 to 24 months (i.e., grow exponentially with time).

    Amazingly visionarymillion transistor/chip barrier wascrossed in the 1980s.

    2300 transistors, 1 MHz clock (Intel 4004) - 1971 16 Million transistors (Ultra Sparc III)

    42 Million transistors, 2 GHz clock (Intel Xeon)2001

    55 Million transistors, 3 GHz, 130nm technology,250mm2 die (Intel Pentium 4) - 2004

    140 Million transistor (HP PA-8500)

  • 8/6/2019 ACA Lecture 1

    39/42

    Processor Performance Increase

    1

    10

    100

    1000

    10000

    1987 1989 1991 1993 1995 1997 1999 2001 2003

    Year

    Performance

    (SPECInt)

    SUN-4/260 MIPS M/120

    MIPS M2000

    IBM RS6000

    HP 9000/750

    DEC AXP/500 IBM POWER 100

    DEC Alpha 4/266DEC Alpha 5/500

    DEC Alpha 21264/600

    DEC Alpha 5/300

    DEC Alpha 21264A/667

    Intel Xeon/2000

    Intel Pentium 4/3000

  • 8/6/2019 ACA Lecture 1

    40/42

    DRAM Capacity Growth

    10

    100

    1000

    10000

    100000

    1000000

    1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002

    Year of introduction

    Kbitcapacity

    16K

    64K

    256K

    1M

    4M

    16M

    64M128M

    256M

    512M

  • 8/6/2019 ACA Lecture 1

    41/42

    Function LevelParallelism Driven by 42

    Eras in Processor Architecture

    0.01

    0.1

    1

    10

    1001000

    10000

    100000

    1000000

    1970 1985 2000 2015

    MIPS

    486386

    2868086

    PipelinedArchitecture

    SuperscalarSpeculative

    Instruction

    Level

    Parallelism

    HyperThreaded

    Pentium

    4Multi-ThreadedMulti-Core

    Thread & Processor

    Level Parallelismwith

    Special Purpose HW

    4004

    i386

    Conroe

    March 2005

  • 8/6/2019 ACA Lecture 1

    42/42

    THANK YOU