solution vu assignment

Upload: faheemansary

Post on 12-Oct-2015

581 views

Category:

Documents


9 download

DESCRIPTION

VU exam

TRANSCRIPT

  • CONTENTS

    96 .............................................................................................................. 3 95 .............................................................................................................. 8 94 ............................................................................................................ 13 93 ............................................................................................................ 18 92 ............................................................................................................ 22 96 ............................................................................................................ 24 95 ............................................................................................................ 28 94 ............................................................................................................ 34 93 ............................................................................................................ 38 92 ............................................................................................................ 42 96 ............................................................................................................ 44 95 ............................................................................................................ 49 94 ............................................................................................................ 54 93 ............................................................................................................ 58 92 ............................................................................................................ 63 96 ............................................................................................................ 67 95 ............................................................................................................ 69 94 ............................................................................................................ 71 93 ............................................................................................................ 73 92 ............................................................................................................ 75 96 ............................................................................................................ 77 95 ............................................................................................................ 80 94 ............................................................................................................ 85 93 ............................................................................................................ 89 92 ............................................................................................................ 92 96 ............................................................................................................ 96 95 .......................................................................................................... 104 94 .......................................................................................................... 113 93 .......................................................................................................... 121 92 .......................................................................................................... 126 96 ...................................................................................................... 131 95 ...................................................................................................... 133 94 ...................................................................................................... 137 96 .......................................................................................................... 143 95 .......................................................................................................... 145 94 .......................................................................................................... 149 93 .......................................................................................................... 152

  • 1

    92 .......................................................................................................... 154 96 .......................................................................................................... 157 95 .......................................................................................................... 163 94 .......................................................................................................... 168 96 .......................................................................................................... 174 95 .......................................................................................................... 181 96 .......................................................................................................... 186 95 .......................................................................................................... 189 94 .......................................................................................................... 194 93 .......................................................................................................... 197 92 .......................................................................................................... 199 96 .......................................................................................................... 202 95 .......................................................................................................... 205 94 .......................................................................................................... 209 93 .......................................................................................................... 213 92 .......................................................................................................... 217 96 .......................................................................................................... 222 96 .......................................................................................................... 227 95 .......................................................................................................... 236 94 .......................................................................................................... 243 93 .......................................................................................................... 250 92 .......................................................................................................... 257 96 .......................................................................................................... 263 95 .......................................................................................................... 267 94 .......................................................................................................... 269 93 .......................................................................................................... 272 92 .......................................................................................................... 274 96 .......................................................................................................... 277 95 .......................................................................................................... 280 94 .......................................................................................................... 283 93 .......................................................................................................... 286 92 .......................................................................................................... 289 96 ...................................................................................................... 291 95 ...................................................................................................... 293 94 ...................................................................................................... 295 93 ...................................................................................................... 298 92 ...................................................................................................... 300 93 ...................................................................................................... 302 95 .......................................................................................................... 305

  • 2

    96 ...................................................................................................... 309 95 .......................................................................................................... 314 95 .......................................................................................................... 317 95 .......................................................................................................... 320

  • 3

    96

    1. _____ implements the translation of a program's address space to physical

    addresses.

    (A) DRAM

    (B) Main memory

    (C) Physical memory

    (D) Virtual memory

    Answer: (D)

    2. To track whether a page of disk has been written since it was read into the

    memory, a ____ is added to the page table.

    (A) valid bit

    (B) tag index

    (C) dirty bit

    (D) reference bit

    Answer: (C)

    3. (Refer to the CPU architecture of Figure 1 below) Which of the following

    statements is correct for a load word (LW) instruction?

    (A) MemtoReg should be set to 0 so that the correct ALU output can be sent to

    the register file.

    (B) MemtoReg should be set to 1 so that the Data Memory output can be sent to

    the register file.

    (C) We do not care about the setting of MemtoReg. It can be either 0 or 1.

    (D) MemWrite should be set to 1.

    Answer: (B)

  • 4

    PC

    Instructionmemory

    Readaddress

    Instruction[31

    _0]

    Instruction [20 16]

    Instruction [25 21]

    Add

    Instruction [5 0]

    MemtoReg

    ALUOp

    MemWrite

    RegWrite

    MemRead

    Branch

    RegDst

    ALUSrc

    Instruction [31 26]

    4

    16 32Instruction [15 0]

    0

    0Mux

    0

    1

    Control

    AddALU

    result

    Mux

    0

    1

    RegistersWriteregister

    Writedata

    Readdata 1

    Readdata 2

    Readregister 1

    Readregister 2

    Signextend

    Mux

    1

    ALUresult

    Zero

    PCSrc

    Datamemory

    Writedata

    Readdata

    Mux

    1

    Instruction [15 11]

    ALUcontrol

    Shiftleft 2

    ALUAddress

    PC

    Instructionmemory

    Readaddress

    Instruction[31

    _0]

    Instruction [20 16]

    Instruction [25 21]

    Add

    Instruction [5 0]

    MemtoReg

    ALUOp

    MemWrite

    RegWrite

    MemRead

    Branch

    RegDst

    ALUSrc

    Instruction [31 26]

    4

    16 32Instruction [15 0]

    0

    0Mux

    0

    1

    Control

    AddALU

    result

    Mux

    0

    1

    RegistersWriteregister

    Writedata

    Readdata 1

    Readdata 2

    Readregister 1

    Readregister 2

    Signextend

    Mux

    1

    ALUresult

    Zero

    PCSrc

    Datamemory

    Writedata

    Readdata

    Mux

    1

    Instruction [15 11]

    ALUcontrol

    Shiftleft 2

    ALUAddress

    Figure 1

    4. IEEE 754 binary representation of a 32-bit floating number is shown below

    (normalized single-precision representation with bias = 127)

    31 30 ~ 23 22 ~ 0

    S exponent fraction

    1 bit 8 bits 23 bits

    (S) (E) (F)

    What is the correct binary presentation of (- 0.75)10 in IEEE single-precision float

    format?

    (A) 1011 1111 0100 0000 0000 0000 0000 0000

    (B) 1011 1111 1010 0000 0000 0000 0000 0000

    (C) 1011 1111 1101 0000 0000 0000 0000 0000

    (D) 1011 1110 1000 0000 0000 0000 0000 0000

    Answer: (A)

    5. According to Question 4, what is the decimal number represented by the word

    below?

    Bit position 31 30 ~ 23 22 ~ 0

    Binary value 1 10000011 011000.00

    (A) -10

    (B) -11

    (D) -22

    (D) -44

  • 5

    Answer: (A)

    6. Assume that the following assembly code is run on a machine with 2 GHz clock.

    The number of cycles for assembly instruction is shown in Table 1.

    add $t0, $zero, $zero

    loop: beq $a1, $zero finish

    add $t0, $t0, $a0

    sub $a1, $a1, 1

    j loop

    finish: addi $t0, $t0, 100

    add $v0, $t0, $zero

    instruction Cycles

    add, addi, sub 1

    lw, beq, j 2

    Table 1

    Assume $a0 = 3, $a1 = 20 at initial time, select the correct value of $v0 at the

    final cycle:

    (A) 157

    (B) 160

    (C) 163

    (D) 166

    Answer: (B)

    7. According to Question 6, please calculate the MIPS (millions instructions per

    second) of this assembly code:

    (A) 1342

    (B) 1344

    (C) 1346

    (D) 1348

    Answer: (B)

    MIPS = 134410

    84

    125

    102

    10countn instructio

    cyclesclock

    rateclock

    10CPI

    rateclock

    6

    9

    66

  • 6

    Questions 8-11. Link the following terms ((1) ~ (4))

    (1) Microsoft Word

    (2) Operating system

    (3) Internet

    (4) CD-ROM

    to the most related terminology shown below (A, B, C,..., K), choose the most

    related one ONLY (answer format: e.g., (1) K, for mapping item (1) to

    terminology K).

    A Applications software F Personal computer

    B High-level programming language G Semiconductor

    C Input device H Super computer

    D Integrated circuit I Systems software

    E Output device K Computer Networking

    Please write down the answers in the answer table together with the choice

    questions.

    8. (1) Microsoft word

    9. (2) Operating system

    10. (3) Internet

    11. (4) CD-ROM

    Answer:

    8. (1) Microsoft word A

    9. (2) Operating system I

    10. (3) Internet K

    11. (4) CD-ROM C

    Questions 12-15. Match the memory hierarchy element on the left with the closet

    phrase on the right: (answer format: e.g., (1) d, for mapping item (1) (left) to

    item d (right))

    (1). L1 cache a. A cache for a cache

    (2). L2 cache b. A cache for disks

    (3). Main memory c. A cache for a main memory

    (4). TLB d. A cache for page table entries

    Please write down the answers in the answer table together with the choice

    questions.

    12. (1) L1 cache

    13. (2) L2 cache

    14. (3) Main memory

    15. (4) TLB

  • 7

    Answer:

    12. (1) L1 cache a

    13. (2) L2 cache c

    14. (3) Main memory b

    15. (4) TLB d

    Questions 41-50. Based on the function of the seven control signals and the datapath

    of the MIPS CPU in Figure 1 (the same figure for Question 28), complete the

    settings of the control lines in Table 2 (use 0, 1, and X (dont care) only) for the

    two MIPS CPU instructions (beg and add). X (dont care) can help to reduce the

    implementation complexity, you should put X whenever possible.

    Instr. Branch ALU

    Src

    Reg

    Write

    Reg

    Dst

    Memto

    Reg

    Memory

    Write

    Memory

    Read

    ALU

    Op1

    ALU

    Op0

    beq (16) (17) (18) (19) (20) (21) (22) 0 1

    add (23) (24) (25)

    Table 2

    Please write down the answers in the answer table together with the choice

    questions.

    16. (16) =

    17. (17) =

    18. (18) =

    19. (19) =

    20. (20) =

    21. (21) =

    22. (22) =

    23. (23) =

    24. (24) =

    25. (25) =

    Answer:

    16. (16) = 1

    17. (17) = 0

    18. (18) = 0

    19. (19) = X

    20. (20) = X

    21. (21) = 0

    22. (22) = 0

    23. (23) = 1

    24. (24) = 0

    25. (25) = 0

  • 8

    95

    1-4 Choose ALL the correct answers for each of the following 1 to 4 questions. Note that

    credit will be given only if all choices are correct.

    1. With pipelines: (A) Increasing the depth of pipelining increases the impact of hazards. (B) Bypassing is a method to resolve a control hazard. (C) If a branch is taken, the branch prediction buffer will be updated. (D) In static multiple issue scheme, multiple instructions in each clock cycle are

    fixed by the processor at the beginning of the program execution.

    (E) Predication is an approach to guess the outcome of an instruction and to remove the execution dependence.

    Answer: (A)

    (B) False, (should be data hazard) (C) False, (prediction buffer should be updated when guess wrong) (D) False, (by the compiler) (E) False, (should be Speculation)

    2. Increasing the degree of associativity of a cache scheme will (A) Increase the miss rate. (B) Increase the hit time. (C) Increase the number of comparators. (D) Increase the number of tag bits. (E) Increase the complexity of LRU implementation.

    Answer: (B), (C), (D), (E)

    (A) False, (decrease the miss rate)

    3. With caching: (A) Write-through scheme improves the consistency between main memory and

    cache.

    (B) Split cache applies parallel caches to improve cache speed. (C) TLB (translation-lookaside buffer) is a cache on page table, and could help

    accessing the virtual addresses faster.

    (D) No more than one TLB is allowed in a CPU to ensure consistency. (E) An one-way set associative cache performs the same as a direct mapped

    cache.

    Answer: (A), (B), (E)

    (C) False, (help accessing physical address faster)

  • 9

    (D) False, (MIPS R3000 and Pentium 4 have two TLBs)

    4. In a Pentium 4 PC, (A) DMA mechanism can be applied to delegate responsibility from the CPU. (B) AGP bus can be used to connect MCH (Memory Control Hub) and a

    graphical output device.

    (C) USB 2.0 is a synchronous bus using handshaking protocol. (D) The CPU can fetch and translate IA-32 instructions. (E) The CPU can reduce instruction latency with deep pipelining.

    Answer: (A), (B), (D)

    (C) False, (USB 2.0 is an asynchronous bus)

    (E) False, (pipeline can not reduce single instructions latency)

    5. Examine the following two CPUs, each running the same instruction set. The first one is a Gallium Arsenide (GaAs) CPU. A 10 cm (about 4) diameter GaAs wafer costs $2000. The manufacturing process creates 4 defects per square cm. The

    CPU fabricated in this technology is expected to have a clock rate of 1000 MHz,

    with an average clock cycles per instruction of 2.5 if we assume an infinitely fast

    memory system. The size of the GaAs CPU is 1.0 cm 1.0 cm. The second one is a CMOS CPU. A 20 cm (about 4) diameter CMOS wafer

    costs $1000 and has 1 defect per square cm. The 1.0 cm 2.0 cm CPU executes multiple instructions per clock cycle to achieve an average clock cycles per

    instruction of 0.75, assuming an infinitely fast memory, while achieving a clock

    rate of 200 MHz. (The CPU is larger because it has on-chip caches and executes

    multiple instructions per clock cycle.)

    Assume for both GaAs and CMOS is 2. Yields for GaAs and CMOS wafers are 0.8 and 0.9 respectively. Most of this information is summarized in the

    following table:

    Wafer

    Diam. (cm)

    Wafer

    Yield

    Cost

    ($)

    Defects

    (1/cm2)

    Freq.

    (MHz) CPI

    Die Area

    (cm cm)

    Test Dies

    (per wafer)

    GaAs 10 0.80 $2000 3.0 1000 2.5 1.0 1.0 4

    CMOS 20 0.90 $1000 1.8 200 0.75 1.0 2.0 4

    Hint: Here are two equations that may help:

    per wafer diestest

    area die2

    diameterwafer

    area die

    diameter/2wafer per wafer dies

    2

    area dieareaunit per defects-1yield wafer yield dies

    (a) Calculate the averagae execution time for each instruction with an infinitely fast memory. Which CPU is faster and by what factor?

    (b) How many seconds will each CPU take to execute a one-billion-instruction program?

  • 10

    (c) What is the cost of a GaAs die for the CPU? Repeat the calculation for CMOS die. Show your work.

    (d) What is the ratio of the cost of the GaAs die to the cost of the CMOS die? (e) Based on the costs and performance ratios of the CPU calculated above, what

    is the ratio of cost/performance of the CMOS CPU to the GaAs CPU?

    Answer:

    (a) Execution Time (GaAs) for one instruction = 2.5 1 ns = 2.5 ns

    Execution Time (CMOS) for one instruction = 0.75 5 ns = 3.75 ns GaAs CPU is faster by 3.75/2.5 =1.5 times

    (b) Execution Time (GaAs) = 1 109 2.5 ns = 2.5 seconds

    Execution Time (CMOS) = 1 109 3.75 ns = 3.75 seconds

    (c) GaAs: dies per wafer =

    412

    10

    1

    2/102

    = 67

    die yield = 2

    2

    1318.0

    = 0.2

    Cost of a GaAs CPU die = 0.2 67

    2000

    = 149.25

    CMOS: dies per wafer =

    422

    20

    2

    2/202

    = 121

    die yield = 2

    2

    28.119.0

    = 0.576

    Cost of a GaAs CPU die = 0.576 121

    1000

    = 14.35

    (d) The cost of a GaAs die is 149.25/14.35 = 10.4 times than a CMOS die (e) The ratio of cost/performance of the CMOS to the GaAs is 10.4/1.5 = 6.93

    6. Given the following 8 possible solutions for a POP or a PUSH operation in a STACK: (1) Read from Mem(SP), Decrement SP; (2) Read from Mem(SP),

    Increment SP; (3) Decrement SP, Read from Mem(SP) (4) Increment SP, Read

    from Mem(SP) (5) Write to Mem(SP), Decrement SP; (6) Write to Mem(SP),

    Increment SP; (7) Decrement SP, Write to Mem(SP); (8) Increment SP, Write to

    Mem(SP).

    Choose only ONE of the above solutions for each of the following questions.

    (a) Solution of a PUSH operation for a Last Full stack that grows ascending. (b) Solution of a POP operation for a Next Empty stack that grows ascending. (c) Solution of a PUSH operation for a Next Empty stack that grows ascending. (d) Solution of a POP operation for a Last Full stack that grows ascending.

    Answer:

    (a) (8) (b) (3) (c) (6) (d) (1)

  • 11

    stack pointer (SP)

    SP

    SP

    Address

    big

    small

    Last Full Next Empty

    Last full PUSH (1) Increase SP; (2) Write to MEM(SP) Last full POP (1) Read from MEM(SP); (2) Decrease SP Next Empty PUSH (1) Write to MEM(SP); (2) Increase SP Next Empty POP (1) Decrease SP; (2) Read from MEM(SP)

    7. Execute the following Copy loop on a pipelined machine: Copy: lw $10, 1000($20) sw $10, 2000($20)

    addiu $20, $20, -4

    bne $20, $0, Copy

    Assume that the machine datapath neither stalls nor forwards on hazards, so you

    must add nop instructions.

    (a) Rewrite the code inserting as few nop instructions as needed for proper

    execution;

    (b) Use multi-clock-cycle pipeline diagram to show the correctness of your solution.

    Answer: Suppose that register Read and Write could occur in the same clock cycle.

    (a) lw $10, 1000($20)

    Copy: addiu $20, $20, 4 nop

    sw $10, 2000($20)

    bne $20, $0, Copy

    lw $10, 1000($20)

    (b)

    1 2 3 4 5 6 7 8 9 10 11

    lw IF ID EX MEM WB

    addiu IF ID EX MEM WB

    nop IF ID EX MEM WB

    sw IF ID EX MEM WB

    bne IF ID EX MEM WB

    lw IF ID EX MEM WB

  • 12

    8. In a Personal Computer, the optical drive has a rotation speed of 7500 rpm, a

    40,000,000 bytes/second transfer rate, and a 60 ms seek time. The drive is served

    with a 16 MHz bus, which is 16-bit wide.

    (a) How long does the drive take to read a random 100,000-byte sector? (b) When transferring the 100,000-byte data, what is the bottleneck?

    Answer:

    (a) 40000000

    100000

    60/7500

    5.060 ms = 60ms + 4ms + 2.5ms = 66.5ms

    (b) The time for the bus to transfer 100000 bytes is 61016

    2/100000

    = 3.125 ms

    So, the optical drive is the bottleneck.

    8. A processor has a 16 KB, 4-way set-associate data cache with 32-byte blocks. (a) What is the number of sets in L1 cache? (b) The memory is byte addressable and addresses are 35-bit long. Show the

    breakdown of the address into its cache access components.

    (c) How many total bytes are required for cache? (d) Memory is connected via a 16-bit bus. It takes 100 clock cycles to send a

    request to memory and to receive a cache block. The cache has 1-cycle hit

    time and 95% hit rate. What is the average memory access time?

    (e) A software program is consisted of 25% of memory access instructions. What is the average number of memory-stall cycle per instruction if we run this

    program?

    Answer:

    (a) 16KB/(32 4) = 128 sets (b)

    tag index Block offset Byte offset

    23 bits 7 bits 3 bits 2 bits

    (c) 27 4 (1 + 23 + 32 8) = 140 kbits = 17.5 KB

    (d) Average Memory Access Time = 1 + 0.05 100 = 6 clock cycles

    (e) (6 1) 1.25 = 6.25 clock cycles

  • 13

    94

    1. Compare two memory system designs for a classic 5-stage pipelined processor.

    Both memory systems have a 4-KB instruction cache. But system A has a

    4K-byte data cache, with a miss rate of 10% and a hit time of 1 cycle; and system

    B has an 8K-byte data cache, with a miss rate of 5% and a hit time of 2 cycles

    (the cache is not pipelined). For both data caches, cache lines hold a single word

    (4 bytes), and the miss penalty is 10 cycles. What are the respective average

    memory access times for data retrieved by load instructions for the above two

    memory system designs, measured in clock cycles?

    Answer:

    Average memory access time for system A = 1 + 0.1 10 = 2 cycles

    Average memory access time for system B = 2 + 0.05 10 = 2.5 cycles

    2. (a) Describe at least one clear advantage a Harvard architecture (separate

    instruction and data caches) has over a unified cache architecture (a single

    cache memory array accessed by a processor to retrieve both instruction and

    data)

    (b) Describe one clear advantage a unified cache architecture has over the Harvard

    architecture

    Answer:

    (a) Cache bandwidth is higher for a Harvard architecture than a unified cache

    architecture

    (b) Hit ratio is higher for a unified cache architecture than a Harvard architecture

    3. (a) What is RAID?

    (b) Match the RAID levels 1, 3, and 5 to the following phrases for the best match.

    Uses each only once.

    Data and parity striped across multiple disks

    Can withstand selective multiple disk failures

    Requires only one disk for redundancy

    Answer:

    (a) An organization of disks that uses an array of small and inexpensive disks so

    as to increase both performance and reliability

    (b) RAID 5 Data and parity striped across multiple disks

    RAID 1 Can withstand selective multiple disk failures

    RAID 3 Requires only one disk for redundancy

  • 14

    4. (a) Explain the differences between a write-through policy and a write back policy

    (b) Tell which policy cannot be used in a virtual memory system, and describe the

    reason

    Answer:

    (a) Write through: always write the data into both the cache and the memory

    Write back: updating values only to the block in the cache, then writing the

    modified block to the lower level of the hierarchy when the block is replaced

    (b) Write-through will not work for virtual memory, since writes take too long.

    Instead, virtual memory systems use write-back

    5. (a) What is a denormalized number (denorm or subnormal)?

    (b) Show how to use gradual underflow to represent a denorm in a floating point

    number system.

    Answer:

    (a) For an IEEE 754 floating point number, if the exponent is all 0s, but the

    fraction is non-zero then the value is a denormalized number, which does not

    have an assumed leading 1 before the binary point. Thus, this represents a

    number (-1)s 0.f 2

    -126, where s is the sign bit and f is the fraction.

    (b) Denormalized number allows a number to degrade in significance until it

    becomes 0, called gradual underflow.

    For example, the smallest positive single precision normalized number is

    1.0000 0000 0000 0000 0000 0000two 2-126

    but the smallest single precision denormalized number is

    0.0000 0000 0000 0000 0000 0001two 2-126

    , or l.0two 2-149

    6. Try to show the following structure in the memory map of a 64-bit Big-Endian

    machine, by plotting the answer in a two-row map where each row contains 8

    bytes.

    Struct{

    int a; // 0x11121314

    char* c; // A, B, C, D, E, F, G short e; // 0x2122

    }s;

    Answer: 0 1 2 3 4 5 6 7

    11 12 13 14 A B C D

    E F G 21 22

    Int 4 bytes short 2 bytes character 1 byte half word 3 objects words object half word

  • 15

    7. Assume we have the following 3 ISA styles:

    (1) Stack: All operations occur on top of stack where PUSH and POP are the only

    instructions that access memory;

    (2) Accumulator: All operations occur between an Accumulator and a memory

    location;

    (3) Load-Store: All operations occur in registers, and register-to-register

    instructions use 3 registers per instruction.

    (a) For each of the above ISAs, write an assembly code for the following

    program segment using LOAD, STORE, PUSH, POP, ADD, and SUB and

    other necessary assembly language mnemonics.

    { A = A + C;

    D = A B; }

    (b) Some operations are not commutative (e.g., subtraction). Discuss what are

    the advantages and disadvantages of the above 3 ISAs when executing

    non-commutative operations.

    Answer:

    (a)

    (1) Stack (2) Accumulator (3) Load-Store

    PUSH A LOAD A LOAD R1, A

    PUSH C ADD C LOAD R2, C

    ADD STORE A ADD R1, R1, R2

    POP A SUB B STORE R1, A

    PUSH A STORE D LOAD R2, B

    PUSH B SUB R1, R1, R2

    SUB STORE R1, D

    POP D

    (b) Stack Accumulator ISA non-commutative operations compiler time instruction scheduling Load-Store ISA non-commutative operations instruction scheduling Stack Accumulator ISA Load-Store ISA

  • 16

    8. The program below divides two integers through repeated addition and was

    originally written for a non-pipelined architecture. The divide function takes in as

    its parameter a pointer to the base of an array of three elements where X is the

    first element at 0($a0), Y is the second element at 4 ($a0), and the result Z is to be

    stored into the third element at 8($a0), Line numbers have been added to the left

    for use in answering the questions below.

    1 DIVIDE: add $t3, $zero, $zero

    2 add $t2, $zero, $zero

    3 lw $t1, 4($a0)

    4 lw $t0, 0($a0)

    5 LOOP: beq $t2, $t0, END

    6 addi $t3, $t3, 1

    7 add $t2, $t2, $t1

    8 j LOOP

    9 END: sw $t3, 8($a0)

    (a) Given a pipelined processor as discussed in the textbook, where will data be

    forwarded (ex. Line 10 EX.MEM? Line 11 EX.MEM)? Assume that

    forwarding is used whenever possible, but that branches have not been

    optimized in any way and are resolved in the EX stage.

    (b) How many data hazard stalls are needed? Between which instructions should

    the stall bubble(s) be introduced (ex. Line 10 and Line 11)? Again, assume

    that forwarding is used whenever possible, but that branches have not been

    optimized in any way and are resolved in the EX stage.

    (c) If X = 6 and Y = 3,

    (i) How many times is the body of the loop executed?

    (ii) How many times is the branch beq not taken?

    (iii) How many times is the branch beq taken?

    (d) Rewrite the code assuming delayed branches are used. If it helps, yon may

    assume that the answer to X/Y is at least 2. Assume that forwarding is used

    whenever possible and that branches are resolved in IF/ID. Do not worry

    about reducing the number of times through the loop, but arrange the code to

    use as few cycles as possible by avoiding stalls and wasted instructions.

    Answer:

    (a) Line 4 MEM.WB

    (b) 1 stall is needed, between Line 4 and Line 5

    (c) (i) 2 (ii) 2 (iii) 1

    (d) DIVIDE: add $t2, $zero, $zero

    lw $t0, 0($a0)

    add $t3, $zero, $zero

    lw $t1, 4($a0)

    LOOP: beq $t2, $t0, END

    add $t2, $t2, $t1

  • 17

    j LOOP

    addi $t3, $t3, 1

    END: sw $t3, 8($a0)

  • 18

    93

    1. Explain how each of the following six features contributes to the definition of a

    RISC machine: (a) Single-cycle operation, (b) Load/Store design, (c) Hardwired

    control, (d) Relatively few instructions and addressing modes, (e) Fixed

    instruction format, (f) More compile-time effort.

    Answer:

    (a)

    (b) Load/Store CPU Load/Store CPU

    (c) Hardwire Control (d) (e) (f) RISC

    2. (1) Give an example of structural hazard.

    (2) Identify all of the data dependencies in the following code. Show which

    dependencies are data hazards and how they can be resolved via

    forwarding?

    add $2, $5, $4

    add $4, $2, $5

    sw $5, 100($2)

    add $3, $2, $4

    Answer:

    (1) datapath lw $5, 100($2)

    add $2, $7, $4

    add $4, $2, $5

    sw $5, 100($2)

    4 1 4 structural hazard

    (2) : 1 add $2, $5, $4 2 add $4, $2, $5

    3 sw $5, 100($2)

    4 add $3, $2, $4

  • 19

    Data dependency Data hazard

    $2 (1, 2) (1, 3) (1, 4) (1, 2) (1, 3)

    $4 (2, 4) (2, 4)

    Take instruction (1, 2) for example. We dont need to wait for the first instruction to complete before trying to resolve the data hazard. As soon as the ALU creates

    the sum for the first instruction, we can supply it as an input for the second

    instruction.

    3. Explain (1) what is a precise interrupt? (2) what does RAID mean? (3) what does

    TLB mean?

    Answer:

    (1) An interrupt or exception that is always associated with the correct instruction

    in pipelined computers.

    (2) An organization of disks that uses an array of small and inexpensive disks so

    as to increase both performance and reliability.

    (3) A cache that keeps track of recently used address mappings to avoid an access

    to the page table.

    4. Consider a 32-byte direct-mapped write-through cache with 8-byte blocks.

    Assume the cache updates on write hits and ignores write misses. Complete the

    table below for a sequence of memory references which occur from left to right.

    (Redraw the table in your answer sheet)

    address 00 16 48 08 56 16 08 56 32 00 60

    read/write r r r r r r r w w r r

    index 0 2

    tag 0 0

    hit/miss miss

    Answer:

    Suppose the address is 8 bits. 32-byte direct-mapped cache with 8-byte blocks 4 blocks in the cache; block offset = 3 bits, [2:0]; index = 2 bits, [4:3]; tag = 8 3 2 = 3 bits, [5:7]

    5 4

    4 5

    555 44

    44 555

  • 20

    address tag index

    decimal binary binary decimal binary decimal

    00 000000 0 0 00 0

    16 010000 0 0 10 2

    48 110000 1 1 10 2

    08 001000 0 0 01 1

    56 111000 1 1 11 3

    16 010000 0 0 10 2

    08 001000 0 0 01 1

    56 111000 1 1 11 3

    32 100000 1 1 00 0

    00 000000 0 0 00 0

    60 111100 1 1 11 3

    address 00 16 48 08 56 16 08 56 32 00 60

    read/write r r r r r r r w w r r

    index 0 2 2 1 3 2 1 3 0 0 3

    tag 0 0 1 0 1 0 0 1 1 0 1

    hit/miss miss miss miss miss miss miss hit hit miss hit hit

    cache write hits write misses write 3 reference 32 write misses 2 reference 00 hit

    5. (1) List two Branch Prediction strategies and (2) compare their differences.

    Answer:

    (1) Static prediction Dynamic prediction

    (2) (a)

    (b) Misprediction penalty

    (c)

    (a) run time run time information

    (b) Misprediction penalty

    (c)

    6. Explain how the reference bit in a page table entry is used to implement an approximation to the LRU replacement strategy.

    Answer:

    The operating system periodically clears the reference bits and later records them

    so it can determine which pages were touched during a particular time period.

    With this usage information, the operating system can select a page that is among

  • 21

    the least recently referenced.

    7. Trace Booths algorithm step by step for the multiplication of 2 (6)

    Answer:

    2 ten (6 ten) = 0010two 1010 two = 1111 0100 two = 12ten

    Iteration Step Multiplicand Product

    0 Initial values 0010 0000 1010 0

    1 00 no operation 0010 0000 1010 0 Shift right product 0010 0000 0101 0

    2 10 prod = prod - Mcand 0010 1110 0101 0 Shift right product 0010 1111 0010 1

    3 01 prod = prod + Mcand 0010 0001 0010 1 Shift right product 0010 0000 1001 0

    4 10 prod = prod - Mcand 0010 1110 1001 0 Shift right product 0010 1111 0100 1

    8. What are the differences between Trap and Interrupt?

    Answer:

    Interrupt CPU ( processor ) Trap( processor ) CPUTrap interrupt

  • 22

    92

    1. A certain machine with a 10 ns (1010-9s) clock period can perform jumps (1 cycle), branches (3 cycles), arithmetic instructions (2 cycles), multiply

    instructions (5 cycles), and memory instructions (4 cycles). A certain program has

    10% jumps, 10% branches, 50% arithmetic, 10% multiply, and 20% memory

    instructions. Answer the following question. Show your derivation in sufficient

    detail.

    (1) What is the CPI of this program on this machine (2) If the program executes 10

    9 instructions, what is its execution time

    (3) A 5-cycle multiply-add instruction is implemented that combines an

    arithmetic and a multiply instruction. 50% of the multiplies can be turned into

    multiply-adds. What is the new CPI (4) Following (3) above, if the clock period remains the same, what is the

    programs new execution time.

    Answer:

    (1) 1 0.1 + 3 0.1 + 2 0.5 + 5 0.1 + 4 0.2 = 2.7

    (2) Execution time = 109 2.7 10 ns = 27 s

    (3) CPI = (1 0.1 + 3 0.1 + 2 0.45 + 5 0.05 + 4 0.2 + 5 0.05) / (0.1 +

    0.1 + 0.45 + 0.05 + 0.2 + 0.05) = 2.6 / 0.95 = 2.74

    (4) Execution time = 109 0.95 2.74 10 ns = 26.03 s

    : 100% CPI

    2. Answer True (O) or False () for each of the following. (NO penalty for wrong

    answer.)

    (1) Most computers use direct mapped page tables.

    (2) Increasing the block size of a cache is likely to take advantage of temporal

    locality.

    (3) Increasing the page size tends to decrease the size of the page table.

    (4) Virtual memory typically uses a write-back strategy, rather than a

    write-through strategy.

    (5) If the cycle time and the CPI both increase by 10% and the number of

    instruction deceases by 20%, then the execution time will remain the same.

    (6) A page fault occurs when the page table entry cannot be found in the

    translation lookaside buffer.

    (7) To store a given amount of data, direct mapped caches are typically smaller

    than either set associative or fully associative caches, assuming that the

    block size for each cache is the same.

    (8) The twos complement of negative number is always a positive number in the same number format.

  • 23

    (9) A RISC computer will typically require more instructions than a CISC

    computer to implement a given program.

    (10) Pentium 4 is based on the RISC architecture.

    Answer:

    (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

    O O O O

    : Modern CPUs like the AthlonXP and Pentium 4 are based on a mixture of RISC and CISC.

    3. The average memory access time (AMAT) is defined as

    AMAT = hit time + miss_rate miss_penalty

    Answer the following two questions. Show your derivation in sufficient detail.

    (1) Find the AMAT of a 100MHz machine, with a miss penalty of 20 cycles, a hit

    time of 2 cycles, and a miss rate of 5%.

    (2) Suppose doubling the size of the cache decrease the miss rate to 3%, but

    cause the hit time to increase to 3 cycles and the miss penalty to increase to 21

    cycles. What is the AMAT of the new machine

    Answer:

    (1) AMAT = (2 + 0.05 20) 10ns = 30ns

    (2) AMAT = (3 + 0.03 21) 10ns = 36.3ns

    4. If a pipelined processor has 5 stages and takes 100 ns to execute N instructions.

    How long will it take to execute 2N instructions, assuming the clock rate is 500

    MHz and no pipeline stalls occur?

    Answer:

    Clock cycle time = 1/(500106)= 2 ns, N + 4 = 100 / 2 N = 46

    The execution time of 2N instruction = 2 46 + 4 = 96 clock cycles = 192 ns

  • 24

    96

    1. Answer the following questions briefly:

    (a) Typically one CISC instruction, since it is more complex, takes more time to

    complete than a RISC instruction. Assume that an application needs N CISC

    instructions and 2N RISC instructions, and that one CISC instruction takes an

    average 5T ns to complete, and one RISC instruction takes 2T ns. Which

    processor has the better performance?

    (b) Which of the following processors have a CISC instruction set architecture?

    ARM AMD Opteron

    Alpha 21164 IBM PowerPC

    Intel 80x86 MIPS

    Sun UltraSPARC

    (c) True & False questions;

    (1) There are four types of data hazards; RAR, RAW, WAR, and WAW.

    (True or False?)

    (2) AMD and Intel recently added 64-bit capability to their processors

    because most programs run much faster with 64-bit instructions. (True or

    False?)

    (3) With a modern processor capable of dynamic instruction scheduling and

    out-of-order execution, it is better that the compiler does not to optimize

    the instruction sequences, (True or False?)

    Answer:

    (a) CISC time = N 5T = 5NT ns

    RISC time = 2N 2T = 4NT ns

    RISC time < CISC time, so the RISC architecture has better performance.

    (b) Intel 80x86, AMD Opteron

    (c) (1) False, RAR does not cause data hazard.

    (2) False, most programs run much faster with 64-bit processors not 64-bit

    instructions

    (3) False, the compiler still tries to help improve the issue rate by placing the

    instructions in a beneficial order.

    2. For commercial applications, it is important to keep data on-line and safe in

    multiple places.

    (a) Suppose we want to backup 100GB of data over the network. How many

    hours does it take to send the data by FTP over the Internet? Assume the

    average bandwidth between the two places is 1Mbits/sec.

  • 25

    (b) Would it be better if you burn the data onto DVDs and mail the DVDs to the

    other site? Suppose it takes 10 minutes to bum a DVD which has 4GB

    capacity and the fast delivery service can deliver in 12 hours.

    Answer:

    (a) (100Gbyte)/1Mbits = 800 1024 seconds = 227.56 hours

    (b) (100GB/4GB) 10 minutes = 250 minutes = 4.17 hours

    4.17 + 12 = 16.17 hours < 227.56 hours

    So, it is better to burn the data into DVDs and mail them to other site.

    3. Suppose we have an application running on a shared-memory multiprocessor.

    With one processor, the application runs for 30 minutes.

    (a) Suppose the processor clock rate is 2GHz. The average CPI (assuming that all

    references hit in the cache) on single processor is 0.5. How many instructions

    are executed in the application?

    (b) Suppose we want to reduce the run time of the application to 5 minutes with 8

    processors. Let's optimistically assume that parallelization adds zero overhead

    to the application, i.e. no extra instructions, no extra cache misses, no

    communications, etc. What fraction of the application must be executed in

    parallel?

    (c) Suppose 100% of our application can be executed in parallel. Let's now

    consider the communication overhead. Assume the multiprocessor has a 200

    ns time to handle reference to a remote memory and processors are stalled on

    a remote request. For this application, assume 0.02% of the instructions

    involve a remote communication reference, no matter how many processors

    are used. How many processors are needed at least to make the run time be

    less than 5 minutes?

    (d) Following the above question, but let's assume the remote communication

    references in the application increases as the number of processors increases.

    With N processors, there are 0.02*(N1)% instructions involve a remote

    communication reference. How many processors will deliver the maximum

    speedup?

    Answer:

    (a) 30 60 second = Instruction count 0.5 0.5 ns

    Instruction Count =1800 /0.25 ns = 7200 109 (b) Suppose that the fraction of the application must be executed in parallel = F.

  • 26

    So,

    8)1(

    1

    5

    30

    FF

    F = 20/21 = 0.952

    (c) Assume N is the number of processors that will make run time < 5 minutes

    (30 60)/N + 7200 109 0.0002 200 ns < 5 60 N > 150 So, at lease 150 processors are needed to make the rum time < 5 minutes

    (d) Speedup = N

    6030+ 7200 109 0.0002 (N 1) 200 ns

    = 1800N1

    + 288 (N 1)

    Let the derivative of Speedup = 0 1800N2 + 288 = 0 N = 2.5

    2.5 processors ill deliver the maximum speedup

    4. Number representation.

    (a) What range of integer number can be represented by 16-bit 2's complement

    number?

    (b) Perform the following 8-bit 2's complement number operation and check

    whether arithmetic overflow occurs. Check your answer by converting to

    decimal sign-and-magnitude representation.

    11010011

    11101100

    Answer:

    (a) 215 ~ + (215 1)

    (b) 11010011 11101100 = 11010011 + 00010100 = 11100111

    check: 45 ( 20) = 45 + 20 = 25

    The range for 8-bit 2s complement number is: 27 ~ + (27 1)

    So, no overflow

    5. Bus

    (a) Draw a graph to show the memory hierarchy of a system that consists of CPU,

    Cache, Memory and I/O devices. Mark where memory bus and I/O bus is.

    (b) Assuming system 1 have a synchronous 32-bit bus with clock rate = 33 Mhz

    running at 2.5V. System 2 has a 64-bit bus with clock rate = 66 Mhz running

    at 1.8V. Assuming the average capacitance on each bus line is 2pF for bus in

    system 1. What is the maximum average capacitance allowed for the bus of

    system 2 so the peak power dissipation of system 2 bus will not exceed that of

    the system 1 bus?

    (c) Serial bus protocol such as SATA has gained popularity in recent years. To

    design a serial bus that supports the same peak throughput as the bus in

    system 2, what is the clock frequency of this serial bus?

  • 27

    Answer:

    (a)

    (b) Power dissipation = fCV2

    The peak power dissipation for system 1 =

    33 106 (2 10-12 32) 2.52 = 13.2 mw

    Suppose the average capacitance for system 2 = C

    66 106 C 1.82 < 13.2 mw C < 61.73 pF The maximum average capacitance for system 2 is 61.73 pF.

    (c) Since SATA uses a single signal path to transmit data serially (or bit by bit),

    the frequency should be designed as 66 MHz 64 = 4.224 GHz to support the

    same peak throughtput as bus system 2.

    (b) system 2 bus line

  • 28

    95

    PART I:

    Please answer the following questions in the format listed below. If you do not follow

    the format, you will get zero points for these questions.

    1. (1) T or F (2) T or F

    (3) T or F

    (4) T or F

    (5) T or F

    2. X = Y =

    Stall cycles =

    3. Option is times faster than the old machine

    4. 1-bit predictor: 2-bit predictor:

    1. True & False Questions (1) If an address translation for a virtual page is present in the TLB, then that

    virtual page must be mapped to a physical memory page.

    (2) The set index decreases in size as cache associativity is increased (assume

    cache size and block size remain the same)

    (3) It is impossible to have a TLB hit and a data cache miss for the same data

    reference.

    (4) An instruction takes less time to execute on a pipelined processor than on a

    nonpipelined processor (all other aspects of the processors being the same).

    (5) A muti-cycle implementation of the MIPS processor requires that a single

    memory by used for both instructions and data.

    Answer:

    (1) T (2) T (3) F (4) F (5) T

    2. Consider the following program:

    int A[100]; /* size(int) = 1 word */

    for (i = 0; i < 100; i++)

    A[i] = A[i] + 1;

    The code for this program on a MIPS-like load/store architecture looks as

    follows:

    ADDI R1, R0, #X

    ADDI R2, R0, A ; A is the base address of array A

    LOOP: LD R3, 0(R2)

  • 29

    ADDI R3, R3, #1

    SD R3, 0(R2)

    ADDI R2, R2, #Y

    SUBI R1, R1, #1

    BNE R1, R0, LOOP

    Consider a standard 5-stage MIPS pipeline. Assume that the branch is resolved

    during the instruction decode stage, and full bypassing/register forwarding are

    implemented. Assume that all memory references hit in the cache and TLBs. The

    pipeline does not implement any branch prediction mechanism. What are values

    of #X and #Y, and how many stall cycles are in one loop iteration including stalls

    caused by the branch instruction?

    Answer:

    X = 100

    Y = 4

    Stall cycles = 3 ((1) between LD and ADDI, (2) between SUBI and BEQ, (3) one

    below BEQ)

    Since branch decision is resolved during ID stage, a clock stall is needed between SUBI and BEQ

    3. Suppose you had a computer hat, on average, exhibited the following properties

    on the programs that you run:

    Instruction miss rate: 2%

    Data miss rate: 4%

    Percentage of memory instructions: 30%

    Miss penalty: 100 cycles

    There is no penalty for a cache hit (i.e. the cache can supply the data as fast as the

    processor can consume it.) You want to update the computer, and your budget will

    allow one of the following:

    Option #1: Get a new processor that is twice as fast as your current

    computer. The new processors cache is twice as fast too, so it can keep up with the processor.

    Option #2: Get a new memory that is twice as fast.

    Which is a better choice? And what is the speedup of the chosen design compared

    to the old machine?

    Answer:

    Option 2 is 4.2/2.6 = 1.62 times faster than the old machine.

    Suppose that the base CPI = 1

    CPIold = 1 + 0.02 100 + 0.04 0.3 100 = 4.2

    CPIopt1 = 0.5 + 0.02 100 + 0.04 0.3 100 = 3.7

    CPIopt2 = 1 + 0.02 50 + 0.04 0.3 50 = 2.6

  • 30

    (option#1)processor cache stall

    = CPI cycle time clock rate double cycle time CPI base CPI 1 option #1 CPI 0.5 processor cache

    4. The following series of branch outcomes occurs for a single branch in a program.

    (T means the branch is taken, N means the branch is not taken).

    TTTNNTTT

    How many instances of this branch instruction are mis-predicted with a 1-bit and

    2-bit local branch predictor, respectively? Assume that the BHT are initialized to

    the N state. You may assume that this is the only branch is the program.

    Answer:

    1-bit predictor: 3 2-bit predictor: 5

    : FSM 2-bit predictor 5 FSM 2-bit predictor 6

    PART II:

    For the following questions in Part II, please make sure that you summarize all your

    answer in the format listed below. The answers are short, such as alphabelts, numbers,

    or yes/no. You do not have to show your calculations. There is no partial credit to

    incorrect answers.

    (5a) (5b)

    (6a) (6b) (6c)

    (7a) (7b) (7c)

    (8a) (8b) (8c) (8d) (8e)

    (9a) (9b) (9c) (9d) (9e)

    5. Consider the following performance measurements for a program:

    Measurement Computer A Computer B Computer C

    Instruction Count 12 billion 12 billion 10 billion

    Clock Rate 4 Ghz 3 Ghz 2.8 Ghz

    Cycles Per Instruction 2 1.5 1.4

    (5a) Which computer is faster?

    (5b) Which computer has the higher MIPS rating?

    Answer:

    (5a) Computer C

  • 31

    Execution Time for Computer A = 6104

    21012

    9

    9

    Execution Time for Computer B = 6103

    5.11012

    9

    9

    Execution Time for Computer C = 5108.2

    4.11010

    9

    9

    (5b) The MIPS rates for all computers are the same.

    MIPS for computer A = 2000102

    104

    6

    9

    MIPS for computer B = 2000105.1

    103

    6

    9

    MIPS for computer B = 2000104.1

    108.2

    6

    9

    6. Consider the following two components in a computer system:

    A CPU that sustain 2 billion instructions per second.

    A memory backplane bus capable of sustaining a transfer rate of 1000

    MB/sec

    If the workload consists of 64 KB reads from the disk, and each read operation

    takes 200,000 user instructions and 100,000 OS instructions.

    (6a) Calculate the maximum I/O rate of CPU.

    (6b) Calculate the maximum I/O rate of memory bus.

    (6c) Which of the two components is likely to be the bottlenect for I/O?

    Answer:

    (6a) 6667

    (6b) 15625

    (6c) CPU

    The maximum I/O rate of CPU = 6667200000100000

    102 9

    Calculate the maximum I/O rate of memory bus = 156251064

    101000

    3

    6

  • 32

    7. You are going to enhance a computer, and there are two possible improvements:

    either make multiply instructions run four times faster than before, or make

    memory access instructions run two times faster than before. You repeatedly run

    a program that takes 100 seconds to execute. Of this time, 20% is used for

    multiplication, 50% for memory access instructions, and 30% for other tasks.

    Calculate the speedup:

    (7a) Speedup if we improve only multiplication:

    (7b) Speedup if we only improve memory access:

    (7c) Speedup if both improvements are made:

    Answer:

    (7a) Speedup = 18.1

    8.04

    2.0

    1

    (7b) Speedup = 33.1

    5.02

    5.0

    1

    (7c) Speedup = 67.1

    3.02

    5.0

    4

    2.0

    1

    8. Multiprocessor designs have become popular for todays desktop and mobile computing. Given a 2-way symmetric multiprocessor (SMP) system where both

    processors use write-back caches, write update cache coherency, and a block size

    of one 32-bit word. Let us examine the cache coherence traffic with the following

    sequence of activities involving shared data. Assume that all the words already

    exist in both caches and are clean. Fill-in the last column (8a)-(8e) in the table to

    identify the coherence transactions that should occur on the bus for the sequence.

    Step Processor Memory activity Memory address

    Transaction

    required

    (Yes or No)

    1 Processor 1 1-word write 100 (8a)

    2 Processor 2 1-word write 104 (8b)

    3 Processor 1 1-word read 100 (8c)

    4 Processor 2 1-word read 104 (8d)

    5 Processor 1 1-word read 104 (8e)

    Answer:

  • 33

    (8a) Yes

    (8b) Yes

    (8c) No

    (8d) No

    (8e) No

    9. False sharing can lead to unnecessary bus traffic and delays. Follow the direction

    of Question 8, except change the cache coherency policy to write-invalidate and

    block size to four words (128-bit). Reveal the coherence transactions on the bus

    by filling-in the last column (9a)-(9e) in the table below.

    Step Processor Memory activity Memory address

    Transaction

    required

    (Yes or No)

    1 Processor 1 1-word write 100 (9a)

    2 Processor 2 1-word write 104 (9b)

    3 Processor 1 1-word read 100 (9c)

    4 Processor 2 1-word read 104 (9d)

    5 Processor 1 1-word read 104 (9e)

    Answer:

    (9a) Yes

    (9b) Yes

    (9c) Yes

    (9d) No

    (9e) No

    snoopy protocol(9d)Yes

  • 34

    94

    1. Suppose we have a 32 bit MIPS-like RISC processor with the following

    arithmetic and logical instructions (along with their descriptions):

    Addition

    add rd, rs, rt Put the sum of registers rs and rt into register rd. Addition immediate

    add rt, rs, imm Put the sum of register rs and the sign-extended immediate into register rt.

    Subtract

    sub rd, rs, rt Register rt is subtracted from register rs and the result is put in register rd.

    AND

    and rd, rs, rt Put the logical AND of register rs and rt into register rd. AND immediate

    and rt, rs, imm Put the logical AND of register rs and the zero-extended immediate into register rt.

    Shift left logical

    sll rd, rt, imm Shift the value in register rt left by the distance (i.e. the number of bits) indicated by the immediate (imm) and

    put the result in register rd. The vacated bits are filled

    with zeros.

    Shift right logical

    srl rd, rt, imm Shift the value in register rt right by the distance (i.e. the number of bits) indicated by the immediate (imm) and

    put the result in register rd. The vacated bits are filled

    with zeros.

    Please use at most one instruction to generate assembly code for each of the

    following C statements (assuming variable a and b are unsigned integers). You

    can use the variable names as the register names in your assembly code.

    (a) b = a / 8; /* division operation */

    (b) b = a % 16; /* modulus operation */

    Answer:

    (a) srl b, a, 3

    (b) and b, a, 15

    a = 10010011a 16 1001 0011a 8 10010 011

  • 35

    2. Assume a RISC processor has a five-stage pipeline (as shown below) with each

    stage taking one clock cycle to finish. The pipeline will stall when encountering

    data hazards.

    IF ID EXE MEM WB

    IF: Instruction fetch

    ID: Instruction decode and register file read

    EXE: Execution or address calculation

    MEM: Data memory access

    WB: Write back to register file

    (a) Suppose we have an add instruction followed immediately by a subtract

    instruction that uses the add instruction's result:

    add r1 r2, r3 sub r5 r1, r4

    If there is no forwarding in the pipeline, how many cycle(s) will the pipeline

    stall for?

    (b) If we want to use forwarding (or bypassing) to avoid the pipeline stall caused

    by the code sequence above, choosing from the denoted 6 points (A to F) in

    the following simplified data path of the pipeline, where (from which point to

    which point) should the forwarding path be connected?

    (c) Suppose the first instruction of the above code sequence is a load of r1 instead

    of an add (as shown below).

    load rl [r2] sub r5 r1, r4

    Assuming we have a forwarding path from point E to point C in the pipeline

    data path, will there be any pipeline stall for this code sequence? If so, how

    many cycle(s)? (If your first answer is yes, you have to answer the second

    question correctly to get the 5 pts credit.)

    Answer:

    (a) read registerwrite registerclock cyclestall 2 cyclesstall 3clock cycles

    (b) D to C

    (c) Yes, 1 clock cycle

    3. Cache misses are classified into three categories-compulsory, capacity, and conflict. What types of misses could be reduced if the cache block size is

    increased?

    Answer: compulsory

    IFA

    IDB

    EXEC

    MEMD

    WBE F

  • 36

    4. Consider three types of methods for transferring data between an I/O device and

    memory: polling, interrupt driven, and DMA. Rank the three techniques in terms

    of lowest impact on processor utilization

    Answer: (1) DMA, (2) Interrupt driven, (3) Polling

    5. Assume an instruction set that contains 5 types of instructions: load, store,

    R-format, branch and jump. Execution of these instructions can be broken into 5

    steps: instruction fetch, register read, ALU operations, data access, and register

    write. Table 1 lists the latency of each step assuming perfect caches.

    Instruction

    class

    Instruction

    fetch

    Register

    read

    ALU

    operation

    Data

    access

    Register

    write

    Load

    Store

    R-fonnat

    Branch

    Jump

    2ns

    2ns

    2ns

    2ns

    2ns

    1ns

    1ns

    1ns

    1ns

    1ns

    1ns

    1ns

    1ns

    2ns

    2ns

    1ns

    1ns

    Table 1

    (a) What is the CPU cycle time assuming a multicycle CPU implementation (i.e., each step in Table 1 takes one cycle)?

    (b) Assuming the instruction mix shown below, what is the average CPI of the multicycle processor without pipelining? Assume that the I-cache and

    D-cache miss rates are 3% and 10%, and the cache miss penalty is 12 CPU

    cycles

    Instruction Type Frequency

    Load 40%

    Store 30%

    R-format 15%

    Branch 10%

    Jump 5%

    (c) To reduce the cache miss rate, the architecture team is considering increasing the data cache size. They find that by doubling the data cache size, they can

    eliminate half of data cache misses. However, the data access stage now takes

    4 ns. Do you suggest them to double the data cache size? Explain your

    answer.

    Answer:

    (a) CPU cycle time multicycle implementation CPU cycle time 2ns

    (b) CPI without considering cache misses = 5 0.4 + 4 0.3 + 4 0.15 + 3 0.1 + 1 0.05 = 4.15

    Average CPI = 4.15 + 0.03 12 + (0.3 + 0.4) 0.1 12 = 5.35

  • 37

    (c) CPI after doubling data cache = 4.15 + 0.03 6 + (0.3 + 0.4) 0.05 6 = 4.54

    Average instruction execution time before doubling data cache = 5.35 2ns =

    10.7 ns

    Average instruction execution time after doubling data cache = 4.54 4ns =

    18.16 ns

    Doubling data cache Doubling data cache double the data cache.

  • 38

    93

    1. Consider a system with an average memory access time of 50 nanoseconds, a three level page table (meta-directory, directory, and page table). For full credit,

    your answer must be a single number and not a formula.

    (a) If the system had an average page fault rate of 0.01% for any page accessed (data or page table related), and an average page fault took 1 millisecond to

    service, what is the effective memory access time (assume no TLB or memory

    cache)?

    (b) Now assume the system has no page faults, we are considering adding a TLB that will take 1 nanosecond to lookup an address translation. What hit rate in

    the TLB is required to reduce the effective access time to memory by a factor

    of 2.5?

    Answer:

    (a) page fault effective memory access time = 4 50 = 200 ns ( meta-directory directory page table data access)page fault rate = 0.01% effective memory access time = 200 + 4 0.01% 1000000ns = 600 ns

    (b) (200 / 2.5) = 50ns + 1ns + 150ns (1 H) H = 0.81

    2. In this problem set, show your answers in the following format:

    ? CPU cycles

    Derive your answer.

    CPI = ?

    Derive your answer.

    Machine ? is ?% faster than ?

    Derive your answer.

    ? CPU cycles

    Derive your answer.

    Both machine A and B contain one-level on-chip caches. The CPU clock rates

    and cache configurations for these two machines are shown in Table 1. The

    respective instruction/data cache miss rates in executing program P are also

    shown in Table 1. The frequency of load/store instructions in program P is 20%.

    On a cache miss, the CPU stalls until the whole cache block is fetched from the

    main memory. The memory and bus system have the following characteristics:

    1. the bus and memory support 16-byte block transfer;

    2. 32-bit synchronous bus clocked at 200 MHz, with each 32-bit transfer taking

    1 bus clock cycle, and 1 bus clock cycle required to send an address to

    memory (assuming shared address and data lines);

    3. assuming there is no cycle needed between each bus operation;

    4. a memory access time for the first 4 words (16 bytes) is 250 ns, each

    additional set of four words can be read in 25 ns. Assume that a bus transfer

  • 39

    of the most recently read data and a read of the next four words can be

    overlapped.

    Machine A Machine B

    CPU clock rate 800 MHz 400 MHz

    I-cache

    configuration

    Direct-mapped,

    32-byte block, 8K

    2-way, 32-byte block,

    128K

    D-cache

    configuration

    2-way, 32-byte block,

    16K

    4-way, 32-byte block,

    256K

    I-cache miss rate 6% 1%

    D-cache miss rate 15% 4%

    Table 1

    To answer the following questions, you don't need to consider the time required

    for writing data to the main memory:

    (1) What is the data cache miss penalty (in CPU cycles) for machine A?

    (2) What is the average CPI (Cycle per Instruction) for machine A in executing

    program P? The CPI (Cycle per Instruction) is 1 without cache misses.

    (3) Which machine is faster in executing program P and by how much? The CPI

    (Cycle per Instruction) is 1 without cache misses for both machine A and B.

    (4) What is the data cache miss penalty (in CPU cycles) for machine A if the bus

    and memory system support 32-byte block transfer? All the other memory/bus

    parameters remain the same as defined above.

    Answer:

    (a) 440 CPU cycles

    Since bus clock rate = 200 MHz, the cycle time for a bus clock = 5 ns

    The time to transfer one data block from memory to cache = 2 (1 + 250/5 + 1

    4) 5 ns = 550 ns

    The data miss penalty for machine A = 550ns / (1/800MHz) = 440 CPU cycles

    (b) CPI = 40.6

    Average CPI = 1 + 0.06 440 + 0.2 0.15 440 = 40.6

    (c) Machine B is 409% faster than A

    machine A machine B cache block size 32-bytemiss penalty 550ns

    machine B clock rate 400Mhz machine B miss penalty = 220 clock cycles

    machine B average CPI = 1 + 0.01 220 + 0.2 0.04 220 = 4.96 Execution time for machine A = 40.6 1.25 IC =50.75IC

    Execution time for machine B = 4.96 2.5 IC =12.4IC

    machine B is 50.75/12.4 = 4.09 faster than machine A (d) 240 CPU cycles

    The time to transfer one data block from memory to cache = (1 + 250/5 + 25/5 +

    4) 5 ns = 300 ns

  • 40

    c3 c2 c1 c0d3 d2 d1 d0c3 c2 c1 c0d3 d2 d1 d0

    The data miss penalty for machine A = 300ns / (1/800MHz) = 240 CPU cycles

    3. Given the bit pattern 10010011, what does it represent assuming

    (a) Its a twos complement integer? (b) Its an unsigned integer?

    Write down your answer in decimal format.

    Answer:

    (a) -27 + 2

    4 + 2

    1 + 2

    0 = - 109

    (b) 27 + 2

    4 + 2

    1 + 2

    0 = 147

    4. Draw the schematic for a 4-bit 2s complement adder/subtractor that produces A + B if K=1, A B if K = 0. In your design try to use minimum number of the following basic logic gates (1-bit adders, AND, OR, INV, and XOR).

    Answer:

    K = 0: S = A + (B 1) + 1 = A + (B + 1) = A B K = 1: S = A + (B 0) + 0 = A + B + 0 = A + B

    5. We want to add for 4-bit numbers, A[3:0], B[3:0], C[3:0], D[3:0] together using

    carry-save addition. Draw the schematic using 1-bit full adders.

    Answer:

    + + + +

    a3 b3 a2 b2 a1 b1 a0 b0

    K

    s3 s2 s1 s0

    c4

  • 41

    6. We have an 8-bit carry ripple adder that is too slow. We want to speed it up by

    adding one pipeline stages. Draw the schematic of the resulting pipeline adder.

    How many 1-bit pipeline register do you need? Assuming the delay of 1-bit adder

    is 1 ns, whats the maximum clock frequency the resulting pipelined adder can operate?

    Answer:

    (1) schematic

    (2) 13 1-bit pipeline registers

    (3) 1/4ns = 250MHz

    a0

    b0 +

    a1

    b1 +

    a2

    b2 +

    a3

    b3 +

    a4

    b4 +

    a5

    b5 +

    a6

    b6 +

    a7

    b7 +

    s0

    s1

    s2

    s3

    s4

    s5

    s6

    s7

    c8

    c0

    a0

    b0 +a0

    b0 +

    a1

    b1 +a1

    b1 +

    a2

    b2 +a2

    b2 +

    a3

    b3 +a3

    b3 +

    a4

    b4 +a4

    b4

    a4

    b4 +

    a5

    b5 +a5

    b5

    a5

    b5 +

    a6

    b6 +a6

    b6

    a6

    b6 +

    a7

    b7 +a7

    b7

    a7

    b7 +

    s0

    s1s1

    s2s2

    s3s3

    s4

    s5

    s6

    s7

    c8

    c0

  • 42

    92

    1. A pipelined processor architecture consists of 5 pipeline stages: instruction fetch

    (IF), instruction decode and register read (ID), execution or address calculate

    (EX), data memory access (MEM), and register write back(WB). The delay of

    each stage is summarized below: IF = 2 ns, ID = 1.5 ns, EX = 4 ns, MEM = 2.5 ns,

    WB = 2 ns.

    (1) Whats the maximum attainable clock rate of this processor? (2) What kind of instruction sequence will cause data hazard that cannot be

    resolved by forwarding? Whats the performance penalty? (3) To improve on the clock rate of this processor, the architect decided to add

    one pipeline stage. The location of the existing pipeline registers cannot be

    changed. Where should this pipeline stage be placed? Whats the maximum clock rate of the 6-stage processor? (Assuming there is no delay penalty when

    adding pipeline stages)

    (4) Repeat the analysis in (2) for the new 6-stage processor. Is there other type(s)

    of instruction sequence that will cause a data hazard, and cannot be resolved

    by forwarding? Compare the design of 5-stage and 6-stage processor, what

    effect does adding one pipeline stage has on data hazard resolution?

    Answer:

    (1) stagedelay4 nsclock rate = 1/ (4 10-9) = 250 MHz (2) (a) Load

    forwarding data hazard (b) stallclock cycleforwardingthe performance

    penalty1clock cycle delay (3) (a) EXdelaydelay2 nsEX1EX2

    (b) stagedelay2.5 nsclock rate = 1/ (2.5 10-9) = 400

    MHz (4) (a) Load-usedata hazardforwarding

    data hazard forwardingLoad-use data hazardstall 2clock cycledata hazardstall 1clock cycle

    (b) data hazardpenalty

    2. (1) What type of cache misses (compulsory, conflict, capacity) can be reduced by

    increasing the cache block size (2) Can increasing the degree of the cache associativity always reduce the average

    memory access timeExplain your answer.

    Answer:

  • 43

    (1) Compulsory

    (2) No. AMAT = hit time + miss rate miss penalty. Increase the degree of cache associativity may decrease miss rate but will lengthen hit time; therefore, the

    average memory access time may not necessary be reduced.

    3. List two types of cache write policies. Compare the pros and cons of these two

    polices.

    Answer:

    (1) Write-through: A scheme in which writs always update both the cache and the

    memory, ensuring that data is always consistent between the two.

    Write-back: A scheme that handles writes by updating values only to the

    block in the cache, then writing the modified block to the lower level of the

    hierarchy when the block is replaced.

    (2)

    Polices Write-through Write-back

    Pros blockMemoryCPU write

    Cons cacheMemoryCPU write

    4. Briefly describe the difference between synchronous and asynchronous bus

    transaction

    Answer:

    Bus type Synchronous Bus Asynchronous Bus

    Differences

    Synchronous bus includes a clock

    in the control lines and a fixed

    protocol for communication

    relative to clock

    Asynchronous bus is not

    clocked

    Advantage very little logic and can run very

    fast

    It can accommodate a wide

    range of devices.

    It can be lengthened without

    worrying about clock skew.

    Disadvantage

    Every device on the bus must run

    at the same clock rate.

    To avoid clock skew, they cannot

    be long if they are fast

    It requires a handshaking

    protocol.

  • 44

    96

    1. The following MIPS assembly program tries to copy words from the address in

    register $a0 to the address in $a1, counting the number of words copied in

    register $v0. The program stops copying when it finds a word equal to 0, You do

    not have to preserve the contents of registers $v1, $a0, and $a1. This terminating

    word should be copied but not counted.

    loop: lw $v1, 0($a0) # read next word from source

    addi $v0, $v0, 1 # Increment count words copied

    sw $v1, 0($a1) # Write to destination

    addi @a0, $a0, 1 # Advance pointer to next word

    addi @a0, $a1, 1 # Advance pointer to next word

    bne $v1, $zero, loop # Loop if word copied != zero

    There are multiple bugs in this MIPS program; fix them and turn in a bug-free

    version.

    Answer:

    addi $v0, $zero, 1 Loop: lw $v1, 0($a0)

    addi $v0, $v0, 1

    sw $v1, 0($a1)

    addi $a0, $a0, 4

    addi $a1, $a1, 4

    bne $v1, $zero, Loop

    2. Carry lookahead is often used to speed up the addition operation in ALU. For a

    4-bit addition with carry lookahead, assuming the two 4-bit inputs are a3a2a1a0

    and b3b2b1b0, and the carry-in is c0,

    (a) First derive the recursive equations of carry-out ci+1 in terms of ai and bi and

    ci, where i = 0, 1,.., 3.

    (b) Then by defining the generate (gi) and propagate (pi) signals, express c1, c2,

    c3, and c4 in terms of only gi's, pis and c0. (c) Estimate the speed up for this simple 4-bit carry lookahead adder over the

    4-bit ripple carry adder (assuming each logic gate introduces T delay).

    Answer:

    (a) ci+1 = aibi + aici + bici

    (b) c1 = g0 + (p0c0) c2 = g1 + (p1g0) + (p1p0c0) c3 = g2 + (p2g1) + (p2p1g0) + (p2p1p0c0) c4 = g3 + (p3g2) + (p3p2g1) + (p3p2p1g0) + (p3p2p1p0c0)

    (c) The critical path delay for 4-bit ripple carry adder = 2T 4 = 8T The critical path delay for 4-bit ripple carry adder = 2T + T = 3T

  • 45

    Speedup = 8T/3T = 2.67

    : critical path dealy carry

    3. When performing arithmetic addition and subtraction, overflow might occur. Fill

    in the blanks in the following table of overflow conditions for addition and

    subtraction.

    Operation Operand A Operand B Result indicating

    overflow

    A + B 0 0 (a)

    A + B < 0 < 0 (b)

    A B 0 < 0 (c)

    A B < 0 0 (d)

    Prove that the overflow condition can be determined simply by checking to see if

    the Carryin to the most significant bit of the result is not the same as the CarryOut

    of the most significant bit of the result.

    Answer:

    (1)

    Operation Operand A Operand B Result indicating overflow

    A + B 0 0 (a) < 0

    A + B < 0 < 0 (b) 0

    A B 0 < 0 (c) < 0

    A B < 0 0 (d) 0

    (2) Build a table that shows all possible combinations of Sign and CarryIn to the

    sign bit position and derive the CarryOut, Overflow, and related information.

    Thus

    Sing

    A

    Sing

    B

    Carry

    In

    Carry

    Out

    Sign

    of

    result

    Correct

    Sign of

    Result

    Over

    flow

    ?

    CarryIn

    XOR

    CarryOut

    Notes

    0 0 0 0 0 0 No 0

    0 0 1 0 1 0 Yes 1 Carries differ

    0 1 0 0 1 1 No 0 |A| < |B|

    0 1 1 1 0 0 No 0 |A| > |B|

    1 0 0 0 1 1 No 0 |A| > |B|

    1 0 1 1 0 0 No 0 |A| < |B|

    1 1 0 1 0 1 Yes 1 Carries differ

    1 1 1 1 1 1 No 0

    From this table an XOR of the CarryIn and CarryOut of the sign bit serves to

    detect overflow.

  • 46

    4. Assume all memory addresses are translated to physical addresses before the

    cache is accessed. In this case, the cache is physically indexed and physically

    tagged. Also assume a TLB is used. (a) Under what circumstance can a memory

    reference encounter a TLB miss, a page table hit, and a cache miss? Briefly

    explain why. (b) To speed up cache accesses, a processor may index the cache

    with virtual addresses. This is called a virtually addressed cache, and it uses tags

    that are virtual addresses. However, a problem called aliasing may occur. Explain

    what aliasing is and why. (c) In today's computer systems, virtual memory and

    cache work together as a hierarchy. When the operating system decides to move a

    page back to disk, the contents of that page may have been brought into the cache

    already. What should the OS do with the contents that are in the cache?

    Answer: (a) Data/instruction is in memory but not in cache and page table has this

    mapping but TLB has not.

    (b) A situation in which the same object is accessed by two addresses; can occur

    in virtual memory when there are two virtual addresses for the same physical

    page.

    (c) If the contents in cache are dirty, force them write back to memory and

    invalidate them in cache. After that, copy the page back to disk. If not,

    invalidate them in cache and copy the page back to disk.

    5. The following three instructions are executed using MIPS 5-stage pipeline.

    1. lw $2, 20($1)

    2. sub 4, $2, $5

    3. or $4, $2, $6

    Since there is one cycle delay between lw and sub, a hazard detection unit is

    required. Furthermore, by the time the hazard is detected, sub and or may have

    already been fetched into the pipeline. Therefore it is also required to turn sub

    into a nop and delay the execution of sub and or by one cycle as shown below.

    1. lw $2, 20($1)

    2 nop

    3. sub $4, $2, $5

    4. or $4, $2, $6

    (a) In which stage should the hazard detection unit be placed? Why? (b) How can

    you turn sub into a nop in MIPS 5-stage pipeline? (c) How can you prevent sub

    and or from making progress and force these two instructions to repeat in the next

    clock cycle? (d) Explain why there is one cycle delay between lw and sub.

    Answer:

    (a) ID: Instruction Decode and register file read stage.

    (b) Deassert all nine control signals (in EX/MEM pipeline register) in the EX

    stage.

    (c) Set both control signals PCWrite and IF/IDWrite to 0 to prevent the PC

    register and IF/ID pipeline register from changing.

  • 47

    (d) As shown in the following diagram, after 1-cycle stall between lw and sub,

    the forwarding logic can handle the dependence and execution proceeds. (If

    there were no forwarding, then 2 cycle delay is needed)

    lw IF ID EX MEM WB

    nop IF ID EX MEM WB

    sub IF ID EX MEM WB

    6. Answer the following questions briefly.

    (a) Will addition "0010 + 1110" cause an overflow using the 4-bit two's

    complement signed-integer form? (Simply answer yes or no).

    (b) What would you get after performing arithmetic right shift by one bit on

    1100two?

    (c) If one wishes to increase the accuracy of the floating-point numbers that can

    be represented, then he/she should increase the size of which part in the

    floating-point format?

    (d) Name one event other than branches or jumps that the normal flow of

    instruction execution will be changed, e.g., by switching to a routine in the

    operating system.

    Answer:

    (a) NO

    (b) 1110two

    (c) Fraction

    (d) Arithmetic overflow

    7. A MIPS instruction takes fives stages in a pipelined CPU design: (1) IF:

    instruction fetch, (2) ID: instruction decode/register fetch, (3) ALU: execution or

    calculate a memory address, (4) MEM: access an operand in data memory, and (5)

    WB: write a result hack into the register. Label one appropriate stage in which

    each of the following actions needs to be executed. (Note that A and B are two

    source operands, while ALUOut is the output register of the ALU, PC is the

    program counter, IR is the instruction register. MDR is the memory data register,

    Memory[k] is the k-th word in the memory, and Reg[k] is the k-th registers in the

    register file.)

    (a) Reg[IR[20-16]] = MDR;

    (b) ALUOut = PC + (sign-extend (IR[15-0])

  • 48

    (c) MEM

  • 49

    95

    1. (1) Can you come up with a MIPS instruction that behaves like a NOP? The

    instruction is executed by the pipeline but does not change any state.

    (2) In a MIPS computer a main program can use "jal procedure address" to make a

    procedure call and the callee can use "jr $ra" to return to the main program.

    What is saved in register $ra during this process?

    (3) Name and explain the three principle components that can be combined to

    yield runtime.

    Answer:

    (1) sll $zero, $zero, 0

    (2) The address of the instruction following the jal (Return address)

    (3) Runtime = instruction count CPI (cycles per instruction) clock cycle time

    2. (1) Briefly explain the purpose of having a write buffer in the design of a

    write-through cache.

    (2) Large cache block tends to decrease cache miss rate due to better spatial

    locality. However, it has been observed that too large a cache block actually

    increases miss rate. Especially in a very small cache. Why?

    Answer:

    (1) After writing the data into the write buffer, the processor can continue

    execution without wasting time to wait the memory update. The CPU

    performance can thus be increased.

    (2) The number of blocks that can be held in the cache will become small, and

    there will be a great deal of competition for those blocks. As a result, a block

    will be bumped out of the cache before many of its words are accessed.

    3. (1) Dynamic branch prediction is often used in today's machine. Consider a loop

    branch that branches nine times in a row, and then is not taken once. What is

    the prediction accuracy for this branch, assuming a simple 1-bit prediction

    scheme is used and the prediction bit for this branch remains in the prediction

    buffer? Briefly explain your result.

    (2) What is the prediction accuracy if a 2-bit prediction scheme is used? Again

    briefly explain your result.

    Answer:

    (1) The steady-state prediction behavior will mispredict on the first and last loop

    iterations. Mispredicting the last iteration is inevitable since the prediction bit

    will say taken. The misprediction on the first iteration happens because the bit

    is flipped on prior execution of the last iteration of the loop, since the branch

    was not taken on that exiting iteration. Thus, the prediction accuracy for this

    branch is 80% (two incorrect predictions and eight correct ones).

  • 50

    (2) The prediction accuracy if a 2-bit prediction scheme is 90%, since only the last

    loop iteration will be mispredict.

    4. Answer the following questions briefly:

    (1) In a pipelined CPU design, what kind of problem may occur as it executes

    instructions corresponding to an if-statement in a C program? Name one

    possible scheme to get around this problem more or less.

    (2) Consider the possible actions in the Instruction Decode stage of a pipelined

    CPU. In addition to setting up the two input operands of ALU, what is the

    other possible action? (Hint: consider the execution of a jump instruction)

    (3) What is x if the maximum number of memory words you can use in a 32-bit

    MIPS machine in a single program is expressed as 2x? (Note: MIPS uses a

    byte addressing scheme.)

    Answer:

    (1) Control hazard.

    Solution: Insert Nop instruction, delay branch, branch prediction

    (2) Decode instruction, sign-extend 16 bits immediate constant, jump address

    calculation, branch target calculation, register comparison, load-use data

    hazard detection.

    (3) A single program in 32-bit MIPS machine can use 256 MB = 228

    Bytes = 226

    words. So, x = 26.

    5. Consider the following flow chart of a sequential multiplier. We assume that the

    64-bit multiplicand register is initialized with the 32-bit original multiplicand in

    the right half and 0 in the left half. The final result is to be placed in a product

    register. Fill in the missing descriptions in blanks A and B.

    start

    Test multiplier[0]

    Shift the multiplicand register left by 1 bit

    Blank B

    Blank A

    32nd repetition?

    Done

    Yes: 32 repetitions

    Multiplier[0] = 0Multiplier[0] = 1

    No:

  • 51

    Answer:

    Blank A: add Multiplicand to product and place the result in the Product register

    Blank B: shift the Multiplier register right 1 bit

    6. Schedule the following instruction segment into a superscaler pipeline for MIPS. Assume that the pipeline can execute one ALU or branch instruction and one data

    transfer instruction concurrently. For the best, the instruction segment can be

    executed in four clock cycles. Fill in the instruction identifiers into the table. Note

    that data dependency should be taken into account.

    (Identifier) (Instruction)

    ln-1 Loop: lw $t0, 0($s1)

    ln-2 addu $t0, $t0, $s2

    ln-3 sw $t0, 0($s1)

    ln-4 addi $s1, $s1, 4 ln-5 bne $s1, $zero, Loop

    Clock Cycle ALU or branch instruction Data transfer instruction

    1

    2

    3

    4

    Answer:

    Clock Cycle ALU or branch instruction Data transfer instruction

    1 ln-1 (lw)

    2 ln-4 (addi)

    3 ln-2 (addu)

    4 ln-5 (bne) ln-3 (sw)

    7. Suppose a computer's address size is k bits (using byte addressing),