solution vu assignment

CONTENTS

96 .............................................................................................................. 3 95 .............................................................................................................. 8 94 ............................................................................................................ 13 93 ............................................................................................................ 18 92 ............................................................................................................ 22 96 ............................................................................................................ 24 95 ............................................................................................................ 28 94 ............................................................................................................ 34 93 ............................................................................................................ 38 92 ............................................................................................................ 42 96 ............................................................................................................ 44 95 ............................................................................................................ 49 94 ............................................................................................................ 54 93 ............................................................................................................ 58 92 ............................................................................................................ 63 96 ............................................................................................................ 67 95 ............................................................................................................ 69 94 ............................................................................................................ 71 93 ............................................................................................................ 73 92 ............................................................................................................ 75 96 ............................................................................................................ 77 95 ............................................................................................................ 80 94 ............................................................................................................ 85 93 ............................................................................................................ 89 92 ............................................................................................................ 92 96 ............................................................................................................ 96 95 .......................................................................................................... 104 94 .......................................................................................................... 113 93 .......................................................................................................... 121 92 .......................................................................................................... 126 96 ...................................................................................................... 131 95 ...................................................................................................... 133 94 ...................................................................................................... 137 96 .......................................................................................................... 143 95 .......................................................................................................... 145 94 .......................................................................................................... 149 93 .......................................................................................................... 152

1

92 .......................................................................................................... 154 96 .......................................................................................................... 157 95 .......................................................................................................... 163 94 .......................................................................................................... 168 96 .......................................................................................................... 174 95 .......................................................................................................... 181 96 .......................................................................................................... 186 95 .......................................................................................................... 189 94 .......................................................................................................... 194 93 .......................................................................................................... 197 92 .......................................................................................................... 199 96 .......................................................................................................... 202 95 .......................................................................................................... 205 94 .......................................................................................................... 209 93 .......................................................................................................... 213 92 .......................................................................................................... 217 96 .......................................................................................................... 222 96 .......................................................................................................... 227 95 .......................................................................................................... 236 94 .......................................................................................................... 243 93 .......................................................................................................... 250 92 .......................................................................................................... 257 96 .......................................................................................................... 263 95 .......................................................................................................... 267 94 .......................................................................................................... 269 93 .......................................................................................................... 272 92 .......................................................................................................... 274 96 .......................................................................................................... 277 95 .......................................................................................................... 280 94 .......................................................................................................... 283 93 .......................................................................................................... 286 92 .......................................................................................................... 289 96 ...................................................................................................... 291 95 ...................................................................................................... 293 94 ...................................................................................................... 295 93 ...................................................................................................... 298 92 ...................................................................................................... 300 93 ...................................................................................................... 302 95 .......................................................................................................... 305

2

96 ...................................................................................................... 309 95 .......................................................................................................... 314 95 .......................................................................................................... 317 95 .......................................................................................................... 320

3

96

1. _____ implements the translation of a program's address space to physical

addresses.

(A) DRAM

(B) Main memory

(C) Physical memory

(D) Virtual memory

Answer: (D)

2. To track whether a page of disk has been written since it was read into the

memory, a ____ is added to the page table.

(A) valid bit

(B) tag index

(C) dirty bit

(D) reference bit

Answer: (C)

3. (Refer to the CPU architecture of Figure 1 below) Which of the following

statements is correct for a load word (LW) instruction?

(A) MemtoReg should be set to 0 so that the correct ALU output can be sent to

the register file.

(B) MemtoReg should be set to 1 so that the Data Memory output can be sent to

the register file.

(C) We do not care about the setting of MemtoReg. It can be either 0 or 1.

(D) MemWrite should be set to 1.

Answer: (B)

4

PC

Instructionmemory

Readaddress

Instruction[31

_0]

Instruction [20 16]

Instruction [25 21]

Add

Instruction [5 0]

MemtoReg

ALUOp

MemWrite

RegWrite

MemRead

Branch

RegDst

ALUSrc

Instruction [31 26]

4

16 32Instruction [15 0]

0

0Mux

0

1

Control

AddALU

result

Mux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

PCSrc

Datamemory

Writedata

Readdata

Mux

1

Instruction [15 11]

ALUcontrol

Shiftleft 2

ALUAddress

PC

Instructionmemory

Readaddress

Instruction[31

_0]

Instruction [20 16]

Instruction [25 21]

Add

Instruction [5 0]

MemtoReg

ALUOp

MemWrite

RegWrite

MemRead

Branch

RegDst

ALUSrc

Instruction [31 26]

4

16 32Instruction [15 0]

0

0Mux

0

1

Control

AddALU

result

Mux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

PCSrc

Datamemory

Writedata

Readdata

Mux

1

Instruction [15 11]

ALUcontrol

Shiftleft 2

ALUAddress

Figure 1

4. IEEE 754 binary representation of a 32-bit floating number is shown below

(normalized single-precision representation with bias = 127)

31 30 ~ 23 22 ~ 0

S exponent fraction

1 bit 8 bits 23 bits

(S) (E) (F)

What is the correct binary presentation of (- 0.75)10 in IEEE single-precision float

format?

(A) 1011 1111 0100 0000 0000 0000 0000 0000

(B) 1011 1111 1010 0000 0000 0000 0000 0000

(C) 1011 1111 1101 0000 0000 0000 0000 0000

(D) 1011 1110 1000 0000 0000 0000 0000 0000

Answer: (A)

5. According to Question 4, what is the decimal number represented by the word

below?

Bit position 31 30 ~ 23 22 ~ 0

Binary value 1 10000011 011000.00

(A) -10

(B) -11

(D) -22

(D) -44

5

Answer: (A)

6. Assume that the following assembly code is run on a machine with 2 GHz clock.

The number of cycles for assembly instruction is shown in Table 1.

add $t0, $zero, $zero

loop: beq $a1, $zero finish

add $t0, $t0, $a0

sub $a1, $a1, 1

j loop

finish: addi $t0, $t0, 100

add $v0, $t0, $zero

instruction Cycles

add, addi, sub 1

lw, beq, j 2

Table 1

Assume $a0 = 3, $a1 = 20 at initial time, select the correct value of $v0 at the

final cycle:

(A) 157

(B) 160

(C) 163

(D) 166

Answer: (B)

7. According to Question 6, please calculate the MIPS (millions instructions per

second) of this assembly code:

(A) 1342

(B) 1344

(C) 1346

(D) 1348

Answer: (B)

MIPS = 134410

84

125

102

10countn instructio

cyclesclock

rateclock

10CPI

rateclock

6

9

66

6

Questions 8-11. Link the following terms ((1) ~ (4))

(1) Microsoft Word

(2) Operating system

(3) Internet

(4) CD-ROM

to the most related terminology shown below (A, B, C,..., K), choose the most

related one ONLY (answer format: e.g., (1) K, for mapping item (1) to

terminology K).

A Applications software F Personal computer

B High-level programming language G Semiconductor

C Input device H Super computer

D Integrated circuit I Systems software

E Output device K Computer Networking

Please write down the answers in the answer table together with the choice

questions.

8. (1) Microsoft word

9. (2) Operating system

10. (3) Internet

11. (4) CD-ROM

Answer:

8. (1) Microsoft word A

9. (2) Operating system I

10. (3) Internet K

11. (4) CD-ROM C

Questions 12-15. Match the memory hierarchy element on the left with the closet

phrase on the right: (answer format: e.g., (1) d, for mapping item (1) (left) to

item d (right))

(1). L1 cache a. A cache for a cache

(2). L2 cache b. A cache for disks

(3). Main memory c. A cache for a main memory

(4). TLB d. A cache for page table entries


questions.

12. (1) L1 cache

13. (2) L2 cache

14. (3) Main memory

15. (4) TLB

7

Answer:

12. (1) L1 cache a

13. (2) L2 cache c

14. (3) Main memory b

15. (4) TLB d

Questions 41-50. Based on the function of the seven control signals and the datapath

of the MIPS CPU in Figure 1 (the same figure for Question 28), complete the

settings of the control lines in Table 2 (use 0, 1, and X (dont care) only) for the

two MIPS CPU instructions (beg and add). X (dont care) can help to reduce the

implementation complexity, you should put X whenever possible.

Instr. Branch ALU

Src

Reg

Write

Reg

Dst

Memto

Reg

Memory

Write

Memory

Read

ALU

Op1

ALU

Op0

beq (16) (17) (18) (19) (20) (21) (22) 0 1

add (23) (24) (25)

Table 2


questions.

16. (16) =

17. (17) =

18. (18) =

19. (19) =

20. (20) =

21. (21) =

22. (22) =

23. (23) =

24. (24) =

25. (25) =

Answer:

16. (16) = 1

17. (17) = 0

18. (18) = 0

19. (19) = X

20. (20) = X

21. (21) = 0

22. (22) = 0

23. (23) = 1

24. (24) = 0

25. (25) = 0

8

95

1-4 Choose ALL the correct answers for each of the following 1 to 4 questions. Note that

credit will be given only if all choices are correct.

1. With pipelines: (A) Increasing the depth of pipelining increases the impact of hazards. (B) Bypassing is a method to resolve a control hazard. (C) If a branch is taken, the branch prediction buffer will be updated. (D) In static multiple issue scheme, multiple instructions in each clock cycle are

fixed by the processor at the beginning of the program execution.

(E) Predication is an approach to guess the outcome of an instruction and to remove the execution dependence.

Answer: (A)

(B) False, (should be data hazard) (C) False, (prediction buffer should be updated when guess wrong) (D) False, (by the compiler) (E) False, (should be Speculation)

2. Increasing the degree of associativity of a cache scheme will (A) Increase the miss rate. (B) Increase the hit time. (C) Increase the number of comparators. (D) Increase the number of tag bits. (E) Increase the complexity of LRU implementation.

Answer: (B), (C), (D), (E)

(A) False, (decrease the miss rate)

3. With caching: (A) Write-through scheme improves the consistency between main memory and

cache.

(B) Split cache applies parallel caches to improve cache speed. (C) TLB (translation-lookaside buffer) is a cache on page table, and could help

accessing the virtual addresses faster.

(D) No more than one TLB is allowed in a CPU to ensure consistency. (E) An one-way set associative cache performs the same as a direct mapped

cache.

Answer: (A), (B), (E)

(C) False, (help accessing physical address faster)

9

(D) False, (MIPS R3000 and Pentium 4 have two TLBs)

4. In a Pentium 4 PC, (A) DMA mechanism can be applied to delegate responsibility from the CPU. (B) AGP bus can be used to connect MCH (Memory Control Hub) and a

graphical output device.

(C) USB 2.0 is a synchronous bus using handshaking protocol. (D) The CPU can fetch and translate IA-32 instructions. (E) The CPU can reduce instruction latency with deep pipelining.

Answer: (A), (B), (D)

(C) False, (USB 2.0 is an asynchronous bus)

(E) False, (pipeline can not reduce single instructions latency)

5. Examine the following two CPUs, each running the same instruction set. The first one is a Gallium Arsenide (GaAs) CPU. A 10 cm (about 4) diameter GaAs wafer costs $2000. The manufacturing process creates 4 defects per square cm. The

CPU fabricated in this technology is expected to have a clock rate of 1000 MHz,

with an average clock cycles per instruction of 2.5 if we assume an infinitely fast

memory system. The size of the GaAs CPU is 1.0 cm 1.0 cm. The second one is a CMOS CPU. A 20 cm (about 4) diameter CMOS wafer

costs $1000 and has 1 defect per square cm. The 1.0 cm 2.0 cm CPU executes multiple instructions per clock cycle to achieve an average clock cycles per

instruction of 0.75, assuming an infinitely fast memory, while achieving a clock

rate of 200 MHz. (The CPU is larger because it has on-chip caches and executes

multiple instructions per clock cycle.)

Assume for both GaAs and CMOS is 2. Yields for GaAs and CMOS wafers are 0.8 and 0.9 respectively. Most of this information is summarized in the

following table:

Wafer

Diam. (cm)

Wafer

Yield

Cost

($)

Defects

(1/cm2)

Freq.

(MHz) CPI

Die Area

(cm cm)

Test Dies

(per wafer)

GaAs 10 0.80 $2000 3.0 1000 2.5 1.0 1.0 4

CMOS 20 0.90 $1000 1.8 200 0.75 1.0 2.0 4

Hint: Here are two equations that may help:

per wafer diestest

area die2

diameterwafer

area die

diameter/2wafer per wafer dies

2

area dieareaunit per defects-1yield wafer yield dies

(a) Calculate the averagae execution time for each instruction with an infinitely fast memory. Which CPU is faster and by what factor?

(b) How many seconds will each CPU take to execute a one-billion-instruction program?

10

(c) What is the cost of a GaAs die for the CPU? Repeat the calculation for CMOS die. Show your work.

(d) What is the ratio of the cost of the GaAs die to the cost of the CMOS die? (e) Based on the costs and performance ratios of the CPU calculated above, what

is the ratio of cost/performance of the CMOS CPU to the GaAs CPU?

Answer:

(a) Execution Time (GaAs) for one instruction = 2.5 1 ns = 2.5 ns

Execution Time (CMOS) for one instruction = 0.75 5 ns = 3.75 ns GaAs CPU is faster by 3.75/2.5 =1.5 times

(b) Execution Time (GaAs) = 1 109 2.5 ns = 2.5 seconds

Execution Time (CMOS) = 1 109 3.75 ns = 3.75 seconds

(c) GaAs: dies per wafer =

412

10

1

2/102

= 67

die yield = 2

2

1318.0

= 0.2

Cost of a GaAs CPU die = 0.2 67

2000

= 149.25

CMOS: dies per wafer =

422

20

2

2/202

= 121

die yield = 2

2

28.119.0

= 0.576

Cost of a GaAs CPU die = 0.576 121

1000

= 14.35

(d) The cost of a GaAs die is 149.25/14.35 = 10.4 times than a CMOS die (e) The ratio of cost/performance of the CMOS to the GaAs is 10.4/1.5 = 6.93

6. Given the following 8 possible solutions for a POP or a PUSH operation in a STACK: (1) Read from Mem(SP), Decrement SP; (2) Read from Mem(SP),

Increment SP; (3) Decrement SP, Read from Mem(SP) (4) Increment SP, Read

from Mem(SP) (5) Write to Mem(SP), Decrement SP; (6) Write to Mem(SP),

Increment SP; (7) Decrement SP, Write to Mem(SP); (8) Increment SP, Write to

Mem(SP).

Choose only ONE of the above solutions for each of the following questions.

(a) Solution of a PUSH operation for a Last Full stack that grows ascending. (b) Solution of a POP operation for a Next Empty stack that grows ascending. (c) Solution of a PUSH operation for a Next Empty stack that grows ascending. (d) Solution of a POP operation for a Last Full stack that grows ascending.

Answer:

(a) (8) (b) (3) (c) (6) (d) (1)

11

stack pointer (SP)

SP

SP

Address

big

small

Last Full Next Empty

Last full PUSH (1) Increase SP; (2) Write to MEM(SP) Last full POP (1) Read from MEM(SP); (2) Decrease SP Next Empty PUSH (1) Write to MEM(SP); (2) Increase SP Next Empty POP (1) Decrease SP; (2) Read from MEM(SP)

7. Execute the following Copy loop on a pipelined machine: Copy: lw $10, 1000($20) sw $10, 2000($20)

addiu $20, $20, -4

bne $20, $0, Copy

Assume that the machine datapath neither stalls nor forwards on hazards, so you

must add nop instructions.

(a) Rewrite the code inserting as few nop instructions as needed for proper

execution;

(b) Use multi-clock-cycle pipeline diagram to show the correctness of your solution.

Answer: Suppose that register Read and Write could occur in the same clock cycle.

(a) lw $10, 1000($20)

Copy: addiu $20, $20, 4 nop

sw $10, 2000($20)

bne $20, $0, Copy

lw $10, 1000($20)

(b)

1 2 3 4 5 6 7 8 9 10 11

lw IF ID EX MEM WB

addiu IF ID EX MEM WB

nop IF ID EX MEM WB

sw IF ID EX MEM WB

bne IF ID EX MEM WB

lw IF ID EX MEM WB

12

8. In a Personal Computer, the optical drive has a rotation speed of 7500 rpm, a

40,000,000 bytes/second transfer rate, and a 60 ms seek time. The drive is served

with a 16 MHz bus, which is 16-bit wide.

(a) How long does the drive take to read a random 100,000-byte sector? (b) When transferring the 100,000-byte data, what is the bottleneck?

Answer:

(a) 40000000

100000

60/7500

5.060 ms = 60ms + 4ms + 2.5ms = 66.5ms

(b) The time for the bus to transfer 100000 bytes is 61016

2/100000

= 3.125 ms

So, the optical drive is the bottleneck.

8. A processor has a 16 KB, 4-way set-associate data cache with 32-byte blocks. (a) What is the number of sets in L1 cache? (b) The memory is byte addressable and addresses are 35-bit long. Show the

breakdown of the address into its cache access components.

(c) How many total bytes are required for cache? (d) Memory is connected via a 16-bit bus. It takes 100 clock cycles to send a

request to memory and to receive a cache block. The cache has 1-cycle hit

time and 95% hit rate. What is the average memory access time?

(e) A software program is consisted of 25% of memory access instructions. What is the average number of memory-stall cycle per instruction if we run this

program?

Answer:

(a) 16KB/(32 4) = 128 sets (b)

tag index Block offset Byte offset

23 bits 7 bits 3 bits 2 bits

(c) 27 4 (1 + 23 + 32 8) = 140 kbits = 17.5 KB

(d) Average Memory Access Time = 1 + 0.05 100 = 6 clock cycles

(e) (6 1) 1.25 = 6.25 clock cycles

13

94

1. Compare two memory system designs for a classic 5-stage pipelined processor.

Both memory systems have a 4-KB instruction cache. But system A has a

4K-byte data cache, with a miss rate of 10% and a hit time of 1 cycle; and system

B has an 8K-byte data cache, with a miss rate of 5% and a hit time of 2 cycles

(the cache is not pipelined). For both data caches, cache lines hold a single word

(4 bytes), and the miss penalty is 10 cycles. What are the respective average

memory access times for data retrieved by load instructions for the above two

memory system designs, measured in clock cycles?

Answer:

Average memory access time for system A = 1 + 0.1 10 = 2 cycles

Average memory access time for system B = 2 + 0.05 10 = 2.5 cycles

2. (a) Describe at least one clear advantage a Harvard architecture (separate

instruction and data caches) has over a unified cache architecture (a single

cache memory array accessed by a processor to retrieve both instruction and

data)

(b) Describe one clear advantage a unified cache architecture has over the Harvard

architecture

Answer:

(a) Cache bandwidth is higher for a Harvard architecture than a unified cache

architecture

(b) Hit ratio is higher for a unified cache architecture than a Harvard architecture

3. (a) What is RAID?

(b) Match the RAID levels 1, 3, and 5 to the following phrases for the best match.

Uses each only once.

Data and parity striped across multiple disks

Can withstand selective multiple disk failures

Requires only one disk for redundancy

Answer:

(a) An organization of disks that uses an array of small and inexpensive disks so

as to increase both performance and reliability

(b) RAID 5 Data and parity striped across multiple disks

RAID 1 Can withstand selective multiple disk failures

RAID 3 Requires only one disk for redundancy

14

4. (a) Explain the differences between a write-through policy and a write back policy

(b) Tell which policy cannot be used in a virtual memory system, and describe the

reason

Answer:

(a) Write through: always write the data into both the cache and the memory

Write back: updating values only to the block in the cache, then writing the

modified block to the lower level of the hierarchy when the block is replaced

(b) Write-through will not work for virtual memory, since writes take too long.

Instead, virtual memory systems use write-back

5. (a) What is a denormalized number (denorm or subnormal)?

(b) Show how to use gradual underflow to represent a denorm in a floating point

number system.

Answer:

(a) For an IEEE 754 floating point number, if the exponent is all 0s, but the

fraction is non-zero then the value is a denormalized number, which does not

have an assumed leading 1 before the binary point. Thus, this represents a

number (-1)s 0.f 2

-126, where s is the sign bit and f is the fraction.

(b) Denormalized number allows a number to degrade in significance until it

becomes 0, called gradual underflow.

For example, the smallest positive single precision normalized number is

1.0000 0000 0000 0000 0000 0000two 2-126

but the smallest single precision denormalized number is

0.0000 0000 0000 0000 0000 0001two 2-126

, or l.0two 2-149

6. Try to show the following structure in the memory map of a 64-bit Big-Endian

machine, by plotting the answer in a two-row map where each row contains 8

bytes.

Struct{

int a; // 0x11121314

char* c; // A, B, C, D, E, F, G short e; // 0x2122

}s;

Answer: 0 1 2 3 4 5 6 7

11 12 13 14 A B C D

E F G 21 22

Int 4 bytes short 2 bytes character 1 byte half word 3 objects words object half word

15

7. Assume we have the following 3 ISA styles:

(1) Stack: All operations occur on top of stack where PUSH and POP are the only

instructions that access memory;

(2) Accumulator: All operations occur between an Accumulator and a memory

location;

(3) Load-Store: All operations occur in registers, and register-to-register

instructions use 3 registers per instruction.

(a) For each of the above ISAs, write an assembly code for the following

program segment using LOAD, STORE, PUSH, POP, ADD, and SUB and

other necessary assembly language mnemonics.

{ A = A + C;

D = A B; }

(b) Some operations are not commutative (e.g., subtraction). Discuss what are

the advantages and disadvantages of the above 3 ISAs when executing

non-commutative operations.

Answer:

(a)

(1) Stack (2) Accumulator (3) Load-Store

PUSH A LOAD A LOAD R1, A

PUSH C ADD C LOAD R2, C

ADD STORE A ADD R1, R1, R2

POP A SUB B STORE R1, A

PUSH A STORE D LOAD R2, B

PUSH B SUB R1, R1, R2

SUB STORE R1, D

POP D

(b) Stack Accumulator ISA non-commutative operations compiler time instruction scheduling Load-Store ISA non-commutative operations instruction scheduling Stack Accumulator ISA Load-Store ISA

16

8. The program below divides two integers through repeated addition and was

originally written for a non-pipelined architecture. The divide function takes in as

its parameter a pointer to the base of an array of three elements where X is the

first element at 0($a0), Y is the second element at 4 ($a0), and the result Z is to be

stored into the third element at 8($a0), Line numbers have been added to the left

for use in answering the questions below.

1 DIVIDE: add $t3, $zero, $zero

2 add $t2, $zero, $zero

3 lw $t1, 4($a0)

4 lw $t0, 0($a0)

5 LOOP: beq $t2, $t0, END

6 addi $t3, $t3, 1

7 add $t2, $t2, $t1

8 j LOOP

9 END: sw $t3, 8($a0)

(a) Given a pipelined processor as discussed in the textbook, where will data be

forwarded (ex. Line 10 EX.MEM? Line 11 EX.MEM)? Assume that

forwarding is used whenever possible, but that branches have not been

optimized in any way and are resolved in the EX stage.

(b) How many data hazard stalls are needed? Between which instructions should

the stall bubble(s) be introduced (ex. Line 10 and Line 11)? Again, assume

that forwarding is used whenever possible, but that branches have not been

optimized in any way and are resolved in the EX stage.

(c) If X = 6 and Y = 3,

(i) How many times is the body of the loop executed?

(ii) How many times is the branch beq not taken?

(iii) How many times is the branch beq taken?

(d) Rewrite the code assuming delayed branches are used. If it helps, yon may

assume that the answer to X/Y is at least 2. Assume that forwarding is used

whenever possible and that branches are resolved in IF/ID. Do not worry

about reducing the number of times through the loop, but arrange the code to

use as few cycles as possible by avoiding stalls and wasted instructions.

Answer:

(a) Line 4 MEM.WB

(b) 1 stall is needed, between Line 4 and Line 5

(c) (i) 2 (ii) 2 (iii) 1

(d) DIVIDE: add $t2, $zero, $zero

lw $t0, 0($a0)

add $t3, $zero, $zero

lw $t1, 4($a0)

LOOP: beq $t2, $t0, END

add $t2, $t2, $t1

17

j LOOP

addi $t3, $t3, 1

END: sw $t3, 8($a0)

18

93

1. Explain how each of the following six features contributes to the definition of a

RISC machine: (a) Single-cycle operation, (b) Load/Store design, (c) Hardwired

control, (d) Relatively few instructions and addressing modes, (e) Fixed

instruction format, (f) More compile-time effort.

Answer:

(a)

(b) Load/Store CPU Load/Store CPU

(c) Hardwire Control (d) (e) (f) RISC

2. (1) Give an example of structural hazard.

(2) Identify all of the data dependencies in the following code. Show which

dependencies are data hazards and how they can be resolved via

forwarding?

add $2, $5, $4

add $4, $2, $5

sw $5, 100($2)

add $3, $2, $4

Answer:

(1) datapath lw $5, 100($2)

add $2, $7, $4

add $4, $2, $5

sw $5, 100($2)

4 1 4 structural hazard

(2) : 1 add $2, $5, $4 2 add $4, $2, $5

3 sw $5, 100($2)

4 add $3, $2, $4

19

Data dependency Data hazard

$2 (1, 2) (1, 3) (1, 4) (1, 2) (1, 3)

$4 (2, 4) (2, 4)

Take instruction (1, 2) for example. We dont need to wait for the first instruction to complete before trying to resolve the data hazard. As soon as the ALU creates

the sum for the first instruction, we can supply it as an input for the second

instruction.

3. Explain (1) what is a precise interrupt? (2) what does RAID mean? (3) what does

TLB mean?

Answer:

(1) An interrupt or exception that is always associated with the correct instruction

in pipelined computers.

(2) An organization of disks that uses an array of small and inexpensive disks so

as to increase both performance and reliability.

(3) A cache that keeps track of recently used address mappings to avoid an access

to the page table.

4. Consider a 32-byte direct-mapped write-through cache with 8-byte blocks.

Assume the cache updates on write hits and ignores write misses. Complete the

table below for a sequence of memory references which occur from left to right.

(Redraw the table in your answer sheet)

address 00 16 48 08 56 16 08 56 32 00 60

read/write r r r r r r r w w r r

index 0 2

tag 0 0

hit/miss miss

Answer:

Suppose the address is 8 bits. 32-byte direct-mapped cache with 8-byte blocks 4 blocks in the cache; block offset = 3 bits, [2:0]; index = 2 bits, [4:3]; tag = 8 3 2 = 3 bits, [5:7]

5 4

4 5

555 44

44 555

20

address tag index

decimal binary binary decimal binary decimal

00 000000 0 0 00 0

16 010000 0 0 10 2

48 110000 1 1 10 2

08 001000 0 0 01 1

56 111000 1 1 11 3

16 010000 0 0 10 2

08 001000 0 0 01 1

56 111000 1 1 11 3

32 100000 1 1 00 0

00 000000 0 0 00 0

60 111100 1 1 11 3

address 00 16 48 08 56 16 08 56 32 00 60

read/write r r r r r r r w w r r

index 0 2 2 1 3 2 1 3 0 0 3

tag 0 0 1 0 1 0 0 1 1 0 1

hit/miss miss miss miss miss miss miss hit hit miss hit hit

cache write hits write misses write 3 reference 32 write misses 2 reference 00 hit

5. (1) List two Branch Prediction strategies and (2) compare their differences.

Answer:

(1) Static prediction Dynamic prediction

(2) (a)

(b) Misprediction penalty

(c)

(a) run time run time information

(b) Misprediction penalty

(c)

6. Explain how the reference bit in a page table entry is used to implement an approximation to the LRU replacement strategy.

Answer:

The operating system periodically clears the reference bits and later records them

so it can determine which pages were touched during a particular time period.

With this usage information, the operating system can select a page that is among

21

the least recently referenced.

7. Trace Booths algorithm step by step for the multiplication of 2 (6)

Answer:

2 ten (6 ten) = 0010two 1010 two = 1111 0100 two = 12ten

Iteration Step Multiplicand Product

0 Initial values 0010 0000 1010 0

1 00 no operation 0010 0000 1010 0 Shift right product 0010 0000 0101 0

2 10 prod = prod - Mcand 0010 1110 0101 0 Shift right product 0010 1111 0010 1

3 01 prod = prod + Mcand 0010 0001 0010 1 Shift right product 0010 0000 1001 0

4 10 prod = prod - Mcand 0010 1110 1001 0 Shift right product 0010 1111 0100 1

8. What are the differences between Trap and Interrupt?

Answer:

Interrupt CPU ( processor ) Trap( processor ) CPUTrap interrupt

22

92

1. A certain machine with a 10 ns (1010-9s) clock period can perform jumps (1 cycle), branches (3 cycles), arithmetic instructions (2 cycles), multiply

instructions (5 cycles), and memory instructions (4 cycles). A certain program has

10% jumps, 10% branches, 50% arithmetic, 10% multiply, and 20% memory

instructions. Answer the following question. Show your derivation in sufficient

detail.

(1) What is the CPI of this program on this machine (2) If the program executes 10

9 instructions, what is its execution time

(3) A 5-cycle multiply-add instruction is implemented that combines an

arithmetic and a multiply instruction. 50% of the multiplies can be turned into

multiply-adds. What is the new CPI (4) Following (3) above, if the clock period remains the same, what is the

programs new execution time.

Answer:

(1) 1 0.1 + 3 0.1 + 2 0.5 + 5 0.1 + 4 0.2 = 2.7

(2) Execution time = 109 2.7 10 ns = 27 s

(3) CPI = (1 0.1 + 3 0.1 + 2 0.45 + 5 0.05 + 4 0.2 + 5 0.05) / (0.1 +

0.1 + 0.45 + 0.05 + 0.2 + 0.05) = 2.6 / 0.95 = 2.74

(4) Execution time = 109 0.95 2.74 10 ns = 26.03 s

: 100% CPI

2. Answer True (O) or False () for each of the following. (NO penalty for wrong

answer.)

(1) Most computers use direct mapped page tables.

(2) Increasing the block size of a cache is likely to take advantage of temporal

locality.

(3) Increasing the page size tends to decrease the size of the page table.

(4) Virtual memory typically uses a write-back strategy, rather than a

write-through strategy.

(5) If the cycle time and the CPI both increase by 10% and the number of

instruction deceases by 20%, then the execution time will remain the same.

(6) A page fault occurs when the page table entry cannot be found in the

translation lookaside buffer.

(7) To store a given amount of data, direct mapped caches are typically smaller

than either set associative or fully associative caches, assuming that the

block size for each cache is the same.

(8) The twos complement of negative number is always a positive number in the same number format.

23

(9) A RISC computer will typically require more instructions than a CISC

computer to implement a given program.

(10) Pentium 4 is based on the RISC architecture.

Answer:

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

O O O O

: Modern CPUs like the AthlonXP and Pentium 4 are based on a mixture of RISC and CISC.

3. The average memory access time (AMAT) is defined as

AMAT = hit time + miss_rate miss_penalty

Answer the following two questions. Show your derivation in sufficient detail.

(1) Find the AMAT of a 100MHz machine, with a miss penalty of 20 cycles, a hit

time of 2 cycles, and a miss rate of 5%.

(2) Suppose doubling the size of the cache decrease the miss rate to 3%, but

cause the hit time to increase to 3 cycles and the miss penalty to increase to 21

cycles. What is the AMAT of the new machine

Answer:

(1) AMAT = (2 + 0.05 20) 10ns = 30ns

(2) AMAT = (3 + 0.03 21) 10ns = 36.3ns

4. If a pipelined processor has 5 stages and takes 100 ns to execute N instructions.

How long will it take to execute 2N instructions, assuming the clock rate is 500

MHz and no pipeline stalls occur?

Answer:

Clock cycle time = 1/(500106)= 2 ns, N + 4 = 100 / 2 N = 46

The execution time of 2N instruction = 2 46 + 4 = 96 clock cycles = 192 ns

24

96

1. Answer the following questions briefly:

(a) Typically one CISC instruction, since it is more complex, takes more time to

complete than a RISC instruction. Assume that an application needs N CISC

instructions and 2N RISC instructions, and that one CISC instruction takes an

average 5T ns to complete, and one RISC instruction takes 2T ns. Which

processor has the better performance?

(b) Which of the following processors have a CISC instruction set architecture?

ARM AMD Opteron

Alpha 21164 IBM PowerPC

Intel 80x86 MIPS

Sun UltraSPARC

(c) True & False questions;

(1) There are four types of data hazards; RAR, RAW, WAR, and WAW.

(True or False?)

(2) AMD and Intel recently added 64-bit capability to their processors

because most programs run much faster with 64-bit instructions. (True or

False?)

(3) With a modern processor capable of dynamic instruction scheduling and

out-of-order execution, it is better that the compiler does not to optimize

the instruction sequences, (True or False?)

Answer:

(a) CISC time = N 5T = 5NT ns

RISC time = 2N 2T = 4NT ns

RISC time < CISC time, so the RISC architecture has better performance.

(b) Intel 80x86, AMD Opteron

(c) (1) False, RAR does not cause data hazard.

(2) False, most programs run much faster with 64-bit processors not 64-bit

instructions

(3) False, the compiler still tries to help improve the issue rate by placing the

instructions in a beneficial order.

2. For commercial applications, it is important to keep data on-line and safe in

multiple places.

(a) Suppose we want to backup 100GB of data over the network. How many

hours does it take to send the data by FTP over the Internet? Assume the

average bandwidth between the two places is 1Mbits/sec.

25

(b) Would it be better if you burn the data onto DVDs and mail the DVDs to the

other site? Suppose it takes 10 minutes to bum a DVD which has 4GB

capacity and the fast delivery service can deliver in 12 hours.

Answer:

(a) (100Gbyte)/1Mbits = 800 1024 seconds = 227.56 hours

(b) (100GB/4GB) 10 minutes = 250 minutes = 4.17 hours

4.17 + 12 = 16.17 hours < 227.56 hours

So, it is better to burn the data into DVDs and mail them to other site.

3. Suppose we have an application running on a shared-memory multiprocessor.

With one processor, the application runs for 30 minutes.

(a) Suppose the processor clock rate is 2GHz. The average CPI (assuming that all

references hit in the cache) on single processor is 0.5. How many instructions

are executed in the application?

(b) Suppose we want to reduce the run time of the application to 5 minutes with 8

processors. Let's optimistically assume that parallelization adds zero overhead

to the application, i.e. no extra instructions, no extra cache misses, no

communications, etc. What fraction of the application must be executed in

parallel?

(c) Suppose 100% of our application can be executed in parallel. Let's now

consider the communication overhead. Assume the multiprocessor has a 200

ns time to handle reference to a remote memory and processors are stalled on

a remote request. For this application, assume 0.02% of the instructions

involve a remote communication reference, no matter how many processors

are used. How many processors are needed at least to make the run time be

less than 5 minutes?

(d) Following the above question, but let's assume the remote communication

references in the application increases as the number of processors increases.

With N processors, there are 0.02*(N1)% instructions involve a remote

communication reference. How many processors will deliver the maximum

speedup?

Answer:

(a) 30 60 second = Instruction count 0.5 0.5 ns

Instruction Count =1800 /0.25 ns = 7200 109 (b) Suppose that the fraction of the application must be executed in parallel = F.

26

So,

8)1(

1

5

30

FF

F = 20/21 = 0.952

(c) Assume N is the number of processors that will make run time < 5 minutes

(30 60)/N + 7200 109 0.0002 200 ns < 5 60 N > 150 So, at lease 150 processors are needed to make the rum time < 5 minutes

(d) Speedup = N

6030+ 7200 109 0.0002 (N 1) 200 ns

= 1800N1

+ 288 (N 1)

Let the derivative of Speedup = 0 1800N2 + 288 = 0 N = 2.5

2.5 processors ill deliver the maximum speedup

4. Number representation.

(a) What range of integer number can be represented by 16-bit 2's complement

number?

(b) Perform the following 8-bit 2's complement number operation and check

whether arithmetic overflow occurs. Check your answer by converting to

decimal sign-and-magnitude representation.

11010011

11101100

Answer:

(a) 215 ~ + (215 1)

(b) 11010011 11101100 = 11010011 + 00010100 = 11100111

check: 45 ( 20) = 45 + 20 = 25

The range for 8-bit 2s complement number is: 27 ~ + (27 1)

So, no overflow

5. Bus

(a) Draw a graph to show the memory hierarchy of a system that consists of CPU,

Cache, Memory and I/O devices. Mark where memory bus and I/O bus is.

(b) Assuming system 1 have a synchronous 32-bit bus with clock rate = 33 Mhz

running at 2.5V. System 2 has a 64-bit bus with clock rate = 66 Mhz running

at 1.8V. Assuming the average capacitance on each bus line is 2pF for bus in

system 1. What is the maximum average capacitance allowed for the bus of

system 2 so the peak power dissipation of system 2 bus will not exceed that of

the system 1 bus?

(c) Serial bus protocol such as SATA has gained popularity in recent years. To

design a serial bus that supports the same peak throughput as the bus in

system 2, what is the clock frequency of this serial bus?

27

Answer:

(a)

(b) Power dissipation = fCV2

The peak power dissipation for system 1 =

33 106 (2 10-12 32) 2.52 = 13.2 mw

Suppose the average capacitance for system 2 = C

66 106 C 1.82 < 13.2 mw C < 61.73 pF The maximum average capacitance for system 2 is 61.73 pF.

(c) Since SATA uses a single signal path to transmit data serially (or bit by bit),

the frequency should be designed as 66 MHz 64 = 4.224 GHz to support the

same peak throughtput as bus system 2.

(b) system 2 bus line

28

95

PART I:

Please answer the following questions in the format listed below. If you do not follow

the format, you will get zero points for these questions.

1. (1) T or F (2) T or F

(3) T or F

(4) T or F

(5) T or F

2. X = Y =

Stall cycles =

3. Option is times faster than the old machine

4. 1-bit predictor: 2-bit predictor:

1. True & False Questions (1) If an address translation for a virtual page is present in the TLB, then that

virtual page must be mapped to a physical memory page.

(2) The set index decreases in size as cache associativity is increased (assume

cache size and block size remain the same)

(3) It is impossible to have a TLB hit and a data cache miss for the same data

reference.

(4) An instruction takes less time to execute on a pipelined processor than on a

nonpipelined processor (all other aspects of the processors being the same).

(5) A muti-cycle implementation of the MIPS processor requires that a single

memory by used for both instructions and data.

Answer:

(1) T (2) T (3) F (4) F (5) T

2. Consider the following program:

int A[100]; /* size(int) = 1 word */

for (i = 0; i < 100; i++)

A[i] = A[i] + 1;

The code for this program on a MIPS-like load/store architecture looks as

follows:

ADDI R1, R0, #X

ADDI R2, R0, A ; A is the base address of array A

LOOP: LD R3, 0(R2)

29

ADDI R3, R3, #1

SD R3, 0(R2)

ADDI R2, R2, #Y

SUBI R1, R1, #1

BNE R1, R0, LOOP

Consider a standard 5-stage MIPS pipeline. Assume that the branch is resolved

during the instruction decode stage, and full bypassing/register forwarding are

implemented. Assume that all memory references hit in the cache and TLBs. The

pipeline does not implement any branch prediction mechanism. What are values

of #X and #Y, and how many stall cycles are in one loop iteration including stalls

caused by the branch instruction?

Answer:

X = 100

Y = 4

Stall cycles = 3 ((1) between LD and ADDI, (2) between SUBI and BEQ, (3) one

below BEQ)

Since branch decision is resolved during ID stage, a clock stall is needed between SUBI and BEQ

3. Suppose you had a computer hat, on average, exhibited the following properties

on the programs that you run:

Instruction miss rate: 2%

Data miss rate: 4%

Percentage of memory instructions: 30%

Miss penalty: 100 cycles

There is no penalty for a cache hit (i.e. the cache can supply the data as fast as the

processor can consume it.) You want to update the computer, and your budget will

allow one of the following:

Option #1: Get a new processor that is twice as fast as your current

computer. The new processors cache is twice as fast too, so it can keep up with the processor.

Option #2: Get a new memory that is twice as fast.

Which is a better choice? And what is the speedup of the chosen design compared

to the old machine?

Answer:

Option 2 is 4.2/2.6 = 1.62 times faster than the old machine.

Suppose that the base CPI = 1

CPIold = 1 + 0.02 100 + 0.04 0.3 100 = 4.2

CPIopt1 = 0.5 + 0.02 100 + 0.04 0.3 100 = 3.7

CPIopt2 = 1 + 0.02 50 + 0.04 0.3 50 = 2.6

30

(option#1)processor cache stall

= CPI cycle time clock rate double cycle time CPI base CPI 1 option #1 CPI 0.5 processor cache

4. The following series of branch outcomes occurs for a single branch in a program.

(T means the branch is taken, N means the branch is not taken).

TTTNNTTT

How many instances of this branch instruction are mis-predicted with a 1-bit and

2-bit local branch predictor, respectively? Assume that the BHT are initialized to

the N state. You may assume that this is the only branch is the program.

Answer:

1-bit predictor: 3 2-bit predictor: 5

: FSM 2-bit predictor 5 FSM 2-bit predictor 6

PART II:

For the following questions in Part II, please make sure that you summarize all your

answer in the format listed below. The answers are short, such as alphabelts, numbers,

or yes/no. You do not have to show your calculations. There is no partial credit to

incorrect answers.

(5a) (5b)

(6a) (6b) (6c)

(7a) (7b) (7c)

(8a) (8b) (8c) (8d) (8e)

(9a) (9b) (9c) (9d) (9e)

5. Consider the following performance measurements for a program:

Measurement Computer A Computer B Computer C

Instruction Count 12 billion 12 billion 10 billion

Clock Rate 4 Ghz 3 Ghz 2.8 Ghz

Cycles Per Instruction 2 1.5 1.4

(5a) Which computer is faster?

(5b) Which computer has the higher MIPS rating?

Answer:

(5a) Computer C

31

Execution Time for Computer A = 6104

21012

9

9

Execution Time for Computer B = 6103

5.11012

9

9

Execution Time for Computer C = 5108.2

4.11010

9

9

(5b) The MIPS rates for all computers are the same.

MIPS for computer A = 2000102

104

6

9

MIPS for computer B = 2000105.1

103

6

9

MIPS for computer B = 2000104.1

108.2

6

9

6. Consider the following two components in a computer system:

A CPU that sustain 2 billion instructions per second.

A memory backplane bus capable of sustaining a transfer rate of 1000

MB/sec

If the workload consists of 64 KB reads from the disk, and each read operation

takes 200,000 user instructions and 100,000 OS instructions.

(6a) Calculate the maximum I/O rate of CPU.

(6b) Calculate the maximum I/O rate of memory bus.

(6c) Which of the two components is likely to be the bottlenect for I/O?

Answer:

(6a) 6667

(6b) 15625

(6c) CPU

The maximum I/O rate of CPU = 6667200000100000

102 9

Calculate the maximum I/O rate of memory bus = 156251064

101000

3

6

32

7. You are going to enhance a computer, and there are two possible improvements:

either make multiply instructions run four times faster than before, or make

memory access instructions run two times faster than before. You repeatedly run

a program that takes 100 seconds to execute. Of this time, 20% is used for

multiplication, 50% for memory access instructions, and 30% for other tasks.

Calculate the speedup:

(7a) Speedup if we improve only multiplication:

(7b) Speedup if we only improve memory access:

(7c) Speedup if both improvements are made:

Answer:

(7a) Speedup = 18.1

8.04

2.0

1

(7b) Speedup = 33.1

5.02

5.0

1

(7c) Speedup = 67.1

3.02

5.0

4

2.0

1

8. Multiprocessor designs have become popular for todays desktop and mobile computing. Given a 2-way symmetric multiprocessor (SMP) system where both

processors use write-back caches, write update cache coherency, and a block size

of one 32-bit word. Let us examine the cache coherence traffic with the following

sequence of activities involving shared data. Assume that all the words already

exist in both caches and are clean. Fill-in the last column (8a)-(8e) in the table to

identify the coherence transactions that should occur on the bus for the sequence.

Step Processor Memory activity Memory address

Transaction

required

(Yes or No)

1 Processor 1 1-word write 100 (8a)

2 Processor 2 1-word write 104 (8b)

3 Processor 1 1-word read 100 (8c)

4 Processor 2 1-word read 104 (8d)

5 Processor 1 1-word read 104 (8e)

Answer:

33

(8a) Yes

(8b) Yes

(8c) No

(8d) No

(8e) No

9. False sharing can lead to unnecessary bus traffic and delays. Follow the direction

of Question 8, except change the cache coherency policy to write-invalidate and

block size to four words (128-bit). Reveal the coherence transactions on the bus

by filling-in the last column (9a)-(9e) in the table below.

Step Processor Memory activity Memory address

Transaction

required

(Yes or No)

1 Processor 1 1-word write 100 (9a)

2 Processor 2 1-word write 104 (9b)

3 Processor 1 1-word read 100 (9c)

4 Processor 2 1-word read 104 (9d)

5 Processor 1 1-word read 104 (9e)

Answer:

(9a) Yes

(9b) Yes

(9c) Yes

(9d) No

(9e) No

snoopy protocol(9d)Yes

34

94

1. Suppose we have a 32 bit MIPS-like RISC processor with the following

arithmetic and logical instructions (along with their descriptions):

Addition

add rd, rs, rt Put the sum of registers rs and rt into register rd. Addition immediate

add rt, rs, imm Put the sum of register rs and the sign-extended immediate into register rt.

Subtract

sub rd, rs, rt Register rt is subtracted from register rs and the result is put in register rd.

AND

and rd, rs, rt Put the logical AND of register rs and rt into register rd. AND immediate

and rt, rs, imm Put the logical AND of register rs and the zero-extended immediate into register rt.

Shift left logical

sll rd, rt, imm Shift the value in register rt left by the distance (i.e. the number of bits) indicated by the immediate (imm) and

put the result in register rd. The vacated bits are filled

with zeros.

Shift right logical

srl rd, rt, imm Shift the value in register rt right by the distance (i.e. the number of bits) indicated by the immediate (imm) and

put the result in register rd. The vacated bits are filled

with zeros.

Please use at most one instruction to generate assembly code for each of the

following C statements (assuming variable a and b are unsigned integers). You

can use the variable names as the register names in your assembly code.

(a) b = a / 8; /* division operation */

(b) b = a % 16; /* modulus operation */

Answer:

(a) srl b, a, 3

(b) and b, a, 15

a = 10010011a 16 1001 0011a 8 10010 011

35

2. Assume a RISC processor has a five-stage pipeline (as shown below) with each

stage taking one clock cycle to finish. The pipeline will stall when encountering

data hazards.

IF ID EXE MEM WB

IF: Instruction fetch

ID: Instruction decode and register file read

EXE: Execution or address calculation

MEM: Data memory access

WB: Write back to register file

(a) Suppose we have an add instruction followed immediately by a subtract

instruction that uses the add instruction's result:

add r1 r2, r3 sub r5 r1, r4

If there is no forwarding in the pipeline, how many cycle(s) will the pipeline

stall for?

(b) If we want to use forwarding (or bypassing) to avoid the pipeline stall caused

by the code sequence above, choosing from the denoted 6 points (A to F) in

the following simplified data path of the pipeline, where (from which point to

which point) should the forwarding path be connected?

(c) Suppose the first instruction of the above code sequence is a load of r1 instead

of an add (as shown below).

load rl [r2] sub r5 r1, r4

Assuming we have a forwarding path from point E to point C in the pipeline

data path, will there be any pipeline stall for this code sequence? If so, how

many cycle(s)? (If your first answer is yes, you have to answer the second

question correctly to get the 5 pts credit.)

Answer:

(a) read registerwrite registerclock cyclestall 2 cyclesstall 3clock cycles

(b) D to C

(c) Yes, 1 clock cycle

3. Cache misses are classified into three categories-compulsory, capacity, and conflict. What types of misses could be reduced if the cache block size is

increased?

Answer: compulsory

IFA

IDB

EXEC

MEMD

WBE F

36

4. Consider three types of methods for transferring data between an I/O device and

memory: polling, interrupt driven, and DMA. Rank the three techniques in terms

of lowest impact on processor utilization

Answer: (1) DMA, (2) Interrupt driven, (3) Polling

5. Assume an instruction set that contains 5 types of instructions: load, store,

R-format, branch and jump. Execution of these instructions can be broken into 5

steps: instruction fetch, register read, ALU operations, data access, and register

write. Table 1 lists the latency of each step assuming perfect caches.

Instruction

class

Instruction

fetch

Register

read

ALU

operation

Data

access

Register

write

Load

Store

R-fonnat

Branch

Jump

2ns

2ns

2ns

2ns

2ns

1ns

1ns

1ns

1ns

1ns

1ns

1ns

1ns

2ns

2ns

1ns

1ns

Table 1

(a) What is the CPU cycle time assuming a multicycle CPU implementation (i.e., each step in Table 1 takes one cycle)?

(b) Assuming the instruction mix shown below, what is the average CPI of the multicycle processor without pipelining? Assume that the I-cache and

D-cache miss rates are 3% and 10%, and the cache miss penalty is 12 CPU

cycles

Instruction Type Frequency

Load 40%

Store 30%

R-format 15%

Branch 10%

Jump 5%

(c) To reduce the cache miss rate, the architecture team is considering increasing the data cache size. They find that by doubling the data cache size, they can

eliminate half of data cache misses. However, the data access stage now takes

4 ns. Do you suggest them to double the data cache size? Explain your

answer.

Answer:

(a) CPU cycle time multicycle implementation CPU cycle time 2ns

(b) CPI without considering cache misses = 5 0.4 + 4 0.3 + 4 0.15 + 3 0.1 + 1 0.05 = 4.15

Average CPI = 4.15 + 0.03 12 + (0.3 + 0.4) 0.1 12 = 5.35

37

(c) CPI after doubling data cache = 4.15 + 0.03 6 + (0.3 + 0.4) 0.05 6 = 4.54

Average instruction execution time before doubling data cache = 5.35 2ns =

10.7 ns

Average instruction execution time after doubling data cache = 4.54 4ns =

18.16 ns

Doubling data cache Doubling data cache double the data cache.

38

93

1. Consider a system with an average memory access time of 50 nanoseconds, a three level page table (meta-directory, directory, and page table). For full credit,

your answer must be a single number and not a formula.

(a) If the system had an average page fault rate of 0.01% for any page accessed (data or page table related), and an average page fault took 1 millisecond to

service, what is the effective memory access time (assume no TLB or memory

cache)?

(b) Now assume the system has no page faults, we are considering adding a TLB that will take 1 nanosecond to lookup an address translation. What hit rate in

the TLB is required to reduce the effective access time to memory by a factor

of 2.5?

Answer:

(a) page fault effective memory access time = 4 50 = 200 ns ( meta-directory directory page table data access)page fault rate = 0.01% effective memory access time = 200 + 4 0.01% 1000000ns = 600 ns

(b) (200 / 2.5) = 50ns + 1ns + 150ns (1 H) H = 0.81

2. In this problem set, show your answers in the following format:

? CPU cycles

Derive your answer.

CPI = ?

Derive your answer.

Machine ? is ?% faster than ?

Derive your answer.

? CPU cycles

Derive your answer.

Both machine A and B contain one-level on-chip caches. The CPU clock rates

and cache configurations for these two machines are shown in Table 1. The

respective instruction/data cache miss rates in executing program P are also

shown in Table 1. The frequency of load/store instructions in program P is 20%.

On a cache miss, the CPU stalls until the whole cache block is fetched from the

main memory. The memory and bus system have the following characteristics:

1. the bus and memory support 16-byte block transfer;

2. 32-bit synchronous bus clocked at 200 MHz, with each 32-bit transfer taking

1 bus clock cycle, and 1 bus clock cycle required to send an address to

memory (assuming shared address and data lines);

3. assuming there is no cycle needed between each bus operation;

4. a memory access time for the first 4 words (16 bytes) is 250 ns, each

additional set of four words can be read in 25 ns. Assume that a bus transfer

39

of the most recently read data and a read of the next four words can be

overlapped.

Machine A Machine B

CPU clock rate 800 MHz 400 MHz

I-cache

configuration

Direct-mapped,

32-byte block, 8K

2-way, 32-byte block,

128K

D-cache

configuration


16K


256K

I-cache miss rate 6% 1%

D-cache miss rate 15% 4%

Table 1

To answer the following questions, you don't need to consider the time required

for writing data to the main memory:

(1) What is the data cache miss penalty (in CPU cycles) for machine A?

(2) What is the average CPI (Cycle per Instruction) for machine A in executing

program P? The CPI (Cycle per Instruction) is 1 without cache misses.

(3) Which machine is faster in executing program P and by how much? The CPI

(Cycle per Instruction) is 1 without cache misses for both machine A and B.

(4) What is the data cache miss penalty (in CPU cycles) for machine A if the bus

and memory system support 32-byte block transfer? All the other memory/bus

parameters remain the same as defined above.

Answer:

(a) 440 CPU cycles

Since bus clock rate = 200 MHz, the cycle time for a bus clock = 5 ns

The time to transfer one data block from memory to cache = 2 (1 + 250/5 + 1

4) 5 ns = 550 ns

The data miss penalty for machine A = 550ns / (1/800MHz) = 440 CPU cycles

(b) CPI = 40.6

Average CPI = 1 + 0.06 440 + 0.2 0.15 440 = 40.6

(c) Machine B is 409% faster than A

machine A machine B cache block size 32-bytemiss penalty 550ns

machine B clock rate 400Mhz machine B miss penalty = 220 clock cycles

machine B average CPI = 1 + 0.01 220 + 0.2 0.04 220 = 4.96 Execution time for machine A = 40.6 1.25 IC =50.75IC

Execution time for machine B = 4.96 2.5 IC =12.4IC

machine B is 50.75/12.4 = 4.09 faster than machine A (d) 240 CPU cycles

The time to transfer one data block from memory to cache = (1 + 250/5 + 25/5 +

4) 5 ns = 300 ns

40

c3 c2 c1 c0d3 d2 d1 d0c3 c2 c1 c0d3 d2 d1 d0

The data miss penalty for machine A = 300ns / (1/800MHz) = 240 CPU cycles

3. Given the bit pattern 10010011, what does it represent assuming

(a) Its a twos complement integer? (b) Its an unsigned integer?

Write down your answer in decimal format.

Answer:

(a) -27 + 2

4 + 2

1 + 2

0 = - 109

(b) 27 + 2

4 + 2

1 + 2

0 = 147

4. Draw the schematic for a 4-bit 2s complement adder/subtractor that produces A + B if K=1, A B if K = 0. In your design try to use minimum number of the following basic logic gates (1-bit adders, AND, OR, INV, and XOR).

Answer:

K = 0: S = A + (B 1) + 1 = A + (B + 1) = A B K = 1: S = A + (B 0) + 0 = A + B + 0 = A + B

5. We want to add for 4-bit numbers, A[3:0], B[3:0], C[3:0], D[3:0] together using

carry-save addition. Draw the schematic using 1-bit full adders.

Answer:

+ + + +

a3 b3 a2 b2 a1 b1 a0 b0

K

s3 s2 s1 s0

c4

41

6. We have an 8-bit carry ripple adder that is too slow. We want to speed it up by

adding one pipeline stages. Draw the schematic of the resulting pipeline adder.

How many 1-bit pipeline register do you need? Assuming the delay of 1-bit adder

is 1 ns, whats the maximum clock frequency the resulting pipelined adder can operate?

Answer:

(1) schematic

(2) 13 1-bit pipeline registers

(3) 1/4ns = 250MHz

a0

b0 +

a1

b1 +

a2

b2 +

a3

b3 +

a4

b4 +

a5

b5 +

a6

b6 +

a7

b7 +

s0

s1

s2

s3

s4

s5

s6

s7

c8

c0

a0

b0 +a0

b0 +

a1

b1 +a1

b1 +

a2

b2 +a2

b2 +

a3

b3 +a3

b3 +

a4

b4 +a4

b4

a4

b4 +

a5

b5 +a5

b5

a5

b5 +

a6

b6 +a6

b6

a6

b6 +

a7

b7 +a7

b7

a7

b7 +

s0

s1s1

s2s2

s3s3

s4

s5

s6

s7

c8

c0

42

92

1. A pipelined processor architecture consists of 5 pipeline stages: instruction fetch

(IF), instruction decode and register read (ID), execution or address calculate

(EX), data memory access (MEM), and register write back(WB). The delay of

each stage is summarized below: IF = 2 ns, ID = 1.5 ns, EX = 4 ns, MEM = 2.5 ns,

WB = 2 ns.

(1) Whats the maximum attainable clock rate of this processor? (2) What kind of instruction sequence will cause data hazard that cannot be

resolved by forwarding? Whats the performance penalty? (3) To improve on the clock rate of this processor, the architect decided to add

one pipeline stage. The location of the existing pipeline registers cannot be

changed. Where should this pipeline stage be placed? Whats the maximum clock rate of the 6-stage processor? (Assuming there is no delay penalty when

adding pipeline stages)

(4) Repeat the analysis in (2) for the new 6-stage processor. Is there other type(s)

of instruction sequence that will cause a data hazard, and cannot be resolved

by forwarding? Compare the design of 5-stage and 6-stage processor, what

effect does adding one pipeline stage has on data hazard resolution?

Answer:

(1) stagedelay4 nsclock rate = 1/ (4 10-9) = 250 MHz (2) (a) Load

forwarding data hazard (b) stallclock cycleforwardingthe performance

penalty1clock cycle delay (3) (a) EXdelaydelay2 nsEX1EX2

(b) stagedelay2.5 nsclock rate = 1/ (2.5 10-9) = 400

MHz (4) (a) Load-usedata hazardforwarding

data hazard forwardingLoad-use data hazardstall 2clock cycledata hazardstall 1clock cycle

(b) data hazardpenalty

2. (1) What type of cache misses (compulsory, conflict, capacity) can be reduced by

increasing the cache block size (2) Can increasing the degree of the cache associativity always reduce the average

memory access timeExplain your answer.

Answer:

43

(1) Compulsory

(2) No. AMAT = hit time + miss rate miss penalty. Increase the degree of cache associativity may decrease miss rate but will lengthen hit time; therefore, the

average memory access time may not necessary be reduced.

3. List two types of cache write policies. Compare the pros and cons of these two

polices.

Answer:

(1) Write-through: A scheme in which writs always update both the cache and the

memory, ensuring that data is always consistent between the two.

Write-back: A scheme that handles writes by updating values only to the

block in the cache, then writing the modified block to the lower level of the

hierarchy when the block is replaced.

(2)

Polices Write-through Write-back

Pros blockMemoryCPU write

Cons cacheMemoryCPU write

4. Briefly describe the difference between synchronous and asynchronous bus

transaction

Answer:

Bus type Synchronous Bus Asynchronous Bus

Differences

Synchronous bus includes a clock

in the control lines and a fixed

protocol for communication

relative to clock

Asynchronous bus is not

clocked

Advantage very little logic and can run very

fast

It can accommodate a wide

range of devices.

It can be lengthened without

worrying about clock skew.

Disadvantage

Every device on the bus must run

at the same clock rate.

To avoid clock skew, they cannot

be long if they are fast

It requires a handshaking

protocol.

44

96

1. The following MIPS assembly program tries to copy words from the address in

register $a0 to the address in $a1, counting the number of words copied in

register $v0. The program stops copying when it finds a word equal to 0, You do

not have to preserve the contents of registers $v1, $a0, and $a1. This terminating

word should be copied but not counted.

loop: lw $v1, 0($a0) # read next word from source

addi $v0, $v0, 1 # Increment count words copied

sw $v1, 0($a1) # Write to destination

addi @a0, $a0, 1 # Advance pointer to next word

addi @a0, $a1, 1 # Advance pointer to next word

bne $v1, $zero, loop # Loop if word copied != zero

There are multiple bugs in this MIPS program; fix them and turn in a bug-free

version.

Answer:

addi $v0, $zero, 1 Loop: lw $v1, 0($a0)

addi $v0, $v0, 1

sw $v1, 0($a1)

addi $a0, $a0, 4

addi $a1, $a1, 4

bne $v1, $zero, Loop

2. Carry lookahead is often used to speed up the addition operation in ALU. For a

4-bit addition with carry lookahead, assuming the two 4-bit inputs are a3a2a1a0

and b3b2b1b0, and the carry-in is c0,

(a) First derive the recursive equations of carry-out ci+1 in terms of ai and bi and

ci, where i = 0, 1,.., 3.

(b) Then by defining the generate (gi) and propagate (pi) signals, express c1, c2,

c3, and c4 in terms of only gi's, pis and c0. (c) Estimate the speed up for this simple 4-bit carry lookahead adder over the

4-bit ripple carry adder (assuming each logic gate introduces T delay).

Answer:

(a) ci+1 = aibi + aici + bici

(b) c1 = g0 + (p0c0) c2 = g1 + (p1g0) + (p1p0c0) c3 = g2 + (p2g1) + (p2p1g0) + (p2p1p0c0) c4 = g3 + (p3g2) + (p3p2g1) + (p3p2p1g0) + (p3p2p1p0c0)

(c) The critical path delay for 4-bit ripple carry adder = 2T 4 = 8T The critical path delay for 4-bit ripple carry adder = 2T + T = 3T

45

Speedup = 8T/3T = 2.67

: critical path dealy carry

3. When performing arithmetic addition and subtraction, overflow might occur. Fill

in the blanks in the following table of overflow conditions for addition and

subtraction.

Operation Operand A Operand B Result indicating

overflow

A + B 0 0 (a)

A + B < 0 < 0 (b)

A B 0 < 0 (c)

A B < 0 0 (d)

Prove that the overflow condition can be determined simply by checking to see if

the Carryin to the most significant bit of the result is not the same as the CarryOut

of the most significant bit of the result.

Answer:

(1)

Operation Operand A Operand B Result indicating overflow

A + B 0 0 (a) < 0

A + B < 0 < 0 (b) 0

A B 0 < 0 (c) < 0

A B < 0 0 (d) 0

(2) Build a table that shows all possible combinations of Sign and CarryIn to the

sign bit position and derive the CarryOut, Overflow, and related information.

Thus

Sing

A

Sing

B

Carry

In

Carry

Out

Sign

of

result

Correct

Sign of

Result

Over

flow

?

CarryIn

XOR

CarryOut

Notes

0 0 0 0 0 0 No 0

0 0 1 0 1 0 Yes 1 Carries differ

0 1 0 0 1 1 No 0 |A| < |B|

0 1 1 1 0 0 No 0 |A| > |B|

1 0 0 0 1 1 No 0 |A| > |B|

1 0 1 1 0 0 No 0 |A| < |B|

1 1 0 1 0 1 Yes 1 Carries differ

1 1 1 1 1 1 No 0

From this table an XOR of the CarryIn and CarryOut of the sign bit serves to

detect overflow.

46

4. Assume all memory addresses are translated to physical addresses before the

cache is accessed. In this case, the cache is physically indexed and physically

tagged. Also assume a TLB is used. (a) Under what circumstance can a memory

reference encounter a TLB miss, a page table hit, and a cache miss? Briefly

explain why. (b) To speed up cache accesses, a processor may index the cache

with virtual addresses. This is called a virtually addressed cache, and it uses tags

that are virtual addresses. However, a problem called aliasing may occur. Explain

what aliasing is and why. (c) In today's computer systems, virtual memory and

cache work together as a hierarchy. When the operating system decides to move a

page back to disk, the contents of that page may have been brought into the cache

already. What should the OS do with the contents that are in the cache?

Answer: (a) Data/instruction is in memory but not in cache and page table has this

mapping but TLB has not.

(b) A situation in which the same object is accessed by two addresses; can occur

in virtual memory when there are two virtual addresses for the same physical

page.

(c) If the contents in cache are dirty, force them write back to memory and

invalidate them in cache. After that, copy the page back to disk. If not,

invalidate them in cache and copy the page back to disk.

5. The following three instructions are executed using MIPS 5-stage pipeline.

1. lw $2, 20($1)

2. sub 4, $2, $5

3. or $4, $2, $6

Since there is one cycle delay between lw and sub, a hazard detection unit is

required. Furthermore, by the time the hazard is detected, sub and or may have

already been fetched into the pipeline. Therefore it is also required to turn sub

into a nop and delay the execution of sub and or by one cycle as shown below.

1. lw $2, 20($1)

2 nop

3. sub $4, $2, $5

4. or $4, $2, $6

(a) In which stage should the hazard detection unit be placed? Why? (b) How can

you turn sub into a nop in MIPS 5-stage pipeline? (c) How can you prevent sub

and or from making progress and force these two instructions to repeat in the next

clock cycle? (d) Explain why there is one cycle delay between lw and sub.

Answer:

(a) ID: Instruction Decode and register file read stage.

(b) Deassert all nine control signals (in EX/MEM pipeline register) in the EX

stage.

(c) Set both control signals PCWrite and IF/IDWrite to 0 to prevent the PC

register and IF/ID pipeline register from changing.

47

(d) As shown in the following diagram, after 1-cycle stall between lw and sub,

the forwarding logic can handle the dependence and execution proceeds. (If

there were no forwarding, then 2 cycle delay is needed)

lw IF ID EX MEM WB

nop IF ID EX MEM WB

sub IF ID EX MEM WB

6. Answer the following questions briefly.

(a) Will addition "0010 + 1110" cause an overflow using the 4-bit two's

complement signed-integer form? (Simply answer yes or no).

(b) What would you get after performing arithmetic right shift by one bit on

1100two?

(c) If one wishes to increase the accuracy of the floating-point numbers that can

be represented, then he/she should increase the size of which part in the

floating-point format?

(d) Name one event other than branches or jumps that the normal flow of

instruction execution will be changed, e.g., by switching to a routine in the

operating system.

Answer:

(a) NO

(b) 1110two

(c) Fraction

(d) Arithmetic overflow

7. A MIPS instruction takes fives stages in a pipelined CPU design: (1) IF:

instruction fetch, (2) ID: instruction decode/register fetch, (3) ALU: execution or

calculate a memory address, (4) MEM: access an operand in data memory, and (5)

WB: write a result hack into the register. Label one appropriate stage in which

each of the following actions needs to be executed. (Note that A and B are two

source operands, while ALUOut is the output register of the ALU, PC is the

program counter, IR is the instruction register. MDR is the memory data register,

Memory[k] is the k-th word in the memory, and Reg[k] is the k-th registers in the

register file.)

(a) Reg[IR[20-16]] = MDR;

(b) ALUOut = PC + (sign-extend (IR[15-0])

48

(c) MEM

49

95

1. (1) Can you come up with a MIPS instruction that behaves like a NOP? The

instruction is executed by the pipeline but does not change any state.

(2) In a MIPS computer a main program can use "jal procedure address" to make a

procedure call and the callee can use "jr $ra" to return to the main program.

What is saved in register $ra during this process?

(3) Name and explain the three principle components that can be combined to

yield runtime.

Answer:

(1) sll $zero, $zero, 0

(2) The address of the instruction following the jal (Return address)

(3) Runtime = instruction count CPI (cycles per instruction) clock cycle time

2. (1) Briefly explain the purpose of having a write buffer in the design of a

write-through cache.

(2) Large cache block tends to decrease cache miss rate due to better spatial

locality. However, it has been observed that too large a cache block actually

increases miss rate. Especially in a very small cache. Why?

Answer:

(1) After writing the data into the write buffer, the processor can continue

execution without wasting time to wait the memory update. The CPU

performance can thus be increased.

(2) The number of blocks that can be held in the cache will become small, and

there will be a great deal of competition for those blocks. As a result, a block

will be bumped out of the cache before many of its words are accessed.

3. (1) Dynamic branch prediction is often used in today's machine. Consider a loop

branch that branches nine times in a row, and then is not taken once. What is

the prediction accuracy for this branch, assuming a simple 1-bit prediction

scheme is used and the prediction bit for this branch remains in the prediction

buffer? Briefly explain your result.

(2) What is the prediction accuracy if a 2-bit prediction scheme is used? Again

briefly explain your result.

Answer:

(1) The steady-state prediction behavior will mispredict on the first and last loop

iterations. Mispredicting the last iteration is inevitable since the prediction bit

will say taken. The misprediction on the first iteration happens because the bit

is flipped on prior execution of the last iteration of the loop, since the branch

was not taken on that exiting iteration. Thus, the prediction accuracy for this

branch is 80% (two incorrect predictions and eight correct ones).

50

(2) The prediction accuracy if a 2-bit prediction scheme is 90%, since only the last

loop iteration will be mispredict.

4. Answer the following questions briefly:

(1) In a pipelined CPU design, what kind of problem may occur as it executes

instructions corresponding to an if-statement in a C program? Name one

possible scheme to get around this problem more or less.

(2) Consider the possible actions in the Instruction Decode stage of a pipelined

CPU. In addition to setting up the two input operands of ALU, what is the

other possible action? (Hint: consider the execution of a jump instruction)

(3) What is x if the maximum number of memory words you can use in a 32-bit

MIPS machine in a single program is expressed as 2x? (Note: MIPS uses a

byte addressing scheme.)

Answer:

(1) Control hazard.

Solution: Insert Nop instruction, delay branch, branch prediction

(2) Decode instruction, sign-extend 16 bits immediate constant, jump address

calculation, branch target calculation, register comparison, load-use data

hazard detection.

(3) A single program in 32-bit MIPS machine can use 256 MB = 228

Bytes = 226

words. So, x = 26.

5. Consider the following flow chart of a sequential multiplier. We assume that the

64-bit multiplicand register is initialized with the 32-bit original multiplicand in

the right half and 0 in the left half. The final result is to be placed in a product

register. Fill in the missing descriptions in blanks A and B.

start

Test multiplier[0]

Shift the multiplicand register left by 1 bit

Blank B

Blank A

32nd repetition?

Done

Yes: 32 repetitions

Multiplier[0] = 0Multiplier[0] = 1

No:

51

Answer:

Blank A: add Multiplicand to product and place the result in the Product register

Blank B: shift the Multiplier register right 1 bit

6. Schedule the following instruction segment into a superscaler pipeline for MIPS. Assume that the pipeline can execute one ALU or branch instruction and one data

transfer instruction concurrently. For the best, the instruction segment can be

executed in four clock cycles. Fill in the instruction identifiers into the table. Note

that data dependency should be taken into account.

(Identifier) (Instruction)

ln-1 Loop: lw $t0, 0($s1)

ln-2 addu $t0, $t0, $s2

ln-3 sw $t0, 0($s1)

ln-4 addi $s1, $s1, 4 ln-5 bne $s1, $zero, Loop

Clock Cycle ALU or branch instruction Data transfer instruction

1

2

3

4

Answer:

Clock Cycle ALU or branch instruction Data transfer instruction

1 ln-1 (lw)

2 ln-4 (addi)

3 ln-2 (addu)

4 ln-5 (bne) ln-3 (sw)

7. Suppose a computer's address size is k bits (using byte addressing),

solution vu assignment

Documents