solution vu assignment
DESCRIPTION
VU examTRANSCRIPT
-
CONTENTS
96 .............................................................................................................. 3 95 .............................................................................................................. 8 94 ............................................................................................................ 13 93 ............................................................................................................ 18 92 ............................................................................................................ 22 96 ............................................................................................................ 24 95 ............................................................................................................ 28 94 ............................................................................................................ 34 93 ............................................................................................................ 38 92 ............................................................................................................ 42 96 ............................................................................................................ 44 95 ............................................................................................................ 49 94 ............................................................................................................ 54 93 ............................................................................................................ 58 92 ............................................................................................................ 63 96 ............................................................................................................ 67 95 ............................................................................................................ 69 94 ............................................................................................................ 71 93 ............................................................................................................ 73 92 ............................................................................................................ 75 96 ............................................................................................................ 77 95 ............................................................................................................ 80 94 ............................................................................................................ 85 93 ............................................................................................................ 89 92 ............................................................................................................ 92 96 ............................................................................................................ 96 95 .......................................................................................................... 104 94 .......................................................................................................... 113 93 .......................................................................................................... 121 92 .......................................................................................................... 126 96 ...................................................................................................... 131 95 ...................................................................................................... 133 94 ...................................................................................................... 137 96 .......................................................................................................... 143 95 .......................................................................................................... 145 94 .......................................................................................................... 149 93 .......................................................................................................... 152
-
1
92 .......................................................................................................... 154 96 .......................................................................................................... 157 95 .......................................................................................................... 163 94 .......................................................................................................... 168 96 .......................................................................................................... 174 95 .......................................................................................................... 181 96 .......................................................................................................... 186 95 .......................................................................................................... 189 94 .......................................................................................................... 194 93 .......................................................................................................... 197 92 .......................................................................................................... 199 96 .......................................................................................................... 202 95 .......................................................................................................... 205 94 .......................................................................................................... 209 93 .......................................................................................................... 213 92 .......................................................................................................... 217 96 .......................................................................................................... 222 96 .......................................................................................................... 227 95 .......................................................................................................... 236 94 .......................................................................................................... 243 93 .......................................................................................................... 250 92 .......................................................................................................... 257 96 .......................................................................................................... 263 95 .......................................................................................................... 267 94 .......................................................................................................... 269 93 .......................................................................................................... 272 92 .......................................................................................................... 274 96 .......................................................................................................... 277 95 .......................................................................................................... 280 94 .......................................................................................................... 283 93 .......................................................................................................... 286 92 .......................................................................................................... 289 96 ...................................................................................................... 291 95 ...................................................................................................... 293 94 ...................................................................................................... 295 93 ...................................................................................................... 298 92 ...................................................................................................... 300 93 ...................................................................................................... 302 95 .......................................................................................................... 305
-
2
96 ...................................................................................................... 309 95 .......................................................................................................... 314 95 .......................................................................................................... 317 95 .......................................................................................................... 320
-
3
96
1. _____ implements the translation of a program's address space to physical
addresses.
(A) DRAM
(B) Main memory
(C) Physical memory
(D) Virtual memory
Answer: (D)
2. To track whether a page of disk has been written since it was read into the
memory, a ____ is added to the page table.
(A) valid bit
(B) tag index
(C) dirty bit
(D) reference bit
Answer: (C)
3. (Refer to the CPU architecture of Figure 1 below) Which of the following
statements is correct for a load word (LW) instruction?
(A) MemtoReg should be set to 0 so that the correct ALU output can be sent to
the register file.
(B) MemtoReg should be set to 1 so that the Data Memory output can be sent to
the register file.
(C) We do not care about the setting of MemtoReg. It can be either 0 or 1.
(D) MemWrite should be set to 1.
Answer: (B)
-
4
PC
Instructionmemory
Readaddress
Instruction[31
_0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32Instruction [15 0]
0
0Mux
0
1
Control
AddALU
result
Mux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
PCSrc
Datamemory
Writedata
Readdata
Mux
1
Instruction [15 11]
ALUcontrol
Shiftleft 2
ALUAddress
PC
Instructionmemory
Readaddress
Instruction[31
_0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32Instruction [15 0]
0
0Mux
0
1
Control
AddALU
result
Mux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
PCSrc
Datamemory
Writedata
Readdata
Mux
1
Instruction [15 11]
ALUcontrol
Shiftleft 2
ALUAddress
Figure 1
4. IEEE 754 binary representation of a 32-bit floating number is shown below
(normalized single-precision representation with bias = 127)
31 30 ~ 23 22 ~ 0
S exponent fraction
1 bit 8 bits 23 bits
(S) (E) (F)
What is the correct binary presentation of (- 0.75)10 in IEEE single-precision float
format?
(A) 1011 1111 0100 0000 0000 0000 0000 0000
(B) 1011 1111 1010 0000 0000 0000 0000 0000
(C) 1011 1111 1101 0000 0000 0000 0000 0000
(D) 1011 1110 1000 0000 0000 0000 0000 0000
Answer: (A)
5. According to Question 4, what is the decimal number represented by the word
below?
Bit position 31 30 ~ 23 22 ~ 0
Binary value 1 10000011 011000.00
(A) -10
(B) -11
(D) -22
(D) -44
-
5
Answer: (A)
6. Assume that the following assembly code is run on a machine with 2 GHz clock.
The number of cycles for assembly instruction is shown in Table 1.
add $t0, $zero, $zero
loop: beq $a1, $zero finish
add $t0, $t0, $a0
sub $a1, $a1, 1
j loop
finish: addi $t0, $t0, 100
add $v0, $t0, $zero
instruction Cycles
add, addi, sub 1
lw, beq, j 2
Table 1
Assume $a0 = 3, $a1 = 20 at initial time, select the correct value of $v0 at the
final cycle:
(A) 157
(B) 160
(C) 163
(D) 166
Answer: (B)
7. According to Question 6, please calculate the MIPS (millions instructions per
second) of this assembly code:
(A) 1342
(B) 1344
(C) 1346
(D) 1348
Answer: (B)
MIPS = 134410
84
125
102
10countn instructio
cyclesclock
rateclock
10CPI
rateclock
6
9
66
-
6
Questions 8-11. Link the following terms ((1) ~ (4))
(1) Microsoft Word
(2) Operating system
(3) Internet
(4) CD-ROM
to the most related terminology shown below (A, B, C,..., K), choose the most
related one ONLY (answer format: e.g., (1) K, for mapping item (1) to
terminology K).
A Applications software F Personal computer
B High-level programming language G Semiconductor
C Input device H Super computer
D Integrated circuit I Systems software
E Output device K Computer Networking
Please write down the answers in the answer table together with the choice
questions.
8. (1) Microsoft word
9. (2) Operating system
10. (3) Internet
11. (4) CD-ROM
Answer:
8. (1) Microsoft word A
9. (2) Operating system I
10. (3) Internet K
11. (4) CD-ROM C
Questions 12-15. Match the memory hierarchy element on the left with the closet
phrase on the right: (answer format: e.g., (1) d, for mapping item (1) (left) to
item d (right))
(1). L1 cache a. A cache for a cache
(2). L2 cache b. A cache for disks
(3). Main memory c. A cache for a main memory
(4). TLB d. A cache for page table entries
Please write down the answers in the answer table together with the choice
questions.
12. (1) L1 cache
13. (2) L2 cache
14. (3) Main memory
15. (4) TLB
-
7
Answer:
12. (1) L1 cache a
13. (2) L2 cache c
14. (3) Main memory b
15. (4) TLB d
Questions 41-50. Based on the function of the seven control signals and the datapath
of the MIPS CPU in Figure 1 (the same figure for Question 28), complete the
settings of the control lines in Table 2 (use 0, 1, and X (dont care) only) for the
two MIPS CPU instructions (beg and add). X (dont care) can help to reduce the
implementation complexity, you should put X whenever possible.
Instr. Branch ALU
Src
Reg
Write
Reg
Dst
Memto
Reg
Memory
Write
Memory
Read
ALU
Op1
ALU
Op0
beq (16) (17) (18) (19) (20) (21) (22) 0 1
add (23) (24) (25)
Table 2
Please write down the answers in the answer table together with the choice
questions.
16. (16) =
17. (17) =
18. (18) =
19. (19) =
20. (20) =
21. (21) =
22. (22) =
23. (23) =
24. (24) =
25. (25) =
Answer:
16. (16) = 1
17. (17) = 0
18. (18) = 0
19. (19) = X
20. (20) = X
21. (21) = 0
22. (22) = 0
23. (23) = 1
24. (24) = 0
25. (25) = 0
-
8
95
1-4 Choose ALL the correct answers for each of the following 1 to 4 questions. Note that
credit will be given only if all choices are correct.
1. With pipelines: (A) Increasing the depth of pipelining increases the impact of hazards. (B) Bypassing is a method to resolve a control hazard. (C) If a branch is taken, the branch prediction buffer will be updated. (D) In static multiple issue scheme, multiple instructions in each clock cycle are
fixed by the processor at the beginning of the program execution.
(E) Predication is an approach to guess the outcome of an instruction and to remove the execution dependence.
Answer: (A)
(B) False, (should be data hazard) (C) False, (prediction buffer should be updated when guess wrong) (D) False, (by the compiler) (E) False, (should be Speculation)
2. Increasing the degree of associativity of a cache scheme will (A) Increase the miss rate. (B) Increase the hit time. (C) Increase the number of comparators. (D) Increase the number of tag bits. (E) Increase the complexity of LRU implementation.
Answer: (B), (C), (D), (E)
(A) False, (decrease the miss rate)
3. With caching: (A) Write-through scheme improves the consistency between main memory and
cache.
(B) Split cache applies parallel caches to improve cache speed. (C) TLB (translation-lookaside buffer) is a cache on page table, and could help
accessing the virtual addresses faster.
(D) No more than one TLB is allowed in a CPU to ensure consistency. (E) An one-way set associative cache performs the same as a direct mapped
cache.
Answer: (A), (B), (E)
(C) False, (help accessing physical address faster)
-
9
(D) False, (MIPS R3000 and Pentium 4 have two TLBs)
4. In a Pentium 4 PC, (A) DMA mechanism can be applied to delegate responsibility from the CPU. (B) AGP bus can be used to connect MCH (Memory Control Hub) and a
graphical output device.
(C) USB 2.0 is a synchronous bus using handshaking protocol. (D) The CPU can fetch and translate IA-32 instructions. (E) The CPU can reduce instruction latency with deep pipelining.
Answer: (A), (B), (D)
(C) False, (USB 2.0 is an asynchronous bus)
(E) False, (pipeline can not reduce single instructions latency)
5. Examine the following two CPUs, each running the same instruction set. The first one is a Gallium Arsenide (GaAs) CPU. A 10 cm (about 4) diameter GaAs wafer costs $2000. The manufacturing process creates 4 defects per square cm. The
CPU fabricated in this technology is expected to have a clock rate of 1000 MHz,
with an average clock cycles per instruction of 2.5 if we assume an infinitely fast
memory system. The size of the GaAs CPU is 1.0 cm 1.0 cm. The second one is a CMOS CPU. A 20 cm (about 4) diameter CMOS wafer
costs $1000 and has 1 defect per square cm. The 1.0 cm 2.0 cm CPU executes multiple instructions per clock cycle to achieve an average clock cycles per
instruction of 0.75, assuming an infinitely fast memory, while achieving a clock
rate of 200 MHz. (The CPU is larger because it has on-chip caches and executes
multiple instructions per clock cycle.)
Assume for both GaAs and CMOS is 2. Yields for GaAs and CMOS wafers are 0.8 and 0.9 respectively. Most of this information is summarized in the
following table:
Wafer
Diam. (cm)
Wafer
Yield
Cost
($)
Defects
(1/cm2)
Freq.
(MHz) CPI
Die Area
(cm cm)
Test Dies
(per wafer)
GaAs 10 0.80 $2000 3.0 1000 2.5 1.0 1.0 4
CMOS 20 0.90 $1000 1.8 200 0.75 1.0 2.0 4
Hint: Here are two equations that may help:
per wafer diestest
area die2
diameterwafer
area die
diameter/2wafer per wafer dies
2
area dieareaunit per defects-1yield wafer yield dies
(a) Calculate the averagae execution time for each instruction with an infinitely fast memory. Which CPU is faster and by what factor?
(b) How many seconds will each CPU take to execute a one-billion-instruction program?
-
10
(c) What is the cost of a GaAs die for the CPU? Repeat the calculation for CMOS die. Show your work.
(d) What is the ratio of the cost of the GaAs die to the cost of the CMOS die? (e) Based on the costs and performance ratios of the CPU calculated above, what
is the ratio of cost/performance of the CMOS CPU to the GaAs CPU?
Answer:
(a) Execution Time (GaAs) for one instruction = 2.5 1 ns = 2.5 ns
Execution Time (CMOS) for one instruction = 0.75 5 ns = 3.75 ns GaAs CPU is faster by 3.75/2.5 =1.5 times
(b) Execution Time (GaAs) = 1 109 2.5 ns = 2.5 seconds
Execution Time (CMOS) = 1 109 3.75 ns = 3.75 seconds
(c) GaAs: dies per wafer =
412
10
1
2/102
= 67
die yield = 2
2
1318.0
= 0.2
Cost of a GaAs CPU die = 0.2 67
2000
= 149.25
CMOS: dies per wafer =
422
20
2
2/202
= 121
die yield = 2
2
28.119.0
= 0.576
Cost of a GaAs CPU die = 0.576 121
1000
= 14.35
(d) The cost of a GaAs die is 149.25/14.35 = 10.4 times than a CMOS die (e) The ratio of cost/performance of the CMOS to the GaAs is 10.4/1.5 = 6.93
6. Given the following 8 possible solutions for a POP or a PUSH operation in a STACK: (1) Read from Mem(SP), Decrement SP; (2) Read from Mem(SP),
Increment SP; (3) Decrement SP, Read from Mem(SP) (4) Increment SP, Read
from Mem(SP) (5) Write to Mem(SP), Decrement SP; (6) Write to Mem(SP),
Increment SP; (7) Decrement SP, Write to Mem(SP); (8) Increment SP, Write to
Mem(SP).
Choose only ONE of the above solutions for each of the following questions.
(a) Solution of a PUSH operation for a Last Full stack that grows ascending. (b) Solution of a POP operation for a Next Empty stack that grows ascending. (c) Solution of a PUSH operation for a Next Empty stack that grows ascending. (d) Solution of a POP operation for a Last Full stack that grows ascending.
Answer:
(a) (8) (b) (3) (c) (6) (d) (1)
-
11
stack pointer (SP)
SP
SP
Address
big
small
Last Full Next Empty
Last full PUSH (1) Increase SP; (2) Write to MEM(SP) Last full POP (1) Read from MEM(SP); (2) Decrease SP Next Empty PUSH (1) Write to MEM(SP); (2) Increase SP Next Empty POP (1) Decrease SP; (2) Read from MEM(SP)
7. Execute the following Copy loop on a pipelined machine: Copy: lw $10, 1000($20) sw $10, 2000($20)
addiu $20, $20, -4
bne $20, $0, Copy
Assume that the machine datapath neither stalls nor forwards on hazards, so you
must add nop instructions.
(a) Rewrite the code inserting as few nop instructions as needed for proper
execution;
(b) Use multi-clock-cycle pipeline diagram to show the correctness of your solution.
Answer: Suppose that register Read and Write could occur in the same clock cycle.
(a) lw $10, 1000($20)
Copy: addiu $20, $20, 4 nop
sw $10, 2000($20)
bne $20, $0, Copy
lw $10, 1000($20)
(b)
1 2 3 4 5 6 7 8 9 10 11
lw IF ID EX MEM WB
addiu IF ID EX MEM WB
nop IF ID EX MEM WB
sw IF ID EX MEM WB
bne IF ID EX MEM WB
lw IF ID EX MEM WB
-
12
8. In a Personal Computer, the optical drive has a rotation speed of 7500 rpm, a
40,000,000 bytes/second transfer rate, and a 60 ms seek time. The drive is served
with a 16 MHz bus, which is 16-bit wide.
(a) How long does the drive take to read a random 100,000-byte sector? (b) When transferring the 100,000-byte data, what is the bottleneck?
Answer:
(a) 40000000
100000
60/7500
5.060 ms = 60ms + 4ms + 2.5ms = 66.5ms
(b) The time for the bus to transfer 100000 bytes is 61016
2/100000
= 3.125 ms
So, the optical drive is the bottleneck.
8. A processor has a 16 KB, 4-way set-associate data cache with 32-byte blocks. (a) What is the number of sets in L1 cache? (b) The memory is byte addressable and addresses are 35-bit long. Show the
breakdown of the address into its cache access components.
(c) How many total bytes are required for cache? (d) Memory is connected via a 16-bit bus. It takes 100 clock cycles to send a
request to memory and to receive a cache block. The cache has 1-cycle hit
time and 95% hit rate. What is the average memory access time?
(e) A software program is consisted of 25% of memory access instructions. What is the average number of memory-stall cycle per instruction if we run this
program?
Answer:
(a) 16KB/(32 4) = 128 sets (b)
tag index Block offset Byte offset
23 bits 7 bits 3 bits 2 bits
(c) 27 4 (1 + 23 + 32 8) = 140 kbits = 17.5 KB
(d) Average Memory Access Time = 1 + 0.05 100 = 6 clock cycles
(e) (6 1) 1.25 = 6.25 clock cycles
-
13
94
1. Compare two memory system designs for a classic 5-stage pipelined processor.
Both memory systems have a 4-KB instruction cache. But system A has a
4K-byte data cache, with a miss rate of 10% and a hit time of 1 cycle; and system
B has an 8K-byte data cache, with a miss rate of 5% and a hit time of 2 cycles
(the cache is not pipelined). For both data caches, cache lines hold a single word
(4 bytes), and the miss penalty is 10 cycles. What are the respective average
memory access times for data retrieved by load instructions for the above two
memory system designs, measured in clock cycles?
Answer:
Average memory access time for system A = 1 + 0.1 10 = 2 cycles
Average memory access time for system B = 2 + 0.05 10 = 2.5 cycles
2. (a) Describe at least one clear advantage a Harvard architecture (separate
instruction and data caches) has over a unified cache architecture (a single
cache memory array accessed by a processor to retrieve both instruction and
data)
(b) Describe one clear advantage a unified cache architecture has over the Harvard
architecture
Answer:
(a) Cache bandwidth is higher for a Harvard architecture than a unified cache
architecture
(b) Hit ratio is higher for a unified cache architecture than a Harvard architecture
3. (a) What is RAID?
(b) Match the RAID levels 1, 3, and 5 to the following phrases for the best match.
Uses each only once.
Data and parity striped across multiple disks
Can withstand selective multiple disk failures
Requires only one disk for redundancy
Answer:
(a) An organization of disks that uses an array of small and inexpensive disks so
as to increase both performance and reliability
(b) RAID 5 Data and parity striped across multiple disks
RAID 1 Can withstand selective multiple disk failures
RAID 3 Requires only one disk for redundancy
-
14
4. (a) Explain the differences between a write-through policy and a write back policy
(b) Tell which policy cannot be used in a virtual memory system, and describe the
reason
Answer:
(a) Write through: always write the data into both the cache and the memory
Write back: updating values only to the block in the cache, then writing the
modified block to the lower level of the hierarchy when the block is replaced
(b) Write-through will not work for virtual memory, since writes take too long.
Instead, virtual memory systems use write-back
5. (a) What is a denormalized number (denorm or subnormal)?
(b) Show how to use gradual underflow to represent a denorm in a floating point
number system.
Answer:
(a) For an IEEE 754 floating point number, if the exponent is all 0s, but the
fraction is non-zero then the value is a denormalized number, which does not
have an assumed leading 1 before the binary point. Thus, this represents a
number (-1)s 0.f 2
-126, where s is the sign bit and f is the fraction.
(b) Denormalized number allows a number to degrade in significance until it
becomes 0, called gradual underflow.
For example, the smallest positive single precision normalized number is
1.0000 0000 0000 0000 0000 0000two 2-126
but the smallest single precision denormalized number is
0.0000 0000 0000 0000 0000 0001two 2-126
, or l.0two 2-149
6. Try to show the following structure in the memory map of a 64-bit Big-Endian
machine, by plotting the answer in a two-row map where each row contains 8
bytes.
Struct{
int a; // 0x11121314
char* c; // A, B, C, D, E, F, G short e; // 0x2122
}s;
Answer: 0 1 2 3 4 5 6 7
11 12 13 14 A B C D
E F G 21 22
Int 4 bytes short 2 bytes character 1 byte half word 3 objects words object half word
-
15
7. Assume we have the following 3 ISA styles:
(1) Stack: All operations occur on top of stack where PUSH and POP are the only
instructions that access memory;
(2) Accumulator: All operations occur between an Accumulator and a memory
location;
(3) Load-Store: All operations occur in registers, and register-to-register
instructions use 3 registers per instruction.
(a) For each of the above ISAs, write an assembly code for the following
program segment using LOAD, STORE, PUSH, POP, ADD, and SUB and
other necessary assembly language mnemonics.
{ A = A + C;
D = A B; }
(b) Some operations are not commutative (e.g., subtraction). Discuss what are
the advantages and disadvantages of the above 3 ISAs when executing
non-commutative operations.
Answer:
(a)
(1) Stack (2) Accumulator (3) Load-Store
PUSH A LOAD A LOAD R1, A
PUSH C ADD C LOAD R2, C
ADD STORE A ADD R1, R1, R2
POP A SUB B STORE R1, A
PUSH A STORE D LOAD R2, B
PUSH B SUB R1, R1, R2
SUB STORE R1, D
POP D
(b) Stack Accumulator ISA non-commutative operations compiler time instruction scheduling Load-Store ISA non-commutative operations instruction scheduling Stack Accumulator ISA Load-Store ISA
-
16
8. The program below divides two integers through repeated addition and was
originally written for a non-pipelined architecture. The divide function takes in as
its parameter a pointer to the base of an array of three elements where X is the
first element at 0($a0), Y is the second element at 4 ($a0), and the result Z is to be
stored into the third element at 8($a0), Line numbers have been added to the left
for use in answering the questions below.
1 DIVIDE: add $t3, $zero, $zero
2 add $t2, $zero, $zero
3 lw $t1, 4($a0)
4 lw $t0, 0($a0)
5 LOOP: beq $t2, $t0, END
6 addi $t3, $t3, 1
7 add $t2, $t2, $t1
8 j LOOP
9 END: sw $t3, 8($a0)
(a) Given a pipelined processor as discussed in the textbook, where will data be
forwarded (ex. Line 10 EX.MEM? Line 11 EX.MEM)? Assume that
forwarding is used whenever possible, but that branches have not been
optimized in any way and are resolved in the EX stage.
(b) How many data hazard stalls are needed? Between which instructions should
the stall bubble(s) be introduced (ex. Line 10 and Line 11)? Again, assume
that forwarding is used whenever possible, but that branches have not been
optimized in any way and are resolved in the EX stage.
(c) If X = 6 and Y = 3,
(i) How many times is the body of the loop executed?
(ii) How many times is the branch beq not taken?
(iii) How many times is the branch beq taken?
(d) Rewrite the code assuming delayed branches are used. If it helps, yon may
assume that the answer to X/Y is at least 2. Assume that forwarding is used
whenever possible and that branches are resolved in IF/ID. Do not worry
about reducing the number of times through the loop, but arrange the code to
use as few cycles as possible by avoiding stalls and wasted instructions.
Answer:
(a) Line 4 MEM.WB
(b) 1 stall is needed, between Line 4 and Line 5
(c) (i) 2 (ii) 2 (iii) 1
(d) DIVIDE: add $t2, $zero, $zero
lw $t0, 0($a0)
add $t3, $zero, $zero
lw $t1, 4($a0)
LOOP: beq $t2, $t0, END
add $t2, $t2, $t1
-
17
j LOOP
addi $t3, $t3, 1
END: sw $t3, 8($a0)
-
18
93
1. Explain how each of the following six features contributes to the definition of a
RISC machine: (a) Single-cycle operation, (b) Load/Store design, (c) Hardwired
control, (d) Relatively few instructions and addressing modes, (e) Fixed
instruction format, (f) More compile-time effort.
Answer:
(a)
(b) Load/Store CPU Load/Store CPU
(c) Hardwire Control (d) (e) (f) RISC
2. (1) Give an example of structural hazard.
(2) Identify all of the data dependencies in the following code. Show which
dependencies are data hazards and how they can be resolved via
forwarding?
add $2, $5, $4
add $4, $2, $5
sw $5, 100($2)
add $3, $2, $4
Answer:
(1) datapath lw $5, 100($2)
add $2, $7, $4
add $4, $2, $5
sw $5, 100($2)
4 1 4 structural hazard
(2) : 1 add $2, $5, $4 2 add $4, $2, $5
3 sw $5, 100($2)
4 add $3, $2, $4
-
19
Data dependency Data hazard
$2 (1, 2) (1, 3) (1, 4) (1, 2) (1, 3)
$4 (2, 4) (2, 4)
Take instruction (1, 2) for example. We dont need to wait for the first instruction to complete before trying to resolve the data hazard. As soon as the ALU creates
the sum for the first instruction, we can supply it as an input for the second
instruction.
3. Explain (1) what is a precise interrupt? (2) what does RAID mean? (3) what does
TLB mean?
Answer:
(1) An interrupt or exception that is always associated with the correct instruction
in pipelined computers.
(2) An organization of disks that uses an array of small and inexpensive disks so
as to increase both performance and reliability.
(3) A cache that keeps track of recently used address mappings to avoid an access
to the page table.
4. Consider a 32-byte direct-mapped write-through cache with 8-byte blocks.
Assume the cache updates on write hits and ignores write misses. Complete the
table below for a sequence of memory references which occur from left to right.
(Redraw the table in your answer sheet)
address 00 16 48 08 56 16 08 56 32 00 60
read/write r r r r r r r w w r r
index 0 2
tag 0 0
hit/miss miss
Answer:
Suppose the address is 8 bits. 32-byte direct-mapped cache with 8-byte blocks 4 blocks in the cache; block offset = 3 bits, [2:0]; index = 2 bits, [4:3]; tag = 8 3 2 = 3 bits, [5:7]
5 4
4 5
555 44
44 555
-
20
address tag index
decimal binary binary decimal binary decimal
00 000000 0 0 00 0
16 010000 0 0 10 2
48 110000 1 1 10 2
08 001000 0 0 01 1
56 111000 1 1 11 3
16 010000 0 0 10 2
08 001000 0 0 01 1
56 111000 1 1 11 3
32 100000 1 1 00 0
00 000000 0 0 00 0
60 111100 1 1 11 3
address 00 16 48 08 56 16 08 56 32 00 60
read/write r r r r r r r w w r r
index 0 2 2 1 3 2 1 3 0 0 3
tag 0 0 1 0 1 0 0 1 1 0 1
hit/miss miss miss miss miss miss miss hit hit miss hit hit
cache write hits write misses write 3 reference 32 write misses 2 reference 00 hit
5. (1) List two Branch Prediction strategies and (2) compare their differences.
Answer:
(1) Static prediction Dynamic prediction
(2) (a)
(b) Misprediction penalty
(c)
(a) run time run time information
(b) Misprediction penalty
(c)
6. Explain how the reference bit in a page table entry is used to implement an approximation to the LRU replacement strategy.
Answer:
The operating system periodically clears the reference bits and later records them
so it can determine which pages were touched during a particular time period.
With this usage information, the operating system can select a page that is among
-
21
the least recently referenced.
7. Trace Booths algorithm step by step for the multiplication of 2 (6)
Answer:
2 ten (6 ten) = 0010two 1010 two = 1111 0100 two = 12ten
Iteration Step Multiplicand Product
0 Initial values 0010 0000 1010 0
1 00 no operation 0010 0000 1010 0 Shift right product 0010 0000 0101 0
2 10 prod = prod - Mcand 0010 1110 0101 0 Shift right product 0010 1111 0010 1
3 01 prod = prod + Mcand 0010 0001 0010 1 Shift right product 0010 0000 1001 0
4 10 prod = prod - Mcand 0010 1110 1001 0 Shift right product 0010 1111 0100 1
8. What are the differences between Trap and Interrupt?
Answer:
Interrupt CPU ( processor ) Trap( processor ) CPUTrap interrupt
-
22
92
1. A certain machine with a 10 ns (1010-9s) clock period can perform jumps (1 cycle), branches (3 cycles), arithmetic instructions (2 cycles), multiply
instructions (5 cycles), and memory instructions (4 cycles). A certain program has
10% jumps, 10% branches, 50% arithmetic, 10% multiply, and 20% memory
instructions. Answer the following question. Show your derivation in sufficient
detail.
(1) What is the CPI of this program on this machine (2) If the program executes 10
9 instructions, what is its execution time
(3) A 5-cycle multiply-add instruction is implemented that combines an
arithmetic and a multiply instruction. 50% of the multiplies can be turned into
multiply-adds. What is the new CPI (4) Following (3) above, if the clock period remains the same, what is the
programs new execution time.
Answer:
(1) 1 0.1 + 3 0.1 + 2 0.5 + 5 0.1 + 4 0.2 = 2.7
(2) Execution time = 109 2.7 10 ns = 27 s
(3) CPI = (1 0.1 + 3 0.1 + 2 0.45 + 5 0.05 + 4 0.2 + 5 0.05) / (0.1 +
0.1 + 0.45 + 0.05 + 0.2 + 0.05) = 2.6 / 0.95 = 2.74
(4) Execution time = 109 0.95 2.74 10 ns = 26.03 s
: 100% CPI
2. Answer True (O) or False () for each of the following. (NO penalty for wrong
answer.)
(1) Most computers use direct mapped page tables.
(2) Increasing the block size of a cache is likely to take advantage of temporal
locality.
(3) Increasing the page size tends to decrease the size of the page table.
(4) Virtual memory typically uses a write-back strategy, rather than a
write-through strategy.
(5) If the cycle time and the CPI both increase by 10% and the number of
instruction deceases by 20%, then the execution time will remain the same.
(6) A page fault occurs when the page table entry cannot be found in the
translation lookaside buffer.
(7) To store a given amount of data, direct mapped caches are typically smaller
than either set associative or fully associative caches, assuming that the
block size for each cache is the same.
(8) The twos complement of negative number is always a positive number in the same number format.
-
23
(9) A RISC computer will typically require more instructions than a CISC
computer to implement a given program.
(10) Pentium 4 is based on the RISC architecture.
Answer:
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
O O O O
: Modern CPUs like the AthlonXP and Pentium 4 are based on a mixture of RISC and CISC.
3. The average memory access time (AMAT) is defined as
AMAT = hit time + miss_rate miss_penalty
Answer the following two questions. Show your derivation in sufficient detail.
(1) Find the AMAT of a 100MHz machine, with a miss penalty of 20 cycles, a hit
time of 2 cycles, and a miss rate of 5%.
(2) Suppose doubling the size of the cache decrease the miss rate to 3%, but
cause the hit time to increase to 3 cycles and the miss penalty to increase to 21
cycles. What is the AMAT of the new machine
Answer:
(1) AMAT = (2 + 0.05 20) 10ns = 30ns
(2) AMAT = (3 + 0.03 21) 10ns = 36.3ns
4. If a pipelined processor has 5 stages and takes 100 ns to execute N instructions.
How long will it take to execute 2N instructions, assuming the clock rate is 500
MHz and no pipeline stalls occur?
Answer:
Clock cycle time = 1/(500106)= 2 ns, N + 4 = 100 / 2 N = 46
The execution time of 2N instruction = 2 46 + 4 = 96 clock cycles = 192 ns
-
24
96
1. Answer the following questions briefly:
(a) Typically one CISC instruction, since it is more complex, takes more time to
complete than a RISC instruction. Assume that an application needs N CISC
instructions and 2N RISC instructions, and that one CISC instruction takes an
average 5T ns to complete, and one RISC instruction takes 2T ns. Which
processor has the better performance?
(b) Which of the following processors have a CISC instruction set architecture?
ARM AMD Opteron
Alpha 21164 IBM PowerPC
Intel 80x86 MIPS
Sun UltraSPARC
(c) True & False questions;
(1) There are four types of data hazards; RAR, RAW, WAR, and WAW.
(True or False?)
(2) AMD and Intel recently added 64-bit capability to their processors
because most programs run much faster with 64-bit instructions. (True or
False?)
(3) With a modern processor capable of dynamic instruction scheduling and
out-of-order execution, it is better that the compiler does not to optimize
the instruction sequences, (True or False?)
Answer:
(a) CISC time = N 5T = 5NT ns
RISC time = 2N 2T = 4NT ns
RISC time < CISC time, so the RISC architecture has better performance.
(b) Intel 80x86, AMD Opteron
(c) (1) False, RAR does not cause data hazard.
(2) False, most programs run much faster with 64-bit processors not 64-bit
instructions
(3) False, the compiler still tries to help improve the issue rate by placing the
instructions in a beneficial order.
2. For commercial applications, it is important to keep data on-line and safe in
multiple places.
(a) Suppose we want to backup 100GB of data over the network. How many
hours does it take to send the data by FTP over the Internet? Assume the
average bandwidth between the two places is 1Mbits/sec.
-
25
(b) Would it be better if you burn the data onto DVDs and mail the DVDs to the
other site? Suppose it takes 10 minutes to bum a DVD which has 4GB
capacity and the fast delivery service can deliver in 12 hours.
Answer:
(a) (100Gbyte)/1Mbits = 800 1024 seconds = 227.56 hours
(b) (100GB/4GB) 10 minutes = 250 minutes = 4.17 hours
4.17 + 12 = 16.17 hours < 227.56 hours
So, it is better to burn the data into DVDs and mail them to other site.
3. Suppose we have an application running on a shared-memory multiprocessor.
With one processor, the application runs for 30 minutes.
(a) Suppose the processor clock rate is 2GHz. The average CPI (assuming that all
references hit in the cache) on single processor is 0.5. How many instructions
are executed in the application?
(b) Suppose we want to reduce the run time of the application to 5 minutes with 8
processors. Let's optimistically assume that parallelization adds zero overhead
to the application, i.e. no extra instructions, no extra cache misses, no
communications, etc. What fraction of the application must be executed in
parallel?
(c) Suppose 100% of our application can be executed in parallel. Let's now
consider the communication overhead. Assume the multiprocessor has a 200
ns time to handle reference to a remote memory and processors are stalled on
a remote request. For this application, assume 0.02% of the instructions
involve a remote communication reference, no matter how many processors
are used. How many processors are needed at least to make the run time be
less than 5 minutes?
(d) Following the above question, but let's assume the remote communication
references in the application increases as the number of processors increases.
With N processors, there are 0.02*(N1)% instructions involve a remote
communication reference. How many processors will deliver the maximum
speedup?
Answer:
(a) 30 60 second = Instruction count 0.5 0.5 ns
Instruction Count =1800 /0.25 ns = 7200 109 (b) Suppose that the fraction of the application must be executed in parallel = F.
-
26
So,
8)1(
1
5
30
FF
F = 20/21 = 0.952
(c) Assume N is the number of processors that will make run time < 5 minutes
(30 60)/N + 7200 109 0.0002 200 ns < 5 60 N > 150 So, at lease 150 processors are needed to make the rum time < 5 minutes
(d) Speedup = N
6030+ 7200 109 0.0002 (N 1) 200 ns
= 1800N1
+ 288 (N 1)
Let the derivative of Speedup = 0 1800N2 + 288 = 0 N = 2.5
2.5 processors ill deliver the maximum speedup
4. Number representation.
(a) What range of integer number can be represented by 16-bit 2's complement
number?
(b) Perform the following 8-bit 2's complement number operation and check
whether arithmetic overflow occurs. Check your answer by converting to
decimal sign-and-magnitude representation.
11010011
11101100
Answer:
(a) 215 ~ + (215 1)
(b) 11010011 11101100 = 11010011 + 00010100 = 11100111
check: 45 ( 20) = 45 + 20 = 25
The range for 8-bit 2s complement number is: 27 ~ + (27 1)
So, no overflow
5. Bus
(a) Draw a graph to show the memory hierarchy of a system that consists of CPU,
Cache, Memory and I/O devices. Mark where memory bus and I/O bus is.
(b) Assuming system 1 have a synchronous 32-bit bus with clock rate = 33 Mhz
running at 2.5V. System 2 has a 64-bit bus with clock rate = 66 Mhz running
at 1.8V. Assuming the average capacitance on each bus line is 2pF for bus in
system 1. What is the maximum average capacitance allowed for the bus of
system 2 so the peak power dissipation of system 2 bus will not exceed that of
the system 1 bus?
(c) Serial bus protocol such as SATA has gained popularity in recent years. To
design a serial bus that supports the same peak throughput as the bus in
system 2, what is the clock frequency of this serial bus?
-
27
Answer:
(a)
(b) Power dissipation = fCV2
The peak power dissipation for system 1 =
33 106 (2 10-12 32) 2.52 = 13.2 mw
Suppose the average capacitance for system 2 = C
66 106 C 1.82 < 13.2 mw C < 61.73 pF The maximum average capacitance for system 2 is 61.73 pF.
(c) Since SATA uses a single signal path to transmit data serially (or bit by bit),
the frequency should be designed as 66 MHz 64 = 4.224 GHz to support the
same peak throughtput as bus system 2.
(b) system 2 bus line
-
28
95
PART I:
Please answer the following questions in the format listed below. If you do not follow
the format, you will get zero points for these questions.
1. (1) T or F (2) T or F
(3) T or F
(4) T or F
(5) T or F
2. X = Y =
Stall cycles =
3. Option is times faster than the old machine
4. 1-bit predictor: 2-bit predictor:
1. True & False Questions (1) If an address translation for a virtual page is present in the TLB, then that
virtual page must be mapped to a physical memory page.
(2) The set index decreases in size as cache associativity is increased (assume
cache size and block size remain the same)
(3) It is impossible to have a TLB hit and a data cache miss for the same data
reference.
(4) An instruction takes less time to execute on a pipelined processor than on a
nonpipelined processor (all other aspects of the processors being the same).
(5) A muti-cycle implementation of the MIPS processor requires that a single
memory by used for both instructions and data.
Answer:
(1) T (2) T (3) F (4) F (5) T
2. Consider the following program:
int A[100]; /* size(int) = 1 word */
for (i = 0; i < 100; i++)
A[i] = A[i] + 1;
The code for this program on a MIPS-like load/store architecture looks as
follows:
ADDI R1, R0, #X
ADDI R2, R0, A ; A is the base address of array A
LOOP: LD R3, 0(R2)
-
29
ADDI R3, R3, #1
SD R3, 0(R2)
ADDI R2, R2, #Y
SUBI R1, R1, #1
BNE R1, R0, LOOP
Consider a standard 5-stage MIPS pipeline. Assume that the branch is resolved
during the instruction decode stage, and full bypassing/register forwarding are
implemented. Assume that all memory references hit in the cache and TLBs. The
pipeline does not implement any branch prediction mechanism. What are values
of #X and #Y, and how many stall cycles are in one loop iteration including stalls
caused by the branch instruction?
Answer:
X = 100
Y = 4
Stall cycles = 3 ((1) between LD and ADDI, (2) between SUBI and BEQ, (3) one
below BEQ)
Since branch decision is resolved during ID stage, a clock stall is needed between SUBI and BEQ
3. Suppose you had a computer hat, on average, exhibited the following properties
on the programs that you run:
Instruction miss rate: 2%
Data miss rate: 4%
Percentage of memory instructions: 30%
Miss penalty: 100 cycles
There is no penalty for a cache hit (i.e. the cache can supply the data as fast as the
processor can consume it.) You want to update the computer, and your budget will
allow one of the following:
Option #1: Get a new processor that is twice as fast as your current
computer. The new processors cache is twice as fast too, so it can keep up with the processor.
Option #2: Get a new memory that is twice as fast.
Which is a better choice? And what is the speedup of the chosen design compared
to the old machine?
Answer:
Option 2 is 4.2/2.6 = 1.62 times faster than the old machine.
Suppose that the base CPI = 1
CPIold = 1 + 0.02 100 + 0.04 0.3 100 = 4.2
CPIopt1 = 0.5 + 0.02 100 + 0.04 0.3 100 = 3.7
CPIopt2 = 1 + 0.02 50 + 0.04 0.3 50 = 2.6
-
30
(option#1)processor cache stall
= CPI cycle time clock rate double cycle time CPI base CPI 1 option #1 CPI 0.5 processor cache
4. The following series of branch outcomes occurs for a single branch in a program.
(T means the branch is taken, N means the branch is not taken).
TTTNNTTT
How many instances of this branch instruction are mis-predicted with a 1-bit and
2-bit local branch predictor, respectively? Assume that the BHT are initialized to
the N state. You may assume that this is the only branch is the program.
Answer:
1-bit predictor: 3 2-bit predictor: 5
: FSM 2-bit predictor 5 FSM 2-bit predictor 6
PART II:
For the following questions in Part II, please make sure that you summarize all your
answer in the format listed below. The answers are short, such as alphabelts, numbers,
or yes/no. You do not have to show your calculations. There is no partial credit to
incorrect answers.
(5a) (5b)
(6a) (6b) (6c)
(7a) (7b) (7c)
(8a) (8b) (8c) (8d) (8e)
(9a) (9b) (9c) (9d) (9e)
5. Consider the following performance measurements for a program:
Measurement Computer A Computer B Computer C
Instruction Count 12 billion 12 billion 10 billion
Clock Rate 4 Ghz 3 Ghz 2.8 Ghz
Cycles Per Instruction 2 1.5 1.4
(5a) Which computer is faster?
(5b) Which computer has the higher MIPS rating?
Answer:
(5a) Computer C
-
31
Execution Time for Computer A = 6104
21012
9
9
Execution Time for Computer B = 6103
5.11012
9
9
Execution Time for Computer C = 5108.2
4.11010
9
9
(5b) The MIPS rates for all computers are the same.
MIPS for computer A = 2000102
104
6
9
MIPS for computer B = 2000105.1
103
6
9
MIPS for computer B = 2000104.1
108.2
6
9
6. Consider the following two components in a computer system:
A CPU that sustain 2 billion instructions per second.
A memory backplane bus capable of sustaining a transfer rate of 1000
MB/sec
If the workload consists of 64 KB reads from the disk, and each read operation
takes 200,000 user instructions and 100,000 OS instructions.
(6a) Calculate the maximum I/O rate of CPU.
(6b) Calculate the maximum I/O rate of memory bus.
(6c) Which of the two components is likely to be the bottlenect for I/O?
Answer:
(6a) 6667
(6b) 15625
(6c) CPU
The maximum I/O rate of CPU = 6667200000100000
102 9
Calculate the maximum I/O rate of memory bus = 156251064
101000
3
6
-
32
7. You are going to enhance a computer, and there are two possible improvements:
either make multiply instructions run four times faster than before, or make
memory access instructions run two times faster than before. You repeatedly run
a program that takes 100 seconds to execute. Of this time, 20% is used for
multiplication, 50% for memory access instructions, and 30% for other tasks.
Calculate the speedup:
(7a) Speedup if we improve only multiplication:
(7b) Speedup if we only improve memory access:
(7c) Speedup if both improvements are made:
Answer:
(7a) Speedup = 18.1
8.04
2.0
1
(7b) Speedup = 33.1
5.02
5.0
1
(7c) Speedup = 67.1
3.02
5.0
4
2.0
1
8. Multiprocessor designs have become popular for todays desktop and mobile computing. Given a 2-way symmetric multiprocessor (SMP) system where both
processors use write-back caches, write update cache coherency, and a block size
of one 32-bit word. Let us examine the cache coherence traffic with the following
sequence of activities involving shared data. Assume that all the words already
exist in both caches and are clean. Fill-in the last column (8a)-(8e) in the table to
identify the coherence transactions that should occur on the bus for the sequence.
Step Processor Memory activity Memory address
Transaction
required
(Yes or No)
1 Processor 1 1-word write 100 (8a)
2 Processor 2 1-word write 104 (8b)
3 Processor 1 1-word read 100 (8c)
4 Processor 2 1-word read 104 (8d)
5 Processor 1 1-word read 104 (8e)
Answer:
-
33
(8a) Yes
(8b) Yes
(8c) No
(8d) No
(8e) No
9. False sharing can lead to unnecessary bus traffic and delays. Follow the direction
of Question 8, except change the cache coherency policy to write-invalidate and
block size to four words (128-bit). Reveal the coherence transactions on the bus
by filling-in the last column (9a)-(9e) in the table below.
Step Processor Memory activity Memory address
Transaction
required
(Yes or No)
1 Processor 1 1-word write 100 (9a)
2 Processor 2 1-word write 104 (9b)
3 Processor 1 1-word read 100 (9c)
4 Processor 2 1-word read 104 (9d)
5 Processor 1 1-word read 104 (9e)
Answer:
(9a) Yes
(9b) Yes
(9c) Yes
(9d) No
(9e) No
snoopy protocol(9d)Yes
-
34
94
1. Suppose we have a 32 bit MIPS-like RISC processor with the following
arithmetic and logical instructions (along with their descriptions):
Addition
add rd, rs, rt Put the sum of registers rs and rt into register rd. Addition immediate
add rt, rs, imm Put the sum of register rs and the sign-extended immediate into register rt.
Subtract
sub rd, rs, rt Register rt is subtracted from register rs and the result is put in register rd.
AND
and rd, rs, rt Put the logical AND of register rs and rt into register rd. AND immediate
and rt, rs, imm Put the logical AND of register rs and the zero-extended immediate into register rt.
Shift left logical
sll rd, rt, imm Shift the value in register rt left by the distance (i.e. the number of bits) indicated by the immediate (imm) and
put the result in register rd. The vacated bits are filled
with zeros.
Shift right logical
srl rd, rt, imm Shift the value in register rt right by the distance (i.e. the number of bits) indicated by the immediate (imm) and
put the result in register rd. The vacated bits are filled
with zeros.
Please use at most one instruction to generate assembly code for each of the
following C statements (assuming variable a and b are unsigned integers). You
can use the variable names as the register names in your assembly code.
(a) b = a / 8; /* division operation */
(b) b = a % 16; /* modulus operation */
Answer:
(a) srl b, a, 3
(b) and b, a, 15
a = 10010011a 16 1001 0011a 8 10010 011
-
35
2. Assume a RISC processor has a five-stage pipeline (as shown below) with each
stage taking one clock cycle to finish. The pipeline will stall when encountering
data hazards.
IF ID EXE MEM WB
IF: Instruction fetch
ID: Instruction decode and register file read
EXE: Execution or address calculation
MEM: Data memory access
WB: Write back to register file
(a) Suppose we have an add instruction followed immediately by a subtract
instruction that uses the add instruction's result:
add r1 r2, r3 sub r5 r1, r4
If there is no forwarding in the pipeline, how many cycle(s) will the pipeline
stall for?
(b) If we want to use forwarding (or bypassing) to avoid the pipeline stall caused
by the code sequence above, choosing from the denoted 6 points (A to F) in
the following simplified data path of the pipeline, where (from which point to
which point) should the forwarding path be connected?
(c) Suppose the first instruction of the above code sequence is a load of r1 instead
of an add (as shown below).
load rl [r2] sub r5 r1, r4
Assuming we have a forwarding path from point E to point C in the pipeline
data path, will there be any pipeline stall for this code sequence? If so, how
many cycle(s)? (If your first answer is yes, you have to answer the second
question correctly to get the 5 pts credit.)
Answer:
(a) read registerwrite registerclock cyclestall 2 cyclesstall 3clock cycles
(b) D to C
(c) Yes, 1 clock cycle
3. Cache misses are classified into three categories-compulsory, capacity, and conflict. What types of misses could be reduced if the cache block size is
increased?
Answer: compulsory
IFA
IDB
EXEC
MEMD
WBE F
-
36
4. Consider three types of methods for transferring data between an I/O device and
memory: polling, interrupt driven, and DMA. Rank the three techniques in terms
of lowest impact on processor utilization
Answer: (1) DMA, (2) Interrupt driven, (3) Polling
5. Assume an instruction set that contains 5 types of instructions: load, store,
R-format, branch and jump. Execution of these instructions can be broken into 5
steps: instruction fetch, register read, ALU operations, data access, and register
write. Table 1 lists the latency of each step assuming perfect caches.
Instruction
class
Instruction
fetch
Register
read
ALU
operation
Data
access
Register
write
Load
Store
R-fonnat
Branch
Jump
2ns
2ns
2ns
2ns
2ns
1ns
1ns
1ns
1ns
1ns
1ns
1ns
1ns
2ns
2ns
1ns
1ns
Table 1
(a) What is the CPU cycle time assuming a multicycle CPU implementation (i.e., each step in Table 1 takes one cycle)?
(b) Assuming the instruction mix shown below, what is the average CPI of the multicycle processor without pipelining? Assume that the I-cache and
D-cache miss rates are 3% and 10%, and the cache miss penalty is 12 CPU
cycles
Instruction Type Frequency
Load 40%
Store 30%
R-format 15%
Branch 10%
Jump 5%
(c) To reduce the cache miss rate, the architecture team is considering increasing the data cache size. They find that by doubling the data cache size, they can
eliminate half of data cache misses. However, the data access stage now takes
4 ns. Do you suggest them to double the data cache size? Explain your
answer.
Answer:
(a) CPU cycle time multicycle implementation CPU cycle time 2ns
(b) CPI without considering cache misses = 5 0.4 + 4 0.3 + 4 0.15 + 3 0.1 + 1 0.05 = 4.15
Average CPI = 4.15 + 0.03 12 + (0.3 + 0.4) 0.1 12 = 5.35
-
37
(c) CPI after doubling data cache = 4.15 + 0.03 6 + (0.3 + 0.4) 0.05 6 = 4.54
Average instruction execution time before doubling data cache = 5.35 2ns =
10.7 ns
Average instruction execution time after doubling data cache = 4.54 4ns =
18.16 ns
Doubling data cache Doubling data cache double the data cache.
-
38
93
1. Consider a system with an average memory access time of 50 nanoseconds, a three level page table (meta-directory, directory, and page table). For full credit,
your answer must be a single number and not a formula.
(a) If the system had an average page fault rate of 0.01% for any page accessed (data or page table related), and an average page fault took 1 millisecond to
service, what is the effective memory access time (assume no TLB or memory
cache)?
(b) Now assume the system has no page faults, we are considering adding a TLB that will take 1 nanosecond to lookup an address translation. What hit rate in
the TLB is required to reduce the effective access time to memory by a factor
of 2.5?
Answer:
(a) page fault effective memory access time = 4 50 = 200 ns ( meta-directory directory page table data access)page fault rate = 0.01% effective memory access time = 200 + 4 0.01% 1000000ns = 600 ns
(b) (200 / 2.5) = 50ns + 1ns + 150ns (1 H) H = 0.81
2. In this problem set, show your answers in the following format:
? CPU cycles
Derive your answer.
CPI = ?
Derive your answer.
Machine ? is ?% faster than ?
Derive your answer.
? CPU cycles
Derive your answer.
Both machine A and B contain one-level on-chip caches. The CPU clock rates
and cache configurations for these two machines are shown in Table 1. The
respective instruction/data cache miss rates in executing program P are also
shown in Table 1. The frequency of load/store instructions in program P is 20%.
On a cache miss, the CPU stalls until the whole cache block is fetched from the
main memory. The memory and bus system have the following characteristics:
1. the bus and memory support 16-byte block transfer;
2. 32-bit synchronous bus clocked at 200 MHz, with each 32-bit transfer taking
1 bus clock cycle, and 1 bus clock cycle required to send an address to
memory (assuming shared address and data lines);
3. assuming there is no cycle needed between each bus operation;
4. a memory access time for the first 4 words (16 bytes) is 250 ns, each
additional set of four words can be read in 25 ns. Assume that a bus transfer
-
39
of the most recently read data and a read of the next four words can be
overlapped.
Machine A Machine B
CPU clock rate 800 MHz 400 MHz
I-cache
configuration
Direct-mapped,
32-byte block, 8K
2-way, 32-byte block,
128K
D-cache
configuration
2-way, 32-byte block,
16K
4-way, 32-byte block,
256K
I-cache miss rate 6% 1%
D-cache miss rate 15% 4%
Table 1
To answer the following questions, you don't need to consider the time required
for writing data to the main memory:
(1) What is the data cache miss penalty (in CPU cycles) for machine A?
(2) What is the average CPI (Cycle per Instruction) for machine A in executing
program P? The CPI (Cycle per Instruction) is 1 without cache misses.
(3) Which machine is faster in executing program P and by how much? The CPI
(Cycle per Instruction) is 1 without cache misses for both machine A and B.
(4) What is the data cache miss penalty (in CPU cycles) for machine A if the bus
and memory system support 32-byte block transfer? All the other memory/bus
parameters remain the same as defined above.
Answer:
(a) 440 CPU cycles
Since bus clock rate = 200 MHz, the cycle time for a bus clock = 5 ns
The time to transfer one data block from memory to cache = 2 (1 + 250/5 + 1
4) 5 ns = 550 ns
The data miss penalty for machine A = 550ns / (1/800MHz) = 440 CPU cycles
(b) CPI = 40.6
Average CPI = 1 + 0.06 440 + 0.2 0.15 440 = 40.6
(c) Machine B is 409% faster than A
machine A machine B cache block size 32-bytemiss penalty 550ns
machine B clock rate 400Mhz machine B miss penalty = 220 clock cycles
machine B average CPI = 1 + 0.01 220 + 0.2 0.04 220 = 4.96 Execution time for machine A = 40.6 1.25 IC =50.75IC
Execution time for machine B = 4.96 2.5 IC =12.4IC
machine B is 50.75/12.4 = 4.09 faster than machine A (d) 240 CPU cycles
The time to transfer one data block from memory to cache = (1 + 250/5 + 25/5 +
4) 5 ns = 300 ns
-
40
c3 c2 c1 c0d3 d2 d1 d0c3 c2 c1 c0d3 d2 d1 d0
The data miss penalty for machine A = 300ns / (1/800MHz) = 240 CPU cycles
3. Given the bit pattern 10010011, what does it represent assuming
(a) Its a twos complement integer? (b) Its an unsigned integer?
Write down your answer in decimal format.
Answer:
(a) -27 + 2
4 + 2
1 + 2
0 = - 109
(b) 27 + 2
4 + 2
1 + 2
0 = 147
4. Draw the schematic for a 4-bit 2s complement adder/subtractor that produces A + B if K=1, A B if K = 0. In your design try to use minimum number of the following basic logic gates (1-bit adders, AND, OR, INV, and XOR).
Answer:
K = 0: S = A + (B 1) + 1 = A + (B + 1) = A B K = 1: S = A + (B 0) + 0 = A + B + 0 = A + B
5. We want to add for 4-bit numbers, A[3:0], B[3:0], C[3:0], D[3:0] together using
carry-save addition. Draw the schematic using 1-bit full adders.
Answer:
+ + + +
a3 b3 a2 b2 a1 b1 a0 b0
K
s3 s2 s1 s0
c4
-
41
6. We have an 8-bit carry ripple adder that is too slow. We want to speed it up by
adding one pipeline stages. Draw the schematic of the resulting pipeline adder.
How many 1-bit pipeline register do you need? Assuming the delay of 1-bit adder
is 1 ns, whats the maximum clock frequency the resulting pipelined adder can operate?
Answer:
(1) schematic
(2) 13 1-bit pipeline registers
(3) 1/4ns = 250MHz
a0
b0 +
a1
b1 +
a2
b2 +
a3
b3 +
a4
b4 +
a5
b5 +
a6
b6 +
a7
b7 +
s0
s1
s2
s3
s4
s5
s6
s7
c8
c0
a0
b0 +a0
b0 +
a1
b1 +a1
b1 +
a2
b2 +a2
b2 +
a3
b3 +a3
b3 +
a4
b4 +a4
b4
a4
b4 +
a5
b5 +a5
b5
a5
b5 +
a6
b6 +a6
b6
a6
b6 +
a7
b7 +a7
b7
a7
b7 +
s0
s1s1
s2s2
s3s3
s4
s5
s6
s7
c8
c0
-
42
92
1. A pipelined processor architecture consists of 5 pipeline stages: instruction fetch
(IF), instruction decode and register read (ID), execution or address calculate
(EX), data memory access (MEM), and register write back(WB). The delay of
each stage is summarized below: IF = 2 ns, ID = 1.5 ns, EX = 4 ns, MEM = 2.5 ns,
WB = 2 ns.
(1) Whats the maximum attainable clock rate of this processor? (2) What kind of instruction sequence will cause data hazard that cannot be
resolved by forwarding? Whats the performance penalty? (3) To improve on the clock rate of this processor, the architect decided to add
one pipeline stage. The location of the existing pipeline registers cannot be
changed. Where should this pipeline stage be placed? Whats the maximum clock rate of the 6-stage processor? (Assuming there is no delay penalty when
adding pipeline stages)
(4) Repeat the analysis in (2) for the new 6-stage processor. Is there other type(s)
of instruction sequence that will cause a data hazard, and cannot be resolved
by forwarding? Compare the design of 5-stage and 6-stage processor, what
effect does adding one pipeline stage has on data hazard resolution?
Answer:
(1) stagedelay4 nsclock rate = 1/ (4 10-9) = 250 MHz (2) (a) Load
forwarding data hazard (b) stallclock cycleforwardingthe performance
penalty1clock cycle delay (3) (a) EXdelaydelay2 nsEX1EX2
(b) stagedelay2.5 nsclock rate = 1/ (2.5 10-9) = 400
MHz (4) (a) Load-usedata hazardforwarding
data hazard forwardingLoad-use data hazardstall 2clock cycledata hazardstall 1clock cycle
(b) data hazardpenalty
2. (1) What type of cache misses (compulsory, conflict, capacity) can be reduced by
increasing the cache block size (2) Can increasing the degree of the cache associativity always reduce the average
memory access timeExplain your answer.
Answer:
-
43
(1) Compulsory
(2) No. AMAT = hit time + miss rate miss penalty. Increase the degree of cache associativity may decrease miss rate but will lengthen hit time; therefore, the
average memory access time may not necessary be reduced.
3. List two types of cache write policies. Compare the pros and cons of these two
polices.
Answer:
(1) Write-through: A scheme in which writs always update both the cache and the
memory, ensuring that data is always consistent between the two.
Write-back: A scheme that handles writes by updating values only to the
block in the cache, then writing the modified block to the lower level of the
hierarchy when the block is replaced.
(2)
Polices Write-through Write-back
Pros blockMemoryCPU write
Cons cacheMemoryCPU write
4. Briefly describe the difference between synchronous and asynchronous bus
transaction
Answer:
Bus type Synchronous Bus Asynchronous Bus
Differences
Synchronous bus includes a clock
in the control lines and a fixed
protocol for communication
relative to clock
Asynchronous bus is not
clocked
Advantage very little logic and can run very
fast
It can accommodate a wide
range of devices.
It can be lengthened without
worrying about clock skew.
Disadvantage
Every device on the bus must run
at the same clock rate.
To avoid clock skew, they cannot
be long if they are fast
It requires a handshaking
protocol.
-
44
96
1. The following MIPS assembly program tries to copy words from the address in
register $a0 to the address in $a1, counting the number of words copied in
register $v0. The program stops copying when it finds a word equal to 0, You do
not have to preserve the contents of registers $v1, $a0, and $a1. This terminating
word should be copied but not counted.
loop: lw $v1, 0($a0) # read next word from source
addi $v0, $v0, 1 # Increment count words copied
sw $v1, 0($a1) # Write to destination
addi @a0, $a0, 1 # Advance pointer to next word
addi @a0, $a1, 1 # Advance pointer to next word
bne $v1, $zero, loop # Loop if word copied != zero
There are multiple bugs in this MIPS program; fix them and turn in a bug-free
version.
Answer:
addi $v0, $zero, 1 Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a1)
addi $a0, $a0, 4
addi $a1, $a1, 4
bne $v1, $zero, Loop
2. Carry lookahead is often used to speed up the addition operation in ALU. For a
4-bit addition with carry lookahead, assuming the two 4-bit inputs are a3a2a1a0
and b3b2b1b0, and the carry-in is c0,
(a) First derive the recursive equations of carry-out ci+1 in terms of ai and bi and
ci, where i = 0, 1,.., 3.
(b) Then by defining the generate (gi) and propagate (pi) signals, express c1, c2,
c3, and c4 in terms of only gi's, pis and c0. (c) Estimate the speed up for this simple 4-bit carry lookahead adder over the
4-bit ripple carry adder (assuming each logic gate introduces T delay).
Answer:
(a) ci+1 = aibi + aici + bici
(b) c1 = g0 + (p0c0) c2 = g1 + (p1g0) + (p1p0c0) c3 = g2 + (p2g1) + (p2p1g0) + (p2p1p0c0) c4 = g3 + (p3g2) + (p3p2g1) + (p3p2p1g0) + (p3p2p1p0c0)
(c) The critical path delay for 4-bit ripple carry adder = 2T 4 = 8T The critical path delay for 4-bit ripple carry adder = 2T + T = 3T
-
45
Speedup = 8T/3T = 2.67
: critical path dealy carry
3. When performing arithmetic addition and subtraction, overflow might occur. Fill
in the blanks in the following table of overflow conditions for addition and
subtraction.
Operation Operand A Operand B Result indicating
overflow
A + B 0 0 (a)
A + B < 0 < 0 (b)
A B 0 < 0 (c)
A B < 0 0 (d)
Prove that the overflow condition can be determined simply by checking to see if
the Carryin to the most significant bit of the result is not the same as the CarryOut
of the most significant bit of the result.
Answer:
(1)
Operation Operand A Operand B Result indicating overflow
A + B 0 0 (a) < 0
A + B < 0 < 0 (b) 0
A B 0 < 0 (c) < 0
A B < 0 0 (d) 0
(2) Build a table that shows all possible combinations of Sign and CarryIn to the
sign bit position and derive the CarryOut, Overflow, and related information.
Thus
Sing
A
Sing
B
Carry
In
Carry
Out
Sign
of
result
Correct
Sign of
Result
Over
flow
?
CarryIn
XOR
CarryOut
Notes
0 0 0 0 0 0 No 0
0 0 1 0 1 0 Yes 1 Carries differ
0 1 0 0 1 1 No 0 |A| < |B|
0 1 1 1 0 0 No 0 |A| > |B|
1 0 0 0 1 1 No 0 |A| > |B|
1 0 1 1 0 0 No 0 |A| < |B|
1 1 0 1 0 1 Yes 1 Carries differ
1 1 1 1 1 1 No 0
From this table an XOR of the CarryIn and CarryOut of the sign bit serves to
detect overflow.
-
46
4. Assume all memory addresses are translated to physical addresses before the
cache is accessed. In this case, the cache is physically indexed and physically
tagged. Also assume a TLB is used. (a) Under what circumstance can a memory
reference encounter a TLB miss, a page table hit, and a cache miss? Briefly
explain why. (b) To speed up cache accesses, a processor may index the cache
with virtual addresses. This is called a virtually addressed cache, and it uses tags
that are virtual addresses. However, a problem called aliasing may occur. Explain
what aliasing is and why. (c) In today's computer systems, virtual memory and
cache work together as a hierarchy. When the operating system decides to move a
page back to disk, the contents of that page may have been brought into the cache
already. What should the OS do with the contents that are in the cache?
Answer: (a) Data/instruction is in memory but not in cache and page table has this
mapping but TLB has not.
(b) A situation in which the same object is accessed by two addresses; can occur
in virtual memory when there are two virtual addresses for the same physical
page.
(c) If the contents in cache are dirty, force them write back to memory and
invalidate them in cache. After that, copy the page back to disk. If not,
invalidate them in cache and copy the page back to disk.
5. The following three instructions are executed using MIPS 5-stage pipeline.
1. lw $2, 20($1)
2. sub 4, $2, $5
3. or $4, $2, $6
Since there is one cycle delay between lw and sub, a hazard detection unit is
required. Furthermore, by the time the hazard is detected, sub and or may have
already been fetched into the pipeline. Therefore it is also required to turn sub
into a nop and delay the execution of sub and or by one cycle as shown below.
1. lw $2, 20($1)
2 nop
3. sub $4, $2, $5
4. or $4, $2, $6
(a) In which stage should the hazard detection unit be placed? Why? (b) How can
you turn sub into a nop in MIPS 5-stage pipeline? (c) How can you prevent sub
and or from making progress and force these two instructions to repeat in the next
clock cycle? (d) Explain why there is one cycle delay between lw and sub.
Answer:
(a) ID: Instruction Decode and register file read stage.
(b) Deassert all nine control signals (in EX/MEM pipeline register) in the EX
stage.
(c) Set both control signals PCWrite and IF/IDWrite to 0 to prevent the PC
register and IF/ID pipeline register from changing.
-
47
(d) As shown in the following diagram, after 1-cycle stall between lw and sub,
the forwarding logic can handle the dependence and execution proceeds. (If
there were no forwarding, then 2 cycle delay is needed)
lw IF ID EX MEM WB
nop IF ID EX MEM WB
sub IF ID EX MEM WB
6. Answer the following questions briefly.
(a) Will addition "0010 + 1110" cause an overflow using the 4-bit two's
complement signed-integer form? (Simply answer yes or no).
(b) What would you get after performing arithmetic right shift by one bit on
1100two?
(c) If one wishes to increase the accuracy of the floating-point numbers that can
be represented, then he/she should increase the size of which part in the
floating-point format?
(d) Name one event other than branches or jumps that the normal flow of
instruction execution will be changed, e.g., by switching to a routine in the
operating system.
Answer:
(a) NO
(b) 1110two
(c) Fraction
(d) Arithmetic overflow
7. A MIPS instruction takes fives stages in a pipelined CPU design: (1) IF:
instruction fetch, (2) ID: instruction decode/register fetch, (3) ALU: execution or
calculate a memory address, (4) MEM: access an operand in data memory, and (5)
WB: write a result hack into the register. Label one appropriate stage in which
each of the following actions needs to be executed. (Note that A and B are two
source operands, while ALUOut is the output register of the ALU, PC is the
program counter, IR is the instruction register. MDR is the memory data register,
Memory[k] is the k-th word in the memory, and Reg[k] is the k-th registers in the
register file.)
(a) Reg[IR[20-16]] = MDR;
(b) ALUOut = PC + (sign-extend (IR[15-0])
-
48
(c) MEM
-
49
95
1. (1) Can you come up with a MIPS instruction that behaves like a NOP? The
instruction is executed by the pipeline but does not change any state.
(2) In a MIPS computer a main program can use "jal procedure address" to make a
procedure call and the callee can use "jr $ra" to return to the main program.
What is saved in register $ra during this process?
(3) Name and explain the three principle components that can be combined to
yield runtime.
Answer:
(1) sll $zero, $zero, 0
(2) The address of the instruction following the jal (Return address)
(3) Runtime = instruction count CPI (cycles per instruction) clock cycle time
2. (1) Briefly explain the purpose of having a write buffer in the design of a
write-through cache.
(2) Large cache block tends to decrease cache miss rate due to better spatial
locality. However, it has been observed that too large a cache block actually
increases miss rate. Especially in a very small cache. Why?
Answer:
(1) After writing the data into the write buffer, the processor can continue
execution without wasting time to wait the memory update. The CPU
performance can thus be increased.
(2) The number of blocks that can be held in the cache will become small, and
there will be a great deal of competition for those blocks. As a result, a block
will be bumped out of the cache before many of its words are accessed.
3. (1) Dynamic branch prediction is often used in today's machine. Consider a loop
branch that branches nine times in a row, and then is not taken once. What is
the prediction accuracy for this branch, assuming a simple 1-bit prediction
scheme is used and the prediction bit for this branch remains in the prediction
buffer? Briefly explain your result.
(2) What is the prediction accuracy if a 2-bit prediction scheme is used? Again
briefly explain your result.
Answer:
(1) The steady-state prediction behavior will mispredict on the first and last loop
iterations. Mispredicting the last iteration is inevitable since the prediction bit
will say taken. The misprediction on the first iteration happens because the bit
is flipped on prior execution of the last iteration of the loop, since the branch
was not taken on that exiting iteration. Thus, the prediction accuracy for this
branch is 80% (two incorrect predictions and eight correct ones).
-
50
(2) The prediction accuracy if a 2-bit prediction scheme is 90%, since only the last
loop iteration will be mispredict.
4. Answer the following questions briefly:
(1) In a pipelined CPU design, what kind of problem may occur as it executes
instructions corresponding to an if-statement in a C program? Name one
possible scheme to get around this problem more or less.
(2) Consider the possible actions in the Instruction Decode stage of a pipelined
CPU. In addition to setting up the two input operands of ALU, what is the
other possible action? (Hint: consider the execution of a jump instruction)
(3) What is x if the maximum number of memory words you can use in a 32-bit
MIPS machine in a single program is expressed as 2x? (Note: MIPS uses a
byte addressing scheme.)
Answer:
(1) Control hazard.
Solution: Insert Nop instruction, delay branch, branch prediction
(2) Decode instruction, sign-extend 16 bits immediate constant, jump address
calculation, branch target calculation, register comparison, load-use data
hazard detection.
(3) A single program in 32-bit MIPS machine can use 256 MB = 228
Bytes = 226
words. So, x = 26.
5. Consider the following flow chart of a sequential multiplier. We assume that the
64-bit multiplicand register is initialized with the 32-bit original multiplicand in
the right half and 0 in the left half. The final result is to be placed in a product
register. Fill in the missing descriptions in blanks A and B.
start
Test multiplier[0]
Shift the multiplicand register left by 1 bit
Blank B
Blank A
32nd repetition?
Done
Yes: 32 repetitions
Multiplier[0] = 0Multiplier[0] = 1
No:
-
51
Answer:
Blank A: add Multiplicand to product and place the result in the Product register
Blank B: shift the Multiplier register right 1 bit
6. Schedule the following instruction segment into a superscaler pipeline for MIPS. Assume that the pipeline can execute one ALU or branch instruction and one data
transfer instruction concurrently. For the best, the instruction segment can be
executed in four clock cycles. Fill in the instruction identifiers into the table. Note
that data dependency should be taken into account.
(Identifier) (Instruction)
ln-1 Loop: lw $t0, 0($s1)
ln-2 addu $t0, $t0, $s2
ln-3 sw $t0, 0($s1)
ln-4 addi $s1, $s1, 4 ln-5 bne $s1, $zero, Loop
Clock Cycle ALU or branch instruction Data transfer instruction
1
2
3
4
Answer:
Clock Cycle ALU or branch instruction Data transfer instruction
1 ln-1 (lw)
2 ln-4 (addi)
3 ln-2 (addu)
4 ln-5 (bne) ln-3 (sw)
7. Suppose a computer's address size is k bits (using byte addressing),