1/30 division by convergence 授課老師:王立洋老師 製作學生: m9535204 蔡鐘葳
TRANSCRIPT
1/30
Division by Convergence
授課老師:王立洋 老師
製作學生:M9535204 蔡鐘葳
2/30
Outline
▓ Speedup of Convergence Division
▓ Hardware Implementation
▓ Analysis of Lookup Table Size
▓ Reference
3/30
16.4.
Speedup of Convergence Division
4/30
Introduction
)1()1()0(
)1()1()0(
m
m
xxdx
xxzxdz
q Compute y = 1/d
Do the multiplication yz
Division can be performed via 2 log2 k – 1 multiplications
This is not yet very impressive
64-bit numbers, 5-ns multiplier 55-ns division
5/30
Three Types of Speedup
Three types of speedup are possible:
Reducing the number of multiplications (reduce m)
Using narrower multiplications (reduce the width of some x(i)s)
Performing the multiplications faster
6/30
Initial Approximation
Convergence is slow in the beginning:
It takes 6 multiplications to get 8 bits of convergence and another 5 to go from 8 bits to 64 bits
Since x(0) x(1) x(2) is essentially an approximation to 1/d, these four initial multiplications can be replaces by a table-lookup step that directly supplies x(0+)
7/30
Initial Approximation via Table Lookup
A 2w w lookup table is necessary and sufficient for w bits of convergence after the first pair multiplications
Approx to 1/d
Better approx
Read this value, x(0+), directly replaced by a table-lookup step, thereby reducing 6 multiplications to 2
d x(0) x(1) x(2) = (0.1111 1111 . . . )two
8/30
Example with 4-bit lookup
Example with 4-bit lookup: d = (0.1011 xxxx . . .)two
11/16 d < 12/16
Inverses of the two extremes are 16/11 1.0111 and 16/12 1.0101
So, 1.0110 is a good estimate for 1/d
1.0110 0.1011 = (11/8) (11/16) = 121/128 = 0.1111001
1.0110 0.1100 = (11/8) (3/4) = 33/32 = 1.000010
9/30
Fig. 16.3
Fig. 16.3 Convergence in division by repeated multiplications with initial table lookup.
Iterations
1
d
z
1 - ulp
q - ε
After table lookup and first pair of multiplications,
replacing several iterations
After the second pair of multiplications
10/30
Fig. 16.3
For division by repeated multiplications
We saw that convergence to 1 and q occurred from below
If at some point in our iterations, d(i) overshoots 1 (becomes 1 + ε)
The next multiplicative factor 2 - d(i) = 1 - ε will lead to a value smaller than 1
But still closer to 1, for d(i+1)
11/30
Analysis the Truncating Multiplicative (1/2)
We begin by noting that
dx(0) x(1) … x(i) = 1 – y(i)
x(i+1) = 2 – (1 – y(i)) = 1 + y(i)
Assume that we truncate 1 – y(i) to an a-bit fraction
Thus obtaining (1 – y(i))T with an error of α< 2-a
12/30
Analysis the Truncating Multiplicative (2/2)
With this truncated multiplicative factor, we get
x(i+1) = 2 – (1 – y(i)) = 1 + y(i)
Where 0 (≦ x(i+1))T – x(i+1) < 2-a
Thus
dx(0) x(1) … x(i) x(i+1)T = (1 – y(i))(1 + y(i) + α)
= 1 – (y(i))2 + α(1 – y(i)) = dx(0) x(1) … x(i) x(i+1) + α(1 – y(i))
13/30
Fig. 16.4
Fig. 16.4 Convergence in division by repeated multiplications with initial table lookup and the use of truncated multiplicative factors.
Iterations
1
d
z
1 ± ulp
q ± ε
14/30
Fig. 16.4
The first pair of multiplications following the table-lookup involve a narrow multiplier
It may be faster than a full-width multiplications
If the multiplier is suitably truncated
The result is that convergence occurs from above or below
15/30
Fig. 16.5
Fig. 16.4 One step in convergence division with truncated multiplicative factors.
1
Approximate iteration
Precise iteration
B
A
i + 1 i
Iteration
(x (i+1)
d x (0) x (1) x (i) ... x (i+1)
) T
d x (0) x (1) x (i) ...
d x (0) x (1) x (i) ...
< 2 a
16/30
Fig. 16.5
If we aim to go from l bits to 2l bits of convergence
We can truncate the next multiplicative factor to 2l Bits
Consider Fig. 16.5
A is the result of precise iteration, is no more than 2-2l below 1
With a = 2l, B, arrived at by the approximate iteration, will be no more than 2-2l above 1
17/30
Example
64-bit multiplication
Initial step: Table of size 256 8 = 2K bits
Middle steps: Multiplication pairs, with 9, 17, and 33-bit multipliers
Final step: Full 64 64 multiplication
18/30
16.5.
Hardware Implementation
19/30
Hardware Implementation
Fig. 16.6 Two multiplications fully overlapped in a 2-stage pipelined multiplier.
z x(i)(i)
d x(i)(i)
x(i)z(i)d(i+1)
d(i+1)
x(i+1)
z x(i)(i)
d x(i+1)(i+1)
z(i+1)
2's Complz(i+1) x(i+1)
z x(i+1)(i+1)
d(i+2)
d x(i+1)(i+1)
20/30
Fig. 16.6
As the computation of z(i) x(i) moves from the top to the bottom pipeline stage
The next iteration begins by computing the stage of d(i+1) x(i
+1)
21/30
Implementing Division with Reciprocation
Reciprocation: Multiplication pairs are data-dependent, so they cannot be pipelined or performed in parallel
Since in the recurrence x(i+1) = x(i) (2 - x(i)d)
The second multiplication by x(i) needs the result of the first one
The most promising speedup method relief on deriving a better starting approximation to 1/d
22/30
The Required Lookup Table
The Required Lookup Table can be made smaller, or totally eliminated, by a variety of methods
Store the reciprocal values for fewer points
Use linear or higher-order interpolation to compute the starting approximation
Formulate the starting approximation as a multi-operand addition problem
Use or pass through the multiplier’s CSA tree, suitably augmented, to compute it
23/30
16.6.
Analysis of Lookup Table Size
24/30
Theorem for Table SizeTheorem 16.1: To get w 5 bits of convergence after the first iteration of division by repeated multiplications, w bits of d (beyond the mandatory 1) must be inspected. The factor x(0+) read out from table is of the form (1.xxx . . . xxx)two, with w bits after the radix point
Based on the theorem, the required table size is 2w × w
The cases w < 5:
Practically uninteresting (allow smaller table)
We can ignore them
25/30
Analysis of Lookup Table Size (1/4)
Recall that our objective is to have
1 – 2-w ≦ dx(0+) 1 + 2≦ -w
Let d = (0.1 d-2 d-3) …d-(w+1) d-(w+2) …d-l)two
-----------------------w bits to be inspected
Theorem 16.1 postulates the existence of x(0+) = (1. x+
-1 x+-2 …x+
-w)two satisfying the objective inequality
26/30
Analysis of Lookup Table Size (2/4)
Let u = (1 d-2 d-3) … d-(w+1))two satisfying 2w ≦ u < 2w+1
We have 2-(w+1) u ≦ d < 2-(w+1) (u+1)
Similarly, let v = (1x+-1 x+
-2 …x+-w)two
The objective inequality can be rewrite as
2w – 1 dv 2≦ ≦ w + 1
27/30
Analysis of Lookup Table Size (3/4)
We derive the following sufficient conditions
2w - 1 2≦ -(w+1)uv
2-(w+1) (u+1)v 2≦ w + 1
The conditions lead to the following restrictions on v
1
)12(2122 11
uv
u
wwww
28/30
Analysis of Lookup Table Size (4/4)
The latter condition is equivalent to
The last inequality always holds is left as an exercise
Completes the “sufficiency” part of the proof
At least w bits of d must be inspected
x(0+) must have at least w bits after the radix point
1
)12(2122 11
uu
wwww
29/30
ExampleTable 16.2 Sample entries in the lookup table replacing the first four multiplications in division by repeated multiplications
––––––––––––––––––––––––––––––––––––––––––––––––––––––– Address d = 0.1 xxxx xxxx x
(0+) = 1. xxxx xxxx––––––––––––––––––––––––––––––––––––––––––––––––––––––– 55 0011 0111 1010 0101 64 0100 0000 1001 1001–––––––––––––––––––––––––––––––––––––––––––––––––––––––Example: Table entry at address 55 (311/512 d < 312/512)
For 8 bits of convergence, the table entry f must satisfy
(311/512)(1 + . f) 1 – 2–8 (312/512)(1 + . f) 1 + 2–8
199/311 .f 101/156 or 163.81 ≤ 256 . f ≤ 165.74
Two choices: 164 = (1010 0100)two or 165 = (1010 0101)two
30/30
Reference
[1] Behrooz Parhami, “Computer Arithmetic Algorithms and Hardware Designs,” Oxford University Press. 2000.