1/30 division by convergence 授課老師：王立洋老師製作學生： m9535204 蔡鐘葳

1/30

Division by Convergence

授課老師：王立洋老師

製作學生：M9535204 蔡鐘葳

2/30

Outline

▓ Speedup of Convergence Division

▓ Hardware Implementation

▓ Analysis of Lookup Table Size

▓ Reference

3/30

16.4.

Speedup of Convergence Division

4/30

Introduction

)1()1()0(

)1()1()0(

m

m

xxdx

xxzxdz

q Compute y = 1/d

Do the multiplication yz

Division can be performed via 2 log2 k – 1 multiplications

This is not yet very impressive

64-bit numbers, 5-ns multiplier 55-ns division

5/30

Three Types of Speedup

Three types of speedup are possible:

Reducing the number of multiplications (reduce m)

Using narrower multiplications (reduce the width of some x(i)s)

Performing the multiplications faster

6/30

Initial Approximation

Convergence is slow in the beginning:

It takes 6 multiplications to get 8 bits of convergence and another 5 to go from 8 bits to 64 bits

Since x(0) x(1) x(2) is essentially an approximation to 1/d, these four initial multiplications can be replaces by a table-lookup step that directly supplies x(0+)

7/30

Initial Approximation via Table Lookup

A 2w w lookup table is necessary and sufficient for w bits of convergence after the first pair multiplications

Approx to 1/d

Better approx

Read this value, x(0+), directly replaced by a table-lookup step, thereby reducing 6 multiplications to 2

d x(0) x(1) x(2) = (0.1111 1111 . . . )two

8/30

Example with 4-bit lookup

Example with 4-bit lookup: d = (0.1011 xxxx . . .)two

11/16 d < 12/16

Inverses of the two extremes are 16/11 1.0111 and 16/12 1.0101

So, 1.0110 is a good estimate for 1/d

1.0110 0.1011 = (11/8) (11/16) = 121/128 = 0.1111001

1.0110 0.1100 = (11/8) (3/4) = 33/32 = 1.000010

9/30

Fig. 16.3

Fig. 16.3 Convergence in division by repeated multiplications with initial table lookup.

Iterations

1

d

z

1 - ulp

q - ε

After table lookup and first pair of multiplications,

replacing several iterations

After the second pair of multiplications

10/30

Fig. 16.3

For division by repeated multiplications

We saw that convergence to 1 and q occurred from below

If at some point in our iterations, d(i) overshoots 1 (becomes 1 + ε)

The next multiplicative factor 2 － d(i) = 1 － ε will lead to a value smaller than 1

But still closer to 1, for d(i+1)

11/30

Analysis the Truncating Multiplicative (1/2)

We begin by noting that

dx(0) x(1) … x(i) = 1 – y(i)

x(i+1) = 2 – (1 – y(i)) = 1 + y(i)

Assume that we truncate 1 – y(i) to an a-bit fraction

Thus obtaining (1 – y(i))T with an error of α< 2-a

12/30

Analysis the Truncating Multiplicative (2/2)

With this truncated multiplicative factor, we get

x(i+1) = 2 – (1 – y(i)) = 1 + y(i)

Where 0 (≦ x(i+1))T – x(i+1) < 2-a

Thus

dx(0) x(1) … x(i) x(i+1)T = (1 – y(i))(1 + y(i) + α)

= 1 – (y(i))2 + α(1 – y(i)) = dx(0) x(1) … x(i) x(i+1) + α(1 – y(i))

13/30

Fig. 16.4

Fig. 16.4 Convergence in division by repeated multiplications with initial table lookup and the use of truncated multiplicative factors.

Iterations

1

d

z

1 ± ulp

q ± ε

14/30

Fig. 16.4

The first pair of multiplications following the table-lookup involve a narrow multiplier

It may be faster than a full-width multiplications

If the multiplier is suitably truncated

The result is that convergence occurs from above or below

15/30

Fig. 16.5

Fig. 16.4 One step in convergence division with truncated multiplicative factors.

1

Approximate iteration

Precise iteration

B

A

i + 1 i

Iteration

(x (i+1)

d x (0) x (1) x (i) ... x (i+1)

) T

d x (0) x (1) x (i) ...

d x (0) x (1) x (i) ...

< 2 a

16/30

Fig. 16.5

If we aim to go from l bits to 2l bits of convergence

We can truncate the next multiplicative factor to 2l Bits

Consider Fig. 16.5

A is the result of precise iteration, is no more than 2-2l below 1

With a = 2l, B, arrived at by the approximate iteration, will be no more than 2-2l above 1

17/30

Example

64-bit multiplication

Initial step: Table of size 256 8 = 2K bits

Middle steps: Multiplication pairs, with 9, 17, and 33-bit multipliers

Final step: Full 64 64 multiplication

18/30

16.5.

Hardware Implementation

19/30

Hardware Implementation

Fig. 16.6 Two multiplications fully overlapped in a 2-stage pipelined multiplier.

z x(i)(i)

d x(i)(i)

x(i)z(i)d(i+1)

d(i+1)

x(i+1)

z x(i)(i)

d x(i+1)(i+1)

z(i+1)

2's Complz(i+1) x(i+1)

z x(i+1)(i+1)

d(i+2)

d x(i+1)(i+1)

20/30

Fig. 16.6

As the computation of z(i) x(i) moves from the top to the bottom pipeline stage

The next iteration begins by computing the stage of d(i+1) x(i

+1)

21/30

Implementing Division with Reciprocation

Reciprocation: Multiplication pairs are data-dependent, so they cannot be pipelined or performed in parallel

Since in the recurrence x(i+1) = x(i) (2 - x(i)d)

The second multiplication by x(i) needs the result of the first one

The most promising speedup method relief on deriving a better starting approximation to 1/d

22/30

The Required Lookup Table

The Required Lookup Table can be made smaller, or totally eliminated, by a variety of methods

Store the reciprocal values for fewer points

Use linear or higher-order interpolation to compute the starting approximation

Formulate the starting approximation as a multi-operand addition problem

Use or pass through the multiplier’s CSA tree, suitably augmented, to compute it

23/30

16.6.

Analysis of Lookup Table Size

24/30

Theorem for Table SizeTheorem 16.1: To get w 5 bits of convergence after the first iteration of division by repeated multiplications, w bits of d (beyond the mandatory 1) must be inspected. The factor x(0+) read out from table is of the form (1.xxx . . . xxx)two, with w bits after the radix point

Based on the theorem, the required table size is 2w × w

The cases w < 5:

Practically uninteresting (allow smaller table)

We can ignore them

25/30

Analysis of Lookup Table Size (1/4)

Recall that our objective is to have

1 – 2-w ≦ dx(0+) 1 + 2≦ -w

Let d = (0.1 d-2 d-3) …d-(w+1) d-(w+2) …d-l)two

-----------------------w bits to be inspected

Theorem 16.1 postulates the existence of x(0+) = (1. x+

-1 x+-2 …x+

-w)two satisfying the objective inequality

26/30


Let u = (1 d-2 d-3) … d-(w+1))two satisfying 2w ≦ u < 2w+1

We have 2-(w+1) u ≦ d < 2-(w+1) (u+1)

Similarly, let v = (1x+-1 x+

-2 …x+-w)two

The objective inequality can be rewrite as

2w – 1 dv 2≦ ≦ w + 1

27/30


We derive the following sufficient conditions

2w - 1 2≦ -(w+1)uv

2-(w+1) (u+1)v 2≦ w + 1

The conditions lead to the following restrictions on v

1

)12(2122 11

uv

u

wwww

28/30


The latter condition is equivalent to

The last inequality always holds is left as an exercise

Completes the “sufficiency” part of the proof

At least w bits of d must be inspected

x(0+) must have at least w bits after the radix point

1

)12(2122 11

uu

wwww

29/30

ExampleTable 16.2 Sample entries in the lookup table replacing the first four multiplications in division by repeated multiplications

––––––––––––––––––––––––––––––––––––––––––––––––––––––– Address d = 0.1 xxxx xxxx x

(0+) = 1. xxxx xxxx––––––––––––––––––––––––––––––––––––––––––––––––––––––– 55 0011 0111 1010 0101 64 0100 0000 1001 1001–––––––––––––––––––––––––––––––––––––––––––––––––––––––Example: Table entry at address 55 (311/512 d < 312/512)

For 8 bits of convergence, the table entry f must satisfy

(311/512)(1 + . f) 1 – 2–8 (312/512)(1 + . f) 1 + 2–8

199/311 .f 101/156 or 163.81 ≤ 256 . f ≤ 165.74

Two choices: 164 = (1010 0100)two or 165 = (1010 0101)two

30/30

Reference

[1] Behrooz Parhami, “Computer Arithmetic Algorithms and Hardware Designs,” Oxford University Press. 2000.

1/30 division by convergence 授課老師：王立洋老師 製作學生： m9535204 蔡鐘葳

Documents

1/30 division by convergence 授課老師：王立洋老師製作學生： m9535204 蔡鐘葳