floating point in computers

Floating Point in computers

Comply with standards:

IEEE 754

ISO/IEC 559

Timeline

• Introduction quite short• Binary review not so long• Integer Arithmetic 1/3• Floating Point 1/3• Floating Point Arithmetic 1/3• Other issues extra short

Introduction

• Who does computer arithmetic?

• Intel’s spare money

• How is it done in hardware?

• How Integer relates to Floating point

• Now, we go back to “computer structure”

Binary numbers

• What is 1 0 0 1 0 1 1 . 0 0 1 0 1 ?

64 8 2 1

0123456 2222222 54321 22222

81

321

32575

Signed Binary Integers

• Sign-magnitude

• 2’s complement

• 1’s complement

• biased

Sign-Magnitude

• High order bit = Sign

• 0101 = 5

• 1101 = -5

• 2 zero’s

2’s complement

• Number + Negative = 2n

• 0101 = 5

• 1011 = -5

• Easy addition (drop carry)

• Formula: -an-12n-1 + an-22n-2 + … +a121 + a0

1’s Complement

• Negative - complement to 1

• 0101 = 5

• 1010 = -5

• 2 zero’s

• Number + Negative = 2n-1

Biased

• Binary = Number + Bias

• Bias = 5:1101 = 5 5+5=10

0000 = -5 (-5)+5 = 0

• Relative order remains

Integer Arithmetic

Adding (usigned) Integers

• Elementry school :

1 1 0 0 1 1 0 1

1 0 0 0 0 1 1 0+

110

1

0

1

1010

1

1

• Result has n+1 bits!

Adding Integers - hardware

Half Addera b

Cin

s

Cout

a b

s

Cout

Full Adder

2 logical levels

abcbabas

out

bcacabccbacbacbas

out

ininin

Ripple carry Adderan-1 bn-1

sn-1

Cout

an-2 bn-2

Cin

sn-2

a1 b1

s1

a0 b0

s0

• Slow - 2n logical levels

• Small constant (CMOS)

• Other ways exist

Adding Signed Integers

• In 2’s complement:

b + (-a) = b + (2n-a) = 2n + (b-a)

• hence - add as integers, discard carry out

• Example: 0011 + 1100 = ?

= (2n - (b+a)) + 2n= (2n-b)+(2n-a)(-b) + (-a)

Substracting Integers

• Add the negation

• Negating 2’s complement:

11010100101011000110000 = ?00001001010110101001110

Integer (unsigned) Multiplication

• Elementry school : 1 1 0 11 0 0 11 1 0 1

0 0 0 00 0 0 0

1 1 0 1

0 1 1 1 0 1 0 1

*

• Result is 2n bits !

Hardware Multiplier

• P=0

• loop:(i) if A0=1, add B to P

(ii) right-shift P & A

AP

B

Shift

n n

Carry

n

Integer (unsigned) Division

• Elementry school :

1101110

00011

1

11000

0

00001

0

0001

Result: 0100, Rem 1

Dec: 13/3=4, Rem 1

Hardware Divider

• P=0• loop:(i) left-shift P & A

(ii) Sub. B from P:positive: a0=1

negative: a0=0, restore P (add B)

AP

B

Shift

n n+1

n+1

0

Example

• 13 / 3 = 4 (1)

• n=4

• A=1101 B=00011 P=00000

P A B

0 0 0 1 10 0 0 0 0 1 1 0 1

P A B

0 0 0 1 10 0 0 0 1 0 1 0 0

QuotientRemainder

Division - remarks

• Non-restoring Algorithm

• Load P only if positive

• Check for 0

• (Total) Result is 2n bits!

Integer arithmetic - remarks

• Signed Multiply and Division– Algorithms exist– We will not use them

• What to do with extra bits?

• Faster methods

Floating Point

Non Integers - Other Methods

• Fixed Point– example: # # # . #– Binary point shifted– Integer arithmetic (extra shifting)– Small number magnitude

• Rational– a/b (a,bZ)

Floating Point

• Exponent + Significand (= Mantisa)

• x = s • 2e

• Example:

s=101 e=011x = 101 • 211 = 40= 5 • 23 = 101000

Uniqueness

• Denormal Numbers: 123.456 107

0.123 104

• Normalized: #.### 10#

1.123 104

• What about 0 ?

Floating Point Standard

• Why Standartize?– Hardware accelerators– Software compatibility– Build Software Libraries– etc…..

• IEEE 754-1985 ISO/IEC 559

• Includes: Structure, Arithmetic results

Float Types

• 4 Precision Types:– Single– Single extended– Double– Double extended

Single Precision

• 32 bits:

• Exponent (e): Biased ( + 127)

• Significand (f): Fixed fraction: 0 . # # # …

• Nuber: 1.f • 2e-127

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Sign(1) Exponent(8) Significand(23)

Single Precision - Example

• 1 10000001 01000000000000000000000

• 10000001 = 129

• 01000… = 0.01000…

129-127=2

• X = - 1.25 • 22

• X = - 5

1.01= 1.25

Single Precision - Range

• Emax = 127 (e = 254)

• Emin = -126 (e = 1)

• Why |Emin|<|Emax|?– 1/2Emin does not overflow

• Why Biased notation?

• What about 0 and 255 ?

Floating Point Precision

Single SingleExtend

Double DoubleExtend

Format Width 32 43 64 80

Precision 24 32 53 64

Emax +127 1023 1023 16383

Emin -126 -1022 -1022 -16382

Exp. Width 8 11 11 15

Exp. Bias 127 1023

Exmaples

• We shall use base 10 sometimes:

• f will have 3 digits

• Emax will be 98

• Emin will be -97

• Ex: 5.341070

NaN

• Not a Number

• Result of ilegal computation:– – Any computation involving a NaN

• e = Emax + 1 & f 0• # 11111111 #######################

• Many NaN’s (different f’s)

),()0,(0)(00 yremxrem

NaN’s in use

• Zero finder outside domain– f(x) = sqrt(x) - 1

• Works since all computations NaN

• No exception caused !

Zero’s

• 0 00000000 00000000000000000000000 ?• this is NOT 1.02Emin

• 1 00000000 00000000000000000000000 ?

0 is signed! 0 both exits!

• What is the difference?

Signed 0’os

• +0 = -0 BUT:

• Multiply/Divide keep sign rules:

• Monivation:– Using inf correctly (describe later)– log(x) : log(0)=-inf log(negative)=Nan

log(x) if x(-0) ?

)0()0(3)0()0(3

± inf

• More logic:

• e = Emax + 1 & f = 0

• # 11111111 00000000000000000000000

01

01

01

01

)(01 x

Inf usage Example

(If tan-1 is defined properly)

xxx

11tan2)(cos 11

More on 0’os and inf’s

• General Rule for 0/inf arithmetic:– Take appropriate limit:

• 1/(1/x) where x=0 or inf

• Why not Max # instead?

)0(3)(

330

3 limlim)(0

xxxx

704998

707022

105:1016.31099.9

104103

answare

yxyx

Zero’s and inf’s - yet again

• X/(x2+1) is bad!Why?

• 1/(x+x-1) is better

• Do we need to check for x=0?

• Using 2 zero’s and inf’s saves some special cases checks.

Denormalized numbers

• Example:– x=1.23•10-98 y=1.11•10-98

– x-y = 1.20•10 -99 = 0– so: x-y=0 but: x y – think of: if(x y) then z=1/(x-y)

• Soluition:– use denormalized numbers!

Denormal Numbers

• Smallest normal: 1.0 • 2Emin • Below, use denormal: 0.f • 2Emin

• e = Emin - 1& f 0

• # 00000000 #######################• Gradual underflow: 1.23 • 10-4 ( /10 )

0.12 • 10-4 ( /10 )

0.01 • 10-4 ( /10 )

0

Denormal Numbers

• Back to our Example:– x=1.23•10-98 y=1.11•10-98

– x-y = 0.12•10 -98

– and this is not 0 !

Flush to 0 Vs Gradual Underflow

0 2-4 2-3 2-12-2

0 2-4 2-3 2-12-2

Special Values - Summary

Exponent FractionRepresents

Emin-1 f=0 0

Emin-1 f0 0.f2Emin

Emin e Emax ---- 1.f2e

Emax+1 f=0 0

Emax+1 f0 0.f2Emin

Rounding

• Why is rounding needed?

• Infinit numbers Finit representation

• Integers only overflow

• Almost all operations need rounding

• IEEE - specifies algorithms for arithmetic

Numbers need rounding

• Out of range:– x>22Emax x<12Emin

• Between 2 floats:– 0.110 = 0.00011001100….2 = 1.1001100…. 2-4

– 1.1001 2-4

Measuring Error

• ULPS (units in last place)– 1.1210-1 Vs 0.124 : 0.4 ulps– 1.1210-1 Vs 0.118 : 0.2 ulps

• Relative Error– Difference/Original– 1.1210-1 Vs 0.124 : Err=0.004/0.124=0.032

Calculate Using Rounding

• Benign cancellation– Calculate 10.1-9.93 (= 0.17)

1.01 101

0.99 101

0.02 101 = 2.00 10-1

– 30 upls!

Rounding problems

• Catastrophic cancellation– b2-4ac– both b2 and 4ac are rounded– the (-) exposes the error– b=3.34 a=1.22 c=2.28

b2=11.2 4ac=11.1 b2-4ac=0.10

correct=0.0292 (70.08 upls)

IEEE Arithmetic

• Requirement:+ - shold be EXACTLY rounded

remainder shold be EXACTLY rounded

Integer conv. shold be EXACTLY rounded

• Not all (transcendental, binary to decimal)

• “Tie break” - Round to Even

Round to Even

• How will 1.005 be rounded ?– Round Up: 1.01– Round Even: 1.00

• Why? Example:– xi=xi-1+y-y x0=1.00 y=0.125

– Round up: 1.00, 1.01, 1.02, ….– Round even: 1.00, 1.00, 1.00, ….

Float Multiplication

2121 2)()2()2( 2121eeee ssss

Integer multiply

Biased addition

•“Biased addition”:

detect Overflow: Use n+1 bit adder

detect Underflow: Harder (Denormals)

)127(127)127()127( 321 eee

Rounding Multiplication 1.23 6.788.3394

X

Round to 8.34

2.83 4.4712.6501

X

Round to 1.27

1.28 7.8109.9968

X

Round to 1.00

1.00011 1.00100 1.00101 0.11010

Round bit 0 Round bit 1All rest 0

Round bit 1All rest 0

Shift needed

Round, Guard, Sticky

0 . 1 1 0 1 0 0 0 1 0

number guard round sticky

1 . 0 0 1 0 0 0 1 0 0

number round sticky

Rounding Multiplication

AP

B

Shift

n n

Carry

n

x0x1.x2x3x4x5 g r s s s s

x1.x2x3x4x5 g

X0.x1x2x3x4x5

Case 1: x0=0, shift

Case 2: x0=1, inc. exp

Product Results:

Round digit

Sticky bit

Rounding rules

• r=0 rounded OK

• r=1, s=1 add 1 to LSB

• r=1, s=0 add 1 if LSB=1

• Denormals Extra shifting

Float addition

• Compute all digits and round?– 1.00220 + 1.00 2-20 = 10000000….0000001– too long!

• Use Round and Sticky bits:– shift to same exponent– r = first discarded digit– s = OR of rest discarded

Float addition - example

1.10011 .000011.10100

+

r=1, s=1Round needed! 1.10101

Calculate: 1.1001120 + 1.100012-5

Shift exponents: 1.1001120 + 0.000011000120

r=1 s=0|0|0|1=1

Signed Addition/Substraction

• Simplest way - convert to 2’s cmpl.

• Cancellation of high order bit - shift

• more bits cancel - How many guard digits?

1.000001.111110.11111

+1.000000.00000101111

-1.11111010001cmpl

Float Division

2121 2)()2()2( 2121eeee ssss

Integer division

Biased substraction

• Very similar to Multiplication

• Dividing using integer divide

• Compute 2 more bits (round, guard)

• Use remainder as sticky bit (Why?)

• Sign bit: XOR

floating point in computers

Documents