rafi4

7/28/2019 rafi4

1/4

A VLSI architecture for a Run-time Multi-precision

Reconfigurable Booth Multiplier

Zhou Shun, Oliver A. Pfander, Hans-Jorg Pfleiderer and Amine BermakElectrical and Computer Engineering Department

Hong Kong University of Science and Technology, Hong Kong

Email: {eevin,eebermak}@ust.hkInstitute for Microelectronics

University of Ulm, Germany

Email: [email protected]

Abstract In this paper, a reconfigurable multi-precisionRadix-4 Booth multiplier structure is presented. The reconfig-urable 88 bit multiplier unit can be cascaded to form a multi-plier that can adapt to variable input precision requirements. The

number of bits can be extended by concatenating more stagestogether. For example, four 88 bit units can be used to builda 1616 bit Booth multiplier. In our proposed architecture, themultiplier adapts to different bit-lengths by using external controlsignals. The performance of our reconfigurable multiplier arecompared with a parallel array multiplier and a conventionalBooth multiplier. The comparison is based on synthesis resultsobtained by synthesizing all multiplier architectures targeting aXilinx FPGA. The overhead resulting from our reconfigurationscheme are also evaluated and compared to a conventional Boothand array multipliers.

I. INTRODUCTION

Multiplication is a very important operation in many digital

signal processing (DSP) applications. Many multiplication

algorithms and their VLSI implementations have been reportedin the literature [1] [2]. As the size, power consumption and

silicon area are greatly dependant upon the multiplications

word-length, it is therefore important to carefully evaluate the

precision requirements for the application at hand. Assuming

that a typical application requires n bit precision, then using

a lower precision would degrade the performance. However,

using more than the required precision would result in wasted

energy and area. Fast and large word-length multipliers con-

sume a large amount of power and take considerable silicon

area. To overcome such a problem, many scalable multiplier

designs have been recently proposed [5] [6] [7]. In [5], a

dedicated circuit to handle different bit-lengths is introduced,but it also results in a large overhead and its scalability is

limited to a small scale. The structure presented in [7] is highly

scalable but not speed-optimized.

In this paper, we propose a new reconfigurable Booth multi-

plier structure. In our design, it is possible to use the required

number of bits for the application at hand while attempting to

keep the reconfigurability overhead to a minimum and limit

the power consumption at the same time. Our study highlights

a Booth multiplier, which is known to provide a higher

speed as compared to other multipliers due to the reduced

number of partial products. Our proposed multiplier can also

be reconfigured to adapt to different levels of precision and

is suitable for FPGA implementation as well as full-custom

design methodology due to its regular characteristics.

II . BOOTH MULTIPLIER ALGORITHM AND ARCHITECTURE

The idea of Booth encoding has become a standard algo-

rithm widely used to speed-up the multiplication process by

reducing the number of partial products. The commonly used

Radix-4 Booth multipliers can reduce the number of required

partial products to 50% by recoding the multiplier bits. It can

save the multiplier layout area and reduce delay at the same

time which yields to important design advantages.

According to the sign-generate algorithm [3], the result of

adding all the sign extension bits of an 88 bit multiplier canbe written as

Sgn =

s0

16

i=9

2i

20 +

s1

14

i=9

2i

22 + (1)

s2

12i=9

2i

24 +

s3

10i=9

2i

26

Where si are the sign bits for the partial products. Using the

equivalencies:k

i=j

2i

= 2k+1 2j (2)

si = 1 si (3)

Sgn becomes:

Sgn =

3j=0

sj 28+2j + 28 + 29 + 210 + 211 + 212 (4)

Table I illustrates an example of 88 Booth multiplicationalgorithm with different sign extension strategies. P denotes

the partial product term and cor is the correction factor for

negative number. In the first variant derived from (1), the

partial products need sign extension up to the width of the

product before summation can take place. In order to operate

on signed numbers, this algorithm presents 16 extra terms

obtained by duplicating the leftmost bit of each row. This

leads to a significant increase in terms of circuit complexity;

increased power consumption and delay due to 16 extra full

1-4244-1378-8/07/$25.00 2007 IEEE. 975

7/28/2019 rafi4

2/4

adders. This traditional design can be modified to support

signed numbers in twos complement notation by modifying

several of the product terms.

TABLE I

SIG N EXTENSION FOR BOOTH MULTIPLIER

P08P08P08P08P08P08P08P08P07P06P05 P04 P03 P02 P01 P00P18P18P18P18P18 P18 P17 P16P05 P04 P03 P02 P01 P00 cor0

P28P28P28 P28 P27 P26 P25 P24P23P22P21 P20 cor1P38 P38 P37 P36 P35 P34 P33 P32P31P30 cor2

cor3

Sign Extension Algorithm A

1

1 P08 P07 P06 P05 P04 P03 P02 P01 P00

1 P18P17 P16 P05 P04 P03 P02 P01 P00 cor0

1 P28P27 P26 P25 P24 P23 P22 P21 P20 cor1

1 P38P37 P36 P35 P34 P33 P32 P31 P30 cor2cor3

Sign Extension Algorithm B

Table I (B) illustrates the Booth multiplier algorithm of

a modified sign generation derived from (4). This Booth

multiplier introduced in [4] was used to solve the sign ex-

tension problem inherent in conventional Booth multipliers.

This multiplier features reduced number of full adders and

therefore optimizes both area and delay. The algorithm uses

5 extra terms to handle the sign extension issue resulting in

only 5 extra full adders required. This algorithm (Algorithm

B represented in Table I) was therefore used as the basic

computational block in our Booth multiplier as it presents a

simpler sign extension logic and a more regular structure when

compared to the first algorithm.The Booth multiplier as described earlier can be built in a

hierarchical way using 3 major components: Booth Encoder,

Partial Product Generator and finally a Full Adder. Each com-

ponent needs to be scalable in order to be combined together to

form an 88 signed Booth multiplier. In our design, a regularBooth encoding and partial product generation is adopted [8].

Note that there are many recent new approaches that could be

used to achieve faster encoding and partial product generation

[9] [10], but may not be structured regularly.

III. SCALABLE AND RECONFIGURABLE UNI T

In this section, we present our multiplier unit structure that

can have both the property of scalability and reconfigurability.The multiplier element can be used either as a fully functional

independent multiplier design of 88 bit operand size, or as abuilding block to form a multiplier of higher bit-length in steps

of 8 bit. Assembling a group of these multiplier elements to

form a larger processing unit accounts for the desired scalable

precision. By reconfigurability we also mean the option to

change the behavior of the building blocks in terms of the

role that they play in a concatenated multiplier array.

This principal concept of realizing a run-time reconfigu-

ration option for a concatenated multiplier assembled from

uniform elements is based on the use of 2-by-1 multiplexers

and explained in [7]. These data exchange interfaces provide

the possibility to influence the signal path, bypass specific

circuit elements and feed in zero signals in perimeter positions.

The largest benefit is having a limited set of control signals

addressing multiplexers, enabling the re-arrangement of the

elements to form an increased precision multiplier while the

system is running without the need to stop the system, e. g.

in order to re-program a look-up table or change the routing.By providing a flexible array of multiplier elements that can

operate either separately or as a part of a superior multiplier

with increased word-length, two major approaches of using

this arithmetic structure become possible: The first approach

is to concatenate several (or all) of the multiplier elements

together, working at a higher (or maximum) word-length to

ensure an increased (or a full) precision. The second approach

is to configure the elements to work in stand-alone mode,

in order to achieve a higher parallelism when the precision

requirements are lower.

A. Bit-length of the Reconfigurable Unit

The bit-length of the reconfigurable unit needs to be chosen

carefully to make the overall design efficient both in terms of

area and also power consumption. Simulations and comparison

between a Signed Array Multiplier (SAM) and a Booth Multi-

plier (BM) have been performed for different bit-lengths. The

CMOS standard cell library used is VTVT TSMC 0.25m.

Figure 1 shows a comparison in terms of delay and area

usage. The delay is shown in nanoseconds while the area is

normalized to a 4 bit SAM. In 4 bit length, SAM is slightly

faster than the BM and it occupies much less area. When

the bit-length is increased to larger or equal to 8 bit, BM

surpasses SAM in both delay and area. Table II summarizes

the comparison by showing the PowerDelay Product. As thebit-length is 8 bit, BM has a 40% advantage in efficiency over

SAM when power and delay are considered simultaneously.

Also, in most DSP systems, a word-length of 8 bit is widely

used for low precision applications. Therefore, we choose

8 bits as the bit-length of the reconfigurable unit.

0 2 4 6 8 10 12 14 16 18

5

10

15

20

25

bitlength

Area

Area Comparison

SAMBM

0 2 4 6 8 10 12 14 16 18

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4

bitlength

MaximumD

elay(ns)

Delay Comparison

SAMBM

Fig. 1. Comparison between Signed Array Multiplier & Booth Multiplier

B. Scalability of the Multiplier Unit

The scalable design unit is able to be cascaded together to

form an arbitrarily large multiplier. It is desired to build a supe-

rior multiplier for different word-lengths. The major advantage

976

7/28/2019 rafi4

3/4

TABLE II

ARE ADELAY PRODUCT

Bit-length SAM BM Percentage

4 2372 3207.96 -26%

8 31610.7 22467.75 40%12 124423 84252.48 48%16 329212.8 207367.2 59%

of this multiplier structure is the highly regular structure and

thus the reusability of its basic arithmetic primitives. Using this

core and introducing a supplementary peripheral circuit, we

are able to build a multiplier with different word-lengths. The

desired connectivity is achieved by employing multiplexers

with corresponding control input signals. We directly amend

our multiplier core to handle signed numbers by introducing

various modifications and few extra circuit elements.

Fig. 2 (A) shows an example of a scalable 88 Boothmultiplier unit that is cascaded together to form a 1616

Booth multiplier. Following the same approach, an 8n

8n

Booth multiplier can be built in a fashion shown in Fig. 2 (B).

In order to obtain such a scalable design, 3 components in the

Booth multiplier have to be scalable, namely Booth encoder,

partial product generator and partial product addition units.

Ctrl A B I

SB1Co Cin

P

Ctrl A B I

SB1Co Cin

P

Ctrl A B I

SB1Co Cin

P

Ctrl A B I

SB1Co Cin

P

zero

Product

Ctrl A B I

SB2Co Cin

P

Ctrl A B I

SB2Co Cin

P

Ctrl A B I

SB1Co Cin

P

Product

Ctrl A B I

SB1Co Cin

P

zero

Ctrl A B I

SB2Co Cin

P

Ctrl A B I

SB2Co Cin

P

(A.) (B.)

Fig. 2. Architecture View.

The Booth encoder is a very repetitive and modular structure

that can be scaled to form a larger unit. The modified Booth

encoder takes 3 bits of multiplier and outputs 3 bits of Booth

code. For grouping several bits of the multiplier together, each

group needs to take in the last bit of the previous group as

the first bit, as depicted in Fig. 3 (B). The first group feeds

in a zero for the first bit. The partial product generator (PPG)

also has a similar regular structure, which can be scaled quiteeasily. Like the Booth encoder, each group of PPG takes in

several multiplicand bits, see Fig. 3 (A). A group accepts the

last bit of the previous group as the first bit. The first group

feeds in a zero for the first bit and the last group duplicates

the last bit of multiplicand for its last bit.

The difficulty in designing such a reconfigurable multiplier

lies in the partial product addition (PPA) stage. The Booth

multiplier operates on 2s complement numbers, which are

signed numbers. A sign extension circuit is required to ma-

nipulate the MSBs of each row. As shown in figure 2, a block

on the leftmost position is differently addressed for handling

sign extension, while a block on other positions should pass

;L ;L

X1

X2

NEG33*

;L

X1

X2

NEG

33*

X1

X2

NEG

33*

;NL

QELW%RRWK

(QFRGHU

7/28/2019 rafi4

4/4

Fig. 4. Structure of PPA block of 8x8 bits

using Xilinx ISE 9.1i targeting a Xilinx Virtex-4 FPGA device

XC4VLX15 (package SF363, speed grade -12) and the results

are extracted after place and route procedure.

The 16 bit reconfigurable multiplier is built as depicted

in figure 2 (A), while the 24 bit and 32 bit implementations

follow the same structure shown in figure 2 (B). Table III -

table V illustrates the simulation results. In general, the Booth

multiplier takes up more area than the array multiplier because

of the Booth encoding and partial product generation circuit.

The reconfigurable multiplier obviously requires some over-

head logic when compared to a conventional Booth Multiplier

(BM) estimated at about 1%. When 16 bit or more word-length

multipliers are built, only the sign extension circuit on the

leftmost blocks is contributing to the area overhead. The sign

extension circuit on the remaining block is redundant, which

contributes most of the overhead. There is also some extracircuitry for the mode control signal and signal routing.

Table IV shows the delay comparison. Booth multipliers

demonstrate great advantage over the array multiplier. A

reconfigurable multiplier increases the bit delay by up to 2%

compared to BM. In the partial product addition stage, the

signals is routed through some extra multiplexers for mode

control and hence some extra delay overhead is to be noticed.

Table V reports the power consumption expressed in nW

for input signal frequencies of about 120Mhz and a supply

voltage of 1.2V. Our reconfigurable design presents a small

overhead in power consumption over normal Booth multi-

plier. In summary, our approach gains multi-precision fromits reconfigurable and scalable design while improving the

performance in terms of delay inherently obtained as a results

of its Booth encoding scheme. The overhead in terms of area

and power remains limited to about 2% and 1.2%, respectively.

TABLE III

AREA IN EQUIVALENT GATE COUNT

Word- Array Booth Reconfigurable Overheadlength Multiplier Multiplier Multiplier vs. Booth

16 bit 3,054 3,255 3,279 0.74%24 bit 7,005 7,467 7,509 0.56%32 bit 12,429 13,341 13,473 0.99%

TABLE IV

DELAY IN N S


16 bit 33.533 28.355 28.921 2.00%24 bit 54.445 42.037 42.756 1.71%32 bit 72.986 55.700 56.591 1.60%

TABLE V

POWER IN NW


16 bit 321.38 303.05 302.01 0.34%24 bit 549.92 499.47 499.65 0.04%32 bit 825.31 732.00 740.32 1.14%

V. CONCLUSION

In this paper, a scalable and multi-precision Booth multiplier

VLSI architecture is proposed. This structure can be config-

ured to different operation modes under different precision

requirements. It can be used to form a multiplier of higherword-length based on smaller building blocks. Because of its

regular and scalable structure, it is also able to be configured

to accept parallel inputs of small word-length or one input

of large word-length. Comparisons in different word-lengths

with Booth and array multipliers show improved performance

in terms of delay inherently obtained as a results of the

adopted Booth encoding scheme. The overhead in terms of

area and power remains limited to about 2% and 1.2%,

respectively. The reconfigurability and scalability features of

our proposed parallel Booth multiplier make it a very good

candidate for multi-precision and high speed digital signal

processing applications.

ACKNOWLEDGMENT

The work described in this paper is supported by a grant

from the Research Grant Council of HK SAR and the DAAD

German research Grant council (Project No G-HK019/05).

REFERENCES

[1] N. Ohkubo et al., A 4.4 ns CMOS 54 x 54-b Multiplier Using Pass-Transistor Multiplexer, IEEE J. of Solid-State Circuits, Vol. 30, No. 3,pp. 251-257, March 1995.

[2] K.-S. Cho et al., 54x54-bit Radix-4 Multiplier based on Modified BoothAlgorithm, GLSVLSI03, pp.233-237, 2003.

[3] R. Fried, Minimizing Energy Dissipation in High-speed Multipliers,IEEE Int. Symp. Low-Power Electronics and Design, pp. 214-219, 1997.

[4] M. Annaratone, Digital CMOS circuit Design, Kluwer Academic

Publishers, Boston, 1986.[5] H. Lee, A Power-Aware Scalable Pipelined Booth Multiplier, IEEE Int.

SOC Conference, pp. 123-126, 2004.[6] Y. Kolla et al., A Novel 32-bit Scalable Multplier Architecture,

GLSVLSI03, pp. 241-244, 2003.[7] O.A. Pfander, R. Hacker and H.-J. Pfleiderer, A Multiplier-Based

Concept for Reconfigurable Multiplier Arrays, Int. Conf. on FieldProgrammable Logic and its Applications, pp. 938-942, 2004

[8] F.S. Anderson et al., The IBM system 360/91 floating point executionunit, IBM J. Res. Develop., vol.11, pp. 34-53, Jan. 1967.

[9] M.K. Gowan, L. L. Biro, and D.B. Jackson, Power considerations inthe design of the Alpha 21264 microprocessor, in Proc. 35th Designand Automation Conf, pp. 726-731, 1998.

[10] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design.Addison-Wesley Publishing Company, 1993.

978

rafi4

Documents