rafi4
TRANSCRIPT
-
7/28/2019 rafi4
1/4
A VLSI architecture for a Run-time Multi-precision
Reconfigurable Booth Multiplier
Zhou Shun, Oliver A. Pfander, Hans-Jorg Pfleiderer and Amine BermakElectrical and Computer Engineering Department
Hong Kong University of Science and Technology, Hong Kong
Email: {eevin,eebermak}@ust.hkInstitute for Microelectronics
University of Ulm, Germany
Email: [email protected]
Abstract In this paper, a reconfigurable multi-precisionRadix-4 Booth multiplier structure is presented. The reconfig-urable 88 bit multiplier unit can be cascaded to form a multi-plier that can adapt to variable input precision requirements. The
number of bits can be extended by concatenating more stagestogether. For example, four 88 bit units can be used to builda 1616 bit Booth multiplier. In our proposed architecture, themultiplier adapts to different bit-lengths by using external controlsignals. The performance of our reconfigurable multiplier arecompared with a parallel array multiplier and a conventionalBooth multiplier. The comparison is based on synthesis resultsobtained by synthesizing all multiplier architectures targeting aXilinx FPGA. The overhead resulting from our reconfigurationscheme are also evaluated and compared to a conventional Boothand array multipliers.
I. INTRODUCTION
Multiplication is a very important operation in many digital
signal processing (DSP) applications. Many multiplication
algorithms and their VLSI implementations have been reportedin the literature [1] [2]. As the size, power consumption and
silicon area are greatly dependant upon the multiplications
word-length, it is therefore important to carefully evaluate the
precision requirements for the application at hand. Assuming
that a typical application requires n bit precision, then using
a lower precision would degrade the performance. However,
using more than the required precision would result in wasted
energy and area. Fast and large word-length multipliers con-
sume a large amount of power and take considerable silicon
area. To overcome such a problem, many scalable multiplier
designs have been recently proposed [5] [6] [7]. In [5], a
dedicated circuit to handle different bit-lengths is introduced,but it also results in a large overhead and its scalability is
limited to a small scale. The structure presented in [7] is highly
scalable but not speed-optimized.
In this paper, we propose a new reconfigurable Booth multi-
plier structure. In our design, it is possible to use the required
number of bits for the application at hand while attempting to
keep the reconfigurability overhead to a minimum and limit
the power consumption at the same time. Our study highlights
a Booth multiplier, which is known to provide a higher
speed as compared to other multipliers due to the reduced
number of partial products. Our proposed multiplier can also
be reconfigured to adapt to different levels of precision and
is suitable for FPGA implementation as well as full-custom
design methodology due to its regular characteristics.
II . BOOTH MULTIPLIER ALGORITHM AND ARCHITECTURE
The idea of Booth encoding has become a standard algo-
rithm widely used to speed-up the multiplication process by
reducing the number of partial products. The commonly used
Radix-4 Booth multipliers can reduce the number of required
partial products to 50% by recoding the multiplier bits. It can
save the multiplier layout area and reduce delay at the same
time which yields to important design advantages.
According to the sign-generate algorithm [3], the result of
adding all the sign extension bits of an 88 bit multiplier canbe written as
Sgn =
s0
16
i=9
2i
20 +
s1
14
i=9
2i
22 + (1)
s2
12i=9
2i
24 +
s3
10i=9
2i
26
Where si are the sign bits for the partial products. Using the
equivalencies:k
i=j
2i
= 2k+1 2j (2)
si = 1 si (3)
Sgn becomes:
Sgn =
3j=0
sj 28+2j + 28 + 29 + 210 + 211 + 212 (4)
Table I illustrates an example of 88 Booth multiplicationalgorithm with different sign extension strategies. P denotes
the partial product term and cor is the correction factor for
negative number. In the first variant derived from (1), the
partial products need sign extension up to the width of the
product before summation can take place. In order to operate
on signed numbers, this algorithm presents 16 extra terms
obtained by duplicating the leftmost bit of each row. This
leads to a significant increase in terms of circuit complexity;
increased power consumption and delay due to 16 extra full
1-4244-1378-8/07/$25.00 2007 IEEE. 975
-
7/28/2019 rafi4
2/4
adders. This traditional design can be modified to support
signed numbers in twos complement notation by modifying
several of the product terms.
TABLE I
SIG N EXTENSION FOR BOOTH MULTIPLIER
P08P08P08P08P08P08P08P08P07P06P05 P04 P03 P02 P01 P00P18P18P18P18P18 P18 P17 P16P05 P04 P03 P02 P01 P00 cor0
P28P28P28 P28 P27 P26 P25 P24P23P22P21 P20 cor1P38 P38 P37 P36 P35 P34 P33 P32P31P30 cor2
cor3
Sign Extension Algorithm A
1
1 P08 P07 P06 P05 P04 P03 P02 P01 P00
1 P18P17 P16 P05 P04 P03 P02 P01 P00 cor0
1 P28P27 P26 P25 P24 P23 P22 P21 P20 cor1
1 P38P37 P36 P35 P34 P33 P32 P31 P30 cor2cor3
Sign Extension Algorithm B
Table I (B) illustrates the Booth multiplier algorithm of
a modified sign generation derived from (4). This Booth
multiplier introduced in [4] was used to solve the sign ex-
tension problem inherent in conventional Booth multipliers.
This multiplier features reduced number of full adders and
therefore optimizes both area and delay. The algorithm uses
5 extra terms to handle the sign extension issue resulting in
only 5 extra full adders required. This algorithm (Algorithm
B represented in Table I) was therefore used as the basic
computational block in our Booth multiplier as it presents a
simpler sign extension logic and a more regular structure when
compared to the first algorithm.The Booth multiplier as described earlier can be built in a
hierarchical way using 3 major components: Booth Encoder,
Partial Product Generator and finally a Full Adder. Each com-
ponent needs to be scalable in order to be combined together to
form an 88 signed Booth multiplier. In our design, a regularBooth encoding and partial product generation is adopted [8].
Note that there are many recent new approaches that could be
used to achieve faster encoding and partial product generation
[9] [10], but may not be structured regularly.
III. SCALABLE AND RECONFIGURABLE UNI T
In this section, we present our multiplier unit structure that
can have both the property of scalability and reconfigurability.The multiplier element can be used either as a fully functional
independent multiplier design of 88 bit operand size, or as abuilding block to form a multiplier of higher bit-length in steps
of 8 bit. Assembling a group of these multiplier elements to
form a larger processing unit accounts for the desired scalable
precision. By reconfigurability we also mean the option to
change the behavior of the building blocks in terms of the
role that they play in a concatenated multiplier array.
This principal concept of realizing a run-time reconfigu-
ration option for a concatenated multiplier assembled from
uniform elements is based on the use of 2-by-1 multiplexers
and explained in [7]. These data exchange interfaces provide
the possibility to influence the signal path, bypass specific
circuit elements and feed in zero signals in perimeter positions.
The largest benefit is having a limited set of control signals
addressing multiplexers, enabling the re-arrangement of the
elements to form an increased precision multiplier while the
system is running without the need to stop the system, e. g.
in order to re-program a look-up table or change the routing.By providing a flexible array of multiplier elements that can
operate either separately or as a part of a superior multiplier
with increased word-length, two major approaches of using
this arithmetic structure become possible: The first approach
is to concatenate several (or all) of the multiplier elements
together, working at a higher (or maximum) word-length to
ensure an increased (or a full) precision. The second approach
is to configure the elements to work in stand-alone mode,
in order to achieve a higher parallelism when the precision
requirements are lower.
A. Bit-length of the Reconfigurable Unit
The bit-length of the reconfigurable unit needs to be chosen
carefully to make the overall design efficient both in terms of
area and also power consumption. Simulations and comparison
between a Signed Array Multiplier (SAM) and a Booth Multi-
plier (BM) have been performed for different bit-lengths. The
CMOS standard cell library used is VTVT TSMC 0.25m.
Figure 1 shows a comparison in terms of delay and area
usage. The delay is shown in nanoseconds while the area is
normalized to a 4 bit SAM. In 4 bit length, SAM is slightly
faster than the BM and it occupies much less area. When
the bit-length is increased to larger or equal to 8 bit, BM
surpasses SAM in both delay and area. Table II summarizes
the comparison by showing the PowerDelay Product. As thebit-length is 8 bit, BM has a 40% advantage in efficiency over
SAM when power and delay are considered simultaneously.
Also, in most DSP systems, a word-length of 8 bit is widely
used for low precision applications. Therefore, we choose
8 bits as the bit-length of the reconfigurable unit.
0 2 4 6 8 10 12 14 16 18
5
10
15
20
25
bitlength
Area
Area Comparison
SAMBM
0 2 4 6 8 10 12 14 16 18
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
4
bitlength
MaximumD
elay(ns)
Delay Comparison
SAMBM
Fig. 1. Comparison between Signed Array Multiplier & Booth Multiplier
B. Scalability of the Multiplier Unit
The scalable design unit is able to be cascaded together to
form an arbitrarily large multiplier. It is desired to build a supe-
rior multiplier for different word-lengths. The major advantage
976
-
7/28/2019 rafi4
3/4
TABLE II
ARE ADELAY PRODUCT
Bit-length SAM BM Percentage
4 2372 3207.96 -26%
8 31610.7 22467.75 40%12 124423 84252.48 48%16 329212.8 207367.2 59%
of this multiplier structure is the highly regular structure and
thus the reusability of its basic arithmetic primitives. Using this
core and introducing a supplementary peripheral circuit, we
are able to build a multiplier with different word-lengths. The
desired connectivity is achieved by employing multiplexers
with corresponding control input signals. We directly amend
our multiplier core to handle signed numbers by introducing
various modifications and few extra circuit elements.
Fig. 2 (A) shows an example of a scalable 88 Boothmultiplier unit that is cascaded together to form a 1616
Booth multiplier. Following the same approach, an 8n
8n
Booth multiplier can be built in a fashion shown in Fig. 2 (B).
In order to obtain such a scalable design, 3 components in the
Booth multiplier have to be scalable, namely Booth encoder,
partial product generator and partial product addition units.
Ctrl A B I
SB1Co Cin
P
Ctrl A B I
SB1Co Cin
P
Ctrl A B I
SB1Co Cin
P
Ctrl A B I
SB1Co Cin
P
zero
Product
Ctrl A B I
SB2Co Cin
P
Ctrl A B I
SB2Co Cin
P
Ctrl A B I
SB1Co Cin
P
Product
Ctrl A B I
SB1Co Cin
P
zero
Ctrl A B I
SB2Co Cin
P
Ctrl A B I
SB2Co Cin
P
(A.) (B.)
Fig. 2. Architecture View.
The Booth encoder is a very repetitive and modular structure
that can be scaled to form a larger unit. The modified Booth
encoder takes 3 bits of multiplier and outputs 3 bits of Booth
code. For grouping several bits of the multiplier together, each
group needs to take in the last bit of the previous group as
the first bit, as depicted in Fig. 3 (B). The first group feeds
in a zero for the first bit. The partial product generator (PPG)
also has a similar regular structure, which can be scaled quiteeasily. Like the Booth encoder, each group of PPG takes in
several multiplicand bits, see Fig. 3 (A). A group accepts the
last bit of the previous group as the first bit. The first group
feeds in a zero for the first bit and the last group duplicates
the last bit of multiplicand for its last bit.
The difficulty in designing such a reconfigurable multiplier
lies in the partial product addition (PPA) stage. The Booth
multiplier operates on 2s complement numbers, which are
signed numbers. A sign extension circuit is required to ma-
nipulate the MSBs of each row. As shown in figure 2, a block
on the leftmost position is differently addressed for handling
sign extension, while a block on other positions should pass
;L ;L
X1
X2
NEG33*
;L
X1
X2
NEG
33*
X1
X2
NEG
33*
;NL
QELW%RRWK
(QFRGHU
-
7/28/2019 rafi4
4/4
Fig. 4. Structure of PPA block of 8x8 bits
using Xilinx ISE 9.1i targeting a Xilinx Virtex-4 FPGA device
XC4VLX15 (package SF363, speed grade -12) and the results
are extracted after place and route procedure.
The 16 bit reconfigurable multiplier is built as depicted
in figure 2 (A), while the 24 bit and 32 bit implementations
follow the same structure shown in figure 2 (B). Table III -
table V illustrates the simulation results. In general, the Booth
multiplier takes up more area than the array multiplier because
of the Booth encoding and partial product generation circuit.
The reconfigurable multiplier obviously requires some over-
head logic when compared to a conventional Booth Multiplier
(BM) estimated at about 1%. When 16 bit or more word-length
multipliers are built, only the sign extension circuit on the
leftmost blocks is contributing to the area overhead. The sign
extension circuit on the remaining block is redundant, which
contributes most of the overhead. There is also some extracircuitry for the mode control signal and signal routing.
Table IV shows the delay comparison. Booth multipliers
demonstrate great advantage over the array multiplier. A
reconfigurable multiplier increases the bit delay by up to 2%
compared to BM. In the partial product addition stage, the
signals is routed through some extra multiplexers for mode
control and hence some extra delay overhead is to be noticed.
Table V reports the power consumption expressed in nW
for input signal frequencies of about 120Mhz and a supply
voltage of 1.2V. Our reconfigurable design presents a small
overhead in power consumption over normal Booth multi-
plier. In summary, our approach gains multi-precision fromits reconfigurable and scalable design while improving the
performance in terms of delay inherently obtained as a results
of its Booth encoding scheme. The overhead in terms of area
and power remains limited to about 2% and 1.2%, respectively.
TABLE III
AREA IN EQUIVALENT GATE COUNT
Word- Array Booth Reconfigurable Overheadlength Multiplier Multiplier Multiplier vs. Booth
16 bit 3,054 3,255 3,279 0.74%24 bit 7,005 7,467 7,509 0.56%32 bit 12,429 13,341 13,473 0.99%
TABLE IV
DELAY IN N S
Word- Array Booth Reconfigurable Overheadlength Multiplier Multiplier Multiplier vs. Booth
16 bit 33.533 28.355 28.921 2.00%24 bit 54.445 42.037 42.756 1.71%32 bit 72.986 55.700 56.591 1.60%
TABLE V
POWER IN NW
Word- Array Booth Reconfigurable Overheadlength Multiplier Multiplier Multiplier vs. Booth
16 bit 321.38 303.05 302.01 0.34%24 bit 549.92 499.47 499.65 0.04%32 bit 825.31 732.00 740.32 1.14%
V. CONCLUSION
In this paper, a scalable and multi-precision Booth multiplier
VLSI architecture is proposed. This structure can be config-
ured to different operation modes under different precision
requirements. It can be used to form a multiplier of higherword-length based on smaller building blocks. Because of its
regular and scalable structure, it is also able to be configured
to accept parallel inputs of small word-length or one input
of large word-length. Comparisons in different word-lengths
with Booth and array multipliers show improved performance
in terms of delay inherently obtained as a results of the
adopted Booth encoding scheme. The overhead in terms of
area and power remains limited to about 2% and 1.2%,
respectively. The reconfigurability and scalability features of
our proposed parallel Booth multiplier make it a very good
candidate for multi-precision and high speed digital signal
processing applications.
ACKNOWLEDGMENT
The work described in this paper is supported by a grant
from the Research Grant Council of HK SAR and the DAAD
German research Grant council (Project No G-HK019/05).
REFERENCES
[1] N. Ohkubo et al., A 4.4 ns CMOS 54 x 54-b Multiplier Using Pass-Transistor Multiplexer, IEEE J. of Solid-State Circuits, Vol. 30, No. 3,pp. 251-257, March 1995.
[2] K.-S. Cho et al., 54x54-bit Radix-4 Multiplier based on Modified BoothAlgorithm, GLSVLSI03, pp.233-237, 2003.
[3] R. Fried, Minimizing Energy Dissipation in High-speed Multipliers,IEEE Int. Symp. Low-Power Electronics and Design, pp. 214-219, 1997.
[4] M. Annaratone, Digital CMOS circuit Design, Kluwer Academic
Publishers, Boston, 1986.[5] H. Lee, A Power-Aware Scalable Pipelined Booth Multiplier, IEEE Int.
SOC Conference, pp. 123-126, 2004.[6] Y. Kolla et al., A Novel 32-bit Scalable Multplier Architecture,
GLSVLSI03, pp. 241-244, 2003.[7] O.A. Pfander, R. Hacker and H.-J. Pfleiderer, A Multiplier-Based
Concept for Reconfigurable Multiplier Arrays, Int. Conf. on FieldProgrammable Logic and its Applications, pp. 938-942, 2004
[8] F.S. Anderson et al., The IBM system 360/91 floating point executionunit, IBM J. Res. Develop., vol.11, pp. 34-53, Jan. 1967.
[9] M.K. Gowan, L. L. Biro, and D.B. Jackson, Power considerations inthe design of the Alpha 21264 microprocessor, in Proc. 35th Designand Automation Conf, pp. 726-731, 1998.
[10] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design.Addison-Wesley Publishing Company, 1993.
978