rafi4

Upload: raffi-sk

Post on 03-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 rafi4

    1/4

    A VLSI architecture for a Run-time Multi-precision

    Reconfigurable Booth Multiplier

    Zhou Shun, Oliver A. Pfander, Hans-Jorg Pfleiderer and Amine BermakElectrical and Computer Engineering Department

    Hong Kong University of Science and Technology, Hong Kong

    Email: {eevin,eebermak}@ust.hkInstitute for Microelectronics

    University of Ulm, Germany

    Email: [email protected]

    Abstract In this paper, a reconfigurable multi-precisionRadix-4 Booth multiplier structure is presented. The reconfig-urable 88 bit multiplier unit can be cascaded to form a multi-plier that can adapt to variable input precision requirements. The

    number of bits can be extended by concatenating more stagestogether. For example, four 88 bit units can be used to builda 1616 bit Booth multiplier. In our proposed architecture, themultiplier adapts to different bit-lengths by using external controlsignals. The performance of our reconfigurable multiplier arecompared with a parallel array multiplier and a conventionalBooth multiplier. The comparison is based on synthesis resultsobtained by synthesizing all multiplier architectures targeting aXilinx FPGA. The overhead resulting from our reconfigurationscheme are also evaluated and compared to a conventional Boothand array multipliers.

    I. INTRODUCTION

    Multiplication is a very important operation in many digital

    signal processing (DSP) applications. Many multiplication

    algorithms and their VLSI implementations have been reportedin the literature [1] [2]. As the size, power consumption and

    silicon area are greatly dependant upon the multiplications

    word-length, it is therefore important to carefully evaluate the

    precision requirements for the application at hand. Assuming

    that a typical application requires n bit precision, then using

    a lower precision would degrade the performance. However,

    using more than the required precision would result in wasted

    energy and area. Fast and large word-length multipliers con-

    sume a large amount of power and take considerable silicon

    area. To overcome such a problem, many scalable multiplier

    designs have been recently proposed [5] [6] [7]. In [5], a

    dedicated circuit to handle different bit-lengths is introduced,but it also results in a large overhead and its scalability is

    limited to a small scale. The structure presented in [7] is highly

    scalable but not speed-optimized.

    In this paper, we propose a new reconfigurable Booth multi-

    plier structure. In our design, it is possible to use the required

    number of bits for the application at hand while attempting to

    keep the reconfigurability overhead to a minimum and limit

    the power consumption at the same time. Our study highlights

    a Booth multiplier, which is known to provide a higher

    speed as compared to other multipliers due to the reduced

    number of partial products. Our proposed multiplier can also

    be reconfigured to adapt to different levels of precision and

    is suitable for FPGA implementation as well as full-custom

    design methodology due to its regular characteristics.

    II . BOOTH MULTIPLIER ALGORITHM AND ARCHITECTURE

    The idea of Booth encoding has become a standard algo-

    rithm widely used to speed-up the multiplication process by

    reducing the number of partial products. The commonly used

    Radix-4 Booth multipliers can reduce the number of required

    partial products to 50% by recoding the multiplier bits. It can

    save the multiplier layout area and reduce delay at the same

    time which yields to important design advantages.

    According to the sign-generate algorithm [3], the result of

    adding all the sign extension bits of an 88 bit multiplier canbe written as

    Sgn =

    s0

    16

    i=9

    2i

    20 +

    s1

    14

    i=9

    2i

    22 + (1)

    s2

    12i=9

    2i

    24 +

    s3

    10i=9

    2i

    26

    Where si are the sign bits for the partial products. Using the

    equivalencies:k

    i=j

    2i

    = 2k+1 2j (2)

    si = 1 si (3)

    Sgn becomes:

    Sgn =

    3j=0

    sj 28+2j + 28 + 29 + 210 + 211 + 212 (4)

    Table I illustrates an example of 88 Booth multiplicationalgorithm with different sign extension strategies. P denotes

    the partial product term and cor is the correction factor for

    negative number. In the first variant derived from (1), the

    partial products need sign extension up to the width of the

    product before summation can take place. In order to operate

    on signed numbers, this algorithm presents 16 extra terms

    obtained by duplicating the leftmost bit of each row. This

    leads to a significant increase in terms of circuit complexity;

    increased power consumption and delay due to 16 extra full

    1-4244-1378-8/07/$25.00 2007 IEEE. 975

  • 7/28/2019 rafi4

    2/4

    adders. This traditional design can be modified to support

    signed numbers in twos complement notation by modifying

    several of the product terms.

    TABLE I

    SIG N EXTENSION FOR BOOTH MULTIPLIER

    P08P08P08P08P08P08P08P08P07P06P05 P04 P03 P02 P01 P00P18P18P18P18P18 P18 P17 P16P05 P04 P03 P02 P01 P00 cor0

    P28P28P28 P28 P27 P26 P25 P24P23P22P21 P20 cor1P38 P38 P37 P36 P35 P34 P33 P32P31P30 cor2

    cor3

    Sign Extension Algorithm A

    1

    1 P08 P07 P06 P05 P04 P03 P02 P01 P00

    1 P18P17 P16 P05 P04 P03 P02 P01 P00 cor0

    1 P28P27 P26 P25 P24 P23 P22 P21 P20 cor1

    1 P38P37 P36 P35 P34 P33 P32 P31 P30 cor2cor3

    Sign Extension Algorithm B

    Table I (B) illustrates the Booth multiplier algorithm of

    a modified sign generation derived from (4). This Booth

    multiplier introduced in [4] was used to solve the sign ex-

    tension problem inherent in conventional Booth multipliers.

    This multiplier features reduced number of full adders and

    therefore optimizes both area and delay. The algorithm uses

    5 extra terms to handle the sign extension issue resulting in

    only 5 extra full adders required. This algorithm (Algorithm

    B represented in Table I) was therefore used as the basic

    computational block in our Booth multiplier as it presents a

    simpler sign extension logic and a more regular structure when

    compared to the first algorithm.The Booth multiplier as described earlier can be built in a

    hierarchical way using 3 major components: Booth Encoder,

    Partial Product Generator and finally a Full Adder. Each com-

    ponent needs to be scalable in order to be combined together to

    form an 88 signed Booth multiplier. In our design, a regularBooth encoding and partial product generation is adopted [8].

    Note that there are many recent new approaches that could be

    used to achieve faster encoding and partial product generation

    [9] [10], but may not be structured regularly.

    III. SCALABLE AND RECONFIGURABLE UNI T

    In this section, we present our multiplier unit structure that

    can have both the property of scalability and reconfigurability.The multiplier element can be used either as a fully functional

    independent multiplier design of 88 bit operand size, or as abuilding block to form a multiplier of higher bit-length in steps

    of 8 bit. Assembling a group of these multiplier elements to

    form a larger processing unit accounts for the desired scalable

    precision. By reconfigurability we also mean the option to

    change the behavior of the building blocks in terms of the

    role that they play in a concatenated multiplier array.

    This principal concept of realizing a run-time reconfigu-

    ration option for a concatenated multiplier assembled from

    uniform elements is based on the use of 2-by-1 multiplexers

    and explained in [7]. These data exchange interfaces provide

    the possibility to influence the signal path, bypass specific

    circuit elements and feed in zero signals in perimeter positions.

    The largest benefit is having a limited set of control signals

    addressing multiplexers, enabling the re-arrangement of the

    elements to form an increased precision multiplier while the

    system is running without the need to stop the system, e. g.

    in order to re-program a look-up table or change the routing.By providing a flexible array of multiplier elements that can

    operate either separately or as a part of a superior multiplier

    with increased word-length, two major approaches of using

    this arithmetic structure become possible: The first approach

    is to concatenate several (or all) of the multiplier elements

    together, working at a higher (or maximum) word-length to

    ensure an increased (or a full) precision. The second approach

    is to configure the elements to work in stand-alone mode,

    in order to achieve a higher parallelism when the precision

    requirements are lower.

    A. Bit-length of the Reconfigurable Unit

    The bit-length of the reconfigurable unit needs to be chosen

    carefully to make the overall design efficient both in terms of

    area and also power consumption. Simulations and comparison

    between a Signed Array Multiplier (SAM) and a Booth Multi-

    plier (BM) have been performed for different bit-lengths. The

    CMOS standard cell library used is VTVT TSMC 0.25m.

    Figure 1 shows a comparison in terms of delay and area

    usage. The delay is shown in nanoseconds while the area is

    normalized to a 4 bit SAM. In 4 bit length, SAM is slightly

    faster than the BM and it occupies much less area. When

    the bit-length is increased to larger or equal to 8 bit, BM

    surpasses SAM in both delay and area. Table II summarizes

    the comparison by showing the PowerDelay Product. As thebit-length is 8 bit, BM has a 40% advantage in efficiency over

    SAM when power and delay are considered simultaneously.

    Also, in most DSP systems, a word-length of 8 bit is widely

    used for low precision applications. Therefore, we choose

    8 bits as the bit-length of the reconfigurable unit.

    0 2 4 6 8 10 12 14 16 18

    5

    10

    15

    20

    25

    bitlength

    Area

    Area Comparison

    SAMBM

    0 2 4 6 8 10 12 14 16 18

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2x 10

    4

    bitlength

    MaximumD

    elay(ns)

    Delay Comparison

    SAMBM

    Fig. 1. Comparison between Signed Array Multiplier & Booth Multiplier

    B. Scalability of the Multiplier Unit

    The scalable design unit is able to be cascaded together to

    form an arbitrarily large multiplier. It is desired to build a supe-

    rior multiplier for different word-lengths. The major advantage

    976

  • 7/28/2019 rafi4

    3/4

    TABLE II

    ARE ADELAY PRODUCT

    Bit-length SAM BM Percentage

    4 2372 3207.96 -26%

    8 31610.7 22467.75 40%12 124423 84252.48 48%16 329212.8 207367.2 59%

    of this multiplier structure is the highly regular structure and

    thus the reusability of its basic arithmetic primitives. Using this

    core and introducing a supplementary peripheral circuit, we

    are able to build a multiplier with different word-lengths. The

    desired connectivity is achieved by employing multiplexers

    with corresponding control input signals. We directly amend

    our multiplier core to handle signed numbers by introducing

    various modifications and few extra circuit elements.

    Fig. 2 (A) shows an example of a scalable 88 Boothmultiplier unit that is cascaded together to form a 1616

    Booth multiplier. Following the same approach, an 8n

    8n

    Booth multiplier can be built in a fashion shown in Fig. 2 (B).

    In order to obtain such a scalable design, 3 components in the

    Booth multiplier have to be scalable, namely Booth encoder,

    partial product generator and partial product addition units.

    Ctrl A B I

    SB1Co Cin

    P

    Ctrl A B I

    SB1Co Cin

    P

    Ctrl A B I

    SB1Co Cin

    P

    Ctrl A B I

    SB1Co Cin

    P

    zero

    Product

    Ctrl A B I

    SB2Co Cin

    P

    Ctrl A B I

    SB2Co Cin

    P

    Ctrl A B I

    SB1Co Cin

    P

    Product

    Ctrl A B I

    SB1Co Cin

    P

    zero

    Ctrl A B I

    SB2Co Cin

    P

    Ctrl A B I

    SB2Co Cin

    P

    (A.) (B.)

    Fig. 2. Architecture View.

    The Booth encoder is a very repetitive and modular structure

    that can be scaled to form a larger unit. The modified Booth

    encoder takes 3 bits of multiplier and outputs 3 bits of Booth

    code. For grouping several bits of the multiplier together, each

    group needs to take in the last bit of the previous group as

    the first bit, as depicted in Fig. 3 (B). The first group feeds

    in a zero for the first bit. The partial product generator (PPG)

    also has a similar regular structure, which can be scaled quiteeasily. Like the Booth encoder, each group of PPG takes in

    several multiplicand bits, see Fig. 3 (A). A group accepts the

    last bit of the previous group as the first bit. The first group

    feeds in a zero for the first bit and the last group duplicates

    the last bit of multiplicand for its last bit.

    The difficulty in designing such a reconfigurable multiplier

    lies in the partial product addition (PPA) stage. The Booth

    multiplier operates on 2s complement numbers, which are

    signed numbers. A sign extension circuit is required to ma-

    nipulate the MSBs of each row. As shown in figure 2, a block

    on the leftmost position is differently addressed for handling

    sign extension, while a block on other positions should pass

    ;L ;L

    X1

    X2

    NEG33*

    ;L

    X1

    X2

    NEG

    33*

    X1

    X2

    NEG

    33*

    ;NL

    QELW%RRWK

    (QFRGHU

  • 7/28/2019 rafi4

    4/4

    Fig. 4. Structure of PPA block of 8x8 bits

    using Xilinx ISE 9.1i targeting a Xilinx Virtex-4 FPGA device

    XC4VLX15 (package SF363, speed grade -12) and the results

    are extracted after place and route procedure.

    The 16 bit reconfigurable multiplier is built as depicted

    in figure 2 (A), while the 24 bit and 32 bit implementations

    follow the same structure shown in figure 2 (B). Table III -

    table V illustrates the simulation results. In general, the Booth

    multiplier takes up more area than the array multiplier because

    of the Booth encoding and partial product generation circuit.

    The reconfigurable multiplier obviously requires some over-

    head logic when compared to a conventional Booth Multiplier

    (BM) estimated at about 1%. When 16 bit or more word-length

    multipliers are built, only the sign extension circuit on the

    leftmost blocks is contributing to the area overhead. The sign

    extension circuit on the remaining block is redundant, which

    contributes most of the overhead. There is also some extracircuitry for the mode control signal and signal routing.

    Table IV shows the delay comparison. Booth multipliers

    demonstrate great advantage over the array multiplier. A

    reconfigurable multiplier increases the bit delay by up to 2%

    compared to BM. In the partial product addition stage, the

    signals is routed through some extra multiplexers for mode

    control and hence some extra delay overhead is to be noticed.

    Table V reports the power consumption expressed in nW

    for input signal frequencies of about 120Mhz and a supply

    voltage of 1.2V. Our reconfigurable design presents a small

    overhead in power consumption over normal Booth multi-

    plier. In summary, our approach gains multi-precision fromits reconfigurable and scalable design while improving the

    performance in terms of delay inherently obtained as a results

    of its Booth encoding scheme. The overhead in terms of area

    and power remains limited to about 2% and 1.2%, respectively.

    TABLE III

    AREA IN EQUIVALENT GATE COUNT

    Word- Array Booth Reconfigurable Overheadlength Multiplier Multiplier Multiplier vs. Booth

    16 bit 3,054 3,255 3,279 0.74%24 bit 7,005 7,467 7,509 0.56%32 bit 12,429 13,341 13,473 0.99%

    TABLE IV

    DELAY IN N S

    Word- Array Booth Reconfigurable Overheadlength Multiplier Multiplier Multiplier vs. Booth

    16 bit 33.533 28.355 28.921 2.00%24 bit 54.445 42.037 42.756 1.71%32 bit 72.986 55.700 56.591 1.60%

    TABLE V

    POWER IN NW

    Word- Array Booth Reconfigurable Overheadlength Multiplier Multiplier Multiplier vs. Booth

    16 bit 321.38 303.05 302.01 0.34%24 bit 549.92 499.47 499.65 0.04%32 bit 825.31 732.00 740.32 1.14%

    V. CONCLUSION

    In this paper, a scalable and multi-precision Booth multiplier

    VLSI architecture is proposed. This structure can be config-

    ured to different operation modes under different precision

    requirements. It can be used to form a multiplier of higherword-length based on smaller building blocks. Because of its

    regular and scalable structure, it is also able to be configured

    to accept parallel inputs of small word-length or one input

    of large word-length. Comparisons in different word-lengths

    with Booth and array multipliers show improved performance

    in terms of delay inherently obtained as a results of the

    adopted Booth encoding scheme. The overhead in terms of

    area and power remains limited to about 2% and 1.2%,

    respectively. The reconfigurability and scalability features of

    our proposed parallel Booth multiplier make it a very good

    candidate for multi-precision and high speed digital signal

    processing applications.

    ACKNOWLEDGMENT

    The work described in this paper is supported by a grant

    from the Research Grant Council of HK SAR and the DAAD

    German research Grant council (Project No G-HK019/05).

    REFERENCES

    [1] N. Ohkubo et al., A 4.4 ns CMOS 54 x 54-b Multiplier Using Pass-Transistor Multiplexer, IEEE J. of Solid-State Circuits, Vol. 30, No. 3,pp. 251-257, March 1995.

    [2] K.-S. Cho et al., 54x54-bit Radix-4 Multiplier based on Modified BoothAlgorithm, GLSVLSI03, pp.233-237, 2003.

    [3] R. Fried, Minimizing Energy Dissipation in High-speed Multipliers,IEEE Int. Symp. Low-Power Electronics and Design, pp. 214-219, 1997.

    [4] M. Annaratone, Digital CMOS circuit Design, Kluwer Academic

    Publishers, Boston, 1986.[5] H. Lee, A Power-Aware Scalable Pipelined Booth Multiplier, IEEE Int.

    SOC Conference, pp. 123-126, 2004.[6] Y. Kolla et al., A Novel 32-bit Scalable Multplier Architecture,

    GLSVLSI03, pp. 241-244, 2003.[7] O.A. Pfander, R. Hacker and H.-J. Pfleiderer, A Multiplier-Based

    Concept for Reconfigurable Multiplier Arrays, Int. Conf. on FieldProgrammable Logic and its Applications, pp. 938-942, 2004

    [8] F.S. Anderson et al., The IBM system 360/91 floating point executionunit, IBM J. Res. Develop., vol.11, pp. 34-53, Jan. 1967.

    [9] M.K. Gowan, L. L. Biro, and D.B. Jackson, Power considerations inthe design of the Alpha 21264 microprocessor, in Proc. 35th Designand Automation Conf, pp. 726-731, 1998.

    [10] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design.Addison-Wesley Publishing Company, 1993.

    978