[ieee proceedings of the ieee 2006 custom integrated circuits conference - san jose, ca...

4
A 0.13 μm Low-power Race-free Programmable Logic Array timing and power advantages are described. Nearly 50% power savings over a conventional PLA design is achieved on a 130 nm process at less than 10% delay cost. The new PLA circuit has been fabricated on a 130 nm low standby power process and tested silicon operates at 905 MHz at V DD = 1.5 V. I. INTRODUCTION Programmable Logic Arrays (PLAs) implement two-stage logic functions with wide minterms (product terms) with less delay than when implemented using static CMOS gates. PLAs are commonly NOR-NOR circuits, i.e., NOR gates are used for both logic planes. This is a natural choice given the higher speed of NMOS, pre-charged, or pseudo-NMOS NOR gates [1]. The high DC power dissipation of the latter approach has made it infeasible in modern designs. Unfortunately, dynamic NOR-NOR PLAs require significant design effort due to an inherent timing race that could result in circuit failure at all frequencies. A replica timing circuit is typically used to enable the OR plane, the activation of which is the critical circuit race. Loss of this race results in accidental discharge of the OR plane if its inputs, i.e., the outputs of the AND plane are not yet stable before the OR plane is activated. Performance must be sacrificed to provide sufficient timing margin to ensure design robustness at all process, voltage and temperature (PVT) corners. Increasing variation in future sub-100 nm processes [2, 3] will also require greater margin and design effort for conventional PLA circuits. Power has achieved primary importance in integrated circuit (ICs) [4]. Pre-charged dynamic NOR gates have very high activity factors and hence large power consumption. Difficulty of design, high power consumption, and the relative ease in synthesizing random logic using standard cells has made PLAs increasingly rare in ICs. However, their high speed continues to make them valuable in very high performance designs [5]. In this paper, a PLA circuit based on a hierarchy of dynamic NAND gates implementing the AND logic plane is presented. The greatly reduced NAND activity factor and reduced clock power provide almost 50% power savings over the conventional design. The NOR-NOR PLA circuit race condition is eliminated, making the design robust and amenable to future semiconductor processes, which will exhibit increased leakage and variability [1-3]. NOR plane delay is reduced by the elimination of the stacked clock (footer) device in each output path. High performance and low power is demonstrated through simulation and measured results on a 130 nm CMOS process. Section II briefly summarizes conventional NOR-NOR PLA designs for comparison purposes. The proposed design comprises Section III. The circuit architecture and physical design of the PLA is described as well. Simulation and test results of the fabricated circuits in Section IV demonstrate a 905 MHz clock rate, with the PLA occupying a single clock phase. Section V concludes. II. CONVENTIONAL PLA DESIGN A. Circuits and Operation The conventional PLA circuit is shown in Fig. 1 [1, 6]. The first stage is the input driver stage that buffers the inputs and the clock from the pins to the PLA. A 32 input NOR-NOR PLA supporting 32 min-terms and 32 sums is the baseline for comparison throughout this paper. The buffered inputs drive footed (D1) domino NOR gates in the AND plane. The pull down transistor gates are connected, or not, to implement the desired function. The clocked footer NMOS transistor controlled by PLACLK allows evaluation to begin at the rising clock edge. The OR plane is driven by the AND plane. Since the AND plane outputs begin in the asserted condition, a timing signal Giby Samson and Lawrence T. Clark Department of Electrical Engineering, Arizona State University, Tempe, Arizona 85287, USA. PLAOUT(x) OROUT(x) ORIN(n) ORIN(m) REPLICACLK PLA INPUTS PLACLK Fig. 1. A conventional PLA circuit, showing the NOR gate foot devices, replica clock, and circuit race path. IEEE 2006 Custom Intergrated Circuits Conference (CICC) 1-4244-0076-7/06/$20.00 ©2006 IEEE 313 P-28-1

Upload: lt

Post on 09-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

A 0.13 µm Low-power Race-free Programmable Logic Array

Abstract- A PLA using NAND and NOR gates for the AND and OR logic planes, respectively, is described. The circuit design, timing and power advantages are described. Nearly 50% power savings over a conventional PLA design is achieved on a 130 nm process at less than 10% delay cost. The new PLA circuit has been fabricated on a 130 nm low standby power process and tested silicon operates at 905 MHz at VDD = 1.5 V.

I. INTRODUCTION Programmable Logic Arrays (PLAs) implement two-stage

logic functions with wide minterms (product terms) with less delay than when implemented using static CMOS gates. PLAs are commonly NOR-NOR circuits, i.e., NOR gates are used for both logic planes. This is a natural choice given the higher speed of NMOS, pre-charged, or pseudo-NMOS NOR gates [1]. The high DC power dissipation of the latter approach has made it infeasible in modern designs. Unfortunately, dynamic NOR-NOR PLAs require significant design effort due to an inherent timing race that could result in circuit failure at all frequencies.

A replica timing circuit is typically used to enable the OR plane, the activation of which is the critical circuit race. Loss of this race results in accidental discharge of the OR plane if its inputs, i.e., the outputs of the AND plane are not yet stable before the OR plane is activated. Performance must be sacrificed to provide sufficient timing margin to ensure design robustness at all process, voltage and temperature (PVT) corners. Increasing variation in future sub-100 nm processes [2, 3] will also require greater margin and design effort for conventional PLA circuits.

Power has achieved primary importance in integrated circuit (ICs) [4]. Pre-charged dynamic NOR gates have very high activity factors and hence large power consumption. Difficulty of design, high power consumption, and the relative ease in synthesizing random logic using standard cells has made PLAs increasingly rare in ICs. However, their high speed continues to make them valuable in very high performance designs [5].

In this paper, a PLA circuit based on a hierarchy of dynamic NAND gates implementing the AND logic plane is presented. The greatly reduced NAND activity factor and reduced clock power provide almost 50% power savings over the conventional design. The NOR-NOR PLA circuit race condition is eliminated, making the design robust and amenable to future semiconductor processes, which will

exhibit increased leakage and variability [1-3]. NOR plane delay is reduced by the elimination of the stacked clock (footer) device in each output path. High performance and low power is demonstrated through simulation and measured results on a 130 nm CMOS process.

Section II briefly summarizes conventional NOR-NOR PLA designs for comparison purposes. The proposed design comprises Section III. The circuit architecture and physical design of the PLA is described as well. Simulation and test results of the fabricated circuits in Section IV demonstrate a 905 MHz clock rate, with the PLA occupying a single clock phase. Section V concludes.

II. CONVENTIONAL PLA DESIGN A. Circuits and Operation

The conventional PLA circuit is shown in Fig. 1 [1, 6]. The first stage is the input driver stage that buffers the inputs and the clock from the pins to the PLA. A 32 input NOR-NOR PLA supporting 32 min-terms and 32 sums is the baseline for comparison throughout this paper. The buffered inputs drive footed (D1) domino NOR gates in the AND plane. The pull down transistor gates are connected, or not, to implement the desired function. The clocked footer NMOS transistor controlled by PLACLK allows evaluation to begin at the rising clock edge.

The OR plane is driven by the AND plane. Since the AND plane outputs begin in the asserted condition, a timing signal

Giby Samson and Lawrence T. ClarkDepartment of Electrical Engineering,

Arizona State University, Tempe, Arizona 85287, USA.

PLAOUT(x)

OR

OU

T(x)

ORIN(n)

ORIN(m)

REPLICACLK

PLA INPUTSPLACLK

Fig. 1. A conventional PLA circuit, showing the NOR gate foot devices, replica clock, and circuit race path.

IEEE 2006 Custom Intergrated Circuits Conference (CICC)

1-4244-0076-7/06/$20.00 ©2006 IEEE 313P-28-1

must be provided so that OR plane operation begins after all AND planes have fully evaluated. The OR plane synchronizing signal is typically asserted by a replica circuit at the edge of the PLA. Timing between the synchronizing signal REPLICACLK in Fig. 1 and the OR plane inputs ORIN(n) constitutes a classical critical timing race condition. The race must be “won” by ORIN(n) transitioning low before REPLICACLK is asserted high at all PVT corners. The replica should track the delay of the AND plane across process corners, an increasingly difficult proposition in deep submicron processes, which exhibit increased variability [3]. The multiple inverter stages add delay guard band and buffer the OR plane clock but of course, add to the overall delay.

As in any dynamic circuit, the output of the OR plane must be latched to hold the output logic state while the OR plane pre-charges in the subsequent phase. Logic built into the latch allows the OR plane NOR gates to be split in two and combined in the latch, reducing the NOR gate fan-in and easing the keeper sizing constraints.

III. LOW POWER RACE FREE NAND-NOR PLA A. Circuits and Operation

The proposed dynamic NAND-NOR PLA utilizes NAND and NOR gates in the AND plane and OR plane, respectively, eliminating the conventional PLA circuit race. The basic circuit configuration is shown in Fig. 2. Clock power is one of the major power dissipation components in a PLA. In order to reduce the clock loading and improve speed, the “footed” (D1) gates are replaced by “footless” (D2) domino gates in both the AND and OR planes. This reduces the clock loading and provides significant power savings. Clock loading is entirely

eliminated from the AND plane by using the clock ANDed signals from the previous stage as pre-chargers for the subsequent stages, as evident in Fig. 2.

To minimize delay and avoid problems with charge-sharing noise, the dynamic NAND gates in the AND plane are limited to four inputs each, which also reduces the output activity factor to one in 16. Compared to high fan-in NOR gates, the NAND gates have the inherent advantage of lower leakage, as a result of the “stack-effect”. This also eliminates the keeper sizing problems inherent in high fan-in NOR configurations [7]. The NAND gate outputs are combined hierarchically to generate the AND plane output, with the same four input scheme driven so that 16 min-terms are generated in each AND half-plane. The logical ANDing combines the clock timing with the logically generated output. No OR plane foot device or separate synchronizing input signal is required. The design is fully pseudo static—each four input domino NAND has a keeper transistor. To emphasize the overall circuit approach, the keepers are not shown in Fig. 2.

The AND plane is split into two 16 input sections placed above and below the OR plane, reducing parasitic resistances and capacitances due to signal routing. As mentioned, the input clock to the PLA does not directly drive the dynamic NAND gates in the AND plane. The input signals and their complements are logically ANDed with the input clock to generate differential inputs to the AND plane. This is shown in Fig. 3 where the PLA_INPUT node rises after the clock rise and is only asserted in the high phase of the clock.

The AND plane inputs at the top of each four high stack also act as drivers for the PMOS pre-charge devices, which also reduces clock loading. The topmost inputs can be slower

Time (ns) 2.0 2.5 3.0

Fig. 3. Operational waveforms of the NAND-NOR PLA for outputs of

logic zero and one. No output glitch is present.

PLAOUT(x)

REP

LIC

AC

LK

PLAINb(m)

PLAIN(m)

INPUT (m)

PLACLK

ANDOUT

OR

OU

T(x)

Fig. 2. Proposed NAND-NOR PLA circuit architecture. The hierarchical NAND and the pre-charge controlled by the clocked inputs are evident.

314P-28-2

A

Ab

DB-cell

SPRE

A

Ab

DD-cell

S

A

Ab

DE-cell

SPREA

Ab

DF-cell

S

A

Ab

DG-cell

SPRE

A

Ab

DC-cell

S

Zb0-cell

Zb1-cell

IN

A

Ab

DA-cell

S

PRE

Fig. 4. Subcells used for constructing the AND and OR planes

than the other inputs, i.e., since the transistors in a stack must discharge from the bottom to top, the added driver capacitive loading does not increase the AND plane delay. Hence the AND plane drivers are all identically sized. The conventional OR plane is also split as described above and terminates at a time-borrowing dynamic-to-static conversion latch.

A replica clock is used to delay the conditional OR plane precharge. This avoids a power race between the de-assertion of the AND plane outputs and the D2 domino OR plane precharge. This race is not timing critical—losing it costs power but will not cause an incorrect logic output. Again, for brevity, the OR plane PMOS keeper transistors are not shown in Fig. 2. The circuit operation is shown in Fig. 3. As mentioned, the OR plane pull down transistors are enabled directly by the AND hierarchy. The NAND to NOR path is simply a delay and there is no timing constraint.

B. AND Plane Sub-cells Replacing the NOR gates with NAND gates complicates the

AND plane and makes the configuration dependent on the input term placement. In order to generate all the possible combinations of NAND functions using four input NAND gates, a set of seven basic cells ‘A-cell’ to ‘G-cell’ are used. The physical and logical designs of these sub-cells are shown in Fig. 4.

The ‘A-cell’ and ‘B-cell’ are used for positive and negative inputs, respectively, and contain pre-charge connections to drive the top of stack PMOS pre-charge transistors. The ‘C-cell’ and ‘D-cell’ use positive and negative inputs

respectively, but lack the pre-charge connection. The ‘A-cell’ and ‘C-cell’ layouts are identical, as are those of the ‘B-cell’ and ‘D-cell’. The precharge connections are made when the AND plane pull-up transistor is placed adjacent to the precharge/keeper cells while building the required product term. The ‘E-cell’ and ‘F-cell’ are used in place of unused inputs and they propagate VSS or the clocked value respectively. The ‘G-cell’ disables the precharge to save power in empty NAND stacks, producing an output that is always a logic “0” to allow propagation of other inputs to the min-term.

Fig. 4 also shows the ‘0-cell’ and ‘1-cell’ used in the OR plane. The ‘0-cell’ is used for unused inputs in the OR plane and maximizes polysilicon regularity without loading the output. The ‘1-cell’ connects a pull-down transistor in the OR plane.

C. Physical Design The NAND-NOR PLA physical design on a 130 nm

technology is shown in Fig. 5. As mentioned, the input drivers are placed between the AND planes (the blocks labeled ‘D’). Placement of the input drivers between the AND planes reduces the wire delay by half. The split AND planes drive the OR planes as indicated in Fig. 5. The pull down transistors in the OR plane are placed symmetrically on either side of the output latches (marked ‘O’). The output buffers at the latch outputs drive the OR plane horizontally as shown. The circuit is 115 µm X 136 µm in size.

AND AND

AND AND

OR OR

D

D

O

Fig. 5. PLA physical design. D indicates input drivers, O indicates the

output latches and OR plane keepers.

315P-28-3

IV. MEASURED SILICON RESULTS A. Test Setup

The fabricated NAND-NOR PLA has been tested to be fully functional on the target low standby power 130 nm process. The test setup is shown in Fig. 6. The test chip is socketed on a test board and driven by a Xilinx Vertex II FPGA, which has a serial interface with the controlling computer. Due to limited pin availability on the test chip, the PLA is interfaced through scan chains. There are two independently controlled scan chains, each 32 flip-flops long, connected to the input and output of the PLA, respectively. For speed testing, the data is serially driven into the input scan chain and the PLA is clocked exactly once. Delay measurements are made by differing the phase of the scan out and PLA execution clocks, which were designed with matched insertion delays. The delay, i.e., difference between the clock edges is measured using an Agilent 54832D oscilloscope. One measurement of the fabricated NAND-NOR PLA is shown in Fig. 7. For power measurements, the inputs are left static and the outputs are not captured.

B. Simulated Performance and Power For comparison, both the NAND-NOR PLA and a

conventional PLA were designed on a foundry 130 nm process and simulated using Cadence Spectre with VDD = 1.2 V. Both were programmed to exercise the worst-case, i.e., single asserted pull down, critical timing paths and to determine power dissipation for identical functions. The NAND-NOR PLA delay is 520 ps. A 1 GHz operating frequency is possible if the following phase logic is static logic due to the use of time-borrowing output latches. The simulated power consumption at 1 GHz is 5.882 mW. The simulated delay of the conventional NOR-NOR PLA is 470 ps, with a total power consumption of 10.935 mW at 1 GHz.

C. Measured Performance and Power The fabricated NAND-NOR PLA was measured to have an

input clock to output (post-latch) delay of 552 ps with VDD = 1.5 V, allowing a clock rate of 905 MHz since the PLA occupies one clock phase. One OR plane output bit was found

to fail at VDD = 1.49 V at this speed. This is slower than the expected value arrived at by simulation and has been attributed to a circuit sizing error. The measured active energy dissipated per operation at 1.5 V is 17.22 pJ. Since the power supply is shared with many other circuit blocks, which were inactive, the leakage power is subtracted.

V. CONCLUSION An improved PLA circuit architecture achieves within 10%

of the performance of a conventional PLA while reducing power consumption by over 40% has been described. The design uses a hierarchical dynamic NAND gate based AND plane to greatly reduce the circuit activity factor in that portion. This also eliminates the critical circuit race inherent in the conventional PLA, making the circuit more robust for future processes, which exhibit increasing circuit timing variability due to systematic and random process variation.

ACKNOWLEDGEMENTS This work was supported by SRC under task 1288.001. The authors gratefully acknowledge contributions of Jon Knudsen, who designed the test chip pad rings and power grids.

REFERENCES

[1] N.Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective. 3rd ed., Addison-Wesley, NY, 2005.

[2] International Technology Roadmap for Semiconductors, [online]. [3] Wong, A Mittal, Y. Cao and G. Starr, Nano-CMOS Circuit and Physical

Design, John Wiley & Sons, Inc., 2004. [4] T. Mudge, “Power: A First-Class Architectural Design Constraint,”

Computer, 34, pp. 52-58, Apr. 2001. [5] O.Takahashi et al., “The Circuits and Physical Design of the Synergistic

Processor Element of a CELL Processor,” VLSI Cir. Symp. Tech. Digest, pp. 20-23, 2005.

[6] J. Wang, C. Chang, and C.Yeh, “Analysis and Design of High-Speed and Low-Power CMOS PLAs,” IEEE J. Solid-State Circuits, 36, no.8, pp. 1250-1262, Aug. 2001.

[7] A. Alvandpour et al., “A sub-130 nm conditional keeper technique,” IEEE J. Solid-State Circuits, 37, no.5, pp. 633-638, May 2002.

SCAN_IN CLK

SCAN_OUT CLK

SCAN_IN ENABLE

PLA CLK

SCAN_IN DATA

SCAN_OUT ENABLE

SCAN_OUT DATA

Fig 7. Measured PLA output. All ‘1’ pattern is input to the PLA through a scan chain. PLA outputs (0 to 23) are functions of the minterm

)31(A)....2(A)1(A)0(A ••• and hence evaluates to ‘0’. PLA outputs (24 to 31) evaluates to ‘1’. This pattern is observed at the output scan chain.

COMPUTER UARTInterface

FPGA

DUT

POWERSUPPLY OSCILLOSCOPE

TEST BOARD

Fig. 6. Measurement setup

316P-28-4