[ieee 2008 17th asian test symposium (ats) - hokkaido, japan (2008.11.24-2008.11.27)] 2008 17th...

6
Layout-Aware and Programmable Memory BIST Synthesis for Nanoscale System-on-Chip Designs Aman Kokrady , C.P. Ravikumar , Nitin Chandrachoodan Ψ Texas Instruments India, Ψ IIT Madras {koko,Ravikumar}@ti.com, [email protected] Abstract Debugging memory test failures in a system-on-chip design is becoming difficult due to the growing number and sizes of the embedded memories. Low-complexity marching tests, which are ideally suited for production testing, are insufficient for debug and diagnostics. On-chip support for multiple memory test algorithms can be prohibitively expensive. Moreover, memory test engineers would like the flexibility to make small changes to the test sequence. Run- time programmability can be provided through the use of programmable finite state machines and/or microcode in the BIST controllers. Since such controllers have higher area requirement, it is difficult to employ multiple controllers and distribute them geographically on the chip. Therefore, the BIST controller can become a routing hot-spot. Existing memory BIST insertion flows operate on a post-synthesis net-list and ignore the constraints that will be posed by the physical design step that will follow. These constraints include routing congestion and interconnect timing. Similarly, the synthesis of the BIST logic must also address area, test application time and test power constraints. In this paper, we formulate the problem of programmable memory BIST synthesis as an optimization problem and describe an implementation. Results show upto 3X improvement in area and wirelength for industrial designs when a layout-aware flow is used as opposed to manual BIST implementation. 1 Introduction As more content gets integrated into system-on-chip (SoC) designs, there has been a steady increase in the amount of memory embedded into a SoC [ITR+04]. Memories are used in the form of buffers and queues in communication related applications and to store image data in camera and other image processing applications. It is not uncommon to find several hundred instances of embedded memories in some of the modern SoC. Testing of embedded memories poses several challenges. Read/Write access to embedded memories from an external tester is a major problem, due to which built-in self-test (BIST) has become a standard way to test memories. One or more BIST controllers are included on the SoC to generate read/write sequences according a memory test algorithm. A collar (or wrapper) is included with the memory which consists of multiplexers that select between address, data, and control signals generated from the BIST controller in test mode and the signals generated from the CPU in the mission mode. A comparator is used to check if the data expected from the read operation matches the data actually read out from the memory. Memories are tested using Marching Test algorithms [KPS+04,V+06] which require O(N) test application time to test a memory with N bits. In scaled CMOS technologies with copper interconnects, memories exhibit many subtle failure mechanisms such as dynamic read-destructive faults, deceptive read destructive faults, slow write driver faults, bit-line imbalance faults, etc [NGL+07,V+06]. Testing for such faults will require the use of several complex test algorithms. Since the cost of testing is sensitive to the complexity of the memory test algorithm, some form of adaptive testing will have to be practiced. For example, production testing can be done with low-cost test algorithms, whereas debugging and diagnosis can be performed using more elaborate algorithms. In order to support the use of alternate test sequences at different stages of testing, one should ideally provide BIST controllers that can run all the test algorithms. However, this would be prohibitively cost intensive. Run-time Programmable BIST (PBIST) is a way to overcome this problem. PBIST allows us to alter the test algorithm at the time of test application. Run-time programmability comes at the cost of a higher complexity of the PBIST controller and a more complex BIST insertion flow. Given that there can be hundreds of memory instances, sharing the PBIST controller among them can create a major routing congestion at the controller and enforce a sequential testing of memories and a higher test application time. If we counter this with multiple controllers and/or more complex controllers that can support concurrent testing of several memories, we will incur higher area and test power overheads. In this paper, we consider the problem of synthesizing optimal PBIST solutions. Our notion of optimality is based on the following considerations – routing congestion, BIST area overhead, Test application time, BIST power, and impact on timing closure. Existing CAD flows treat memory BIST insertion as a post-synthesis step. Today’s synthesis tools are physical-design friendly and make important decisions related to cell placement and wire routing since the circuit timing depends critically on the interconnect architecture. Even the insertion of design-for-test logic such as scan chains is layout-aware. In this background, it is a handicap to have a memory BIST flow that is not layout-aware. We elevate the memory BIST synthesis problem to a higher level of abstraction (register-transfer level) since many of the design decisions during PBIST synthesis must be taken before the gate-level synthesis is completed. At the same time, we make the PBIST synthesis flow layout-aware, since decisions taken during this step directly influence routing congestion. We refer to our Layout-Aware Memory PBIST Synthesis flow as “LAMPS”. The synthesis problem is formulated as a combinatorial optimization problem where the logical and physical architecture of the PBIST solution is derived through several transformations. Our experimental results indicate that the use of LAMPS can uncover significantly better logical and physical BIST architectures for industry-strength designs. Since the heuristic is a fast greedy algorithm, we can offer the physical design team several local optimal solutions to choose from, along with their congestion maps. We have organized the paper as follows. Section 2 provides some background on programmable memory BIST and surveys the related literature. Section 3 describes the conventional BIST flow and its drawbacks. We then propose a layout-aware Memory PBIST synthesis flow and formulate the synthesis problem in terms of logical/physical transformations. Section 4 describes the optimization framework and a greedy heuristic for solving the PBIST synthesis problem. Section 5 describes experimental setup and results on some industrial designs. We summarize our results in Section 6. 2 Background There are two approaches towards Built-in Self-Test for embedded memories in a SoC, namely, Programmable Memory BIST (PBIST) and Hardwired Memory BIST (HW-BIST). As the name indicates, a 17th Asian Test Symposium 1081-7735/08 $25.00 © 2008 IEEE DOI 10.1109/ATS.2008.77 349 17th Asian Test Symposium 1081-7735/08 $25.00 © 2008 IEEE DOI 10.1109/ATS.2008.77 351

Upload: nitin

Post on 09-Mar-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2008 17th Asian Test Symposium (ATS) - Hokkaido, Japan (2008.11.24-2008.11.27)] 2008 17th Asian Test Symposium - Layout-Aware and Programmable Memory BIST Synthesis for Nanoscale

Layout-Aware and Programmable Memory BIST Synthesis for Nanoscale System-on-Chip Designs

Aman Kokrady†, C.P. Ravikumar†, Nitin ChandrachoodanΨ †Texas Instruments India, ΨIIT Madras

{koko,Ravikumar}@ti.com, [email protected]

Abstract

Debugging memory test failures in a system-on-chip design is becoming difficult due to the growing number and sizes of the embedded memories. Low-complexity marching tests, which are ideally suited for production testing, are insufficient for debug and diagnostics. On-chip support for multiple memory test algorithms can be prohibitively expensive. Moreover, memory test engineers would like the flexibility to make small changes to the test sequence. Run-time programmability can be provided through the use of programmable finite state machines and/or microcode in the BIST controllers. Since such controllers have higher area requirement, it is difficult to employ multiple controllers and distribute them geographically on the chip. Therefore, the BIST controller can become a routing hot-spot. Existing memory BIST insertion flows operate on a post-synthesis net-list and ignore the constraints that will be posed by the physical design step that will follow. These constraints include routing congestion and interconnect timing. Similarly, the synthesis of the BIST logic must also address area, test application time and test power constraints. In this paper, we formulate the problem of programmable memory BIST synthesis as an optimization problem and describe an implementation. Results show upto 3X improvement in area and wirelength for industrial designs when a layout-aware flow is used as opposed to manual BIST implementation.

1 Introduction As more content gets integrated into system-on-chip (SoC) designs, there has been a steady increase in the amount of memory embedded into a SoC [ITR+04]. Memories are used in the form of buffers and queues in communication related applications and to store image data in camera and other image processing applications. It is not uncommon to find several hundred instances of embedded memories in some of the modern SoC. Testing of embedded memories poses several challenges.

• Read/Write access to embedded memories from an external tester is a major problem, due to which built-in self-test (BIST) has become a standard way to test memories. One or more BIST controllers are included on the SoC to generate read/write sequences according a memory test algorithm. A collar (or wrapper) is included with the memory which consists of multiplexers that select between address, data, and control signals generated from the BIST controller in test mode and the signals generated from the CPU in the mission mode. A comparator is used to check if the data expected from the read operation matches the data actually read out from the memory.

• Memories are tested using Marching Test algorithms [KPS+04,V+06] which require O(N) test application time to test a memory with N bits. In scaled CMOS technologies with copper interconnects, memories exhibit many subtle failure mechanisms such as dynamic read-destructive faults, deceptive read destructive faults, slow write driver faults, bit-line imbalance faults, etc [NGL+07,V+06]. Testing for such faults will require the use of several complex test algorithms. Since the cost of testing is sensitive to the complexity of the memory test algorithm, some form of adaptive testing will have to be practiced. For example, production testing can be done with

low-cost test algorithms, whereas debugging and diagnosis can be performed using more elaborate algorithms.

• In order to support the use of alternate test sequences at different stages of testing, one should ideally provide BIST controllers that can run all the test algorithms. However, this would be prohibitively cost intensive. Run-time Programmable BIST (PBIST) is a way to overcome this problem. PBIST allows us to alter the test algorithm at the time of test application. Run-time programmability comes at the cost of a higher complexity of the PBIST controller and a more complex BIST insertion flow. Given that there can be hundreds of memory instances, sharing the PBIST controller among them can create a major routing congestion at the controller and enforce a sequential testing of memories and a higher test application time. If we counter this with multiple controllers and/or more complex controllers that can support concurrent testing of several memories, we will incur higher area and test power overheads.

In this paper, we consider the problem of synthesizing optimal PBIST solutions. Our notion of optimality is based on the following considerations – routing congestion, BIST area overhead, Test application time, BIST power, and impact on timing closure. Existing CAD flows treat memory BIST insertion as a post-synthesis step. Today’s synthesis tools are physical-design friendly and make important decisions related to cell placement and wire routing since the circuit timing depends critically on the interconnect architecture. Even the insertion of design-for-test logic such as scan chains is layout-aware. In this background, it is a handicap to have a memory BIST flow that is not layout-aware. We elevate the memory BIST synthesis problem to a higher level of abstraction (register-transfer level) since many of the design decisions during PBIST synthesis must be taken before the gate-level synthesis is completed. At the same time, we make the PBIST synthesis flow layout-aware, since decisions taken during this step directly influence routing congestion. We refer to our Layout-Aware Memory PBIST Synthesis flow as “LAMPS”. The synthesis problem is formulated as a combinatorial optimization problem where the logical and physical architecture of the PBIST solution is derived through several transformations. Our experimental results indicate that the use of LAMPS can uncover significantly better logical and physical BIST architectures for industry-strength designs. Since the heuristic is a fast greedy algorithm, we can offer the physical design team several local optimal solutions to choose from, along with their congestion maps. We have organized the paper as follows. Section 2 provides some background on programmable memory BIST and surveys the related literature. Section 3 describes the conventional BIST flow and its drawbacks. We then propose a layout-aware Memory PBIST synthesis flow and formulate the synthesis problem in terms of logical/physical transformations. Section 4 describes the optimization framework and a greedy heuristic for solving the PBIST synthesis problem. Section 5 describes experimental setup and results on some industrial designs. We summarize our results in Section 6.

2 Background There are two approaches towards Built-in Self-Test for embedded memories in a SoC, namely, Programmable Memory BIST (PBIST) and Hardwired Memory BIST (HW-BIST). As the name indicates, a

17th Asian Test Symposium

1081-7735/08 $25.00 © 2008 IEEE

DOI 10.1109/ATS.2008.77

349

17th Asian Test Symposium

1081-7735/08 $25.00 © 2008 IEEE

DOI 10.1109/ATS.2008.77

351

Page 2: [IEEE 2008 17th Asian Test Symposium (ATS) - Hokkaido, Japan (2008.11.24-2008.11.27)] 2008 17th Asian Test Symposium - Layout-Aware and Programmable Memory BIST Synthesis for Nanoscale

hardwired scheme uses an embedded controller to generate read/write sequences to test a memory according to a fixed algorithm such as MARCH C+ [V+06]. Commercial memory BIST solutions are based on the HW-BIST methodology and allow a controller to be shared across several memories, which are tested one after another using the same test algorithm. PBIST solutions permit the user to dynamically change the memory test algorithm. They can also be programmed to concurrently test several memories. Concurrent testing is necessary to reduce the test application time in modern system-on-chip designs which include hundreds of memory instances. Programmability is necessary in chips fabricated using sub-100 nm technologies, which exhibit subtle failure mechanisms. To debug such failures, the test engineer may need to apply several test sequences that are not part of production testing. Compared to HW-BIST, a PBIST data path occupies lesser area for the same flexibility, test power, and area.

2.1 Previous Work Zarrineh and Upadhyaya presented early designs for programmable memory BIST [ZU+99] based on a programmable FSM as well as a microprogrammable controller. A FSM based fully programmable controller was proposed by Du et al [DMC+05]. The flexibility of modifying the test sequence is highest when a microprogrammable controller is used, but the area overhead is higher when compared to the programmable FSM based controller. The programmable controllers required about 4x to 5x the area of a static BIST controller based on MARCH algorithms. Microcode-based controllers have also been proposed by Jakobsen et al [JDP+01], Appello et al [ABF+03]. A CPU-based controller uses an on-chip special-purpose processor for the generation of test sequences; such controllers are proposed in [ZS+03]. Several authors have studied the optimization of memory BIST data paths for BIST area, power, and test time, but these are not intended for PBIST. Bodoni et. Al proposed a distributed architecture for BIST architectures [BBC+00] in which BIST area and routing was minimized. In BRAINS (BIST for RAm IN Seconds), the authors aim to optimize test application time using test power as a constraint [CHH+91]. Concurrently testing all memories may violate the power constraint, but will provide the least test time. A serial schedule will result in least test power but lead to highest test time. In [RR+04], an optimizer called SAMBA (Simple ASIC Memory BIST Advisor) was described that minimizes BIST area by grouping memories based on physical constraints and placing a BIST controller per group at the best location. Fewest number of controllers are used to minimize area overhead. Using a single controller leads to minimal area, but will not permit concurrent schedules and lead to long test application times. Memories are grouped based on their logical hierarchies, their physical locations, and access types. In BIST Share [MYF+06], the authors consider all aspects of BIST cost and present an algorithm to minimize the BIST area under time and power constraints. While the above papers laid the foundation for BIST optimization, they suffer from two limitations – (a) These solutions do not address the aspect of BIST timing. At-

speed testing of memories is necessary to detect delay related defects. Due to the domination of interconnect delays in nanoscale designs, timing of the BIST data path is critical. To avoid long wires, we can introduce data steering logic (explained in this paper later). Similarly, to close timing on the BIST data paths, we can introduce pipeline registers. It is important to take into account the area/power penalty resulting from these BIST data path elements during the optimization phase itself. In other words, using an existing scheme such as “BIST Share” and performing timing optimizations later will result in non-optimal solutions. Further, the routability of the solution will get impacted if we attempt such a post-facto repair.

(b) The solutions are intended mostly for HW-BIST architecture. Programmable BIST datapaths differ from HW-BIST datapaths in three ways. First, PBIST controllers are area-intensive and it is impractical to place multiple controllers. As a result, the aspects of routing congestion and routability become critical. Second, programmability allows more scope for optimization. Memory slicing (proposed originally in [CMMP+99], explained later) can be done to share the comparator across all the memories within a group. Memories that are not necessarily in the physical neighborhood can be grouped and a comparator can be shared across the memories through slicing, since the latency of the arrival of data from the memory to the controller can be programmed. Finally, PBIST allows us to test memories across clock domains and memory types. In [BBC+00] PBIST was proposed but it did not consider details of layout for area/routing reduction

In this paper, we propose an optimization framework called LAMPS (Layout Aware Memory PBIST Synthesis) which is primarily intended for PBIST synthesis. It makes use of a variety of physical and logical transformations to PBIST data path to improve the routability, total wirelength, area, timing, and power dissipation. We summarize the key differences between our approach and the existing approaches in Table I.

3 PBIST Optimization Irrespective of the controller architecture, the impact of using PBIST on area, routing congestion, power, and device timing is significant. An implementation that treats the controller as a hard IP can lead to a routing hotspot at the controller (Figure 1). The PBIST controller (PBC) is duplicated for illustration. The darker lines represent the forward paths (from PBC to memory) and dashed lines represent the return paths (memory to PBC). The bus width of the forward path leading to memory Mj is K+NAj+DWj, where K is the number of control lines, NAj is the number of address lines to memory Mj, and DWj is the width of the data bus in Mj. Clearly, the congestion factor at PBC is ∑ K+ DWj

+ NAj. We propose modifications to PBIST architecture to minimize the overheads of PBIST.

Figure 1 – Simple PBIST solution leading to routing congestion

The present-day SoC design flow (Figure 2) consists of integration of IP blocks such as microprocessors, microcontrollers, signal converters, digital signal processors, hardware accelerators and memories at a suitably high level of abstraction, such as register-transfer level (RTL) or electronic system level (ESL). An early floor-planning step is performed to improve the convergence of timing closure. Layout-aware logic synthesis, also called physical synthesis, is used to derive a gate-level representation for the logic part. The design-for-test (DFT) modules for logic testing are then added at the gate-level of abstraction. To improve timing closure, the scan chain insertion step is layout-aware. Physical design tools also perform scan-flop reordering to improve design timing. In contrast, memories in the SoC are synthesized and the memory BIST logic is added without cognizance of the physical design. We believe that this is a major drawback in the design flow, since memories constitute a major portion of the silicon real-estate. Assume that the initial floor-

M1

M2

M3

PBC PBC

K + NA1 + DW1 DW1

M4

350352

Page 3: [IEEE 2008 17th Asian Test Symposium (ATS) - Hokkaido, Japan (2008.11.24-2008.11.27)] 2008 17th Asian Test Symposium - Layout-Aware and Programmable Memory BIST Synthesis for Nanoscale

planning step has placed the centre of the memory Mj at location (xj, yj). If a single PBIST controller is to be added at location (x0,y0), it must be placed equidistant from the memories. Assuming Manhattan routing, a lower bound on the length of a wire from/to the controller to/from the memory Mj is given by |x0 – xj| + |y0 – yj. In the simple PBIST solution of Figure 1, the routing congestion can be substantial. This will place an enormous burden on the physical design flow which attempts to close the timing on every signal path. There is a case, therefore, for physically-aware memory BIST synthesis.

Figure 2 – Conventional SoC Design FlowThe simple PBIST solution of Figure 1 can be a starting point for layout-aware BIST. When the number of memories is small, the controller can be located at an approximately equal distance from the memories. This is similar to the location of a hospital in a city, where we intend that the facility is equally accessible to all parts of the city. As the number of memories in a modern SoC are in the hundreds, the simple PBIST solution will lead to routing hot-spots and will lead to an impractical solution with too much area overhead, power, and signal-integrity issues. We propose physically-aware transformations to the memory BIST solution to improve the situation.

3.1 Structuring transformations The congestion factor at the PBIST controller can be reduced by using data distributors and data concentrators as shown in Figure 3. The memories are partitioned based on their physical proximity into groups G1, G2, …, Gp. In the figure, we have two groups G1 and G2, with two memories per group. A hierarchical, tree-based interconnect scheme is constructed to route the address, data, and control signals from/to the controller to/from the memories. We can control the maximum degree of the nodes in the tree to a constant d which is much smaller than the number of memories. For example, in the figure, a two-level hierarchy is illustrated using a binary tree structure with d=3. There are two benefits of this transformation. The congestion factor is no more than d ×(K + Max(DWj) + Max(NAj), reducing the routing congestion. The global routes from the controller to the memories are now replaced with local routes that join the nodes of the tree; these local routes are smaller in length and easier to route. The choice of the right logic architecture for distributor and concentrator blocks will impact the area and performance of the solution. For example, a k-way concentrator can be realized using a k-to-1 multiplexer, which, in turn, can be implemented using a hierarchy of 2l-to-1 multiplexers, where 1 < l < log2 k. Similarly, the data distributor can be implemented using a hierarchy of demultiplexers.

3.2 Slicing transformations If a memory Mj has a much larger data bus width in comparison to the other memories, it will result in wasteful BIST resources as illustrated by the example that follows. Assume that M1 in Figure 3 is a 64-bit memory whereas the others are 32-bit memories. The size of the comparator in the BIST logic will have to be 64. The comparator is fully utilized only when M1 is tested. We can logically slice the memory M1 into two halves and test each half at a time in two phases. In Phase 1, the lower 32 bits of the read data are selected for comparison. In Phase 2, the upper 32 bits of the read data are compared. At the expense of some multiplexing logic, this slicing transformation will reduce the routing complexity and the hardware complexity of the BIST logic. Slicing transformations allow us to explore the tradeoff between test time, BIST area, and routing complexity. Although we illustrated the slicing transformation for a two-way slicing, one can also perform multi-way slicing on very wide memories. Slicing transformations are illustrated in Figure 4.

3.3 Pipelining transformations The wires in the interconnect tree from the PBIST controller to the memories can have significant transport delays. However, for the purpose of at-speed testing of the memories, it is important to ensure that the memory write and read operations are completed in one mission-mode clock cycle. Placement of one or more pipeline register along the wire will help us operate the data path from the BIST controller to/from the memory in a pipelined fashion at high speed. Pipelining transformations will be necessary even when the simple PBIST solution of Figure 1 is used and at-speed testing is desired. However, in such a solution, the number of pipeline registers can be excessive due to the long routes that result owing to the routing congestion. It has been our experience that the number of pipeline registers in a simple PBIST solution for a 200-memory SoC can be in several thousands. Substantial reductions, of the order of 10X, are possible when structuring and slicing transformations are applied. Pipelining transformations are illustrated in Figure 4. A delay calculation engine is used to determine the (negative) timing slack along the path and a pipeline register is placed at the center of the wire to improve the slack; this procedure is repeated until the slack becomes zero.

4 PBIST Optimization framework In this paper, we will assume that a single PBIST controller will be used, irrespective of the number of memory instances. Our work can be extended to the case where multiple controllers are employed. We assume that an initial floorplanning step has identified a location (xj,yj) for every memory instance Mj in the chip. The location of the PBIST controller, (x0,y0), is to be fixed as part of the PBIST synthesis flow. Every memory instance is characterized by the type of the me-

Figure 3 – Structuring transformation to reduce the PBIST overhead. Solid lines represent forward paths and dotted lines

represent return paths.

M1

M2

M3

M4

PBC PBC

DistributorConcentrator

IP Integration at RTL/ESL

Floorplanning

Physical Synthesis

Memory Synthesis

Memory BIST Synthesis

Logic DFT

Physical Design

351353

Page 4: [IEEE 2008 17th Asian Test Symposium (ATS) - Hokkaido, Japan (2008.11.24-2008.11.27)] 2008 17th Asian Test Symposium - Layout-Aware and Programmable Memory BIST Synthesis for Nanoscale

Figure 4 – Illustration of all PBIST Optimizationsmory (DRAM, SRAM, single-port, dual-port, etc.), the width of the data bus, and the width of the control bus. The actual dimensions of a memory are ignored in estimating the length of a wire from controller to the memory. We use the Manhattan distance measure to obtain a lower bound on the wirelength. The PBIST synthesis flow determines the optimal tree structure, the placement information for the distributor and concentrator logic blocks, slicing decisions for wide memories, the number of pipeline registers to be placed, and the placement information for the pipeline registers. A bus is treated as a logical grouping of wires, but the individual wires of the bus are treated separately for physical design purposes. The optimization framework considers the space of all possible tree structures when performing the structuring transformation. Knuth [K+97] describes a generating function approach to count the number of rooted trees that can be constructed to connect a given number of nodes n. This number is related to the Catalan number 1/(n+1)× C(2n,n) and grows very quickly with n. The tree structuring problem in our paper considers an even larger space since the number of nodes n is a variable – at the minimum, the number of nodes is m+1, (number of memories plus the controller) and at the maximum, the number of nodes is 2m when a binary tree structure is employed. A cost estimator evaluates a combination of metrics such as the area overhead, routing congestion, the BIST timing characteristics, test application time, and BIST power dissipation. The area of a PBIST solution is measured by the overhead introduced by the data distributors and concentrators, the size(s) of the comparator(s) required, the lengths of the wires required in the memory test data path, and the number of pipeline registers. The hardware complexity of the controller and the data distributors/concentrators depends on the concurrency in the schedule. The routing congestion is measured by the combination of in-degree and out-degrees of the nodes in the solution – a node that has a higher value for this metric than a threshold can become a hot spot in the physical design. The placement of pipeline registers is based on the estimation of wire length of the data path – if the wire length exceeds a threshold, a pipeline register is placed mid-way along the path. We work with the premise that BIST timing can always be matched to the mission-mode timing through the placement of adequate pipeline registers. The test application time depends on the concurrency in the test schedule and can be evaluated from the schedule array and the knowledge of the test algorithms used to test the memories. BIST power overhead consists of the dynamic and static power dissipated in the data path. The techniques discussed in [KMR+03] can be used to estimate the power.

There are two phases in the optimization of the PBIST architecture. During the Phase 1, the area overhead and test time are treated as objectives and BIST power is used as a constraint. Structural transformations are used to find a minimal area solution, and concurrency transformations are used to find a solution that minimizes test-time. In Phase 2, slicing and pipeline optimizations are used to improve the solution further. As a post-processing step, concurrency in the schedule is used to reduce the test time. Several heuristics are used for grouping memories. Memories that are physically far apart and/or belong to different physical hierarchies are placed in different groups. Memories that belong to different clock/voltage domains are not grouped. Memories that serve the same functionality (video buffer, audio buffer, FIFO, etc) are preferably grouped together to promote concurrent testing. For the same reason, memories that have the same type (single port RAM, dual port RAM, etc) and similar size parameters are grouped. Each phase of the overall optimization framework is an NP-complete problem. A greedy optimization algorithm was employed at each stage to obtain a near-optimal solution. The initial solution is the simple PBIST solution, where all memories are in the same group and there is no hierarchy in the PBIST architecture. From this initial solution, a set of local neighborhood solutions are generated using the appropriate set of transformations, and the neighborhood solution that minimizes the area overhead and test application time is accepted. This procedure continues until the user-defined limit on run-time is exhausted or the best solution in the neighborhood is inferior compared to the current solution, signifying that the current solution is a local minimum. In our implementation of the proposed flow, we made use of a combination of in-house tools and commercial tools. The gate-level netlist (prior to PBIST insertion) is available in Verilog format. The physical design information such as placement of the memories is available in the Magma tool database. An in-house tool is used to extract the memory related information from the netlist and the physical hierarchies to which the memories are assigned. A Magma TCL script extracts the congestion information for the layout (number of free routes in a routing window). Congestion score, which is an optimization metric for the datapath, is obtained using Magma physical design tool. The user specifies a number of constraints to the PBIST synthesis flow. These include a bound on the test application time, a margin on routing space for a routing window before it is marked as a congested area, a bound on the test power, and so on.

5 Experimental Results LAMPS was tested on three industry strength designs (see Table II). These designs had several types of memories such as single-port and dual-port SRAMs and single and dual-ported register files. In Table II, we have shown the estimated wirelengths, number of pipeline registers and the area overhead of the PBIST solution for the best case (when LAMPS was configured to minimize the parameters) and the worst case (when LAMPS was configured to maximize the parameters). The results indicate that there can be a 4X to 5X improvement in wirelength, 4X to 9X improvement in the number of pipeline registers, and a 4X to 6X improvement in the area overhead. Table III shows the phase-wise improvement in the solution as LAMPS optimizes the PBIST datapath (Chip A). R indicates if the solution S is routable. The initial solution S1 is the simple PBIST datapath, which is not routable and does not meet timing. Structural transformations are able to find a solution S2, which is routable and significantly improve the area (A), the worst negative slack (WNS), and total negative slack (TNS) on the PBIST datapath, but the timing is still not met (column MT).

M1

M2

M3

M4

PBC PBC

Distributor

W

W

W Pipelining Transformation

Splitting Transformation

Concentrator

C O M P

Shared Comparator

352354

Page 5: [IEEE 2008 17th Asian Test Symposium (ATS) - Hokkaido, Japan (2008.11.24-2008.11.27)] 2008 17th Asian Test Symposium - Layout-Aware and Programmable Memory BIST Synthesis for Nanoscale

Exp Logic Gates Memory Instances Best-case Scenario Worst-case Scenario Wirelength(μm) Pipe Flops Area(μm2) Wirelength(μm) Pipe Flops Area(μm2) Chip-A 300K 45 16328 1024 10504 97675 4672 47369 Chip-B 1 M 200 83291 128 5340 444374 1096 34683 Chip-C 3 M 150 2034951 4096 69907 6123762 15132 147159

Table II – Results of using LAMPS flow on Benchmarks Slicing transformations S3, further improves the area and the slack at the cost of increased test power (TP) and test time (TT). Pipeline transformations S4 are able to meet the timing on PBIST datapath through the addition of 1024 pipeline flops. Note that there is a fairly significant area cost due to this transformation. The increase in test power through addition of pipeline flops is negligible.

Table III – Solution Improvement during PBIST optimization

S R WL (μm)A

(μm2) MT WNS TNS PF TT TP

S1N 75446 9532 N 7512 1441042 0 8.9 1X

S2Y 24419 4810 N 2407 432312 0 8.9 1X

S3Y 16328 3336 N 2598 451784 0 12.9 1.23X

S4Y 16328 10504 Y 0 0 1024 12.9 1.23X Figure 5 shows three different PBIST datapaths and their layout plans for Chip-B, where the (blue) dots represent the memories, the (red) square represents the position of the PBIST controller, the dark triangles represent the pipeline registers. The x and y axes are in μm. To bring out the impact of PBIST optimization, Figure 5(a) shows the solution that maximizes wirelength and area and minimizes the routability index. Figure 5(b) represents the simple (flat) PBIST architecture with pipeline register placement. In both these solutions, the PBIST controller is a routing bottleneck. Figure 5(c) shows the optimal solution generated using LAMPS, which includes all the optimizations discussed in Section 3. (In fact, the solution of 5(a) was also generated using LAMPS by changing the minimization problem to one of maximization.) It is easy to see that the LAMPS solution is superior in terms of routability; this example illustrates the importance of a layout-aware PBIST flow towards design optimization. Figure 6(a) shows a physical design plan for the PBIST data path on Chip C; this plan was generated manually by a physical designer after several days of painstaking work. Figure 6(b) shows the physical design plan for the optimized PBIST data path obtained by the LAMPS flow in less than 2 seconds of run-time. An examination of the two figures will illustrate that LAMPS is able to better plan the routing and pipelining optimizations. Figure 7(a) shows the abstracted physical design of Chip C for a manually generated PBIST architecture. It may be seen that in the central regions of the chip, there is a significant routing congestion. The best solution from LAMPS flow is illustrated in Figure 7(b).

6 Conclusions Embedded memories in nanometer SoC have subtle failure modes which require expensive test sequences for debug/diagnostics. A programmable BIST solution allows us to use simple and fast test sequences in production mode and more complex sequences for debug/diagnostics. Conventional solutions for memory BIST insertion are inadequate when the number of embedded memories is in the order of hundreds, since these flows ignore the physical design constraints and result in routing hotspots. Earlier PBIST solutions attempted to build PBIST cores, unlike our solution where the PBIST data path is distributed across the chip and can hence permit optimizing transformations presented in this paper. We described several constraints and formulated the synthesis problem. A simple greedy optimization algorithm was used in our

implementation. This algorithm is quite fast and can provide excellent solutions in short run-times. Our experimental results indicate that our flow can save the designer many days of work by offering good PBIST architectures which are complete in terms of logical and physical attributes. The flow is capable of suggesting a location for the PBIST controller and an optimal routing tree from/to the PBIST controller to/from the memories. Optimal configurations are selected for the data steering logic in the routing tree and comparators in the PBIST data path. Pipeline registers are placed along the long interconnect paths to ensure timing correctness. Future work will consider optimization algorithms other than the greedy search algorithm used in this work to obtain better solutions.

References [ITR+04] International Technology Roadmap for Semiconductors,

2004 update, http://public.itrs.net/Home.htm, 2004. [ABF+03] D. Appello, et. al., A programmable BIST approach for

the diagnosis of embedded memory cores. Proceedings of the 8th IEEE European Test Workshop, The Netherlands, May 25-28, 2003, pp. 101-102.

[BBC+00] M. L. Bodoni et al., An effective distributed BIST architecture for RAMs. European Test Symposium. p. 119-124.

[CHH+91] K.-L. Cheng, et. al., Automatic Generation of Memory Built-In Self-TestCores for System-on-Chip. Asian Test symposium, P. 91

[CMMP+99] Crouch, A.L., et. al., The Testability Features of the 3rd Generation Coldfire® Family of Microprocessors. Proceedings of the International Test Conference, 1999.

[DMC+05] X. Du, et. al., Full-Speed Field-Programmable Memory BIST Architecture. International Test Conference 2005, paper 45.3.

[JDP+01] P. Jakobsen, et. al, Embedded DRAM Built In Self Test and Methodology for Test Insertion. ITC 2001, pp.975-984.

[KPS+04] D-C Kang, et. al., An Efficient Built-In Self-Test Algorithm for Neighborhood Pattern- and Bit-Line-Sensitive Faults in High-Density Memories. ETRI Journal, vol.26, no.6, Dec. 2004, pp.520-534.

[K+97] D.E. Knuth., Fundamental Algorithms, Third Edition (Reading, Massachusetts: Addison-Wesley, 1997.

[KMR+03] Kokrady et. al,. Estimating Test Power Dissipation for Embedded Memories. Proceedings of the VLSI Design and Test Workshops, Bangalore, India, 2003.

[MYF+06] M. Miyazaki, et. al., A Memory Grouping Method for Sharing Memory BIST Logic. ASP-DAC 2006, PP-671-676

[NGL+07] A. Ney, et. al., Slow Write Driver Faults in 65nm SRAM Technology: Analysis and March Test Solution. Design, Automation & Test in Europe Conference &amp; Exhibition, 2007, p. 103.

[RR+04] Roshin L., et. al., Simple ASIC Memory BIST advisor, TI Symposium on Test, 2004

[V+06] A. J. Van De Goor., Advanced Memory Testing. Tutorial given at International Test Conference, 2006, Santa Clara.

[ZU+99] K. Zarrineh et. al., On Programmable Memory Built-In Self-Test Architecture. In Proceedings of Design, Automation and Test in Europe 1999. Proceedings, pp.708-713, 1999.

[ZS+03] Y. Zorian, et. al., Embedded-Memory Test and Repair: Infrastructure IP for SoC Yield. IEEE Design and Test of Computers, vol. 20, no. 3, pp. 58-66, May/Jun, 2003

353355

Page 6: [IEEE 2008 17th Asian Test Symposium (ATS) - Hokkaido, Japan (2008.11.24-2008.11.27)] 2008 17th Asian Test Symposium - Layout-Aware and Programmable Memory BIST Synthesis for Nanoscale

Table I – Comparison of our approach with existing BIST optimization tools Approach Architecture Number of

Controllers Determine Controller Location

Memory Grouping

Memory Slicing

Steiner Routing of Data path wiring

Sharing of comparator within a group

Pipeline register placement for timing

BIST Share HW-BIST Multiple No Yes No No Yes No BRAINS HW-BIST Multiple No Yes No No Yes No SAMBA HW-BIST Multiple No Yes No No No No LAMPS PBIST Single Yes Yes Yes Yes Yes Yes

(5a) (5b)

(5c)Figure 5 – (a) Worst-case solution for Chip-B; (b) A flat PBIST architecture that ignores physical design constraints; (c)Best-case

solution for Chip-B, generated by LAMPS

(6a) (6b)Figure 6 – (a) Result of a manual PBIST synthesis for Chip C

(b) Result from LAMPS layout-aware PBIST synthesis flow for Chip C

(7a) (7b) Figure 7 – (a) Manual Synthesis and Layout of PBIST for Chip C – Magma congestion score of 0 (worst) on a scale of 0..10

(b) Automated Synthesis of PBIST and layout using LAMPS – Magma congestion score of 10 (best) on a scale of 0..10

354356