Pulsed-Latch ASIC Synthesis in Industrial Design Flow

Similar documents
International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Novel Pulsed-Latch Replacement Based on Time Borrowing and Spiral Clustering

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Figure.1 Clock signal II. SYSTEM ANALYSIS

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

Clock Tree Power Optimization of Three Dimensional VLSI System with Network

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

11. Sequential Elements

An FPGA Implementation of Shift Register Using Pulsed Latches

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

ECE321 Electronics I

EE-382M VLSI II FLIP-FLOPS

University College of Engineering, JNTUK, Kakinada, India Member of Technical Staff, Seerakademi, Hyderabad

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

A Power Efficient Flip Flop by using 90nm Technology

TKK S ASIC-PIIRIEN SUUNNITTELU

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Low-Power and Area-Efficient Shift Register Using Pulsed Latches

2.6 Reset Design Strategy

LOW POWER AND HIGH PERFORMANCE SHIFT REGISTERS USING PULSED LATCH TECHNIQUE

Australian Journal of Basic and Applied Sciences. Design of SRAM using Multibit Flipflop with Clock Gating Technique

Minimizing Leakage of Sequential Circuits through Flip-Flop Skewing and Technology Mapping

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

Power Optimization by Using Multi-Bit Flip-Flops

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.5

Low Voltage Clocking Methodologies for Nanoscale ICs. A Dissertation Presented. Weicheng Liu. The Graduate School. in Partial Fulfillment of the

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume Reduction

Power Efficient Design of Sequential Circuits using OBSC and RTPG Integration

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

P.Akila 1. P a g e 60

Comparative study on low-power high-performance standard-cell flip-flops

K.T. Tim Cheng 07_dft, v Testability

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

Keeping The Clock Pure. Making The Impurities Digestible

Design and Analysis of Custom Clock Buffers and a D Flip-Flop for Low Swing Clock Distribution Networks. A Thesis presented.

DESIGN AND ANALYSIS OF LOW POWER STS PULSE TRIGGERED FLIP-FLOP USING 250NM CMOS TECHNOLOGY

Design and Simulation of a Digital CMOS Synchronous 4-bit Up-Counter with Set and Reset

A Low-Power CMOS Flip-Flop for High Performance Processors

Impact of Test Point Insertion on Silicon Area and Timing during Layout

A Novel Low-overhead Delay Testing Technique for Arbitrary Two-Pattern Test Application

At-speed Testing of SOC ICs

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Current Mode Double Edge Triggered Flip Flop with Enable

Power Optimization Techniques for Sequential Elements Using Pulse Triggered Flip-Flops with SVL Logic

Low Power D Flip Flop Using Static Pass Transistor Logic

Latch-Based Performance Optimization for FPGAs. Xiao Teng

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

DIFFERENTIAL CONDITIONAL CAPTURING FLIP-FLOP TECHNIQUE USED FOR LOW POWER CONSUMPTION IN CLOCKING SCHEME

SGERC: a self-gated timing error resilient cluster of sequential cells for wide-voltage processor

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

Dual Edge Adaptive Pulse Triggered Flip-Flop for a High Speed and Low Power Applications

A Greedy Heuristic Algorithm for Flip-Flop Replacement Power Reduction in Digital Integrated Circuits

Modeling and designing of Sense Amplifier based Flip-Flop using Cadence tool at 45nm

ISSN Vol.08,Issue.24, December-2016, Pages:

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Design of a High Frequency Dual Modulus Prescaler using Efficient TSPC Flip Flop using 180nm Technology

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

Simultaneous Control of Subthreshold and Gate Leakage Current in Nanometer-Scale CMOS Circuits

Enhanced JTAG to test interconnects in a SoC

HIGH SPEED CLOCK DISTRIBUTION NETWORK USING CURRENT MODE DOUBLE EDGE TRIGGERED FLIP FLOP WITH ENABLE

Lecture 23 Design for Testability (DFT): Full-Scan

LOW POWER DOUBLE EDGE PULSE TRIGGERED FLIP FLOP DESIGN

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

Design Low-Power and Area-Efficient Shift Register using SSASPL Pulsed Latch

Parametric Optimization of Clocked Redundant Flip-Flop Using Transmission Gate

Energy Recovery Clocking Scheme and Flip-Flops for Ultra Low-Energy Applications

Minimization of Power for the Design of an Optimal Flip Flop

Retiming Sequential Circuits for Low Power

Static Timing Analysis for Nanometer Designs

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Design of Shift Register Using Pulse Triggered Flip Flop

Design of New Dual Edge Triggered Sense Amplifier Flip-Flop with Low Area and Power Efficient

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science

Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift Register

Project 6: Latches and flip-flops

Scan. This is a sample of the first 15 pages of the Scan chapter.

cascading flip-flops for proper operation clock skew Hardware description languages and sequential logic

An Optimized Implementation of Pulse Triggered Flip-flop Based on Single Feed-Through Scheme in FPGA Technology

Peak Dynamic Power Estimation of FPGA-mapped Digital Designs

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

Analysis of Digitally Controlled Delay Loop-NAND Gate for Glitch Free Design

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

AN OPTIMIZED IMPLEMENTATION OF MULTI- BIT FLIP-FLOP USING VERILOG

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

II. ANALYSIS I. INTRODUCTION

ECEN454 Digital Integrated Circuit Design. Sequential Circuits. Sequencing. Output depends on current inputs

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

Transcription:

Pulsed-Latch AC Synthesis in Industrial Design Flow Sangmin Kim, Duckhwan Kim, and Youngsoo Shin Departmt of Electrical Engineering, KAIST Daejeon 35-71, Korea Abstract Flip-flop has long be used as a sequcing elemt of choice in AC ign; commercial synthesis tools have also be developed in this context. This work has be motivated by a question of whether existing CAD tools can be employed from RTL to layout while pulsed latch replaces flip-flop as a sequcing elemt. Two important problems have be idtified and their solutions are proposed: placemt of pulse gerators and latches for integrity of pulse shape, and ign of special scan latches and their selective use to reduce hold violations. A referce ign flow has also be set up using published documts, in order to assess the proposed one. In 4-nm technology, the proposed flow achieves 2% reduction in circuit area and 3% reduction in power consumption, on average of 12 test circuits. I. INTRODUCTION A pulsed latch is a latch driv by a short pulse, rather than by a normal clock. Since the time it can capture input data is very brief, it can be approximated as a faster flip-flop. A pulse can be gerated either internally or externally. In the latter approach, a single pulse gerator is shared by more than one latch; it thus has advantage of area and power consumption than the former approach, and is the focus of discussion in this paper. A pulsed latch has oft be used in high performance processor igns [1] [3], but its adoption in AC igns is not yet popular. Several documts have be published regarding ign methodology [4] and CAD optimization [5] [8], but the study of integrated ign flow from logic synthesis to layout geration is still missing. This paper is motivated by a question of whether commercial CAD tools, which mostly assume flip-flops as sequcing elemts, can be used to ign pulsed latch AC; some script programming may be needed where customization is necessary. The key problems have be idtified during the study and solutions are provided; these include placemt of pulse gerators and latches for integrity of pulse shape, and ign of special scan latches and their use to reduce hold violations. The proposed ign flow has be compared to a referce flow. The result in 4-nm commercial technology is very positive, ev though referce flow may be a little arbitrary since it itself has never be documted before in complete form: circuit area is reduced by 2% and power is consumed 3% less, on average of 12 test circuits. A. Contributions Our main contributions can be summarized as follows: 1 2 1 Fig. 1. T cq T hd W... T su 2 Timing model of a pulsed latch circuit. Use of commercial CAD tools (with support of some script for customization) from RTL to layout for pulsedlatch AC ign. An algorithm to insert pulse gerators, and using properly sized bounding boxes for placemt of pulse gerators and latches. Design of special scan latches and an algorithm for their selective use to reduce hold violations. The remainder of this paper is organized as follows. In Section II, we discuss how the timing model of standard latch is altered so that pulsed latch can be used right from the logic synthesis stage. A new pulse gerator is proposed, in which output pulse is not distorted ev wh input clock transitions slowly; it is further modified to have pulse able input, so that it is used during clock gating synthesis. The problems of pulse gerators insertion and automatic placemt are addressed in Section III. Special scan latches are proposed in Section IV; how they are used together with standard scan latches are prested with the objective of minimizing hold violations. Experimtal results are prested in Section V and we draw conclusions in Section VI. II. PRELIMINARIES We assume a standard AC ign process as a base of proposed flow; it consists of a sequce of logic synthesis (which also performs clock gating synthesis), test synthesis, placemt, clock tree synthesis, and routing. A. Timing Model The foundation of using a pulsed latch in AC ign is to treat it as a faster flip-flop. This is made possible by forcing that data arrives before the rising edge of pulse, which implies that setup margin (T su ) and clock-to-q delay (T cq ) are 978-1-4673-33-5/13/$31. 213 IEEE 356

rise time 5ps rise time 2ps rise time 5ps rise time 2ps M 1 1..5 1..5 15 3 15 3 Fig. 2. A pulse gerator [1] and its modified form. The waveforms are shown for two differt rise times. characterized at the rising edge as shown in Fig. 1, and datato-q delay of standard latch is not used. Time borrowing is inhibited as a result; the amount of time allocated to each latch to latch path is fixed to a clock period as in flip-flop circuits. Logic synthesis can be performed while latches are treated as flip-flops, and we take advantage of smaller sequcing overhead. While the second latch is transpart (φ 2 =1in Fig. 1), its input, which is on the path from the first latch, has to stay stable. The hold margin (T hd ) is thus characterized at the falling edge of pulse, and minimum delay betwe latch pairs d 12 has to satisfy T cq + d 12 W + T hd, (1) where W is pulse width. Due to the presce of W, (1) implies a risk of more hold violations than in a flip-flop circuit and thus more extra delay buffers, which requires a special atttion, a topic discussed in more details in Section IV. B. Pulse Gerator A pulse gerator is a key circuit compont. One of its implemtations [1], which we base on, is shown in Fig. 2. A drawback of this pulse gerator is distortion of pulse as input clock transitions slowly, as indicated by SPICE waveforms; this is because slow transitions directly drive pmos transistor M 1. The pulse gerator was modified by inserting an inverter at input, so that clock signal can be regerated, and by adopting an NOR structure accordingly, as illustrated in Fig. 2. The SPICE waveforms confirm that is not distorted now ev wh rise time of is 2 ps; this however comes at the cost of 4.8% increase of cell area and 29.7% increase in power consumption. 1) Pulse Gerator with Pulse Enable: The basic pulse gerator can be extded to a gerator with pulse able, as illustrated in Fig. 3. The new gerator can be used instead of a clock gating cell (CGC) as shown in Fig. 3; this allows us to set the new gerator as a CGC and perform a standard clock gating synthesis. Notice that the latch within CGC is not necessary any more. This is because glitches at arising wh is 1 are unlikely to affect (see the waveforms of Fig. 3), because a pulse is very short and minimum delay of is usually larger than pulse width, or we force minimum delay to be larger than pulse width. L CGC glitch Fig. 3. Pulse gerator with pulse able and its use in clock gating in place of a clock gating cell. III. PUL GENERATORS INRTION AND PLACEMENT It is important to assure that pulse is delivered from each pulse gerator to connected latches without distortion. This is accomplished by characterizing maximum load capacitance that a gerator can support, called load limit, and make it sure that actual load does not exceed this limit. The load of pulse gerator is determined by the number of attached latches and the wiring for connection. Thus, pulse gerators insertion and placemt are two steps to be addressed in this context. A. Design Flow The overall flow of the steps is illustrated in Fig. 4. As explained in the previous section, clock gating synthesis is performed while pulse gerator with pulse able is used instead of CGC. Thus, the latches in which clock gating is performed, called gated latches, are already attached to gerators, while ungated latches are not. A bounding box is assigned to each group of gated latches and their gerator, so that they remain inside the box during automatic placemt, while load limit of pulse gerator is honored. Pulse gerators are inserted to ungated latches using a simple greedy algorithm. Let each latch belong to its own 357

Fig. 4. Logic Synthesis (with CG Synthesis) Placemt Legalization Set bounding boxes for s and gates PLs Insert s to ungated PLs & set bounding boxes Pulse gerators insertion and placemt flow. Max distance [μm] 1 75 5 25 2 5 8 11 14 # Latches Fig. 6. Maximum distance betwe pulse gerator and latches under 15 ff of load limit. Maximum, minimum, and mean values are shown for some number of latches. Combinational logic... SO DQ DQ DQ Fig. 7. A scan chain. Thick line corresponds to a scan signal path., SO, and are scan input, scan output, and scan able, respectively. Fig. 5. Example layout of a test circuit wb dma after placemt with bounding boxes. group. We determine two groups i and j with minimum wiring cost and merge the two into a single group; the wiring cost is the lgth of Steiner tree that spans all the latches of i and j subtracted by Steiner tree lgth of i and that of j. The process iterates until no two groups can be merged; a pulse gerator is th assigned to each group and connections are made to member latches. We limit the number of latches that can be grouped to 11. This is determined in empirical fashion: too small number yields large number of pulse gerators; too large number causes bad placemt since a pulse gerator and member latches are clumped together due to less budget on wirelgth. A bounding box is assigned to each group of ungated latches and pulse gerator, and legalization is performed so that group members are moved to inside the box boundary. Fig. 5 is an example layout after placemt. B. Sizing Bounding Boxes The size of each bounding box should be properly set in a way that the load of pulse gerator does not exceed its load limit, irrespective of location of gerator and latches inside the box. The pulse gerator is loaded by latch and wire capacitance. As it drives more latches, there is less budget for wire capacitance, which implies a shorter distance being allowed betwe pulse gerator and latches. This is experimtally demonstrated in Fig. 6. The y-axis corresponds to maximum distance betwe pulse gerator and latches wh their connections are made using Steiner tree (see thick line betwe gerator and latch in Fig. 6). Note that this value varies for a giv number of latches depding on how latches are distributed in a plane and how much wires are shared in Steiner tree. A square bounding box is assumed, whose lgth of diagonal line is empirically set to twice the mean value of Fig. 6. After placemt, the load of each gerator is examined; if it exceeds the load limit, a square is reduced by setting the lgth of diagonal to twice the minimum value of Fig. 6, which is th followed by legalization. IV. SCAN DEGN Pulsed-latch circuit is susceptible to hold violations as dictated by (1). Scan signal path, which is illustrated by a thick line in Fig. 7, is particularly very susceptible since the path contains few logic gates. The extra buffers inserted to fix hold violations may mask out the befit of using pulsed latches. We have igned two special scan latches, so that they, together with standard scan latch, can selectively be used in a ign in a way that hold violations are minimized. A. Scan Latch Design Fig. 8 is a standard scan latch; Q corresponds to data output in normal operation ( = ) and scan output during scan operation ( = 1) in the setting of Fig. 7. In Fig. 8, Q is now dedicated to data output while scan output is available at additional pin SQ; the polarity of scan output is now opposite, which can be tak care of while test patterns are prepared. Notice that SQ is asserted after the falling edge of pulse as shown in SPICE waveforms, which increases minimum delay along the scan path thereby reducing the risk of hold violations. This however comes at the cost of 8% increase of cell area. Fig. 8(c) is similar to Fig. 8, and Q is used both for data and scan output; the differce is that Q 358

Q Q D D SQ D Q Q 1 2 4 6 SQ 1 2 4 6 Q 1 2 4 6 (c) Fig. 8. A standard latch, a latch with delayed scan output, and (c) a latch with delayed data output. i Q SQ data path... data path scan path Fig. 9. Data paths and a scan path launched from a latch i. is delayed and available after the falling edge of pulse, which is conceptually similar to Fig. 8. The area of this latch is also 8% larger than that of standard latch. B. Scan Latch Selection In the proposed ign flow, logic synthesis, placemt, and clock tree synthesis are all performed using latches with delayed scan output (Fig. 8). For each latch i, wow determine whether it befits if i is replaced by other latch types (a standard latch or a latch with delayed data output). The objective is to minimize the area sum of i and extra buffers that must be inserted to fix hold violations in the path launched from i. Consider Fig. 9; a latch i usually launches more than one data path but only one scan path. If there are no hold violations on data paths, we check scan path. If hold slack on the scan path is positive ough such that ev a standard latch does not cause hold violation, i is replaced by a standard latch for befit of smaller latch area; otherwise no action is tak. If data paths involve hold violations and setup slack is positive ough such that a latch with delayed data output can be deployed without causing setup violation, i is replaced to that latch type; otherwise a standard latch and a latch with delay scan output are compared in terms of latch area and resulting buffers area, and the one that results in smaller area is tak. V. EXPERIMENTAL RESULTS A set of test circuits was prepared using ITC bchmarks and op cores [9]. They are listed in Table I. All experimts TABLE I TEST CIRCUITS Name # Gates # s Clock period (ns) 1 121 1.5 3461 247 4.2 21191 1415 2.9 82 2199 2.2 aes core 1442 53 2.8 2164 19 2.3 8674 97 2.6 or12 1812 788 6.3 2487 229 2.4 tv8 564 359 2.3 9164 1746 3.3 wb dma 2272 563 2. were performed using 4-nm industrial library. The proposed ign flow was implemted using Tcl script, which is executed on commercial CAD tools; specifically, pulse gerators insertion, bounding box assignmt, and scan latch selection were implemted. A pulse gerator was igned for pulse width of 21 ps in worst process corner (11 ps in best corner). Its load limit is 15 ff; maximum fanout is set to 11 latches, which correspond to about 1 ff. A. Referce Design Flow To assess the proposed ign flow, a referce ign flow was set up [4], [1]. An initial netlist is gerated using flipflops; it is th submitted to automatic placemt. The critical path delay is measured and is assumed to be a clock period, which is reported in the last column of Table I. The load limit of clock gating cell (CGC) is deliberately set to 15 ff, which is the same as the load limit of pulse gerator, during clock gating. After placemt, all flip-flops are replaced by latches, and each CGC is replaced by a pulse gerator with pulse able (Fig. 3). To insert pulse gerators for ungated latches, clock tree synthesis is performed while the load limit of leaf-stage clock buffer is set to that of pulse gerator; each buffer is th replaced by a pulse gerator and the clock 359

1. w/ able Comb. PL Buffers 1. w/ able Normalized area.75.5.25 Normalized power.75.5.25 or12 tv8 or12 tv8 Fig. 1. Comparison of referce flow (left bars) and proposed flow (right bars): circuit area and power consumption. tree beyond pulse gerators are removed. A clock tree is synthesized once again with pulse gerators as sinks; hold violations are checked and delay buffers are inserted where they are necessary; and routing is performed to finalize layout. Only standard latches (see Fig. 8) are assumed during test synthesis. B. Comparison of Referce and Proposed Design Flow 1) Circuit Area: Sum of standard cell areas is compared betwe the referce and proposed ign flow in Fig. 1. The proposed flow achieves 19.8% reduction on average. This is due to three factors: Logic synthesis is performed using flip-flops in the referce flow, but using latches in the proposed flow. Since sequcing overhead of latch (57 ps) is much smaller than that of flip-flop (159 ps), synthesis yields less combinational logic in the proposed flow for the same clock period. Less number of pulse gerators are used due to more efficit pulse gerators insertion procedure (see Section III-A and Section V-A). Selective use of three scan latches yields substantially less number of hold violations and smaller number of extra buffers. 2) Power Consumption: Power consumption (including both switching and leakage) was measured using fast transistor-level simulator, while 1 randomly gerated patterns are provided at inputs; thus power consumption we report corresponds to that while circuit is actively switching. The result is shown in Fig. 1, which indicates 29.8% reduction in the proposed flow. Main saving comes from pulse gerators as well as from combinational logic. Comparing the portion of gerators in Fig. 1 and reveals their importance in power consumption. A gerator consumes about 1 μw; this is 58 and 12 times of power consumption of 2-input NAND gate and a latch, respectively. A portion of pulse gerator in Fig. 1 is divided into a basic pulse gerator () and a gerator with pulse able ( with able). Since is never gated, it represts a source of large power ev though it occupies a small area. Fig. 11. latches. Normalized # s 1.8 1.6 1.4 1.2 1. Referce flow Proposed flow or12 The number of pulse gerators (normalized) inserted for ungated C. Analysis of Pulse Gerators Insertion and Placemt The numbers of pulse gerators from referce and proposed flow are compared in Fig. 11; the pulse gerators for gated latches are same in both flows and are dropped; and use only one or two gerators, and are not included in the comparison. We also obtain the number of pulse gerators with zero wire capacitance, and regard that number as a (loose) lower bound, which is used to obtain normalized numbers of Fig. 11. The average of referce flow is 1.55 while that of proposed flow is 1.9. Considering that the bound is lower than the actual minimum (which is unknown), the proposed heuristic is efficit ev though it is a simple greedy. We use bounding boxes during placemt to force pulse gerators and latches to be placed nearby. Another method to achieve this goal is to assign higher net weight in their connections, ev though there is no guarantee that load limit of pulse gerators is honored. Two placemt methods are compared in Fig. 12, in terms of total wirelgth as a way of assessing placemt quality. Another placemt is created without restriction on the location of pulse gerators and latches; total wirelgth is measured, and is used to normalize the wirelgth from the two methods. The average of net weighting is 1.9 while that of bounding box is 1.3, which shows the befit of using bounding boxes. tv8 36

Normalized wirelgth 1.2 1.1 1..9 Net weighting Bounding box # Data outputs 1 75 5 25 Referce # Scan outputs Proposed 1 75 5 25 or12 tv8-4 -3-2 -1 1 Slack [ps] 2 3-4 -3-2 -1 1 Slack [ps] 2 3 Fig. 12. Comparison of total wirelgth of placemt using net weighting and bounding boxes. TABLE II THE NUMBERS OF HOLD VIOLATIONS IN REFERENCE AND PROPOD FLOW; PERCENTAGE U OF SCAN LATCH TYPES IN PROPOD FLOW Name # Hold violations Perctage of latch types Referce Proposed Standard Delayed Delayed SQ Q 235 162 1.7 2.7 77.7 266 93.9 83.5 15.8 1899 784 1.7 57.5 4.8 375 1684 1.8 26.9 71.3 aes core 489 166 1.1 74.5 24.3 316 147 6.3 66.8 26.8 1755 634 1.1 41.5 57.4 or12 1277 623 2.1 27.8 7.2 43 122 1.7 25.8 72.5 tv8 576 212 3.3 35.9 6.7 2795 184 1. 28.6 7.4 wb dma 896 373 1.7 35.8 62.5 Avg 1..43 2. 43.8 54.2 D. Analysis of Scan Design The number of hold violations before delay buffers are inserted is compared in columns 2-3 of Table II; it is appart that hold violations are substantially reduced by employing newly igned scan latches. The extt of hold slack being negative, not simply the number of hold violations, is important. Fig. 13 reports two hold slack histograms of, one at data outputs and the other at scan outputs; the befit of new scan latches is also clear in this context. The last three columns of Table II show the perctage of scan latches that are used in the proposed flow. As expected, standard latches are rarely used since employing them causes large amount of negative hold slacks in most scan paths. As we have cribed in Section IV-B, an initial netlist is made using only the latches with delayed SQ, i.e. data output is available at Q and scan output is available at SQ. While we determine scan latch type for each latch, if hold slack at Q is negative, there is high chance that that latch is replaced by a latch with delayed Q. This conjecture is well confirmed in Fig. 14. VI. CONCLUON We have prested an integrated ign flow based on commercial CAD tools, with support of some script to customize pulse gerators insertion, placemt of pulse gerators and latches, and optimizing scan chain to reduce hold violations. Fig. 13. Hold slack histograms of a test circuit. Delayed-Q latches [%] 1 8 6 4 2 or12 tv8 2 4 6 8 1 Latches having negative hold slack at data paths [%] Fig. 14. Latches having negative hold slack at data paths (before scan latch selection) versus perctage use of latches with delayed-q after scan latch selection. The befit of proposed ign flow in circuit area and power consumption has be demonstrated using 4-nm commercial library. A future plan inclu a validation of proposed flow through test chip. ACKNOWLEDGMENT This work was supported in part by the Mid-Career Researcher Program through NRF Grant funded by the MEST (211-2987), and by LG Electronics. REFERENCES [1] S. Naffziger et al., The implemtation of the Itanium 2 microprocessor, IEEE Journal of Solid-State Circuits, vol. 37, no. 11, pp. 1448 146, Nov. 22. [2] H. Ando et al., A 1.3-GHz fifth-geration SPARC64 microprocessor, IEEE Journal of Solid-State Circuits, vol. 38, no. 11, pp. 1896 195, Nov. 23. [3] T. Baumann, D. Schmitt-Landsiedel, and C. Pacha, Architectural assesmt of ign techniques to improve speed and robustness in embedded microprocessors, in Proc. Design Automation Conf., July 29, pp. 947 95. [4] H. Li, M. Ch, and K. Ho, Integrated circuit ign systems for replacing flip-flops with pulsed latches, Dec. 211, U.S. Patt 87419. [5] H. Lee, S. Paik, and Y. Shin, Pulse width allocation with clock skew scheduling for optimizing pulsed latch-based sequtial circuits, in Proc. Int. Conf. on Computer-Aided Design, Nov. 28, pp. 224 229. [6] Y. Chuang, S. Kim, Y. Shin, and Y. Chang, Pulsed-latch-aware placemt for timing-integrity optimization, in Proc. Design Automation Conf., June 21, pp. 28 285. [7] H. Lin, Y. Chuang, and T. Ho, Pulsed-latch-based clock tree migration for dynamic power reduction, in Proc. Int. Symp. on Low Power Electronics and Design, Aug. 211, pp. 39 44. [8] S. Paik, G. Nam, and Y. Shin, Implemtation of pulsed-latch and pulsed-register circuits to minimize clocking power, in Proc. Int. Conf. on Computer-Aided Design, Nov. 211, pp. 156 161. [9] Opcores, http://www.opcores.org/. [1] S. Shibatani and A. Li, Pulse-latch approach reduces dynamic power, July 26, EE Times. 361