Comparative study on low-power high-performance standard-cell flip-flops S. Tahmasbi Oskuii, A. Alvandpour Electronic Devices, Linköping University, Linköping, Sweden ABSTRACT This paper explores the energy-delay space of eight widely referred flip-flops in a 0.13µm CMOS technology. The main goal has been to find the smallest set of flip-flop topologies to be included in a high performance flip-flop cell library covering a wide range of power-performance targets. Based on our comparison results, transmission gate-based flipflops show the best power-performance trade-off with a total delay (clock-to-output + setup time) down to 105ps. For higher performance, the pulse-triggered flip-flops are the fastest (80ps) alternatives suitable to be included in a flip-flop cell library. However, pulse-triggered flip-flops consume significantly larger power (about 2.5x) compared to other fast but fully dynamic flip-flops such as TSPC and dynamic TG-based flip-flops. Keywords: flip-flops, latches, low-power, standard-cell, cell library, energy-delay space 1. INTRODUCTION For high performance VLSI chip-design, the choice of the back-end methodology has a significant impact on the design time and the design cost. Making every single gate from scratch is not necessarily the best method. Instead, a sufficient set of pre-designed standard cells can be utilized as building blocks to design most of the functional blocks. Semiconductor manufacturers offer standard cell libraries, which are also supported by CAD tools in automated design flows including the final physical auto-placement and routing. However, the selection of the standard cells as well as their performance is often limited. Despite the performance limitations, standard cell libraries could be useful even in design of high performance VLSI chips. Often, only a smaller portion of the chips include performance-critical units, and the rest of the design could be maximally automated to reduce the design time without degrading the targeted performance. In addition, the concept of cell library can be extended to even support the full-custom part of the chip. Custom (in house) cell libraries can be made and shared by the designers of the performance critical units. This results in a sharp decrease in the number of cells to be created and verified reducing the total chip layout time significantly. Hence, development of an efficient cell library for high performance chips is essential. A cell library includes a number of cells with different functionalities, where each cell may be available in several sizes and with different driving capability. Two central categories of cells included in cell libraries are flip-flops and latches. These are extremely important circuit elements in any synchronous VLSI chip. They are not only responsible for correct timing, functionality, and performance of the chips, but also their clocked devices consume a significant portion of the total active power. A universal flip-flop with the best performance, lowest power consumption, and highest robustness against noise would be an ideal component to be included in cell libraries. However, it will be shown in this paper, that increasing the performance of flip-flops generally involves significant power and robustness trade-offs. Therefore, a set of different latches and flip-flops with different performances are essential to limit the use of more power consuming and noisesensitive elements only for smaller portion of the chips with performance-critical units. This eliminates global and unnecessary increase in power consumption as well as robustness degradations, which would result in overall decrease in noise margin requiring extra careful and time consuming design. 390 Microelectronics: Design, Technology, and Packaging, edited by Derek Abbott, Kamran Eshraghian, Charles A. Musca, Dimitris Pavlidis, Neil Weste, Proceedings of SPIE Vol. 5274 (SPIE, Bellingham, WA, 2004) 0277-786X/04/$15 doi: 10.1117/12.530225
The goal of this work is to find a small set (ideally the smallest set) of flip-flop topologies to be included in a library covering a wide range of power-performance targets. Our strategy has been to first explore the capabilities of conventional and simpler transmission-gate (TG) based flip-flop topologies, before including other types of flip-flops. Among a large number of flip-flops that have been proposed in the past [1-5], we have selected some of the widely used and/or referred topologies. Sec.2 shows eight flip-flops we have incorporated in our initial benchmark including static and dynamic edge-triggered mater-slave as well as semi-dynamic pulsed flip-flops. In contrast to many previously published results [3], [5], we have explored a wide power-performance space for each of the eight flip-flops. By sizing, we have identified the useful operating ranges of the flip-flops. The design-space exploration not only enables a true comparison, but also it reveals potentially large overlaps in operating range of the flip-flops. This in turn provides an opportunity to reduce the number of different circuit topologies in a flip-flop library. Sec. 3 describes our simulation setup, as well as the flip-flop parameters we have considered in our comparisons. In Sec. 4 we show the comparison results including the energy-delay space for each flip-flop topology followed by the conclusions in Sec. 5. 2. FLIP-FLOP TOPOLOGIES As was described in Sec.1, many flip-flop topologies have been proposed in the past. For our comparative study, we have selected some of widely used and/or referred topologies in our initial benchmark. Four static master-slave flipflops are included in our test bench. Figure 1 shows the classic transmission-gate based flip-flop (TGMS) [1]. Another variation of TGMS is the flip-flop shown in figure 2, which is derived from PowerPC 603 master slave flip-flop [6]. In PowerPC 603 the interrupting feedback in the storage elements is based on C²MOS inverters. Figure 3 shows the third topology, which is a modified clocked inverter (mc²mos) [7], where the dynamic master-slave C²MOS flip-flop is modified to a pseudo-static C²MOS flip-flop by adding a C²MOS feedback at the outputs. Fourth master-slave flip-flop (figure 4) is based on the traditional SR-latch build by cross coupled NAND/NOR gates [1], [4]. The next two flip-flops (figures 5, 6) are pulse-triggered latches. They are based on a single latch, which is transparent within a short time (during a pulse) on the edge of the clock. Figure 5 shows a hybrid-latch flip-flop element (HLFF) [8], and figure 6 shows a semi-dynamic flip-flop (SDFF) [9]. Both of the pulse-triggered topologies require and include pulse generators. Further, there are two fully dynamic flip-flops in our benchmark; the TSPC flip-flop [10] in figure 7 and the dynamic transmission gate flip-flop [1], [2] in figure 8. These fast flip-flops (with floating nodes) are extra sensitive to noise and leakage currents. However, we have included their performance level as a reference to evaluate other flip-flops. Figure 1: TGMS flip-flop Figure 2: PowerPC 603 flip-flop Proc. of SPIE Vol. 5274 391
Figure 3: Modified C²MOS flip-flop (mc²mos) Figure 4: NAND-NOR flip-flop Figure 5: Hybrid latch flip-flop (HLFF) Figure 6: Semi-dynamic flip-flop (SDFF) Figure 7: 9T TSPC flip-flop Figure 8: fully-dynamic TGMS flip-flop 3. SIMULATION SETUP All the circuits are designed in a standard 0.13µm CMOS technology. The supply voltage used for simulations is 1.2V, and the operating temperature is 27 C. The simulation condition is shown in figure 9. All of the flip-flops utilize identical and fixed input drivers (minimum sized inverters) and are loaded equally by the input capacitances of four minimum sized inverters. 392 Proc. of SPIE Vol. 5274
Figure 9: The simulation test bench 3.1. Energy-delay space exploration We have explored the energy-delay space of each flip-flop by sizing the internal devices, while keeping the input drivers and the output loads fixed. The best achieved delays for a number of different target energy consumption points have been selected as the sub-optimal energy-delay points. Our timing and energy metrics are described as follows: 3.1.1. Delay and Timing metrics The performance of flip-flops is defined by three important time-windows and delays: Clock-to-Output delay, setup time, and hold time. For performance comparison, the setup time and hold times require a clear definition. An edgetriggered flip-flop requires the input data to be stable some time before the edge of the clock. This time-window is referred as the setup time of a flip-flop. The time after the clock edge for which the input has to remain stable is called the hold time. The setup time can be defined and measured in different ways. The time window could be measured by referring it to a timing edge, where a flip-flop fails to receive the data. However, this definition might be impractical. Before the input reach the failure limit, the flip-flop responses slower to the input data. This increases the delay of the flip-flop. Throughout this paper, we use the following definition for setup and hold time which was also used in [3]: Setup time and hold time are the data to clock offsets which cause 5% increase of Clock-to-Output. This is illustrated in figure 10. Propagation delay, setup time and hold time may be different for low-to-high or high-to-low input transitions. For the comparisons, we have chosen the worst case delays. t Clock-to-Output =max ( t Clock-to-OutputLH, t Clock-to-OutputHL), t setup =max ( t setuplh, t setuphl ), t hold =max ( t holdlh, t holdhl ) A t skew t Logic B Combinational logic Figure 11: Flip-flops at the logic boundaries Figure 10: Setup time and hold time definitions Proc. of SPIE Vol. 5274 393
Further, in a digital system (figure 11) the following condition has to be satisfied: Latency = t Clock-to-Output (Max) + t setup (1) Latency + t logic + t skew T (2) Where, T (the clock period) must be greater than the sum of the worst case clock-to-output delay of the flip-flop A, the setup time of the flip-flop B, the maximum delay in the combinational logic, and the relative clock skew. Therefore, we have used the sum of the clock-to-output and the setup time as the delay imposed by the flip-flops. 3.1.2. Energy metrics: Two energy metrics have been used for the comparisons. The first measure is energy-per-transition (EPT), which is the average energy consumption when a transition appears at the input of the flip-flop (average of high and low transitions). The second measure is the clock-energy, which is the average energy consumption when data activity is zero. Since the flip-flops are targeted to be used in cell libraries, the input clock and the input data are both single-ended. All additional clock phases are generated inside the flip-flops. Power consumptions of the clock and data drivers are included in the total power consumption of the flip-flops. 4. SIMULATION RESULTS AND COMPARISON Figures 12-19 show the energy-delay space of each flip-flop. Each figure includes two sub-graphs: 1- The upper sub-graph shows the flip-flop energy per transition versus the total delay time (clock-to-output + setup time). The energy consumed by the clocked devices is shown with black color. 2- The lower sub-graph shows the total delay time (clock-to-output + setup time) versus the total flip-flop energy per transition. The setup time and the clock-to-output delay are highlighted by white and black colors respectively. Figures 20-23 summarize the energy-delay space of all the flip-flops. As the figure 21 shows, transmission gates flipflops TGMS and PowerPC 603 show the best power-performance trade-off among the fully static flip-flops. Further, they cover a relatively wide portion of the total energy-delay space. Pulse-triggered flip-flops HLFF and SDFF can support shorter delay targets. Figure 21 shows that pulse-triggered flip-flops HLFF and SDFF are faster mainly due to their shorter setup-time. Based on this figure the SDFF is the fastest flip-flop. However, the pulse-triggered flip-flops consume a considerably larger power (about 2x compared to TGMS flip-flops). The TSPC and the dynamic TG-based flip flops have a comparable performance while they consume up to 50% of the energy needed for SDFF. However, their internal floating nodes are sensitive to leakage currents and other sources of noise [11]. Figure 12: Energy-delay space for TGMS Figure 13: Energy-delay space for PowerPC 603 394 Proc. of SPIE Vol. 5274
Figure 14: Energy-delay space for mc²mos Figure 15: Energy-delay space for NANDNOR Figure 16: Energy-delay space for HLFF Figure 17: Energy-delay space for SDFF Figure 18: Energy-delay space for TSPC Figure 19: Energy-delay space for dynamic TGMS Proc. of SPIE Vol. 5274 395
Figure 20: Energy-per-transition versus clock-to-output delay Figure 21: Energy-per-transition versus total delay Figure 22: Clock-Energy versus clock-to-output delay Figure 23: Clock-Energy versus total delay Figures 20-23 can be used to identify the optimum flip-flop topology for different energy-delay targets. However, as an example, Table 1 compares the flip-flops at their minimum EPT delay² points in Fig. 21. 396 Proc. of SPIE Vol. 5274
Overall delay [ps] Clock-to- Output [ps] Setuptime[ps] Holdtime[ps] Energy-pertransition [fj] Clock energy [fj] SDFF 83.6 65.1 15.0 18.8 46.8 34.4 HLFF 94.5 64.4 26.9 15.6 34.7 21.7 TGMS-dynamic 98.4 49.8 46.1-6.4 15.8 4.4 TSPC 103.8 59.7 41.1 3.9 15.6 6.7 PowerPC 116.3 60.2 53.1-17.4 18.9 5.7 TGMS 118.7 63.3 52.2-17.8 18.8 5.6 mc²mos 152.8 68.6 80.8-31.7 29.9 10.6 NANDNOR 197.5 94.9 97.9-30.8 25.1 7.5 Table 1: Performance parameters at minimum EPT Latency² 5. CONCLUSION In this paper, we have explored the energy-delay space for eight of widely referred flip-flops to be included in a high performance flip-flop cell library covering a wide range of power-performance targets. All the eight flip-flops have been designed in a standard 0.13µm CMOS technology at 1.2V. Based on our simulation results, we have shown that transmission gate-based flip-flops (such as TGMS and PowerPC 603) exhibit the best power-performance trade-off with a total delay (clock-to-output + setup time) down to 105ps. For higher performance, the pulse-triggered semi-dynamic flip-flop SDFF (figure 6) is the fastest (80ps) alternative suitable to be included in a flip-flop cell library. However, pulse-triggered flip-flops consume significantly larger power (about 2.5x) compared to fully-dynamic flip-flops such as TSPC and dynamic TG-based flip-flops. ACKNOWLEDGEMENTS Authors would like to thank Dr. Ram Krishnamurthy, and James Tschanz (Intel Corporation) and Prof. Christer Svensson (Linköping University) for useful technical discussions. REFERENCES 1. Weste N. H. E., Eshraghian K., Principles of CMOS VLSI design, a systems perspective, second edition, Addison-Wesley, 1994 2. Rabaey J. M., Chandrakasan A., Nikolic B., Digital integrated circuits, a design perspective, second edition, Prentice Hall, 2003 3. Markovic D., Nikolic B., Brodersen R.W., Analysis and design of low-energy flip-flops, Proceeding of International Symposium on Low Power Electronics and Design, 2001, 6-7 Aug. 2001, Pages: 52-55 4. Uyemura J., Circuit Design for CMOS VLSI, Kluwer Academic Publishers, Norwell, Massachusetts, 1992 5. Stojanovic V., Oklobdzija V.G., Comparative analysis of master-slave latches and flip-flops for highperformance and low-power systems, IEEE Journal of Solid-State Circuits, Volume: 34 Issue: 4, April 1999, Pages: 536-548 Proc. of SPIE Vol. 5274 397
6. Gerosa G., Gary S., Dietz C., Dac Pham, Hoover K., Alvarez J., Sanchez H., Ippolito P., Tai Ngo, Litch S., Eno J., Golab J., Vanderschaaf N., Kahle J., A 2.2 W, 80 MHz superscalar RISC microprocessor, IEEE Journal of Solid-State Circuits, Volume: 29 Issue: 12, Dec. 1994, Pages: 1440-1454 7. Suzuki Y., Odagawa K., Abe T., Clocked CMOS calculator circuitry, IEEE Journal of Solid-State Circuits, Volume: 8 Issue: 6, Dec 1973, Pages: 462-469 8. Partovi H., Burd R., Salim U., Weber F., DiGregorio L., Draper D., Flow-through latch and edge-triggered flip-flop hybrid elements, Solid-State Circuits Conference, 1996. Digest of Technical Papers. 43rd ISSCC., 1996 IEEE International, 8-10 Feb. 1996, Pages: 138-139 9. Klass F., Semi-dynamic and dynamic flip-flops with embedded logic, Digest of Technical Papers, 1998 Symposium on VLSI Circuits, Honolulu, HI, USA, 11-13 June 1998, Pages: 108-109 10. Yuan J., Svensson C., High-speed CMOS circuit technique, IEEE Journal of Solid-State Circuits, Volume: 24 Issue: 1, Feb. 1989, Pages: 62-70 11. Larsson P.; Svensson C., Noise in digital dynamic CMOS circuits, IEEE Journal of Solid-State Circuits, Volume: 29 Issue: 6, June 1994, Pages: 655-662 398 Proc. of SPIE Vol. 5274