Flip-Flop Insertion with Shifted-Phase Clocks for FPGA Power Reduction

Size: px

Start display at page:

Download "Flip-Flop Insertion with Shifted-Phase Clocks for FPGA Power Reduction"

Brendan Stokes
6 years ago
Views:

1 Flip-Flop Insertion with Shifted-Phase Clocks for FPGA Power Reduction Hyeonmin Lim, Kyungsoo Lee, Youngjin Cho and Naehyuck Chang School of CSE, Seoul National University, Korea Abstract Although the (look-up table) size of FPGAs has been optimized for general applications, complicated designs may contain a large number of cascaded s between flip-flops. This results in unwanted glitch propagation along the s, and wastes power. This paper proposes a flip-flop insertion, We propose insertion of new flip-flops between adjacent existing flipflops to minimize glitch propagation and power loss. Each new flip-flop is timed by a phase-shifted clock with the phase calculated from the delays of s and routing paths. This is different from traditional retiming methods that use the original clock or an 180-degree clock for the new flip-flops, and thus alters the original pipeline structure and synchronization. We start from a post-layout design, retiming its clock frequency and timing behavior. Multiple flip-flop insertion is an NP-complete problem because each new flip-flop affects the delays in the design. We have devised a glitch generation and propagation model for based FPGAs that take account of path delays while supporting reasonable complexity. We propose effective heuristics for flipflop insertion and clock phase selection. Full-chip measurements, including all the overheads associated with the inserted flip-flops, show that our approach shows up to 38% of the total dynamic power. We have analyzed our scheme, showing the mechanics of clock assignment and glitch minimization, and the sources of power reduction. I. INTRODUCTION FPGAs (Field Programmable Gate Arrays) are generally more expensive, and have smaller capacity and lower performance than other target devices that require fabrication. Despite these drawbacks, FPGA implementations are often retained in final products, thanks to their short time-to-market, low financial risk, and the ease of making design changes. However, as the density of FPGAs increases, their power consumption becomes problematic. Even look-up table () based FPGAs, which are more frugal than other architectures, consume significant amount of power. FPGA vendors offer various sizes of, but an FPGA is farely optimal for a particular design. In fact, complicated designs generally contain a large number of cascades with two or more levels. This results in unwanted glitch propagation down the cascaded s until a flip-flop terminates signal propagation. This causes a serious waste of power unless a well-designed glitch blocking method is applied. In this paper, we introduce a power reduction technique for -based FPGAs that blocks glitch propagation with inserted flip-flops. Reducing glitch propagation by inserting flip-flops is an established technique, but our approach to clock assignment is new. One of the most distinctive characteristics of -based FPGAs is interconnect their delay. Like deep sub-micron designs, -based FPGAs exhibit path delays due to the high resistance and capacitance of the routing resources. Glitches may contribute % to 70% of the total power consumption of ASICs (application specific integrated circuits) [1] [2]. Power consumption due to glitches is even more serious for FPGAs. Interconnect resources dissipate at least 65% of total power in the Xilinx 4003 family, and more than 60% in the Xilinx Virtex-II family [3] [4]. But glitch blocking, using a combination of pipelining and retiming, can reduce the amount of energy consumed per operation by between 40% and 90% [5]. We will focus on the glitches that have the highest impact on power consumption. Pipelining or retiming, or both, involves the insertion of flip-flops between combinational logic blocks, this reduces glitch propagation effectively, but in ASICs it requires more area for additional flip-flops. Dividing each flip-flop into two latches timed by dual 180-degree phase-shifted clocks, is one way to reduce the area requirement [6]. Although retiming by fixed-phase clock assignment is straightforward, it results in architectural modification and altered time behavior. Consequent changes to factors, such as pipeline depth, latency and synchronization with other pipelines, may require the entire design to be revisited. Our approach is completely different from traditional retiming techniques: We reactivate disabled flip-flops in logic blocks that comprise cascaded s in large-scale combinational logic, clocking them with phase-shifted clocks, rather than the original (in-phase) clock. We determine the amount of phase shift by considering the delays in s and routing paths. Since the resolution of the phase shift and the number of dedicated clock lines are both limited, we insert flipflops in an order that maximizes the benefit from glitch blocking, in case the process has to be terminated because clock resources are exhausted. Since the slack in a net is dependent on the slack of other nets, flip-flop insertion is an NP-complete problem. We have proved its NP-completeness and devised an appropriate heuristic solution. Unless a cascaded is not on a critical path, we can exploit its slack. Thus, even though the delays of two signal paths are not identical, the inserted flip-flops may share the same phase clock. Our heuristics aim at the maximum power saving by appropriate grouping of the inserted flip-flops. We formulate a glitch generation and propagation model which can take account of routing path delays while supporting reasonable design complexity. -based logic has different glitch generation and propagation characteristics to ASICs. Our model of takes the SRAM structure into account. By the addition of power consumption models, we have built an FPGA dynamic power simulator for flip-flop insertion. Our approach considers all the overheads of flip-flop insertion, taking account of the switching capacitance of each flip-flop and its dedicated clock path. Furthermore, the final result is a real full-chip measurement, and thus shows the actual power saving, including every overhead.

2 FF 1 L 2 (a) One-level (no glitch propagation) (b) Two-level (glitch propagation) FF 1 FF 1 L 2 FF 1 L 2 φ 1 (c) Glitch block w/in-phase clock (d) Glitch block w/shifted-phase clock Fig. 1. Glitch minimization by insertion of unused flip-flops. (a) (b) (c) (d) FF 1 φ 1 φ 1 L 2 L 2 slack FF 1 FF 1 Fig. 2. Timing diagram of (a), (b), (c), and (d) in Figure 1. Measurements are backed up by a complete analysis of the clock assignment process, the extent of glitch reduction and of dynamic power reduction. The results of this analysis accord with the measurement results, and provide further justification of the effectiveness of the proposed scheme. Our power-reduction scheme starts with a post-layout design,and retains the original clock frequency and timing behavior. Because our approach preserves timing and logical behavior, we can apply it in tandem with other FPGA powerreduction techniques [7] [8]. Because we are using FPGAs and net ASICs, we do not have to sacrifice any area for flip-flop insertion; we can simply activate unused flip-flops [2]. Phase shifting of the original clock and additional dedicated clock lines are already available as built-in functions on FP- GAs, and require no additional silicon area. The Virtex-II architecture from Xilinx Inc. provides a logic which can shift the phase in steps of 50ps or 1/6 of the input clock period [9]. FPGAs from Altera can also shift the phase of the input clock using PLLs [10]. This functionality makes it possible to generate phase-shifted clocks without either using additional logic blocks on the FPGA, or off-chip support. We used the Virtex-II architecture as our target device, and the Xilinx FPGA design tool functions such as synthesis, timing L 2 simulation, placements, routing delays. The rest of this paper is paganized follows. We clarify the problem in Section II. Section III introduces a -based FPGA energy model, constructed in terms of effective capacitance and switching activity. Based on this model, we propose an effective phase-shifted clock assignment algorithm, called Solve-FPA, in Section IV. The experimental results are showninsectionv. II. PROBLEM STATEMENT A. Concept of glitch blocking Glitches in -based FPGAs can be blocked by activating unused flip-flops without performance degradation by the use of a phase-shifted clock. In Fig. 1(a), a glitch generated from is blocked by FF 1 ; thus there is no additional power consumption due to glitch propagation in this one-level. In Fig. 1(b), a glitch generated by propagates to L 2 and causes additional power consumption in the routing path from to L 2 and in L 2. By activating an unused flip-flop, FF 1,we can stop glitch propagation from to L 2, we see in Fig. 1(c). Note that FF 1 is triggered by the system clock, (i.e. the inphase clock ). Thus insertion of FF 1, clocked by, causes a change in system behavior. The timing diagram of Fig. 1 is shown in Fig. 2. In Figs. 2(a) and 2(b), evaluation of and L 2 is finished in one clock cycle. However, as shown in Fig. 2(c), a traditional approach would require an additional clock cycle to finish the evaluation of L 2. B. Shifted-phase clock approach Now we will use the slack time before the new flip-flop is inserted to negate the change in system behavior. A slack time exists between flip-flops which are not on the critical path, or in all flip-flops when the design is not operated at the maximum clock speed. In Fig. 2(b), the total delay of the two s, and L 2, is shorter than the clock cycle of, which means that the input of will arrive sooner than the triggering edge of. We can use this slack time to insert the flip-flop, without an additional clock cycle, by triggering the new flip-flop with a new phase-shifted clock φ 1.Fig.1(d) shows the insertion of FF 1, to be triggered by the shiftedphase clock φ 1. The phase of the new clock is chosen to match the logic block and interconnect delay, and thus the original system behavior is restored. Fig. 2(d) shows now the slack time determines the phase of φ 1.When is executed as soon as possible and L 2 is executed as late as possible, the slack time between and L 2 is maximized. Since we are considering rising-edge flipflops, the allowable phase of the clock φ 1 that triggers FF 1 is bounded by the amount of slack time between and L 2.Note that the width of φ 1 is slightly less than slack time because of the setup and hold times of the flip-flop. Moreover, we cannot insert a flip-flop on the critical path without degrading the maximum operating frequency of the original design, because of the setup/hold times of the flip-flop. Comparing Fig. 2(d) with Fig. 2(b), we can minimize glitch propagation by triggering the newly inserted flip-flop with φ 1 without changing the system behavior. C. Dynamic power saving After an unused flip-flop is activated, glitch propagation is blocked, and thus dynamic power is saved. Since the two major elements of dynamic power consumption are switching activity and load capacitance, we compute the number of glitches blocked by the inserted flip-flop and the

3 total load capacitance influenced by each blocked glitch. The dynamic power saved by activating a flip-flop is obtained by multiplying the number of blocked glitches by the total effective capacitance of each. D. Overhead of flip-flop insertion While new flip-flops save power by blocking glitch propagation, they also dissipate additional power. After inserting a flip-flop, the glitch which would otherwise have been propagated now hits the new flip-flop. the resulting power overhead is the product of the total number of blocked glitches and the capacitance of the inserted flip-flop. Using the phase-shifted clock is also a power overhead. Since the clock in an FPGA uses a global net which is routed through the entire chip, the effective capacitance of the global wiring required for a new clock cannot be neglected. E. Flip-flop insertion as an optimization problem Given an FPGA design after placement and routing, we need an assignment of the phase-shifted clocks mat deploys unused flip-flops for maximum power saving. We call this the flip-flop and phase-shifted clock assignment (FPA) problem. An instance of this problem is characterized by (D,n c,r p ) where D is the FPGA design, n c is the available number of new clocks and r p is the reciprocal of the phase shift resolution. The factors n c and r p are dependent on the target FPGA architecture. Ideally, we would achive maximum glitch blocking by inserting flip-flops wherever s are directly connected to each other. Each newly inserted flip-flop would be clocked by an appropriate phase-shifted clock, reflecting the delay calculations. In reality this is not feasible due to the limitations of the phase-shifted clocks. The number of phase-shifted clocks is limited by the clock lines in the FPGA and the phases of the clocks are limited by the resolution of the PLLs or DLLs. As a result, we have to choose the locations for flip-flop insertion, and the corresponding clock phases carefully, so as to achieve the maximum power saving by glitch blocking. This problem can be summarized as follows: Problem 1: FPA: Let us assume that the target FPGA has clock lines such that c =(c 1,,c nc ) and PLL/DLL phase shift slots such that p =(p 1,, p rp ).Foragiven(D,n c,r p ), determine the phase selection f p : c i p j,where1 i n c and 1 j r p, such that f p results in the maximum power saving without changing the logical behavior of D. Once the FPA problem has been solved, flip-flop insertion becomes straightforward using existing resources. We can activate unused flip-flops using the clock lines. III. POWER CONSUMPTION OF LOOK-UP TABLE BASED FPGAS A. The Virtex-II architecture Virtex-II devices are user-programmable gate arrays consisting of input/output blocks (IOBs) and configurable logic blocks (CLBs). CLBs consist of four slices, which include two 4-input function generators, carry logic, arithmetic logic gates, wide function multiplexors and two storage resources. Each 4-input function generator is configurable as a 4-input, as 16-bit shift register element, or as a 16-bit distributed SelectRAM memory. The output from the function generator in each slice drives both the slice output and the D input of the storage element. The storage resources are configurable as either edge-triggered D-type flip-flops or as level-sensitive latches. We assume that Type Resource Capacitance (pf) Input Crossbar 9.44 Interconnect per CLB Output Crossbar Double Hex Long Logic per CLB inputs Flip-flop inputs Clocking Global Local TABLE I THE EFFECTIVE CAPACITANCE OF EACH RESOURCE IN THE VIRTEX-II 1000-FG456. function generators and storage resources are fixed as 4-input s and D flip-flops, respectively. All routing resources are segmented into long, hex and double lines. Long lines are bidirectional wires that distribute signals across the device. Vertical and horizontal long lines span the full height and width of the device. Hex lines route signals to every third or sixth block in all four directions. Double lines route signals to every first or second block in all four directions. In addition, there are two sets of switches to connect the wire segments to the input and output of each CLB, which are the input and output crossbars. The digital clock manager (DCM) is a logic block for clock management in the Virtex-II architecture. While its main features are clock de-skew, phase adjustment, and frequency synthesis, we can utilize it effectively for phase shifting of the system clock. The DCM can create a phaseshifted output with a resolution of 50ps or 1/6th of the input clock period. The number of DCMs, which is the number of available new clocks, is dependent on the device, but the minimum is four [9]. B. Effective capacitance Dynamic power dissipation is caused by signal transitions in the circuit, and is comprised of two parts: i) charge and discharge of capacitance; and ii) short-circuit power. The dynamic power consumption caused by charge-discharge of capacitance in a CMOS circuit is given by Power X C x V 2 dd f x, (1) where C x is the load capacitance of the net x, V dd is the supply voltage, and f x is the transition rate of x. Since short-circuit power dissipation is mostly caused by switching, it can be modeled as an additional capacitance in Eq. 1. Thus the effective capacitance is defined as the sum of parasitic effects due to interconnection wires and transistors, and the emulated capacitance due to short-circuit currents [4]. The effective capacitance of each resource is shown in Table I. The effective capacitances of long and global wires are dependent on the size of the target FPGA, and we choose values (from [4]) that are consistent with the Virtex-II 1000-FG456. C. Switching activity We measure switching activity using the concept of transition density. The transition density of a net x, D(x), isdefined as lim T n x (T)/T,wheren x (T ) is the number of transitions of x in the time interval ( T/2, +T/2] [11]. To compute the exact power saved by flip-flop blocking, we need to know the number of transitions which are glitches. To achieve this, we compute the transition density with and

4 without considering the delay, and take difference between these two values. First, without considering the delay: when the time interval of the transition density is a clock period, we model the transition density as the probability of a transition in a clock. The Boolean difference, y/ x i is have used to compute the propagation rate of the transition from the input to the output of the logic block [11]. The probability of a Boolean difference P( y/ x i ) is calculated by dividing y/ x i by 2 n, which is the total number of input patterns. We can extend the Boolean difference to accommodate multiple input transitions as, y (x i x j ) = y xi =1,x j =1 y x i =0,x j =0 + y xi =1,x j =0 y x i =0,x j =1, (2) and derive the transition density of the output y: «y D(y)=P D(x 1 ) (1 D(x 2 )) (1 D(x 3 )) (1 D(x n )) x 1 «y +P (1 D(x 1 )) D(x 2 ) (1 D(x 3 )) (1 D(x n )) x 2 + «y +P D(x 1 ) D(x 2 ) (1 D(x 3 )) (1 D(x n )) (x 1 x 2 ) + «y +P D(x 1 ) D(x 2 ) D(x 3 ) D(x n ). (x 1 x 2 x n ) (3) The first term of this equation is the probability that the transition occurs only at the input x 1. We sum these probability over all of the input patterns, assuming no input correlation. The glitch propagation characteristics of LUSs are different from those of ASICs, and we need to take SRAM structure of s into account. In a, implemented as an SRAM, the difference in input delays causes skew between input signals. this is shown for a 4-input in Fig. 3, and here the number of transitions for all 16 cases is y/ x 1 + y/ x 2 + y/ x 4 if the delays are considered, or y/ (x 1 x 2 x 4 ) if not. Note that these values are independent of the sequence of inputs because the Boolean difference represents the number of transitions for every input pattern. We can now compute the transition density, considering the delay of output y, by substituting y/ (x i x j x k ) for y/ x i + y/ x j + + y/ x k in Eq. 3. We define D (y) as the transition density of y with delay, and it is given by nx «! D y (y)=min P D (x i ), τ/µ. (4) x i=1 i Because the width of two transitions less than the logic delay of the gate is eliminated, we assign τ/µ as the maximum transition density, where τ is the clock period and µ is the delay of the, assuming that the transitions are equally spacial. This assumption may slightly overestimate the glitch transitions, but it does not cause appreciable error because we use only the relative values of the transition density in the algorithm to be descended in Section IV. We have checked the extend of this overestimation by perturbing the value of τ/µ. Note that D (x) is equal to D(x) when x is the input to the logic. Although Najm s model [11] looks similar to Eq. 4, it is not capable of pulse absorption; a pulse is eliminated when it x 1 x 2 x 1 x 2 x 3 x x 3 x (a) Without delay (no input skew) (b) With delay (input skew) Fig. 3. The number of transitions in y for the transition x 1,x 2,x 4 in a 4-input. Solve-FPA input : FPGA design D, c =(c 1,,c nc ), p =(p 1,, p rp ) output : power-minimized FPGA design D new begin get graph G from FPGA design D for each new clock c i get weight w and interval ξ for each vertex v in G for each phase p j get maximum phase weight mpw(p j ) endfor select p max with maximum mpw(p j ) if mpw(p max ) overhead break assign p max to c i modify G with c i endfor get D new from G return D new end Fig. 4. A summary of Solve-FPA. passes an if it is narrower than the delay of the s. This results in significant overestimation of the transition density. Another experiment [12],in which Najm s model was compared with IRSIM, a switch-level simulator supports this fact. Anderson and Najm [13] estimate the activity of the net in an FPGA by regression analysis. Although this method is more accurate, long computation times and a complicated procedure militate against its use in a fully automated algorithm. Using D(y) and D (y), we can compute the number of transitions which are glitches. We define G trans (y), as the glitch transition in y. G trans can be computed as G trans (y)=d (y) D(y). (5) There are also arithmetic logic gates and multiplexers in the slices of the Virtex-II. We apply the same model to these dedicated logic blocks as we use for the s. IV. GLITCH MINIMIZATION TECHNIQUES FOR LOOK-UP TABLE BASED FPGAS A. Proposed algorithm Addressing the FPA problem from scratch, we see that activating flip-flops with a shifted-phase clock must yield the maximum power saving by glitch blocking (called the weight in this paper). To achieve this objective, we consider the weights of all the unused flip-flops. Note that more than two flip-flops can be activated by a single phase-shifted clock if the phase p of that clock is in the range of those flip-flops. Thus we need to know the maximum weight that can be obtained for each possible phase. We show that this problem is NP-hard, and address it with a greedy algorithm. Because

5 activating a flip-flop changes the weights and timing values of the downstream components, computing optimal phases for all the available clocks is not a finite problem. Therefore, we set the phase of one clock at a time. Once we have activated flip-flops with a particular clock, we recompute the weight and timing values and repeat the procedure for the next clock until there remains no clock to insert or no further power saving is produced. We call the resulting power minimization phase-shifted clock assignment algorithm Solve-FPA (Fig. 4). e 1 v 1 v 3 e 2 e 4 e 3 e 5 e 6e7 v 5 v 6 v 7 e 8 e 9 v 9 v 10 e 10 NCD Xilinx design tool (4.2) v 2 v 4 Fig. 6. DAG generated from an FPGA design. v 8 v 11 Fig. 5. Directed acyclic graph XDL Graph generation (4.2) Weight generation (4.3) Slack calculation (4.4) Phase assignment (4.5) Power saving analysis (4.6) Design conversion (4.6) New NCD SDF Weighted slack interval graph Overall design flow of Solve-FPA. The overall procedure of Solve-FPA, shown in Fig. 5, is fully automated. The algorithm starts from an original FPGA design in a post-layout net list format. The FPGA design is converted to a directed acyclic graph (DAG) at the graph generation stage. Then the timing simulation stage extracts delay information from the FPGA design using the Xilinx design tool. Next, we compute the extent of the power saving: this is the weight generation stage. Depending on the cumulative delay of the flip-flops and nets in front of and behind the new flip-flop, we calculate the maximum permissible range of this phase in the slack calculation stage. Then we select a phaseshifted clock that will give the maximum power saving, using a vertex-weighted DAG with slack times: this is the phase assignment stage. Then we decide whether to iterate or not, while considering the clock insertion overhead, at the power saving analysis stage. Finally, the new FPGA design is generated at the FPGA design conversion stage by converting the modified graph to the FPGA net list format. B. Graph generation and timing simulation The first step of the algorithm uses the Xilinx FPGA design tool to generate the FPGA design in a binary file, in NCD format. We translate it into the Xilinx Description Language (XDL), a text format, for flexibility in editing the design. Timing simulation is also performed using the Xilinx design tool, which generates outputs in Standard Delay Format (SDF). Using the XDL file, we convert the FPGA design to a DAG. Each vertex of the DAG is associated with an and its following flip-flop, or only with an if the flip-flop is not used. The nets are then converted into edges (Fig. 6). We will now define name additional terms. For every u,v V, if there is an edge u v, thenu is the predecessor of v, and v is the successor of u. Ifthereisapathu v, thenu is the ancestor of v, andv the descendant of u. C. Weight generation The weight of a vertex denotes the possible extent of power saving when the unused flip-flop at that vertex is activated. This step requires the effective capacitance of each vertex to be determined. The effective capacitance of a vertex v, C ef f (v), is defined as the sum of the effective capacitances of the resources used in the output edges of v. We compute C ef f (v) by summing the products of the number of resources of each type used with their unit effective capacitances. The effective capacitance of the various resources is shown in Table I. When a glitch is not blocked at vertex v, it will propagate to the descendants of v until it meets the in-phase flip-flop. A glitch first generates a power leakage from v, and power leakage also occurs in its predecessors. But the extent of a glitch will be reduced by a factor of y/ x, wherex and y are the input and output edges of u, respectively. In other words, we can say that the effective capacitance is reduced by a factor of y/ x from the viewpoint of v. We therefore define the total capacitance, C total (v), as the total effective capacitance of the circuit under the impact of G trans (v). We can compute C total (v) as C total (v)=c eff (v)+ X «y C total (u) P, (6) vu u f anout(v) where f anout(v) is the set of vertices which comprise the fanout of v, vu is the edge v u and y is the output of u. Finally, the weight of v is given by w(v)=g trans (v) C total (v). (7) Since the supply voltage is constant in Eq. 1, this definition of weight is sufficient to represent the dynamic power reduction achieved by activating the flip-flop at v. In summary, the output of the weight generation stage is a vertex-weighted DAG G = (V,E,w). D. Slack calculation The slack calculation determines the amount of freedom available to each newly inserted flip-flop, which is the permissible range of phases. It utilizes the DAG G =(V,E,w) and the delay information from the SDF file. Within the permissible range, we convert G to the slack interval graph G int,whichis a weighted interval graph and takes the precedence relations among the vertices into consideration. We define the slack function ξ : v (R,R) which maps avertexv to two real numbers: a starting-point and an endpoint. These are calculated under the condition that the all the

6 v 3 v 1 v 4 v 2 w(v 3 ) v w(v 9 9 ) v w(v 5 5 ) v w(v 7 ) 7 v 10 v w(v 6 6 ) w(v 4 ) v 8 w(v 8 ) a b c d e Fig. 7. Weighted slack interval graph converted from a DAG. v 11 v 2 w(v 2 ) w(v 1 ) v 1 v 1 v 3 w(v 3 ) w(v 4 ) v 4 E = {(v 1,v 2 ), (v 2,v 3 ), (v 1,v 3 ), (v 1,v 4 )} v 2 v 3 v 4 ξ s w(v 1 ) w(v 2 ) w(v 3 ) w(v 4 ) p ξ e Ψ = {ψ(v 1,v 2 ), ψ(v 2,v 3 ), ψ(v 1,v 3 ), ψ(v 1,v 4 )} ancestors of v are executed as early as possible and all the descendants of v are executed as late as possible. For instance, let the cumulative delay in front of the new flip-flop be τ f, and let the delay behind the flip-flop be τ b.if the clock period is τ, andτ > τ f + τ b, the permissible phase range will be [2πτ f /τ,2πτ b /τ], which becomes the slack available to the new flip-flop. We convert G =(V,E,w) to G int =(V int,w int,ξ,ψ) by translating each vertex to a slack interval, V int, with starting and end-points determined by ξ(v int ). The weight of a slack interval, w int (v int ),ing int is inherited from the weight of v in G, i.e. w int (v int )=w(v). Ψ is a set of links among slack intervals such that u int and v int are linked if and only if there is a path between v and u in the original DAG G. Fig. 7 demonstrates a slack interval graph derived from Fig. 6, where the letters a to e denote the available phases. E. Phase assignment Phase assignment selects the shifted-phase clock that gives the maximum benefit. A slack interval graph G int = (V int,w int,ξ,ψ) is the input to the phase assignment stage and (G int, f p ) is the output, where f p is the clock selection function defined in the FPA problem. For every possible phase p, thereisasetv p of slack intervals which can be triggered by a p-shifted clock, so that v int V int, v int V p if and only if the phase p is between the starting and end-points of ξ(v int ). We call V p the p-crossing set. We define the maximum benefit of selecting the phase p as the maximum phase weight, mpw(p). If there were no link in G int, i.e. Ψ = /0, we could obtain mpw(p) by simply summing the weights of the slack intervals in V p. However, finding mpw(p) is difficult, due to the links within the set of vertices V int.ifψ(u int,v int ) Ψ, and even though u int,v int V p, u int and v int cannot simultaneously be related to correspond to p without changing the system behavior. Therefore, the links between slack intervals become constraints in computing mpw(p). For example, if Ψ in Fig. 7 ware an empty set, then mpw(d) would be w(v 9 )+w(v 7 )+w(v 6 )+w(v 8 ).HoweverΨ is not empty, and so mpw(d) is actually MAX(w(v 6 ),w(v 7 )+ w(v 8 ),w(v 9 )+w(v 8 )). We call the optimization that finds mpw(p) with a given G int and phase p the maximum phase weight (MPW) problem, and we claim that it is NP-hard. Problem 2: MPW decision: Let us assume that a graph G int, a phase, (G int, p), and a given positive real number k constitute an instance of the MPW problem. For a given V p, we now seek to determine whether mpw(p) k. Theorem 1: The MPW decision problem is NP-hard. Proof: We show that the MPW decision can be reduced int polynomial time to a maximum weighted independent set problem, which has been shown to be NP-hard [14]. Let us assume that we have a vertex-weighted graph G =(V,E,w). First we convert G to an interval graph G int = (a) Vertex-weighted graph G (b) Weighted interval graph G int Fig. 8. Conversion between vertex-weighted graph G and weighted interval graph G int. (V int,w int,ξ,ψ), such that ξ : v (ξ s,ξ e ),whereξ s and ξ e are positive real numbers and ξ s < ξ e (Fig. 8). V now is translated to the slack interval V int, with starting and endpoints determined by the function ξ. Because the range of all the intervals is [ξ s,ξ e ], there exists a phase ξ s p ξ e such that all the intervals are p-crossing, i.e. V int = V p. Ψ is obtained from the edges E from G such that, for all u,v V, when u and v are adjacent, the corresponding intervals u int and v int are linked, i.e. ψ(u int,v int ) Ψ. Finally, w(v) is converted to w int (v int ). Therefore, if a vertex set in G is an independent set, the corresponding interval set in G int has no links to it. Therefore, an independent set in G has a sum of weights which is not less than k, if and only if there exists V p V int,wherempw(p) k. Having shown the MPW problem to be NP-hard, we apply a greedy algorithm to solve it. In cool data, we found that a small number of vertices in V p usually have dominant weights. To capitalize on this, we sort the vertices before computing mpw(p). Then, we select the vertex with the maximum weight and remove all the vertices which are linked to this vertex. We repeat this process until the set V p is empty. F. Power saving analysis and FPGA design conversion The sequence of procedures, from weight generation to phase assignment, which we have just descended, determine the shifted-phase clock which maximizes the power saved by glitch blocking. We iterate through these procedures until we have used all the n c clock resources, or until the insertion of more flip-flops does not save any more power, because of the overheads, even though some clock resources remain unused. The power overhead of an inserted flip-flop is made up of the power consumption of the new flip-flop and that of the global wires used by the new clock. Power consumption by flip-flop can be computed by multiplying G gen (v) by the input capacitance of the flip-flop (Table I), and the overhead from clock routing can be computed by multiplying the effective capacitance of both global and local clock routing resources by the transition density of the clock, which is 1. We can then compare mpw(p) with the sum of the two overheads, and insert a phase-shifted clock only when the total overhead is smaller than mpw(p). Unfortunately, inserting a new flip-flop may change the weights and the slack intervals of the design. To achieve a global optimum, phase assignment of all the additional clocks would need to be done simultaneously, but this implies an unacceptably complicated calculation. Instead, we perform flip-flop insertion and phase assignment sequentially, i.e. in a greedy manner, which makes the optimization problem converge to a local minimum. Experimental results and actual measurements of practical benchmark sets show that the

7 Fig. 9. Dynamic energy(nj/clock) Test environment of FPGA cycle-true energy measurement Clock cycle Original 1st-phase FF insertion 2nd-phase FF insertion 3rd-phase FF insertion 4th-phase FF insertion 5th-phase FF insertion 6th-phase FF insertion Fig. 10. Dynamic energy reduction using Solve-FPA (C6288). greedy approach is effective in blocking glitch propagation and does yield significant power saving. After selecting the clock phases and flip-flops that save the most power, we modify the XDL file to trigger the flip-flops with the phase-shifted clock that we have chosen. We then convert the modified XDL file to an NCD file, which is compatible with the Xilinx FPGA design tool. V. EXPERIMENTS To reinforce the simulation results, we measured a real Xilinx FPGA device: a Virtex-II 1000-FG456 with 51 slices and a core voltage of 1.5V, using out in-house FPGA cycletrue energy measurement tool based on switched capacitors [], which is shown in Fig. 9. Fig. 10 shows the difference in cycle-true energy consumption between the original C6288 and the C6288 logic modified by Solve-FPA. It demonstrates that Solve-FPA does achieve more dynamic energy reduction by adding groups of flip-flops triggered by shifted-phase clocks. Fig. 11 shows how the dynamic energy distribution caused by a glitch changes as flipflops are inserted. Because the addition of shifted-phase clocks incurs an energy overhead, the extent of the energy saving decreases as more phases are used. Depending on the circuit structure, the optimal number of phases for the benchmark circuits varies from one to six (Fig. 12). The greatest sawing is achieved on C6288, using six phases to achieve a 31% reduction in dynamic energy consumption. More detailed testbench results are summarized in Table II. In most of the benchmark circuit Solve-FPA only achieves significant dynamic power reduction of the most benchmark circuits only with the first-phase of flip-flop insertion. The test cases are of three classes: Xilinx core generators, ISCAS85, and an FIR filter. The Addr14 and Multiplier14 cases shown on the first two rows of Table II are a 14-bit adder and Energy (pj/clock) Energy (pj/clock) Energy (pj/clock) Phase (Degree) (a) Original design Energy (pj/clock) Phase (Degree) (b) Third-phase flip-flop insertion Fig Phase (Degree) (b) First-phase flip-flop insertion Phase (Degree) (b) Sixth-phase flip-flop insertion Histograms of net dynamic energy caused by a glitch in C6288 logic, with a sequence of modification by Solve-FPA. multiplier generated by the Xilinx core generator. The next three are ISCAS85 circuits, which are ASIC benchmarks implemented as s after optimization. In these cases it is less meaningful to apply our algorithm, and therefore we have omitted the results for those circuits. The final circuit is a directed FIR filter with 16 pipeline stages. Unlike other combination circuits, in this filter flip-flops are used for puposes other than feedback. We used real input vectors to the FIR filters to make the experiment as realistic as possible, while the ISCAS85 circuit and the adder and multiplier were driven by random vectors. The number of slices used in these design is between % and % of the target FPGA. We have duplicated some circuits that occupy only small number of slices in the interest of more accurate measurement. The second column, on titled module, in Table II indicates the number of duplications. All the benchmarks were implemented on one or two thousand slices, except the 14-bit adder (Addr14) and C499 from IS- CAS85. This is caused by the limited number of input/output signals available on the measurement tool. One of the most important outputs of Solve-FFPA is the phase of the clocks for the newly inserted flip-flops. Note that the values in the fifth column of Table II are not measured in degrees but in units of a 1/6th of a period of the system clock. For example, values of 128 and 6 would correspond to phases of 180 and 360, respectively. The number of candidate nodes are the number of CLBs present before any flip-flop are allocated. The final action of Solve-FPA is to reactivate enough CLB flip-flops to correspond to the selected quantity of nodes. Note that selecting more nodes does not save more energy. Solve-FPA estimates the amount of glitch energy can be saved. The reduced power requirement takes into account the overhead of the additional flip-flops and of the global wires for clock distribution. Since relative energy is more important, these values may have offsets. The large difference between the measured and estimated energy reduction in Addr14 is due to the small size of the glitch power compared to the power consumed by the logical transitions. Even though almost every glitch is removed by clock insertion, the effect on dynamic

8 Target Number Number Number Selected Number of nodes Estimated glitch Measured dynamic energy consumption Circuit of modules of slices of phases phase(s) (selected/candidates) energy reduction Original Solve-FPA Reduction Addr / % 3.69nJ/clk 3.37nJ/clk 8.67% Multiplier / % 14.47nJ/clk 11.48nJ/clk.63% C / % 6.27nJ/clk 6.00nJ/clk 4.31% C / % 5.46nJ/clk 4.72nJ/clk 13.41% C / % 24.89nJ/clk 16.99nJ/clk 31.74% Tap16-D / % 9.03nJ/clk 7.88nJ/clk 12.74% : 116, 83, 5, 98, 187, 57, and : 107, 117, 231. TABLE II POWER REDUCTION ACHIEVED BY Solve-FPA. Dynamic energy (nj/clock) C432 C499 Tap16-D C The number of phases for flip-flop insertion Fig. 12. Dynamic energy saving by flip-flop insertion. The energy curves are convex due to the power overheads of the inserted flipflops. power consumption is not large. As shown in the last three columns of Table II, Solve-FPA actually saves between 8% and % of dynamic energy in real applications. The differences between the computed and measured values result from the assumptions made while modeling the capacitance and transition density of the Virtex-II and from inaccuracies in the measurement tool. in any case, experimental values showed around 5% variation, depending on the temperature of the target device. We consider the calculated and measured realty to be broadly consistent. We also tried slowing down the clock frequency to extend the slack interval of each vertex. When the clock frequency is reduced by 5%, around 2% of additional power reduction was achieved. This shows that more power saving can indeed be obtained by reducing the clock frequency, which can improve the gain from clock scaling for devices which do not allow a scalable supply voltage. VI. CONCLUSIONS We have showed how to save power in -architecture FPGAs, based on the observation that most of the power is dissipated in the routing paths, and specifically, in glitch propagation. We propose the insertion of flip-flops between adjacent s when there are more than two cascaded s. Our method is substantially different from traditional retiming or pipelining schemes, in that the inserted flip-flops are timed by clocks which are phase-shifted to correspond to the delays in s and routing paths, and their slack times. To make the most of limited phase-shift resolutions and clock lines, we insert flip-flops in order of power gain to be gained by glitch minimization. This is shown to be an NP-complete problem, bu we provide an efficient heuristics. The advantages of our method are: 1) we do not change pipeline structures and timing behavior by the use of phase-shifted clocks; 2) we can start from a post-layout design independent of other optimizations during earlier design stages; and 3) flip-flop insertion can easily be automated as a post-optimization process at the final design stage. The figures in this paper allow our glitch minimization techniques to be visualized: We show the distribution of glitches by clock phase, and real-chip cycle-true power measurements. Results show savings of up to 38% of the fullchip power. In future work, we will continue to upgrade the heuristics to solve NP-complete problem of allocating shiftedphase clock flip-flops. REFERENCES [1] A. Shin, A. Gosh, S. Devadas, and K. Keutzer, On average power dissipation and random pattern testability of cmos combinational logic networks, in The Proceedings of the International Conference on Computer Aided Design, 1992, pp [2] M. Favalli and L. Benini, Analysis of glitch power dissipation in cmos ic s, in The Proceedings of the International Symposium on Low Power Electronics and Design, April 1995, pp [3] E. Kusse and J. M. Rabaey, Low-energy embedded FPGA structure, in The Proceedings of the International Symposium on Low Power Electronics and Design, 1998, pp [4] L. Shang, A. S. Kaviani, and K. Bathala, Dynamic power consumption in Virtex-II FPGA family, in The Proceedings of the International Symposium on Field Programmable Gate Arrays, February 02. [5] S. J. Wilton, S.-S. Ang, and W. Luk, The impact of pipelining on energy per operation in field-programmable gate arrays, in The Proceedings of the International Conference on Field-Programmable Logic and Applications, August 04. [6] K. N. Lalgudi and M. C. Papaefthymiou, Fixed-phase retiming for low power design, in The Proceedings of the International Symposium on Low Power Electronics and Design, [7] H. Li and S. Katkoori, Power minimization algorithms for lut-based FPGA technology mapping, ACM Transaction on Design Automation of Electronic Systems, vol. 9, no. 1, pp , January 04. [8] B. Kumthekar, L. Benini, E. Macii, and F. Somenzi, Power optimization of FPGA-based designs without rewiring, IEEE Proceedings Computers and Digital Techniques, vol. 147, no. 3, pp , May 00. [9] Xilinx Inc., Virtex-II Platform FPGA Handbook, 00. [10] Altera Corporation, Using the clocklock and clockboost pll features, Altera Application Note, no. 1, November 03. [11] F. N. Najm, Transition density: a new measure of activity in digital circuits, IEEE Transaction on CAD of IC and Systems, vol. 12, no. 2, February [12] H. Mehta, M. Borah, R. M. Owens, and M. J. Irwin, Accurate estimation of combinational circuit activity, in The Proceedings of the 32nd IEEE/ACM Design Automation Conference, June 1995, pp [13] J. H.Anderson and F. N.Najm, Power estimation techniques for FP- GAs, IEEE Transaction on VLSI Systems, vol. 12, no. 10, pp , October 04. [14] L. Lovasz, Stable set and polynomials, Discrete Mathematics, vol. 124, pp , [] H. G. Lee, S. Nam, and N. Chang, Cycle-accurate energy measurement and high-level energy characterization of FPGAs, in The Proceedings of 4th International Symposium on Quality Electronic Design, March 03, pp

Retiming Sequential Circuits for Low Power

Retiming Sequential Circuits for Low Power José Monteiro, Srinivas Devadas Department of EECS MIT, Cambridge, MA Abhijit Ghosh Mitsubishi Electric Research Laboratories Sunnyvale, CA Abstract Switching