Power-driven FPGA to ASIC Conversion

Size: px

Start display at page:

Download "Power-driven FPGA to ASIC Conversion"

Poppy Sutton
6 years ago
Views:

1 Power-driven FPGA to ASIC Conversion WenHai Fang a and Lambert Spaanenburg b a SwitchCore AB, Emdalavägen 1, Lund (Sweden) b Dept. of Information Technology, Lund University / LTH, P.O. Box 11, Lund (Sweden) ABSTRACT Gate arrays are often presented as a convenient means for ASIC prototyping. Obviously, they can both perform the same function and therefore be built from the same behavioral description. Design development implies a process of subsequent parameter bindings, leaving steadily less freedom for the remaining implementation choices. On the other hand, the ASIC offers more place & route freedom than the gate array. Hence it is commonly suggested that an optimal prototype will always have an acceptable ASIC realization. But this does not make the gate array an easy stepping-stone in ASIC development. Differences in platform technology induce a different structural sugaring to achieve a reasonable implementation. This cannot easily be ported, unless the implementation is developed while keeping the restrictions for the other technology in mind. Such implies a number of scaling rules to be the foundation of the design transformation process. This paper looks into the platform commonalities of Field-Programmable Gate-arrays and standard-cell ASICs from fundamental physical principles. These basic considerations are then related to show how the area and speed restrictions in the logic synthesis can be applied to carry power efficient designs efficiently from prototype to realization. This is illustrated in the design of the SNOW-2 encryption core, where a consistent 3% power reduction is achieved. Keywords: Application-Specific Integrated Circuit, Field-Programmable Gate-Array, Computational Energy, Dynamic Power Dissipation, Encryption. 1. INTRODUCTION Most microprocessors are contained in a product for non-computing purposes, putting different aspects than pure number crunching performance on the foreground. Though these aspects are also of concern in classical computers, the move towards embedded computing makes them probably the prime design aspect. On the other hand, embedded systems tend to be prototyped on Field-Programmable Gate-Arrays (FPGA). They offer a rich set of pre-integrated gates and macros that can be personalized to the desired system and a second system that performs this configuration. Compared to Application-Specific Integrated Circuits (ASIC), area consumption has become the degree of gate utilization, while power consumption is largely overshadowed by the static dissipation of the pre-integrated parts. For an effective prototyping, it is desirable that the decisions in designing an FPGA will also hold for the ASIC. Altera has for some time advertised that an FPGA can be transformed in an ASIC by a technique called Hardcopy 1. This essentially takes away all the unused gates from a design after Place & Route. Such is clearly not the most efficient way to do things, though it will clearly reduce the area and the power consumption. The alternative is to start from scratch, i.e. use the VHDL-coded design for a new Place & Route effort. It is often found that this requires an entirely new set of P&R decisions to be made. This leaves the question whether and under which circumstances the P&R decisions for the FPGA implementation can be re-used. The problem of re-using design decisions becomes even more pressing, when aiming for a low-power design. Techniques for power reduction can be applied at several stages of system design 2. Already the software implementation can be coded for power efficiency. Most of this has to do with handling the memory access, as the communication external to the chip and the distribution of such data streams within the chip are a major contributor. This is in turn reflected in architectural decisions that aim to limit the system part that is critical for overall speed. Having a larger part of the system work at a lower clock speed will always benefit the power efficiency. In the extreme, one may opt for no synchronicity. Still a number of measures can be taken on the logic level: clock gating and parallel processing, in combination with design steps as re-timing and unfolding. On a lower level of abstraction, a number of circuit design techniques have been proposed, like Dynamic Voltage Scaling 3.

2 Notably, IP cores for secure portable products have to be low on area and energy. In this paper we report about a further development for a power-efficient implementation of the SNOW2.0 standard. Therefore, we will first discuss some applicable techniques to see their impact on FPGA and ASIC design. Then we treat some specific methods for FPGA designs. Subsequently, the SNOW-2.0 design is introduced 4 and the derived techniques are applied. Finally we evaluate such techniques and find that the dynamic power dissipation can be reduced by a factor THE BASICS OF POWER REDUCTION The power dissipation of a logic gate is composed of a static contribution and a dynamic one. The static or quiescent power is consumed to keep the circuit into a well-defined static state. It is drastically reduced by the advent of CMOS technology, but returns when discussing FPGAs. The dynamic contribution is largely due to the clock frequency f clk and the logic switching rate α 0-1, formulated by Neil Weste 5 for full-custom design as P dyn = α 0-1 f clk C L V 2 dd. Though it is valid for FPGA design also, the interpretation for power reduction methods seems to be different. P f I II III (a) (b) Figure 1 Full-custom power/delay curve in theory (a) and in the practice of FPGA design (b), demonstrating the effect of different P&R efforts on Fang s design with 400,000 bits 4. In full-custom design, the gate is sized to drive a capacitive load C L at just the right frequency. As the gate delay is dependent on C L /V dd.k where K=W/L, we can see that a higher frequency is reached by raising the geometry ratio K. However, doing that raises not only the frequency, but also the capacitive load C L, being composed of a driver contribution K.C G and a driven part C W. Consequently the law of diminishing returns takes effect, which can be described from the recognition of three regions in the power/frequency curve (Figure 1a): a. Region I. If the load capacitance is mainly due to the wiring (and fan-out), the gate delay is a function of C W /V.K, which makes the power dissipation to be a linear function of K (therefore f). b. Region II. If both wiring and logic play a role in the load capacitance, we see that K has to change more than before to raise the frequency. Consequently, the power curve will be slightly non-linear. c. Region III. If the load capacitance is only slightly caused by wiring but largely driver-dominated, the gate delay becomes geometry-independent. Actually, as the logic gate is so large that it is merely loaded by itself, the frequency is almost saturated and can only be marginally changed. But that little change in frequency takes a drastic increase in K, making for a steep rise of the power/frequency curve. We can easily deduce from Figure 1a that reducing the operating frequency for designs at operating points in region III will drastically bring the power dissipation down. A typical example is in bank switching. As the power dissipation rises by more than a factor 2 with frequency, doubling the circuitry while halving the frequency will effectively bring the power consumption down! In FPGA design the gates will not be sized. Therefore the driver contribution to the load capacitance is fixed and the power dissipation will go linear with the frequency as long as the gates are fast enough. We encounter this voluntary limitation to region I behavior somewhat relaxed in semi-custom design also, and even full-custom design can be done this way. For ASIC design, the slow rise of the region I curve precludes the benefit of structural duplication. As the

3 power dissipation rises by less than a factor 2 with frequency, doubling the circuitry while halving the frequency will only raise the power consumption. Apparently, the important parameter is the Differential Power Dissipation DPD=δP/δf. If this value is more than 2, then power dissipation will drop by a larger ratio than the frequency. If it is smaller, then the effect may be debatable and may even not be an improvement at all. In general, the Differential Power Dissipation is layout dependent. This shows when the Place & Route is performed on different levels of effort. As an example, we show in Figure 1b the effect of different P&R efforts on the best result, discussed by Fang 4. What remains, is modifying the circuit structure. For an ASIC, this can be part of the design and can be furthered by also tuning the logic structures. For instance, a circuit with a low admissible clock rate can be lowered in propagation speed to restrict the power dissipation by serializing the logic structure. Of specific interest here is tapering, a serialization style aimed to moderate the relation between K and C for a given f. It is mostly applied in I/O circuits, but also explains the benefits of using buffers when a relative large load has to be driven. In FPGA design, the tooling will use pre-placed buffers, as long as the tooling is made aware of the problem. For instance, the clock will be distributed over a balanced arrangement of buffers to ensure timing. The apparent contrast between ASIC and FPGA design involves that in the latter case spatial tapering cannot be done with similar comfort. Fortunately, we can still create a tapering in time by twiddling with externally supplied signals, notably the clock and the reset. Both the clock and the reset are global wires that cross the entire chip and are therefore a major source of dissipation. Hence multiple clock and multiple power lines can help to reduce this, while tuning every local circuit to the requirements by adapting the signal frequency (and not the circuit activity). 3. DECENTRALISED CONTROL There has always been a heavy debate in processor design between centralized and decentralized control. Centralized control evolves from the separate attention on control and data-path as results from the Instruction Set Architecture concept. In synchronous design, the rationale is the distribution of the signals that designate what the effect of the continuous stream of clock pulses has to be. In asynchronous design, this issue does not exist and the existence of many control signals became the ruling factor. It has been argued that the main advantage of such designs is the lack of the global clock 6. Recently it is concluded that this signal causes a lot of the power dissipation 7. It is easy to see that the power dissipation of a synchronous system can therefore be lowered by distributed control, assuming that this will be used to localize the clock. It extends the concept of tapered timing with clock gating and local powering. The beneficial implementation is based on clock gating. Here the clock to a circuit is passed through a gate, where a hold signal is fed to the other input. The output will then not pass the clock when the hold signal is active and consequently the circuit is not clocked. The BUFGCE cell _ available in Spartan-3, Virtex-II, Virtex-II Pro and Virtex-II Pro X _ can implement such a clock gating block. Clock gating (or local clocking) goes hand in hand with local powering. An example of this gated powering can be found in the classical microprocessor with on-chip cache. Assuming locality of operation and data, the hierarchical construction of the cache allows most of this memory to be sleeping, as it will not be needed for the current part of the program execution. The only twiddle factor left to play with is the switch factor. In the case where an interconnect bus shows little logic activity, one may opt for time multiplexing. The question to answer here is: what is the power minimization equivalent of multiplexing? Lets take a look at clock phases in order to answer this question (Figure 2a). Figure 2 (a) Clock phases and (b) a 2-edged register.

4 Ever since the period, wherein TTL was the ruling technology, it has been good practice to separate data and control on different clock edges. In the LSI era, the technique lost popularity as new circuitry for clock balancing needed to be developed. But it allows the controller to provide the data path with signals that will keep the circuit active when desired, while holding it inactive otherwise. The signal has both a rising and a falling edge; the combination is used to create the read & store functionality that makes a double latch into a flip-flop (Figure 2b). This can be easily extended, for instance to registers that operate on both edges but then at half the frequency. The above discussion shows that the concept of multiplexing does not change fundamentally. The only meaning is that the aim is to raise the operational meaning of the line without raising the activity. In other words, every signal change needs to have as much operational meaning as possible as long as the activity rate remains the same. This has always been good practice, but here it also implies that using both edges allows reducing the frequency. Consequently, the power dissipation goes down while the noise margin (in terms of time between two events) remains the same. Of major concern are the I/O gates, which are a main source of power dissipation next to the global (clock) lines. Multiplexing signals on a smaller number of I/O gates has a significant impact on power dissipation. The same is true for the gated powering of the peripherals. 4. THE ASIC/FPGA TRADE-OFF The Field-Programmable Gate Array (FPGA) is based on standard blocks from which any type of logic can be created. Actually such standard blocks are memory parts, where functions can be stored such that the block behaves as the intended logic. This is the principle, but the raw reality is that there is one type of block with a specific capacity. The bare components will still be there, when less than the block capacity is used. Using more than the capacity is not possible; in that case simply more blocks will be used. This turns logic synthesis into allocation. A related issue has to do with macro usage. Macros are blocks, optimized for a specific function domain such as multiplication and data storage. The function can also be built by the logic slices, which are standard blocks in Xilinx families, leaving the question on when to use the macro and when to use the slices. The crossover point is important, but clearly the exact value is dependent on the FPGA type. The Block SelectRAM (BS-RAM) is a configurable memory module, generated by the Core Generator toolset with parameters that are assigned by the user. As means of data storage it is meant to bring improvement over Distributed RAM. The latter is based on the individual register elements contained in the logic slices, which makes the intrinsic speed higher than that of a macro element. They are crucial to many high-performance applications that use relatively small, embedded RAM blocks such as FIFOs or small register files. But for a number of register sets, speed or area will have increased by such amount that a single macro has become more efficient. Therefore the question is: when to use what? The answer to this question will typically be different for different design goals. Pareto Curve SNOW 2.0 Pareto 3.5 area(in slices) execution speed(in nsec) 3, 4 4,2 4,4 4,6 4, SNOW 2.0 Pareto3.5 (a) (b) Figure 3 (a) The Pareto curve from design with a different BS- versus Distributed RAM ratio and (b) the power dissipation in BSand Distributed RAM.

5 There has been a regular feud to relate slices and LUTs on one hand to logic gates on the other hand. The use of the equivalent logic gate has been proposed and enjoys a limited popularity. It expresses a design in the number of 2-input logic NAND-gates that are needed to achieve the same functionality, but this does not do justice to technologies, where the logic NAND is not an efficient basic building block. Relating slices to memory locations is even more difficult to achieve. It seems justified to use an equivalence ratio within the same technology. This can be concluded by starting from a design with many memory blocks and doing a number of subsequent transformations to decrease the memory usage, or vice versa 4. From a typical case study, of which the Pareto curve for the different implementations is shown in Figure 3a, it can be deduced that a BS-RAM is the area equivalent of 256 slices. Therefore it is convenient to replace any logic part in excess of 256 slices by a lookup table in BS-RAM. We cannot take this for a universal truth, as a synthesis report will not clarify how much of the slice will actually be used. The size of a logic circuit that corresponds to a table of fixed size varies considerably. This underlines that it is very hard to translate an FPGA design into gate equivalents, or to compare designs on FPGAs of different brands. Hence we will stick in our evaluation to Xilinx FPGAs only. In this paper we decide on whether to use BS- or Distributed RAM on basis of power dissipation instead of just area. From Figure 3b we find that BS-RAM has the constant consumption of 421 mw, while the Distributed RAM increases its dissipation per register count. The reason is that BS-RAMs use internally a gated clock to activate just one row at a time. On the other hand, the clock drives all blocks in a distributed RAM, regardless the addressed block. Not surprisingly, the crossover point is around 10 registers of 32 bits each. 5. A CASE STUDY A typical IP core for quality embedded systems finds application in secure communication. Data encryption has always been adequately supported in software, but the introduction into embedded devices necessitates hardware. This raises questions on the relation between algorithmic architecture and efficient hardware implementation. Such relations have been under investigation within the ECRYPT European Network of Excellence over the past years. The most widespread encryption algorithms are DES and AES. Clearly a lot of work has been spent in finding suitable hardware implementations. AES is not really low in its computational demands and is not really efficient in streaming; therefore ECRYPT has initiated a competition in new concepts that can be suitably integrated into an embedded system, called estream. The SNOW encryption algorithm for streaming has already been proposed in Though the first version was rapidly broken, the second version 10 has survived the scrutiny by the ECRYPT community and has recently seen its first commercial application for portable telephony. Therefore SNOW does not participate in the competition anymore, but has already moved on to become a standard. Key & IV Key_IV LFSR Calcu New_LFSR Value & KeyStream FSM PlainText Encrypt_Addition CipherText Figure 4 (a) A schematic view of SNOW-2.0 and (b) an initial implementation. SNOW-2.0 is a word-oriented stream cipher generator with a word size of 32 bits (Figure 4a). It takes two input values only: a secret key of either 12 or 256 bits and a public initialisation value, IV at 12 bits. The length of the linear feedback shift register (LFSR) is 16. The FSM consists of two 32-bits registers, R1 and R2, and a so-called S-box that calculates the output of the FSM. This output is XORed with the first element of the LFSR to provide the desired running key.

6 This paper describes the gradual and continuous enhancement of the original SNOW 2.0 IP core in line with the different approaches of power reduction discussed in sections 2 till 4. The power estimation is performed using Xpower on the back-annotated design. The initial design makes use of BS-RAMs, configured to allow for two read or two write accesses simultaneously. Figure 4b shows that 5 LFSR values are used at a time to calculate the new LFSR value, the value of R1 and the Keystream. The LFSR itself will be created around one BS-RAM. In the feedback loop, multiplication with α and α -1 can be implemented as a simple byte shift plus an additional XOR with one of 256 possible patterns. So this can be implemented using a look-up table stored in a BS-RAM. The S-box is also implemented using a look-up table. In other words, the only components in use are XOR, bit adder, bit shifter and look-up table. Addr. RAM 256 * bits Logic calculation n i n j XOR S (a) RAM 512 * 32bits n 1 n 2 Addr. XOR S (b) RAM 512 * 32bits n 3 n 4 (c) Figure 5 The different realizations of the S-box: the small (a) and the large (b) design with operational performance (c). First of all we have looked at the filling and clocking of the BS-RAMs used to implement the S-box. There are two options, as illustrated in Figure 5a and b. By using two BS-RAMs with 512*32 bits each, the individual clock frequency is reduced. Alternatively, the two boxes can be merged into a single RAM with 256 * bits at the expense of some additional glue logic. Here, the full clock rate needs to be used, while we have more gates. As illustrated in Figure 5c, the double box solution is preferred, reducing the power needs by minimal 7-12%. The next step is the insertion of a DCM in combination with double-edge triggering, as discussed in Figure 2. This allows separating the system logic between full and half speed parts. The effect is a reduction of minimal 6 - %. Then the point of interest is the LFSR. It uses only 16 words in a BS-RAM and can easily be replaced by the Xilinx SRL16, which is efficiently mapped on the LUTs. This gives a minimal 3 6% reduction, based on the fact that a BS-RAM dissipates the same amount of power irrespective of the filling grade while the LUT dissipates only for the parts that are used. As we have mentioned in section 3, power consumption can also be reduced by clock gating. The initial key block works only in the beginning. Hence, we use clock gating to shut this block down after use. This brings us an additional 4% reduction. Finally, the creation of the key stream is separated from the creation of new LFSR values, which reduces further power dissipation by 2%. The block diagram of the final design is shown Figure 6a. Overall reduction of the dynamic power dissipation in the three BS-RAM designs is frequency dependent and ranges from 16 to 34%. We have noted that a tool like Xpower measures power as an average over the simulation period. In stream encryption, we have an initialization phase followed execution. To measure the executive power correctly, the simulation has to cover at least 20,000 clock cycles. 6. DISCUSSION A first design of the SNOW-2.0 IP core has been discussed by Fang 4. It focuses on the utilization of the FPGA resources to achieve a small footprint and a high speed. This design has been taken as reference for the evaluation of the power reduction techniques, discussed above. The power estimations produced by the Xilinx X-Power tool are shown in Figure 6b. All values represent a design that is synthesized by Synplify-.1 using a speed restriction that corresponds to the required clock frequency. The original design, on which in sequence the modifications have been implemented, displays a dynamic dissipation that rises with 5mW per MHz increase in clock frequency.

7 CLK Create new_lfsr 2XCLK CLK Plain Text CLK Clock gating K ey & IV Load the initial key to LFSR LFSR FSM Create Key_str eam CLK keystream Encrypt CLK Control signal Cipher Text (a) 2XCLK (b) Figure 6 The final power-efficient SNOW design (a) and the dynamic power dissipation compared to the original design (b). The new design starts at a lower dissipation level as it contains 44% less equivalent used slices. The effect of our measure can be found from the slope of the curve, showing an increase of 4 mw per MHz. Both designs show a steeper increase at the end of the affordable clock frequency spectrum. When the FPGA architecture gets at the limit of its clock budget, the combinatorial logic gets parallelized. At this point we like to draw attention to what happens between 100 and 140 MHz. The increase in dynamic power dissipation DPD has been decreased from 5 mw/mhz to 4 and now suddenly drops without any change in the original VHDL file to 1.4. The synthesis report reveals no major change in the logic but a 2% increase in the number of occupied slices. This suggests that the difference is caused by problems in the Place & Route. Checking the logs of previous experiments, we have noticed that the curve after 100 MHz cannot always be reproduced. One out of four physical design attempts ends not with the shown curve but with a linear extension of the curve below 100 MHz, similar to the result for Fang s design. This has a striking similarity to what has been observed earlier in ASIC-oriented design. Looking back at what we stated in section 2, we can now conclude that the three regions in the power/delay curve can also be distinguished in FPGA design. The difference lies in the role of time domination. Where in ASIC design, an additional sizing of the driving gate compensates the additional loading; in FPGA design a different physical design of the overall circuit will take place. In region I, the speed requirements are so loose that it is attempted to place the subsequent gates close to one another. When all the space is taken, the remaining gates will be placed far away. In region III, the speed requirements are dominant and hard to fulfill. In the transition, arbitrary dense packing becomes unfeasible and a more balanced arrangement may occur. The consequence of this observation is in an eventual conversion of FPGA into ASIC. When a Hard Copy is made of the FPGA implementation, the timing-restricted (but not dominated) regime will provide a better starting point as the layout is less related to the specifics of the FPGA architecture. 7. CONCLUSIONS We have applied variations on three techniques in an attempt to decrease the power dissipation of an existing design: a. Optimal division between distributed and BS-RAM for compact storage; b. Multiple and gated clock for distributed and block activation; c. Adaptation of logic structure to enable the efficient application of the above two techniques. As the example stream encryption core operates on a bit per second, it is unlikely to have underused parts. The design operates successfully for internal clock frequencies till 306 MHz, while displaying power figures for a maximum external 153 MHz frequency. This is because the process concept in VHDL forbids the use of double-edged triggered registers. We have used the Digital Clock Monitor (DCM) to reach the same effect by local frequency doubling. As a consequence, the FPGA-based design is limited in its clock range to half that of the original design. This is only a matter of technology. The noted restriction will disappear when moving to an ASIC.

8 In literature, some SNOW implementations are described, where BS-RAM usage is varied to exchange throughput for area consumption 4. Some typical designs are mentioned in Table 1 with their power figures to enable a comparison with the low-power design as developed in this paper. All designs give output on every clock cycle and have therefore the same throughput of 340 Mbps at a 120 MHz clock. Table 1 Comparison between designs (f 120 MHz) with a 12-bits secret key. Design A 4 Design C 4 Design D 4 This paper Slice count/bs-rams 72/7 1360/4 1936/0 624/3 Dynamic power Overall power Throughput/Dynamic Power Throughput/#Slices and Dynamic power Throughput/#Slices and Power The discussed measures lead to a reduction in pure power dissipation of up to 3%. Alternatively, we can use computation power as throughput divided by power 11, for the purpose of cleaning the comparison between the designs from irrelevant influences. We have refrained from including the results from investigations on equivalent gate metrics 12, because their power estimates are based on too limited measurement periods. Though a distinct effect of BS-RAM usage on power dissipation cannot be denied, our low-power measures prove to be much more effective. Overall the dynamic computational power has increased by 39%. Normalized on area, the improvement is even 102% compared to the best design published earlier 4. ACKNOWLEDGEMENTS The authors like to thank Thomas Johansson, Chao Chen, Yitao Jia and Suleyman Malki for their support throughout this research. REFERENCES 1. Altera, 1 March H. van Gageldonk, An Asynchronous Low-Power 0C51 Microcontroller, Ph.D. Thesis, Eindhoven University of Technology, pp , 33-34, 49-54, , D. Shin, J. Kim and S. Lee, Low-energy intra-task voltage scheduling using static timing analysis, Proceedings DAC, pp , W. Fang, T. Johansson, and L. Spaanenburg, Snow 2.0 IP Core for Trusted Hardware, Proceedings FPL 2005, pp , Tampere, Finland, August N. Weste and K Eshragian, Principles of CMOS VLSI design: A system perspective, Addison-Wiley, L. Spaanenburg, et alieni, One-chip microcomputer design based on isochronity and selftesting, Digest EDA'4, pp , Warwick (England), March K. v. Berkel and M. Rem, "VLSI programming of asynchronous circuits for low power," in Asynchronous Digital Circuit Design (G. Birtwistle and A. Davis, eds.), Workshops in Computing, pp , Springer-Verlag, estream, 1 March P. Ekdahl and T. Johansson, SNOW a new stream cipher, Proceedings 1 st NESSIE workshop, P. Ekdahl, On LFSR based Stream Cipher Analysis and Design, Ph.D. Thesis, Lund Institute of Technology, Lund University, pp , T.A.C.M. Claassen, High Speed: Not the only way to exploit the intrinsic computational power of silicon, Digest IEEE ISSCC, pp , T. Good, W. Chelton and M. Benaissa, Review of stream cipher candidates from a low resource hardware perspective, 1 March 2006.

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices March 13, 2007 14:36 vra80334_appe Sheet number 1 Page number 893 black appendix E Commercial Devices In Chapter 3 we described the three main types of programmable logic devices (PLDs): simple PLDs, complex