An Integrated FPGA Design Framework: Custom Designed FPGA Platform and Application Mapping Toolset Development

Similar documents
A Novel FPGA Architecture and an Integrated Framework of CAD Tools for Implementing Applications

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

288 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

On the Sensitivity of FPGA Architectural Conclusions to Experimental Assumptions, Tools, and Techniques

Optimizing area of local routing network by reconfiguring look up tables (LUTs)

Modifying the Scan Chains in Sequential Circuit to Reduce Leakage Current

L11/12: Reconfigurable Logic Architectures

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

L12: Reconfigurable Logic Architectures

Leakage Current Reduction in Sequential Circuits by Modifying the Scan Chains

GlitchLess: An Active Glitch Minimization Technique for FPGAs

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

LOW POWER DOUBLE EDGE PULSE TRIGGERED FLIP FLOP DESIGN

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Why FPGAs? FPGA Overview. Why FPGAs?

A Low Power Delay Buffer Using Gated Driver Tree

Design and Simulation of a Digital CMOS Synchronous 4-bit Up-Counter with Set and Reset

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Field Programmable Gate Arrays (FPGAs)

LFSR Counter Implementation in CMOS VLSI

DUAL EDGE-TRIGGERED D-TYPE FLIP-FLOP WITH LOW POWER CONSUMPTION

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Exploring Architecture Parameters for Dual-Output LUT based FPGAs

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

SYNCHRONOUS DERIVED CLOCK AND SYNTHESIS OF LOW POWER SEQUENTIAL CIRCUITS *

Power Optimization by Using Multi-Bit Flip-Flops

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

Fine-grain Leakage Optimization in SRAM based FPGAs

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

A Fast Constant Coefficient Multiplier for the XC6200

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

P.Akila 1. P a g e 60

Report on 4-bit Counter design Report- 1, 2. Report on D- Flipflop. Course project for ECE533

RELATED WORK Integrated circuits and programmable devices

A Power Efficient Flip Flop by using 90nm Technology

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

The Stratix II Logic and Routing Architecture

A Low-Power CMOS Flip-Flop for High Performance Processors

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

DIFFERENTIAL CONDITIONAL CAPTURING FLIP-FLOP TECHNIQUE USED FOR LOW POWER CONSUMPTION IN CLOCKING SCHEME

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Electrical & Computer Engineering ECE 491. Introduction to VLSI. Report 1

Design of Fault Coverage Test Pattern Generator Using LFSR

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

University College of Engineering, JNTUK, Kakinada, India Member of Technical Staff, Seerakademi, Hyderabad

A Symmetric Differential Clock Generator for Bit-Serial Hardware

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

Peak Dynamic Power Estimation of FPGA-mapped Digital Designs

Leveraging Reconfigurability to Raise Productivity in FPGA Functional Debug

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

International Journal of Computer Trends and Technology (IJCTT) volume 24 Number 2 June 2015

Design of a Low Power Four-Bit Binary Counter Using Enhancement Type Mosfet

DESIGN OF LOW POWER TEST PATTERN GENERATOR

High Performance Carry Chains for FPGAs

LOW-POWER CLOCK DISTRIBUTION IN EDGE TRIGGERED FLIP-FLOP

Parametric Optimization of Clocked Redundant Flip-Flop Using Transmission Gate

Figure.1 Clock signal II. SYSTEM ANALYSIS

Dual-V DD and Input Reordering for Reduced Delay and Subthreshold Leakage in Pass Transistor Logic

A Novel Low-overhead Delay Testing Technique for Arbitrary Two-Pattern Test Application

Modified Ultra-Low Power NAND Based Multiplexer and Flip-Flop

AN OPTIMIZED IMPLEMENTATION OF MULTI- BIT FLIP-FLOP USING VERILOG

DESIGN OF DOUBLE PULSE TRIGGERED FLIP-FLOP BASED ON SIGNAL FEED THROUGH SCHEME

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

Retiming Sequential Circuits for Low Power

Design of Low Power D-Flip Flop Using True Single Phase Clock (TSPC)

Novel Low Power and Low Transistor Count Flip-Flop Design with. High Performance

New Single Edge Triggered Flip-Flop Design with Improved Power and Power Delay Product for Low Data Activity Applications

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

CAD Tool Flow for Variation-Tolerant Non-Volatile STT-MRAM LUT based FPGA

POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

Comparative study on low-power high-performance standard-cell flip-flops

Asynchronous Model of Flip-Flop s and Latches for Low Power Clocking

LUT Optimization for Memory Based Computation using Modified OMS Technique

Low-Power Decimation Filter for 2.5 GHz Operation in Standard-Cell Implementation

Combinational vs Sequential

An FPGA Implementation of Shift Register Using Pulsed Latches

Novel Design of Static Dual-Edge Triggered (DET) Flip-Flops using Multiple C-Elements

COE328 Course Outline. Fall 2007

Improve Performance of Low-Power Clock Branch Sharing Double-Edge Triggered Flip-Flop

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

Sharif University of Technology. SoC: Introduction

Clock Tree Power Optimization of Three Dimensional VLSI System with Network

Static Timing Analysis for Nanometer Designs

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

Energy Recovery Clocking Scheme and Flip-Flops for Ultra Low-Energy Applications

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

AN EFFICIENT DOUBLE EDGE TRIGGERING FLIP FLOP (MDETFF)

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Transcription:

An Integrated FPGA Design Framework: Custom Designed FPGA Platform and Application Mapping Toolset Development V. Kalenteridis 1, H. Pournara 1, K. Siozios 2, K. Tatas 2, G. Koytroympezis 2, I. Pappas 1 S. Nikolaidis 1, S.Siskos 1, D. J. Soudris 2 and A. Thanailakis 2 1 Electronics and Computers Div., Department of Physics, Aristotle University of Thessaloniki, 54006, Thessaloniki, Greece {vkale, hpour}@skiathos.physics.auth.gr, ilpap@auth.gr {snikolaid, siskos}@physics.auth.gr 2 VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100, Xanthi, Greece {ksiop, ktatas, dsoudris, thanail}@ee.duth.gr Abstract. A complete system for the implementation of digital logic in a fine-grain reconfigurable platform is introduced. The system is composed of two parts: The fine-grain reconfigurable hardware platform (FPGA) on which the logic is implemented and the set of CAD tools for mapping logic to the FPGA platform. The novel energy-efficient FPGA architecture was designed and simulated in STM 0.18µm CMOS technology. Concerning the tool flow, each tool can operate as a standalone program as well as part of a complete design framework, composed by existing and new tools. Keywords: Low Power FPGA interconnect architecture, CLB Architecture, Graphical User Interface 1 Introduction and related work FPGAs have recently benefited from technology process advances to become significant alternatives to Application Specific Integrated Circuits (ASICs). An important feature that has made FPGAs, particularly attractive is a logic mapping and implementation flow similar to the ASIC design flow (from VHDL or Verilog down to the configuration bitstream) provided by the industrial sector [1, 2]. Academia has also shown initiative in the development of fine-grain reconfigurable architectures [3, 4, 5, 6, 7, 8, 9]. Many solid efforts for the development of a complete tool design flow from the academic sector have also taken place [10, 11, 12, 13, 14]. The main characteristics of the resulting tools are the requirement of a UNIX Operating System (which is quite expensive), as well as the need for some operating system (OS) knowledge from the designer, as these tools run from the command line. 2. Motivation and contribution Despite the above efforts, there is a gap in the complete design flow (from VHDL to configuration bitstream) provided by existing academic tools. This is, among other reasons due to the lack of an open-source synthesizer and a FPGA configuration bit-stream generation tool. Also, the existing design flows operate in text mode, which means that they have no Graphical User Interface (GUI). Additionally, most of the tools were designed and implemented for different operating systems (SUN OS, Linux, BSD, etc). Therefore, there is no existing complete academic system capable of implementing logic specified in a hardware description language in a FPGA, just an assortment of various finegrain architectures and tools that cannot be easily integrated into a complete system. In this paper, such a complete system is introduced. The design of an efficient FPGA architecture is presented. An exploration in terms of energy, delay and area at both Configurable Logic Block and interconnection architecture has been applied in order to make appropriate architecture decisions. Simulation results are presented in 0.18µm STM process. The design is mostly focused on minimizing energy dissipation, without significantly degrading delay and area. Additionally, a complete tool-supported design flow for mapping logic on the FPGA mentioned above is presented starting from a VHDL circuit description down to the FPGA configuration bitstream. The tools will be available by mid-march [29]. Section 3 describes the proposed architecture and circuit design for CLB and interconnect network. Section 4 presents the proposed design flow. Finally conclusions and future work are discussed in section 5. 3. FPGA Architecture In this section the FPGA architecture, which can be programmed using the developed toolset, is presented. The main design constraints are the energy minimization under the delay constraints, while maintaining a reasonable silicon area. 3.1 Configurable Logic Block (CLB) Architecture The design of the CLB architecture is crucial to the CLB granularity, performance, and power consumption. The proposed design flow supports cluster-based FPGA [13]. Therefore, the CLB consists of a collection

of Basic Logic Elements (BLEs), which are interconnected by a local network, Fig. 1. Fig.1a shows the structure of the BLE, which is formed by a Look-UP Table (LUT), a D-F/F and a 2-to-1 multiplexer, while in Fig.1b a cluster of BLEs form the CLB. A number of parameters have to be determined: a) the number of the LUT inputs (K), b) the number of a CLB BLEs (cluster size, N) and c) the number of CLB inputs (I). LUT Inputs (K). The Look-Up Table (LUT) is used for the implementation of logic functions. It has been demonstrated in [24] that 4-input LUTs lead to the lowest energy consumption for the FPGA, providing an efficient area-delay product. a)basic Logic Element (BLE) LUT and Multiplexer Design. The 4-input Look-Up- Table (LUT) in the BLE, is implemented by using a multiplexer (MUX), as shown in Fig. 2. The main difference from a typical MUX is that the control signals are the inputs to the LUT and the inputs to the multiplexer are stored in memory cells (S 0 -S 15 ). The LUT and MUX structures with the minimum-sized transistors were adopted, since they lead to the lowest energy consumption without degradation in the delay. Transistors of the minimum size are also used for the 2- to-1 multiplexer at the output of the BLE. D-Flip/Flop. A significant reduction in power consumption can be achieved by using Double Edge- Triggered Flip-Flop (DETFF), since it keeps the same data rate while working at half frequency, and the power dissipation on the clock network is halved. Five alternative implementations of the most popular DETFFs in literature were designed and simulated in STM 0.18µm process, in order to determine the optimum one. IN 1 IN 2 IN 3 IN 4 V D D V D D V D D V D D S 0 S 1 S 2 S 3 b) Proposed CLB Fig. 1: Structure of BLE and CLB CLB Inputs (I). An exploration for finding the number of CLB inputs, which provides 98% utilization of all the BLEs [17], results in an almost linear dependency with the number of LUT inputs (K), and the cluster size (N), considering the formula: I=(K/2) (N+1) (1) This affects the tools seen in the next section. Cluster Size (N). The Cluster Size corresponds to the number of BLEs within a CLB. Taking into account mostly the minimization of energy consumption, our design exploration showed that a cluster size of 5 BLEs leads to the minimization of energy consumption. 3.2 Circuit Design The CLB was designed at transistor level in order to obtain the maximum power savings. Also, it is well known that the minimization of the effective capacitance in the circuits leads to the low power requirement. This is achieved by using minimum-size transistors, at the cost of delay time. Power consumption minimization involves some techniques such as logic threshold adjustment in critical buffers and gated clock technique. Simulations were performed in Cadence framework using 0.18 STM technology. S 4 S 5 S 6 S 7 S 8 S 9 S 1 0 S 1 1 S 1 2 S 1 3 S 1 4 S 1 5 Fig. 2: Circuit design of the LUT O U T P U T Two versions of the F/F proposed in [20] (Chung1 and Chung2) and [19], (Llopis1 and Llopis2) are determined, depending on tri-state inverter type, as shown in Fig. 3. Another DETFF type has been proposed. in [15] (Strollo). The total energy consumed during the application of the input sequence shown in Fig. 4 and also the worst case delay from all the combinations of clock signal and data inputs are given in Table 1. Fig. 3: Type of tri-state inverters

Fig. 4: Input pulses to the Flip/Flops for simulation Table 1: Energy consumption, delay and energy delay product of DET F/Fs Cell Total Energy Delay Energy Delay (fjoules) (psec) Product Chung 1 [20] 433.1 163.5 70.8 10-24 Chung 2 [20] 457.2 135.3 61.9 10-24 Llopis 1 [19] 409.0 217.2 88.8 10-24 Llopis 2 [19] 429.7 241.8 104 10-24 Strollo [15] 413.5 270.0 112 10-24 As it is observed, the F/Fs which exhibit the most favourable characteristics are the Llopis1 and the Chung2. In particular, the foregeating F/F has the lowest energy-delay product, while the latter one exhibits the lowest energy consumption. The Llopis-1a F/F presents the lowest energy consumption but Chung-2a presents the lowest energy delay product. Although the Llopis-1a does not exhibit the lowest energy delay product, it has simpler structure leading to smaller area and lower total energy consumption. Therefore, it was selected as the optimal solution. Gated-clock. Clock gating is applied both at BLE and CLB level. a) Gated-clock at BLE level. At BLE level when the clock enable,, is 0, the F/F is not triggered. The circuit structures that are used for simulation are given in Fig. 6, where the shaded inverters in the chain are set for measuring the effect of the input capacitance of the NAND gate, on the energy consumption. The average energy consumed for a positive and a negative output transition of the F/F shown in Fig. 5a and when using gated clock (Fig. 5b) considering both 0 and 1 for _ ENABLE signal, are given in Table 2. VDD INPUT MR D Q B VDD INPUT MR D Q B a) b) Fig. 5: a) Single clock signal b) Gated clock signal Table 2: Energy consumption for single and gated clock Single clock Gated clock E=40.76fJ Clock_enable: "1", E=43.44 fj Clock_enable : "0", E= 9.31 Fj It can be seen that energy savings of about 77% can be achieved when _ ENABLE is 0. However, when _ ENABLE is 1 there is a slight increase in energy consumption (6.2%) due to the larger capacitance of the NAND gate than the inverter s. b) Gated clock at CLB level. A gated clock at CLB level can minimize the energy at the local clock network when all F/Fs of the CLB are idle. In this case, the gated clock inputs of the F/Fs and the local clock network of the CLB are in logic level 0 and no dynamic energy is consumed. The circuit structures that are used to measure the energy consumption for the single and the gated clock cases are shown in Fig. 6. VDD VDD a b Fig. 6: a) Single clock circuit at CLB level b) Gated clock array at CLB level The energy consumption is measured for various conditions. The simulation results are given in Table 3. Table 3: Energy consumption for single and Gated clock at CLB level Condition Single Clock Gated Clock (NAND ) all F/Fs "OFF" E=23.1fJ E=3.9fJ One F/F "ON" E= 24.1fJ E= 32.1fJ all F/Fs "ON" E= 27.8fJ E= 35.8fJ As it can be observed, the gated clock technique results in a 83% reduction in energy consumption when all the F/Fs are OFF, in a 33% increase when only one F/F is ON, and in a 29% increase when all the F/Fs are ON. From these results it is clear that the adoption of the gated clock at the CLB level is reasonable, as long as the probability of all the F/Fs in the CLB to be OFF is higher than 1/3. The final adoption of the gated clock at CLB level is determined by experiments using the

physical design of these structures, in order the wire capacitance to be considered. Selected CLB architecture. Based on the results mentioned in the previous sections and those reported in the literature, a decision for the CLB architecture was made. Consequently, the features of the selected CLB are: a) Cluster of 5 BLEs, b) 4-inputs LUT per BLE, c) One double edge-triggered Flip-Flop per BLE, d) One Gated Clock signal per BLE and CLB, e) 12 inputs and 5 outputs provided by each CLB f) All 5 outputs can be registered g) A fully Connected CLB resulting to 17-to-1 multiplexing in every input of a LUT, h) One asynchronous Clear signal for whole CLB and i) One Clock signal for whole CLB. The placement and routing tool described in the next section is indifferent to the exact low-level implementations (transistor level), allowing us to employ several transistor-level low-power techniques. 3. 3. Interconnect Network Architecture An SRAM-based, island-style interconnection architecture [12] was designed; this style of FPGA is employed by Xilinx, Lucent Technologies and the Vantis VF1. In this interconnection style, the logic blocks are surrounded by vertical and horizontal metal routing tracks, which connect the logic blocks, via programmable routing switches. These switches contribute significant capacitance and with the metal wire capacitance are responsible for the greatest amount of dissipated energy. Therefore, an investigation about the size of the routing switches, when driving wire segments of different length, is needed. However, the delay and the required area should remain within acceptable values. Routing switches are either pass transistors or pairs of tri-state buffers (one in each direction) and allow wire segments to be joined in order to form longer connections [22]. 3.3.1 Sizing Pass Transistor Routing Switches In this section the best routing pass transistor width is determined by evaluating the effect of different pass transistors widths on the energy-delay-area product. All experimental values have been derived from simulations of STM s 0.18µm, 6-metal layers CMOS technology process. Fig. 7 shows a typical routing interconnection with a wire of logical length equals one, connecting four logic blocks (CLB) via pass transistors. The logical length is defined as the number of logic blocks spanned by a routing wire. It is assumed that the connection box flexibility, Fc (defines the number of CLB pins can connect to routing wires) is equal to 1 for both input and output pins of the CLB. This value is considered for the worst-case scenario. The disjoint switch block topology, which gives an Fs=3 (number of routing wires which can be connected via routing switches), is used as in most commercial implementations. The pass transistors, which permit to connect the logic block output pin at each routing track add extra parasitic capacitance. These transistors have the same size as the other routing switches [13]. The wire is also loaded by the buffers, which drive the data signals into the logic block input pin. In addition, it should be noted that the routing wire is laid out in metal 3, because it has the lowest capacitance value among routing metals for the used technology. Moreover, it is investigated if FPGA routing wires benefit from greater than minimum metal width or spacing. Fig. 8 plots the energy-delay-area product as a function of the pass transistor width for four different wire lengths, in the case of minimum width wire and minimum spacing between two metal wires. The width of the routing transistors is shown relative to the minimum contacted transistor width of 0.18µm technology. The plotted curves show that for wire lengths of 1, 2 and 4 logic blocks, transistor of 10 and 16 times the minimum one are essentially tied for the optimal energy delay area product. However, it is obvious that the optimal width is 64 times the minimum for wire length of 8, but it leads to an unacceptable silicon area. Fig. 7: FPGA routing experiment circuitry Fig. 8: Energy-Delay-Area product vs. routing pass transistor width The energy delay area product as a function of the pass transistor width for different wire lengths is plotted in Fig. 9, for the case which minimum width and doublespacing is used. Energy delay area product is improved in this case, which means that this wire configuration is more efficient than the previous one since an increase in spacing between metal wires decreases the stray capacitance, leading to energy reduction. The optimum

routing pass transistor width is ten times the minimum one for wire lengths of 1, 2 and 4 logic blocks, while it still remains 64 times the minimum width for wire length of 8. The curves of energy delay area product relative to the different widths of the routing switches double width, double-spacing, are illustrated in Fig. 10. It can be seen that transistor width of 10 times the minimum one yields the best energy delay area product, for wire lengths of 1, 2, 4 and 16 times the minimum for wire length of 8. Fig. 9: Energy-Delay-Area product vs. routing pass transistor width The increase of metal width reduces the metal resistance and so the circuit delay, but on the other hand there is some increase in the metal capacitance and hence to the system energy dissipation. It is worthy of note that the metal pitch is increased by either increasing metal width or spacing. Although the interconnection area is increased when it is used double spacing between two metal wires the total area is not affected significantly, since it is limited by the area occupied by the Switch Box. Summarizing, the optimum width from an energy, delay and area perspective is ten times the minimum for wire length of 1, 2 and 4 for all cases. The best routing pass transistor size for wire length of 8 and 64 times the minimum one. However such a large pass transistor width would lead to an unacceptable switch box area and consequently the channel with, influencing significantly the whole FPGA area. Therefore a transistor with ten times the minimum width is selected for this application. that each pass transistor between two routing wire segments is replaced by two tri-state buffers -one in each direction. In the tri-state buffer sizing exploration, two stages have been used in order to minimize energy-delayarea of the buffer [16], and also the maximum transistor width is limited to 16 times the minimum one, because energy dissipation becomes prohibitive beyond this size. The first stage is consisted by an inverter with minimum contactable width (0.28µm) nmos and pmos transistors in order to achieve logic threshold adjustment. Due to lack of space, similar results with the previous subsection were omitted. They indicated that pass transistors routing switches with a wire length of 1 and minimum width double-spacing will be used in order to achieve a low energy fine-grain FPGA. It should be noted that the exact transistor-level implementation of the interconnect network does not affect the function of the placement and routing tool, but the overall topology that was selected does. Therefore, the design of the FPGA platform and the development of the tool flow that will be presented in the following section is an interactive task. 4. Proposed Design Flow Equally important to an FPGA platform is a tool set, which supports the implementation of digital logic on the proposed FPGA. Therefore, such a design flow was realized. The proposed design flow comprises a sequenced set of steps employed in programming an FPGA chip, as shown in Fig. 11. The circuit is first described in VHDL, while the output of the CAD flow is the bit stream file that can be used to program the FPGA. Three different types of tools comprise the flow: i) nonmodified existing tools, ii) modified existing tools, iii) and new tools. Fig. 10: Energy-Delay-Area product vs. routing pass transistor width 3.3.2 Sizing Tri-state Buffer Routing Switches In order to determine the best size of tri-state buffers, a procedure identical to that described in the prior section is followed. The configuration of Fig. 7 is used, except Fig. 11: The proposed design flow

4.1. Tools of the proposed design flow In this section, the tools that form the complete design framework are described. All of them can be executed both from the command line and the GUI presented in the next subsection. It should be noted, that the proposed design flow possesses the following attractive features: i) Technology Independence: The proposed CAD flow provides process technology independence in order to allow designers to easily implement their design in different process technologies. ii) Portability: The proposed flow has been designed to run on several hardware platforms, from i386 based microcomputers to SPARC stations. iii) Modularity: Each tool can operate as a standalone program as well as a part of the complete design framework. For this reason, most of the tools support several different standard VLSI circuit description formats (VHDL, EDIF, BLIF). iv) Compactness: Unlike commercial CAD tools, the proposed CAD framework suits the limited resources of low-cost PCs. For PC operation, the minimum requirements for a Linux system are an i486 PC with 32 Mbytes of memory, appropriate disk storage (350 Mbytes), and graphic capabilities (X-Windows). On the other hand, there is the possibility to use the design flow through the Internet or the CD-ROM. In the first case, the users need to have only a web-browser installed into their PC and a connection to the Internet. On the other hand, if the CD-ROM is chosen, then the user has only to boot the PC with the appropriate Live-CD that includes the LINUX Operating System as well as the executables of the tools, and start using the design flow. v) Ease of use: All the tools as well as the proposed design flow, are simple to use with no experience required at the Linux OS, as the Graphical User Interface (GUI) provides a user-friendly interface. Furthermore, the on-line documentation in combination with the paper and manuals that are provided with each tool, help the non-experienced user to program the FPGA by using the proposed design packages. It should be noted that this tool set in its current version supports the island-style FPGA architecture described in section 4. VHDL Parser VHDL Parser [26] is a tool that performs syntax checking of VHDL input files. Input: VHDL source. Output: Syntax check message. Usage: This tool is used to check the correctness of the VHDL file compared to the prototype VHDL-93 [25]. DIVINER Input: VHDL source. Output: EDIF netlist (commercial tool format). Usage: The DIVINER tool is used as a synthesizer of behavioral VHDL language. DRUID Input: EDIF netlist (commercial tool format). Output: EDIF netlist (T-VPack format). Usage: The DRUID tool is used to modify the EDIF [28] output file that is produced during the synthesis step, so that is can be used by the following tools of the design flow. E2FMT Input: EDIF generic netlist. Output: BLIF generic netlist Usage: translation of the netlist from EDIF to BLIF [27] format. SIS [30] Input: BLIF generic netlist. Output: BLIF netlist (LUTs and F/Fs). Usage: LUT mapping. T-VPack Input: BLIF netlist (LUTs and F/Fs). Output: T-VPack netlist (LUTs and F/Fs packed to CLBs). Usage: The T- VPack tool [21] is used to group a LUT and an F/F to form a Basic Logic Element (BLE) or some BLEs to form a Cluster. DUTYS: Input: FPGA features. Output: FPGA architecture file. Usage: Generates the architecture file description of the target FPGA. PowerModel Input: BLIF netlist, Placement and routing file. Output: Power estimation report Usage: The PowerModel tool [14] estimates the dynamic, short-circuit, and leakage power consumption of an island-style FPGA. VPR Input: T-VPack netlist (LUTs and F/Fs), FPGA architecture file. Output: Placement and routing file. Usage: placement and routing of the target circuit into the FPGA. DAGGER Input: PowerModel, Placement and Routing file, FPGA architecture file, T-VPack netlist. Output: FPGA configuration bit stream file. Usage: The DAGGER tool is used to generate the bitstream file. 4.2. Graphical User Interface The Graphical User Interface (GUI) provides the designer with the opportunities to easily use all (or some of the tools) that are included in the proposed design flow. The GUI is shown in the Fig. 12. It consists of six independent stages: i) the File Upload, ii) the Synthesis, iii) the Format Translation, iv) the Power Estimation, v) the Placement and Routing and vi) the FPGA Program stage. Until now, there is no other academic

implementation of such a complete graphical design chain. The main GUI advantage is the fact that it is friendly to the non-experienced designer who does not need to be familiar with the Linux OS. It is possible to run it from a local PC or through the Internet/Intranet, and the source code can be easily modified in order to add more tools. Regardless of the execution (locally or through the network) the proposed interface runs on the web-browser, and can program an FPGA that is attached to the user s PC. Fig. 12: The graphical user interface (GUI) for the tool 5. Conclusions This paper demonstrated the first complete system for implementing digital logic on a fine-grain reconfigurable platform. It includes the design of both the FPGA architecture and the complete design flow (from VHDL to bitstream) consisting entirely of academic tools, which allows the mapping of logic on the presented novel FPGA architecture. The novel FPGA architecture was designed and implemented in STM 0.18µm CMOS technology. The obtained simulation results prove the attractive features of the proposed architecture. On the other hand, in contrast to commercial CAD systems, the proposed design flow can accomplish a FPGA design and is publicly available and very friendly to the nonexperienced designers. Acknowledgement This work was partially supported by the project IST- 34379-AMDREL which is funded by the European Commission. References 1. http://direct.xilinx.com/bvdocs/publications/ds003.pdf 2. http://www.altera.com/products/devices/dev-index.jsp 3. Edward Tau, Derrick Chen, Ian Eslick, Jeremy Brown, Andre DeHon, A First Generation FPGA Implementation, FPD 95, Third Canadian Workshop of Field-Programmable Devices, May 29-June 1, 1995, Montreal, Canada 4. C. Ebeling, G. Borriello, S. A. Hauck, D. Song, and E.A. Walkup, TRIPTYCH: A new FPGA architecture, in FPGA s, W. Moore and W. Luk, Eds. Abingdon, U.K.L Abingdon, 1991, ch 3.1, pp. 75-90. 5. G. Borriello, C. Ebeling, S. A. Hauck, and S. Burns, The Triptych FPGA architecture, IEEE Trans. VLSI Syst., vol 3, pp 491-500, Dec. 1995. 6. S. Hauck, G. Borriello, S. Burns, and C. Ebeling, MONTAGE: An FPGA for synchronous and asynchronous circuits, in Proc. 2nd Int. Workshop Field-Programmable Logic Applicat., Vienna, Austria, Sept. 1992. 7. P. Chow, S. O. Seo, D. Au, T. Choy, B. Fallah, D. Lewis, C. Li, and J. Rose, A 1.2 µm CMOS FPGA using cascaded logic blocks and segmented routing, in FPGA s W. Moore and W. Luk, Eds. Abingdon, U.K.: Abingdon, 1991, ch 3.2, pp. 91-102. 8. V. George, H. Zhang, J. Rabaey, The Design of a Low Energy FPGA, ISLPED 1999. 9. K. Leijten-Nowak, J. L. van Meerbergen, Embedded Reconfigurable Logic Core for DSP Applications, Field- Programmable Logic and Applications (FPL) 2002, Montpellier, France, 2002, pp. 89-101 10. http://www-cad.eecs.berkeley.edu/ 11. http://ballade.cs.ucla.edu/ 12. G. Varghese, J. M. Rabaey, Low-Energy FPGAs- Architecture and Design, Kluwer Academic Publishers, 2001 13. V. Betz, J. Rose and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic Publishers, 1999 14. K. Poon, A. Yan, S. Wilton, A Flexible Power Model for FPGAs, Field-Programmable Logic and Applications (FPL) 2002, Montpellier, France, 2002, pp. 312 321. 15. A.G.M. Strollo, E. Napoli, C. Cimino, Analysis of Power Dissipation in Double Edge Triggered Flip-Flops, IEEE Transaction on VLSI Systems, Vol. 8, No. 5, October 2000, pp.624-628 16. B. S. Cherkauer, E.G. Friedman, Unification of Speed, Power, Area and Reliability in CMOS Tapered Buffer Design, International Symposium on Circuits and Systems, ISCAS 1994, pp.111-114 17. E. Ahmed, J. Rose, The effect of LUT and cluster size on deep submicron FPGA performance and density, ACM International Symposium of on FPGAs, Monterey 2000(2000), pp.3-12 18. Ken McElvain, Benchmarks tests files, MCNC International Workshop on Logic Synthesis 1993, ftp://ftp.mcnc.org/pub/benchmark/benchmark_dirs/lgsynth93/ LGSynth93.tar 19. R. Peset Llopis, and M. Sachdev, Low Power, Testable Dual Edge Triggered Flip-Flops, Proceedings of IEEE International Symposium on Low Power Electronics and Design, August 1996, Montray, USA 20. T. Lo, W. Man Chung, and M Sachdev, A Comparative Analysis of Dual Edge Triggered Flip-flops, IEEE Transactions on VLSI Systems, Vol.10, No.6, December 2002, pp.913-918 21.V.Betz, VPR and T-VPack User s Manual, ver 4.30, March 2000 (http://www.eecg.toronto.edu/~vaughn/vpr/vpr.html) 22. V.Betz, J. Rose, Circuit Design, Transistor Sizing and Wire Layout of FPGA Interconnect, IEEE Custom Integrated Circuits Conference, (CICC), San Diego, California, 1999 23. V. Betz, J. Rose, Cluster-Based Logic Blocks for FPGAs: Area-Efficiency vs. Input Sharing and Size, IEEE Custom Integrated Circuits Conference, Santa Clara, CA (1997), pp.551-554

24. V. Betz, J. Rose, FPGA Routing Architecture: Segmentation and Buffering to Optimize Speed and Density, ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA (1999), pp.59-68 25. http://opensource.ethz.ch/emacs/vhdl93_syntax.html 26. http://search.cpan.org/author/gslondon/hardware-vhdl- Parser-0.12/ 27. http://www.bdd-portal.org/docu/blif/blif.html 28. http://www.edif.org 29. http://www.vlsi.ee.duth.gr/amdrel 30. M. Sentovich, K. J. Singh, L. Lavagno, et al.: SIS: A System for Sequential Circuit Synthesis, UCB/ERL M92/41 (1992)