Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Similar documents
Optimizing area of local routing network by reconfiguring look up tables (LUTs)

Field Programmable Gate Arrays (FPGAs)

L12: Reconfigurable Logic Architectures

L11/12: Reconfigurable Logic Architectures

288 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

LUT Optimization for Memory Based Computation using Modified OMS Technique

On the Sensitivity of FPGA Architectural Conclusions to Experimental Assumptions, Tools, and Techniques

Why FPGAs? FPGA Overview. Why FPGAs?

An Efficient Reduction of Area in Multistandard Transform Core

Exploring Architecture Parameters for Dual-Output LUT based FPGAs

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

A Fast Constant Coefficient Multiplier for the XC6200

An FPGA Implementation of Shift Register Using Pulsed Latches

Design of Memory Based Implementation Using LUT Multiplier

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Examples of FPLD Families: Actel ACT, Xilinx LCA, Altera MAX 5000 & 7000

FIELD programmable gate arrays (FPGA s) are widely

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Optimization of memory based multiplication for LUT

A Novel Architecture of LUT Design Optimization for DSP Applications

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

High Performance Carry Chains for FPGAs

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

An Integrated FPGA Design Framework: Custom Designed FPGA Platform and Application Mapping Toolset Development

ALONG with the progressive device scaling, semiconductor

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

GlitchLess: An Active Glitch Minimization Technique for FPGAs

University College of Engineering, JNTUK, Kakinada, India Member of Technical Staff, Seerakademi, Hyderabad

Boolean, 1s and 0s stuff: synthesis, verification, representation This is what happens in the front end of the ASIC design process

RELATED WORK Integrated circuits and programmable devices

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Low-Power and Area-Efficient Shift Register Using Pulsed Latches

Automatic Transistor-Level Design and Layout Placement of FPGA Logic and Routing from an Architectural Specification

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

CAD Tool Flow for Variation-Tolerant Non-Volatile STT-MRAM LUT based FPGA

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

Latch-Based Performance Optimization for FPGAs. Xiao Teng

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

A Low Power Delay Buffer Using Gated Driver Tree

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

FPGA Hardware Resource Specific Optimal Design for FIR Filters

BIST-Based Diagnostics of FPGA Logic Blocks

The Stratix II Logic and Routing Architecture

A Low Energy HEVC Inverse Transform Hardware

An Efficient High Speed Wallace Tree Multiplier

Memory efficient Distributed architecture LUT Design using Unified Architecture

SI-Studio environment for SI circuits design automation

Designing an Efficient and Secured LUT Approach for Area Based Occupations

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Optimization of FPGA Architecture for Uniform Random Number Generator Using LUT-SR Family

Design of BIST with Low Power Test Pattern Generator

Low Power and Area Efficient 256-bit Shift Register based on Pulsed Latches

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

Implementation of Low Power and Area Efficient Carry Select Adder

Leveraging Reconfigurability to Raise Productivity in FPGA Functional Debug

A Novel FPGA Architecture and an Integrated Framework of CAD Tools for Implementing Applications

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science. EECS150, Spring 2011

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

[Dharani*, 4.(8): August, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

Designing for High Speed-Performance in CPLDs and FPGAs

Power Optimization by Using Multi-Bit Flip-Flops

HDL & High Level Synthesize (EEET 2035) Laboratory II Sequential Circuits with VHDL: DFF, Counter, TFF and Timer

Interframe Bus Encoding Technique for Low Power Video Compression

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

TEST PATTERN GENERATION USING PSEUDORANDOM BIST

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

Digital Systems Design

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Figure.1 Clock signal II. SYSTEM ANALYSIS

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

OMS Based LUT Optimization

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

VLSI System Testing. BIST Motivation

Lecture 2: Basic FPGA Fabric. James C. Hoe Department of ECE Carnegie Mellon University

Fault Location in FPGA-Based Reconfigurable Systems

BIST for Logic and Memory Resources in Virtex-4 FPGAs

Clock Tree Power Optimization of Three Dimensional VLSI System with Network

Hardware Modeling of Binary Coded Decimal Adder in Field Programmable Gate Array

Modified Reconfigurable Fir Filter Design Using Look up Table

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Level and edge-sensitive behaviour

FPGA Implementation of Viterbi Decoder

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Tajana Simunic Rosing. Source: Vahid, Katz

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Reconfigurable Neural Net Chip with 32K Connections

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

Transcription:

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures Jörn Gause Abstract This paper presents an investigation of Look-Up Table (LUT) based Field Programmable Gate Arrays (FPGAs) using various architectures of the Inverse Discrete Cosine Transform (IDCT). To compare FPGA architectures of different vendors, a generic FPGA model is developed and used in architecture independent modelling software. LUTs with three inputs yield the best results in terms of area when mapping the IDCT architectures to LUTs of different sizes. After placing and routing, FPGAs with a granularity of eight or sixteen LUTs and flip- flops per logic block were most efficient in terms of area and speed.. 1. Introduction Due to the increasing economic importance of image processing, the demands on image and video signal processing procedures are getting higher and higher. It is important to evaluate the influences of modifications of algorithm parameters on the image quality at an early stage and under real-time conditions. The traditional use of software simulation for the verification of algorithms and circuits generally cannot meet these demands. Field Programmable Gate Arrays (FPGAs), based on Look-Up Tables (LUTs) which are configured by Static RAM cells, provide a well-suited alternative for a reprogrammable and real-time implementation of signal processing procedures. This paper presents an investigation of commercial LUT based FPGAs for their use in image processing. The Inverse Discrete Cosine Transform (IDCT) is an important element of video image processing schemes, i.e. H.263 [1] or MPEG [2], and its implementation is used in this project as an example of a typical image processing algorithm. To compare FPGAs of different vendors, dedicated programmes for architecture independent modelling of FPGAs is used instead of vendor specific software tools to map the IDCT implementations into the LUTs and to place and route the logic blocks. It is therefore necessary to develop a generic FPGA model and an appropriate design flow. 2. Implementations of the IDCT Two different architectures of a two dimensional IDCT (2-D IDCT) are implemented in this project. They are both based on the Row-Column approach which splits the 8x8 IDCT into two 8x1 IDCTs and a matrix transposition. The wordwidth of the data is 12 bits at the input and 9 bits at the output of the circuit, as used in the H.263 codec [1]. The first IDCT architecture is a so called Fast IDCT () based on the algorithm of Zhang and Bergmann [3]. It uses only 11 multiplications and 29 additions for a 1-D IDCT and has a data rate of eight pixels per clock cycle. The second IDCT architecture uses Distributed Arithmetic and is based on an algorithm of Sun, Wu, and Liou [4]. No multipliers are needed to perform the transform. The data rate is one pixel per clock cycle. 3. Design Mapping 3.1. Design Flow The design flow used in this project from hardware description in Verilog HDL to placing and routing is shown in Figure 3-1. The Logic Synthesis is performed by the Synopsys FPGA Compiler [5]. The result is a Xilinx Gate Netlist (name.xnf) without any logic hierarchies. Since all Xilinx specific elements are removed and, therfore, the netlist 1

only contains the gates used and its connections, it is still architecture independent. HDL Description Gate Netlist LUT Size K LUT Netlist Cluster Size N Partitioned LUT Netlist FPGA Data Placement Data Routing Data Logic Synthesis FPGA Compiler LUT Synthesis SIS, RASP Clustering VPACK Place & Route VPR Figure 3-1 Design Flow name.v name.xnf # LUTs name.blif # Clusters name.net Statistics name.p name.r To map the gates used for the implementation of the IDCT architectures into the LUTs of the FPGA logic blocks (Clusters), a logic synthesis system for SRAM based FPGAs called RASP (RApid Systems Prototyping) [6] is used. RASP contains a number of synthesis and optimisation algorithms for technology independent logic synthesis and to transform gate netlists into LUTs of various sizes. The input file for RASP is a Xilinx netlist of the circuit which is then transformed into a BLIF file (Berkeley Logic Interchange Format). RASP contains SIS [7], an interacting tool for synthesis and optimisation of sequential circuits which separates the BLIF file into combinatoric and sequential logic. The combinatoric part of the design, represented as an acyclic graph, is now mapped into LUTs with K inputs. For this LUT synthesis, different algorithms, all based on FlowMap [8], can be used to optimise the design for area, speed, or a trade-off of both. The result is a BLIF file which contains the circuit description as a K- input LUT netlist for a given K. The last steps in the design flow are the partitioning of LUTs with K inputs into certain logic blocks (Clustering) and the following placement and routing of the logic blocks (Clusters) on the FPGA. The two tools VPACK [9] and VPR (Versatile Place and Route) [9] are used in this project to perform these tasks. VPACK packs the LUTs and flip-flops of the circuit into Clusters. The number of LUTs and flip-flops per Cluster (N) must be the same for all Clusters. The number of different inputs per Cluster (I) can be smaller than K*N. Afterwards, the VPR tool places the circuit onto an FPGA and tries to route it with a minimum number of wires (Tracks) per routing channel. FPGA specific features of the different routing architectures of the FPGAs examined in this project are needed as an input of the tool. The Simulated Annealing algorithm [10] is used for global routing in order to keep connected clusters as close to each other as possible. The detailed routing, where the nets are distributed on routing channels, is based on the Pathfinder Algorithm [9]. Output data of the tool are statistics like the number of Tracks per channel, average and maximum net length, and the average and maximum number of bends per net. This data can be used to distinguish between different FPGAs in terms of area and speed efficiency. The routing results can also be displayed by the graphic output of the VPR tool. 3.2. Generic FPGA Model Due to the limitations of the modelling software, not all of the features of different FPGAs can be examined. Therefore, a simple generic model needed to be developed and used. K Inputs Clock K-LUT D-FF Figure 3-2 Basic Logic Element (BLE) Output 2

One LUT with K inputs (K-LUT) and a flipflop is combined to a so called Basic Logic Element (BLE), as shown in Figure 3-2. One logic block, or Logic Cluster, consists of N BLEs which are connected to each other. Another parameter is the number of inputs per Cluster (I). As shown in Figure 3-3, not all of the K*N LUT inputs need to be accessible from the outside of the logic block. I Inputs Clock BLE 1 BLE N Figure 3-3 Logic Cluster N Outputs The Logic Clusters are aligned in rows and columns on the FPGA. The number of rows and columns must be equal. Logic Clusters and Input / Output Blocks (IOBs) are connected by Tracks which surround the logic blocks in channels. There are Switch Blocks at the junction of two channels as shown in Figure 3-4. All routing segments have the same length. Inside a Switch Block, every segment can be connected to three other segments which is also the case in most commercial FPGAs. Every input or output of a logic block can be connected to any track in the routing channel. Track Figure 3-4 Switch Block Programmable Switch There are a number of restrictions in the FPGA model. All LUTs have the same number of inputs and the number of LUTs and the number of flip-flops per Cluster is equal. Due to the limitations of the modelling tools, special features like additional RAM blocks or non square architectures could not be considered. Furthermore, no timing analysis is possible with the software used. 4. Investigated FPGAs Table 4-1 shows the LUT based FPGAs investigated in this project and their parameters used in the modelling tools. K is the number of inputs per LUT, N the number of LUTs and flip-flops per Cluster, I the number of inputs per Cluster, and IO Rate the number of IOBs at the end of every row or column of Clusters. The parameter I must be smaller than or equal to K*N. FPGA K N I IO Rate AT 40K 3 1 3 2 XC 3000 5 2 5 2 XC 4000 4 2 8 2 XC 5200 4 4 16 3 ORCA 2C 4 4 10 4 ORCA 3C 4 8 32 4 FLEX 8K 4 8 24 4 VF 1 4 16 64 3 Table 4-1 Investigated FPGAs Due to the limitations of the software, some of the parameters used differ from those of the actual FPGAs. The Atmel 40K is the finest grained FPGA with two 3-LUTs but only one flip-flop (therefore N=1, I=3) per Cluster. The logic block of a Xilinx XC3000 has one 5-input LUT, but two flip-flops and two outputs (therefore N=2). For the XC4000 model, only the two 4-LUTs could be used, since all LUTs have to have the same number of inputs (therefore K=4, N=2, I=8). Both the Xilinx XC5200 and the Lucent ORCA2C FPGA have four 4-LUTs per logic block, but the latter FPGA has only ten inputs per Cluster. There are eight 4-input LUTs in one Cluster of an ORCA3C and in one Cluster of an Altera FLEX8K. Since only FPGAs with the same number of Clusters per row and columns can be modelled, the Altera model differs immensly from the actual FPGA which has far more columns than rows of logic blocks. The most coarse grain FPGA architecture is the new Vantis VF1 which has been modelled with K=4 and N=16. 3

5. Results The LUT Synthesis using SIS [7] and RASP [6] was performed for both IDCT architectures and for LUTs of sizes from two to nine inputs. Figure 5-1 shows the number of LUTs with K inputs needed to implement the two IDCT architectures, dependent on K. # LUTs 35000 30000 25000 20000 15000 10000 5000 0 2 3 4 5 6 7 8 9 K K=2, the design has to be distributed into a great number of LUTs, whereas for K>3 more and more logic area is unused because not all inputs of the LUTs are needed. It should be noted that no routing costs and no flip-flop costs are included in those results. The statistics provided by the VPR tool are used to evaluate the results for the different FPGA models after partitioning, placing, and routing the IDCT implementations. The number of Tracks required to route the circuits onto the different FPGA models can be seen in Figure 5-3. Number of Tracks is the product of the number of Clusters per row or column and (channel width +1) since all Clusters are surrounded by routing channels of the same width. Hence, the number of Tracks gives a fair cost function regarding the routing area. Figure 5-1 Results of LUT Synthesis (I) The number of LUTs needed to implement the IDCT architectures decreases with K increasing. The decrease is high for small values of K, whereas the number of LUTs remains almost constant for larger values of K. A better cost function for the logic area is given with the number of SRAM cells (#LUTs * 2 K ) which are needed to implement the designs into K-LUTs. This is shown in Figure 5-2. # Tracks VF1 FLEX 8K ORCA 3C ORCA 2C XC5200 XC4000 XC3000 AT 40K 0 500 1000 1500 2000 # SRAM cells Figure 5-3 Number of Tracks 2500000 2000000 1500000 1000000 500000 0 K 2 3 4 5 6 7 8 9 Figure 5-2 Results of LUT Synthesis (II) The minimum logic area (number SRAM cells) is needed for LUTs with three inputs. For For both IDCT architectures, the smallest number of Tracks is needed for the fine grained AT40K FPGA. The relatively poor results of the XC3000 and FLEX8K models are mainly caused by the models restrictions in connecting Cluster inputs and outputs to only certain (not all four directions) routing channels. In general, the number of tracks is higher for more coarse grained FPGA architectures. Eventhough the number of Clusters per row or column is small if the number of LUTs per Cluster (N) is high, the number of Tracks per channel (channel width) is considerable higher in a coarse 4

grained architecture than in a fine grained FPGA. Eventhough the delay behaviour of different FPGA architectures cannot be modelled explicitely with the VPR tool, it can be approached using an appropriate cost function. The major share of the delay in an FPGA is caused by the capacitance and resistance of the programmable switches (pass transistors). Since the maximum net length between connected logic blocks is proportional to the number of switch boxes and therefore to the number of programmable switches, it can be used to approximate the delay of the design. The maximum net length for the different FPGA models for both IDCT implementations can be seen in Figure 5-4. Max. Net Length VF1 FLEX 8K ORCA 3C ORCA 2C XC5200 XC4000 XC3000 AT 40K 0 500 1000 1500 Figure 5-4 Maximum Net Length One can see that the maximum net length and hence the largest delay exists for the finest grained FPGAs. The shortest delay is to be expected in the coarse grain Vantis VF1. The product of the number of Tracks (~ area) and the maximum net length (~ delay) is now used to get an overall cost function for the efficiency of the routing architectures of the different FPGA models. The result is shown in Figure 5-5. It can be seen that the model of the Lucent ORCA 3C is most efficient for the IDCT using Distributed Arithmetic, whereas the most coarse grain VF1 is best suited for the larger architecture of the Fast-IDCT. In general, coarse grain FPGAs seem to be more appropriate for applications like the IDCT. # Tracks * Max. Net Length (scaled) VF1 FLEX 8K ORCA 3C ORCA 2C XC5200 XC4000 XC3000 AT 40K 0 500 1000 1500 2000 Figure 5-5 # Tracks * Maximum Net Length 6. Conclusion In this project, commercial LUT based FPGAs were investigated for their use in image processing. Two architectures of the 2-D IDCT were used as examples for typical image processing algorithms. A generic FPGA model was developed to examine eight commercial FPGAs in an architecture independent design flow. The tools RASP and SIS were used to map the IDCT architectures into LUTs of various sizes, and VPACK and VPR were used to place and route the designs. With this project, there exists an entire design flow to analyse FPGAs using real circuits. It could be shown that LUTs with three inputs were best suited in terms of area for mapping the logic into look-up tables. The product of the number of routing tracks and the maximum net length were used to analyse the routing efficiency of the FPGAs. The most coarse grain FPGA models (eight or sixteen 4-input LUTs per logic block) yielded the best results. It could also be shown that the most appropriate granularity of an FPGA architecture depends on the implemented circuit. The granularity shoud be more coarse grain for larger circuits. 5

References [1] ITU-T Rec. H.263, Video Coding for Low Bit Rate Communication, Dec. 1995. [2] ISO / IEC, Generic Coding of Moving Pictures and Associated Audio Systems, (MPEG-2 Systems Specification), ISO / IEC 13818-1, Nov. 1994. [3] J.Zhang, N.W.Bergmann, A New 8*8 Fast DCT Algorithm for Image Compression, IEEE Visual Signal Processing and Communications, Workshop Proceedings, Melbourne, Australia, Sep. 1993, pp.57-60. [4] M.T.Sun,L.Wu, M.L.Liou, A Concurrent Architecture for VLSI Implementation of Discrete Cosine Transform, IEEE Trans. On Circuits and Systems, vol. CAS-34, No. 8, Aug. 1987, pp. 992-994. [5] Synopsys Inc., FPGA Compiler User Guide, v1998.2, 1998. [6] J.Cong, J.Peck, Y.Ding, RASP: A General Logic Synthesis System for SRAM-based FPGAs, Proc. ACM/ SIGDA Int. Symp. On FPGAs, Monterey, California, Feb. 1996, pp. 137-143. [7] E.M.Sentovich et al, SIS: A System for Sequential Circuit Synthesis, Tech. Report No. UCB/ERL M92/41, University of California, Berkeley, 1992. [8] J.Cong,Y.Ding, FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in LUT Based FPGA Designs, IEEE Trans. On Computer-Aided Design, Vol.13(1), 1994, pp.1-12. [9] V.Betz, J.Rose, VPR: A New Packing, Placement and Routing Tool for FPGA Research, 7 th Int. Workshop on Field-Programmable Logic, London, August 1997, pp. 213-222. [10] S.Kirkpatrick et.al., Optimization by Simulated Annealing, Science, May 1983, pp. 213-222. [10]Atmel Inc., AT40K FPGAs, 1997. [11]Xilinx Inc., The Programmable Logic Data Book, 1998. [12]Lucent Technologies Inc., ORCA Data Sheets, 1998. [13]Altera Inc., Data Book, 1998. [14]Vantis Inc., Vantis VF1 FPGA, 1998. 6