Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures Jörn Gause Abstract This paper presents an investigation of Look-Up Table (LUT) based Field Programmable Gate Arrays (FPGAs) using various architectures of the Inverse Discrete Cosine Transform (IDCT). To compare FPGA architectures of different vendors, a generic FPGA model is developed and used in architecture independent modelling software. LUTs with three inputs yield the best results in terms of area when mapping the IDCT architectures to LUTs of different sizes. After placing and routing, FPGAs with a granularity of eight or sixteen LUTs and flip- flops per logic block were most efficient in terms of area and speed.. 1. Introduction Due to the increasing economic importance of image processing, the demands on image and video signal processing procedures are getting higher and higher. It is important to evaluate the influences of modifications of algorithm parameters on the image quality at an early stage and under real-time conditions. The traditional use of software simulation for the verification of algorithms and circuits generally cannot meet these demands. Field Programmable Gate Arrays (FPGAs), based on Look-Up Tables (LUTs) which are configured by Static RAM cells, provide a well-suited alternative for a reprogrammable and real-time implementation of signal processing procedures. This paper presents an investigation of commercial LUT based FPGAs for their use in image processing. The Inverse Discrete Cosine Transform (IDCT) is an important element of video image processing schemes, i.e. H.263 [1] or MPEG [2], and its implementation is used in this project as an example of a typical image processing algorithm. To compare FPGAs of different vendors, dedicated programmes for architecture independent modelling of FPGAs is used instead of vendor specific software tools to map the IDCT implementations into the LUTs and to place and route the logic blocks. It is therefore necessary to develop a generic FPGA model and an appropriate design flow. 2. Implementations of the IDCT Two different architectures of a two dimensional IDCT (2-D IDCT) are implemented in this project. They are both based on the Row-Column approach which splits the 8x8 IDCT into two 8x1 IDCTs and a matrix transposition. The wordwidth of the data is 12 bits at the input and 9 bits at the output of the circuit, as used in the H.263 codec [1]. The first IDCT architecture is a so called Fast IDCT () based on the algorithm of Zhang and Bergmann [3]. It uses only 11 multiplications and 29 additions for a 1-D IDCT and has a data rate of eight pixels per clock cycle. The second IDCT architecture uses Distributed Arithmetic and is based on an algorithm of Sun, Wu, and Liou [4]. No multipliers are needed to perform the transform. The data rate is one pixel per clock cycle. 3. Design Mapping 3.1. Design Flow The design flow used in this project from hardware description in Verilog HDL to placing and routing is shown in Figure 3-1. The Logic Synthesis is performed by the Synopsys FPGA Compiler [5]. The result is a Xilinx Gate Netlist (name.xnf) without any logic hierarchies. Since all Xilinx specific elements are removed and, therfore, the netlist 1
only contains the gates used and its connections, it is still architecture independent. HDL Description Gate Netlist LUT Size K LUT Netlist Cluster Size N Partitioned LUT Netlist FPGA Data Placement Data Routing Data Logic Synthesis FPGA Compiler LUT Synthesis SIS, RASP Clustering VPACK Place & Route VPR Figure 3-1 Design Flow name.v name.xnf # LUTs name.blif # Clusters name.net Statistics name.p name.r To map the gates used for the implementation of the IDCT architectures into the LUTs of the FPGA logic blocks (Clusters), a logic synthesis system for SRAM based FPGAs called RASP (RApid Systems Prototyping) [6] is used. RASP contains a number of synthesis and optimisation algorithms for technology independent logic synthesis and to transform gate netlists into LUTs of various sizes. The input file for RASP is a Xilinx netlist of the circuit which is then transformed into a BLIF file (Berkeley Logic Interchange Format). RASP contains SIS [7], an interacting tool for synthesis and optimisation of sequential circuits which separates the BLIF file into combinatoric and sequential logic. The combinatoric part of the design, represented as an acyclic graph, is now mapped into LUTs with K inputs. For this LUT synthesis, different algorithms, all based on FlowMap [8], can be used to optimise the design for area, speed, or a trade-off of both. The result is a BLIF file which contains the circuit description as a K- input LUT netlist for a given K. The last steps in the design flow are the partitioning of LUTs with K inputs into certain logic blocks (Clustering) and the following placement and routing of the logic blocks (Clusters) on the FPGA. The two tools VPACK [9] and VPR (Versatile Place and Route) [9] are used in this project to perform these tasks. VPACK packs the LUTs and flip-flops of the circuit into Clusters. The number of LUTs and flip-flops per Cluster (N) must be the same for all Clusters. The number of different inputs per Cluster (I) can be smaller than K*N. Afterwards, the VPR tool places the circuit onto an FPGA and tries to route it with a minimum number of wires (Tracks) per routing channel. FPGA specific features of the different routing architectures of the FPGAs examined in this project are needed as an input of the tool. The Simulated Annealing algorithm [10] is used for global routing in order to keep connected clusters as close to each other as possible. The detailed routing, where the nets are distributed on routing channels, is based on the Pathfinder Algorithm [9]. Output data of the tool are statistics like the number of Tracks per channel, average and maximum net length, and the average and maximum number of bends per net. This data can be used to distinguish between different FPGAs in terms of area and speed efficiency. The routing results can also be displayed by the graphic output of the VPR tool. 3.2. Generic FPGA Model Due to the limitations of the modelling software, not all of the features of different FPGAs can be examined. Therefore, a simple generic model needed to be developed and used. K Inputs Clock K-LUT D-FF Figure 3-2 Basic Logic Element (BLE) Output 2
One LUT with K inputs (K-LUT) and a flipflop is combined to a so called Basic Logic Element (BLE), as shown in Figure 3-2. One logic block, or Logic Cluster, consists of N BLEs which are connected to each other. Another parameter is the number of inputs per Cluster (I). As shown in Figure 3-3, not all of the K*N LUT inputs need to be accessible from the outside of the logic block. I Inputs Clock BLE 1 BLE N Figure 3-3 Logic Cluster N Outputs The Logic Clusters are aligned in rows and columns on the FPGA. The number of rows and columns must be equal. Logic Clusters and Input / Output Blocks (IOBs) are connected by Tracks which surround the logic blocks in channels. There are Switch Blocks at the junction of two channels as shown in Figure 3-4. All routing segments have the same length. Inside a Switch Block, every segment can be connected to three other segments which is also the case in most commercial FPGAs. Every input or output of a logic block can be connected to any track in the routing channel. Track Figure 3-4 Switch Block Programmable Switch There are a number of restrictions in the FPGA model. All LUTs have the same number of inputs and the number of LUTs and the number of flip-flops per Cluster is equal. Due to the limitations of the modelling tools, special features like additional RAM blocks or non square architectures could not be considered. Furthermore, no timing analysis is possible with the software used. 4. Investigated FPGAs Table 4-1 shows the LUT based FPGAs investigated in this project and their parameters used in the modelling tools. K is the number of inputs per LUT, N the number of LUTs and flip-flops per Cluster, I the number of inputs per Cluster, and IO Rate the number of IOBs at the end of every row or column of Clusters. The parameter I must be smaller than or equal to K*N. FPGA K N I IO Rate AT 40K 3 1 3 2 XC 3000 5 2 5 2 XC 4000 4 2 8 2 XC 5200 4 4 16 3 ORCA 2C 4 4 10 4 ORCA 3C 4 8 32 4 FLEX 8K 4 8 24 4 VF 1 4 16 64 3 Table 4-1 Investigated FPGAs Due to the limitations of the software, some of the parameters used differ from those of the actual FPGAs. The Atmel 40K is the finest grained FPGA with two 3-LUTs but only one flip-flop (therefore N=1, I=3) per Cluster. The logic block of a Xilinx XC3000 has one 5-input LUT, but two flip-flops and two outputs (therefore N=2). For the XC4000 model, only the two 4-LUTs could be used, since all LUTs have to have the same number of inputs (therefore K=4, N=2, I=8). Both the Xilinx XC5200 and the Lucent ORCA2C FPGA have four 4-LUTs per logic block, but the latter FPGA has only ten inputs per Cluster. There are eight 4-input LUTs in one Cluster of an ORCA3C and in one Cluster of an Altera FLEX8K. Since only FPGAs with the same number of Clusters per row and columns can be modelled, the Altera model differs immensly from the actual FPGA which has far more columns than rows of logic blocks. The most coarse grain FPGA architecture is the new Vantis VF1 which has been modelled with K=4 and N=16. 3
5. Results The LUT Synthesis using SIS [7] and RASP [6] was performed for both IDCT architectures and for LUTs of sizes from two to nine inputs. Figure 5-1 shows the number of LUTs with K inputs needed to implement the two IDCT architectures, dependent on K. # LUTs 35000 30000 25000 20000 15000 10000 5000 0 2 3 4 5 6 7 8 9 K K=2, the design has to be distributed into a great number of LUTs, whereas for K>3 more and more logic area is unused because not all inputs of the LUTs are needed. It should be noted that no routing costs and no flip-flop costs are included in those results. The statistics provided by the VPR tool are used to evaluate the results for the different FPGA models after partitioning, placing, and routing the IDCT implementations. The number of Tracks required to route the circuits onto the different FPGA models can be seen in Figure 5-3. Number of Tracks is the product of the number of Clusters per row or column and (channel width +1) since all Clusters are surrounded by routing channels of the same width. Hence, the number of Tracks gives a fair cost function regarding the routing area. Figure 5-1 Results of LUT Synthesis (I) The number of LUTs needed to implement the IDCT architectures decreases with K increasing. The decrease is high for small values of K, whereas the number of LUTs remains almost constant for larger values of K. A better cost function for the logic area is given with the number of SRAM cells (#LUTs * 2 K ) which are needed to implement the designs into K-LUTs. This is shown in Figure 5-2. # Tracks VF1 FLEX 8K ORCA 3C ORCA 2C XC5200 XC4000 XC3000 AT 40K 0 500 1000 1500 2000 # SRAM cells Figure 5-3 Number of Tracks 2500000 2000000 1500000 1000000 500000 0 K 2 3 4 5 6 7 8 9 Figure 5-2 Results of LUT Synthesis (II) The minimum logic area (number SRAM cells) is needed for LUTs with three inputs. For For both IDCT architectures, the smallest number of Tracks is needed for the fine grained AT40K FPGA. The relatively poor results of the XC3000 and FLEX8K models are mainly caused by the models restrictions in connecting Cluster inputs and outputs to only certain (not all four directions) routing channels. In general, the number of tracks is higher for more coarse grained FPGA architectures. Eventhough the number of Clusters per row or column is small if the number of LUTs per Cluster (N) is high, the number of Tracks per channel (channel width) is considerable higher in a coarse 4
grained architecture than in a fine grained FPGA. Eventhough the delay behaviour of different FPGA architectures cannot be modelled explicitely with the VPR tool, it can be approached using an appropriate cost function. The major share of the delay in an FPGA is caused by the capacitance and resistance of the programmable switches (pass transistors). Since the maximum net length between connected logic blocks is proportional to the number of switch boxes and therefore to the number of programmable switches, it can be used to approximate the delay of the design. The maximum net length for the different FPGA models for both IDCT implementations can be seen in Figure 5-4. Max. Net Length VF1 FLEX 8K ORCA 3C ORCA 2C XC5200 XC4000 XC3000 AT 40K 0 500 1000 1500 Figure 5-4 Maximum Net Length One can see that the maximum net length and hence the largest delay exists for the finest grained FPGAs. The shortest delay is to be expected in the coarse grain Vantis VF1. The product of the number of Tracks (~ area) and the maximum net length (~ delay) is now used to get an overall cost function for the efficiency of the routing architectures of the different FPGA models. The result is shown in Figure 5-5. It can be seen that the model of the Lucent ORCA 3C is most efficient for the IDCT using Distributed Arithmetic, whereas the most coarse grain VF1 is best suited for the larger architecture of the Fast-IDCT. In general, coarse grain FPGAs seem to be more appropriate for applications like the IDCT. # Tracks * Max. Net Length (scaled) VF1 FLEX 8K ORCA 3C ORCA 2C XC5200 XC4000 XC3000 AT 40K 0 500 1000 1500 2000 Figure 5-5 # Tracks * Maximum Net Length 6. Conclusion In this project, commercial LUT based FPGAs were investigated for their use in image processing. Two architectures of the 2-D IDCT were used as examples for typical image processing algorithms. A generic FPGA model was developed to examine eight commercial FPGAs in an architecture independent design flow. The tools RASP and SIS were used to map the IDCT architectures into LUTs of various sizes, and VPACK and VPR were used to place and route the designs. With this project, there exists an entire design flow to analyse FPGAs using real circuits. It could be shown that LUTs with three inputs were best suited in terms of area for mapping the logic into look-up tables. The product of the number of routing tracks and the maximum net length were used to analyse the routing efficiency of the FPGAs. The most coarse grain FPGA models (eight or sixteen 4-input LUTs per logic block) yielded the best results. It could also be shown that the most appropriate granularity of an FPGA architecture depends on the implemented circuit. The granularity shoud be more coarse grain for larger circuits. 5
References [1] ITU-T Rec. H.263, Video Coding for Low Bit Rate Communication, Dec. 1995. [2] ISO / IEC, Generic Coding of Moving Pictures and Associated Audio Systems, (MPEG-2 Systems Specification), ISO / IEC 13818-1, Nov. 1994. [3] J.Zhang, N.W.Bergmann, A New 8*8 Fast DCT Algorithm for Image Compression, IEEE Visual Signal Processing and Communications, Workshop Proceedings, Melbourne, Australia, Sep. 1993, pp.57-60. [4] M.T.Sun,L.Wu, M.L.Liou, A Concurrent Architecture for VLSI Implementation of Discrete Cosine Transform, IEEE Trans. On Circuits and Systems, vol. CAS-34, No. 8, Aug. 1987, pp. 992-994. [5] Synopsys Inc., FPGA Compiler User Guide, v1998.2, 1998. [6] J.Cong, J.Peck, Y.Ding, RASP: A General Logic Synthesis System for SRAM-based FPGAs, Proc. ACM/ SIGDA Int. Symp. On FPGAs, Monterey, California, Feb. 1996, pp. 137-143. [7] E.M.Sentovich et al, SIS: A System for Sequential Circuit Synthesis, Tech. Report No. UCB/ERL M92/41, University of California, Berkeley, 1992. [8] J.Cong,Y.Ding, FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in LUT Based FPGA Designs, IEEE Trans. On Computer-Aided Design, Vol.13(1), 1994, pp.1-12. [9] V.Betz, J.Rose, VPR: A New Packing, Placement and Routing Tool for FPGA Research, 7 th Int. Workshop on Field-Programmable Logic, London, August 1997, pp. 213-222. [10] S.Kirkpatrick et.al., Optimization by Simulated Annealing, Science, May 1983, pp. 213-222. [10]Atmel Inc., AT40K FPGAs, 1997. [11]Xilinx Inc., The Programmable Logic Data Book, 1998. [12]Lucent Technologies Inc., ORCA Data Sheets, 1998. [13]Altera Inc., Data Book, 1998. [14]Vantis Inc., Vantis VF1 FPGA, 1998. 6