A Fast Constant Coefficient Multiplier for the XC6200

Similar documents
OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Why FPGAs? FPGA Overview. Why FPGAs?

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

Field Programmable Gate Arrays (FPGAs)

L11/12: Reconfigurable Logic Architectures

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

L12: Reconfigurable Logic Architectures

FPGA Hardware Resource Specific Optimal Design for FIR Filters

VLSI IEEE Projects Titles LeMeniz Infotech

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

Implementation of Low Power and Area Efficient Carry Select Adder

An Efficient Reduction of Area in Multistandard Transform Core

High Performance Carry Chains for FPGAs

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Memory efficient Distributed architecture LUT Design using Unified Architecture

LUT Optimization for Memory Based Computation using Modified OMS Technique

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

ALONG with the progressive device scaling, semiconductor

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Optimization of memory based multiplication for LUT

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

An Efficient High Speed Wallace Tree Multiplier

A Novel Architecture of LUT Design Optimization for DSP Applications

N.S.N College of Engineering and Technology, Karur

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Examples of FPLD Families: Actel ACT, Xilinx LCA, Altera MAX 5000 & 7000

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Radar Signal Processing Final Report Spring Semester 2017

OMS Based LUT Optimization

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL

Implementation of Memory Based Multiplication Using Micro wind Software

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

Designing Fir Filter Using Modified Look up Table Multiplier

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Design of Memory Based Implementation Using LUT Multiplier

FIELD programmable gate arrays (FPGA s) are widely

Distributed Arithmetic Unit Design for Fir Filter

Comparative Analysis of Stein s. and Euclid s Algorithm with BIST for GCD Computations. 1. Introduction

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

Digital Systems Design

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration

Modified Reconfigurable Fir Filter Design Using Look up Table

An Lut Adaptive Filter Using DA

Contents Circuits... 1

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

An Improved Recursive and Non-recursive Comb Filter for DSP Applications

INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Reconfigurable Fir Digital Filter Realization on FPGA

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Introductory Digital Systems Laboratory

Design on CIC interpolator in Model Simulator

BIST-Based Diagnostics of FPGA Logic Blocks

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Introductory Digital Systems Laboratory

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Serial FIR Filter. A Brief Study in DSP. ECE448 Spring 2011 Tuesday Section 15 points 3/8/2011 GEORGE MASON UNIVERSITY.

A Symmetric Differential Clock Generator for Bit-Serial Hardware

FPGA Digital Signal Processing. Derek Kozel July 15, 2017

International Journal of Engineering Research-Online A Peer Reviewed International Journal

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

FPGA Implementation of Viterbi Decoder

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

Design and Implementation of LUT Optimization DSP Techniques

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

Area and Speed Efficient Implementation of Symmetric FIR Digital Filter through Reduced Parallel LUT Decomposed DA Approach

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

FPGA Implementation of DA Algritm for Fir Filter

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

Fully Pipelined High Speed SB and MC of AES Based on FPGA

An MFA Binary Counter for Low Power Application

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Performance Driven Reliable Link Design for Network on Chips

Leveraging Reconfigurability to Raise Productivity in FPGA Functional Debug

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

White Paper Versatile Digital QAM Modulator

Midterm Exam 15 points total. March 28, 2011

9 Programmable Logic Devices

Upgrading a FIR Compiler v3.1.x Design to v3.2.x

BIST for Logic and Memory Resources in Virtex-4 FPGAs

[Dharani*, 4.(8): August, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

Authentic Time Hardware Co-simulation of Edge Discovery for Video Processing System

COE328 Course Outline. Fall 2007

ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

Designing for High Speed-Performance in CPLDs and FPGAs

The input-output relationship of an N-tap FIR filter in timedomain

Lossless Compression Algorithms for Direct- Write Lithography Systems

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Layout Decompression Chip for Maskless Lithography

Transcription:

A Fast Constant Coefficient Multiplier for the XC6200 Tom Kean, Bernie New and Bob Slous Xilinx Inc. Abstract. We discuss the design of a high performance constant coefficient multiplier on the Xilinx XC6200 FPGA. The dynamic reconfiguration capabilities of the device are used to allow the constant coefficient to be rapidly changed. The design also provides better performance and density than similar multipliers on state of the art conventional FPGA s which require complete reconfiguration to change the coefficient. Introduction The Xilinx XC6200 [1] is the first commercial FPGA specifically designed to work in close cooperation with a microprocessor. To the microprocessor the XC6200 appears as a block of RAM, configuration data and registers within the user design appear within this memory space. The data bus width is programmably selectable to 8,16 or 32 bits. Additional features support writing many words of configuration memory with the same data simultaneously (to reduce configuration time) and grouping nonadjacent registers within the user design into a single word for transfer to and from the processor. XC6200 application development and debug is supported by the XACTStep Series 6000 CAD software package and by a PCI bus resident development system [2]. The XC6200 architecure is derived from the Algotronix CAL technology which Xilinx aquired in 1992 [3] [4]. Constant coefficient multiplication is a common function in many Digital Signal Processing (DSP) applications such as Finite Impulse Response (FIR) filtering. In this case the constant coefficients represent filter parameters and are changed much less frequently than the input data values. The multiplier described here multiplies an 8-bit input number by a constant 8-bit coefficient to get a 16-bit result, the numbers are unsigned. An efficient technique to implement fixed coefficient multipliers of FPGA s using Look Up Table (LUT) based FPGA s was developed by Xilinx [5]. This relies on lookup tables, rather than a network of adders, to perform most of the multiplication. For example, there are 16 possible results when a four bit number is multiplied by an 8 bit fixed number (because there are 16 different four bit numbers). Thus a four bit variable times eight bit constant mu ltiplier can be implemented by a 16 entry lookup table. Each entry must be 12 bits, the width of the largest possible output. An 8-bit by 8-bit constant multiplier can be built using two of these 4-bit by 8-bit constant multipliers in the configuration shown in figure 1. Figure 2 explains the simple mathematics behind this arrangement.

8 bit Data 4 4 LUT- B 12 Adder 12 LUT- A 8 4 Figure 1: LUT Based Multiplier Physical Implementation Figure 3 shows the hardware components required to implement the 8x8-bit multiplier. Pipeline registers have been added to increase the throughput. Note that not only the adder but also the LUT s are pipelined. Unlike XC4000 FPGA s which use LUT s to implement logic functions XC6200 s have a simpler function unit built around a 2:1 multiplexer. This function unit can implement a register and either a 2:1 multiplexer or any one of the 16 logical functions of two input variables. The ability to implement a register and a combinational logic function makes pipelining inexpensive in this technology. Figure 4 shows a naive approach to implementing a 4-input lookup table using 2:1 multiplexers and registers. Figure 5 shows how this can be collapsed by encoding the leaf nodes of the tree into gate functions. It takes four bits of configuration memory to encode the truth table of a 2- input logic gate so what we are doing is moving the 16 bits of memory provided by user registers in figure 4 into bits of configuration memory. It is possible to obtain a very regular layout for the individual 4-input LUT s as a single row of 8 cells. These slices can be stacked vertically to form the 12-bit output, 4-bit input LUT s required. The fact that the layout is so regular makes it easy to determine which parts of the chip need to be reconfigured at run time. The layout of the multiplier LUT is shown in figure 6. Note that the pipelining scheme used requires that some slices, corresponding to more significant bits, have additional registers.

Constant LUT-B Input LUT-A Input LUT-B Output Adder Output Figure 2: Maths of LUT Based Multiplier LUT-A Output M[7:0] M[7:4] M[3:0] LUT-B LUT-A B[11:8] B[7:4] A[11:8] B[3:0] A[7:4] 4bit half 4 bit full 4 bit full Reg A[3:0] P[15:12] P[11:8] P[7:4] P[3:0] Figure 3. Multiplier Architecture

A0 A1 A2 A3 F(A0,A1,A2,A3) Figure 4: Building LUT's from Muxes A0 A1 A2 A3 F(A0,A1,A2,A3) The LUT function is encoded into the functions implemented by these gates Figure 5: Optimised LUT Construction

Func1 Func2 Func3 Func4 Figure 6: LUT layout on XC6200 Changing Coefficients. The XC6216 memory map is designed so that the control memory for functionally related resources on the chip occur in logically adjacent bits and words of control memory. The bits which control the gate function of a cell occur in the same byte of memory. With a 32-bit data bus this byte in four vertically adjacent cells can be written simultaneously. In figure 6 the vertical columns labelled Func1 through Func 4 contain the gates whose functions are changed according to the lokup table - each horizontal slice implements one 4-input LUT. There are 12x4=48 cells in each multiplier LUT which may need reconfigured if the coefficient changes and 2 LUT s for a total of 96 cells. Thus 24 write cycles are required to completely reprogram the LUT s or 960ns with a 50MHz clock.

Multiplier Performance The multiplier has been simulated at 75MHz using the -2 speed grade. The equivalent design on the XC4000E -3 grade runs at 65MHz (note that the speed grade numbering system is different for the two families so this is a fair comparison). The cost is around 25CLB s in the XC4000E and 280 used cells in the XC6200. Normalising for silicon area approximately 10.6 XC6200 cells take the same area on the same process as 1 XC4000E CLB: so the area is almost equivalent. The big difference between the designs is that the XC6200 allows the coefficients to be changed in less than 1us where the XC4000E would require that the entire chip be reconfigured which takes several ms. XC4000 devices can support changing of LUT configuration data without reconfiguring the chip by using the feature which allows user logic to address LUT RAM. If this technique is used, however, significant additional logic must be added to the user design to control the LUT RAM and to interface with the external source of data (e.g. a microprocessor). Routing resources will be tied up routing between IOB s and the LUT RAM control logic. This circuitry is supplied for free by the dedicated processor interface on the XC6200. Increasing Density The design presented above is laid out to preserve regularity in order to make dynamic reconfiguration to change coefficients fast and efficient. If longer reconfiguration times can be tolerated it is possible to substantially reduce area by co-optimising the logic of the groups of 12 LUT s which share the same 4 inputs together. A simple way of doing this is to notice that only 5 unique functions of the first two input variables must be generated. There are 16 possible functions but the second 8 are the logical inverses of the first and the inverse can be calculated by inverting the corresponding multiplexer input in the second level of the tree. Of the remaining 8 functions 3 are trivial (ZERO,A0 and A1) and do not require a logic gate. Thus we need only calculate the 5 non trivial functions and route these 5 functions and the two input variables to the 12 sets of muxes to build the LUT. This requires 3*12+5=41 cells rather than the 7*12=84 cells of the previous design, at the expense of having a different routing pattern for each possible LUT. This will almost half the size of the XC6216 design while still allowing coefficients to be changed much faster than the XC4000E. Further reduction in gate count to 30 gates for the worst case coefficient is possible with more complex logic optimisation algorithms. Summary This paper has shown that the XC6200 fine grained multiplexer based FPGA s can implement LUT based functions almost as efficiently as state of the art LUT based FPGA s. In fact the fine grained FPGA can be more efficient when several lookup tables have the same set of input variables, because it allows common logic used by all LUT s to be implemented only once.

XC6200 devices implement the adder more efficiently than coarse grain devices which allows the fine grain implementation of the multiplier to take almost identical area to the coarse grain implementation even when common LUT logic is not optimised. The LUT implementation techniques described here, and the results reported are specific to XC6200. The available fine grain FPGA s have significantly different functional and routing resources and each may require a different approach to LUT implementation. In addition the ability of the XC6200 architecure to dynamically reconfigure small portions of the device very quickly makes the look up table based technique much more attractive since coefficients or other parameters stored in the lookup tables can be changed rapidly. The configuration values required for a particular coefficient can be calculated rapidly on the fly from input data by a microprocessor - no complex CAD tools are required. References [1] Xilinx Inc, XC6200 FPGA Family Advanced Product Description, Available from Xilinx Inc. 2100 Logic Drive San Jose CA. [2] Wayne Luk and Nabeel Shirazi, Modelling and Optimising Run Time Reconfigurable Systems, Proc. IEEE Symposium on FPGA s for Custom Computing Machines, Napa CA 1996. [3] Tom Kean, Configurable Logic: A Dynamically Programmable Cellular Architecture and its VLSI Implementation Phd Thesis CST-62-89, University of Edinburgh, Dept. Computer Science. [4] Tom Kean and John Gray, Configurable Hardware: A New Paradigm for Computation, Advanced Research in VLSI, Proc. Decennial Caltech Conference, MIT Press 1989. [5] Ken Chapman, Fast Integer Multipiers fit in FPGA s, EDN 1993 Design Idea Winner, EDN May 12th 1994.