Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration

Similar documents
Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

A Fast Constant Coefficient Multiplier for the XC6200

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Why FPGAs? FPGA Overview. Why FPGAs?

Reconfigurable Fir Digital Filter Realization on FPGA

Distributed Arithmetic Unit Design for Fir Filter

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

VLSI IEEE Projects Titles LeMeniz Infotech

VID_OVERLAY. Digital Video Overlay Module Rev Key Design Features. Block Diagram. Applications. Pin-out Description

Design & Simulation of 128x Interpolator Filter

FPGA Implementation of DA Algritm for Fir Filter

Serial FIR Filter. A Brief Study in DSP. ECE448 Spring 2011 Tuesday Section 15 points 3/8/2011 GEORGE MASON UNIVERSITY.

Memory efficient Distributed architecture LUT Design using Unified Architecture

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Sciences

Area and Speed Efficient Implementation of Symmetric FIR Digital Filter through Reduced Parallel LUT Decomposed DA Approach

Field Programmable Gate Arrays (FPGAs)

RELATED WORK Integrated circuits and programmable devices

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

Tutorial 11 ChipscopePro, ISE 10.1 and Xilinx Simulator on the Digilent Spartan-3E board

LUT Optimization for Memory Based Computation using Modified OMS Technique

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

FPGA Design with VHDL

BIST for Logic and Memory Resources in Virtex-4 FPGAs

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Implementation of Low Power and Area Efficient Carry Select Adder

INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE

An Efficient Reduction of Area in Multistandard Transform Core

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

Design on CIC interpolator in Model Simulator

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

L12: Reconfigurable Logic Architectures

Efficient Implementations of Multi-pumped Multi-port Register Files in FPGAs

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

ECE 545 Lecture 1. FPGA Devices & FPGA Tools

An Efficient High Speed Wallace Tree Multiplier

L11/12: Reconfigurable Logic Architectures

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Design of an Area-Efficient Interpolated FIR Filter Based on LUT Partitioning

High Performance Carry Chains for FPGAs

FPGA Realization of Farrow Structure for Sampling Rate Change

Cyclone II EPC35. M4K = memory IOE = Input Output Elements PLL = Phase Locked Loop

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

A Low Power VLSI Implementation of Reconfigurable FIR Filter Using Carry Bypass Adder

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

FPGA Digital Signal Processing. Derek Kozel July 15, 2017

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Fully Pipelined High Speed SB and MC of AES Based on FPGA

Modified Reconfigurable Fir Filter Design Using Look up Table

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Robust Secure FPGA-based Wireless Smart Meters Utilizing PUF and CSI

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science. EECS150, Spring 2011

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Block Diagram. dw*3 pixin (RGB) pixin_vsync pixin_hsync pixin_val pixin_rdy. clk_a. clk_b. h_s, h_bp, h_fp, h_disp, h_line

ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

Design of Memory Based Implementation Using LUT Multiplier

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

A Practical Look at SEU, Effects and Mitigation

Optimizing area of local routing network by reconfiguring look up tables (LUTs)

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

Block Diagram. RGB or YCbCr. pixin_vsync. pixin_hsync. pixin_val. pixin_rdy. clk

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Introductory Digital Systems Laboratory

EEM Digital Systems II

Radar Signal Processing Final Report Spring Semester 2017

Inside Digital Design Accompany Lab Manual

A Parallel Area Delay Efficient Interpolation Filter Architecture

In-System Testing of Configurable Logic Blocks in Xilinx 7-Series FPGAs

A new framework to accelerate Virtex-II Pro dynamic partial self-reconfiguration

ALONG with the progressive device scaling, semiconductor

An Lut Adaptive Filter Using DA

DDC and DUC Filters in SDR platforms

Fast Fourier Transform v4.1

Institutionen för systemteknik

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

L14: Quiz Information and Final Project Kickoff. L14: Spring 2004 Introductory Digital Systems Laboratory

Faculty of Electrical & Electronics Engineering BEE3233 Electronics System Design. Laboratory 3: Finite State Machine (FSM)

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

Midterm Exam 15 points total. March 28, 2011

White Paper Versatile Digital QAM Modulator

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

FPGA Realization of High Speed FIR Filter based on Distributed Arithmetic

Polar Decoder PD-MS 1.1

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Level and edge-sensitive behaviour

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

FPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER

Leveraging Reconfigurability to Raise Productivity in FPGA Functional Debug

Transcription:

Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration Martin Kumm, Konrad Möller and Peter Zipf University of Kassel, Germany

FIR FILTER Fundamental component in digital signal processing Computationally complex due to numerous multiply/ accumulate operations 2

WHY RECONFIGURATION? Many applications require the change of coefficients......but only from time to time Possibility to reduce complexity 3

METHODS OF RECONFIGURATION 1. Integrating multiplexers into the design 2. Partial reconfiguration (e.g., using ICAP) 3. Reconfigurable LUTs 4

MULTIPLEXER BASED RECONFIGURATION Multiplexers are integrated in add/shift networks Extremly fast reconfiguration (single clock cycle) Only a limited set of coefficients possible! [Faust et al. 10] 5

PARTIAL RECONFIGURATION Partial regions of the FPGA are reconfigured via ICAP Least resources Arbitrary coefficients...... but synthesis needed for each coefficient set Slow reconfiguration ( μs/ms)! 6

RECONFIGURABLE LUTS Changing the LUT content only Routing has to be fixed First academic tool available (TLUT flow, [Bruneel et al. 11]) Fast reconfiguration (a few clock cycles, ns/μs) Arbitrary coefficients...... but (again) synthesis needed for each coefficient set Not, if a generic architecture is transformed to fixed routing 7

RECONFIGURABLE LUTS FPGA components to realize reconfigurable LUTs Older Xilinx FPGAs (Virtex 1-4): Shift-Register LUT (SRL16) Newer Xilinx FPGAs (Virtex 5/6, Spartan 6, 7-Series): CFGLUT5 (similar to SRLC32E but with two output functions) Other FPGA vendors: Distributed RAM or block RAM 8

METHODS OF RECONFIGURATION 1. Integrating multiplexers into the design Logic fixed, routing flexible 2. Partial reconfiguration (e.g., using ICAP) Logic flexible, routing flexible 3. Reconfigurable LUTs Logic flexible, routing fixed 9

LUT BASED FIR FILTER Two well-known methods that employ LUTs in a fixed structure, suitable for FIR filters: 1. Distributed Arithmetic [Crosisier et al. 73] [Zohar 73]...... [Kumm et al. 13] 2. LUT based multipliers [Chapman 96] [Wiatr et al. 01] 10

The main question is: "Which architecture performs best? 11

DISTRIBUTED ARITHMETIC Main idea is rearranging the underlying inner product Resulting function (realized as LUT) is identical for each bit b Less configuration memory y = c x = = = N 1 n=0 N 1 n=0 B x 1 b=0 c n x n B x 1 c n b=0 N 1 2 b n=0 2 b x n,b c n x n,b =f( x N b )(LUT) x N b =(x 0,b,x 1,b,...,x N 1,b ) T 12

DISTRIBUTED ARITHMETIC OVERALL ARCHITECTURE Pre-processing to exploit coefficient symmetry 13 Output adder tree Reconfigurable LUTs Reconfiguration circuit

DISTRIBUTED ARITHMETIC MAPPING TO CFGLUT5 14

LUT MULTIPLIER FIR FILTER Basic Idea: Split a multiplication into smaller chunks which fit into the FPGA LUT: c n x n B c B x mult. L 1 = c n b=0 2 b x n,b B c L mult. L 1 +2 L c n b=0 2 b x n,b+l B c L mult. +... 15

LUT MULTIPLIER MAPPING TO CFGLUT5 16

LUT MULTIPLIER OVERALL ARCHITECTURE Replaced by reconfigurable multipliers 17

CONTROL ARCHITECTURE 18

RESOURCE COMPARISON Distributed Arithmetic LUT Multiplier FIR LUTs with inputs B x +1 M M LUTs with inputs B x CFGLUTs: (B x + 1) M/4B c /2+1 1 4 (B x + 1)M(B c /2 + 1) CFGLUTs: M B x /4B c /2+2 1 4 B xm(b c /2 + 2) M = N/2 : No. of unique taps B x /B c : input/coefficient bit width 19

RESOURCE COMPARISON Distributed Arithmetic LUT Multiplier FIR LUTs with inputs B x +1 M M LUTs with inputs B x CFGLUTs: (B x + 1) M/4B c /2+1 1 4 (B x + 1)M(B c /2 + 1) CFGLUTs: M B x /4B c /2+2 1 4 B xm(b c /2 + 2) Surprisingly, CFGLUT requirements are very similar! 20

RESOURCE COMPARISON Distributed Arithmetic LUT Multiplier FIR Adders: Adders: M + B x +(B x + 1) M/4 2M 1+M B x /4 So, LUT multiplier based FIR filters are better when... 2M 1+MB x /4 <M+ B x +(B x + 1)M/4... 3 4 M 1 <B x...,i.e., the input word size B x is greater than approximately half the number of coefficients 21 M = N/2

RESULTS: 1ST EXPERIMENT Synthesis experiment for Virtex 6 Nine benchmark filters with length N=6...151 Input word size B x {8, 16, 24, 32} Very fast reconfiguration times: 49...106 ns High clock frequencies: 472 MHz/494 MHz (DA/LUT mult.) 22

RESULTS: 1ST EXPERIMENT LUT Multiplier improvement compared to DA: Slice improvement [%] 40 20 0 20 40 6 10 13 20 28 41 61 119 151 Slice improvement [%] 40 20 0 20 40 6 10 13 20 28 41 61 119 151 Filter length N Filter length N (a) Input word size B x =8bit (b) Input word size B x =16bit Slice improvement [%] 40 20 0 20 40 6 10 13 20 28 41 61 119 151 Slice improvement [%] 40 20 0 20 40 6 10 13 20 28 41 61 119 151 Filter length N Filter length N (c) Input word size B x =24bit (d) Input word size B x =32bit As expected, the LUT multiplier architecture is best for low N 23

RESULTS: 1ST EXPERIMENT LUT Multiplier improvement compared to DA: Slice improvement [%] 40 20 0 20 40 6 10 13 20 28 41 61 119 151 Slice improvement [%] 40 20 0 20 40 6 10 13 20 28 41 61 119 151 Filter length N Filter length N (a) Input word size B x =8bit (b) Input word size B x =16bit Slice improvement [%] 40 20 0 20 40 6 10 13 20 28 41 61 119 151 Slice improvement [%] 40 20 0 20 40 6 10 13 20 28 41 61 119 151 Filter length N Filter length N (c) Input word size B x =24bit (d) Input word size B x =32bit Choosing the right architecture can save up to 40% slices 24

RESULTS: 2ND EXPERIMENT Comparison with partial reconfiguration via ICAP Ten different filters with N=41 were highly optimized using PMCM optimization RPAG [Kumm et al. 12] Method S [bit] Slices f clk [MHz] T rec [ns] RPAG with ICAP 746496 502... 569 386.7... 448.8 233280 Reconf. FIR DA 1920 1071 521.9 61.3 Reconf. FIR LUT 14784 1108 487.8 65.6 Configuration memory is reduced by a factor of 1/388 (DA) and 1/50 (LUT Mult.) 25

RESULTS: 2ND EXPERIMENT Comparison with partial reconfiguration via ICAP Ten different filters with N=41 were highly optimized using PMCM optimization RPAG [Kumm et al. 12] Method S [bit] Slices f clk [MHz] T rec [ns] RPAG with ICAP 746496 502... 569 386.7... 448.8 233280 Reconf. FIR DA 1920 1071 521.9 61.3 Reconf. FIR LUT 14784 1108 487.8 65.6 Slice requirements are roughtly doubled 26

RESULTS: 2ND EXPERIMENT Comparison with partial reconfiguration via ICAP Ten different filters with N=41 were highly optimized using PMCM optimization RPAG [Kumm et al. 12] Method S [bit] Slices f clk [MHz] T rec [ns] RPAG with ICAP 746496 502... 569 386.7... 448.8 233280 Reconf. FIR DA 1920 1071 521.9 61.3 Reconf. FIR LUT 14784 1108 487.8 65.6 Perfomance is similar 27

RESULTS: 2ND EXPERIMENT Comparison with partial reconfiguration via ICAP Ten different filters with N=41 were highly optimized using PMCM optimization RPAG [Kumm et al. 12] Method S [bit] Slices f clk [MHz] T rec [ns] RPAG with ICAP 746496 502... 569 386.7... 448.8 233280 Reconf. FIR DA 1920 1071 521.9 61.3 Reconf. FIR LUT 14784 1108 487.8 65.6 Reconfiguration time is drastically reduced by a factor of 1/3556! 28

CONCLUSION Two different reconfigurable FIR filter architectures for arbitrary coefficient sets were analyzed Both are implemented using reconfigurable LUTs (CFGLUTs) The LUT multiplier architecture typically needs less slices when input word size is greater than approx. half the number of coefficients (and vice versa) Both architectures offer reconfiguration times of about 3500 times faster than partial reconfiguration using ICAP This is paid by twice the number of slice resources 29

RECOSOC CONCLUSION If you have a reconfigurable FPGA circuit which allows a fixed routing: Use reconfigurable LUTs! 30

THANK YOU!