LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

Similar documents
Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

An Lut Adaptive Filter Using DA

LUT Optimization for Memory Based Computation using Modified OMS Technique

Reconfigurable Fir Digital Filter Realization on FPGA

A Parallel Area Delay Efficient Interpolation Filter Architecture

FPGA Implementation of DA Algritm for Fir Filter

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Fully Pipelined High Speed SB and MC of AES Based on FPGA

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15,

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Memory efficient Distributed architecture LUT Design using Unified Architecture

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Distributed Arithmetic Unit Design for Fir Filter

ISSN:

A Novel Architecture of LUT Design Optimization for DSP Applications

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Implementation of Low Power and Area Efficient Carry Select Adder

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

An Efficient High Speed Wallace Tree Multiplier

Implementation of CRC and Viterbi algorithm on FPGA

Implementation of Memory Based Multiplication Using Micro wind Software

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Area and Speed Efficient Implementation of Symmetric FIR Digital Filter through Reduced Parallel LUT Decomposed DA Approach

Memory Based Computing for DSP. Pramod Meher Institute for Infocomm Research

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

VLSI IEEE Projects Titles LeMeniz Infotech

A Fast Constant Coefficient Multiplier for the XC6200

ALONG with the progressive device scaling, semiconductor

Design of Memory Based Implementation Using LUT Multiplier

An Efficient Reduction of Area in Multistandard Transform Core

Designing Fir Filter Using Modified Look up Table Multiplier

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

Motion Compensation Hardware Accelerator Architecture for H.264/AVC

Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration

Optimization of memory based multiplication for LUT

Serial FIR Filter. A Brief Study in DSP. ECE448 Spring 2011 Tuesday Section 15 points 3/8/2011 GEORGE MASON UNIVERSITY.

Design and Analysis of Modified Fast Compressors for MAC Unit

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks Abstract:

Research Article Low Power 256-bit Modified Carry Select Adder

Built-In Self-Test (BIST) Abdil Rashid Mohamed, Embedded Systems Laboratory (ESLAB) Linköping University, Sweden

From Theory to Practice: Private Circuit and Its Ambush

An Efficient Viterbi Decoder Architecture

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

OMS Based LUT Optimization

Radar Signal Processing Final Report Spring Semester 2017

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Figure 1: Feature Vector Sequence Generator block diagram.

The Design of Efficient Viterbi Decoder and Realization by FPGA

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

A Review on Hybrid Adders in VHDL Payal V. Mawale #1, Swapnil Jain *2, Pravin W. Jaronde #3

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Synthesis Techniques for Pseudo-Random Built-In Self-Test Based on the LFSR

L12: Reconfigurable Logic Architectures

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

FPGA Implementaion of Soft Decision Viterbi Decoder

Transactions Briefs. Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

Field Programmable Gate Arrays (FPGAs)

Hardware Implementation of Viterbi Decoder for Wireless Applications

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

FPGA Realization of Farrow Structure for Sampling Rate Change

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P11 ISSN Online:

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

Efficient Implementation of Multi Stage SQRT Carry Select Adder

HYBRID CONCATENATED CONVOLUTIONAL CODES FOR DEEP SPACE MISSION

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

Chapter 3 Unit Combinational

L11/12: Reconfigurable Logic Architectures

FPGA Realization of High Speed FIR Filter based on Distributed Arithmetic

INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE

FPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER

Memory interface design for AVS HD video encoder with Level C+ coding order

VLSI Based Minimized Composite S-Box and Inverse Mix Column for AES Encryption and Decryption

Design And Implimentation Of Modified Sqrt Carry Select Adder On FPGA

AN 623: Using the DSP Builder Advanced Blockset to Implement Resampling Filters

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Cyclone II EPC35. M4K = memory IOE = Input Output Elements PLL = Phase Locked Loop

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

An FPGA Implementation of Shift Register Using Pulsed Latches

High Performance Carry Chains for FPGAs

Ultra-lightweight 8-bit Multiplicative Inverse Based S-box Using LFSR

N.S.N College of Engineering and Technology, Karur

Peak Dynamic Power Estimation of FPGA-mapped Digital Designs

A Robust Turbo Codec Design for Satellite Communications

Modified Reconfigurable Fir Filter Design Using Look up Table

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

Laboratory 1 - Introduction to Digital Electronics and Lab Equipment (Logic Analyzers, Digital Oscilloscope, and FPGA-based Labkit)

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

A LOW COMPLEXITY CODE COMPRESSION BASED ON HYBRID RLC-BM CODES

Transcription:

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter Abstract: In this paper, we analyze the contents of lookup tables (LUTs) of distributed arithmetic (DA)- based block least mean square (BLMS) adaptive filter (ADF) and based on that we propose intraiteration LUT sharing to reduce its hardware resources, energy consumption, and iteration period. The proposed LUT optimization scheme offers a saving of 60% LUT content for block size 8 and still higher saving for larger block sizes over the conventional design approach. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2. Enhancement of the project: Existing System: Distributed arithmetic (DA)-based design approach has been proposed to derive low-complexity hardware structures for ADFs. The DA-based ADF uses lookup tables (LUTs) for the calculation of filter output and weight-increment terms, which constitute most of its hardware resources. The DA-based LMS ADF structure of uses two separate LUTs for the calculation of filter output and weight-increment terms. Few design schemes have been suggested in recent past for efficient realization of LMS ADF in FPGA. A DA-based pipelined structure is proposed for the realization of delayed LMS ADF with low adaptation delay. Subsequently, another DA-based design has been proposed for LMS ADFs, where a single LUT is used to perform both filtering and weight-updating and a parallel LUTupdate method is used to reduce LUT-update time. Carry-save accumulation is used to further reduce the iteration period of the DA-based LMS structure. A few DA-based designs have also been proposed for the FPGA realization of BLMS ADF. We have proposed a DA structure for BLMS ADF. Although many DA-based designs have been suggested for LMS- and BLMSbased ADF, we do not find any LUT optimization scheme in the literature specific to BLMS DA-LUT. In this paper, we have made an analysis of intra-iteration LUT contents of DA-based

BLMS ADF design to find the redundant LUT words which could be shared to minimize hardware resources, the number of LUT accesses, energy consumption and iteration period. Disadvantages: The LUT size is large LUT-update is complex Proposed System: Allred et al. have identified the LUT redundancy corresponding to successive iterations of the DA-based LMS ADF, and based on that the half of the auxiliary LUT contents is updated. No LUT optimization scheme, however, has been proposed to take advantage of redundant LUT values in the DA-LMS computation. We observe that, in DA-based LMS ADF, the redundant LUT values belong to different processing cycles and they need to be stored in LUT or outside LUT, which consumes the same amount of resource. Therefore, the redundant LUT values of DA-based LMS do not offer LUT optimization except LUT words to be updated. However, in the case of DA-based BLMS ADF, the redundant LUT values of L successive iterations are created within a processing cycle, which allow the possibility of LUT optimization, where L is the block size. Conventionally, 16 NP LUT words are required to implement NP LUTs of the LU matrix. For filter length N = 16, 256 LUT words are required to implement the LU matrix for L = 4. The contents of LU matrix of BLMS filter for block size L = 4 are shown in Fig. 1. The LUT content is represented by function E(.), which enumerates a sum of 16 possible combination of an input vector.

Fig. 1. LUT content of the LU matrix of block size L = 4 for four consecutive iterations [kth, (k + 1)th, (k + 2)th, and (k + 3)th]. Light gray color LUTs of successive iteration with identical content. The input argument s i,0 k for 0 i 3 of the first column of LU is defined for the kth iteration input-block {x(n) x(n 3)}, where n = k L. {x(n) x(n 3)}: input sequence {x(n), x(n 1), x(n 2), x(n 3)}. Gray color: succeeding LUTs with overlapped input vectors. Intra-iteration LUT Sharing The LUT content depends on the argument (s ij k,p) of the LUT enumeration function E which does not change during an iteration. We analyze the arguments (s ij k,p) corresponding to one column of the LU matrix to find the redundant values in the LUTs of one column of LU. Inter-iteration LUT Reuse As shown in Fig. 1, The LUT contents of the first (M 1) columns of LUs of any given iteration can be reused by the last (M 1) columns of LUs during the next iteration, which need not be updated. Proposed Design Strategy The entire LUT content needs to be available in the same cycle for the sharing of LUT words. The conventional RAM-based LUTs are not suitable for LUT sharing, since in any given cycle, they allow access to only one (or a few in the case of multiported RAM) of the stored LUT values. A register-based LUT (REG-LUT) could be used instead for the proposed DA-based design. Based on these facts, we have arrived at the following design strategy to derive an area-delaypower efficient structure for the DA-based BLMS ADF. 1) The register-based shared LUT is used instead of the conventional RAM-based LUT to exploit intra-iteration LUT sharing. 2) Based on the inter-iteration LUT reuse provision of BLMS ADF only one column out of (N/L) columns of the LU matrix is updated in every iteration. 3) A full-parallel design for LUT-update unit is used to generate update values of one LU column to update its contents in one cycle.

The proposed structure is similar to the structure of at block level. However, the internal structures of LUT-update block and processing element (PE) of the DA module are different than that of due to shared LUTs used in the proposed design. The structure of the DA module of the proposed structure is shown in Fig. 2. Each PE of the DAmodule uses REG-LUTs instead of RAM-LUTs as in the case to make the use of the LUT sharing property. It requires only (16L 25) registers instead of 16P L RAM words as required.the LUT-update unit of the DA-module of the proposed structure computes a set of (16L 25) values to update LUTs of a PE in one cycle against 16 cycles required.

Fig. 2. Structure of DA module of the proposed DA BLMS ADF of filter length N and block size L, where N = M L. Advantages: reduce the LUT-size reduce LUT-update complexity Software implementation: Modelsim Xilinx ISE