Design and Implementation of LUT Optimization DSP Techniques

Similar documents
OMS Based LUT Optimization

ALONG with the progressive device scaling, semiconductor

A Novel Architecture of LUT Design Optimization for DSP Applications

Design of Memory Based Implementation Using LUT Multiplier

Implementation of Memory Based Multiplication Using Micro wind Software

Optimization of memory based multiplication for LUT

The input-output relationship of an N-tap FIR filter in timedomain

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

LUT Optimization for Memory Based Computation using Modified OMS Technique

Modified Reconfigurable Fir Filter Design Using Look up Table

K. Phanindra M.Tech (ES) KITS, Khammam, India

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Designing an Efficient and Secured LUT Approach for Area Based Occupations

N.S.N College of Engineering and Technology, Karur

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

Memory efficient Distributed architecture LUT Design using Unified Architecture

An Lut Adaptive Filter Using DA

Designing Fir Filter Using Modified Look up Table Multiplier

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Memory Based Computing for DSP. Pramod Meher Institute for Infocomm Research

An Efficient Reduction of Area in Multistandard Transform Core

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

VLSI IEEE Projects Titles LeMeniz Infotech

Implementation of Low Power and Area Efficient Carry Select Adder

An MFA Binary Counter for Low Power Application

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

DESIGN OF HIGH PERFORMANCE, AREA EFFICIENT FIR FILTER USING CARRY SELECT ADDER

Reconfigurable Fir Digital Filter Realization on FPGA

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

FPGA Implementation of DA Algritm for Fir Filter

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Sciences

A Fast Constant Coefficient Multiplier for the XC6200

Design and Simulation of Modified Alum Based On Glut

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

Efficient Implementation of Multi Stage SQRT Carry Select Adder

MODULE 3. Combinational & Sequential logic

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

[Dharani*, 4.(8): August, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

A Parallel Area Delay Efficient Interpolation Filter Architecture

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

Distributed Arithmetic Unit Design for Fir Filter

An Efficient High Speed Wallace Tree Multiplier

Design and VLSI Implementation of Oversampling Sigma Delta Digital to Analog Convertor Used For Hearing Aid Application

Inside Digital Design Accompany Lab Manual

Research Article Low Power 256-bit Modified Carry Select Adder

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P11 ISSN Online:

OPTIMIZED DIGITAL FILTER ARCHITECTURES FOR MULTI-STANDARD RF TRANSCEIVERS

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

Design And Implimentation Of Modified Sqrt Carry Select Adder On FPGA

International Journal of Engineering Research-Online A Peer Reviewed International Journal

Chapter 3. Boolean Algebra and Digital Logic

Solution to Digital Logic )What is the magnitude comparator? Design a logic circuit for 4 bit magnitude comparator and explain it,

Low Power Area Efficient Parallel Counter Architecture

Power Reduction Techniques for a Spread Spectrum Based Correlator

DDC and DUC Filters in SDR platforms

Midterm Exam 15 points total. March 28, 2011

White Paper Versatile Digital QAM Modulator

Contents Circuits... 1

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Implementation of High Speed Adder using DLATCH

Logic Design Viva Question Bank Compiled By Channveer Patil

An FPGA Implementation of Shift Register Using Pulsed Latches

Introduction to Digital Logic Missouri S&T University CPE 2210 Exam 3 Logistics

A VLSI Architecture for Variable Block Size Video Motion Estimation

Novel Correction and Detection for Memory Applications 1 B.Pujita, 2 SK.Sahir

WINTER 15 EXAMINATION Model Answer

CHAPTER 4 RESULTS & DISCUSSION

THE USE OF forward error correction (FEC) in optical networks

Area and Speed Efficient Implementation of Symmetric FIR Digital Filter through Reduced Parallel LUT Decomposed DA Approach

FPGA Implementation of Convolutional Encoder And Hard Decision Viterbi Decoder

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Design and Analysis of Modified Fast Compressors for MAC Unit

Research Article VLSI Architecture Using a Modified SQRT Carry Select Adder in Image Compression

Analogue Versus Digital [5 M]

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Hardware Modeling of Binary Coded Decimal Adder in Field Programmable Gate Array

Implementation of 2-D Discrete Wavelet Transform using MATLAB and Xilinx System Generator

High Speed 8-bit Counters using State Excitation Logic and their Application in Frequency Divider

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Bus Encoded LUT Multiplier for Portable Biomedical Therapeutic Devices

FPGA Implementation of Viterbi Decoder

ISSN:

A Review on Hybrid Adders in VHDL Payal V. Mawale #1, Swapnil Jain *2, Pravin W. Jaronde #3

IC Design of a New Decision Device for Analog Viterbi Decoder

Architecture of Discrete Wavelet Transform Processor for Image Compression

DESIGN OF A NEW MODIFIED CLOCK GATED SENSE-AMPLIFIER FLIP-FLOP

Arithmetic Unit Based Reconfigurable Approximation Technique for Video Encoding

Reduction of Area and Power of Shift Register Using Pulsed Latches

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

Figure.1 Clock signal II. SYSTEM ANALYSIS

Transcription:

Design and Implementation of LUT Optimization DSP Techniques 1 D. Srinivasa rao & 2 C. Amala 1 M.Tech Research Scholar, Priyadarshini Institute of Technology & Science, Chintalapudi 2 Associate Professor, Priyadarshini Institute of Technology & Science, Chintalapudi Abstract: Recently, we have proposed the antisymmetric product coding (APC) and odd-multiple-storage (OMS) techniques for lookup-table (LUT) design for memory-based multipliers to be used in digital signal processing applications. Each of these techniques results in the reduction of the LUT size by a factor of two. In this brief, we present a different form of APC and a modified OMS scheme, in order to combine them for efficient memory-based multiplication. The proposed combined approach provides a reduction in LUT size to one-fourth of the conventional LUT. We have also suggested a simple technique for selective sign reversal to be used in the proposed design. It is shown that the proposed LUT design for small input sizes can be used for efficient implementation of highprecision multiplication by input operand decomposition. It is found that the proposed LUT-based multiplier involves comparable area and time complexity for a word size of 8 bits, but for higher word sizes, it involves significantly less area and less multiplication time than the canonical-signed-digit (CSD)-based multipliers. For 16- and 32-bit word sizes, respectively, it offers more than 30% and 50% of saving in area- delay product over the corresponding CSD multipliers. Keywords: Antisymmetric product coding; odd-multiplestorage; canonical-signed-digit; lookup-table 1. INTRODUCTION Digital signal processing algorithms typically require a large number of mathematical operations to be performed quickly and repetitively on a set of data. Signals are constantly converted from analog to digital, manipulated digitally, and then converted again to analog form, as diagrammed below. Many DSP applications have constraints on latency; that is, for the system to work, the DSP operation must be completed within some fixed time, and deferred processing is not viable. Digital signal processing: Fig 1. DSP Framework In-order to reach a certain criteria memory based computation plays a vital role in dsp (digital signal processing) application. FILTER DESIGNING: Finite impulse response (FIR) digital filter is widely used as a basic tool in various signal processing and image processing applications. The order of an FIR filter primarily determines the width of the transition-band, such that the higher the filter order, the sharper is the transition between a pass-band and adjacent stop-band. Many applications in digital Communication (channel equalization, frequency channelization), speech processing (adaptive noise cancelation), seismic signal processing (noise elimination), and several other areas of signal processing require large order FIR filters. Since the number of multiplyaccumulate (MAC) operations required per filter output increases linearly with the filter order, real-time Page 30

implementation of these filters of large orders is a challenging task. Several attempts have, therefore, been made and continued to develop low- complexity dedicated VLSI systems for these filters. As the scaling in silicon devices has progressed over the last four decades, semiconductor memory has become cheaper, faster and more power-efficient. According to the projections of the international technology roadmap for semiconductors (ITRS), embedded memories will continue to have dominating presence in the system-on-chip (SoC), which may exceed 90% of total SoC content. It has also been found thatthe transistor packing density of SRAM is not only high, but also increasing much faster than the transistor density of logic devices. 1.1 BINARY MULTIPLICATION: Multiplication in binary is similar to its decimal counterpart. Two numbers A and B can be multiplied by partial products: for each digit in B, the product of that digit in A is calculated and written on a new line, shifted leftward so that its rightmost digit lines up with the digit in B that was used. The sum of all these partial products gives the final result. 1.2 FIR filter architecture: The objectives of this work are: Multiplying two binary numbers one number is fixed X[4:0] and another variable 'A' Using APC-OMS combined LUT design for themultiplication of W-bit fixed coefficient A with 5-bit input X. Number of calculations reduced and memory required is less to perform multiplication. For 16- and 32-bit word sizes, respectively, it offers more than 30% and 50% of saving in area-delay product over the corresponding CSD multipliers. Fig 2.FIR filter architecture 1.3 ANTI -SYMMETRIC PRODUCT CODING: Anti symmetric product coding is the technique used to process the multiplication based on LUT multiplication which reduces the size of conventional lut by 50 %.The anti symmetric product coding is based on the antisymmetric coding i.e the 2's complement phenomenon which is used to reduce the LUT size by half.for simplicity of presentation, we assume both X and A to be positive integers.2 The product words for different values of X for L = 5 are shown in Table I. It may be observed in this table that the input word X on the first column of each row is the two's complement of that on the third column of the same row. In addition, the sum of product values corresponding to these two input values on the same row is 32A. Let the product values on the second and fourth columns of a row be u and v, respectively. Since one can write u = [(u + v)/2 - (v - u)/2] and v = Page 31

[(u + v)/2 + (v - u)/2], for (u + v) = 3 2 A, we can have TABLE I :APC WORDS FOR DIFFERENT INPUT VALUES F OR L = 6 Input, X produc address APC Input, X produ t xfq words 0 0 0 0 1 A 11111 31A 1 1 1 L 15 A 0 0 0 1 0 2,4 11110 30.4 1110 14,4 0 0 0 1 1 3,4 1110 1 29 A 110 1 13,4 0 0 10 0 4^4 1110 0 28 A 110 0 12 A 0 0 10 1 5A 110 11 27A 10 11 1IA 0 0 110 6,4 110 10 26 A 10 10 10,4 0 0 1X1 7 A 110 0 1 15A 10 0 1 M 0 10 0 0 8A 110 0 0 24 A 10 0 0 8 A 0 10 0 1 9,4 10 111 23 A 0 111 7 A 0 1 0 1 0 10.4 10 110 22 A 0 110 6 A 0 10 11 11,4 10 10 1 21A 0 10 1 5,4 0 110 0 12.4 10 10 0 20 A 0 10 0 4.4 0110 1 13/1 1 0 0 1 1 19.4 0 0 11 $A 0 1 I 1 0 14-4 10 0 1 0 18A 0 0 10 2A 0 1111 15,4 1 0 0 0 1 17.4 0 0 0 1 A 1 0 0 0 0 1(L4 1 0 0 0 o 16A 0 0 0 o 0 For X = (0 0 0 0 0 0), the encoded word to be stored is 32A The APC approach, although providing a reduction in LUT size by a factor of two, incorporates substantial overhead of area and time to perform the two s complement operation of LUT output for sign modification and that of the input operand for input mapping. However, we find that when the APC approach is combined with the OMS technique, the two s complement operations could be very much simplified since the input address and LUT output could always be transformed into odd integers. However, the OMS technique in [9] cannot be combined with the APC scheme in [10], since the APC words generated according to [10] are odd numbers. Moreover, the OMS scheme in [9] does not provide an efficient implementation when combined with the APC technique. In this brief, we therefore present a different form of APC and combined that with a modified form of the OMS scheme for efficient memory- based multiplication. The product values on the second and fourth columns of Table I therefore have a negative mirror symmetry. This behavior of the product words can be used to reduce the LUT size, where, instead of storing u and v, only [(v - u)/2] is stored for a pair of input on a given row. The 4-bit LUT addresses and corresponding coded words are listed on the fifth and sixth columns of the table, respectively. Since the representation of the product is derived from the anti-symmetric behavior of the products, we can name it as anti-symmetric product code. The 4-bit address X'= x3'x2'x1'x0' of the APC word is given by X' = XL, if x4 = 1=X'L, if x4 = 0 where XL = (x3x2x1x0) is the four less significant bits of X, and XL' is the two's complement of XL. Page 32

Fig 3. Optimized implementation of the sign modification of the odd LUT output. 1.4 LUT -BASED MULTIPLICATION USING APC - OMS MODIFIED OPTIMIZATION TECHNIQUE The APC approach, although providing a reduction in LUT size by a factor of two, incorporates substantial overhead of area and time to perform the two's complement operation of LUT output for sign modification and that of the input operand for input mapping. However, we find that when the APC approach is combined with the OMS technique, the two's complement operations could be very much simplified since the input address and LUT output could always be transformed into odd integers. 1.5 LUT COMBINED APC-OMS BASED MULTIPLICAT-ION TECHNIQUE input X' product uor shifted stored addre 12*^1 f> value shif input, X" APC 0 0 0 1 A 0 0 0 0 PO = A 1 ss 0 00 0 0 10 2 X A 1 0 0 10 0 Ax A 9 10 0 0 8 x A 3 0 0 11 3A 0 00 11 PI = ZA 0 00 0 110 2 x 3.4 4 x 3-4 1 2 1 0 10 1 5 A O 0 10 1 P2 = 5i4 0 0 I 0 I 0 2x 5.4 1 10 0 7A 0 0 1 1 1 P3=7A 0 0 1110 2 x 7 A 1 11 10 0 1 QA 0 1 00 1 P4 = 9,4 0 10 10 11 11A 0 10 11 P5 =11A 0 10 110 1 13 A Q 110 1 P6 = 13.4 0 1111 ISA 0 1111 P7 = 15.4 0 The proposed APC-OMS combined design of the LUT for L = 5 and for any coefficient width W is shown in Fig. 2.4. It consists of an LUT of nine words of (W + 4)-bit width, a four- to-nine-line address decoder, a barrel shifter, an address generation circuit, and a control circuit for generating the RESET signal and control word (s1s0) for the barrel shifter. The recomputed values of A x (2i + 1) are stored as Pi, for i = 0, 1, 2,..., 7, at the eight consecutive locations of the memory array, as specified in Table II, while 2A is stored for input X = (00000) at LUT address "1000," as specified in Table III. The decoder takes the 4-bit address from the address generator and generates nine word-select signals, i.e., {wi, for 0 < i < 8}, to select the referenced word from thelut. The 4-to-9-line decoder is a simple modification of 3- to-8-line decoder.the control bits s0 and s1 to be used by the barrel shifter to produce the desired number of shifts of the LUT output are generated by the control circuit, according to the relations. 2. LUT OPTIMATION 2.1 Basic Components of LUT Optimization : The modules contributed for combined APC- OMS based LUT optimization technique are 1.Xin generation module (based on antisymmetric process) 2. Address generation module 3. line decoder 4. 9*(w+4) LUT >line selector module >multiplier result module >resultant multiplier module 5. Barrel Shifter 6. Add/Substractor (Sign Determination) module Xin generation module (based on antisymmetric process): A input of 5-bit length is given as input to this module. It used to generate antisymetric of last 4-bits (Xin(3 to 0)) when the msb of Xini.eXin(4) is 0 and and Page 33

process the same input when the msb of Xin is 1 hence only 16 combinations will be achived for 5-bit of input as in table 1. 3. IMPLEMENTATION A barrel shifter is often implemented as a cascade of parallel 2 1 multiplexers. For a 4-bit barrel shifter, an intermediate signal is used which shifts by two bits, or passes the same data, based on the value of S[1]. This signal is then shifted by another multiplexer, which is controlled by S[0]: 1. Adding Bto a and 1 yields the desired subtraction of B - A.The adder-subtractor above could easily be extended to include more functions. For example, a 2-to-1 multiplexer could be introduced on each Bi that would switch between zero and Bi; this could be used (in conjunction_with D = 1) to yield the two's complement of A since A = A + l. im = IN, if S[1] == 0 = IN << 2, if S[1] == 1 OUT = im, if S[0] == 0 = im<< 1, if S[0] == 1 It is used to add the intermediate results to 16A to get the final output.it may make output 0 when clr is high. u = *(u + v)/2 (v u)/2+ and v = *(u + v)/2 + (v u)/2+, for (u + v) = 32A, When xin(4 ) = 1 then sign value = 1 When xin(4) = 0 then sign value = 0. 4-bit_ripple_carry_adder-subtracter.svg In digital circuits, an adder-subtractor is a circuit that is capable of adding or subtracting numbers.this works because when D = 1 the A input to the adder is really A and the carry in is Fig 4. 4-bit_ripple_carry_adder-subtracter LUT APC - OMS Optimization Top Model output * LUT " APC-OMS The APC approach, although providing a reduction in LUT size by a factor of two, incorporates substantial overhead of area and time to perform the two's complement operation of LUT output for sign modification and that of the input operand for input mapping.the proposed APC-OMS combined design of the LUT for L = 5 and for any coefficient width W is shown in Fig. 2.4. It consists of an LUT of nine words of (W + 4)-bit width, a four- to-nine-line address decoder, a barrel shifter, an address generation circuit, and a control circuit for generating the RESET signal and control word (s1s0) for the barrel shifter. The recomputed values of A x (2i + 1) are stored as Pi, for i = 0, 1, 2,..., 7, at the eight consecutive locations of the memory array, as Page 34

specified in Table II, while 2A is stored for input X = (00000) at LUT address "1000," as specified in Table III. The decoder takes the 4-bit address from the address generator and generates nine word-select signals, i.e., {wi, for 0 < i < 8}, to select the referenced word from the LUT. fig 5.4 lut combined apc-oms based multiplication technique 4. RTL SCHEMATIC: Here we observe that they will Antisymmetry in the address for the LSB 4 bits. We will get all the address from 0 to 15 for 0 to 31.Thus we reduce the memory locations required to store coefficients by half. Then we will store only odd coefficients in the look up table.thus we reduce the number of coefficients by half again. On total we have reduced the number coefficients by quarter. Fig 6. RTL Diagram Page 35

5. SIMULATION RESULTS: Fig. 7: Simulation Results of LUT of 6 bit Page 36

6. CONCLUSION: This paper deals with the design of the LUT prototype which can be applied to any DSP filter techniques or operations. This paper is executed on the Spartan 3e which provides overall power handling capacities for the design systems is about 36mw. REFERENCES [1] International Technology Roadmap for Semiconductors. [Online].Available: http://public.itrs.net/ [2] J.-I. Guo, C.-M. Liu, and C.-W. Jen, The efficient memory-based VLSI array design for DFT and DCT, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 10, pp. 723 733, Oct. 1992. [3] H.-R. Lee, C.-W. Jen, and C.-M. Liu, On the design automation of the memorybased VLSI architectures for FIR filters, IEEE Trans. Consum. Electron., vol.39, no. 3, pp. 619 629, Aug. 1993. [4] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T.Stouraitis, A systolic array architecture for the Circuits Syst. Video Technol., vol. 15, no. 3,pp. 445 453, Mar. 2005. [6] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T.Stouraitis, Systolic algorithms and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST, IEEE Trans.Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 1125 1137, Jun. 2005. *7+ P. K. Meher, Systolic designs for DCT using a lowcomplexity concurrent convolutional formulation, IEEE Trans. Circuits Syst. Video Tech- nol., vol. 16,no. 9, pp. 1041 1050, Sep. 2006. *8+ P. K. Meher, Memory-based hardware for resourceconstrained digital signal processing systems, in Proc. 6th Int. Conf. ICICS, Dec. 2007,pp. 1 4. *9+ P. K. Meher, New approach to LUT implementation and accumulation for memory-based multiplication, in Proc. IEEE ISCAS, May 2009, pp. 453 456. [10] P. K. Meher, New look-up-table optimizations for memory-based multiplication, in Proc. ISIC, Dec. 2009, pp. 663 666. discrete sine transform, IEEE Trans. Signal Process.,vol. 50, no. 9, pp. 2347 2354, Sep. 2002. [5] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.- W. Jen, A memory-efficient realization of cyclic convolution and its application to discrete cosine transform, IEEE Trans. Page 37