Design of Memory Based Implementation Using LUT Multiplier

Similar documents
OMS Based LUT Optimization

ALONG with the progressive device scaling, semiconductor

A Novel Architecture of LUT Design Optimization for DSP Applications

Implementation of Memory Based Multiplication Using Micro wind Software

Optimization of memory based multiplication for LUT

LUT Optimization for Memory Based Computation using Modified OMS Technique

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Modified Reconfigurable Fir Filter Design Using Look up Table

Design and Implementation of LUT Optimization DSP Techniques

K. Phanindra M.Tech (ES) KITS, Khammam, India

Memory efficient Distributed architecture LUT Design using Unified Architecture

Designing an Efficient and Secured LUT Approach for Area Based Occupations

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

N.S.N College of Engineering and Technology, Karur

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

An Lut Adaptive Filter Using DA

Designing Fir Filter Using Modified Look up Table Multiplier

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

The input-output relationship of an N-tap FIR filter in timedomain

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

An Efficient Reduction of Area in Multistandard Transform Core

Memory Based Computing for DSP. Pramod Meher Institute for Infocomm Research

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Implementation of Low Power and Area Efficient Carry Select Adder

An MFA Binary Counter for Low Power Application

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

An Efficient High Speed Wallace Tree Multiplier

VLSI IEEE Projects Titles LeMeniz Infotech

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

Research Article Low Power 256-bit Modified Carry Select Adder

ISSN:

FPGA Implementation of DA Algritm for Fir Filter

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Distributed Arithmetic Unit Design for Fir Filter

Implementation of High Speed Adder using DLATCH

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

Implementation of 2-D Discrete Wavelet Transform using MATLAB and Xilinx System Generator

Optimizing area of local routing network by reconfiguring look up tables (LUTs)

Efficient Implementation of Multi Stage SQRT Carry Select Adder

A Fast Constant Coefficient Multiplier for the XC6200

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

Bus Encoded LUT Multiplier for Portable Biomedical Therapeutic Devices

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

Design and Simulation of Modified Alum Based On Glut

Design And Implimentation Of Modified Sqrt Carry Select Adder On FPGA

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Reconfigurable Fir Digital Filter Realization on FPGA

THE USE OF forward error correction (FEC) in optical networks

Modeling Digital Systems with Verilog

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

International Journal of Engineering Research-Online A Peer Reviewed International Journal

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

A Review on Hybrid Adders in VHDL Payal V. Mawale #1, Swapnil Jain *2, Pravin W. Jaronde #3

Area and Speed Efficient Implementation of Symmetric FIR Digital Filter through Reduced Parallel LUT Decomposed DA Approach

Design of Low Power Efficient Viterbi Decoder

VLSI Based Minimized Composite S-Box and Inverse Mix Column for AES Encryption and Decryption

Design of BIST with Low Power Test Pattern Generator

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

Transactions Briefs. Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

A Low Energy HEVC Inverse Transform Hardware

L12: Reconfigurable Logic Architectures

A Parallel Area Delay Efficient Interpolation Filter Architecture

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15,

Design & Simulation of 128x Interpolator Filter

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

FPGA IMPEMENTATION OF LOW POWER AND AREA EFFICIENT CARRY SELECT ADDER

Hardware Modeling of Binary Coded Decimal Adder in Field Programmable Gate Array

OPTIMIZED DIGITAL FILTER ARCHITECTURES FOR MULTI-STANDARD RF TRANSCEIVERS

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

High Speed 8-bit Counters using State Excitation Logic and their Application in Frequency Divider

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

DDC and DUC Filters in SDR platforms

The Design of Efficient Viterbi Decoder and Realization by FPGA

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

FPGA Implementation of Low Power and Area Efficient Carry Select Adder

Figure.1 Clock signal II. SYSTEM ANALYSIS

Design and Analysis of Modified Fast Compressors for MAC Unit

Modified128 bit CSLA For Effective Area and Speed

Field Programmable Gate Arrays (FPGAs)

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P11 ISSN Online:

Design on CIC interpolator in Model Simulator

Hardware Implementation of Viterbi Decoder for Wireless Applications

Microprocessor Design

A Low Power Delay Buffer Using Gated Driver Tree

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Low Power and Area Efficient 256-bit Shift Register based on Pulsed Latches

Transcription:

Design of Memory Based Implementation Using LUT Multiplier Charan Kumar.k 1, S. Vikrama Narasimha Reddy 2, Neelima Koppala 3 1,2 M.Tech(VLSI) Student, 3 Assistant Professor, ECE Department, Sree Vidyanikethan Engineering College(Autonomous), A.Rangampet, Tirupati. Abstract - Multiplication is major arithmetic operation in signal processing. In ALU s the multiplier uses lookup-table (LUT) as memory for their computations. We do not find any significant work on LUT optimization for memory-based multiplication. In this project, the anti symmetric product coding (APC) and odd-multiple storage (OMS) are used for lookup-table (LUT) design for memory-based multipliers used in the signal processing applications like filter design. Each of this technique results in the reduction of LUT size by a factor of two. A different form of APC and modified OMS scheme can be combined for efficient memory implementation which reduces LUT size to one-fourth of the conventional LUT. The proposed design of LUTbased multiplier involves less area-delay product for higher word sizes due to operand decomposition than the canonical-signed-digit (CSD)-based multipliers. The coding is proposed to be done in Veriolg HDL and synthesized using XillinxISE10.1i and implemented using Spartan3E FPGA. Key words- digital signal processing (DSP) chip, lookuptable (LUT)-based computing, memory-based computing, very large scale integrations (VLSI). operation of the these devices is very fast which consumes less power, less area, reduces time of operation & become more efficient with respect to the several factors such as reliability, flexibility, scaling etc. therefore it leads to significant growth & improvement of these devices become cheaper. The semiconductors have embedded memory which results in dominating presence in the SOC s exceeding 90% of the total soc [2]. When compared to logical components, the semiconductor memory devices has high transistor packing density with increasing fast rate [1]. Apart from that, memory based computing structures offers more other advantages rather than multiply accumulate structures such as greater potential for high throughput, low latency implementation and less dynamic power consumption. Memory-based computing is well suited for many digital signal processing (DSP) algorithms, which involve multiplication with a fixed set of coefficients. The following block diagram shows the conventional look up table based multiplier in fig1 I. INTRODUCTION Due to the rapid development of increasing technology, now a day s semiconductor devices has become more prominent usage in every field. The ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 46

Fig. 1: Conventional LUT-based multiplier. Whereas X is an input address & A is a multiplier to the input X with fixed coefficient then resulting product is taken as output. Suppose X is a positive binary number of word length L, it provides 2 L possible values of X in which corresponding resultant product as C=A X for possible values of X. In memory based multiplication, for all possible values of X, A conventional LUT having word length 2 L Provides pre-computed product values. For an LUT, Xi is an input address with a L bit binary digit then the corresponding product A. XI is as its output. Therefore the product A. XI is stored in the location XI for 0 XI 2L 1. In earlier days, for memory based implementation of DSP algorithms involving orthogonal transforms & digital filters [5]-[12] had reported by several architectures but they could not find any significant work for LUT optimization. Recently we introduced a new approach for LUT optimization in which only the odd multiples of fixed coefficient are to be stored which is termed as oddmultiple-storage-scheme (OMS) [3]. An LUT size can also be reduced to half by another approach known as anti-symmetric product coding (APC) scheme where as the product words are termed as anti symmetric pairs [4]. manner such that the input address & LUT output could always be transformed into odd integers. When OMS scheme is combined with APC approach [3], it does not provide efficient output since APC functions [4] for odd multiples only. So therefore, for efficient memory based multiplication a modified form of OMS scheme is combined with different form of APC. A modified OMS [4] scheme & combined OMS APC approaches are discussed in section 2 where as the implementation of combined schemes is described in section 3 and the design of LUT based multiplier is described in section 4. Finally the conclusion and the synthesizing results of proposed multiplier presented in section5. II. LUT OPTIMIZATIONS FOR MEMORY- BASED MULTIPLICATION This section describes about the APC technique and its optimization by combining it with a modified form of OMS. A. APC for LUT Optimization: For our convenience, we assume both X and A is to be positive integers to simply the operation. The above table1 shows the product values for different values of input X for L=5 as shown. In APC approach, even it reduces the LUT size by a factor of two but for LUT output it takes more time & space for performing the 2s complement operation for sign modification to the corresponding input. We find that, by combining the techniques of APC & OMS scheme the 2s complement operations could be simplified in a ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 47

Table I APC words for L=5 with different input values the 4-bit LUT address values and corresponding coded words respectively. here the product representation is derived from the anti-symmetric behavior of the products, so we can term it as antisymmetric product code. The 4-bit address X = (x3 x2 x1 x0 ) of the APC word is given by X = XL, if x4 = 1 XL if x4 = 0 (2) For X= (1 0 0 0 0), the encoded word to be stored is 16A. From the above table it is clear that for every input word X in the third column of each row resembles the 2s complement of every input word X on the first column of the same row. In addition, the sum of product values of two input values on the same row is 32A. Let u & v be the product values of second and fourth columns of each row respectively. Therefore we can write u=[(u + v)/2 (v u)/2] and v=[(u + v)/2 + (v u)/2] for (u + v) = 32A, We have u=16a [(v u)/2] and v=16 A + [(v u)/2] (1) from the above terms, the product values of the second and fourth columns of the table 1 shows negative- mirror symmetry. Therefore from the above symmetry of the product words of those two columns reduces LUT size, whereas instead of storing u and v, only [(v u)/2] is stored for a pair of input on a given row. The fifth and sixth columns of the table shows where XL = (x3x2x1x0) is the four less significant bits of X and XL is the 2s complement of X. the required product could be obtained by adding or subtracting the stored value (v u) to or from the fixed value 16A when x4 is 1or 0, respectively, i.e., Product word = 16A + (sign value) X (APC word) (3) Where sign value = 1 for x4 = 1 and Sign value = 1 for x4 = 0. The product value for X = (10000) corresponds to APC value zero, which could be derived by resetting the LUT output, instead of storing that in the LUT. B. Modified OMS for LUT Optimization As the name OMS itself specifies that it stores only odd multiples of fixed coefficient. The multiplication of a binary of binary word X of word size L with fixed coefficient A, instead of storing all possible 2 L product values, LUT stores only 2 L /2 words corresponding to odd multiples of A. While all even multiples of A can be converted into odd multiples by left shift operations.from the above assumptions, the LUT for the multiplication of an L-bit input with a ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 48

W-bit coefficient could be designed by the following strategy. 1) A memory unit of [(2L/2) + 1] words of (W + L)-bit width is used to store the product values, where the first (2L/2) words are odd multiples of A, and the last word is zero. 2) A barrel shifter for producing a maximum of (L 1) left shifts is used to derive all the even multiples of A. 3) The L-bit input word is mapped to the (L 1)-bit address of the LUT by an address encoder, and control bits for the barrel shifter are derived by a control circuit. Table 2 shows that eight odd multiples, A (2i + 1) are stored in eight memory locations as pi for i= 0, 1 7. The even multiples 2A, 4A, and 8A are derived by left-shift Table II OMS-Based design of LUT of APC words for L=5 multiples of Aare derived from barrel shifter which produces maximum of three left shifts. As eq(3) states that the word to be stored for X = (00000) is not 0 but 16A, which we can obtain from A by four left shifts using a barrel shifter. However, if 16A is not derived from A, only a maximum of three left shifts is required to obtain all other even multiples of A. a two-stage logarithmic barrel shifter operates only for a maximum of 3 shifts while for a four shift operations it requires a 3 stage barrel shifter. For input X = (00000), this modified OMS scheme is more efficient to store 2A such that the product 16A can be obtained by three arithmetic left shifts. Table3 shows that the product values and encoded words for input words X = (00000) and (10000) respectively. For X = (00000), the required encoded word 16A is obtained by 3-bit left shifts operations of 2A [stored at address (1000)]. For X = (10000), the APC word 0 is derived by resetting the LUT output, by an active-high RESET signal given by RESET = (x0 + x1 + x2 + x3) x4. (4) Table III Products and encoded words for X= (00000) and (10000) operations of A. Similarly, 6A and 12A are derived by left shifting 3A, while 10A and 14A are derived by left shifting 5A and 7A, respectively. All even ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 49

From Tables II and III it shows that that the 5-bit input word X can be mapped into a 4-bit LUT address (d3d2d1d0), by a simple set of mapping relations di = x i+1, for i = 0, 1, 2 and d3 = x0 (5) where X = (x3 x2 x1 x0) is generated by shifting-out all the leading zeros of X by an arithmetic right shift followed by address mapping, i.e., X = YL, if x4 = 1 Y L, if x4 = 0 (6) Where Y L and Y L are derived by circularly shiftingout all the leading zeros of X L and X L, respectively. III. IMPLEMENTATION OF THE LUT-BASED MULTIPLIER USING THE PROPOSED LUT OPTIMIZATION SCHEME This section deals with the implementation of the LUT-based multiplier using the proposed scheme, where the LUT is optimized by a combination of the APC scheme and a modified OMS technique. Fig 2 shows that the structure and function of the LUT-based multiplier for L = 5 using the APC technique. It consists of a four-input LUT of 16 words to store the APC values of product words as given in the sixth column of Table I, except on the last row, where 2A is stored for input X = (00000) instead of storing a 0 for input X = (10000). Besides, it consists of an address-mapping circuit and an add/subtract circuit. The address-mapping circuit generates the desired address (X 3, X 2, X 1, X 0 ) according to (2). A straightforward implementation of address mapping can be done by multiplexing XL and X L using x4 as the control bit. The addressmapping circuit, can be optimized by the realization of three XOR gates, three AND gates, two OR gates, and a NOT gate, as shown in fig 2. According to eq (4) RESET can be generated by a control circuit (not shown in fig). The output of the LUT is added with or subtracted from 16A, for x4 = 1 or 0, respectively, according to (3) by the add/subtract cell. Hence, x4 is used as the control for the add/subtract cell. B. Implementation of the Optimized LUT Using Modified OMS A. Implementation of the LUT Multiplier Using APC for L = 5 Fig.: 3 APC-OMS combined LUT design for multiplication of W-bit fixed coefficient. Fig.: 2 LUT based multiplier using APC technique for L=5. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 50

Fig 3 shows that the combined schemes of proposed APC OMS design of an LUT for L = 5 for any coefficient width W. It consists of an LUT of nine words of (W + 4)-bit width, a four-to-nine-line address decoder, a barrel shifter, an address generation circuit, and a control circuit for generating the RESET signal and control word (s1,s0) for the barrel shifter. As noted in Table-II and Table-III control signals are 2-bit binary equivalent for required number of shifts. Alternative of reset signal for (4) is generated as (d3 AND x4). In Fig. 4(b) generation of control signals and reset signal is shown. According to (5) and (6) address-generator circuit receives the input operand X as 5-bit and maps that onto the 4-bit address word (d3d2d1d0). IV. Results and Discussion Comparison factors No. of 4-input LUT s (9312) Table IV No. of word size for LUT s 4-bit 5-bit 6-bit 15 10 14 No. of slices (4656) 8 6 8 No. of IO s 46 60 67 (a) (b) Fig.: 4(a) four-to-nine-line decoder. (b) Control circuit The pre-computed values of A (2i + 1) are stored in stored in Table II as Pi, for i = 0, 1, 2,..., 7, in the eight consecutive locations of the memory array, while for input X= (00000) is stored for 2A at LUT address 1000, as mentioned in Table III. The decoder generates the nine-word select lines by taking 4-bit address lines, to select the required word from the LUT multiplier. With simple modification of 3-to-8 decoder we are getting 4-to-9-line decoder as shown in Fig. 4(a). To produce desired number of shifts in barrel shifter control signals S0 and S1 are used according to the relations. s0 =x0 + (x1 + x2) (7a) s1 =(x0 + x1) (8b) No. of bonded IO s 19 30 56 (232) Delay (ns) 7.376 6.736 6.736 Fig. (5) Simulated results for L=4 From the above fig. (5) We are applying the input bit sequences for X=4 h0 and getting the output response for q=8 h03. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 51

Fig. (6) Simulated results for L=5 Fig. (7) Simulated results for L=6 From the above fig. (6) We are applying the input bit sequences for X=5 h00 and getting the output response for q=9 h003. From the above fig. (7) We are applying the input bit sequences for X=6 h00 and getting the output response for q=10 h003. As shown in the above table IV, for the increase in the word size in the LUT multiplier, there is a gradual degradation of delay for L=4 and L=5 and for L=6 there is no delay change with respect to L=5 with optimum utilization of memory. The LUT multiplier for L=W=4, 5 and 6 bits are coded in Verilog HDL and synthesized using Xillinx ISE 10.1i environment by using SPARTAN 3E FPGA fg320 package, device used is XC3S500e with speed grade of -5. IV CONCLUSION The LUTs are implemented as arrays of constants for efficient utilization of area-delay product. The area and delay complexities of the multipliers estimated from the synthesis results are listed in Table IV. It is found that the proposed LUT design involves comparable area and time complexities for a word size of 4 bits, but for higher word sizes, it has comparatively less delay factor. In this brief, we have derived the possibility of using LUT based multipliers for the constant implement of operations like multiplication especially for DSP applications. Future scope for this will be implementation of derived OMS APC-based LUTs for higher input sizes for suitable area-delay product with different forms of decompositions. REFERENCES [1] Pramod Kumar Meher, LUT Optimization for Memory-Based Computation IEEE Transactions on circuits and systems ii: express briefs, vol. 57, no. 4, april 2010 [2] International Technology Roadmap for Semiconductors. [Online]. Available: http://public.itrs.net/ [3] P. K. Meher, New approach to LUT implementation and accumulation for memory-based Multiplication, in Proc. IEEE ISCAS, May 2009, pp. 453 456. [4] P. K. Meher, New look-up-table optimizations for memory-based multiplication, in Pro. Int. Symp.Integr. Circuits (ISIC 09), Dec. 2009, to be published. [5] P. K. Meher, Memory-based Hardware for resourceconstrained digital signal processing systems, inproc. 6th Int Conf. ICICS, Dec.2007, pp.1 4. [6] P. K. Meher, Systolic designs for DCT using a lowcomplexity Concurrent convolutional formulation, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 9, pp. 1041 1050, Sep. 2006. [7] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, Systolic algorithms and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST,IEEE Trans. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 52

Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 1125 1137, Jun. 2005. [8] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, A memory-efficient realization of cyclic convolution and its application to discrete cosine transform, IEEE Trans. Circuits Syst.Video Technol., vol. 15, no. 3, pp. 445 453, Mar. 2005. [9] A. K. Sharma, Advanced Semiconductor Memories: Architectures,Designs,and Applications. Piscataway, NJ: IEEE Press, 2003. [10] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, A Systolic array architecture for the discrete sine transform, IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2347 2354, Sep. 2002. [11] H.-R. Lee, C.-W. Jen, and C.-M. Liu, On the design automation of The memory-based VLSI architectures for FIR filters, IEEE Trans. Consum. Electron., vol. 39, no. 3, pp. 619 629, Aug. 1993. [12] J.-I. Guo, C.-M. Liu, and C.-W. Jen, The efficient memory-based VLSI array design for DFT and DCT, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 10, pp. 723 733, Oct. 1992. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 53