LUT Optimization for Memory Based Computation using Modified OMS Technique

Similar documents
ALONG with the progressive device scaling, semiconductor

A Novel Architecture of LUT Design Optimization for DSP Applications

Design of Memory Based Implementation Using LUT Multiplier

OMS Based LUT Optimization

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Optimization of memory based multiplication for LUT

Implementation of Memory Based Multiplication Using Micro wind Software

Design and Implementation of LUT Optimization DSP Techniques

Modified Reconfigurable Fir Filter Design Using Look up Table

K. Phanindra M.Tech (ES) KITS, Khammam, India

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Memory efficient Distributed architecture LUT Design using Unified Architecture

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Designing an Efficient and Secured LUT Approach for Area Based Occupations

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

N.S.N College of Engineering and Technology, Karur

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Designing Fir Filter Using Modified Look up Table Multiplier

An Lut Adaptive Filter Using DA

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

An Efficient Reduction of Area in Multistandard Transform Core

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

The input-output relationship of an N-tap FIR filter in timedomain

Memory Based Computing for DSP. Pramod Meher Institute for Infocomm Research

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

FPGA Hardware Resource Specific Optimal Design for FIR Filters

FPGA Implementation of DA Algritm for Fir Filter

Optimizing area of local routing network by reconfiguring look up tables (LUTs)

Implementation of Low Power and Area Efficient Carry Select Adder

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Bus Encoded LUT Multiplier for Portable Biomedical Therapeutic Devices

Research Article Low Power 256-bit Modified Carry Select Adder

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

Low Power Area Efficient Parallel Counter Architecture

An MFA Binary Counter for Low Power Application

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

ISSN:

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

An Efficient High Speed Wallace Tree Multiplier

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

A Parallel Area Delay Efficient Interpolation Filter Architecture

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

Implementation of High Speed Adder using DLATCH

CHAPTER 4 RESULTS & DISCUSSION

A Fast Constant Coefficient Multiplier for the XC6200

Microprocessor Design

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

THE USE OF forward error correction (FEC) in optical networks

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Design on CIC interpolator in Model Simulator

Area and Speed Efficient Implementation of Symmetric FIR Digital Filter through Reduced Parallel LUT Decomposed DA Approach

Implementation of CRC and Viterbi algorithm on FPGA

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

Radar Signal Processing Final Report Spring Semester 2017

ENGG2410: Digital Design Lab 5: Modular Designs and Hierarchy Using VHDL

Efficient Implementation of Multi Stage SQRT Carry Select Adder

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

High Speed 8-bit Counters using State Excitation Logic and their Application in Frequency Divider

The Design of Efficient Viterbi Decoder and Realization by FPGA

TEST PATTERN GENERATION USING PSEUDORANDOM BIST

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

VLSI IEEE Projects Titles LeMeniz Infotech

Distributed Arithmetic Unit Design for Fir Filter

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Hardware Implementation of Viterbi Decoder for Wireless Applications

Fully Pipelined High Speed SB and MC of AES Based on FPGA

Modified128 bit CSLA For Effective Area and Speed

Design of Carry Select Adder using Binary to Excess-3 Converter in VHDL

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

Design And Implimentation Of Modified Sqrt Carry Select Adder On FPGA

Design of BIST with Low Power Test Pattern Generator

DESIGN OF LOW POWER AND HIGH SPEED BEC 2248 EFFICIENT NOVEL CARRY SELECT ADDER

Hardware Modeling of Binary Coded Decimal Adder in Field Programmable Gate Array

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

A Review on Hybrid Adders in VHDL Payal V. Mawale #1, Swapnil Jain *2, Pravin W. Jaronde #3

Reconfigurable Fir Digital Filter Realization on FPGA

128 BIT MODIFIED CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P11 ISSN Online:

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

An FPGA Implementation of Shift Register Using Pulsed Latches

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

Modeling Digital Systems with Verilog

Design of Modified Carry Select Adder for Addition of More Than Two Numbers

AbhijeetKhandale. H R Bhagyalakshmi

A Power Efficient Flip Flop by using 90nm Technology

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

Design and Analysis of Modified Fast Compressors for MAC Unit

Arithmetic Unit Based Reconfigurable Approximation Technique for Video Encoding

Transcription:

LUT Optimization for Memory Based Computation using Modified OMS Technique Indrajit Shankar Acharya & Ruhan Bevi Dept. of ECE, SRM University, Chennai, India E-mail : indrajitac123@gmail.com, ruhanmady@yahoo.co.in Abstract - Continual need for high performance in compute-bound scientific applications motivates the study of LUT Optimization. LUT Optimization replaces a complex expression with an access to pre-computed LUT data containing the expression values. This results in faster expression evaluation and high performance gain. This manuscript describes a comprehensive methodology for LUT optimization, and show that LUT methods can improve the performance of scientific applications. Mainly two techniques, namely, anti-symmetric product coding (APC) and Odd Multiple Storage (OMS) for achieving LUT Optimization are discussed. Each of these techniques results in the reduction of the LUT size by a factor of two. A modified OMS combined LUT design for multiplication is proposed for both signed and unsigned input values. How this combined approach effects the efficiency is discussed. Such models of memory based computing are well suited for many digital signal processing applications. The proposed LUT multiplier is coded in VHDL and synthesized in Xilinx ISE version 8.2i. Keywords - Memory based computing, Look Up Table (LUT), LUT Optimization, anti-symmetric product coding (APC), Odd Multiple Storage (OMS), FPGA. I. INTRODUCTION Efforts to improve scientific computing performance have been ongoing since the early days of computing. The LUT approach can result in significant improvement in terms of low latency implementation and less dynamic power consumption. Memory based computing structures are more regular than the multiplyaccumulate (MAC) structures. A conventional lookup-table (LUT) based multiplier is shown in Fig.1. Here, A is a fixed coefficient, and X is an input word to be multiplied with A. Let us suppose that X is a positive binary number of word length L. Thus, there can be 2 L possible values of X, and 2 L possible values of product C=A*X. Therefore, for conventional memory based multiplication, an LUT of 2 L words, consisting of pre-computed corresponding to all possible values of X, is used. Fig.1: Conventional LUT-based multiplier The product word A*X i is stored at the address location which is same as X i for 0 2 L -1, such that if L- bit binary value of X i is the address for the LUT, then the corresponding product value A. X i is available as its output. Look-up-tables have been the basis of FPGA blocks. An FPGA is a stand-alone programmable logic device that allows rapid implementation of complex logic systems. The FPGA utilizes look-up-tables (LUT) to implement multi-level functions. LUTs in modern FPGAs are reduced to smaller LUTs. Many architectures have been suggested in literature for memory based implementation of DSP algorithms, but no significant work on LUT Optimization has been done. In this paper, a new approach to LUT design for memory based multiplication has been presented, which can potentially reduce the memory size by half for small input widths. The main objective is to propose a new scheme for optimization of LUT with lower area and time overhead. The anti-symmetric (negative mirror symmetry) product coding i.e. APC Technique provides a reduction in LUT size by two, but it is accompanied by a substantial overhead of area and time to perform the 2 s complement operation of LUT output for sign modification and that of the input operand for input mapping. But, the proposed approach makes the operations much simpler. It basically consists of only three modules, namely address decoder, Barrel shifter 71

and a LUT component comprising of only odd multiples of a fixed coefficient. The LUT component module facilitates multiplications of input values with a fixed coefficient, wherein the LUT consists of all possible pre-computed product values corresponding to all possible values of the input multiplicand. The Barrel Shifter component is mainly related with the OMS Technique and is used to derive all the even multiples of A i.e. the fixed coefficient by left-shift operations. TABLE I: OMS-BASED DESIGN OF THE LUT OF APC WORDS II. MODULES USED IN THE PROPOSED DESIGN FOR LUT BASED MULTIPLIER A. Address Generator and Control Circuit. The address generation and control circuit used to produce the address d 0 d 1 d 2 d 3. This address is given as the input to LUT component. The address generation circuit is generally used in conjunction with the control circuit which is used to produce the control signals s 0 and s 1. The control signals are used in the subsequent blocks as can be seen from Fig. 3., for the multiplication of any binary word of size L, with a fixed coefficient A, instead of storing all the 2 L possible values of C=A*X, only (2 L /2) words corresponding to the odd multiples of A may be stored in the LUT, while all the even multiples of A could be derived by left-shift operations of one of those odd multiples. This can be achieved with of one of the modules i.e. Barrel Shifter. B. Barrel Shifter Module. In Table 1, at eight memory locations, the eight odd multiples, A (2i + 1), are stored. The even multiples 2A, 4A and 8A are derived by left-shift operations of A. 6A and 12A are derived by left shifting operation of 3A. 10A and 14A are derived by left shifting 5A and 7A, respectively. Three left-shift operations can be produced by a barrel shifter to derive all the even multiples of A. As is clearly evident from the table, instead of storing all the 2 L possible values of C=A*X, only (2 L /2), corresponding to the odd multiples of A may be stored in the LUT, while all the even multiples of A could be derived by the left-shift operations of one of those odd multiples. In case of unsigned input i.e. 4-bit input, only eight such unsigned (i.e. MSB is 0) 8-bit odd multiple values may be stored. Whereas in case of signed input i.e. 5 bit input, sixteen such 8-bit odd multiple values are to be stored in total. Out of the sixteen, eight product values are similar to the ones stored for unsigned input and the remaining eight have signed (i.e. MSB is 1) LUT values. This module consisting of clock, reset and read signals, gives the correct output corresponding to the address generated by the address decoder. The remaining contents of the paper are organized in the following manner. In section 2, the modules used for LUT design for memory based multiplication are discussed. Section 3 gives the Optimized LUT design for signed and unsigned operands using OMS Technique. The synthesis of LUT-Based Multiplier using the Optimization scheme when programmed on Xilinx ISE is described in section 4. Conclusions are presented in section 5. Fig. 2: The Address-generation and control circuit used in conjunction with Barrel Shifter module. As shown in Fig.2, the address generation circuit receives a 4-bit (or 5-bit ) input operand X and maps that depending on unsigned or signed operations respectively, giving a 4-bit output data in the d 3 d 2 d 1 d 0 format. This output depends on the selection lines s 1 and s 0. It is to be noted that s 1 and s 0 is 2-bit binary equivalent of the required number of shifts in Table 1. C. LUT Component Module The LUT component for multiplication of a 4-bit unsigned input consists of a set of eight odd multiple values of a fixed coefficient, say 4, i.e. 4, 12, 20, 28, and so on. Also, for 5-bit signed input, the LUT component has the above values as well as another set of odd multiple stored values such as 196, 200, and so on till 252. As a result, LUT size is considerably reduced. The 72

LUT component module can thus successfully replace the need of Anti-symmetric Product Coding (APC) Technique. Proposed LUT design with the block diagram is discussed in the next section. III. OPTIMIZED LUT DESIGN FOR SIGNED AND UNSIGNED OPERANDS The proposed block diagram consisting of all the modules is as shown in Fig. 3. As discussed earlier, for a LUT for word length L=5 and for any coefficient width W, it consists of an LUT of nine words of (W + 4)-bit width, an Address Generator, LUT component, a Barrel shifter and a control circuit for generating the RESET signal and control word s 1 and s 0 for the barrel shifter. IV. CODING AND SYNTHESIS OF LUT BASED MULTIPLIER USING OPTIMIZATION SCHEME AND COMPARISON The output simulation results for both unsigned and signed input words are as shown in Fig. 4 and Fig. 5 respectively. The synthesis results obtained after individually programming each module on Xilinx ISE 8.2i are as given in Table 2. It can be clearly seen from Table 2 that the delay in case of proposed LUT Optimization scheme is less than the individual APC Product and OMS Product delay values, it being only 5.59ns each. Also, the number of bonded IOBs (i.e. Utilization) too is lesser (11% and 10%) as compared to 14 and 15% of the individual modules. The Implement Design feature of Xilinx ISE suite can also be used to gain more data shown in Table 3. Fig. 3: Combined LUT design for the multiplication of W-bit fixed coefficient A with 5-bit input X. The pre-computed values of A (2i + 1) are stored at the eight consecutive locations of the memory array, as specified in Table II, while 2A is stored for input X = (00000) at LUT address 1000, as specified in Table III. The decoder takes the 4-bit address from the address generator and generates nine word-select signals, i.e., {wi, for 0 i 8}, to select the reference word from the LUT. The control bits s 0 and s 1 to be used by the barrel shifter to produce the desired number of left shifts of the LUT output are generated by the control circuit. Also, again it is to be noted that except the last word, all other words in the LUT are odd multiples of A. The fixed coefficient could be even or odd. Since there is no need to add the fixed value of 16A in this case and because the product values are naturally in antisymmetric form, there is no need for the inclusion of APC Technique. Hence, the add/subtract circuit is also becomes superfluous and hence redundant. Thus, a simplified block diagram is obtained.these features are instrumental in reducing important factors such as areatime complexity, Area-Delay Product (ADP), etc. Fig. 4: Simulation result for unsigned input word. Fig. 5: Simulation result for signed input word. 73

. TABLE II. SYNTHESIS REPORT OF VARIOUS TECHNIQUES IN TABULATED FORMAT TABLE III. IMPLEMENTATION REPORT IN XILINX 8.2i ISE OF VARIOUS TECHNIQUES IN TABULATED FORMAT. For simulation and implementation purposes, ModelSim 6.3 and Xilinx 8.2i ISE is used. The desired output i.e. product words are obtained in the simulations as shown. Xilinx 8.2i ISE is used to dump the program into the SPARTAN 3E kit. Thus, the Xilinx forms the interface between the ModelSim and the FPGA kit, which converts the code in ModelSim that can be dumped into the kit. Comparison with reference to parameters such as Total Equivalent gate count for design and additional JTAG count for IOBs are given in Table 3. Again it is clearly evident that the proposed LUT Optimization scheme has far less number of additional JTAG gate count for IOBs than the individual APC and OMS modules. This automatically results in reduction in terms of Delay and Memory size for memory based computation. V. CONCLUSION The proposed LUT Optimization scheme L=W=4 or 5 bits are coded in VHDL and simulated in ModelSim 6.3. The synthesis and implementation reports are obtained from Xilinx ISE 8.2i. On comparison, it is found that the proposed LUT Design involves significantly less multiplication time than its individual and other modules. It offers more saving in terms of Area-Delay Product as compared to other methods such as APC, OBC, OPC Techniques, etc. Thus, in this paper, the possibility of using LUT-based multipliers to implement the constant multiplication for DSP algorithms and applications is shown. The full advantages of proposed LUT-based design could be derived if the LUTs are implemented as NAND or NOR read-only memories and the arithmetic shifts are 74

implemented by an array barrel shifter using metal oxide semiconductor transistors. VI. FURTHER WORK FPGAs and other programmable logic arrays are highly configurable. Further work could still be carried out to derive such modified OMS based LUTs for higher input sizes with different decomposition forms. Other parallel and pipelined addition schemes for suitable area-delay trade-offs. VII. REFERENCES [1] Pramod Kumar Meher. LUT Optimization for Memory-Based Computation, IEEE Transactions on Circuits and Systems-II: on Express Briefs, Vol. 57 No. 4, April 2010 [2] International Technology Roadmap for Semiconductors. [Online]. Available: http://public.itrs.net/ [3] A. K. Sharma, Advanced Semiconductor Memories: Architectures, Designs, and Applications. Piscataway, NJ: IEEE Press, 2003. [4] P.K. Meher, LUT Based optimization for memory based computation in Circuits and System II, April. 2010, pp. 285 289. [5] P.K. Meher, Memory-based hardware for resource constrained digital signal processing systems, in Proc. 6th Int. Conf. ICICS, Dec. 2007 [6] P. K. Meher, New approach to LUT implementation and accumulation for memorybased multiplication, in Proc. IEEE ISCAS, May 2009, pp. 453 456. [7] P. K. Meher, New look-up-table optimizations for memory based multiplication, in Proc. ISIC, Dec. 2009, pp. 663 666. [8] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, A systolic array architecture for the discrete sine transform, IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2347 2354, Sep. 2002. [9] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, A memory-efficient realization of cyclic convolution and its application to discrete cosine transform, IEEE Trans. Circuits Syst. Video Technology., vol. 15, no. 3, pp. 445 453, Mar. 2005. [10] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, Systolic algorithms and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 1125 1137, Jun. 2005. [11] P. K. Meher, Systolic designs for DCT using a low-complexity concurrent convolutional formulation, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 9, pp. 1041 1050, Sep. 2006. 75