Memory efficient Distributed architecture LUT Design using Unified Architecture

Similar documents
Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

LUT Optimization for Memory Based Computation using Modified OMS Technique

Designing Fir Filter Using Modified Look up Table Multiplier

Design of Memory Based Implementation Using LUT Multiplier

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

Modified Reconfigurable Fir Filter Design Using Look up Table

ALONG with the progressive device scaling, semiconductor

A Novel Architecture of LUT Design Optimization for DSP Applications

OMS Based LUT Optimization

An Lut Adaptive Filter Using DA

Optimization of memory based multiplication for LUT

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Implementation of Memory Based Multiplication Using Micro wind Software

N.S.N College of Engineering and Technology, Karur

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Design and Implementation of LUT Optimization DSP Techniques

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

An Efficient Reduction of Area in Multistandard Transform Core

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Designing an Efficient and Secured LUT Approach for Area Based Occupations

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

K. Phanindra M.Tech (ES) KITS, Khammam, India

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

FPGA Implementation of DA Algritm for Fir Filter

Reconfigurable Fir Digital Filter Realization on FPGA

Distributed Arithmetic Unit Design for Fir Filter

Memory Based Computing for DSP. Pramod Meher Institute for Infocomm Research

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Design on CIC interpolator in Model Simulator

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

A Fast Constant Coefficient Multiplier for the XC6200

A Parallel Area Delay Efficient Interpolation Filter Architecture

An MFA Binary Counter for Low Power Application

Area and Speed Efficient Implementation of Symmetric FIR Digital Filter through Reduced Parallel LUT Decomposed DA Approach

DDC and DUC Filters in SDR platforms

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

VLSI IEEE Projects Titles LeMeniz Infotech

THE USE OF forward error correction (FEC) in optical networks

Serial FIR Filter. A Brief Study in DSP. ECE448 Spring 2011 Tuesday Section 15 points 3/8/2011 GEORGE MASON UNIVERSITY.

The input-output relationship of an N-tap FIR filter in timedomain

Implementation of Low Power and Area Efficient Carry Select Adder

RECENT advances in mobile computing and multimedia

International Journal of Engineering Research-Online A Peer Reviewed International Journal

FPGA Implementation of Convolutional Encoder And Hard Decision Viterbi Decoder

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

FPGA Realization of Farrow Structure for Sampling Rate Change

Figure 1.LFSR Architecture ( ) Table 1. Shows the operation for x 3 +x+1 polynomial.

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

An Efficient High Speed Wallace Tree Multiplier

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

ISSN:

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Implementation of BIST Test Generation Scheme based on Single and Programmable Twisted Ring Counters

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

An FPGA Implementation of Shift Register Using Pulsed Latches

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P11 ISSN Online:

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

A Low Power Delay Buffer Using Gated Driver Tree

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

An Improved Recursive and Non-recursive Comb Filter for DSP Applications

Design And Implimentation Of Modified Sqrt Carry Select Adder On FPGA

Implementation of CRC and Viterbi algorithm on FPGA

Design and Analysis of Modified Fast Compressors for MAC Unit

Efficient Implementation of Multi Stage SQRT Carry Select Adder

FPGA Implementation of Optimized Decimation Filter for Wireless Communication Receivers

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

A Symmetric Differential Clock Generator for Bit-Serial Hardware

Implementation of High Speed Adder using DLATCH

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

Low Power Area Efficient Parallel Counter Architecture

High performance and Low power FIR Filter Design Based on Sharing Multiplication

The Design of Efficient Viterbi Decoder and Realization by FPGA

Midterm Exam 15 points total. March 28, 2011

FPGA Realization of High Speed FIR Filter based on Distributed Arithmetic

Design & Simulation of 128x Interpolator Filter

Radar Signal Processing Final Report Spring Semester 2017

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

TEST PATTERN GENERATION USING PSEUDORANDOM BIST

SDR Implementation of Convolutional Encoder and Viterbi Decoder

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

White Paper Versatile Digital QAM Modulator

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Journal of Theoretical and Applied Information Technology 20 th July Vol. 65 No JATIT & LLS. All rights reserved.

Multirate Digital Signal Processing

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

Operating Bio-Implantable Devices in Ultra-Low Power Error Correction Circuits: using optimized ACS Viterbi decoder

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration

TERRESTRIAL broadcasting of digital television (DTV)

Inside Digital Design Accompany Lab Manual

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Transcription:

Research Article Memory efficient Distributed architecture LUT Design using Unified Architecture Authors: 1 S.M.L.V.K. Durga, 2 N.S. Govind. Address for Correspondence: 1 M.Tech II Year, ECE Dept., ASR Institute of Technology 2 Assistant Professor, ECE Dept., ASR Institute of Technology. Abstract: In this paper, an efficient algorithm for optimizing the size of a LUT required for the direct storage of complex computational values and a FIR System based on optimized LUT is implemented. So far, many algorithms have been implemented for optimizing Look-up-tables (substitute the multiply and accumulate structures contained in FPGAs) of DSP cores in FPGAs. In this paper, a new method A- OMS LUT is presented to provide better performance than the previously specified methods [3, 5, 6]. In addition, a simple FIR filter is implemented through an A-OMS algorithm using Look Up Tables (LUT) for high-speed computations in FPGAs and is applicable for Communication Technologies i.e. wireless technology especially for spectrum sensing techniques in cognitive radio of a Software Defined Radio and like-wise. Further, the memory optimization process based on A-OMS LUT algorithm is shown, which further enhances the system performance in terms of speed and area that doubles the transmission rate, increasing the overall throughput. Finally, the experimental results show more than 30% of saving in area-delay product with a transmission speed of twice that of the conventional methods. Xilinx synthesis tools are used to implement the entire design process and is simulated using Xilinx ISE 7.1 Project Navigator. KEY WORDS Xilinx ISE, LUT, FIR System, SDR, Spectrum-Sensing, FPGA, Memory optimization, A- OMS LUT. 1.Introduction: In most of the DSP processors the memory based computing structures are of primary concern than the multiply accumulate structures. Computational or functional operations performed in the DSP blocks of an FPGA for implementing a particular task are time consuming and require more components like adders, multipliers. In the processors like DSP core in FPGAs multiply and accumulate structures are replaced with Look Up Tables. Instead of using conventional multipliers for complex multiplication, operations are simplified with the usage of LUTs that are used for the direct storage of the complex computational values [1, 3]. Further optimization of Look-up-tables provides better performance in terms of speed and effective area utilization. In this paper, LUT optimization using the A-OMS methodology is of primary concern. Several studies in the past have examined the effect of logic block functionality on the area and performance of field-programmable gate-arrays (FPGAs). The focus of this paper is to determine the effect of the number of inputs to the LUT. In terms of the algorithms employed, the mappers are divided into structural and functional. Structural mappers consider the circuit graph as given and find a covering of the graph with K-input subgraphs corresponding to LUTs. The functional approaches perform Boolean decomposition of the logic functions of the nodes into sub-functions of limited support size realizable by individual LUTs. Since functional mappers explore a larger solution space, they tend to be

time-consuming, which limits their use to small designs [1, 3]. In practice, FPGA mapping for large designs is done using structural mappers, whereas the functional mappers are used for re synthesis after technology mapping. Secondly, spectrum sensing in the current era of communication technologies is of foremost concern. Most of its applications require an efficient utilization of it. However, according to the recent survey reports, there is still the need to have efficient spectrum sensing techniques. Consider the case of a Software Defined Radio that put forward the concept of cognitive radio to sense the spectrum holes (white spaces) for channel or band reusability. Moreover, in areas like wireless and multimedia applications, digital signal processing applications this efficiency helps to increase the intact system performance. The concept of cognitive radio is considered in SDR to provide a solution for spectrum under utilization where the spectrum holes reuse the channel or band for an authorized spectrum stealing. Here matched filters are considered to be good approach and its structure resembles the FIR filter structure [2, 4]. This can be applicable to different fields of technologies and in wireless and multimedia communication where digital forms of signal processing are now primary concern. In this paper, A-OMS LUT based FIR filter (that reflects a simple matched filter design as an efficient method for spectrum sensing) is designed for high-speed signal transmissions. A combined approach of the two methods is defined (i.e, Antisymmetric product coding and Odd Multiple Storage that are used previously to optimize LUTs with in a DSP cores for their related operations). The input address and LUT output could always be transformed into odd integers. Previously [5, 6] it is observed that, when an Antisymmetric product coding approach is combined with the Odd multiple storage technique, the two s complement operations could be very much simplified since the input address and LUT output could always be transformed into odd integers, and both cannot be combined since the words generated are odd numbers. Consequently a different form of Antisymmetric product coding combined with a modified form of Odd Multiple Storage scheme forming A OMS LUT method which aims mainly to provide the efficient memory based computations and to perform operations for required functional computational. The modified approach is described briefly in the section two of this paper. The section three consists of an FIR filter based system design with an A-OMS LUT method. In the section, four and five consists of the memory optimization process, results, and conclusion. 2. A-OMS LUT METHOD: Conventional LUT-based multipliers, with a fixed coefficient and an input word have been used for simple memory based multiplication operations that hoard in a memory core [3]. This requires increase in the LUT size with an increase in the input word length, which is area inefficient. In order to provide an area efficient look-up-table for large data operation, some optimization schemes have been presented, of them in one method, instead of the entire values only the odd multiple values are stored and with another one, there is a reduction in LUT size to half of its original where the product words are recorded as antisymmentric pairs. Combining the above-specified methods, form A-OMS LUT method that further optimizes the LUTs where modified methods of odd multiple storage and antisymmetric product coding are used. 2(a) Method 1: Modified antisymmetric product coding scheme: In this method, 32 x 5-bit input words are considered. Computing the product word (PW) values (i.e., input word {X} of length L=5 multiplied by fixed coefficient value A) results in the negative mirror symmetry from half of total input words that facilitates a reduced LUT in size. Hence, for a given 4-bit addresses the corresponding code words to be stored are reduced to half.

This is derived from the antisymmetric behavior of products forming antisymmetric product coding, where the address bits are represented by x{x 0,x 1,x 2,x 3 } such that X =X L, if x 4 =1; X L, if x 4 =0 (1) where X L = (x 0, x 1, x 2, x 3 ) is the four less significant bits of X, and X L is the two s complement of X L. The product word can be denoted as PW = 16A + (sign value) (derived word) (2) where sign value is equal to one for x 4 = 1 and is equals to 1 for x 4 = 0. The product value for X = (10000) corresponds to the derived word i.e., antisymmetric product code value zero, which could be derived by resetting the LUT output, instead of storing that in the LUT. A simplified LUT-M circuit for an input word of length L=5 is shown in figure 1.it describes both the structure and function of LUT-M (look-up-table based multiplier). Figure. 1. Antisymmetric product coding based LUT-M circuit. The address mapping circuit generates the desired address {x 0, x 1, x 2, x 3 } where x 4 is a control bit for the (+/-)_cell. The address bit generated through address mapping selects the required value in LUT input whose output then add/subtracted from 16A, by the (+/-)_cell as shown in figure 1. 2(b) Method 2: Modified odd multiple storage scheme: In this method a barrel shifter is used to perform the shift operations through which the even multiples are computed from the obtained odd multiple values by simple shift operations which provides LUT optimized in terms of area by storing only the odd multiple values rather than whole values. Figure. 2. (a) Decoder circuit.

(b) Control circuit and Address generation circuit (AGC) In addition to the shifter circuit, a memory unit for product values and a decoder circuit for mapping bits as well as a control circuit and address generation circuit are required. The mapping process of 5-bit input word to a 4-bit LUT address (d 0, d 1, d 2, d 3 ) is done by a simple set of mapping relations. The address bits are thus generated from the AGC as shown in figure 2(b) using the equation (3) and (4) that are defined below. Here the Y L {y 0, y 1, y 2 } denotes all the shifted odd integer address bits. The relations used to map are as follows. (3) where X = {x 0, x 1, x 2, x 3} is generated through address mapping the values after arithmetically right shifting the leading zeros of X similar to that defined in equation (1).i.e.,. X =Y L, if x 4 =1; Y L, if x 4 =0 (4) For a given L-bit input word an address encoder maps the L-1 bit addresses of the LUT which consists of nine words of (W+4) bit width. Modifying a simple 3 to 8 line decoder circuit (shown in figure 2(a)) produce a 4 to 9 line decoder that generates word select signals through which the required word from LUT is selected. In figure 2(b), a control circuit is shown to provide the control bits and a reset signal bit. The basic shift operation by barrel shifter is done by the control bits s 0, s 1 from control circuit. From the figure 2(b) the control bits s 0, s 1 are derived as follows. and the reset signal is derived as RESET = d 3 AND x4 (6) Where d 3 is defined in equation (3) (5) The optimized LUT circuit with modified schemes thus designed and is shown in figure 3. in figure 3 the address generation and control block is as shown in figure 2(b). The main reason for approaching this technique is to optimize the implementation of the sign modification of the odd LUT output, which does not support the OMS scheme in methods defined previously [6]. The modified circuit is shown in figure 3. Also it provides the 2 s complement representation of the product words, that supports computations with both the signed and unsigned bits, by modifying the (+/-)_cell (to perform add/subtract operations) of figure 1.

Figure. 3. (a) A-OMS LUT System Block Diagram 3. OPTIMIZED LUT BASED FIR SYSTEM DESIGN: The optimization of LUT using A-OMS method is clearly defined in the above section. As specified in the introduction the white spaces i.e. spectrum holes are detected using matched filters at the receiver end and FIR filter structure resembles the matched filter structure. Hence implementing an FIR filter and further designing a system based on the A-OMS method will be described here. 3(a) FIR Filter Design: A-OMS method is a different approach for implementing digital filters. The basic idea is to replace all multiplications and additions by a table & shifter-accumulator. An optimized FIR filter is designed, and the basic block diagram of the matched filter resembles the basic architecture of FIR filter. An FIR filter is a LTI digital filter that is characterized by the non-recursive difference equation in time domain and the equation is as follows x[n] = Samples of input sequence, y[n] = Samples of output sequences, h[k] = Impulse response of the filter. The z-domain representation of it is as shown below (7) (8) To reduce the number of register and pipelined stages w.r.t the direct form, the transposed structure is considered. The direct form realization and Transposed form realization of FIR filter is show in figure 4(a), 4(b). Figure. 4. (a) Direct form realization of FIR filter (b) Transposed structure realization of FIR filter

3(b) Design Process: FIR filter impulsive response is the ratio of output sequences to that of input sequences. A simple flow of FIR filter with A-OMS based design is shown in figure 5 where the input samples are represented by x[n] and multiplication operation by M, simple arithmetic function by A, D stands for delay operator. The method specified in section 1 is applied to produce the desired output. Figure. 5. A-OMS based FIR filter The SA cells and AS cells are used to perform arithmetical operations add/subtract and logical operations arith-shift between the input samples and the impulses and to store them in a memory core that provide access through the LUT. One input sample in each clock cycle has the same number of cycles of latency as the optimized LUT since the same pair of address words are used by all the LUTmultipliers. The impulse sequences h[0], h[1], h[2],.,, h[n-2], h[n-1] are inputs of SA cell shown in figure 6. Fig 6: FIR filter using LUT optimised structure The decoder and control operations are similar to that described in section 2. The input values will be the input sequences X[N] i.e., (x[0], x[1], x[2],,x[n-1]) of the desired filter and the coefficient values are impulse sequences h[0], h[1], h[2],.,, h[n-2], h[n-1]. The coefficient may be fixed or generic can be a set of impulse sequences. The A-OMS method can support multiple or a set of sequences and generic coefficient values along with fixed values. The desired output responses are derived through the mapping process that is explained in the section 2 of this paper. Finally A-OMS method reduces the computational delay. The FIR filter design based on optimized LUT using A-OMS method provide more efficiency than the DA-based approach (that is done previously [2, 4]) in terms of area-complexity for a given throughput and lower latency of implementation. With an increase in the number of input sample size, the high precision multiplication operations where input code words (length of code word) were coded into required word sizes and the A-OMS operations are performed in the same way as previous. This coding method is known to be the input coding technique that provides high precision operations.

4. RESULTS: The comparisons in terms of the latencies and area complexity between the LUT-based design and the other methods based design are shown in table-1, table-2. FIR filter design using the High Speed LUT structure thus implemented and simulated through XILINX ISE simulator and synthesis tools are used. Table 1 Latencies for different filter order and input word size Table-2 Area complexity comparisons using different methods The simulated results are as shown in the figure 7. The overall design process enhances the system performance in terms of speed and area that doubles the transmission rate, increasing the overall throughput. The result shows that more than 20% of saving in area-delay product with a transmission speed of twice that of the conventional methods. It requires N times less number of decoders and memory requirement is reduced to ½ the of the conventional design therefore nearly 20% less area than conventional design methods (DA, Conventional Multiplier) for the implementation of a 16-tap FIR filter having the

Figure. 7. XILINX ISE simulated waveform same throughput per cycle. This could be used for memory-based implementation of cyclic and linear convolutions, sinusoidal transforms, and inner-product computation. The results of the implemented FIR filter is also compared with the previous works shows reduction in the adder/subtract, slices, thus reduced memory core size. 5. CONCLUSION: An efficient approach is thus specified for optimizing the LUTs and is implemented through A- OMS method that results in improving performance. FIR filter, resembling the Matched filter structure is thus implemented through the design that is applicable for many applications especially for spectrum sensing in cognitive radio of SDR. The specified method and the design process thus provide 20% savings of area-delay product and the throughput is twice that of the previous methods. This is shown in tabular re-presentations and explained briefly in the above sections. The overall implementation process is thus simulated using XILINX ISE Project Navigator and the simulated result is shown. The FIR filter is implemented at the receiver part of the system. To further enhance the system performance by utilizing the available spectrum efficiently the specified method will be implemented at transmitter part of the system.

6. REFERENCES: (1) P. K. Meher, LUT Optimization for Memory-Based Computation, IEEE Trans on Circuits & Systems-II, pp.285-289, April 2010. (2) Professor S.K. Sanyal, Wasim Arif, Designing of a fast LUT based DDA FIR system with adaptive coefficient for Spectrum Sensing in Cognitive Radio ICGST AIML-11 Conference, Dubai, UAE, 12-14 April 2011. (3) P. K. Meher, New Approach to Look-up-Table Design and Memory-Based Realization of FIR Digital Filter, IEEE Trans on Circuits &Systems-I pp 592-Systems I, pp.592 603, March 2010. (4) Tevfik Y ucek and H useyin Arslan, A Survey of Spectrum Sensing Algorithms for Cognitive Radio Applications, IEEE Communications Surveys & Tutorials, Vol. 11, No. 1,First Quarter 2009 (5) H. H. Dam, A. Cantoni, K. L. Teo, and S. Nordholm, FIR variable digital filter with signed powerof-two coefficients, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 6, pp. 1348 1357, Jun. 2007. (6) R. Mahesh and A. P. Vinod, A new common subexpression elimination algorithm for realizing lowcomplexity higher order digital filters, IEEE Trans. Computer-Aided Ded. Integr. Circuits Syst., vol. 27, no. 2, pp. 217 229, Feb. 2008. [7] D. Sundararajan, M. O. Ahmad, and M. N. S. Swamy, Vector compu-tation of the discrete Fourier transform, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 45, no. 4, pp. 449 461, Apr. 1998. [8] V. Britanak, DCT/DST universal computational structure and its impact on VLSI design, inproc. IEEE DSP Workshop, Hunt, TX, Oct. 15 18, 2000. [9] L-P. Chau and W-C. Siu, Direct formulation for the realization of dis-crete cosine transform using recursivefilter structure, IEEE Trans. Cir-cuits Syst. II, Analog Digit. Signal Process., vol. 42, no. 1, pp. 50 52, Jan. 1995. [10] J. F. Yang and C-P. Fang, Compact recursive structures for discrete cosine transform, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 47, no. 4, pp. 314 321, Apr. 2000. [11] W. H. Fang and M. L. Wu, An efficient unified systolic architecture for the computation of discrete trigonometric transforms, in Proc. IEEE Symp. Circuits and Systems, vol. 3, 1997, pp. 2092 2095.