Design of Memory Based Implementation Using LUT Multiplier

Design of Memory Based Implementation Using LUT Multiplier Charan Kumar.k 1, S. Vikrama Narasimha Reddy 2, Neelima Koppala 3 1,2 M.Tech(VLSI) Student, 3 Assistant Professor, ECE Department, Sree Vidyanikethan Engineering College(Autonomous), A.Rangampet, Tirupati. Abstract - Multiplication is major arithmetic operation in signal processing. In ALU s the multiplier uses lookup-table (LUT) as memory for their computations. We do not find any significant work on LUT optimization for memory-based multiplication. In this project, the anti symmetric product coding (APC) and odd-multiple storage (OMS) are used for lookup-table (LUT) design for memory-based multipliers used in the signal processing applications like filter design. Each of this technique results in the reduction of LUT size by a factor of two. A different form of APC and modified OMS scheme can be combined for efficient memory implementation which reduces LUT size to one-fourth of the conventional LUT. The proposed design of LUTbased multiplier involves less area-delay product for higher word sizes due to operand decomposition than the canonical-signed-digit (CSD)-based multipliers. The coding is proposed to be done in Veriolg HDL and synthesized using XillinxISE10.1i and implemented using Spartan3E FPGA. Key words- digital signal processing (DSP) chip, lookuptable (LUT)-based computing, memory-based computing, very large scale integrations (VLSI). operation of the these devices is very fast which consumes less power, less area, reduces time of operation & become more efficient with respect to the several factors such as reliability, flexibility, scaling etc. therefore it leads to significant growth & improvement of these devices become cheaper. The semiconductors have embedded memory which results in dominating presence in the SOC s exceeding 90% of the total soc [2]. When compared to logical components, the semiconductor memory devices has high transistor packing density with increasing fast rate [1]. Apart from that, memory based computing structures offers more other advantages rather than multiply accumulate structures such as greater potential for high throughput, low latency implementation and less dynamic power consumption. Memory-based computing is well suited for many digital signal processing (DSP) algorithms, which involve multiplication with a fixed set of coefficients. The following block diagram shows the conventional look up table based multiplier in fig1 I. INTRODUCTION Due to the rapid development of increasing technology, now a day s semiconductor devices has become more prominent usage in every field. The ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 46

Fig. 1: Conventional LUT-based multiplier. Whereas X is an input address & A is a multiplier to the input X with fixed coefficient then resulting product is taken as output. Suppose X is a positive binary number of word length L, it provides 2 L possible values of X in which corresponding resultant product as C=A X for possible values of X. In memory based multiplication, for all possible values of X, A conventional LUT having word length 2 L Provides pre-computed product values. For an LUT, Xi is an input address with a L bit binary digit then the corresponding product A. XI is as its output. Therefore the product A. XI is stored in the location XI for 0 XI 2L 1. In earlier days, for memory based implementation of DSP algorithms involving orthogonal transforms & digital filters [5]-[12] had reported by several architectures but they could not find any significant work for LUT optimization. Recently we introduced a new approach for LUT optimization in which only the odd multiples of fixed coefficient are to be stored which is termed as oddmultiple-storage-scheme (OMS) [3]. An LUT size can also be reduced to half by another approach known as anti-symmetric product coding (APC) scheme where as the product words are termed as anti symmetric pairs [4]. manner such that the input address & LUT output could always be transformed into odd integers. When OMS scheme is combined with APC approach [3], it does not provide efficient output since APC functions [4] for odd multiples only. So therefore, for efficient memory based multiplication a modified form of OMS scheme is combined with different form of APC. A modified OMS [4] scheme & combined OMS APC approaches are discussed in section 2 where as the implementation of combined schemes is described in section 3 and the design of LUT based multiplier is described in section 4. Finally the conclusion and the synthesizing results of proposed multiplier presented in section5. II. LUT OPTIMIZATIONS FOR MEMORY- BASED MULTIPLICATION This section describes about the APC technique and its optimization by combining it with a modified form of OMS. A. APC for LUT Optimization: For our convenience, we assume both X and A is to be positive integers to simply the operation. The above table1 shows the product values for different values of input X for L=5 as shown. In APC approach, even it reduces the LUT size by a factor of two but for LUT output it takes more time & space for performing the 2s complement operation for sign modification to the corresponding input. We find that, by combining the techniques of APC & OMS scheme the 2s complement operations could be simplified in a ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 47

Table I APC words for L=5 with different input values the 4-bit LUT address values and corresponding coded words respectively. here the product representation is derived from the anti-symmetric behavior of the products, so we can term it as antisymmetric product code. The 4-bit address X = (x3 x2 x1 x0 ) of the APC word is given by X = XL, if x4 = 1 XL if x4 = 0 (2) For X= (1 0 0 0 0), the encoded word to be stored is 16A. From the above table it is clear that for every input word X in the third column of each row resembles the 2s complement of every input word X on the first column of the same row. In addition, the sum of product values of two input values on the same row is 32A. Let u & v be the product values of second and fourth columns of each row respectively. Therefore we can write u=[(u + v)/2 (v u)/2] and v=[(u + v)/2 + (v u)/2] for (u + v) = 32A, We have u=16a [(v u)/2] and v=16 A + [(v u)/2] (1) from the above terms, the product values of the second and fourth columns of the table 1 shows negative- mirror symmetry. Therefore from the above symmetry of the product words of those two columns reduces LUT size, whereas instead of storing u and v, only [(v u)/2] is stored for a pair of input on a given row. The fifth and sixth columns of the table shows where XL = (x3x2x1x0) is the four less significant bits of X and XL is the 2s complement of X. the required product could be obtained by adding or subtracting the stored value (v u) to or from the fixed value 16A when x4 is 1or 0, respectively, i.e., Product word = 16A + (sign value) X (APC word) (3) Where sign value = 1 for x4 = 1 and Sign value = 1 for x4 = 0. The product value for X = (10000) corresponds to APC value zero, which could be derived by resetting the LUT output, instead of storing that in the LUT. B. Modified OMS for LUT Optimization As the name OMS itself specifies that it stores only odd multiples of fixed coefficient. The multiplication of a binary of binary word X of word size L with fixed coefficient A, instead of storing all possible 2 L product values, LUT stores only 2 L /2 words corresponding to odd multiples of A. While all even multiples of A can be converted into odd multiples by left shift operations.from the above assumptions, the LUT for the multiplication of an L-bit input with a ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 48

W-bit coefficient could be designed by the following strategy. 1) A memory unit of [(2L/2) + 1] words of (W + L)-bit width is used to store the product values, where the first (2L/2) words are odd multiples of A, and the last word is zero. 2) A barrel shifter for producing a maximum of (L 1) left shifts is used to derive all the even multiples of A. 3) The L-bit input word is mapped to the (L 1)-bit address of the LUT by an address encoder, and control bits for the barrel shifter are derived by a control circuit. Table 2 shows that eight odd multiples, A (2i + 1) are stored in eight memory locations as pi for i= 0, 1 7. The even multiples 2A, 4A, and 8A are derived by left-shift Table II OMS-Based design of LUT of APC words for L=5 multiples of Aare derived from barrel shifter which produces maximum of three left shifts. As eq(3) states that the word to be stored for X = (00000) is not 0 but 16A, which we can obtain from A by four left shifts using a barrel shifter. However, if 16A is not derived from A, only a maximum of three left shifts is required to obtain all other even multiples of A. a two-stage logarithmic barrel shifter operates only for a maximum of 3 shifts while for a four shift operations it requires a 3 stage barrel shifter. For input X = (00000), this modified OMS scheme is more efficient to store 2A such that the product 16A can be obtained by three arithmetic left shifts. Table3 shows that the product values and encoded words for input words X = (00000) and (10000) respectively. For X = (00000), the required encoded word 16A is obtained by 3-bit left shifts operations of 2A [stored at address (1000)]. For X = (10000), the APC word 0 is derived by resetting the LUT output, by an active-high RESET signal given by RESET = (x0 + x1 + x2 + x3) x4. (4) Table III Products and encoded words for X= (00000) and (10000) operations of A. Similarly, 6A and 12A are derived by left shifting 3A, while 10A and 14A are derived by left shifting 5A and 7A, respectively. All even ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 49

From Tables II and III it shows that that the 5-bit input word X can be mapped into a 4-bit LUT address (d3d2d1d0), by a simple set of mapping relations di = x i+1, for i = 0, 1, 2 and d3 = x0 (5) where X = (x3 x2 x1 x0) is generated by shifting-out all the leading zeros of X by an arithmetic right shift followed by address mapping, i.e., X = YL, if x4 = 1 Y L, if x4 = 0 (6) Where Y L and Y L are derived by circularly shiftingout all the leading zeros of X L and X L, respectively. III. IMPLEMENTATION OF THE LUT-BASED MULTIPLIER USING THE PROPOSED LUT OPTIMIZATION SCHEME This section deals with the implementation of the LUT-based multiplier using the proposed scheme, where the LUT is optimized by a combination of the APC scheme and a modified OMS technique. Fig 2 shows that the structure and function of the LUT-based multiplier for L = 5 using the APC technique. It consists of a four-input LUT of 16 words to store the APC values of product words as given in the sixth column of Table I, except on the last row, where 2A is stored for input X = (00000) instead of storing a 0 for input X = (10000). Besides, it consists of an address-mapping circuit and an add/subtract circuit. The address-mapping circuit generates the desired address (X 3, X 2, X 1, X 0 ) according to (2). A straightforward implementation of address mapping can be done by multiplexing XL and X L using x4 as the control bit. The addressmapping circuit, can be optimized by the realization of three XOR gates, three AND gates, two OR gates, and a NOT gate, as shown in fig 2. According to eq (4) RESET can be generated by a control circuit (not shown in fig). The output of the LUT is added with or subtracted from 16A, for x4 = 1 or 0, respectively, according to (3) by the add/subtract cell. Hence, x4 is used as the control for the add/subtract cell. B. Implementation of the Optimized LUT Using Modified OMS A. Implementation of the LUT Multiplier Using APC for L = 5 Fig.: 3 APC-OMS combined LUT design for multiplication of W-bit fixed coefficient. Fig.: 2 LUT based multiplier using APC technique for L=5. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 50

Fig 3 shows that the combined schemes of proposed APC OMS design of an LUT for L = 5 for any coefficient width W. It consists of an LUT of nine words of (W + 4)-bit width, a four-to-nine-line address decoder, a barrel shifter, an address generation circuit, and a control circuit for generating the RESET signal and control word (s1,s0) for the barrel shifter. As noted in Table-II and Table-III control signals are 2-bit binary equivalent for required number of shifts. Alternative of reset signal for (4) is generated as (d3 AND x4). In Fig. 4(b) generation of control signals and reset signal is shown. According to (5) and (6) address-generator circuit receives the input operand X as 5-bit and maps that onto the 4-bit address word (d3d2d1d0). IV. Results and Discussion Comparison factors No. of 4-input LUT s (9312) Table IV No. of word size for LUT s 4-bit 5-bit 6-bit 15 10 14 No. of slices (4656) 8 6 8 No. of IO s 46 60 67 (a) (b) Fig.: 4(a) four-to-nine-line decoder. (b) Control circuit The pre-computed values of A (2i + 1) are stored in stored in Table II as Pi, for i = 0, 1, 2,..., 7, in the eight consecutive locations of the memory array, while for input X= (00000) is stored for 2A at LUT address 1000, as mentioned in Table III. The decoder generates the nine-word select lines by taking 4-bit address lines, to select the required word from the LUT multiplier. With simple modification of 3-to-8 decoder we are getting 4-to-9-line decoder as shown in Fig. 4(a). To produce desired number of shifts in barrel shifter control signals S0 and S1 are used according to the relations. s0 =x0 + (x1 + x2) (7a) s1 =(x0 + x1) (8b) No. of bonded IO s 19 30 56 (232) Delay (ns) 7.376 6.736 6.736 Fig. (5) Simulated results for L=4 From the above fig. (5) We are applying the input bit sequences for X=4 h0 and getting the output response for q=8 h03. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 51

Fig. (6) Simulated results for L=5 Fig. (7) Simulated results for L=6 From the above fig. (6) We are applying the input bit sequences for X=5 h00 and getting the output response for q=9 h003. From the above fig. (7) We are applying the input bit sequences for X=6 h00 and getting the output response for q=10 h003. As shown in the above table IV, for the increase in the word size in the LUT multiplier, there is a gradual degradation of delay for L=4 and L=5 and for L=6 there is no delay change with respect to L=5 with optimum utilization of memory. The LUT multiplier for L=W=4, 5 and 6 bits are coded in Verilog HDL and synthesized using Xillinx ISE 10.1i environment by using SPARTAN 3E FPGA fg320 package, device used is XC3S500e with speed grade of -5. IV CONCLUSION The LUTs are implemented as arrays of constants for efficient utilization of area-delay product. The area and delay complexities of the multipliers estimated from the synthesis results are listed in Table IV. It is found that the proposed LUT design involves comparable area and time complexities for a word size of 4 bits, but for higher word sizes, it has comparatively less delay factor. In this brief, we have derived the possibility of using LUT based multipliers for the constant implement of operations like multiplication especially for DSP applications. Future scope for this will be implementation of derived OMS APC-based LUTs for higher input sizes for suitable area-delay product with different forms of decompositions. REFERENCES [1] Pramod Kumar Meher, LUT Optimization for Memory-Based Computation IEEE Transactions on circuits and systems ii: express briefs, vol. 57, no. 4, april 2010 [2] International Technology Roadmap for Semiconductors. [Online]. Available: http://public.itrs.net/ [3] P. K. Meher, New approach to LUT implementation and accumulation for memory-based Multiplication, in Proc. IEEE ISCAS, May 2009, pp. 453 456. [4] P. K. Meher, New look-up-table optimizations for memory-based multiplication, in Pro. Int. Symp.Integr. Circuits (ISIC 09), Dec. 2009, to be published. [5] P. K. Meher, Memory-based Hardware for resourceconstrained digital signal processing systems, inproc. 6th Int Conf. ICICS, Dec.2007, pp.1 4. [6] P. K. Meher, Systolic designs for DCT using a lowcomplexity Concurrent convolutional formulation, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 9, pp. 1041 1050, Sep. 2006. [7] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, Systolic algorithms and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST,IEEE Trans. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 52

Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 1125 1137, Jun. 2005. [8] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, A memory-efficient realization of cyclic convolution and its application to discrete cosine transform, IEEE Trans. Circuits Syst.Video Technol., vol. 15, no. 3, pp. 445 453, Mar. 2005. [9] A. K. Sharma, Advanced Semiconductor Memories: Architectures,Designs,and Applications. Piscataway, NJ: IEEE Press, 2003. [10] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, A Systolic array architecture for the discrete sine transform, IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2347 2354, Sep. 2002. [11] H.-R. Lee, C.-W. Jen, and C.-M. Liu, On the design automation of The memory-based VLSI architectures for FIR filters, IEEE Trans. Consum. Electron., vol. 39, no. 3, pp. 619 629, Aug. 1993. [12] J.-I. Guo, C.-M. Liu, and C.-W. Jen, The efficient memory-based VLSI array design for DFT and DCT, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 10, pp. 723 733, Oct. 1992. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 53