Design of an Area-Efficient Interpolated FIR Filter Based on LUT Partitioning This paper describes the design of an area-efficient interpolation FIR filter with partitioned lookup table (LUT) structure. Since the LUT block occupies a large portion of area in the FIR filter, the proposed filter structure targets reduction of LUT size by partitioning, by exploiting coefficient symmetry, and by sharing the partitioned LUTs through the multiplexing of input data streams. Experimental results for several benchmark examples show that the proposed filter reduces the area by over 40% compared to the popular single-architecture dual-channel filter, while the power consumption is comparable to or less than that of conventional filter structures. The proposed FIR filter was designed for the -channel W-CDMA mobile station modulator. Index Terms: Interpolation FIR filter, Lookup table partitioning, Pulse-shaping. I.INTRODUCTION With the on-going research to introduce multimedia capabilities into digital mobile communication systems, communication standards specify the use of enabling technologies such as the -channel W-CMDA mobile station modulator [1]. Pulse-shaping 1:4 interpolated FIR filters are employed in each band-limited Quadrature Phase Shift Keying (QPSK) modulator to provide in-band spectral shaping while minimizing intersymbol interference (ISI) [2],[]. Each channel of QPSK modulator requires filtering operation for in-phase (I data) and quadrature-phase (Q data) signal components of the input data stream [4]. Hence, a total of six band-limited FIR filters are needed in the -channel W-CDMA modulator. A FIR filter using the transversal computation structure [5] adopts a polyphase structure to effectively pipeline the input data streams across a register chain prior to performing the main filtering operation. Despite the simplicity of the structure, it requires a prohibitively large number of registers, and incurs area overhead due to the added complexity involved with the pipeline structure [6]. An alternative FIR filter structure suitable for highspeed filtering employs the LUT for the core filtering operation. All possible filter outputs are pre-calculated and tabulated in memory for any input transition patterns. The input data stream constitutes the address to the LUT, and therefore evaluations of the filter output samples are carried out by simply reading-off constant values; slower dedicated arithmetic operations are substituted by a faster memory reference. LUT-based FIR filter designs have widely been implemented with ROM-based LUTs or hardwired LUTs [7]. A single chip implementation using ROM-based LUTs requires large area, power consumption, and added fabrication complexity. The ROM-based LUT occupies about 99% of the area of the entire filter design due to a large number of transition patterns; hence any area optimizations in the filter design center on a more compact LUT design. To overcome the drawbacks of the ROM-based LUTs, the table is hardwired as in the popularly used single-architecture dual-channel filter [7]. However, the hardwired LUT still occupies about 60% of the total area, necessitating further area reductions. Sun Young Hwang Jong Kwan Choi Sik Kim : Sogang University
In this paper, we propose the design and implementation of an area-efficient, single-channel LUT-based pulseshaping 1:4 interpolation FIR filter. Since the LUT block occupies a large portion of area in the FIR filter, the proposed filter structure targets reduction of LUT size by partitioning, by exploiting coefficient symmetry, and by sharing the partitioned LUTs through the multiplexing of input data streams. The proposed FIR filter has been designed for the -channel W-CDMA mobile station modulator. The rest of the paper is organized as follows: Section II provides an overview into the pulse-shaping 1:4 interpolated FIR filter structure employing LUTs. Section III describes the partitioned LUT structure, symmetric properties of the low-pass filter coefficients, and techniques for sharing the partitioned LUTs by multiplexing input data streams. Section IV highlights the implementation of the proposed single-channel dual-filter architecture adopting the features addressed in Section III, followed by experimental results in Section IV. The final section presents concluding remarks and future research topics. II. LUT-BASED FIR FILTER As mentioned in the previous section, FIR filters are employed in the QPSK modulator for spectral shaping while minimizing ISI. For each channel of the QPSK modulator, two 1:4 interpolation FIR filters are used for performing pulse-shaping operation. Hence, a total of six band-limited FIR filters are needed in the -channel W- CDMA modulator. Interpolation FIR filter designs can be implemented using the transversal structure and LUTbased structure. Equation 1 shows the operation of the 48-tap 1:4 interpolation FIR filter. The input signals to each filter x {x(n), x(n 1),..., x(n 11)} are multiplied with the filter coefficients h m {h 0, h 1,..., h 47 }, to produce four output signals, y(4n ), y(4n 2), y(4n 1), and y(4n). The filter coefficients are stored in ROM as 2 s complement representation. Multipliers are not necessary due to the simple binary nature of the input data hence, the result of the multiplication is either the coefficient itself or its inverse[8]. Figure 1 shows the transversal FIR filter implementation of Equation 1. It adopts a polyphase structure to effectively pipeline the input data streams across a register chain prior to performing the main filtering operation. Compared to the direct method [5], the transversal structure is simple and requires fewer multipliers and adders, thereby reducing the area. Despite
the reductions in the number of functional units and the simplicity of the structure, a prohibitively large number of registers are required, incurred by the pipeline structure. y(4n ) x(n) h 0 x(n 1) h 4 x(n 2) h 8... x(n 11) h 44 y(4n 2) x(n) h 1 x(n 1) h 5 x(n 2) h 9... x(n 11) h 45 y(4n 1) x(n) h 2 +x(n 1) h 6 x(n 2) h 10... x(n 11) h 46 y(4n) x(n) h x(n 1) h 7 x(n 2) h 11... x(n 11) h 47 (1) filter processing time can be dramatically reduced. Thus, LUT-based filters are suited for high-speed filtering applications, since uninterrupted filtering operation can be performed with a streamlined input data. However, LUTbased filter designs are plagued by large LUT sizes, hence LUT area minimization is required for efficient FIR filter realization. To obtain the designed outputs, each of the 11-bit outputs is generated at four times the chip clock rate (four phases). III. PROPOSED FIR FILTER DESIGN Figure 2 shows the LUT-based FIR filter structure. The operation of a 48-tap 1:4 interpolation FIR filter using LUT-based structure can be performed without customary arithmetic operations. Instead of filter coefficients being stored in ROM as in the classical FIR filter designs, each of the four filter outputs in Equation 1 is pre-calculated and tabulated in ROM for each input transition pattern. Therefore, with a 12-bit input x {x(n), x(n 1),..., x(n 11)}, directly implementing Equation 1 in LUT form requires a (2 12 11 4)-bit ROM table for storing all possible 11-bit output results, y(4n ),y(4n 2), y(4n 1), and y(4n). Each of the input data bits holds values of 1, 0, or 1. The input data values serve as direct 12-bit address values to the LUT, hence the four pre-generated 11-bit output values of the filtering operations are available upon request by a simple memory read operation. Since the filter outputs can be generated using only the input data values, the overall Despite the fast filtering operation made possible by use of LUTs, post-synthesis sessions for area measurements indicate that the LUT block in these FIR filters still occupy 60 to 99% of the total filter area, thus reductions in LUT area are required. Since the LUT block occupies a large portion of area in the FIR filter, the proposed filter structure targets reduction of LUT size by partitioning, by exploiting coefficient symmetry, and by sharing the partitioned LUTs through the multiplexing of input data streams. With the output stream size fixed, the size of the original LUT can be adjusted by varying the input stream size. Every reduced input bit halves the LUT address, which implies a reduction in LUT size by half. By the same token, by splitting a larger input bit stream into smaller bit-clusters, a single LUT can be partitioned into
Input data cluste size* 6 bits bits 2 bits 1 bits * Cluster sizes from a 12-bit stream input. Partitioned LUT unit area (a) 2 4 2 1 # LUTs used (b) 2 4 6 12 Total Area (a) (b) 64 16 12 12 multiple smaller LUTs. Table 1 shows various partitioning solutions with respect to a 12-bit input stream. The multiway partitioning of the LUT with respect to various data cluster sizes results in reduced total LUT area. Partitioning the LUT by 1-bit clusters yields six partitioned LUTs, resulting in maximal total LUT area reductions. However due to the increased complexity of the added glue logic resulting from LUT partitions, the overall area may increase. Moreover, logic synthesis of LUTs with very small bit stream clusters fails to produce anticipated area reductions, since the input products and SOP terms are not likely to be shared by the outputs during synthesis [9]. Figure shows a single-filter block design with three different LUT partitioning solutions. Each of the n-bit data clusters from the 12-bit shift-register is assigned to appropriate LUTs for filtering, and the partial results are added at the output. Due to the inherent symmetric nature of the low-pass filter coefficients, the upper half partitioned LUT blocks form mirror images of the lower half. Figure 4 shows a 24-tap 1:4 interpolation FIR filter operation over the input data D n {D 0, D 1, D 2, D, D 4, D 5,}. The 24-element low-pass filter coefficients, h m { h 0, h 1,..., h 2 }, are symmetrical across the center []. Thus D n can be split into two -data sets, with each set performing separate filtering operation. Note that prior to performing the filtering operation, data with phase index not equaling '00' are padded with zeros. The proposed 1:4 interpolation FIR filter utilizes a 48-tap filter coefficient requiring the input of twelve data elements. The twelve data elements are split into two 6-data sets to perform separate filtering operation and added together at the output[2],[5]. Generally, the number of multiplication and coefficient size can be reduced in half by first adding D n and its reflected counterpart, then by performing multiplication with a single common coefficient. However, in the case of FIR filters with 1-bit input, adopting the above technique is not suitable due to increased hardware complexity. As mentioned in Sections 1(III) and 2(III), due to the symmetric nature of the filter coefficients, the proposed single-channel filter blocks in Figure employs two sets of partitioned LUT blocks, when one set of LUT is the mirror image of the other. Since duplicated sets of LUTs signify wasted area, further area optimizations are feasible. Instead of performing single filtering operation with separate LUT pairs, the FIR filter design has been extended to a more efficient dual-filter design by judiciously using only one set of the mirror imaged LUT blocks, thereby reducing the LUT area by a fourth. Figure 5 shows the proposed single-channel dual-filter structure. Two 12-bit input data streams (Q data and I data) are divided and clustered into smaller n-bit packets (i.e. 6-bit Q r, Q f, and I r, I f data clusters) and multiplexed for optimal sharing of partitioned LUTs. After LUT-based filtering operation, which takes the form of memory references as mentioned in Section 2(II), the partial results pertaining to Q data and I data are piped to the adders for the final result. IV. IMPLEMENTATION OF THE PROPOSED FIR FILTER Figure 6 shows the proposed single-channel dual-filter block implemented by adopting the features addressed in Section III. The proposed FIR filter structure consists of three blocks _ input stage, partitioned LUT blocks, and output stage.
(a) Two LUTs (b) Four LUTs (c) Six LUTs.
As mentioned earlier, in each input channel of the QPSK modulator, quadrature-phase and in-phase signal components form two independent input streams. The Q and I data streams are fed into the proposed single-channel dual-filter block, and FIR-filtered Q and I data streams are output. The input stage of the proposed FIR filter block consists of a set of pipeline registers, for input of Q and I data streams. Q and I data sets are shifted into the registers as shown in Figure 6. When the register chain is full, the 12 input bits are packed into smaller data clusters and are dispatched to appropriate LUTs via selection multiplexers for filtering. As reported in Table 1, the twelve input bits can be split to share 1~, and 6 partitioned LUTs. Each LUT area decreases exponentially with the number of partitioned LUTs; the greater the number of LUT partitions the smaller the total LUT area. However, the overall area may increase due to adders, multiplexers, intermediate latches, and other glue logic. Thus exact area estimations require not just the area of the partitioned LUTs but also that of added glue logic with each partitioned solution. The proposed single-channel dualfilter block has been implemented using three different partitioning solutions: Tri-partitioning, bi-partitioning, and single LUT. Empirical results dictate that optimal area reduction for the LUT and surrounding logic is obtained by bi-partitioning the LUT with -bit cluster sizes, as shown in the shaded region in Table 1. Figure 6 (b) shows the proposed single-channel dual-filter block obtained by bi-partitioning the LUT. Note that implementing a dualfilter block by directly merging two single-filter structures of Figure requires four times the number of LUTs. However, by exploiting the symmetric properties of the filter coefficients and by sharing the LUTs through multiplexing clustered data, a more efficient dual-filter scheme can be devised. The proposed dual-filter filter output stage consists of pipelined latches, a tree of adders (10-bit CSA and 11-bit CLA), and a pair of filter output selector latches. After fetching the constant values from the previous LUT stage, the partial results pertaining to Q data and I data are piped to the tree of adders in alternating order. The 10-bit CSA and 11-bit CLA accumulate the partial results to form the
final filtered value, and a set of filter output selector latches direct the filtered Q and I data to corresponding filter output. V. EXPERIMENTAL RESULTS The proposed 1:4 interpolation FIR filter has been designed for different LUT partitioning solutions. The design has been simulated with Verilog-XL, and synthesized with Design Compiler. The power measurements were performed on the PowerMill circuit simulator at the logic level using the Samsung's STD70-0.6µm process standard cell library. Post-synthesis sessions have been carried out for the proposed FIR filter implementation, obtaining area and power measures for -channel 6-filter configurations. These evaluations results have been compared against three other conventional FIR filter implementations under equivalent configurations: 4-bank filter block employing ROM-based LUTs, transversal filter block, and the singlearchitecture dual-channel filter block. Table 2 reports the area profiles of the proposed FIR filter when compared against conventional pulse-shaping 1:4 interpolated FIR filter implementations. The relative costs for all the implementations are compared with the single-architecture dual-channel filter design, which has been popularly employed in -channel W-CMDA mobile station modulators. The proposed LUT-partitioned FIR filter design shows the area reduction of 40%. Note that
(a) Single LUT (b) Bi-partitioned LUT (c) Tri-partitioned LUT III
Area Measures 4-bank filter employing ROM-based LUTs Transversal filter Single-architecture dual-channel filter Proposed Filter Type 1: Tri-partitioned LUT Type 2: Bi-partitioned LUT Type : Single LUT Gate count 15,21.0 4,656.0 2,81.0 1,01.5 1,118.5 1,61.0 # blocks used 6 6 2 * Experimental results are compared against the single-architecture dual-channel filter. Total gate count 91,278.0 27,96.0 5,662.0,904.0,55.5 4,08.0 Relative cost 1,612.1 49.4 100.0 * 69.0 59. 72.1 Power Measures 4-bank filter employing ROM-based LUTs Transversal filter Single-architecture dual-channel filter Proposed Filter Type 1: Tri-partitioned LUT Type 2: Bi-partitioned LUT Type : Single LUT Clock frequency (MHz) 19.68 19.68 19.88 9.76 9.76 9.76 Filter power consumption (mw) 46.2 49.1 1.0 42.5 26.8 1.6 # blocks used 6 6 2 Total power (mw) 277.2 294.6 62.0 127.5 80.4 94.8 implementing a -channel 6-filter modulator requires 2 blocks of the single-architecture dual-channel filter. The proposed single-channel dual-filter block design is readily scalable to the N-channel 2N-filter ensemble, easily meeting the -channel 6-filter specification of the W-CDMA modulator. Among the proposed filter implementations shown in Table 1, type 2 (bi-partitioning the LUT) shows optimal area savings. Table reports the total power consumption for the proposed FIR filter compared against conventional filters. To measure the average power consumption synthesized cells are analyzed and characterized with EPIC's PowerMill for 24000ns. Overall decrease in power consumption is observed for all types of the proposed filter implementations despite little power optimization efforts. Note that even at twice the clock speed, the proposed filters consume the power comparable to or less than conventional designs. Among the proposed filter implementations shown in Table, type 2 (bi-partitioning the LUT) shows optimal power and area savings. The low-energy profile can be attributed to shorter critical paths due to smaller LUTs and retiming effects induced by the latches lying between the LUTs and the filter output. VI. CONCLUSION & FUTURE RESEARCHES In this paper, we described the design of an areaefficient 1:4 interpolation FIR filter by partitioning LUTs. Six filtering blocks (-channels of in-phase and quadrature-phase input signals) reside within the core blocks of the W-CDMA mobile station modulator. To obtain an area-efficient FIR filter implementation, conventional designs have centered their efforts on reducing the LUT size, since the LUT block occupies 60% to 99% of the total area of each FIR filter. Since the LUT block occupies a large portion of area in the FIR filter, the proposed filter structure targets reduction of LUT size by partitioning, by exploiting coefficient symmetry, and by
sharing the partitioned LUTs through the multiplexing of input data streams. More than 40% reductions in gate area have been obtained for the proposed pulse-shaping filter when compared to conventional FIR filter blocks. Despite no low-power design efforts, the proposed FIR filter consumes power comparable to the popular singlearchitecture dual-channel filter. For portable communication system applications, future research works warrant a power-conscious design effort employing signal transition analysis. Acknowledgement: This research was supported by the Sogang University Research Grants in 2000. [REFERENCES] [1] K. Yeon et al., 'Design of Chip Set for CDMA Mobile System,' ETRI Journal, Vol. 19, No., Oct. 1997, pp. 228~241. [2] A. Oppenheim and D. Manolakis, Discrete-Time Signal Processing, Prentice-Hall: Englewood Cliffs, New Jersey, 1989. [] J. Holmes, Coherent Spread Spectrum Systems, Wiley: New York, 1982. [4] S. Glisic and P. Leppanen, Code Division Multiple Access Communications, Kluwer Academic Publishers, 1995. [5] J. Proakis and D. Manolakis, Digital Signal Processing: Principles, Algorithms, and Applications, Prentice-Hall: Upper Saddle River, New Jersey, 1996. [6] R. Peterson, R. Ziemer, and D. Borth, Introduction to Spread-Spectrum Communications, Prentice Hall: Englewood Cliffs, New Jersey, 1995. [7] I. Kang, K. Yeon, H. Jo, J. Chong, and K. Kim, 'Multiple 1: N Interpolation FIR Filter Design Based on a Single Architecture,' in Proc. IEEE Int. Symposium on Circuits and Systems, Vol. 2, May 1998, pp. 16~19. [8] G. Do and K. Feher, 'Efficient Filter Design for IS-95 CDMA Systems,' IEEE Trans. on Consumer Electronics, Vol. 42, No. 4, Nov. 1996, pp. 1011~1020. [9] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, 2nd Edition, Addison-Wesley, 199.
Sun-Young Hwang received the B.S. degree in electronic engineering from Seoul National University, Seoul, Korea, in 1976, the M.S. degree from Korea Advanced Institute of Science in 1978, and the Ph.D. degree in electrical engineering from Stanford University, California, U.S.A., in 1986. Since 1986, he has been with the Center for Integrated Systems at Stanford University, working on design of a high-level synthesis and simulation system. In 1986 and 1987, he held a consulting position at Palo Alto Research Center of Fairchild Semiconductor Corporation. In 1989, he joined the Department of Electronic Engineering at Sogang University, where he is now professor. His current research interests include hardware/software co-design, and DSP/VLSI systems design. E-mail: hwang@ccs.sogang.ac.kr Tel : +82-2-705-8469 Fax :+82-2-272-220 Jong-Kwan Choi received the B.S. degree in electronic engineering from Sogang University, Seoul, Korea, in 1992. From 1992 to 1995 he was senior engineer at the ASIC center (Semiconductor division) of Daewoo Telecom., LTD. He has joined IAE as a research engineer in 1996, and currently he is working towards the M.S. degree in electrical engineering at Sogang University. His current research interests include communication/network (ATM & Ethernet) system IC design and mobile communication/modem design. E-mail: cjksgu@yahoo.com Tel : +82-2-705-8469 Fax :+82-2-272-220 Sik Kim received the B.S. and M.S. degrees in electronic engineering from Sogang University, Seoul, Korea, in 1994 and 1996, respectively. He is currently working towards the Ph. D degree in electronic engineering at Sogang University. His current research interests include Digital VLSI system design, high speed computer architecture, and CAD system development. E-mail : cosmos@eecad.sogang.ac.kr Tel : +82-2-705-8469 Fax : +82-2-272-220