IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15,

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 921 A High-Performance Energy-Efficient Architecture for FIR Adaptive Filter Based on New Distributed Arithmetic Formulation of Block LMS Algorithm Basant K. Mohanty, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE Abstract In this paper, we present an efficient distributedarithmetic (DA) formulation for the implementation of block least mean square (BLMS) algorithm. The proposed DA-based design uses a novel look-up table (LUT)-sharing technique for the computation of filter outputs and weight-increment terms of BLMS algorithm. Besides, it offers significant saving of adders which constitute a major component of DA-based structures. Also, we have suggested a novel LUT-based weight updating scheme for BLMS algorithm, where only one set of LUTs out of sets need to be modified in every iteration, where,,and are, respectively, the filter length and input block-size. Based on the proposed DA formulation, we have derived a parallel architecture for the implementation of BLMS adaptive digital filter (ADF). Compared with the best of the existing DA-based LMS structures, proposed one involves nearly times adders and times LUT words, and offers nearly times throughput of the other. It requires nearly 25% more flip-flops and does not involve variable shifters like those of existing structures. It involves less LUT access per output (LAPO) than the existing structure for block-size higher than 4. For block-size 8 and filter length 64, the proposed structure involves 2.47 times more adders, 15% more flip-flops, 43% less LAPO than the best of existing structures, and offers 5.22 times higher throughput. The number of adders of the proposed structure does not increase proportionately with block size; and the number of flip-flops is independent of block-size. This is a major advantage of the proposed structure for reducing its area delay product (ADP); particularly, when a large order ADF is implemented for higher block-sizes. ASIC synthesis result shows that, the proposed structure for filter length 64, has almost 14% and 30% less ADP and 25% and 37% less EPO than the best of the existing structures for block size 4 and 8, respectively. Index Terms Adaptive filters, block LMS, distributed arithmetic, VLSI. I. INTRODUCTION ADAPTIVE DIGITAL FILTERS (ADFs) are widely used in various signal-processing applications, such as echo cancellation, system identification, noise cancellation and Manuscript received June 18, 2012; accepted October 07, 2012. Date of publication October 25, 2012; date of current version January 25, 2013. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Zhiyuan Yan. B. K. Mohanty is with the Department of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Raghogarh, Guna, Madhya Pradesh, India-473226 (e-mail: bk.mohanti@juet.ac.in). P. K. Meher is with the Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632 (e-mail: pkmeher@i2r.a-star.edu.sg, url: http://www1.i2r.astar.edu.sg/~pkmeher/). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2012.2226453 channel equalization etc. [1]. Amongst the existing ADFs, least mean square (LMS)-based finite impulse response (FIR) adaptive filter is the most popular one due to its inherent simplicity and satisfactory convergence performance. However, the delay in availability of the feedback-error for updating the weights according to the LMS algorithm does not favor its pipeline implementation when sampling rate is high. Haimi et al. [2] have proposed the delayed LMS (DLMS) algorithm for pipeline implementation of LMS-based ADF. The delayed LMS is similar to the LMS algorithm except that the correction terms for updating the filter weights of the current iteration are calculated from the error corresponding to a past iteration. Several schemes have been proposed to implement the DLMS-based ADFs efficiently in a systolic VLSI with minimum adaptation delay [2] [4], [7], [8]. To avoid adaptation delay in pipelined LMS ADF, Poltmann [5] has proposed a modified DLMS algorithm which is used by Douglas et al. [6] to derive a systolic architecture. But, the structure of [6] involves large amount of hardware resources compared to the earlier one [2]. The block LMS (BLMS) ADF [9] is one of the useful derivatives of the LMS ADF for fast and computationally-efficient implementation of ADFs. Unlike the conventional LMS ADF, BLMS ADF accepts a block of input for computing a block of output and updates the weights using a block of errors in every training cycle. The BLMS ADF has convergence performance similar to the LMS ADF, but the BLMS ADF of block-length offers fold higher throughput compared with the other. Keeping this in view, many variant of BLMS algorithm like time and frequency-domain block filtered-x LMS (BFXLMS) has been proposed for specific applications [20]. Das et al. [21] have proposed efficient BFXLMS using FFT and fast Hartley transform (FHT), which is computationally more efficient. We have proposed a delayed block LMS (DBLMS) algorithm [15], and a concurrent multiplier-based architecture for high-throughput pipeline implementation of BLMS ADFs. The structure of [15] provides fold higher throughput rate and demands times more resources compared to those of DLMS ADF. Baghel et al. [17], [18] have suggested a distributed-arithmetic (DA)-based structure for FPGA implementation of BLMS ADFs. A lowcomplexity design has been proposed in [19] for BLMS ADFs. This structure supports a very low sampling rate since it uses single multiply-accumulate (MAC) cell for the computation of filter output and weight-increment term. To take the advantage of DA-based hardware designs [12], Allred et al. [10] have suggested a scheme to derive a DA-based design for LMS-ADF. The structure of [10] requires separate 1053-587X/$31.00 2012 IEEE

922 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 look-up-tables (LUTs) for the calculation of filter output and weight-increment terms. The LUT used for the computation of filter output and weight-increment term of DA LMS-ADF is named as DA-F-LUT and DA-A-LUT, respectively. In every iteration, entire content of DA-F-LUT is updated to compute the weight-increment term, where half the content of DA-A-LUT is updated to accommodate the new input sample arriving at the current iteration. Updating the LUTs is the most time consuming operation in DA-based LMS-ADF, since the updating is performed sequentially at different LUT locations. The LUT update time, therefore, depends on the size of the LUT to be updated. For most practical adaptive filters, we need to use a decomposition scheme, where small size LUTs can be used in DA-based LMS-ADF which not only helps in reducing the LUT size but also in reducing LUT-update time. Recently Guo et al. [16] have suggested a scheme to avoid the DA-A-LUT in DA-based LMS-ADF, where both filtering and weight-updating are performed using DA-F-LUT. On the other hand, throughput rate of existing DA-LMS ADFs could be slow for real-time applications due to bit-serial nature of DA computation. Although, there are some interesting work on DA-based LMS ADF [10], [16], we find that the potential application of DA for the implementation of BLMS ADF is yet to be explored. In order to reduce the power consumption of DA-based designs, we aim at reducing the number of words in the LUT and less LUT-access. DA-based BLMS ADF structure can be derived by extending the scheme of [10], but this structure would demand times more hardware (memory and combinational logic) for times more throughput rate. The scheme of [16] offers sharing of LUT for the computation of both filter output and weight-increment term, but this scheme can not be applied to derive a DA-based structure for BLMS ADFs, because separate inner-product computation (IPC) is performed for calculation of filter output and weight-increment term of BLMS ADF whereas in case of LMS ADF, IPC is performed to calculate the filter output only. In this paper, we have formulated the DA-BLMS algorithm for sharing of LUTs for the computation of filter output and weight-increment terms. The key contributions of this paper are: DA-based formulation of BLMS algorithm where both convolution operation to compute filter output and correlation operation to compute weight-increment term could be performed by using the same LUT. A novel approach for minimization of number of LUT words to be updated per output. This helps to save external logic and power consumption. We have derived a DA-based structure for BLMS-ADF using the proposed DA-formulation and a novel LUT updating scheme. The most remarkable aspect of the proposed scheme is that the number of adders required by the structure does not increase proportionately with filter order, and the number of flip-flops required by the structure is independent of the block-size. Apart from that, the proposed structure has significantly less LUT access than the existing DA-LMS structure for higher block-sizes. The rest of this paper is organized as follows: Mathematical formulation is presented in Section II. The new-lut update scheme is discussed in Section III, and the proposed structure for DA-based BLMS ADF is presented in Section IV. Hardwareand time-complexities of the proposed structure are discussed in Section V. Conclusion is presented in Section VI. II. MATHEMATICAL FORMULATION The BLMS algorithm for updating the filter weights in the -th iteration is given by where is defined as and are, respectively, the weight-vector and the errorvector of the -th iteration defined as: where is the step-size; and the input matrix is derived from the current input block of length,and past samples, given by The error-vector is computed as where the desired response vector is defined as The -th block of filter output is computed by the matrixvector product: A. Computation of Filter Output The input matrix of size can be decomposed into square matrices of size each, where. Similarly, the weight vector can be decomposed into short weight-vectors of size,for. The computation of (4) can then be expressed as the sum of matrix-vector products: where and are defined as (1) (2) (3) (4) (5)

MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 923 for,and (12b) inner- Each filter output now can be written as the sum of products as and are the -th bit of and, respectively. Substituting (12a) in (7), we have (6) where is an -point inner-product of an input-vector and are given by and is the -th row of given by (7) (13) Rearranging the order of summation, (13) may otherwise be expressed as: (14) for,, and. Note that we have dropped the subscript of in (7) only for convenience of further discussion, without loss of generality. B. Computation of Weight Increment Term The weight-increment vector can be decomposed into short vectors of size each, for. Computation of (2) can be performed through independent matrix-vector multiplication using the relation where,and defined as (9) Using (8), the individual weight increment terms could be evaluated by the following equation (8) (10) where is the inner-product between the vector and, given by (11) Here also we have dropped the subscript of for convenience of further discussions. As shown in (7) and (11), the input-vector is the same for a pair of inner-products and. This is a major advantage in order to optimize the LUTs when the inner-products of (7)and(11)areperformed using the DA principle. C. DA-Formulation Let and, respectively, be the -th components of the -point vectors and, and assumed to be -bit numbers in 2 s complement representation: (12a) where, for,and for. Each term in the inner sum in (14) represents the inner-product of with a bit-vector (or bit-slice) of weightvector. Corresponding to possible values of a bit-vector of length, there could be possible values of such innerproducts of with any possible bit-vector of length.all those possible inner-products could be pre-computed and stored in an LUT, such that when the -th bit-vector (or bit-slice) of weight vector for, is fed to the LUT as address, its inner-product with, is read from the LUT. The computation of inner sum of (14), therefore, could be expressed in the form of memory read operation as: (15) where is a memory-read operation, and its argument for, is used as LUT-address. The inner-product of (11) may, similarly, be expressed in the form of memory-read operation as (16) where is the -th bit-vector of error-vector defined as:, which is used as address of an LUT to read its inner-products with.lut contents for the computation of and are exactly the same, since the LUT content depends on the input-vector, and generated for all possible bit-slices of -bit length, irrespective of whether that is of the weight-vector or the error-vector. When the bit-vector is used as address, the partial results of are read from the LUT, and when is used as address, then partial results of are read from the same LUT. Therefore, by using the proposed scheme, a common set of LUTs could be used for the computation of filter outputs and weight-increment terms. Since, the block of input samples changes after every iteration, the LUTs are required to be updated in every iteration to accommodate the new input-block. In the next Section, we have presented a novel LUT-updating scheme for the DA-based BLMS ADFs.

924 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 Fig. 1. (a) Inner-products of FIR filteroflength, and block-size. The input-vectors corresponding to inner-product is shown inside the box. (b) LUT arrangement for DA-based computation of the FIR filter of,and.eachlutherestores possible values of partial inner-product of input-vector and bit-vector of of length,for and. III. LUT-UPDATING SCHEME Before, we discuss the proposed LUT-updating scheme, we summarize here the proposed decomposition of input-matrix and weight-vector into small vectors, and their participation in the inner-product computation for filtering operation. The inputmatrix of size is decomposed into square matrices of size and is decomposed into short-vectors of size,for where. Each of rows of represents an input-vector, so that such input-vectors (,for ) are derived form, and such input-vectors are derived from,for. All these input-vectors are arranged in rows and columns such that, input-vectors of belong to -th column. According to (5), weight-vectors are multiplied independently with matrices which, in total, involves inner-products. According to (6), results of inner-products corresponding to each row of input-vectors are added together for obtaining a filter output. From such rows of inner-products, filter outputs are obtained. We have illustrated here the aforementioned scheme for the implementation of FIR filteroflength and block-size. Suppose, during the -th iteration the filter receives an input-block and computes a block of output. As discussed above, the input-matrix of size 2 6 is decomposed into 3 square-matrices, and of size 2 2. consists of a pair of input-vectors ( and ), and similarly and consist of pair of input-vectors and, respectively. The 6-point weightvector is decomposed into 3 number of 2-point weight-vectors. Fig. 1(a) shows the arrangement of input-vectors and weight-vectors; and the corresponding inner-products are shown on the top of the rectangular boxes for clarity. Results of odd-numbered inner-products (on upper row) and even-numbered inner-products (on lower row) are added separately (not showninthefigure) to obtain and, respectively. Fig. 2. DA-based computation of the block FIR filter for and. (a) for -th iteration, (b) for -th iteration. As shown in Fig. 1(a), the same weight-vector is used for the computation of inner-product of a particular column of input-vectors. For DA realization, LUT corresponding to each and stores partial inner-products generated by the inner-product of the corresponding input-vector with all possible values of a bit-vector of length. DA-based parallel computation of filter outputs of Fig. 1(a) for the -th iteration is shown in Fig. 1(b). As shown in Fig. 2(a), the DA-based structure receives an input-block during the -th iteration, so that two new samples enter into the set of 7 samples, and two oldest samples are discarded. Consequently, samples of the all 6 input-vectors are changed. But, it occurs in a particular order. We can find from Fig. 1(b) and Fig. 2(a), that the contents of only the first column of LUTs of Fig. 2(a) are changed by the new samples while in other columns, the LUT values remain the same. But the position of those unchanged LUTs are shifted right by one-column. For instance, values stored in the LUTs of second column of Fig. 2(a) are the same as values stored in LUTs of the first-column of Fig. 1(b), and similarly values stored in LUTs of third column of Fig. 2(a) are the same as those LUTs of second-column Fig. 1(b). This feature can be observed in the LUT contents of Fig. 2(b) for the -th iteration also. In other word, contents of a particular column of LUTs during a particular iteration are simply transferred to the adjacent column of LUTs on its right during the next iteration. In this way, the oldest input samples of particular set are shifted out through the -th column ( in the example) of LUTs, and new values are entered at the first column of LUTs. Shifting of values physically from one LUT to the next across the array of LUTs is highly time consuming and power consuming. Therefore, we have proposed a novel LUT updating scheme, where the LUT content need not be shifted. Since, each column of LUTs uses the same weight-vector as LUT-address, the column-wise right-shift of LUT values can be achieved by a left-shift of the weight-vectors. This technique could save a lot of time and power, since the shifting of weight-vectors is significantly less expensive than the shifting of LUT contents. In the proposed LUT update scheme, contents of only one column

MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 925 Fig. 3. (a) Equivalent DA-based structure of Fig. 2(a) which is derived from structure of Fig. 1(b) by changing the content of 5th and 6th LUT (shown in grey color) and left shifting the weight-vectors by one-position. (b) Equivalent DA-based structure of Fig. 2(b) derived from the structure of Fig. 3(a) by changing content of 3rd and 4th LUT (shown in grey color) and left-shifting the weight-vectors by one position. of LUTs out of 3 such columns (for ) need to be updated in every iteration. We can find from Fig. 1(b) and Fig. 2(a) that, the values of the third-column LUTs of the -th iteration are not used during -th iteration, since they correspond to the oldest block of samples.the LUTs of the third column are updated as shown in Fig. 3(a) in grey-color. To feed weight-vectors to LUTs of Fig. 3(a) in the same order as that of Fig. 2(a), weight-vectors of Fig. 1(b) are simply left-shifted by one location. As shown in Fig. 3(a), the second-column of LUTs contain the values corresponding to the samples, which is the oldest block of samples in the -th iteration, and this input-block is discarded and corresponding LUTs are updated by the partial inner-products of new input-block. Weight-vectors of Fig. 3(a) are left-shifted by one column, and fed to LUTs of Fig. 3(b) as addresses. In the following, we summarize the proposed scheme for updating LUTs of BLMS-based adaptive filter: LUTs are updated column-by-column in every iteration in cyclic order. The LUTs which store the values of partial inner-products corresponding to samples of the oldest input block are overwritten by those of the new input block. The weight-vectors are circularly left-shifted after every iteration to change the columns of LUT to be read circularly. The values required for updating a column of LUTs for any particular iteration are calculated from samples of the current input-block and samples of the most recent past samples of the previous block. Based on the above scheme, LUT-matrix is updated column-by-column from right to left after every iteration. The updating process starts from the -thcolumnoflutsandgoes to the first column on a cyclic manner, and then again from the first column it goes to the -thcolumnandthentothe Fig. 4. Proposed DA-based structure for implementation of BLMS adaptive FIR filters (for and ), where,,and. -th column and so on. Hence, LUTs of one particular column are updated once in a period of iterations. IV. PROPOSED ARCHITECTURE Proposed DA-BLMS structure is comprised of one DA-module, one error bit-slice generator (EBSG) and one weight-update cum bit-slice generator (WBSG). WBSG updates the filter weights and generates the required bit-vectors in accordance with the DA-formulation. EBSG computes the error block according to (3) and generates its bit-vectors. The DA-module updates the LUTs and makes use of the bit-vectors generated by WBSG and EBSG to compute the filter output and weight-increment terms according to (15) and (16). A. Structure for Block-Size The proposed structure for DA-based BLMS adaptive filter for and is shown in Fig. 4. The DA-module receives a block of input samples in every iteration, and computes a block of filter output. It also receives a block of errors in every iteration, and computes the weight-increment term for all the components of the weight-vector. The structure of proposed DA-module is shown in Fig. 5. It consists of 4 identical processing elements (PEs) for, one LUT-update block and one MUX-array. Structure of the PE is shown in Fig. 6. It consists of 4 identical subcells (SCs) for. Internal structure and function of the -th SC is shown in Fig. 7. As required by (15), LUT of the -th SC of this PE stores 16 possible values corresponding to the samples, where. The LUT-update block of the DA-module generates the required values to update LUTs of a particular PE. Structure of the LUT-update block is shown in Fig. 8. It consists of one adder-block and an input delay unit (IDU), which stores samples of the previous block. During each iteration, the adder

926 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 Fig. 5. Structure of DA-module of the proposed DA-BLMS ADF (for and ). The subscript of, and varies from 0 to in cycles. Fig. 6. Internal structure of -th processing element (PE) of the DA-module for block-size,where. block receives samples ( samples from the current input block and past samples from the IDU), and feeds these samples to adder-cells (ACs) (see Fig. 8) such that each AC receives samples, and input blocks of adjacent ACs are overlapped by samples. During the -th iteration, AC- receives input samples and AC- receives the samples.for block-size, each AC receives a block of four samples in everyiteration(showninfig.9).asshowninthefigure, each of the four inputs of the AC is ANDed with a bit of the four-bit address by four AND cells. Each AND cell consists of AND gates, where is the word-length of input samples. All those AND gates are fed with a bit of the address, Fig. 7. Internal structure and function of -th subcell (SC) of a PE, where and,. Convergence factor is assumed to be power of 2. while the other input of the AND gates are fed with a bit of input sample. The output of AND cells are fed to an adder-tree (AT). AC receives 16 possible values of in 16 clock cycles, and calculates 16 values of to be stored in the LUT, where is used as the address of the LUT location and is the equivalent integer value of. All the ACs of the adder block (see Fig. 8) work in parallel, and generate all the required values to update LUT of SCs of a PE. According to the proposed LUT-update scheme, LUTs of one PE out of PEs are updated in every iteration. LUTs of all the PEs are updated once in cycles

MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 927 Fig. 8. Internal structure of LUT-update block for block-size,where. Fig. 11. Structure of error computation cum bit-slice generator (EBSG) for block-size,where, and. Fig. 9. Internal structure of -th AC of the LUT-update block for blocksize. Fig. 10. Internal structure of MUX-array for and. in a cyclic order. Each PE uses separate control signal (,for ) to enable the specific columnoflutstobe updated. LUT-update operation of proposed structure is completed during the first clock cycles of every iteration. Each PE receives the bit-vectors, and through the MUX array (shown in Fig. 9) for updating the LUTs or computation of filter outputs or weight-increment terms, respectively. After completion of the LUT-update, filtering computation follows immediately for the next clockcyclesbyaseries of LUT-read operations using the bit-slices of corresponding weight-vector in LSB to MSB order, as successive addresses according to (15). During the -th cycle of filtering, the WBSG generates parallel bit-vectors of width bits each for the PEs to perform the filtering operation. Each SC receives a sequence of bit-vectors,(for where is the wordlength of the filter-coefficients) from the WBSG in clock cycles. The LUT-read values are shift-accumulated in an accumulator (ACC) to obtain a partial filter output. During the -th cycle the LUT output is subtracted from the accumulated result since the bit-vector during this cycle contains the sign-bits of weight-vector. Each SC uses control signal CTR1 to control add/substract operation in the ACC. At the end of the -th cycle, ACC contents are sent to the DMUX as input, and the ACC register (not shown in Fig. 7) is cleared to be used for the computation of weight-increment term from the next cycle (CTR1 is used for clearing the register). Finally the DMUX sends the computed partial results of inner-products to the output line using the select signal CTR6. From SCs of each PE, partial results are obtained in parallel, the corresponding output of each SC from PEs are added by an AT (Fig. 5) to obtain (the -th component of -th block of filter output). A block of parallel filter output ( )areobtainedfrom ATs of the DA-module in each cycle. EBSG receives one block of filter output ( ) from the DA-module, and calculates a block of error ( ) in every iteration using one block of desired response according to (3). As shown in Fig. 11, error values are loaded in parallel-in serial-out (PISO) shift-registers of the bit-slice-generater (BSG) to generate bit-vectors of error-vector. CTR4 enables the clock for the BSG and CTR2 controls load-shift operation of each SR. Bit-vectors,for, fed serially in LSB to MSB order to the DA-module in successive clock cycles to compute weight-increment terms for the -th iteration. According to (16), LUT values of the -th block of filter output are also used to compute weight-increment terms for the -th iteration. In general, LUT values of -th SC of -th PE (for, ) are used to compute the weight-increment term. The -th PE, therefore, computes the weight-increment terms.the computation of weight-increment-terms is similar to the partial filter outputs. But in this case the same bit-vector is used

928 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 TABLE I LUT UPDATING SCHEME FOR AND,WHERE, :BLOCK SIZE, :FILTER ORDER by all the PEs of the DA-module to compute the weight-increment terms. In each SC (see Fig. 7), the ACC contents corresponding to the weight-increment term is sent to the output line of the DMUX. The weight-increment terms are scaled by a factor.herewehaveassumed is a power of 2, so that the scaling of by is realized by a right-shift operation using a fixed-shifter (see Fig. 7). Accordingto(1),theWBSGoftheproposedDA-BLMS structure requires only the weight-increment terms of the current iteration to update the weight-vector for the next iteration. It does not require the LUT values of the current iteration. Therefore, once the weight-increment terms of the current iteration are computed, the LUT-updating operation for the next iteration can be started immediately in the next clock cycle. As we discussed earlier, the filter computation follows the LUT-update operation, and first clock cycles of every iteration are used to complete the LUT-update operation. During this period, weight-update operation of WBSG also can be performed concurrently. A bit-parallel (word-serial) structure of WBSG requires one clock-cycle to complete the weight-update operation, while a bit-serial structure of WBSG requires clock-cycles to complete the same task. If wordlength of filter-coefficients ( ) is less than or equal to the LUT-size, then bit-serial realization of WBSG does not increase the iteration period of the DA-BLMS structure, but it certainly helps to reduce the hardware complexity of the DA-BLMS structure. For and, we can have a bit-serial structure for the WBSG. Bit-serial structure of WBSG receives the weight-increment terms from the DA-module in bit-serial LSB to MSB order, and updates the weight-vector accordingly. For bit-serial realization of WBSG, weight-increment terms computed by each PE of the DA-module are finally loaded into a separate BSG (see Fig. 5) to generate the weight-increment terms in bit-serial order. All the BSG of the DA-module uses common control signals CTR6 and CTR5 to perform the loading and sifting operations, respectively. WBSG is an important block of the proposed structure. It performs three operations: (i) updates filter weights using the weight-increment values calculated by the DA-module, (ii) generates bit-vectors for the DA-module to compute -th block of filter output, (iii) gives one left-shift (circularly) to the weight-vectors as necessitated by the proposed LUT-update scheme. We have shown LUT updating of the DA-BLMS ADF for and in Table I for the first 5 iterations using the proposed LUT-updating scheme. As shown in Table I, for and, the LUT-matrix has4columns(for ). LUTs of all these 4 columns are updated once in a period of 4 iterations. At any given iteration, the LUT-matrix contains the values corresponding to recent past input samples to compute ablockof4filter output. As shown in Table I, during the 5-th iteration, LUT-matrix ( to ) contain the values corresponding to set of input samples ( to ). These set of 19 samples are exactly required to compute the filter output ( to ). Similarly, the LUT-matrix contain the values corresponding to the set of samples during 6-th iteration, and these samples are exactly required to compute filter outputs ( to ). The bit-serial structure of WBSG is shown in Fig. 12. It consists of serial-in serial-out (SISO) SRs and carry-save full-adders (CSFAs) corresponding to filter weights. SRs are arranged in matrix form; and filter-weights are stored in the SR matrix column-wise, such that weight-vector is stored in -th column of SRs. As shown in Table I for, that bit-slices of the weight-vector are received by the PE whose LUTs are to be updated during the -th iteration, and are generated from the first column of filter weights. The weight-vector to be aligned with the corresponding PE. If during the -th iteration, LUT of PE-1 is to be updated, then the first column of SR-matrix is required to contain the components of weight-vector and the -thcolumnofsrs should contain components of weight-vector.asshown in Fig. 12, weight-increment values of the -thcolumnof filter coefficients (available in the -th column of SR-matrix) are obtained from the -th PE, and these values are added with the corresponding filter-weights bit-serially using a carry-save full-adder (CSFA). Results of CSFA of -th column constitute a bit-vector of. SR contents are shifted left for clock cycles, to generated the shifted weight-vectors in accordance with the proposed LUT-update scheme. Shifting operation of the SRs starts at -th clock cycle of every iteration, and continue for clock cycles. The control signal CTR5 is used in WBSG to enable the shifting operation. D flip-flop of each CSFA is cleared during the first clock cycle of every iteration to flush-out the final carry of the previous iteration of weight-update operation. B. Structure for Higher Block-Size To derive DA-based BLMS structure for higher block sizes using LUT of 16 words, we can take the block-size to be an multiple of 4, i.e.,where is an integer. The structures

MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 929 Fig. 12. Bit-serial structure of weight-update cum bit-slice generator (WBSG) for, and. of EBSG and WBSG of the DA-BLMS filter for (for ) are the same as those of block-size shown in Fig. 11 and Fig. 12, respectively. However, the AC of the LUTupdate block and the SC of each PE of the DA-module need to be modified according to the value of. Each SC in this case, is comprised of LUTs of size 16 words each. The bit-vectors of weight-vectors and error-vectors of bits each are splitted into segments of 4-bit size, and fed to LUTs of each SC to read the LUTs in parallel. The values read from the LUTs are added using an AT and subsequently shift-accumulated in the ACC for obtaining a partial output. To generate the weight update-values for LUTs, each AC of the LUT-update block in this case is comprised of AND-TA blocks of size 4 (as shown in Fig. 9). For block-size, each SC involves RAM words and adders along with one ACC and 2 DMUX. Similarly, the LUT-update block involves AND-gates and adders. V. HARDWARE-TIME COMPLEXITY AND PERFORMANCE COMPARISON A. Hardware Complexity Proposed structure is comprised of one DA-module, one WBSG, one EBSG and a control unit. The DA-module consists of one LUT-update block, PEs, adder-trees of words each, one MUX-array and BSG, where and the block-size. LUT-update block consists one IDU and ACs, where the IDU is comprised of registers of size, and each AC is comprised of AND-gates and adders. LUT-block, therefore, involves registers, adders and AND-gates. Each PE consists of SCs, where each SC is comprised of LUTs of 16 words each, adders, one ACCs, one 1-to-2 line DMUX and number of 2-input XOR-gates (used by ACC (not shown in Fig. 7) to compute 1 s complement of the LUT-outputs when the bit-vector contains sign-bits), where ACC involves one adder, one register and one 2-to-1 line DMUX ( ). Each PE, therefore, involves memory words, adders, registers, DMUXes (2-to-1 line) and XOR-gates. Each BSG is comprised of SRs (bit-level) of size. MUX-array involves 2-to-1 line MUXes. The DA-module, therefore, involves memory words, adders, D-type flip-flops (FFs), 2-to-1-line MUXes/DMUXes (word-level), AND-gates and XOR-gates. WBSG involves D-type FFs and FAs. EBSG involves D-type FFs and adders. Proposed structure, therefore, requires memory words, adders, FAs, D-type FFs, MUXes/DMUXes (word-level), AND-gates and and XOR-gates. B. Time Complexity The proposed structure performs four operations sequentially in every iteration. Those are (i) LUT update, (ii) filter output computation, (iii) error calculation and (iii) computation of weight-increment term. It involves 16 clock cycles to complete LUT-update operation. It takes clock cycles to calculate partial results of a block of filter output. It calculates a block of filter output from the partial results and then block of error in one clock cycle. Finally it takes clock cycles to compute the weight-increment term for the weight vector. In every iteration, proposed structure processes one block of samples, where one iteration involves clock cycles and duration of one clock cycle is, where is the delay of one -bit adder. For comparison purpose, we have also estimated number of clock cycles required by the structure of [10] and [16] for one iteration. We assumed the read and write operations are performed in two separateclockcyclesinaluttomaintainuniformityinthe comparison. Structure of [10] requires 16 clock cycles to update the DA-A-LUT of size 16 words, clock cycles to compute one filter output and 32 clock cycles to update the DA-F-LUT

930 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 TABLE II GENERAL COMPARISON OF HARDWARE COMPLEXITY OF THE PROPOSED STRUCTURE (FOR BLOCK-SIZE ) AND THE STRUCTURE OF [10] AND [16] (WITH DECOMPOSITION FACTOR 4) AND THE DA-BLMS STRUCTURE OF [18] LEGEND: ADD: adder, MULT: multiplier, FF: flip-flop, VSH: variable shifter, TR: throughput rate, LAPO: LUT access per output.,,,,,. In addition to the above list of components the proposed structure involves FAs, 2-input AND-gates and 2-input XOR-gates, where : word length of the sequence, and, : word-length of input sequence,. For the proposed structure,, andincaseof[10]and[16],, and in case of [18],,, and block-size,where and are relatively prime to each other. of size 16 words. It involves 48 clock cycles for one iteration and computes one output per iteration, where the duration of one clock cycle is and. Since, the structure of [16] does not involve DA-F-LUT, it requires 16 clock cycles for updating the DA-A-LUT and clock cycles to compute one filter output. The structure of [16], therefore, involves clock cycles for one iteration, where the duration of the clock period is the same as that of [10]. C. Number of LUT Access During every iteration, proposed structure computes filter outputs, and performs write operations for updating the LUTs, LUT read operations for filter output computation and LUT read operations for the computation of weight-increment terms. The number of LUT access per output (LAPO) of the structure is, therefore,. Similarly, the number of LAPO of [10] and [16] are found to be and, respectively, where is the bit-width of the input samples and is the bit-width of all the intermediate and output samples. Note that, LUTs of DA-based ADF are required to be implemented by RAM, and the total energy consumption of the structure, therefore, increases significantlywithlapo. D. Performance Comparison Hardware and time complexities the proposed structure and the DA-LMS structures of [10], [16], and DA-BLMS structure of [18] are listed in Table II for comparison. The structure of [16] is the most efficient one amongst the existing DA-LMS structures. Compared with [16], proposed structure requires times more LUT words, nearly times more adders, 4/3 times more FFs and offers nearly times higher throughput rate. It involves 16 more LAPO for block-size 4 and less LAPO for block-size 8 than those of [16] for 16-bit internal bit precision. Interestingly, number of adders of the proposed structure does not increase proportionately with block-size in the proposed structure and number of flip-flops is independent of block-size. Besides, it does not require variable shifters unlike those of [10] and [16]. We have estimated hardware and time complexity of proposed structure for and 8, and that of [10] and [16] for filter size (, 32 and 64) using the complexity counts of Table II. The estimated values are listed in Table III for comparison. Compared with the structure of [16], proposed structure for involves 8 times more LUT words; 3.27 times more adders on average for different filter orders, and offers 5.22 times higher throughput. But, it involves, respectively, 37.5%, 24.4%, 17.8% more flip-flops and 25%, 37.5%, 47.6% less LAPO than those of [16] for filter orders 16, 32, 64, respectively. E. Simulation Result To validate the proposed design, we have coded it in VHDL for filter order 16, 32 and 64 with block-size 4 and 8. We have also coded the design of [10] and [16] for the same filter orders. We have considered and, and synthesized both the designs by Synopsys Design Compiler using TMSC 90 nm CMOS library. Synthesis reports obtained from the Design Compiler are listed in Table IV. Synthesis results are in accordance with the theoretical estimation given in Table III. The minimum clock period of the proposed structure and the structure of [16] are slightly higher than those of [10] due to the extra MUX/DMUX in the critical path. As shown in Table IV, structure of [16] is the most efficient amongst the existing structures. Compared with [16], proposed structure for block size and 8 involve, respectively, 2.13 and 3.69 times more area on average for different filter orders and offers nearly 2.61 and 5.22 times higher throughput rate, respectively.

MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 931 TABLE III HARDWARE AND TIME-COMPLEXITY OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16] FOR DIFFERENT SIZE FILTERS., TABLE IV COMPARISON OF AREA, DELAY, AND POWER COMPLEXITIES OBTAINED FROM SYNTHESIS RESULT OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16] We have estimated ADP 1, PPO 2 and energy per output (EPO 3 ) at 20 MHz clock. As shown in Table IV, for block-size 4, the proposed structure has 17.47%, 18.49%, 13.66% less ADP than that of [16] for filter order 16, 32 and 64, respectively. For block-size 8, it has 31/6% less ADP than [16] on average for different filter orders. For block-size 4, it consumes 27.5%, 28.8% and 24.6% less EPO than that of [16] for filter order 16, 32 and 64, respectively. Similarly, for block-size 8, it consumes respectively, 40%, 39.8% and 37.4% less EPO than other for 16, 32 and 64 order filters. One can extrapolate these results to obtain the approximate values of ADP, PPO and EPO of the proposed structure for filter order greater than 64. One can also extrapolate these observations to obtain the approximate estimate of the advantages of proposed structure for filter order greater than 64. 1 2 3 VI. CONCLUSION We have derived a DA formulation of BLMS algorithm where both convolution and correlation are performed using a common LUT for the computation of filter outputs and weight increment terms, respectively. This results in a significant saving of LUT words and adders which constitute the major hardware components in DA-based computing structures. Also we have suggested a novel LUT updating scheme to update the LUT contents for DA-based BLMS ADF, where only one set of LUTs out of sets need to be modified in every iteration such that LUT contents are modified once in every iterations, where, is the filter length and is the input block-size. Using the proposed scheme, we have derived a parallel architecture for the implementation of DA-based BLMS ADF. Unlike the existing DA-based LMS structure, number of adders required by the proposed structure does not increase linearly with.compared with the best of the existing DA-based LMS designs, proposed one involves nearly times more adders, and times

932 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 more LUT words and offers nearly times throughput. It requires nearly 25% more flip-flops irrespective of the block-size, but does not involve variable shifters like others. It involves less number of LUT access per output than the existing structure for block-size higher than 4. This is a major advantage of the proposed structure for reducing its ADP and EPO when implemented for large order ADF, and for higher block-sizes. For block-size 8 and filter length 64, the proposed structure involves 2.47 times more adders, 15% more flip-flops, 43% less LAPO than the best of the existing structures, and offers 5.22 times higher throughput. ASIC synthesis result shows that, the proposed structure for filter order 64, has almost 14% and 30% less ADP and 25% and 37% less EPO than the best of the existing structures for block size 4 and 8, respectively. REFERENCES [1] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Filters. Hoboken, NJ: Wiley-Interscience, 2003. [2] R. Haimi-Cohen, H. Herzberg, and Y. Beery, Delayed adaptive LMS filtering: Current results, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Albuquerque, NM, Apr. 1990, pp. 1273 1276. [3] M. D. Meyer and D. P. Agrawal, A modular pipelined implementations of a delayed LMS transversal adaptive filter, in Proc. IEEE Int. Symp. Circuits Syst., New Orleans, LA, May 1990, pp. 1943 1946. [4] V. Visvnathan and S. Ramanathan, A modular systolic architecture for delayed least mean square adaptive filtering, in Proc. IEEE Int. Conf. VLSI Des., Bangalore, 1995, pp. 332 337. [5] R. D. Poltmann, Conversion of the delayed LMS algorithm into the LMS algorithm, IEEE Signal Process. Lett., vol. 2, p. 223, Dec. 1995. [6]S.C.Douglas,Q.Zhu,andK.F.Smith, ApipelinedLMSadaptive FIR filter architecture without adaptive delay, IEEE Trans. Signal Process., vol. 46, pp. 775 779, Mar. 1998. [7] L. D. Van and W. S. Feng, Efficient systolic Architectures for 1-D and 2-D DLMS adaptive digital filters, in Proc. IEEE Asia Pacific Conf. Circuits Syst., Tianjin, China, Dec. 2000, pp. 399 402. [8] L. D. Van and W. S. Feng, An efficient architecture for the DLMS adaptive filters and its applications, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 48, no. 4, pp. 359 366, Apr. 2001. [9] G.A.Clark,S.K.Mitra,andS.R.Parker, Blockimplementationof adaptive digital filters, IEEE Trans. Circuit Syst., vol. 28, pp. 584 592, Jun. 1981. [10] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, LMS adaptive filters using distributed arithmetic for high throughput, IEEE Trans. Circuits Syst., vol. 52, no. 7, pp. 1327 1337, Jul. 2005. [11] D.J.Allred,H.Yoo,V.Krishnan,W.Huang,andD.V.Anderson, A novel high performance distributed arithmetic adaptive filter implementation on an FPGA, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2004, vol. 5, p. V-161-4. [12] S. A. White, Applications of distributed arithmetic to digital signal processing: A tutorial review, IEEE ASSP Mag., vol. 6, pp. 4 19, Jul. 1989. [13] D.J.Allred,H.Yoo,V.Krishnan,W.Huang,andD.V.Anderson, An FPGA implementation for a high throughput adaptive filter using distributed arithmetic, in Proc. 12th Annu. IEEE Symp. Field-Programmable Custom Comput. Mach., 2004, pp. 324 325. [14] W. Huang and D. V. Anderson, Adaptive filters using modified sliding-block distributed arithmetic with offset binary coding, in Proc. IEEE In. Conf. Acoust., Speech, Signal Process. (ICASSP), 2009, pp. 545 548. [15] B. K. Mohanty and P. K. Meher, Delayed block LMS algorithm and concurrent architecture for high-speed implementation of adaptive FIR filters, presented at the IEEE Region 10 TENCON2008 Conf., Hyderabad, India, Nov. 2008. [16] R. Guo and L. S. DeBrunner, Two high-performance adaptive filter implementation schemes using distributed arithmetic, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 9, pp. 600 604, Sep. 2011. [17] S. Baghel and R. Shaik, FPGA implementation of fast block LMS adaptive filter using distributed arithmetic for high-throughput, in Proc. Int. Conf. Commun. Signal Process. (ICCSP), Feb. 10 12, 2011, pp. 443 447. [18] S. Baghel and R. Shaik, Low power and less complex implementation of fast block LMS adaptive filter using distributed arithmetic, in Proc. IEEE Students Technol. Symp., Jan. 14 16, 2011, pp. 214 219. [19]R.Jayashri,H.Chitra,H.Kusuma,A.V.Pavitra,andV.Chandrakanth, Memory based architecture to implement simplified block LMS algorithm on FPGA, in Proc. Int. Conf. Commun. Signal Process. (ICCSP), Feb. 10 12, 2011, pp. 179 183. [20] Q.ShenandA.S.Spanias, TimeandfrequencydomainXblockLMS algorithm for single channel active noise control, Control Eng. J., vol. 44, no. 6, pp. 281 293, 1996. [21] D. P. Das, G. Panda, and S. M. Kuo, New block filtered-x LMS algorithms for active noise control systems, IEE Signal Procesd., vol.1, no. 2, pp. 73 81, Jun. 2007. [22] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and Implementation. New York: Wiley, 1999. [23] C. S. Burrus, Index mappings for multidimensional formulation of the DFT and convolution, IEEE Trans. Acoust., Speech, Signal Process., vol. 25, pp. 239 242, Jun. 1977. Basant K. Mohanty (M 06 SM 11) received M.Sc. degree in physics from Sambalpur University, India, in 1989 and the Ph.D. degree in the field of VLSI for digital signal processing from Berhampur University, Orissa, in 2000. In 2001, he joined as Lecturer in Electrical and Electronic Engineering Department, BITS Pilani, Rajasthan. Then, he joined as an Assistant Professor in the Department of Electronics and Communication Engineering, Mody Institute of Education Research (Deemed University), Rajasthan. In 2003, he joined Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, where he became Associate Professor in 2005 and full Professor in 2007. His research interest includes design and implementation of low-power and high performance systems for multimedia applications, multi-core processor design and algorithm for concurrent processing. He has published nearly 40 technical papers. Dr. Mohanty is a life time member of The Institution of Electronics and Telecommunication Engineering, New Delhi, India. He was the recipient of the Rashtriya Gaurav Award conferred by India International friendship Society, New Delhi, India for 2012. Pramod Kumar Meher (SM 03) received the M.Sc. degree in physics and the Ph.D. degree in science from Sambalpur University, India, in 1978, and 1996, respectively. Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore, and Adjunct Professor with the School of Electrical Sciences, Indian Institute of Technology Bhubaneswar, India. Previously, he was a Professor of Computer Applications with Utkal University, India, from 1997 to 2002, and a Reader in electronics with Berhampur University, India, from 1993 to 1997. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and video processing, communication, bio-informatics and intelligent computing. He has contributed nearly 200 technical papers to various reputed journals and conference proceedings. Dr. Meher has served as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society and Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS. Currently,he is serving as Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, theieeetransactions ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, andjournal of Circuits, Systems, and Signal Processing. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for 1999.