854 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 62, NO. 3, MARCH 2015

Size: px

Start display at page:

Download "854 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 62, NO. 3, MARCH 2015"

Eunice Bruce
5 years ago
Views:

1 854 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 62, NO. 3, MARCH 2015 Efficient Subquadratic Space Complexity Architectures Parallel MPB Single- and Double-Multiplications All Trinomials Using Toeplitz Matrix-Vector Product Decomposition Chiou-Yng Lee, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE Abstract Subquadratic multiplication algorithm has received significant attention of cryptographic hardware researchers efficient implementation public-key cryptosystems. In this paper, we derive a new shifted MPB (SMPB) representation based on modified polynomial basis (MPB). We have shown that by using MPB and SMPB, the proposed double basis multiplication can be transmed into Toeplitz matrix-vector product (TMVP) structure. Furthermore, by employing this mulation of double basis multiplication, we show that three-operand multiplication over all trinomials can be realized efficiently by the recursive TMVP (RTMVP) mulation. To perm the three-operand multiplication with the RTMVP mulation, we have derived a new RTMVP decomposition scheme. The proposed single- and double-multiplications can, respectively, use TMVP and RTMVP decompositions to achieve subquadratic space complexity architectures. By theoretical analysis, it is shown that the proposed subquadratic multipliers involve significantly less space complexity and less computation time compared to the existing subquadratic multipliers using TMVP and Karatsuba algorithms. Moreover, our proposed double-multiplication design can be used in several applications involving successive multiplications, such as exponentiation, inversion, and elliptic curve point multiplication. Index Terms Binary extension field, elliptic curve cryptography, finite field, Galois field, modified polynomial basis multiplication, Toeplitz matrix-vector product. I. INTRODUCTION F INITE FIELD multiplication over is a basic field operation, which is frequently used in elliptic curve cryptography (ECC) to perm point-additions and point-doubling operations on an elliptic curve. The multiplication over can be used further to perm division, exponentiation, and inversion operations. In finite field arithmetic, addition is the simplest operation because addition of any two bits can be permed by logical XOR operation; and there is no carry propagation. Division operations on the other hand can be implemented by a series of multiplications. The area and time complexity involved in perming the multiplications, consequently, contribute most of the area and the time required Manuscript received August 28, 2014; revised November 02, 2014; accepted November 20, Date of current version February 23, This paper was recommended by Associate Editor V. Chandra. P. K. Meher is with the School of Computer Engineering, Nanyang Technological University, Singapore ( aspkmeher@ntu.edu.sg). C.-Y. Lee is with the Department of Computer Inmation and Network Engineering, Lunghwa University of Science and Technology, Taoyuan 33306, Taiwan ( pp010@mail.lhu.edu.tw). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSI the implementation of ECC. It is, theree, required to design finite field multipliers with greater efficiency in terms of area consumption and speed permance ECC. The Weil and Tate pairings [1] based on elliptic curve arithmetic involves extensive computations of multiplication involving operands in large finite fields. This generates further interests to explore hardware-efficient designs high-permance multiplication over large finite fields. A Toeplitz matrix is a matrix in which elements of each descending diagonal from left to right are identical. It is encountered in many signal processing and image processing applications. It is shown that Toeplitz matrix-vector product (TMVP) approach can lead to efficient hardware architecture multiplication in finite fields based on normal basis (NB) [2] [4], shifted polynomial basis (SPB) [5], [6], and dual basis (DB) (or modified polynomial basis, MPB) [7] [9]. In binary extension fields, multiplication is a twostep operation: naive polynomialmultiplicationfollowedbyreduction using the irreducible polynomial, which generates the field. The naive polynomial multiplication involves space complexity and delay. When the chosen is a low-weight irreducible polynomial, such trinomials and pentanomials, the reduction is a simple operation. Theree, naive polynomial multiplication is generally considered as the major contributor to the hardware implementation of multiplication. The complexity of naive polynomial multiplication is reduced by employing the divide-and-conquer techniques, such as Karatsuba-Ofman algorithm (KA) [10], [11], Toom-Cook algorithm [12], and TMVP decomposition [13], from to,. Recently, Lee et al. [14] have proposed a generalized -way KA decomposition with, which is suitable implementing subquatratic digit-serial multiplication. Based on KA decomposition, multi-partite digit-serial multiplier is introduced in [15]. Recently, it is shown that three-operand multiplication can provide area-delay efficient architectures applications involving successive multiplications, e.g., exponentiation, inversion, and elliptic curve point multiplication. Three-operand multiplication based on KA decomposition is suggested in [16], and is also shown that M-ary exponentiation using three-operand multiplication can be permed in nearly multiplication steps. Fast inversion based on Gaussian normal basis double-multiplication is suggested in [17]. Multi-operand multiplication has been found to be useful hardware implementation of high-permance applications IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See more inmation.

2 LEE AND MEHER: EFFICIENT SUBQUADRATIC SPACE COMPLEXITY ARCHITECTURES 855 Using divide-and-conquer method, (1) can be expressed as (2) Fig. 1. The subquadratic TMVP multiplier architecture [13]. In this paper, we extend the TMVP decomposition of [13] to derive 2-way and 3-way recursive TMVP (RTMVP) decomposition schemes the implementation of three-operand multiplication. Using the RTMVP decomposition approach, in the well-known MPB field representation, we explore a novel areatime efficient multiplication scheme in binary extension fields. We show here that three-operand MPB multiplication can be efficiently realized by the proposed RTMVP mulation. The proposed RTMVP decomposition is found to be a suitable match three-operand modified polynomial basis (MPB) multiplier with subquadratic space complexity. By theoretical analysis, we show that the proposed subquadratic single- and double-multiplications have less computation time and less space complexity compared to the existing subquadratic multipliers. The rest of this paper is organized as follows. Section II presents the preliminaries regarding 2-way and 3-way TMVP. In Section III, we present our proposed new RTMVP decompositions. In Section IV, we extend the well-known MPB representation to derive a new shifted MPB (SMPB) representation. Besides, in this section, we present the architecture of the proposed three-operand multiplication based on MPB and SMPB to be used in the RTMVP mulation. In Section V, time- and space-complexities are analyzed. Finally, we conclude the paper in Section VI. II. REVIEW OF TOEPLITZ MATRIX-VECTOR PRODUCT DECOMPOSITION In linear algebra, an Teoplitz matrix is a matrix with the property of,. The Toeplitz matrix-vector product is widely applied to compute multiplications in finite fields based on dual basis (DB), shifted polynomial basis, and normal basis. A Toeplitz matrix has the following property: Proposition 1: An Toeplitz matrix is determined by the entries in the first row and the first column. We can use the vector todefineatoeplitzmatrix. Using such vector representation, the addition requires XOR gates if and are two Toeplitz matrices. In the following, we briefly review 2-way and 3-way TMVP decompositions in [13]. A. 2-Way TMVP Decomposition Let be a column vector and the matrix vector be used to define a Toeplitx matrix, and are two column vectors, and,, and are three Toeplitz matrices. A Toeplitz matrix-vector product in this case is given by The original TMVP involves four sub-tmvps but the TMVP in (2) involves three sub-tmvps. The TMVP in (2), theree, provides better permance than the original TMVP in (2). Based on the decomposition scheme of (1), we can recursively generate four components [component matrix point (CMP), component vector point (CVP), point-wise multiply (PWM), and reconstruction (R)] of reduced size matrices as As mentioned above, Fig. 1 shows subquadratic TMVP architecture, which involves three stages: the evaluation point generation (EPG) stage, the point-wise multiplication (PWM) stage, and the reconstruction (R) stage. The EPG stage is to perm EMP(T) and EVP(V), the PWM stage computes,andther stage perms the operation. Let symbols and denote space and delay, respectively. Let and in the case of denote the number of bit-multiplications and the number of bit-additions required TMVP multiplication, and and denote the number of AND gate delay and the number of XOR gate delay required TMVP multiplication. In [13], Fan and Hasan have shown that, 2-way TMVP decomposition, CMP component involves XOR gates and delay, CVP component involves XOR gates and delay, PWM component involves space complexity and delay, and R component involves XOR gates and delay. Accordingly, we have obtained the following recurrences on complexities: Lemma 1: Assuming and to be two positive integers which satisfy, the solution of the recurrence relation and is, and are integer constants. Lemma 2: Let,,and be three integers with and, the solution of the recurrence relation and is. We utilize Lemmas 1 and 2 to solve the recurrence equations in (3). The time and space complexities of 2-way TMVP decomposition can be expressed as follows: (3) (1)

3 856 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 62, NO. 3, MARCH 2015 B. 3-Way TMVP Decomposition Let be a column vector and the matrix vector be a Toeplitz matrix, each of s(,1,2)isa column vectors, and each of s(,1,2,3,4)isa Toeplitz matrix. The product can be rewritten as in (4) denotes the sum of.from(4),we can find that the 3-way TMVP decomposition each recurrence involves 6 sub-tmvps. Accordingly, we have obtained the complexities of 3-way TMVP decomposition as follows: III. PROPOSED RECURSIVE TOEPLITZ MATRIX-VECTOR PRODUCT DECOMPOSITION In this section, we extend the TMVP decomposition to derive two new recursive TMVP (RTMVP) decompositions computing three-operand multiplication. For the convenience of presentation, we use the term 3-mult to refer the three-operand multiplication in the rest of the paper. We have derived the proposed 2-way and 3-way RTMVP decompositions as follows. A. 2-Way RTMVP Decomposition Let and be two Toeplitz matrices defined by and, sand s are Toeplitz matrices. Let be a column vector, and are column vectors. Here we consider the 3-mult of the m as a recursive TMVP multiplications. To realize 3-mult computation, let us define a two-step operation: and. Based on 2-way TMVP decomposition, the intermediate product is obtained in the first step as From TMVP computation in (5), we get the intermediate product as a vector. Next, we again use the 2-way TMVP decomposition to compute, then the product can be obtained as follows: The 3-mult in (6) involves 6 three-operand sub-rtmvps (,,,,,and ). We can use this mula to iteratively decompose the 3-mult. Based (4) (5) (6) on the proposed multiplication using three individual stages (the evaluation point generation (EPG) stage, the point-wise product (PWM) stage, and the reconstruction (R) stage), Fig. 2 shows the implementation of a 3-mult using 2-way RTMVP identity. Using a cascaded product approach, the six 3-mults in (6) can be rewritten as,,,,,and,, and.accordingly, the six 3-mults in the PWM stage in Fig. 2 involves three multiplier-1 and six multiplier-2. Multiplier-1 calculates and. Multiplier-2 calculates. Now, we define the complexity parameters in Table I to calculate the complexity of the proposed algorithm. In the following we estimate the complexities of three individual stages in Fig. 2: EPG Stage. Based on 6 three-operand sub-rtmvps in this stage, we compute two evaluation matrix points (EMP) and one evaluation vector point (EVP). EMP component: For any two matrices and,we can define and, respectively, each of and is a Toeplitz matrix. In [13], it is shown that XOR gates are required to generate, which involves two additions: and. Accordingly, the computation of involves the space complexity of XOR gates and the computation time of delay. And the complexity of is the same of.theree, and in total require XOR gates, and involve delay. EVP component: The column vector is split into two parts,, and are two column vectors. From (6), the EVP component is generated by. Theree, the complexity of involves XOR gates and delay. PWM Stage. The proposed splitting method (as given in (6)) involves three multiplier-1 and six multiplier-2 in each step. Note that we use cascaded structure to implement the 3-mult, and each multiplier-1 is associated with two multiplier-2. For example, multiplier-1 computes, and the result is used as input operand of two multiplier-2 computing and. Based on this, the complexity of computing a 3-mult can be expressed as XOR gates and AND gates. Theree, the complexity of PWM unit involves AND gates and XOR gates. It involves delays. R Stage. Each subproduct with is a RTMVP. The width of each of these subproducts is theree bits. The R stage needs XOR gates to evaluate,andrequires delay. Based on the above analysis, we have estimated the XOR and the AND complexities involved in different steps as follows: (7) (8)

LEE AND MEHER: EFFICIENT SUBQUADRATIC SPACE COMPLEXITY ARCHITECTURES 857 can be obtained ac- Substituting (13) into (14), the product cording to the following mulation. (15) Fig. 2.

4 LEE AND MEHER: EFFICIENT SUBQUADRATIC SPACE COMPLEXITY ARCHITECTURES 857 can be obtained ac- Substituting (13) into (14), the product cording to the following mulation. (15) Fig. 2. The proposed 2-way RTMVP decomposition. In the following, we analyze the complexity of three stages of 3-mult according to (15). In the EPG stage, we find three components as TABLE I LISTS THE SYMBOL PARAMETERS FOR ESTIMATING THE COMPLEXITY OF THE PROPOSED MULTIPLIER Note: denotes the number of bit-length polynomial. For estimating time complexity, the evaluation step requires delay, the PWM step requires delay, and the reconstructionsteprequires delays. Consequently, we can obtain the following recursive relation of the time complexity: Using Lemmas 1 and 2 to solve the recursive (7), (8), and (9), we can find the following complexities 3-mult using 2-way RTMVP decomposition. (9) (10) (11) (12) B. 3-Way RTMVP Decomposition Let and be two Toeplitz matrices defined by and, sand sare Toeplitz matrices. Let be a column vector,, and are column vectors. The 3-mult of the m uses a recursive TMVP decomposition, such as and. Based on 3-way TMVP decomposition, the intermediate product is obtained as (16) (17) (18) Using Lemma 3 in Appendix, the complexity of involves XOR gates. In [13], it is shown that the matrix additions (, and ) in (17) involves XOR gates and delays; and the vector additions in (18) involves XOR gates and delays. Thus, the EPG unit in total requires XOR gates and delay. The PWM stage needs 18 3-mults, and the R stage involves 15 vector additions. Out of 18 3-mults involved in (15), each multiplier-1 is associated with three multiplier-2. Theree, 18 3-mults can be clustered into 6 groups in PWM stage. For example,, and,, can be clustered to m a group of 3-mults. In this group, multiplier-1 is used to compute, three multiplier-2 are used to compute, and. In this case, the complexity of 3-mult can be defined as XOR gates and AND gates. Theree, the PWM unit in each step involves XOR gates and AND gates, and requires gate delays the computation. Since each 3-mult produces a -bit product word, we need XOR gates the R stage, which involves delay. Theree, we obtain the following expressions of the complexities. (19) Using Lemmas 1 and 2 to solve the recursive (19), subquadratic 3-mult based on 3-way RTMVP decomposition can be found to have the following complexities. (20) Again we use 3-way TMVP decomposition to compute as (13) (14) IV. NEW SUBQUADRATIC MPB SINGLE- AND DOUBLE-MULTIPLICATIONS BASED ON TMVP AND RTMVP DECOMPOSITIONS Toeplitz matrix vector product approach can be used efficient realization of multiplications in binary extension fields, some special classes of basis representation, such as shifted polynomial basis, modified polynomial basis, and dual basis.

5 858 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 62, NO. 3, MARCH 2015 In general, multiplication using double basis representation involves TMVP multiplier and basis conversion from/to the polynomials, and the cost of basis conversion, which depends on the chosen irreducible polynomial.amongthem,thempb representation involves significantly less space complexity if is a trinomial. For the sake of simplicity, we use trinomials to derive a new double basis representation and make use of efficient implementation of three-operand multiplication. We use the matrix to perm the basis conversion as is the invertible matrix defined as (23) (24) A. Formulation of New Shifted MPB Representation Let the field be constructed from an irreducible trinomial of the m, the MPB in [18] is defined as follows. Definition 1: Let be the polynomial basis (PB) of, the intermediate is the root of irreducible trinomial. We can define that the ordered set is called the modified polynomial basis with respect to the set. In the context of this basis representation, the MPB is equivalent to the revised order sequence of the triangular basis in [19]. From the mulation of the MPB, we can find that can be represented by mod,and mod.forthis reason, we can define a new shifted MPB as follows. Definition 2: If a given set is the MPB trinomial, then we can define the ordered set to be the shifted MPB (SMPB). Example 1: Let the field be constructed from the irreducible trinomial. We can find that the set is the MPB, and the set is the SMPB. Here, we discuss the basis conversion between MPB and SMPB. Assume that a field is constructed from.let and be two elements in represented by MPB and SMPB representations, respectively. From Definitions 1 and 2, we have obtained and. Moreover, from, we have obtained. Thus, by basis conversion from SMPB to MPB, we can obtain B. MPB Multiplication In this subsection, we use two bases MPB and SMPB to derive a new double basis multiplication. We have shown that the basis conversion in (21) does not involve any cost hardware implementation, and, theree, the double basis multiplication is equivalent to MPB multiplication. Let and be two polynomials in represented by MPB and SMPB, respectively. From Definition 2, we obtain, and the polynomial can be represented by (25) Assuming that the polynomial is represented by MPB, and is the product of and. It can be rewritten as (26) Based on the definition of MPB (Definition 1), we have obtained the following algebraic relations: (27) Let us denote that mod,. By using the algebra of (27), can be computed as (28) Thus, based on (26), the product using matrix-vector representation can be obtained according to the following mulation. (21) It can be noted that the basis conversion from SMPB to MPB is given by the permutation of the coordinate coefficients of the element Based on the basis conversion in (21), we can define the transmation matrix as (22) (29) Matrix in (29) can be transmed into a Toeplitz matrix. For clarity, we illustrate a double basis multiplication mod in Example 2. Example 2: Let a field be generated by. Assume that and are two elements in represented by MPB and SMPB, respectively. We can pre-compute

6 LEE AND MEHER: EFFICIENT SUBQUADRATIC SPACE COMPLEXITY ARCHITECTURES 859 Fig. 3. (a) Traditional MPB multiplier [7]; (b) The proposed MPB multiplier based on subquadratic TMVP multiplier architecture of Fig. 1., and.using(29), the product mod can be computed as follows. Referring to the above, to obtain the matrix,itisrequired to compute the four terms, which involves 4 XOR gates and delay. Fig. 3(b) shows the proposed MPB multiplier, which involves a TMVP multiplier, a matrix transm unit, and a pre-computation circuit. Note that, in our proposed architecture, the matrix transmation unit the implementation of the basis conversion from MPB to SMPB does not involve any hardware cost, but traditional MPB multiplier [7] requires XOR gates to perm the basis conversion from the MPB to the PB. C. Three-Operand MPB Multiplier Using RTMVP Scheme In Section III, we have derived the proposed three-operand multiplication algorithm using RTMVP decomposition. Based on MPB multiplication, we derive here a new three-operand multiplication architecture using the proposed RTMVP decomposition. Let,, and be four elements in,,, and are represented by MPB, is represented by SMPB, and mod. For three-operand multiplication, the product requires two multipliers in cascade, the first multiplier computes mod, and the second multiplier computes an MPB multiplication mod. Based on the structure of double basis multiplier, we can use Lemma 4 (in Appendix) to obtain the MPB multiplication with respect to Toeplitz matrix-vector product. Thus, the first multiplier directly uses the proposed double basis multiplication in (29) to produce the intermediate result, which can be used the computation of mod to perm the three-operand multiplication, provides the first multiplication result to convert the basis transmation from the MPB to the SMPB. Theree, based on Lemma 4, the three-operand multiplication can be expressed alternatively as (30) As mentioned above, it is shown that three-operand multiplication using the proposed double basis multiplier can be realized by an architecture with RTMVP approach. Theree, 3-mult Fig. 4. (a) The proposed three-operand MPB multiplier based on RTMVP decomposition of Fig. 2. (b) The three-operand multiplier using traditional MPB multiplier approach using Subquadratic TMVP multiplier of Fig. 1. TABLE II COMPLEXITIES OF BCC AND PCC UNITS FOR THE PROPOSED AND THE EXISTING MPB MULTIPLIERS using the proposed RTMVP decomposition can be used to derive the architecture of multiplier with subquadratic space complexity. Fig. 4(a) shows the proposed three-operand multiplier, which involves one RTMVP multiplier and two pre-computation units. Each pre-computation unit involve XOR gates. The proposed architecture multi-operand multiplication can be used hardware and time efficient realization of applications involving successive multiplications, such as inversion, exponentiation, and pairing computation. Typically, dual basis multiplication is realized efficiently by Toeplitz matrix-vector product structure, as shown in Fig. 3(a). The traditional MPB multiplier [7] involves a TMVP multiplier, a basis conversion circuit, and a pre-computation circuit. In [7], it is shown that the MPB is a dual basis mulation respect to PB, and the implementation of double basis multiplication requires basis conversion from MPB to PB if input operands are in MPB representation. The proposed dual basis multiplication has the following advantages compared to traditional dual basis multiplier [7]. 1) The proposed double basis multiplication is based on two bases MPB and SMPB, while traditional double basis multiplication is based on two bases MPB and PB. 2) The basis conversion circuit (BCC) our proposed method is permed from MPB to SMPB, which does not involve any cost hardware implementation. The BCC [7] is permed from MPB to PB. In Table II, we have listed the complexity of the BCC all trinomials.

7 860 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 62, NO. 3, MARCH 2015 TABLE III COMPARISON OF SELECTED SUBQUADRATIC PARALLEL MULTIPLIERS FOR TRINIMIALS WITH AND TABLE IV COMPARISON OF SUBQUADRATIC SELECTED PARALLEL MULTIPLIERS FOR COMPUTING THREE-OPERAND MULTIPLICATION FOR TRINIMIALS WITH AND 3) The pre-computation circuit (PCC) is used to compute all entries of Toeplitz matrix. As shown in Example 2, our proposed PCC involves XOR gates and delay, while the complexity of the PCC [7] depends on the selected trinomial, as shown in Table II. 4) Proposed RTMVP approach can be used multi-operand multiplication scheme, while traditional MPB multiplication scheme [as shown in Fig. 4(b)] cannot be used efficient multi-operand multiplication. V. COMPLEXITY ANALYSIS In this section, we analyze the complexity of MPB single- and double-multipliers based on TMVP and RTMVP decompositions, respectively, and compare the corresponding subquadratic multipliers. A. Comparison of Subquadratic Single-Multipliers We use two different bases, MPB and SMPB, to build an efficient single-multiplier [Fig. 3(b)] the field based on MPB, while traditional MPB multiplier [7] [Fig. 3(a)] is based on PB and MPB. Both multipliers have similar architectures, which involve three units such as TMVP multiplier, basis conversion circuit (BCC), and pre-computation circuit (PCC). Although traditional MPB multiplier is derived quadrinomial basis, its architecture is suitable trinomial basis. Here, we analyze the complexity of both multipliers trinomial basis. We assume that the TMVP multiplier is realized by non-recursive TMVP decomposition approach to develop a subquadratic multiplier. Theree, we compare BCC and PCC units only both multipliers, as shown in Table II. The BCC unit of our proposed MPB multiplier does not involve any cost hardware implementation, while traditional MPB multiplier involves additional hardware basis conversion. The delay of our architecture the case of with is less by than traditional MPB multiplier. Other subquadratic parallel multipliers are suggested in [20] Winograd algorithm, and in [21] Karatsuba algorithm. Those algorithms are based on the naive polynomial multiplication to explore an efficient subquadratic space complexity architecture. In [22], it is shown that the reduction polynomial stage trinomial with involves XOR gates and requires gate delays. We use the reduction polynomial stage in [22] to add Winograd and Karatsuba algorithms evaluating the complexity of multiplication. Table III compares the complexities of the proposed and the existing subquadratic parallel multipliers. As shown in this table, our proposed MPB multiplier has significantly less computation time and less space complexity compared to the existing subquadratic multipliers. B. Comparison of Subquadratic Double-Multipliers It is well-known that the subquadratic algorithms are derived from two-operand multiplication. For fast three-operand multiplication, the subquadratic algorithms require two separate multipliers in cascade. Multiplier of Lee et al. [16] is based on recursive Karatsuba algorithm to explore subquadratic three-operand multiplication. Our proposed double-multiplication scheme is based on RTMVP decomposition to derive subquadratic space complexity architecture. Table IV lists the complexity of the proposed and the exiting subquadratic multipliers [13], [16], [20], [21] three-operand multiplication. As shown in this table, TMVP-based multiplier [13] is better than other existing subquadratic architectures. The proposed 2-way RTMVP-based multiplier has nearly 25% less delays and about 9% less space complexity compared to those of the existing 2-way subquadratic multipliers. The proposed 3-way RTMVP-based multiplier has less delay and slightly more space

8 LEE AND MEHER: EFFICIENT SUBQUADRATIC SPACE COMPLEXITY ARCHITECTURES 861 TABLE V COMPARISON OF VARIOUS SUBQUADRATIC DOUBLE-MULTIPLIERS OVER IN THE TERMS OF DELAY,NUMBER OF GATES, AND TOTAL GE GE denotes gate equivalent in terms of number of 2-input NAND gates. complexity compared to those of existing 3-way TMVP-based multiplier [13]. We consider three fields based on the existing trinomials of degree,suchas, and, the comparison of our proposed and the existing subquadratic multipliers three-operand multiplication. In order to reduce the complexity, hybrid subquadratic multiplication approach is introduced in [10], which combines 2-way and 3-way decomposition schemes. By this approach, we use the hybrid of 2-way and 3-way decompositions to construct the proposed and the existing subquadratic multipliers based on the field order of the m,,,and, respectively, synthesis purpose. We have used the NanGate's Library Creator and the 45-nm FreePDK Base Kit from North Carolina State University (NCSU) [23] to synthesize the proposed double-multiplier and the corresponding existing multipliers. From the synthesis results, we obtain the delay, the number of gates, and total GE (gate equivalent), as shown in Table V. In this table, the total GE is estimated based on the used cell area, i.e., a NAND gate, a AND gate ( and ), and a XOR gate ( and ). We find that our proposed double-multiplier has less space complexity compared to the best of the existing subquadratic multipliers the selected field order of the m is with significantly less time complexity. VI. CONCLUSIONS In this paper, we have derived a new SMPB representation. The proposed MPB multiplication is constructed from MPB and SMPB, while traditional MPB multiplier is constructed from PB and MPB. We have shown that the basis conversion from MPB to SMPB does not involve any cost hardware implementation. We have proposed three-operand MPB multiplication using a new mulation of the RTMVP decomposition. We have used the traditional TMVP decomposition to propose two new 2-way and 3-way RTMVP decompositions. The proposed MPB single- and double-multiplications use TMVP and RTMVP decompositions, respectively, to achieve subquadratic space complexity architectures. Note that, based on the proposed MPB multiplication, we can derive a subquadratic multiplier using RTMVP decomposition, while traditional MPB multiplication cannot utilize RTMVP decomposition computing three-operand multiplication. From the theoretical analysis, it is shown that the proposed subquadratic multipliers have significantly less computation time and less space complexity compared to the existing subquadratic multipliers based on Karatsuba algorithm, Winograd algorithm, and splitting TMVP algorithm. Moreover, our proposed three-operand multiplier can be used several applications such as exponentiation, inversion, and elliptic curve point multiplication. APPENDIX Lemma 3: The matrix additions (,,,,,,and ) in (16) can be permed using XOR gates. Proof: Let. Based on Proposition 1, we can use polynomial to represent the corresponding matrix. The matrix is split into block Toeplitz matrix, such as. Using polynomial representation, the five block matrices (,1,2,3,4) can be represented by. We can theree have (31) Note that the matrix additions of this m are involved in computation of,,,,,,and. Based on (31), we can find the reused terms as follows: 1) The term in also appears 2) The term in also appears 3) The term in also appears 4) The term in also appears 5) The term in also appears The seven matrices,,,,,,and can be computed by XOR gates. Lemma 4: Assume that the MPB multiplication is based on matrix-vector product approach and expressed in the m, and we have,, is generated by,and is a matrix transmation from the SMPB to thempb. Proof: in are represented by the MPB polynomials,,andthematrix is generatedby. Using (23), we can have, perms the basis conversion from SMPB into MPB. Thus, the MPB multiplication can be written as. Since double basis multiplication is a Toeplitz matrix-vector product given by, we can find.

Chiou, Scalable Gaussian normal basis multipliers over using Hankel matrix-vector representation, Signal Process. Syst., vol. 69, no. 2, pp. 197 211, 2012. [3] C.-Y. Lee, Y.-H. Chen, C. W.

9 862 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 62, NO. 3, MARCH 2015 REFERENCES [1] U. Bose, A. K. Bhattacharya, and A. Das, GPU-based implementation of 128-Bit secure eta pairing over a binary field, in Proc. AFRICACRYPT, 2013, pp [2] C.-Y. Lee and C. W. Chiou, Scalable Gaussian normal basis multipliers over using Hankel matrix-vector representation, Signal Process. Syst., vol. 69, no. 2, pp , [3] C.-Y. Lee, Y.-H. Chen, C. W. Chiou, and J.-M. Lin, Unified parallel systolic multiplier over, J. Comput. Sci. Technol. vol. 22, no. 1, pp , 2007 [Online]. Available: s [4] C.-Y. Lee and C. W. Chiou, Efficient design of low-complexity bit-parallel systolic Hankel multipliers to implement multiplication in normal and dual bases of, IEICE Trans., vol. 88-A, no. 11, pp , [5] C.-Y. Lee, Low-complexity parallel systolic montgomery multipliers over using toeplitz matrix-vector representation, IEICE Trans., vol. 91-A, no. 6, pp , [6] J. Han and H. Fan, Toeplitz matrix-vector product based shifted polynomial basis multipliers all irreducible pentanomials, IACR Cryptol. eprint Archive, vol. 2013, p. 427, 2013 [Online]. Available: [7] M. A. Hasan, A. H. Namin, and C. Nègre, Toeplitz matrix approach binary field multiplication using quadrinomials, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 3, pp , [8] J.-S. Pan, R. Azarderakhsh, M. M. Kermani, C.-Y. Lee, W.-Y. Lee, C. W. Chiou, and J.-M. Lin, Low-latency digit-serial systolic double basis multiplier over using subquadratic toeplitz matrix-vector product approach, IEEE Trans. Comput., vol. 63, no. 5, pp , [9] S.-M. Park and K.-Y. Chang, Fast bit-parallel shifted polynomial basis multiplier using weakly dual basis over, IEEE Trans. VLSI Syst., vol. 19, no. 12, pp , [10] A. Weimerskirch and C. Paar, Generalizations of the Karatsuba Algorithm efficient implementations, Univ. Ruhr, Bochum, Germany, 2003, Tech. Rep.. [11] Y. Li, G. Chen, and J. Li, Speedup of bit-parallel Karatsuba multiplier in generated by trinomials, Inf. Process. Lett.,vol.111,no. 8, pp , [12] M. Bodrato, Towards optimal toom-cook multiplication univariate and multivariate polynomials in characteristic 2 and 0, in Proc. WAIFI, 2007, pp [13] H. Fan and M. Hasan, A new approach to subquadratic space complexity parallel multipliers extended binary fields, IEEE Trans. Comput., vol. 56, no. 2, pp , [14] C.-Y. Lee, C.-S. Yang, B. K. Meher, P. K. Meher, and J.-S. Pan, Low-complexity digit-serial and scalable SPB/GPB multipliers over large binary extension fields using (b, 2)-way Karatsuba decomposition, IEEE Trans. Circuits Syst. I, Reg. Papers vol. 61, no. 11, pp , 2014 [Online]. Available: [15] J.-S. Pan, C.-Y. Lee, and P. K. Meher, Low-latency digit-serial and digit-parallel systolic multipliers large binary extension fields, IEEETrans.CircuitsSyst.I,Reg.Papers, vol. 60, no. 12, pp , [16] C.-Y. Lee, P. K. Meher, and C.-P. Chang, Efficient M-ary exponentiation over using subquadratic KA-based three-operand montgomery multiplier, IEEETrans.CircuitsSyst.I,Reg.Papers, vol. 61, no. 11, pp , [17] R. Azarderakhsh, K. Järvinen, and V. S. Dimitrov, Fast inversion in with normal basis using hybrid-double multipliers, IEEE Trans. Comput., vol. 63, no. 4, pp , [18] C. Nègre, Quadrinomial modular arithmetic using modified polynomial basis, in Proc. ITCC, 2005, pp [19] R. Furness, S. Fenn, and M. Benaissa, Multiplication using the triangular basis representation over, in Proc. Global Telecommun. Conf. (GLOBECOM'96), vol. 2, pp [20] B. Sunar, A generalized method constructing subquadratic complexity multipliers, IEEE Trans. Comput., vol. 53, no. 9, pp , [21] C. Paar, A new architecture a parallel finite field multiplier with low complexity based on composite fields, IEEE Trans. Comput., vol. 45, no. 7, pp , [22] B. Sunar and C. K. Koç, Mastrovito multiplier all trinomials, IEEE Trans. Comput., vol. 48, no. 5, pp , [23] Nangate Standard Cell Library [Online]. Available: org/openeda.si2.org/projects /nangatelib/ Chiou-Yng Lee (SM 07) received the B.S. degree in medical engineering and the M.S. degree in electronic engineering, both from the Chung Yuan Christian University, Taiwan, in 1986 and 1992, respectively, and the Ph.D. degree in electrical engineering from Chang Gung University, Taiwan, in From 1988 to 2005, he was a Research Associate with Chunghwa Telecommunication Laboratory, Taiwan. From 2001 to 2005, he taught courses related to finite fields at Ching Yun University. Currently, he is a Professor in the Department of Computer Inmation and Network Engineering at Lunghwa University of Science and Technology, Taiwan. His research interests include computations in finite fields, error-control coding, signal processing, and digital transmission systems. He is a Senior Member of the IEEE and the IEEE Computer society. Pramod Kumar Meher (SM 03) received the B.Sc. (Honours) and M.Sc. degree in physics, and the Ph.D. degree in science from Sambalpur University, India, in 1976, 1978, and 1996, respectively. Currently, he is a Senior Research Scientist with Nanyang Technological University, Singapore. Previously, he was a Professor of Computer Applications with Utkal University, India, from 1997 to 2002, and a Reader in electronics with Berhampur University, India, from 1993 to His research interest includes design of dedicated and reconfigurable architectures computation-intensive algorithms pertaining to signal, image and video processing, communication, bio-inmatics and intelligent computing. He has contributed more than 200 technical papers to various reputed journals and conference proceedings. Dr. Meher has served as a speaker the Distinguished Lecturer Program (DLP) of IEEE Circuits and Systems Society during 2011 and 2012, Associate Editor the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS during 2008 to 2011, and Associate Editor the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS during Currently, he is serving as Associate Editor the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, thejournal of Circuits, Systems, and Signal Processing,andIntegration, the VLSI Journal. Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India. He was the recipient of the Samanta Chandrasekhar Award excellence in research in engineering and technology He has received the 2013 Sydney R. Parker Best Paper Award in the area of signal processing, and the 2013 M.N.S. Swamy Award being the best paper amongst all the papers published in the Journal of Circuits, Systems, and Signal Processing.

ALONG with the progressive device scaling, semiconductor

ALONG with the progressive device scaling, semiconductor IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 4, APRIL 2010 285 LUT Optimization for Memory-Based Computation Pramod Kumar Meher, Senior Member, IEEE Abstract Recently, we