A High-Speed Low-Power Modulo 2 n +1 Multiplier Design Using Carbon-Nanotube Technology

A High-Speed Low-Power Modulo 2 n +1 Multiplier Design Using Carbon-Nanotube Technology A Thesis Presented by He Qi to The Department of Electrical and Computer Engineering in partial fulfillment of the requirement for the degree of Master of Science in Electrical Engineering in the field of Electronic Circuits and Semiconductor Devices Northeastern University Boston, Massachusetts April, 2012

NORTHEASTERN UNIVERSITY Graduate School of Engineering Thesis Title: A High-Speed Low-Power Modulo 2 n +1 Multiplier Design Using Carbon- Nanotube Technology. Author: He Qi. Department: Department of Electrical and Computer Engineering. Approved for Thesis Requirements of the Master of Science Degree Thesis Advisor: Prof. Yong-Bin Kim Thesis Reader: Prof. Fabrizio Lombardi Thesis Reader: Prof. Minsu Choi Department Chair: Prof. Ali Abur Graduate School Notified of Acceptance: Director of the Graduate School Date Date Date Date Date

NORTHEASTERN UNIVERSITY Graduate School of Engineering Thesis Title: A High-Speed Low-Power Modulo 2 n +1 Multiplier Design Using Carbon- Nanotube Technology. Author: He Qi. Department: Department of Electrical and Computer Engineering. Approved for Thesis Requirements of the Master of Science Degree Thesis Advisor: Prof. Yong-Bin Kim Thesis Reader: Prof. Fabrizio Lombardi Thesis Reader: Prof. Minsu Choi Department Chair: Prof. Ali Abur Graduate School Notified of Acceptance: Dean: Prof. Sara Wadia-Fascetti Date Date Date Date Date Copy Deposited in Library: Reference Librarian Date

Abstract Modulo 2 n +1 multiplier is one of the critical components in the area of digital signal processing, residue arithmetic, and data encryption that demand high-speed and low-power operation. In this thesis, a new circuit implementation of a high-speed low-power modulo 2 n +1 multiplier is proposed. It has three major stages: partial product generation stage, partial product reduction stage, and the final adder stage. The major technical contribition to the arts of the thesis is that the partial product reduction stage introduces a new MUX-based compressor to reduce power and increase speed. Secondly, in the final adder stage, the sparse-tree based inverted end-around-carry adder reduces the number of critical path circuit blocks. Finally, a proposed adder is implemented using both 32nm CNTFET (Carbon-Nanotube FET) and bulk CMOS technology for comparison. The CNTFET-based design dramatically decreases the PDP (Power Delay Product) of the circuit. The simulation results demonstrate that the MUX-based compressor reduces the PDP of the partial product reduction stage by 4.24 times compare to the traditional full adder based design. The sparse-architecture solves the wire interconnection problem while slightly reduces the PDP of the final adder stage compare to the Kogge-Stone design. The power consumption of CNTFET-based multiplier is on average of 5.72 times less than its conventional bulk CMOS counterpart, while the PDP of CNTFET is 94 times less than the CMOS one. The proposed multilier circuit and its implementation demonstrates the viability of the ultra-low-power and high performance feature of the promising CNTFET technology. Index Terms Modulo 2 n +1 Multiplier, MUX-based Compressor, Sparse-tree Adder, Carbon-Nanotube Technology

Acknowledgements First of all, I will thank Prof. Yong-Bin Kim, my research advisor. His constructive suggestions and encouragements lead me to make progress in my master research. In addition, his great guidance helps me to realize where my passion is and what research area I m going to concentrate on in the future. Thank you so much! I would also like to thank the members of committee to share my research results with valuable advices. He Qi Boston, MA

For my parents

CONTENTS ABSTRACT...i ACKNOWLEDGEMENTS...i I. INTRODUCTION...1 1.1 BACKGROUND...1 1.2 PROBLEM AND WORK STATEMENT...4 1.3 OUTLINE OF THE THESIS...6 II. ALGORITHM...8 2.1 ALGORITHM OF THE PARTIAL PRODUCT GENERATION STAGE...8 2.2 ALGORITHM OF THE PARTIAL PRODUCT REDUCTION STAGE...10 2.3 ALGORITHM OF THE FINAL ADDITION STAGE...11 2.4 AN EXAMPLE...20 III. CIRCUIT IMPLEMENTATION...22 3.1 CIRCUIT DESIGN OF THE PARTIAL PRODUCT GENERATION STAGE..22 3.2 CIRCUIT DESIGN OF THE PARTIAL PRODUCT REDUCTION STAGE...24 3.2.1 INTRODUCTION OF A MUX-BASED COMPRESSOR DESIGN...24 3.2.2 DIFFERENT TYPES OF THE MUX-BASED COMPRESSORS...26 3.2.2.1 Circuit Desigin of the 3:2 Compressor...26

3.2.2.2 Circuit Desigin of the 4:2 Compressor...27 3.2.2.3 Circuit Desigin of the 5:2 Compressor...29 3.2.2.4 Circuit Desigin of the 7:2 Compressor...33 3.2.3 DETAILED SUBCIRCUIT DESIGN OF THE COMPRESSORS...34 3.2.3.1 MUX Subcircuit Design...34 3.2.3.2 Complementary MUX Subcircuit Design...37 3.2.3.3 XOR-XNOR Subcircuit Design...39 3.2.3.4 CGEN Subcircuit Design...43 3.2.4 THE ARCHITECTURE OF THE PARTIAL PRODUCT REDUCTION STAGE...44 3.2.4.1 Architecture Designed for an 8-bit Modulo 2 n +1 Multiplier...44 3.2.4.2 Architecture Designed for an 16-bit Modulo 2 n +1 Multiplier...45 3.3 CIRCUIT DESIGN OF THE FINAL ADDITION STAGE...47 IV. SIMULATION RESULTS OF THE PROPOSED DESIGN AND TECHNOLOGY COMPARISON.....49 4.1 PERFORMANCE COMPARISON BETWEEN THE FULL ADDER BASED COMPRESSOR AND THE MUX BASED COMPRESSOR. 50 4.2 SIMULATION RESULTS OF DIFFERENT COMPRESSOR ARCHITECTURES IN THE PARTIAL PRODUCT REDUCTION STAGE. 51

4.3 SIMULATION RESULTS OF THE SPARSE-TREE ARCHITECTURE AND THE KOGGE-STONE ARCHITECTURE..52 4.4 SIMULATION RESULTS OF THE CNT-BASED DESIGN AND THE BULK CMOS- BASED DESIGN 54 4.4.1 FEATURES OF THE CNT TECHNOLOGY.......54 4.4.2 POWER, DELAY AND AREA.....57 4.4.3 PVT VARIATION......58 V. CONCLUSION. 63 REFERENCE 64 APPENDIX: HSPICE INPUT FILES..66

List of Figures Fig.1 Initial Partial Product Matrix...9 Fig. 2 Modified Partial Product Matrix...9 Fig. 3 Final n n Partial Product Matrix...10 Fig. 4 8-bit Kogge-Stone Adder...12 Fig. 5 16-bit Kogge-Stone Adder...13 Fig. 6 8-bit Kogge-Stone Diminished-1 Adder...14 Fig. 7 Revised Diminished-1 Kogge-Stone Adder with Stages...17 Fig. 8 16-bit Kogge Stone Adder with Sparsity of 4...18 Fig. 9 Inverted EAC Adder with Sparsity of 4...18 Fig. 10 Inverted EAC Adder with Sparsity of 4 in Stages...19 Fig. 11 the Initial Output of the Partial Product Generation Stage...20 Fig. 12 the n n Partial Product Matrix...20 Fig. 13 the Final Partial Product Matrix with the Correction Factor...21 Fig. 14 the Initial Output of the Partial Product Reduction Stage...21 Fig. 15 the Output of the Partial Product Reduction Stage after Repositioning...21 Fig. 16 Proposed Inverter...22 Fig. 17 Nand Gate with 2 Inputs...23 Fig. 18 Nor Gate with 2 Inputs...23

Fig. 19 Traditional Design of the Partial Product Reduction Stage...24 Fig. 20 A New Design of the Partial Product Reduction Stage...25 Fig. 21 Traditional MUX-based Design of the 3:2 Compressor...26 Fig. 22 A New MUX-based Design of the 3:2 Compressor...27 Fig.23 Traditional MUX-based Design of the 4:2 Compressor...28 Fig.24 A New MUX-based Design of the 4:2 Compressor...29 Fig. 25 Existing Architectures of the 5:2 Compressor...32 Fig. 26 A New MUX-based Design of the 5:2 Compressor...32 Fig. 27 A New MUX-based Design of the 7:2 Compressor...34 Fig. 28 Original Design of the MUX Subcircuit...35 Fig. 29 Modified Design of the MUX Subcircuit...36 Fig. 30 Proposed Design of the MUX Subcircuit...37 Fig. 31 Existing Designs of the Complementary-output MUX Subcircuit...38 Fig. 32 Proposed Design of the Complementary-output MUX Subcircuit...39 Fig. 33 Original Design of the XOR-XNOR Subcircuit...40 Fig. 34 Modified Designs of the XOR-XNOR Subcircuit...41 Fig. 35 Proposed Design of the XOR-XNOR Subcircuit...42 Fig. 36 Proposed Design of the CGEN Subcircuit...43 Fig. 37 Possible Compressor Architectures for an 8-bit Modulo 2 n +1 Multiplier...44 Fig. 38 Possible Compressor Architectures for an 16-bit Modulo 2 n +1 Multiplier...47

Fig. 39 the 4-bit Conditional Sum Generator...48 Fig. 40 Delay of the Full Adder Based Compressor...49 Fig. 41 Delay of the MUX Based Compressor...50 Fig. 42 Critical Path Delay of the Sparse-tree Adder...53 Fig. 43 Noncritical Path Delay of the Sparse-tree Adder...54 Fig. 44 Critical Path Delay of Kogge-Stone Adder...55 Fig. 45 Delay and Rise-time of the Proposed Multiplier Based on CMOS Technology......57 Fig. 46 Delay and Rise-time of the Proposed Multiplier Based on CNTFET Technology...58 Fig. 47 Power Consumption of the Proposed Multiplier Based on Two Technologies...59 Fig. 48 Temperature Variation...61 Fig. 49 Voltage Variation...61 Fig. 50 Process Variation...62

List of Tables Table 1 Truth Table of the CGEN Subcircuit...44 Table 2 Comparison between the Kogge-stone adder and the Sparse-tree Adder...47 Table 3 Performance Comparison between the Full Adders Based Compressor and the MUX-based Compressor...51 Table 4 Performance and Power Comparison between Different Types of Compressors...51 Table 5 Performance and Power Comparison among Different Compressor Architectures for an 8-bit Modulo 2 n +1 Multiplier...52 Table 6 Performance and Power Comparison among Different Compressor Architectures for an 16-bit Modulo 2 n +1 Multiplier...52 Table 7 Performance and Power Comparison between the Kogge-Stone Architecture and the Sparse-tree Architecture...53 Table 8 Performance Comparison between the Proposed Multiplier Based on Two Different Technologies...58 Table 9 Delay Comparison between Two Technologies with Different Temperatures...59 Table 10 Rise-time Comparison between Two Technologies with Different Temperatures...60 Table 11 Delay Comparison between Two Technologies with Different Supply Voltages...60 Table 12 Delay Comparison between Two Technologies with Different Process Corners...60 Table 13 Risetime Comparison between Two Technologies with Different Process Corners...60

I. Introduction 1.1 BACKGROUND Modulo arithmetic is widely used in a lot of areas. In cryptography, modulo arithmetic is the foundation of public key system and is used in a number of symmetric key algorithms such as International Data Encryption (IDEA) and Advanced Encryption Standard (AES). There are also a variety of modulo operations implemented in computer science such as XOR operation in programming language. Furthermore, modulo arithmetic also has an application in music and chemistry such as modulo 12 operations in electronic instruments to implement twelve-tone equal temperament. Nowadays, modulo arithmetic is frequently used in fault tolerant design of ad-hoc network, digital and linear convolution architectures [1]. In recent years, the information safety, especially the confidentiality of transmitting data through signal channels, is becoming more and more important because of the increasing popularity and gradually matured function of internet, which makes cryptography play a significant role in the information age. Modulo 2 n and modulo 2 n +1 multiplier are key blocks in the circuit implementation of cryptographic algorithm such as IDEA [1]. Residue number system (RNS) is another important application of modulo arithmetic. In the recent years, the RNS is widely used in arithmetic computation and signal processing applications such as fast Fourier transforms, digital filtering, and image processing [2]. RNS became so popular is because the calculation of a large integer is transferred into several small integer calculations in parallel by decomposing a large integer into several small integers. This effectively increases the operating speed [3]. Among popular moduli sets, (2 n -1, 2 n, 2 n +1) draws 1

the most attention and have been studied for several decades because of its easy conversion between binary and residue. Such conversion is based on the conventional Chinese remainder theorem [2]. It takes n bits wide inputs for modulo 2 n -1 and modulo 2 n operation, while it takes n+1 bit wide inputs for modulo 2 n +1 operation [1]. That makes modulo 2 n +1 implementation more difficult and complex hardware block with much attention. Many architectures and circuit implementations of modulo 2 n +1 block are proposed and compared in the past decades. According to Cruiger s work [4], three multiplication architectures are proposed: The first architecture is realized by using a (n+1) (n+1) bits multiplier followed by modulo adders to correct errors caused by carry. The second architecture takes advantage of modulo 2 n +1 adder, where multiplier consists of a carry-save adder and a final carry-select addition unit to reduce design complexity [1]. In the third architecture, they modified the second architecture by correcting errors in the carry-select adder. Furthermore, the circuit area is significantly reduced and operating speed is increased by introducing a bit-pair recoding scheme in the carry-save adder block [4]. Although the last two architectures are suitable for full-custom design [1], they increase not only layout and fabrication complexity but also design challenges. In the work of Hiasat [5], a very high speed modulo (2 n +1) multiplier is proposed. The circuit implementation takes advantage of a binary multiplier stage, an adder stage, and the combination of several logic gates. The main contribution of his work is reducing hardware requirement and accomplishing realizing very large dynamic ranges. 2

Later in the work of Wrzyszcz and Milford [6], a new partial product matrix is introduced to reduce design and hardware complexity of the previous design as well as introducing very small hardware overhead. Furthermore, their design realizes a regular VLSI layout implementation since the whole structure is almost composed by full adder and half adder only, which also dramatically optimizes the parallel computing performance, speed, and the maximum operating frequency. Finally, since the periodic properties of occurs in every row of the partial product array, only bits with weight less than 2 n occur to compose the final (n+1) (n+1) partial product matrix after reposition computation. The correction process also turns out to be easy to realize because of those characteristics. According to the work of Zimmermann [7], a new implementation of modulo (2 n +1) multiplier is proposed, which has three major parts: modulo reduced partial products generation block, modulo carry-save adder, and modulo final adder. To implement the final modulo addition operation, a fast and simple end-around-carry adder is needed. Zimmermann introduces a new parallel prefix adder to realize this function, which dramatically increases the operation speed. Furthermore, conventional Booth coding of the partial product generation stage and the Wallace tree structure in the final adder stage could also be used to speed up in Zimmermann s algorithm. Also, the highly regular structure of this implementation reduces the complexity of layout process and it is very suitable for VLSI implementation and modularization. Chaves and Sousa [8] realized the idea of Zimmermann in the later years. Booth coder and Wallace tree structure make their implementation the fastest modulo (2 n +1) multiplier ever at that time. From a panoramic point of view, a lot of work regarding to Diminished-1 algorithm has been 3

done to solve the problem of n+1 bit input length in a modulo (2 n +1) multiplier implementation. For example,yutai Ma [9] introduces bit-pair Booth recoding technique and Carry Save adder to reduce partial products to for even n or for odd n. In the work of Zimmermann [7], weighted operand representation is introduced to implement Diminish-1 function at the cost of additional circuit for correction purpose. Wang s [10] work eliminates the conversion circuit between binary and diminished-1, which reduces power and circuit complexity. Chaves and Sousa [8] compare ordinary and diminished-1 implementations of modulo (2 n +1) multiplier. Also, they optimized the Booth recoding scheme to speed up the multiplier. In the work of Vergos and Efstathiou [11], they made an improvement comparing to the work of Wrzyszcz and Milford [6] by reducing the correction factor from 3 to 1, reducing the circuit complexity and increasing speed. 1.2 PROBLEM AND WORK STATEMENT To sum up, modulo (2 n +1) multiplier today has characteristics of high speed, low power, regular scheme which is suitable for VLSI implementation and small area. However, further improvements of the circuit implementation could be achieved. The enhancements could be possibly made on the partial product reduction stage and the final adder stage because these two stages are the critical path of the multiplier. Thus, new efficient hardware design of partial product reduction block and final adder block to achieve higher speed and lower power is highly needed. To make further improvement on modulo (2 n +1) multiplier, a new circuit implementation is proposed in this thesis. It has three major stages: partial product generation stage, partial product 4

reduction stage, and the final adder stage. The last two stages determine the speed and power of the whole circuit. Conventional compressor in the partial product generation stage takes advantage of cascade full-adders and half-adders. However, adders consume a lot of power and have a large delay. In this thesis, a new compressor based on the combination of MUX and xorxnor gate is proposed to reduce PDP [1]. For the final adder stage, the conventional Kogge-Stone adder is the fastest parallel prefix form carry look-ahead adder [13]. However, the performance of the parallel prefix adders is limited by the large number of carry merge cells and excessive inter-stage wiring tracks. In this thesis, a sparse tree based inverted EAC adder is used to solve this problem [14]. The sparse tree architecture dramatically reduces the number of blocks in the last stage compare to Kogge-stone adder, which helps a lot in the layout process. The sparse tree architecture also reduces delay of the last stage, because the sparse tree path is not the critical path and the fan-out of the critical path is also reduced. Additionally, the limitation of technology itself restricts further improvement of circuit implementation of modulo (2 n +1) multiplier. The popular CMOS technology based transistors could be scaled down to very small size to archive very high integration capacity of VLSI implementation. Nowadays, 32nm CMOS technology has been widely used and dramatically increases the speed of the multiplier. However, as the sub-micron nano range scale down to 25nm in the near future, the leakage current of transistor will significantly increase. Also, the sensitivity to process variation increases significantly to an unavoidable level and the requirement of the accuracy of manufacture process [12]. Furthermore, the intrinsic capacitance of nodes will get smaller and smaller as size of transistors and supply voltage getting lower, making the number of charges that could be stored at nodes getting smaller. This makes 5

instantaneous voltage change such as cosmos particle collision a big problem, which could destroy the device at some conditions [12]. Thus, robust technologies that has stable property when the size of transistors getting smaller is required in the near future. Among variety of modern technologies, cylindrical carbon molecules have beneficial properties in the application of electronics and nanotechnology [12]. Carbon-Nano-Tube (CNT) is a tubeshaped allotrope of carbon. CNT benefits its length-to-diameter ratio of as high as over 130 billion, which is greatly larger than other material under study. One of the advantageous properties of CNT is its extremely hardness and stiffness. The only limitation of this property is that it is sensitive to high-energy electron irradiation. The particular structure of CNT brings the possibility of conductivity change between semiconductor and metal. For a given (n,m) CNT, if n = m, the CNT is metallic; if n m is a multiple of 3, then the CNT turns out to be a semiconductor. Furthermore, CNT has very good thermal properties such as conductivity and thermal stability. Based on CNT technology, a new CNT transistor (CNTFET) is introduced these years with advantages of lower leakage power, better frequency response, lower PVT variation, and extremely low PDP, which makes CNTFET a very competitive substitute of traditional MOSFET. 1.3 OUTLINE OF THE THESIS The rest of the thesis will be organized as follows. In section II, the algorithm used to implement the multiplier is presented. Section III describes the proposed circuit implementation of modulo 2 n +1 multiplier, and the novel sparse tree based Inverted EAC adder and the MUX based compressor are also presented in the same chapter. The simulation results of the CNTFET based 6

design and the comparison with traditional CMOS technology based design is given in section IV, and the conclusion is followed in section V. 7

II. Algorithm Among various existing A B mod (2 n +1) algorithms, the one presented by Vergos and Efstathiou [1] is considered to be the best. The proposed circuit implementation based on this algorithm can be adapted to various applications such as IDEA cipher mentioned in section I. Some problems might occur when this algorithm is used on IDEA cipher, because in the work of Vergos and Efstathiou [1], (n+1)-bit wide inputs are introduced while in IDEA application, the input width is n. However, this problem could be easy solved by connecting the MSB of the two inputs to ground and just neglect the MSB of the outputs. 2.1 ALGORITHM OF THE PARTIAL PRODUCT GENERATION STAGE Assume A and B are two inputs represented as A=a n a n-1 a n-2 a 1 a 0 and B=b n b n-1 b n-2 b 1 b 0, then A B modulo (2 n +1) can be represented as follows [1]: (1) where p i,j = a i AND b i. The A B operation could be achieved by adding a group of partial products together in a certain order. Take an observation of the partial product matrix, it could be divided into four groups: A, B, C and D, as shown in Fig. 1 (where P i,j = a i AND b j ). Only one group of them could be different 8

from 0 at certain time. Thus, partial products in different groups could be ORed instead of being added together. Firstly, we perform the logic OR operation on the terms of the groups A, B, and D in the columns with weight 2 n up to 2 2n-2 and on the two terms of the groups B and D with weight 2 2n-1. Since, the term weighted 2 2n-1, q n-1, can be substituted by two terms q n-1 in the columns with weight 2 n-1 and 1, respectively, and ORed with any term of group A there. Moreover, since, the term p n,n could be repositioned to the rightmost column and ORed with p 0,0 [1, 11]. The modified version of partial product matrix after OR operation is shown in Fig. 2 (where q i = p i,n p n,i ). 2 2n 2 2n-1 2 2n-2 2 n+2 2 n+1 2 n 2 n-1 2 n-2 2 2 2 1 2 0 P n,0 P n-1,0 P n-2,0 P 2,0 P 1,0 P 0,0 P n,1 P n-1,1 P n-2,1 P n-3,1 P 1,1 P 0,1 P n,2 P n-1,2 P n-2,2 P n-3,2 P n-4,2 P 0,2 P n,n-2 P 4,n-2 P 3,n-2 P 2,n-2 P 1,n-2 P 0,n-2 P n,n-1 P n-1,n-1 P 3,n-1 P 2,n-1 P 1,n-1 P 0,n-1 P n,n P n-1,n P n-2,n P 2,n P 1,n P 0,n C B D Fig.1 Initial Partial Product Matrix A 2 2n-2 2 n+1 2 n 2 n-1 2 n-2 2 2 2 1 2 0 P n-1,0 Vq n-1 P n-2,0 P 2,0 P 1,0 P 0,0 V P n,n Vq n-1 P n-1,1 Vq 0 P n-2,1 P n-3,1 P 1,1 P 0,1 P n-1,2 Vq 1 P n-2,2 P n-3,2 P 0,2 P 3,n-2 P 2,n-2 P 1,n-2 P 0,n-2 P n-1,n-1 Vq n-2 P 2,n-1 P 1,n-1 P 0,n-1 Fig. 2 Modified Partial Product Matrix 9

There is an observation regarding to the reposition operation of the partial product terms in the n n partial product matrix, with weight greater than 2 n-1 based on the following equation [11]: (2) Equation (2) shows that repositioning each bit to i th bit needs a correction factor to make sure that the partial product matrix is equivalent to the initial partial product matrix before reposition operation. For each partial product vector, the correction factor is derived as 12n. Hence, the correction factor of the entire partial product matrix is given by [11]: The final n n partial product matrix after the reposition operation is shown in Fig. 3 (3) 2 n-1 2 n-2 2 n-3 2 2 2 1 2 0 P n-1,0 Vq n-1 P n-2,0 P n-3,0... P 2,0 P 1,0 P 0,0 V P n,n Vq n-1 P n-2,1 P n-3,1 P n-4,1 P 1,1 P 0,1 P n-3,2 P n-4,2 P n-5,2 P 0,2 P 1,n-2 P 0,n-2 P 0,n-1 Fig. 3 Final n n Partial Product Matrix 2.2 ALGORITHM OF THE PARTIAL PRODUCT REDUCTION STAGE Another observation is regarding to the compressors in partial product reduction stage, which 10

perform like a carry save adder (CSA). Since this CSA works as a modulo 2 n +1 adder, the carryout bit of each level of the CSA has to be fed back as the carry-in bit of the next subsequent level [1]. Supposing that the carry-out bit of the n th column at ith stage of CSA is c i with weight 2 n, then the carry-out can be deduced to [11]: (4) Thus, in an n-1 stage CSA, another correction factor because of the carry-out bits of the CSA due to equation (4) is [1]: (5) The final correction factor can be calculated from the sum of COR1 and COR2: (6) For an n-bit modulo (2 n +1) multiplier, the constant 3 is the final correction factor. A 2 will be added to the partial product reduction stage, while a 1 will be added to the final adder stage due to the inverted carry feedback issue discussed later in this thesis. 2.3 ALGORITHM OF THE FINAL ADDITION STAGE When two 1-bit wide inputs A and B are added together, if the carry-out of A+B is always 1, regardless of the value of input carry, A and B are said generate. In practice, A and B generate only in the case that both A and B are logic 1. We use to present the relationship of generate, denote as:. Similarly, A and B are said propagate if the carry-out of A+B is always 1 whenever the carry-in bit is 1, regardless the value of two 1-bit wide inputs A and B. In practice, A+B propagate only in the case that at least one of A or B is logic 1. We use to present the relationship of propagate, denote as:. 11

Fig. 4 8-bit Kogge-Stone Adder The final adder stage is an inverted End-Around-Carry (EAC) adder revised from conventional Kogge-Stone adder. An 8-bit Kogge-Stone adder is shown in Fig. 4. The algorithm of Kogge- Stone adder is illustrated below. Each produces a "propagate" and a "generate" bit, where propagate, generate. Next, operator works as in the next stages in vertical direction. The final generate bits are produced in the last stage. These bits need to be XORed with the initial propagate ( ) to produce the final sum bits. For example, the LSB of sum vector is calculated as: P 0 XORed with the carry-in bit. The second LSB of sum vector is calculated as: P 1 XORed with the rightmost carry-out bit in the last stage of operation. The 16-bit Kogge- Stone adder performs in the same manner, as shown in Fig. 5. 12

Although the conventional Kogge-Stone adder is thought to be the fastest adder possible today, however, to realize modulo (2 n +1) function, it needs some structural revision. The partial product reduction stage generates an n-bit sum vector and an n-bit carry vector, which will be added in the final adder stage. However, to achieve the modulo (2 n +1) addition function, the output of carry bit of the carry vector should be feedback to the LSB of the final adder stage, shown in the work of Zimmerman [7]: (7) From (7) we can observe that the inverted carry-out bit of the addition of Sum and Carry vectors has to be fed back to achieve modulo (2 n +1) function in the revised Kogge-Stone adder architecture shown later. Fig. 5 16-bit Kogge-Stone Adder The parallel prefix computation works in the form of operations will be remained in the revised architecture. Instead of directly XORed the propagate of each n th bit with the (n-1) th 13

carry-out bit in the th stage, the new architecture is proposed to invert the (n-1) th carry-out bit in th stage and then this new inverted (p i*, g i* ) set will with the (p i, g i ) of each bit in th stage to generate final (G, P ) set. Finally, the sum vector of the final adder stage is generated by XORing the final carry-out bit g i * with the initial propagate g i. The revised 8-bit EAC Kogge-Stone adder is shown in Fig. 6. Fig. 6 8-bit Kogge-Stone Diminished-1 Adder As the final sum vector and carry vector are calculated mainly depends on the generatepropagate set in every stage, the derivation of (G, P) and some characteristics of it should 14

be discussed. Furthermore, the architecture in Fig.6 has a logic depth of. To reduce the logic depth from to, a new architecture is introduced based on the algorithm improvement shown below. The carry-out bit of a carry-look-ahead (CLA) adder is logic 1 when one of the cases below takes place: A+B generate or the next less significant carry-out bit is 1 with A+B generate. Then the carry-out bit of CLA could be denoted like this: (8) According to (8), the final generate-propagate set in the th stage could be expressed below (Let ) [1]: (9) There are several observations regarding to the equation above. Firstly, which means the inverted EAC adder is just taking the inverted logic of the generate bit and keep the value of the propagate bit. The second observation is: (10) (11) The third observation is on the derivation of, as shown below [15]: 15

(12) In some cases, generating the whole architecture in stages based on (12) is not possible. To solve this problem, we could transfer (12) into another form [15]. Suppose that and, then, (13) According to (13),. The new designed final stage adder based on this algorithm is shown in Fig. 7. The addition operation in the final adder stage is done in stages. However, this implementation has obvious wire interconnection problem because of the complexity of cells [1]. One possible solution for the wire interconnection problem is to introduce sparse-tree architecture. The sparsity of a Kogge-Stone adder refers to the number of carry-out bits generated by the adder. For example, sparsity-1 means the whole adder totally generates 1 carryout bit for. The sparsity of 2 means generating carry-out every other bit and sparsity of 4 means generating carry-out every-fourth bit. A much shorter carry ripple adder is then introduced with an input bit of the carry-out of sparse tree adder. Because this shorter carry ripple adder is not the critical path, the delay of the final adder stage is reduced, while the wire interconnection problem is solved. There is a trade-off between the sparsity and the effectiveness of solving wire interconnection problem. Increasing sparsity increases the speed of the sparse-tree adder; 16

however, the delay of the short carry ripple adder gets larger as well. Finally, the critical path will no longer be the sparse-tree adder, but the short carry ripple adder instead. a7,b7 a6,b6 a5,b5 a4,b4 a3,b3 a2,b2 a1,b1 a0,b0 s7 s6 s5 s4 s3 s2 s1 s0 ai bi G i, j, P i, j G k, m, Pk, m hi (gi,pi) pi, gi, P G P, P G P G i, j i, j k, m, k, m G i, j i, j k, m, k, m Fig. 7 Revised Diminished-1 Kogge-Stone Adder with Stages An example of 16-bit Kogge Stone adder with sparsity-4 is shown in Fig. 8, while the Inverted EAC adder with sparsity-4 is shown in Fig. 9. 17

a15,b 15 a14,b 14 a13,b 13 a12,b 12 a11,b 11 a10,b 10 a9,b9 a8,b8 a7,b7 a6,b6 a5,b5 a4,b4 a3,b3 a2,b2 a1,b1 a0,b0 C13 C9 C5 C1 Fig. 8 16-bit Kogge Stone adder with sparsity of 4 a15,b 15 a14,b 14 a13,b 13 a12,b 12 a11,b 11 a10,b 10 a9,b9 a8,b8 a7,b7 a6,b6 a5,b5 a4,b4 a3,b3 a2,b2 a1,b1 a0,b0 C15=C-1 C11 C7 C3 Fig. 9 Inverted EAC adder with sparsity of 4 Generally, for 8-bit and 16-bit adders, a sparsity of 4 is usually chosen [14]. The carry out equations for the 16-bit sparse tree inverted EAC adder are as follows: 18

(14) Based on the deduction shown in (12), the equations turn into: (15) Based on the deduction shown in (13), the equations turn into: (16) In (16), the final equations limit the modulo addition operation in the final adder stage within stages, as shown in Fig. 10. This architecture solves wire interconnection problem and reduces non-critical path delay. A=119= B=87= 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Fig. 11 the Initial Output of the Partial Product Generation Stage 19

Fig. 10 Inverted EAC adder with sparsity of 4 in stages 2.4 AN EXAMPLE Take a 9-bit modulo (2 n +1) multiplier for example. Assuming the two inputs are A=119=001110111, B=87=001010111. The initial output of the partial product generation stage is shown in Fig. 11. 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 Fig. 12 the n n Partial Product Matrix 20

The left half (to the left of the dash line) of the initial partial products shown in Fig. 11 needs to be repositioned using the principle illustrated in Fig.3. The final n n partial product matrix after repositioning is shown in Fig. 12. A correction factor of 2, in the form of a correction vector shown in the block in Fig.13, is added to the bottom of the n n partial product matrix. Total correction factor of the modulo 2 n +1 multiplier is 3. The other 1 is added in the final adder stage. 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 Fig. 13 the Final Partial Product Matrix with the Correction Factor 1 1 0 1 0 1 0 0 Sum Vector 0 0 1 1 1 0 1 0 Carry Vector Fig. 14 the Initial Output of Partial Product Reduction Stage 1 1 0 1 0 1 0 0 Sum Vector 0 1 1 1 0 1 0 1 Carry Vector Fig. 15 the Output of the Partial Product Reduction Stage after Repositioning The partial product reduction stage compresses the partial product matrix in Fig. 13 to a final sum vector and a carry vector, as shown in Fig.14. This initial output of partial product reduction stage also needs to be repositioned. The final sum vector and carry vector after repositioning with another 1 are then modulo 2 n +1 added. In this example, 119 87 modulo (2 8 +1) equals 73. 21

III. Circuit Implementation The proposed implementation of modulo 2 n +1 multiplier consists of three stages: the partial product generation stage, the partial product reduction stage, and the final addition stage. The possible circuit configurations for each stage will be discussed in this section: 3.1 CIRCUIT DESIGN OF THE PARTIAL PRODUCT GENERATION STAGE This stage is the simplest stage in the circuit design of the entire multiplier. Traditional 2-input NAND gate, 2-input NOR gate, and inverter need to be optimized to meet the power and speed demand of this stage. Fig. 16 Proposed Inverter The structure and the size of the transistors composing the proposed inverter, 2-input NAND and 2-input NOR are shown in Fig. 16, Fig. 17, and Fig. 18, respectively. The NAND gates are used 22

for generating initial partial product terms, while the NOR gates and inverters are the key circuit components to implement the operations of repositioning to get the final n n partial product matrix. The most complex logic functions in the reposition operations are, and, where and [1]. Fig. 17 Nand Gate with 2 Inputs Fig. 18 Nor Gate with 2 Inputs 23

3.2 CIRCUIT DESIGN OF THE PARTIAL PRODUCT REDUCTION STAGE 3.2.1INTRODUCTION OF A MUX-BASED COMPRESSOR DESIGN The partial product reduction stage is considered to be the most important stage to determine the power and speed of the entire modulo 2 n +1 multiplier [1]. Thus, this stage must be designed with a group of low-power high-speed compressors. Fig. 19 Traditional Design of the Partial Product Reduction Stage In this stage, the n n partial product matrix and a correction factor 2 are compressed to a final sum vector and a carry factor. The remaining correction factor of 1 is added to the final 24

addition stage by using the inverted EAC adder. Traditional compressors are designed with full adders. However, these designs consume too much power and occupy too much chip area. It also cannot meet the requirement of ultra high speed in the world today. For example, to compress single column of an 8 8 partial product matrix, totally 7 full adders are needed, while in the worst case of the possible new designs proposed in this thesis, only one 7:2 compressor and two 3:2 compressors are needed. The traditional full adder based compressor and the worst case of the possible new design are shown in Fig. 19 and Fig. 20, respectively. Fig. 20 A New Design of the Partial Product Reduction Stage The compressor architecture shown in Fig. 20 is designed with MUX and XOR-XNOR subcircuits. The compressors based on MUX use much less transistors than the full adder based 25

compressors, and the total number of compressors used in the traditional full adder based design is much higher than the new MUX-based design. Thus, the new compressor architecture is a much more proper design to meet the requirement of low power and high speed. 3.2.2DIFFERENT TYPES OF THE MUX-BASED COMPRESSORS Several basic MUX-based compressors are discussed below: 3.2.2.1 Circuit Design of the 3:2 compressor: A 3:2 compressor takes 3 inputs x1, x2, and x3 to generate two outputs Sum and Carry. The logic relationship between inputs and outputs is demonstrated in equation (17) [16]: (17) Fig. 21 Traditional MUX-based Design of the 3:2 Compressor Fig. 21 shows an existing design of the MUX-based 3:2 compressor [16]. However, this design is not fast enough because X1 and X2 should be added first, and then their sum adds to X3. The second addition operation should wait the calculation result of the first addition operation. The 26

total delay of this design is 2 XOR. To reduce critical path delay of the 3:2 compressor, a new design of the MUX-based 3:2 compressor is shown in Fig. 22. In the proposed design, X3 could select MUXs before the input signals arrive. Thus, the time taken to switch the transistors in the critical path is reduced, increasing circuit efficiency [16]. The total delay of the proposed design is XOR+ MUX. The output equations of the proposed design are shown below [16]: (18) (19) Fig. 22 A New MUX-based Design of the 3:2 Compressor 3.2.2.2 Circuit Design of the 4:2 compressor: A 4:2 compressor takes 4 inputs x1, x2, x3, and x4 along with a carry-in bit Cin to generate three outputs Sum, Carry, and Cout, where Sum is weighted at 2 0, Carry and Cout are weighted at 2 1. The logic relationship between inputs and outputs is demonstrated in equation (20) [16]: (20) 27

An existing circuit design of MUX-based 4:2 compressor is shown in Fig. 23 [16]. Same as the traditional 3:2 compressor, the second and the third XOR operation need to wait the result of the previous one. This limits the speed of the compressor (3 XOR). In Fig. 24, a new design of the MUX-based 4:2 compressor is proposed. In this design, the outputs and its complementary signals are generated at the same time, avoiding the race-hazard problem. The power consumption of the inverters to generate the complementary signal is also reduced. Furthermore, the MUX connected to Cin could be selected in advance. The Total delay of the proposed design is 1 XOR+2 MUX. Fig.23 Traditional MUX-based Design of the 4:2 Compressor The output equations of the proposed design are shown below [16]: (21) (22) 28 (23)

Fig.24 A New MUX-based Design of the 4:2 Compressor 3.2.2.3 Circuit Design of the 5:2 compressor: The 5:2 compressor has 7 inputs (x1, x2, x3, x4, x5, Cin1 and Cin2) and 4 outputs (Carry, Sum, Cout1, and Cout2). The relationship between inputs and outputs is shown below [16]: (24) Several existing circuit implementations of the MUX-based 5:2 compressor are shown in Fig. 25 (a), (b), and (c), respectively [16]. In Fig.25, the delay of the compressor is reduced to 5 ΔXOR. The delay of the original full adder based design is 6 ΔXOR, if all the full adder blocks are replaced by their constitute XOR blocks [16]. However, the delay of the MUX based 5:2 compressor could be further reduced by replacing some XOR gate by MUX blocks. The proposed implementation is shown in Fig. 26. In the first stage, 2 XOR-XNOR blocks are introduced to generate the output and its complementary signal at the same time, reducing the 29

power of additional inverters, and avoiding race-hazard problem. In the second and the fourth stages, the MUXs controlled by X3, Cin1, and Cin2 could be selected before the input signals arrive. The rest of MUX blocks also efficiently use the output of the blocks in the previous stage. Benefits from all the features mentioned above, the critical path delay of the proposed design is reduced to ΔXOR+3 ΔMUX. The equations regarding to the outputs are shown below: (25) (26) (27) (28) 30

X1 X2 X3 X4 X5 XOR XOR XOR MUX Cout1 Cin1 XOR XOR MUX Cout2 Cin2 XOR MUX Sum Carry (a) X1 X2 X3 X4 X5 (x1+x2)(x3+x4) XOR XOR (x1x2+x3x4) Cout1 Cin1 XOR XOR XOR MUX Cout2 Cin2 XOR MUX Sum Carry (b) 31

X1 X2 X3 Cin2 X4 X5 Cin1 CGEN XOR XOR Cout1 XOR XOR MUX Cout2 XOR XOR MUX Sum Carry (c) Fig. 25 Existing Architectures of the 5:2 Compressor X1 X2 X3 Cin2 X4 X5 Cin1 CGEN XOR- XNOR XOR- XNOR Cout1 MUX MUX MUX Cout2 MUX MUX MUX Sum Carry Fig. 26 A New MUX-based Design of the 5:2 Compressor 32

3.2.2.4 Circuit Design of the 7:2 compressor: The 7:2 compressor has 9 inputs (x1, x2, x3, x4, x5, x6, x7, Cin1 and Cin2) and 4 outputs (Sum, Carry, Cout1, and Cout2). Unlike the 5:2 compressor, where Carry, Cout1, and Cout2 are all weighted at 2 1, the 7:2 compressor has a Cout1 output weighted at 2 2. To sum up, the relationship of the inputs and the outputs of a 7:2 compressor is [1]: (29) The MUX-based 7:2 compressor is a totally new design in this thesis. The principle of the design is to use MUX to replace XOR as much as possible to reduce delay and to generate output and its complementary signal at the same time to reduce power. Then the output equations shown below [17] could be transformed into the circuit implementation of the MUX-based 7:2 compressor shown in Fig. 27, with some additional logic gates such as Nand to realize. The total delay of the proposed design is ΔXOR+5 ΔMUX. (30) (31) (32) (33) where 33

X5 X6 X7 X1 X2 X3 X4 XOR- XNOR XOR- XNOR MUX MUX CGEN 3-bit Nor CGEN MUX XOR- XNOR MUX 2-bit Nand 4-bit Nand 2-bit Nand MUX cin2 2-bit Nand 2-bit Nand MUX MUX cin1 CGEN XOR- XNOR MUX Carry Sum Cout1 Cout2 Fig. 27 A New MUX-based Design of the 7:2 Compressor 3.2.3DETAILED SUBCIRCUIT DESIGN OF THE COMPRESSORS To realize the circuit implementations mentioned above, detailed transistor level designs are also need to be discussed and compared. The MUX subcircuit, the complementary-output MUX subcircuit, the XOR-XNOR subcircuit, and the CGEN subcircuit will be discussed one by one. 34

3.2.3.1 MUX Subcircuit Design: Fig. 28 Original Design of the MUX Subcircuit Firstly, we take a look at the subcircuit of MUX. The original 2-1 MUX is shown in Fig.28 [18]. This is the most widespread MUX cell today, especially in low power applications. However, this structure has no driving ability to drive the large input-capacitance of the following stages especially when many stages are cascaded. This introducing large delay and worsen the performance of the entire modulo multiplier. Thus, this implementation will not be chosen. To solve this weak driving ability problem, another circuit implementation of MUX is introduced later, which is shown in Fig. 29. The modified structure solves the driving problem by adding two cascaded inverters at the output of the original design. This method is highly effective. However, inverters consume a lot of power and even enlarge the size of the MUX block by more than 2 times compare to the one in Fig 28. So this is also not a desired design in low power applications. 35

Fig. 29 Modified Design of the MUX Subcircuit The proposed design of the MUX subcircuit is shown in Fig. 30. This design takes advantage of the complementary CMOS technology, which is robust against both voltage scaling and transistor sizing [18]. Compare to the modified MUX circuit shown in Fig. 29, the proposed design only has one inverter, reducing a lot of power. The driving ability of the proposed design is not reduced by diminishing the number of inverters because the rest transistors of the proposed design are also connected to vdd/gnd to be provided driving strength. The total number of transistors in the proposed design is 2 more than the one in Fig. 29. However, the total silicon area of transistors in the two designs is the same. Thus, based on the discussion above, the circuit design in Fig. 30 is chosen in this research for the comprehensive consideration of low power, small silicon area and high speed. 36

Fig. 30 Proposed Design of the MUX Subcircuit 3.2.3.2 Complementary MUX Subcircuit Design: Secondly, the complementary-output MUX subcircuit needs to be designed. Two existing designs of complementary-output MUX are shown in Fig. 31(a) and (b), respectively [18]. The design of (a) has some driving ability because two compensation transistors, which are all driven by vdd, are introduced. For the same reason, structure in (a) can also obtain a full voltage swing at the output. However, the driving ability of (a) is not strong enough to drive many cascaded stages. Different from (a), structure (b) has no driving ability at all. Additionally, in some cases, the output and its complementary signal will not have a full swing. 37

set set set set A W=64nm W=128nm Vdd out A W=64nm W=128nm out B W=64nm Vdd B W=128nm W=64nm W=128nm out W=64nm W=64nm A W=64nm W=128nm out B W=128nm (a) (b) W=64nm Fig. 31 Existing Designs of the Complementary-output MUX Subcircuit To solve the problems mentioned above, we need to redesign a complementary-output MUX. In the circuit design of Fig. 31(a), an inverter needs to be added to each of the two outputs to improve driving ability. In the circuit design of Fig. 31 (b), two cascaded inverters are needed and all other pass-gates need to be replaced by complementary CMOS pass-gates to obtain full swing. Obviously, after the improvement, (b) occupies much more silicon area than (a), so the proposed design needs to take the idea from (a), which is shown in Fig. 32. 38

set W=64nm set W=128nm Vdd W=256nm Vdd out A B W=64nm W=64nm W=64nm W=128nm Vdd Vdd W=128nm W=256nm out W=128nm Fig. 32 Proposed Design of the Complementary-output MUX Subcircuit 3.2.3.3 XOR-XNOR Subcircuit Design: Thirdly, the XOR-XNOR subcircuit needs to be designed. The original design of the XOR- XNOR subcircuit is shown in Fig. 33 [18]. This design has the problem of week driving ability, especially when the logic value the XNOR node is logic 0. This dramatically reduces speed. Another problem is regarding to the complementary outputs. A skew occurs at the node of XOR and the node of XNOR. Additionally, this design generates a weak logic 1 at XNOR node because NMOS-based pass-gate has a Vth voltage drop when passing logic 1. Thus, this 39

design cannot be used at the condition of low power supply. Vdd W=256nm Vdd W=256nm W=256nm W=128nm xnor xor W=128nm W=128nm A B Fig. 33 Original Design of the XOR-XNOR Subcircuit To solve those problems, other designs of XOR-XNOR subcircuit are designed, as shown in Fig. 34 (a), (b), and (c), respectively [18]. The modified XOR-XNOR block shown in (a) could be used with low supply voltage because the complementary CMOS pass-gates are introduced in this design to replace the original one. However, the weak driving ability problem and the skew problem at the output still remain. Unlike (a), design of (b) solves skew problem at the output by adding a group of complementary transistor to the circuit shown in Fig. 33. But it generates a weak 0 at node XOR, while generates a weak 1 at node XNOR. 40

Vdd W=128nm W=128nm A W=64nm W=64nm W=256nm Vdd B xnor Vdd W=128nm W=128nm W=128nm xor W=64nm W=64nm (a) A B A B W=256nm W=256nm W=256nm xor Vdd W=256nm xor W=128nm W=128nm W=128nm W=64nm W=256nm W=256nm Vdd W=128nm W=128nm xnor W=128nm xnor W=128nm (b) (c) Fig. 34 Modified Designs of the XOR-XNOR Subcircuit 41

So this design is also not a good choice in low power applications. The circuit implementation in (c) can solve the weak logic problem and the week driving ability problem at the same time because of the feedback NMOS-PMOS transistors in the middle of the circuitry. However, it is still not a good choice in low power applications for the following reasons. When the input changes from any other input patterns to 00 or 11, the feedback NMOS-PMOS transistors, which is originally turned off, will be turned on by a weak logic driver and a high impedance driver. Thus, this transition will take a lot of time, worsens the entire circuit performance and consumes huge dynamic power when transit [18]. A B W=256nm W=256nm xor W=64nm W=64nm Vdd W=64nm Vdd W=32nm W=128nm W=128nm W=128nm xnor W=128nm Fig. 35 Proposed Design of the XOR-XNOR Subcircuit 42

The proposed design of the XOR-XNOR subcircuit is shown in Fig. 35. It combines all the desire features together, solving the weak logic problem, the skew problem at the output, the week driving ability problem and the long transit time problem occurred in Fig. 34 (c) at the same time. Vdd W=128nm W=128nm W=128nm W=128nm W=128nm Vdd W=256nm W=64nm W=64nm W=64nm W=64nm W=64nm Carry W=128nm Cin B Fig. 36 Proposed Design of the CGEN Subcircuit A 3.2.3.4 CGEN Subcircuit Design: Finally, the proposed CGEN subcircuit is shown in Fig. 36 [18]. The CGEN subcircuit works like a full adder without the output of Sum. The truth table of CGEN block is shown in Table 1. 43

This circuit implementation takes advantage of complementary CMOS logic, providing good driving ability (small delay) with relatively small silicon area. Table 1 Truth Table of the CGEN Subcircuit A b cin carry 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 Fig. 37 Possible Compressor Architectures for an 8-bit Modulo 2 n +1 Multiplier 3.2.4 THE ARCHITECTURE OF THE PARTIAL PRODUCT REDUCTION STAGE 3.2.4.1 Architecture Designed for an 8-bit Modulo 2 n +1 Multiplier 44