A New Family of High-Performance Parallel Decimal Multipliers*

Similar documents
CHAPTER 4 RESULTS & DISCUSSION

ALONG with the progressive device scaling, semiconductor

Implementation of Memory Based Multiplication Using Micro wind Software

An Efficient High Speed Wallace Tree Multiplier

An Efficient Reduction of Area in Multistandard Transform Core

A Novel Architecture of LUT Design Optimization for DSP Applications

An MFA Binary Counter for Low Power Application

Design of Carry Select Adder using Binary to Excess-3 Converter in VHDL

Implementation of Low Power and Area Efficient Carry Select Adder

Combinational Logic Design

VLSI IEEE Projects Titles LeMeniz Infotech

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

LUT Optimization for Memory Based Computation using Modified OMS Technique

Design of Memory Based Implementation Using LUT Multiplier

Hardware Modeling of Binary Coded Decimal Adder in Field Programmable Gate Array

FPGA IMPEMENTATION OF LOW POWER AND AREA EFFICIENT CARRY SELECT ADDER

Reconfigurable Fir Digital Filter Realization on FPGA

OMS Based LUT Optimization

Implementation of efficient carry select adder on FPGA

FPGA Implementation of Low Power and Area Efficient Carry Select Adder

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

DESIGN OF HIGH PERFORMANCE, AREA EFFICIENT FIR FILTER USING CARRY SELECT ADDER

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Memory efficient Distributed architecture LUT Design using Unified Architecture

1. True/False Questions (10 x 1p each = 10p) (a) I forgot to write down my name and student ID number.

A High-Speed Low-Power Modulo 2 n +1 Multiplier Design Using Carbon-Nanotube Technology

Implementation of Area Efficient Memory-Based FIR Digital Filter Using LUT-Multiplier

Design and Analysis of Modified Fast Compressors for MAC Unit

Find the equivalent decimal value for the given value Other number system to decimal ( Sample)

Midterm Exam 15 points total. March 28, 2011

Research Article VLSI Architecture Using a Modified SQRT Carry Select Adder in Image Compression

Design and Implementation of LUT Optimization DSP Techniques

A Fast Constant Coefficient Multiplier for the XC6200

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

Efficient Implementation of Multi Stage SQRT Carry Select Adder

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

FUNCTIONS OF COMBINATIONAL LOGIC

Implementation of High Speed Adder using DLATCH

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

An Efficient Carry Select Adder

Modified Reconfigurable Fir Filter Design Using Look up Table

6.3 Sequential Circuits (plus a few Combinational)

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

ISSN:

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Figure 1.LFSR Architecture ( ) Table 1. Shows the operation for x 3 +x+1 polynomial.

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari

Adaptive Fir Filter with Optimised Area and Power using Modified Inner-Product Block

Improved 32 bit carry select adder for low area and low power

Optimization of memory based multiplication for LUT

R13 SET - 1 '' ''' '' ' '''' Code No: RT21053

Research Article Low Power 256-bit Modified Carry Select Adder

Lab #12: 4-Bit Arithmetic Logic Unit (ALU)

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

Design of Modified Carry Select Adder for Addition of More Than Two Numbers

Digital Circuits ECS 371

Analogue Versus Digital [5 M]

TIME SCHEDULE. MODULE TOPICS PERIODS 1 Number system & Boolean algebra 17 Test I 1 2 Logic families &Combinational logic

Modified128 bit CSLA For Effective Area and Speed

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

A Review on Hybrid Adders in VHDL Payal V. Mawale #1, Swapnil Jain *2, Pravin W. Jaronde #3

R13. II B. Tech I Semester Regular Examinations, Jan DIGITAL LOGIC DESIGN (Com. to CSE, IT) PART-A

Design and Simulation of Modified Alum Based On Glut

LECTURE NOTES. ON Digital Circuit And Systems

MODULE 3. Combinational & Sequential logic

Design And Implimentation Of Modified Sqrt Carry Select Adder On FPGA

Design and Implementation of Low-Power and Area-Efficient for Carry Select Adder (Csla)

K. Phanindra M.Tech (ES) KITS, Khammam, India

Solution to Digital Logic )What is the magnitude comparator? Design a logic circuit for 4 bit magnitude comparator and explain it,

PHYSICS 5620 LAB 9 Basic Digital Circuits and Flip-Flops

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

EECS 270 Midterm 2 Exam Closed book portion Fall 2014

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

COMP12111: Fundamentals of Computer Engineering

TYPICAL QUESTIONS & ANSWERS

DESIGN OF LOW POWER AND HIGH SPEED BEC 2248 EFFICIENT NOVEL CARRY SELECT ADDER

A Novel Bus Encoding Technique for Low Power VLSI

BCN1043. By Dr. Mritha Ramalingam. Faculty of Computer Systems & Software Engineering

Designing Fir Filter Using Modified Look up Table Multiplier

N.S.N College of Engineering and Technology, Karur

Dev Bhoomi Institute Of Technology Department of Electronics and Communication Engineering PRACTICAL INSTRUCTION SHEET

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

128 BIT MODIFIED CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER

1. Convert the decimal number to binary, octal, and hexadecimal.

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Microprocessor Design

Chapter 8 Functions of Combinational Logic

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Tutorial Outline. Design Levels

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Modeling Digital Systems with Verilog

Digital Networks and Systems Laboratory 2 Basic Digital Building Blocks Time 4 hours

FPGA Implementation of DA Algritm for Fir Filter

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

An Efficient Viterbi Decoder Architecture

Transcription:

A New Family of High-Performance Parallel Decimal Multipliers* Alvaro Vázquez, Elisardo Antelo Dept. of Electronic and Computer Science University of Santiago de Compostela Spain alvaro@dec.usc.es elisardo@dec.usc.es Paolo Montuschi Dept. of Computer Engineering Politecnico di Torino Italy montuschi@polito.it *A. Vázquez and E. Antelo supported in part by the Ministry of Science and Technology of Spain under contract TIN2004-07797-C02 and Xunta de Galicia under contract PGIDT03TIC10502PR. ARITH 18 - Montpellier, France. June 25-27, 2007 1

Outline Introduction. Previous work. Implementation of decimal parallel multiplication: Fast carry-save addition using non conventional BCD. Design of high-performance decimal p:2 CSAs. Parallel partial product generation. Architectures. Signed-digit (SD) Radix-10. SD Radix-4/Radix-5 (combined binary/decimal). Evaluation and Comparison. Conclusions. ARITH 18 - Montpellier, France. June 25-27, 2007 2

Introduction High-performance decimal floating-point units. Parallel multiplier: scaling performance by pipelining. Multiplication stages: 1. Generation of partial products (PPG) 2. Reduction of partial products (PPR) 3. Conversion to non-redundant representation. Problems of decimal implementation: High value-range for decimal digits (0-9) PPG Inefficiency of conventional BCD coding PPG, PPR ARITH 18 - Montpellier, France. June 25-27, 2007 3

Previous Work on Decimal Multiplication Previous proposals for PPG 1. Direct generation of partial products (digit-by-digit) 2. Using multiplicand multiples (X,2X,3X,4X,,9X). Direct implementation. SD multiplier. [Ex. 2 radix5 digits (-5X, 5X) (-2X,-X, X,2X)] Previous proposals for PPR 1. Carry-save BCD-8421. a. Full BCD operands (3:2 CSAs + correction) b. Carry operand 1 bit each 4-bit. (4-bit decimal CPAs) 2. Signed-digit representation for decimal digits. SD adders more complex than CSA based implementations. ARITH 18 - Montpellier, France. June 25-27, 2007 4

Proposed techniques X multiplicand, Y multiplier BCD integer words. BCD digit represented as: 3 BCD-8421 (r j =2 j ) Z i = j = 0 z i, j 1. Decimal carry-save addition using BCD-4221. 2. Implementation of decimal CSAs for PPR. 3. Implementation of PPG using multiplier recoding: SD radix-10 SD radix-4. SD radix-5. r j BCD-4221 (r 3,r 2,r 1,r 0 ) = (4,2,2,1) BCD-5211 (r 3,r 2,r 1,r 0 ) = (5,2,1,1) ARITH 18 - Montpellier, France. June 25-27, 2007 5

Decimal carry-save addition (BCD-8421) Add 3 decimal digits to produce 2 decimal digits (sum and carry digits). A i : 8 4 2 1 5 0 1 0 1 A i +B i +C i = S i +2H i A i,b i,c i,s i,h i є[0,9] a i,j b i,j c i,j 2H i є[0,18] and even B i : 6 0 1 1 0 3:2 CSA C i : S i : 2H i : 9 1 0 0 1 10 1 0 1 0 H i : 5 0 1 0 1 Carry-out 10 1 0 0 0 - A i +B i +C i = S i +2H i = 20 Carry-in s i,j = Xor(a i,j,b i,j,c i,j ) h i,j = a i,j b i,j + (a i,j + b i,j ) c i,j PROBLEM WITH BCD-8421 Input digits in [0,9] BUT Sum digit out of decimal range [0,9] ->[0,16] Sum digits require correction ARITH 18 - Montpellier, France. June 25-27, 2007 6

Decimal carry-save addition (BCD-4221) H i : S i : 2H i : A i : B i : C i : Carry-out Add 3 decimal digits to produce 2 decimal digits (sum and carry digits). A i +B i +C i = S i +2H i = S i +L1-shift(W i ) W i : 4 2 2 1 5 1 0 0 1 6 1 1 0 0 9 1 1 1 1 6 1 0 1 0 7 1 1 0 1 7 1 1 0 0 (BCD-5211) 14 1 1 0 0 - L1-shift (W i ) A i +B i +C i = S i +2H i = 20 Carry-in a i,j b i,j c i,j 3:2 CSA s i,j = Xor(a i,j,b i,j,c i,j ) h i,j = a i,j b i,j + (a i,j + b i,j ) c i,j SOLUTION WITH BCD-4221 A i,b i,c i,s i,h i,w i є[0,9] Input digits in [0,9] and Sum digit always in range [0,9]. ARITH 18 - Montpellier, France. June 25-27, 2007 7

Decimal carry-save addition (BCD-5211) A i : B i : C i : S i : H i : 6 1 0 0 1 2H i : 12 1 0 0 1 - Carry-out Add 3 decimal digits to produce 2 decimal digits (sum and carry digits). A i +B i +C i = S i +2H i = S i +L1-shift(H i ) BCD-4221 5 2 1 1 5 1 0 0 0 6 1 0 0 1 9 1 1 1 1 8 1 1 1 0 12 1 0 1 0 - A i +B i +C i = S i +2H i = 20 Carry-in L1-shift BCD-4221 BCD-5211 a i,j b i,j c i,j 3:2 CSA s i,j = Xor(a i,j,b i,j,c i,j ) h i,j = a i,j b i,j + (a i,j + b i,j ) c i,j SOLUTION WITH BCD-5211 A i,b i,c i,s i,h i є[0,9] Input digits in [0,9] and Sum digit always in range [0,9]. ARITH 18 - Montpellier, France. June 25-27, 2007 8

Decimal multiplication by ±2 n and ±5 n Multiplication by 2 BCD-4221 Digit recoding 25 0 1 0 0 1 0 0 1 BCD-5211 25 0 1 0 0 1 0 0 0 L1-SHIFT BCD-4221 50 1 0 0 1 0 0 0 0 Multiplication by 5 BCD-4221 L3-SHIFT BCD-5211 BCD-4221 Negative operands (10 s s complement) by bit inversion (2 s s complement) BCD-4221 x10 4 2 2 1 4 2 2 1 x10 5 2 1 1 5 2 1 1 x10 4 2 2 1 4 2 2 1 0 5 9 6 0000 1001 1111 1100 Digit recoding Bit-complement BCD-4221 9 4 0 3 1111 0110 0000 0011-596 = - 10000 + 9403 +1 +1 Hot-one ARITH 18 - Montpellier, France. June 25-27, 2007 9 25 125 125 x5 x10 4 2 2 1 4 2 2 1 4 2 2 1 0 0 0 0 x100 x10 5 2 1 1 5 2 1 1 5 2 1 1 0 0 1 0 x100 x10 4 2 2 1 4 2 2 1 4 2 2 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 - - - 0 1 0 0 1 0 0 1

Proposed decimal 3:2 CSA (BCD-4221) A i +B i +C i = S i +2H i = S i +L1-shift(W i ) ARITH 18 - Montpellier, France. June 25-27, 2007 10

Proposed decimal 3:2 CSA (BCD-4221) BCD-4221 BCD-5211 0 1 2 0000 0001 0010 0100 0000 0001 0100 Critical path Digit recoder BCD-4221 to BCD-5211 AREA: 18 NAND2 3 4 0011 0101 0100 0110 0101 0111 (0.35 times 4-bit 3:2 CSA area) DELAY: 4 FO4 (0.9 times binary 3:2 CSA delay) 5 6 7 1001 0111 1100 1010 1101 1011 1000 1010 1011 Decimal (digit) 3:2 CSA AREA: 66 NAND2 (1.35 times 4-bit 3:2 CSA area) *DELAY: 1.4 times carry path/same sum path 8 9 1110 1111 1110 1111 *Ratio respect sum path (critical path) delay of bin. 3:2 CSA. ARITH 18 - Montpellier, France. June 25-27, 2007 11

Decimal CSA tree (BCD-4221) 4-bit 3:2 4-bit 3:2 4-bit 3:2 4-bit 3:2 Mux 2:1 For combined Decimal/Binary CSA 4-bit 3:2 Critical path 4-bit 3:2 4-bit 3:2 Example: 9:2 Decimal CSA (digit slice). 1.35 area ratio resp. binary CSA. 1.40 delay ratio resp. binary CSA. Hardware complexity (1 digit): 4-bit 3to2: 7x48 NAND2 Digit recoder (): 7x18 NAND2. Critical path delay: 1-bit 3to2: 4.5/2.2 FO4 (2/1 XOR) Recoder: 4 FO4 (1.75 XOR) 9:2 Decimal CSA: 25 FO4. 9:2 Binary CSA: 18 FO4. ARITH 18 - Montpellier, France. June 25-27, 2007 12

Decimal CSA tree BCD-4221 (area-optimized) 4-bit 3:2 4-bit 3:2 4-bit 3:2 Critical path x1 4-bit 3:2 4-bit 3:2 4-bit 3:2 Example: 9:2 Decimal CSA (digit slice). Area optimization: Group inputs with similar multiplicative factor. 1.20 area ratio resp. binary CSA. 1.40 delay ratio resp. binary CSA. Hardware complexity (1 digit): 4-bit 3to2: 7x48 NAND2 Digit recoder (): 5x18 NAND2. x1 Critical path delay: 4-bit 3:2 9:2 Decimal CSA: 25 FO4. 9:2 Binary CSA: 18 FO4. ARITH 18 - Montpellier, France. June 25-27, 2007 13

SD radix-10 multiplier recoding Multiplicand X (BCD-4221) 4d Multiplier Y (BCD-8421) Y i є [0,9] x5 4 SD radix-10 digit recoder 4d-bit decimal adder Mult. multiples gen. 5 1 Yb i є [-5,5] (hot-one code) X 2X 3X 4X 5X Mux-5 (recoded sign) 4d Integer d-digit precision operands 1 SD radix-10 digit/multiplicand digit d+1 partial products (additional encoded SD radix-10 digit) ARITH 18 - Montpellier, France. June 25-27, 2007 14

SD radix-4 multiplier recoding Multiplicand X (BCD-4221) 4d Multiplier Y (BCD-8421) Y i є [0,9] 4 SD radix-4 digit recoder 1 2 2 Yb i = Y U i 4+ YL i Y U i є [0,2] Y L i є [-2,2] 8X 4X 2X X Mult. multiples gen. (hot-one code) Mux-2 Mux-2 (recoded sign) 4d 4d Integer d-digit precision operands 2 SD radix-4 digit/multiplicand digit 2d partial products ARITH 18 - Montpellier, France. June 25-27, 2007 15

SD radix-5 multiplier recoding Multiplicand X (BCD-4221) 4d Multiplier Y (BCD-8421) Y i є [0,9] x10 x5 4 SD radix-5 digit recoder 10X 4-bit left wired shift 5X Mux-2 2X X Mux-2 Mult. multiples gen. Y U i є [0,2] 2 (recoded sign) 2 (hot-one code) 1 Yb i = YU i 5+ YL i Y L i є [-2,2] 4d 4d Integer d-digit precision operands 2 SD radix-5 digit/multiplicand digit. 2d partial products Simple PPG: area/latency figures similar as Booth radix-4. ARITH 18 - Montpellier, France. June 25-27, 2007 16

Radix-10 architecture X Mult. multiples gen. X 2X 3X 4X 5X 17x 64 17 partial products Decimal 17:2 CSA tree 128 Mux-5 128 128-bit Decimal Adder Y 64 64 SD radix-10 recoder 17x5 16 (recoded signs) Z= X x Y only decimal multiplications. 16 BCD-digit (64 bits) significands (IEEE-754r Decimal64 format). SD radix-10 multiplier recoding. 17 partial products generated. Z 64 Easily pipelined. ARITH 18 - Montpellier, France. June 25-27, 2007 17

Radix-4/5 architecture X Y 64 64 Mult. multiples gen. SD radix-4/5 recoder Can perform binary/decimal multiplications Z= X x Y. 10X/8X Mux-2 5X/4X 2X X Mux-2 32x5 32x5 16 (recoded signs) SD radix-5/4 multiplier recoding (2 SD digits/bcd digit) 16x 64 16x 64 32 partial products Decimal 32:2 CSA tree 32 partial products generated. Easily pipelined. 128 128 128-bit Decimal Adder Z 64 ARITH 18 - Montpellier, France. June 25-27, 2007 18

Evaluation results Area-delay model based on logical effort (delay in FO4;area in NAND2) Architecture Delay Area (64-bits) (FO4) Ratio (Nand2) Ratio Bin. radix-4 Bin. radix-8 Dec. radix-4 Dec. radix-5 Bin/dec. radix-4 Bin/dec. radix-4/5 Dec. Radix-10 Proposed in [8] 50 57 70 65 59/75 61/71 72 92 1.0 1.15 1.4 1.3 1.2/1.5 1.2/1.4 1.45 1.85 43000 39500 49500 49000 54000 53500 40000 69000 1.10 0.90 1.60 ARITH 18 - Montpellier, France. June 25-27, 2007 19 1.0 0.90 1.15 1.25 1.25 [8] T. Lang and A. Nannarelli. A radix-10 combinational multiplier. Proc. 40th Asilomar Conf. on Signals, Systems, and Computers, pp 313 317, Oct. 2006.

Comparison of decimal carry-free trees Architecture carry-free adder Binary 16:2 CSA Decimal 16:2 CSA (area optimized) SD tree [5,14] 4-bit CLA tree [4,7] Delay Ratio Area Ratio 0.70 0.85 1.00 1.00 2.00 2.90 1.45 1.40 Binary Our Proposal Other proposals BCD-8421 CSA [11] Non Spec. CSA [6] 1.50 1.30 2.60 1.45 [4] M. A. Erle and M. J. Schulte. Decimal multiplication via carry-save addition. In Proc. IEEE Int l Conference on Application-Specific Systems, Architectures, and Processors, pp. 348 358, June 2003. [5] M. A. Erle, E. M. Schwarz, and M. J. Schulte. Decimal multiplication with efficient partial product generation. Proc. IEEE 17th Symposium on Computer Arithmetic, pp. 21 28, June 2005. [6] R. D. Kenney and M. J. Schulte. High-speed multioperand decimal adders. IEEE Trans. on Computers, 54(8):953 963, Aug. 2005. [7] R. D. Kenney, M. J. Schulte, and M. A. Erle. High-frequency decimal multiplier. In Proc. IEEE Int l Conference on ComputerDesign: VLSI in Computers and Processors, pp. 26 29, Oct. 2004. [11] T. Ohtsuki. Apparatus for decimal multiplication. U.S.Patent No. 4,677,583, June 1987. [14] B. Shirazi, D. Y. Y. Yun, and C. N. Zhang. RBCD: Redundant binary coded decimal adder. IEE Proc - Computers and Digital Techniques, 136(2):156 160, Mar. 1989. ARITH 18 - Montpellier, France. June 25-27, 2007 20

Conclusions New family of parallel decimal multipliers: decimal radix-10 and combined radix-4/5 architectures. Decimal carry-save addition algorithm using BCD-4221 (also valid for BCD-5211). Efficient designs of decimal p:2 CSA trees for PPR. Parallel PPG using multiplicand multiples and three different SD recodings of the multiplier. Area-delay figures outstand other proposals and comparable to binary parallel multipliers (1.3/1.1 latency/area ratios for decimal SD radix-5 resp. binary Booth radix-4). Future work: decimal floating-point VLSI implementations. ARITH 18 - Montpellier, France. June 25-27, 2007 21