Design and Analysis of Modified Fast Compressors for MAC Unit

Similar documents
Implementation of Low Power and Area Efficient Carry Select Adder

Research Article Low Power 256-bit Modified Carry Select Adder

Design of Modified Carry Select Adder for Addition of More Than Two Numbers

An Efficient High Speed Wallace Tree Multiplier

An MFA Binary Counter for Low Power Application

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

DESIGN OF HIGH PERFORMANCE, AREA EFFICIENT FIR FILTER USING CARRY SELECT ADDER

ISSN:

Pak. J. Biotechnol. Vol. 14 (Special Issue II) Pp (2017) Parjoona V. and P. Manimegalai

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

Design And Implimentation Of Modified Sqrt Carry Select Adder On FPGA

Design and Implementation of Low-Power and Area-Efficient for Carry Select Adder (Csla)

Design of Carry Select Adder using Binary to Excess-3 Converter in VHDL

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

Modified128 bit CSLA For Effective Area and Speed

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

A Review on Hybrid Adders in VHDL Payal V. Mawale #1, Swapnil Jain *2, Pravin W. Jaronde #3

128 BIT MODIFIED CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER

Low-Power Near-Explicit 5:2 Compressor for Superior Performance Multipliers

Implementation of High Speed Adder using DLATCH

DESIGN AND IMPLEMENTATION OF SYNCHRONOUS 4-BIT UP COUNTER USING 180NM CMOS PROCESS TECHNOLOGY

DESIGN OF LOW POWER AND HIGH SPEED BEC 2248 EFFICIENT NOVEL CARRY SELECT ADDER

An Efficient Carry Select Adder

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

Novel Low Power and Low Transistor Count Flip-Flop Design with. High Performance

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

A High-Speed Low-Power Modulo 2 n +1 Multiplier Design Using Carbon-Nanotube Technology

Efficient Implementation of Multi Stage SQRT Carry Select Adder

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

DESIGN OF LOW POWER TEST PATTERN GENERATOR

CMOS DESIGN OF FLIP-FLOP ON 120nm

ANALYSIS OF POWER REDUCTION IN 2 TO 4 LINE DECODER DESIGN USING GATE DIFFUSION INPUT TECHNIQUE

LOW POWER AND HIGH PERFORMANCE SHIFT REGISTERS USING PULSED LATCH TECHNIQUE

A Power Efficient Flip Flop by using 90nm Technology

A HIGH SPEED CMOS INCREMENTER/DECREMENTER CIRCUIT WITH REDUCED POWER DELAY PRODUCT

PERFORMANCE ANALYSIS OF POWER GATING TECHNIQUES IN 4-BIT SISO SHIFT REGISTER CIRCUITS

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Design of Fault Coverage Test Pattern Generator Using LFSR

Improved 32 bit carry select adder for low area and low power

Power Optimization of Linear Feedback Shift Register (LFSR) using Power Gating

ALONG with the progressive device scaling, semiconductor

Aging Aware Multiplier with AHL using FPGA

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

Low Power Area Efficient Parallel Counter Architecture

CHAPTER 6 DESIGN OF HIGH SPEED COUNTER USING PIPELINING

CHAPTER 4 RESULTS & DISCUSSION

COMPUTATIONAL REDUCTION LOGIC FOR ADDERS

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

International Journal of Engineering Research-Online A Peer Reviewed International Journal

High Speed 8-bit Counters using State Excitation Logic and their Application in Frequency Divider

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

VLSI IEEE Projects Titles LeMeniz Infotech

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

A Low-Power CMOS Flip-Flop for High Performance Processors

Research Article VLSI Architecture Using a Modified SQRT Carry Select Adder in Image Compression

LFSR Counter Implementation in CMOS VLSI

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

Figure.1 Clock signal II. SYSTEM ANALYSIS

A Low Power Delay Buffer Using Gated Driver Tree

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

Modified Ultra-Low Power NAND Based Multiplexer and Flip-Flop

LUT Optimization for Memory Based Computation using Modified OMS Technique

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Implementation of Memory Based Multiplication Using Micro wind Software

DESIGN OF DOUBLE PULSE TRIGGERED FLIP-FLOP BASED ON SIGNAL FEED THROUGH SCHEME

Midterm Exam 15 points total. March 28, 2011

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

LOW-POWER CLOCK DISTRIBUTION IN EDGE TRIGGERED FLIP-FLOP

8. Design of Adders. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

WINTER 15 EXAMINATION Model Answer

gate symbols will appear in schematic Dierent of a circuit. Standard gate symbols have been diagram Figures 5-3 and 5-4 show standard shapes introduce

Design of Memory Based Implementation Using LUT Multiplier

A Novel Low-overhead Delay Testing Technique for Arbitrary Two-Pattern Test Application

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

A Novel Architecture of LUT Design Optimization for DSP Applications

FPGA Implementation of DA Algritm for Fir Filter

Distributed Arithmetic Unit Design for Fir Filter

New Single Edge Triggered Flip-Flop Design with Improved Power and Power Delay Product for Low Data Activity Applications

Low Power D Flip Flop Using Static Pass Transistor Logic

DESIGN AND ANALYSIS OF COMBINATIONAL CODING CIRCUITS USING ADIABATIC LOGIC

Optimization of memory based multiplication for LUT

FPGA Implementation of Low Power and Area Efficient Carry Select Adder

Design of a Low Power Four-Bit Binary Counter Using Enhancement Type Mosfet

FPGA IMPEMENTATION OF LOW POWER AND AREA EFFICIENT CARRY SELECT ADDER

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Power Reduction Techniques for a Spread Spectrum Based Correlator

Implementation of efficient carry select adder on FPGA

Transcription:

Design and Analysis of Modified Fast Compressors for MAC Unit Anusree T U 1, Bonifus P L 2 1 PG Student & Dept. of ECE & Rajagiri School of Engineering & Technology 2 Assistant Professor & Dept. of ECE & Rajagiri School of Engineering & Technology Cochin, Kerala, India Abstract Multiplication and addition are the basic arithmetic operations which are important in several microprocessors and digital signal processing (DSP) applications. As the demand for high speed multipliers is continuously increasing, the studies related to the field of multipliers and adders are endless and still significant. Compressors can be used with the aim of reducing the power dissipation of multipliers without compromising their speed performance in which only multiplexer and basic gates are used. In this work, different topologies of 4:2 and 5:2 compressors are compared in terms of power delay product and number of transistors. Compressor topologies are simulated in 90nm Technology using Cadence Virtuoso schematic editor at 700mV power supply. The improved design can be used in multipliers with minimum delay than conventional ones which can be used in MAC units applied for DSP applications. Keywords 4:2 compressor, 5:2 compressor, Pass transistor logic, MAC unit I. INTRODUCTION The new trends in performance of handheld mobile communication and portable devices needs speed, area and power efficiency. Even before mobile era, power consumption has been the fundamental problem. Different ideas from device level to the architectural level and above were already proposed. However, there is no universal way to avoid tradeoffs between power, delay and area. So the techniques chosen by a designer must satisfy the application and needs. Multipliers are one of the critical and compulsory components dictating the overall circuit performance as long as constrained by power consumption and computation speed. In multiplication process, the reduction of partial products contributes most to the overall delay, power and area. Compressors are employed to reduce the latency of this step. Hence compressors are a critical component of the multiplier circuit that greatly influences the overall multiplier speed. For high speed applications, a huge number of compressors are to be used in multiplications to perform the partial product addition. Thus the studies related to the field of multipliers and adders are endless and still significant. Therefore, it is of great interest to develop high speed and low power compressors with minimum number of transistors. Conventionally, partial product reduction has been carried out through the use of carry save adders consisting of rows of 3:2 counters, otherwise known as full adders. To increase the speed of multiplication higher order reduction schemes have been adopted. As an alternative to 3:2 counters, several higher order compressors were proposed. The 4:2 compressors, due to their ability to form regular interconnected cells structure are more popularly used. Higher order compressors such as 5:2, 6:2, etc., have also been employed in high precision multipliers to achieve greater performance. Fast 5:2 compressors are widely used for large word-size multipliers and multiply accumulators. As these compressors are used repeatedly in larger systems, improved design with lowest transistor count will contribute a lot towards overall system performance [2]-[4]. In this paper, section II briefly reviews the compressor architecture and concepts. The architectural details of the compressor designs and modified designs are given in section III. Simulated waveforms and comparison results are explained in section IV. Finally, section V concludes the report.. II. PREVIOUS WORK Due to the reduction of switching energy per device caused by the continually shrinking feature sizes and negligible static power dissipation compared to dynamic power dissipation, logic styles has been prevailing as the technology for implementing low power digital systems. Different techniques have been used for power reduction and for reducing the number of transistors. Compressors are used in multipliers so as to increase the speed and these architectures can also lead to significant savings in the power consumed by the entire multiplier [1]. ISSN: 2231-2803 http://www.ijcttjournal.org Page 213

A 3:2 compressor adds carries and sum separately, such that all of the columns can be added in parallel without relying on the result of the previous column and creating a two output adder with a time delay that is independent inputs sizes. Then the sum and carry can be recombined in a normal addition to form the actual result. This may take time delay and is more complicated. x1 x2 ( Carry x3 Cout) x4 Cin Sum 2 * The standard implementation of the 4:2 compressor is done using two Full Adder cells as shown in Fig. 2, which is having a transistor count of 124 [2-4]. The disadvantages of this design are removed in enhanced designs. When the individual full adders are replaced into XOR blocks, the overall delay is dependent on the four XOR gates. The block diagram in Fig. 3 shows the design for the implementation of the 4:2 compresssor using MUX and XOR gates. Conventional CMOS technology implementation of MUX and XOR gates are used in this design. It is having lesser number of transistors than conventional ones [4]. Fig. 1 Reduction tree Compressors are used for accumulating the partial products in multiplication so as to speed up the operation. Full adders or 3:2 compressors were used for accumulation, in which 3 equally weighted bits were combined to produce two bits one carry with weight of n + l and the other sum with weight n. Each layer of the tree reduces the number of vectors by a factor of 3 to 2. Due to the ability to form regular interconnected cells structure 4:2 compressors are more popularly used. Higher order compressors such as 5:2, 6:2, etc., have also been employed in high precision multipliers to achieve greater performance. Fast 5:2 compressors are widely used for large word size multipliers and multiply accumulators [2]-[4]. A. 4:2 Compressors Fig. 2 Fig. 3 Conventional 4:2 compressor 4:2 compressor design II The 4:2 compressor has 4 inputs X1, X2, X3 and X4 and 2 outputs Sum and Carry along with a Carryin (Cin) and a Carry-out (Cout). The input Cin is the output from the previous lower significant compressor. The Cout is the output to the compressor in the next significant stage. The simplest implementation of 4:2 compressor is obtained by cascading two full adders in a hierarchical structure. Similar to the 3:2 compressor the 4:2 compressor is governed by the basic equation, B. 5:2 Compressors ISSN: 2231-2803 http://www.ijcttjournal.org Page 214

The 5:2 Compressor block has five inputs X1, X2, X3, X4, X5 and 2 outputs, Sum and Carry along with two input carry bits (Cin1, Cin2) and 2 output carry bits (Cout1,Cout2). Five inputs are primary input and the rest are two input carries which receive their values from the previous stage of one bit lower in significance. All the seven inputs as well as output Sum bit have the same weight. The input carry bits are the outputs from the previous lesser significant compressor block and the output carry are passed on to the next higher significant compressor block. The other three output bits weight one bit higher order. Various structures for 5:2 compressors already proposed. Equation below is the basic equation that governs the function of the 5:2 compressor blocks. Fig. 5 5:2 compressor design II x1 Sum x2 x3 x4 2*( Carry x5 Cout1 Cin1 Cin2 Cout2) The conventional implementation of 5:2 compressor is shown in Fig. 4. Hierarchical structure of cascaded full adder cells is used in this conventional circuit. The speed and the number of transistor count are not favourable for a fast and area efficient design. In the design of 5:2 compressors Cout1 must be independent of Cin1 as well as Cin2. In addition, Cout2 must be independent of Cin2. The delay increases as the signal propagates from one compressor to the other. It is the main concept of performance development of compressors. Thus the dependency of Cout2 to Cin1 causes carry to propagate to the third compressor. The method in Fig. 3 is to make the Cout2 independent of Cin1. This will limit the carry propagation to one compressor. Here the XOR* gate is replaced by pass transistor logic. CGEN blocks have been used to generate Cout1 and Cout2 signals using CMOS technology [4]. III. MODIFIED COMPRESSORS The block diagram in Fig. 6 shows the design for the implementation of the 4:2 compresssor using MUX and XOR. Here the XOR using pass transistor logic is implemented as shown in Fig. 7. It is a six transistor implementation, were the transistor count is reduced. Fig. 6 4:2 compressor design III Fig. 4 Conventional 5:2 compressor Fig. 7 XOR gate with pass transistor logic ISSN: 2231-2803 http://www.ijcttjournal.org Page 215

The block diagram in Fig. 8 shows the design for the implementation of the 4:2 compresssor using MUX and XOR-XNOR. However, like in the case of 3:2 compressor, the fact that both the output and its complement are available at every stage, is neglected. Here the XOR is replaced with MUX* which provides improvement in speed of the design. Also the MUX block at the Sum output gets the select bit before the inputs arrive and thus the transistors are already switched by the time they arrive. Fig. 9 5:2 compressor design III Fig. 8 4:2 compressor design IV The 5:2 compressor architecture shown in Fig. 9 is an improved architecture of existing ones. Changes to internal equations of the 5:2 compressor are made to eliminate final NOT gates of the CMOS FA. The power dissipation as well as the operational speed can be improved. To achieve this goal, XNOR gates are used instead of XORs of the second stage of the architecture. This design uses 82 transistors in its architecture. In this architecture, FA-NOT is CMOS full adder with its final NOT gates have been eliminated. Carry generator modules which is also used in design II have been used here to produce Cout1 and Cout2 output signals. In addition, outputs of the XOR gates have been fed to inputs of the XNOR gates. In this way, outputs of the XNOR gates are negation of what it was before for conventional 5:2 compressors and by replacing a FA-Not instead of a FA there is a valid Sum and Carry signals and the XNOR module is shown in Fig. 10. The design IV is a modified architecture shown in Fig. 11, changes have been made to efficiently use the outputs generated at every stage, by replacing few XOR blocks with MUX blocks. The select bits to the multiplexers in the critical path are made available much ahead than input so as to reduce the critical delay. Fig. 10 CMOS implementation of XNOR gate If the output of the multiplexer is used as select bit for another multiplexer, then it can be used efficiently in similar manner because the negation of select bit is also required in the design and an extra stage to compute the negation can be saved. Similarly replacing the XOR block in the second stage with a MUX block as shown in Fig. 12 reduces the delay because the select bit X3 is already available and the time taken for the transistor switching to take place is done in parallel with the computation of the inputs of the block. In all the general implementations of the XOR or MUX block, the output and its complement are generated. Existing architectures is not using this advantage. In this design the outputs are utilized efficiently by using multiplexers at select stages in the circuit. Also additional inverter stages are eliminated. This in turn contributes to the reduction of delay, power consumption and the transistor count is considerably less. ISSN: 2231-2803 http://www.ijcttjournal.org Page 216

IV. RESULTS AND DISCUSSIONS Four different architectures of 4:2 and 5:2 compressors are simulated in cadence virtuoso schematic editor at 700mV supply voltage. Design I of each compressor type is conventional designs. Design III and IV of 4:2 and 5:2 compressors are modified designs. The average power, worst case delay and transistor count were calculated. The output waveforms of optimized designs are seen in Fig. 11 and Fig. 12. The results and inferences obtained from the simulation results were explained in this section with the help of tabular and graphical comparisons. TABLE I COMPARISON TABLE OF 4:2 COMPRESSOR Types of compressor Power (μw) Delay (ns) PDP (fjs) Transistor count Design I 16.98 336 5.705 124 Design II 22.17 110 2.482 128 Design III 5.672 62.1 0.353 72 Design IV 8.89 155.6 1.383 96 TABLE III COMPARISON TABLE OF 5:2 COMPRESSOR Types of compressor Power (μw) Delay (ns) PDP (fjs) Transistor count Design I 16.98 336 5.705 124 Design II 22.17 110 2.482 128 Design III 5.672 62.1 0.353 72 Design IV 8.89 155.6 1.383 96 Fig. 11 Waveform of 4:2 compressor design III Fig. 13 Comparison table of 4:2 compressors Fig. 12 Waveform of 5:2 compressor design IV The comparison of power in terms of micro watts, delay in Nano second and their PDP, that is the power delay product and number of transistors of each designs of both 4:2 compressors and 5:2 compressors are shown in Table I and Table II. And the corresponding graphs are shown in Fig. 13 and Fig. 14 respectively. Fig. 14 Comparison table of 5:2 compressors From the observed results of 4:2 and 5:2 compressors which is obtained from Cadence simulation is plotted in Fig. 13 and Fig. 14 respectively (PDP is scaled by multiplying with 10 ISSN: 2231-2803 http://www.ijcttjournal.org Page 217

in the plot for proper view). Table I and Table II shows that there exists a trade-off between power, delay and number of transistors. From the observed designs third design of 4:2 compressor have the minimum PDP, which is the power delay product. The transistor count is also less compared to other designs. The use of pass transistor logic reduced the transistor count in this design III. From the observed designs of 5:2 compressors the design IV has the minimum PDP. The transistor count is also less compared to other designs, thus the area can be optimized. V. CONCLUSIONS In this work, PDP and transistor count of four different designs of 4:2 compressors and 5:2 compressors are compared in Cadence with 90nm Technology at 700mV. The Cadence simulation results shows that the third design of 4:2 compressors and the fourth design of 5:2 compressors are most efficient, which are modified versions of existing designs. It is concluded that among the four 4:2 compressor simulated design, the third design is most energy efficient one and there exists a tradeoff between PDP and transistor count but the hardware cost is reducing. Among the four 5:2 compressor simulated design, the third design has minimum transistor count and consequently is the most energy efficient one with minimum PDP. That is, by lowering the supply voltage and optimizing the design, efficiency can be improved without much area overhead. As a future work these compressors can be used in multipliers which can be used in MAC units for high speed applications like DSP, WSN etc. ACKNOWLEDGMENT The authors would like to acknowledge all of our professors and assistant professors of ECE department, RSET, Rajagiri Valley and many others in VLSI and Embedded System branch students who have contributed to the work REFERENCES [1] Ming Bo Lin, Introduction to VLSI Systems, CRC Press. November, 2011, pp.256-300. [2] O. Kwon, K. Nowka and E. E. Swartzlander, A 16-Bit by 16-Bit MAC Design Using Fast 5:3 Compressor Cells, The Journal of VLSI Signal Processing,2002, vol. 31. [3] Amir Momeni and Paolo Montuschi, Design and Analysis of Approximate Compressors for Multiplication, IEEE Transactions on Computers, 2015. Vol. 64, No. 4. [4] A. Naja, S. Timarchi and A. Naja, High-speed Energy efficient 5:2 Compressor, Proceedings of MIPRO. Opatija, Croatia, 2014. [5] S. Veeramachaneni, K. M. Krishna, L. Avinash, S. R. Puppala and M. B. Srinivas, Novel Architectures for High-Speed and Low-Power 3-2, 4-2 and 5-2 Compressors, International Conference on VLSI Design, 2007. [6] Shanthala S, Cyril Prasanna Raj and Dr. S Y Kulkarni, Design and VLSI Implementation of Pipelined Multiply Accumulate Unit, Second International Conference On Emerging Trends in Engineering Technology, ICETET, 2009. [7] Amir Naja, Ardalan Naja and Sattar Mirzakuchaki, Lowpower and High performance 5:2 Compressors, 22nd Iranian Conference on Electrical Engineering, 2014, May pp. 20-22. [8] Teffi Francis, Tera Joseph and Jobin K Antony, Modified MAC Unit for Low Power High Speed DSP Application Using Multiplier with Bypassing Technique and Optimized Adders, IEEE-31661, 4 th ICCCNT, 2013. [9] Geoffknage homepage on carry save addition. (2010) [Online]. Available: http://www.geo_knagge.com/fyp/carrysave.shtml [10] K.N.V.S Vijaya Lakshmi and D.R.Sandeep, LowPower32-Bit DADDA Multiplier, International Journal of Computer Trends and Technology (IJCTT), volume 17, 2014, November. [11] Chandra.K and Kumar.P, Optimization of RSA Processors Using Multiplier, International Journal of Computer Trends and Technology (IJCTT), volume 4, Issue5 May 2013, Page 1089 ISSN: 2231-2803 http://www.ijcttjournal.org Page 218