Clock Gating Aware Low Power ALU Design and Implementation on FPGA

Similar documents
POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Research Article Low Power 256-bit Modified Carry Select Adder

Figure.1 Clock signal II. SYSTEM ANALYSIS

CMOS Technology for Increasing Efficiency of Clock Gating Techniques Using Tri-State Buffer

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Implementation of Low Power and Area Efficient Carry Select Adder

Microprocessor Design

Design of Low Power Efficient Viterbi Decoder

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

ENGG2410: Digital Design Lab 5: Modular Designs and Hierarchy Using VHDL

A Symmetric Differential Clock Generator for Bit-Serial Hardware

A Review on Hybrid Adders in VHDL Payal V. Mawale #1, Swapnil Jain *2, Pravin W. Jaronde #3

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Design of BIST with Low Power Test Pattern Generator

DESIGN AND IMPLEMENTATION OF SYNCHRONOUS 4-BIT UP COUNTER USING 180NM CMOS PROCESS TECHNOLOGY

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

Midterm Exam 15 points total. March 28, 2011

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

An FPGA Implementation of Shift Register Using Pulsed Latches

Modified128 bit CSLA For Effective Area and Speed

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

ISSN:

An MFA Binary Counter for Low Power Application

LUT Optimization for Memory Based Computation using Modified OMS Technique

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

A Low Power Delay Buffer Using Gated Driver Tree

L12: Reconfigurable Logic Architectures

Design of Memory Based Implementation Using LUT Multiplier

An Efficient Reduction of Area in Multistandard Transform Core

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

MODULE 3. Combinational & Sequential logic

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Power Efficient Design of Sequential Circuits using OBSC and RTPG Integration

The basic logic gates are the inverter (or NOT gate), the AND gate, the OR gate and the exclusive-or gate (XOR). If you put an inverter in front of

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

EE178 Spring 2018 Lecture Module 5. Eric Crabill

NORTHWESTERN UNIVERSITY TECHNOLOGICAL INSTITUTE

Power Optimization by Using Multi-Bit Flip-Flops

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Improved 32 bit carry select adder for low area and low power

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

Design of a High Frequency Dual Modulus Prescaler using Efficient TSPC Flip Flop using 180nm Technology

Implementation of High Speed Adder using DLATCH

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

L11/12: Reconfigurable Logic Architectures

BCN1043. By Dr. Mritha Ramalingam. Faculty of Computer Systems & Software Engineering

Novel Low Power and Low Transistor Count Flip-Flop Design with. High Performance

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari

FPGA Design with VHDL

Logic Design II (17.342) Spring Lecture Outline

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Laboratory 1 - Introduction to Digital Electronics and Lab Equipment (Logic Analyzers, Digital Oscilloscope, and FPGA-based Labkit)

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

COMP12111: Fundamentals of Computer Engineering

Design and analysis of microcontroller system using AMBA- Lite bus

SOC Implementation for Christmas Lighting with Pattern Display Indication RAMANDEEP SINGH 1, AKANKSHA SHARMA 2, ANKUR AGGARWAL 3, ANKIT SATIJA 4 1

Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

Tomasulo Algorithm Based Out of Order Execution Processor

Long and Fast Up/Down Counters Pushpinder Kaur CHOUHAN 6 th Jan, 2003

Adding Analog and Mixed Signal Concerns to a Digital VLSI Course

Inside Digital Design Accompany Lab Manual

CHAPTER 4: Logic Circuits

CHAPTER 6 DESIGN OF HIGH SPEED COUNTER USING PIPELINING

High Speed 8-bit Counters using State Excitation Logic and their Application in Frequency Divider

Metastability Analysis of Synchronizer

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

WINTER 15 EXAMINATION Model Answer

VLSI IEEE Projects Titles LeMeniz Infotech

Logic Design Viva Question Bank Compiled By Channveer Patil

DEDICATED TO EMBEDDED SOLUTIONS

Implementation of efficient carry select adder on FPGA

LOW-POWER CLOCK DISTRIBUTION IN EDGE TRIGGERED FLIP-FLOP

Design & Simulation of 128x Interpolator Filter

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

University College of Engineering, JNTUK, Kakinada, India Member of Technical Staff, Seerakademi, Hyderabad

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

Design and Implementation of Low-Power and Area-Efficient for Carry Select Adder (Csla)

An Efficient High Speed Wallace Tree Multiplier

Transcription:

Clock Gating Aware Low ALU Design and Implementation on FPGA Bishwajeet Pandey and Manisha Pattanaik Abstract This paper deals with the design and implementation of a Clock Gating Aware Low Arithmetic and Logic Unit that has been developed as part of low power processor design in the platform Xilinx ISE 14.2 and synthesized on 90nm Spartan-3 FPGA. Clock power contributes 45-60 percent of total dynamic power. Hence, clock power reduction is necessary in low power design. In this paper, we analyze theoretical 93.75% clock power reduction in ALU using clock gating techniques. On simulator, we achieved 88.23% clock power reduction using latch based clock gating and 70.58% clock power reduction using latch free clock gating. Index Terms Clock gate, ALU, FPGA, LUT, clock power, register transfer level, dynamic power, leakage power which switch off the inactive units of the design and reduce overall power consumption. There are many clock gating styles available to optimize power in VLSI circuits. They can be: Latch-free based CG design. Latch-based CG design. Flip-flop based CG design. Intelligent CG design. A. Latch-free Clock Gated ALU design We use an AND gate in clock gate if clock is active on the rising edge. We use an OR gate in clock gate if clock is active on the falling edge. Using idea given in [2] and [4], we develop following ALU design as shown in Fig. 1. I. INTRODUCTION Low ALU Design is based on application of clock gate to turn off the sub-module of ALU that is not in use by current executing instruction as decided by instruction decoder unit. According to [1]-[3], Clock consumes 50-70 percent of total chip power and will increase in the next coming generation of hardware designs at 32nm and below. Hence, reducing clock power is very important. Clock gating is a key power reduction technique used by hardware designers and is typically implemented by RTL-level HDL Simulator or gate level power analyser tools. L ( ) ( ) = C Voltage frequency Fig. 1. Latch free clock gated design B. Problem in Latch-Free Clock Gated Design If enable signal goes inactive in between the clock pulse then gated clock terminated before his life time as shown in Fig. 2. In equation (1) power is directly proportional to the square of voltage and the frequency of the clock. A. Statement of the Problem: Clock gating is used in VLSI circuit design to reduce dynamic power by gating off the functional unit that is not in use by current executing instructions as decided by instruction decoder unit. Fig. 2. Problem in latch-free clock gated design C. Latch Based Clock Gated ALU Design The latch-based clock gate consists of a level sensitive latch in design to hold the enable signal from the active edge to the inactive edge of the clock as shown in Fig. 3. II. LITERATURE REVIEW Clock Enable consumed More and Clock Gating consumed Less [1]. According to reference [2], optimization, traditionally relegated to the synthesis and circuits level, now shifted to the System Level and Register-Transfer-Level. This is possible due to clock gating Manuscript received December 25, 2012; revised February 23, 2013. The authors are with the Indian Institute of Information Technology, Gwalior and Centre for Development of Advanced Computing (CDAC), Noida, India (e-mail: gyancity@gyancity.com, manishapattanaik@iiitm.ac.in). Fig. 3. Latch based clock gated ALU design D. Flip-Flop Based Clock gated ALU Design The Flip-Flop based clock gate consists of a level sensitive latch in design to hold the enable signal from the active edge DOI: 10.7763/IJFCC.2013.V2.206 461

to the inactive edge of the clock as shown in Fig. 4. The ALU generates 4 flags-zero (Z), Carry (C), Sign (S), and Parity (P). Flags are not affected by the Unary Logic functions. Only the C flag is affected by the Shift function. All flags are affected by the other ALU functions. Fig. 4. Flip-flop based clock gated ALU design Reference [5] presents the design and implementation of a self-timed arithmetic logic unit (ALU) that has been developed as part of an asynchronous microprocessor. Reference [5] displays an inherent operational characteristic of low consumption, owing to the synchronization signals that stop when the execution of an operation finishes (stoppable clock); that is a precursor of clock gating. Our whole work of clock gating is an extension of the work done in [5] i.e. switch off functional unit when unit is not in use. ALU in [6] performs 16 instructions and has a two-stage pipelined architecture. For low power consumption, [6] propose a new ALU architecture which has an efficient ELM adder of propagation (P) and generation (G) block scheme. The operation of an adder of the proposed ALU is disabled while the logical operation is performed and vice versa, this concept is same as our clock gating approach, here we also switching off the arithmetic function when logical function in use and vice versa using clock gate approach. In outputs of [6], P block are separated to become dual bus to reduce switching capacitances during the ALU operation. Fig. 5. Arithmetic and logic unit IV. METHODOLOGY A. Clear Function of ALU The clear function reset the output of ALU to 8 h00. If we add clock gate in place of de-multiplexing clock signal to all 16 sub-modules of ALU then we reduce 93.75% power reduction as shown in Fig. 6 (a-b). Unary TABLE I: FUNCTION OF ALU Functions of Arithmetic and Logic Unit Sel Arithmetic & Logic Sel Clear 0000 Add 1000 Hold B 0001 Subtract 1001 Complement B 0010 Add Carry 1010 Hold A 0011 Subtract Borrow 1011 Fig. 6. Without clock gate (a), With clock gate (b) B. Save Operand Register Value in ALU Pass value of B to ALU output. ALU_Out=B; In Clock Gating, we turn off the supply of clock signal to rest 15 modules other than Save B. Hence reduce 93.75% power reduction as shown in Fig.7 (a-b). Complement A 0100 Logical AND 1100 Decrement A 0101 Logical OR 1101 Increment A 0110 Logical XOR 1110 Shift Left A 0111 Logical XNOR 1111 All Flags are unaffected in execution of Unary except Carry Flag in Shift All Flag set in every operation from 1000-1111. Fig. 7. Without clock gate (a), With clock gate (b) III. INTRODUCTION OF CLOCK GATED ALU A, B are the two input buses and carry in, Sel [3:0], CLK are others input. Output bus ALU Out returns the result of the ALU operation. OC [7:4] portmap to Sel [3:0] determines which operation is performed as shown in Fig. 5. A. Operations Performed in ALU Opcode of Size 4 i.e. OC [4:7] is portmap to Sel [3:0] to decode all instruction ranges from 0000-1111. Whereas first eight are unary function, next four are arithmetic function and last four are logic function. All function listed in Table1. Fig. 8. Without CLOCK Gate (a), With clock gate (b) C. Invert Operand Register Value in ALU ALU out= ~B; Pass complemented value of B to ALU 462

output and not consider value of A. In Clock Gating, we turn off the 15 functional units as shown in Fig. 8. Hence reduce 93.75% D. Hold Data Bus Value ALU out=a; Pass value of A to ALU output. In Clock Gating, we turn off the 15 functional units as shown in Fig. 9 Hence reduce 93.75% H. Left Shift Data Bus Value ALU out=a << 1; Left shift A by 1 bit and Pass that value of A to ALU output In Clock Gating, we turn off the 15 functional units as shown in Fig.13. Hence reduce 93.75% Fig. 13. Without clock gate (a), With clock gate (b) Fig. 9. Without Clock Gate (a), With Clock Gate (b) E. Invert Data Bus Value ALU out=_ A; Pass inverted value of A to ALU output. In Clock Gating, we turn off the 15 functional units as shown in Fig. 10. Hence reduce 93.75% I. Addition Operation in ALU ALU out=a+b; Add value of B with value of A and pass to ALU out In Clock Gating, we turn off the 15 functional units as shown in Fig. 14. Hence reduce 93.75% power reduction. Fig. 10. Without Clock Gate (a), With Clock Gate (b) F. Decrement Data Bus Value ALU out=a-1; Pass decremented value of A to ALU output. In Clock Gating, we turn off the 15 functional units as shown in Fig. 11. Hence reduce 93.75% Fig. 14. Without Clock Gate (a), With Clock Gate (b) J. Subtraction in ALU ALU out=a-b; Subtract value of B from value of A and pass to ALU out. In Clock Gating, we turn off the 15 functional units as shown in Fig. 15. Hence reduces 93.75% Fig. 11.Without Clock Gate (a), With Clock Gate (b) Fig. 15. Without clock Gate (a), With clock gate (b) Fig. 12. Without Clock Gate (a), With Clock Gate (b) G. Increment Data Bus Value ALU out=a+1; Pass incremented value of A to ALU output. In Clock Gating, we turn off the 15 functional units as shown in Fig12. Hence reduce 93.75% Fig. 16. Without clock gate (a), With clock gate (b) K. Addition with Carry ALU out=a+b+carry_in; Add three values A, B and Carry_in and store that result to ALU out. In Clock Gating, we turn off the 15 functional units other than ADDC as 463

shown in Fig. 16. Hence reduces 93.75(15/16*100) % power reduction. L. Subtraction with Carry ALU out=a-b-carry in; Subtract value of B from value of A and then subtract Carry In from last result and pass to ALU out. In Clock Gating, we turn off the 15 functional units as shown in Fig. 17. Hence reduce 93.75% functional units as shown in Fig.20. Hence reduce 93.75% P. Logical XNOR Operation ALU out= (A B) ; Calculate Logical (A B) and pass that value to ALU out. In Clock Gating, we turn off the 15 functional units as shown in Fig.21. Hence reduce 93.75% Fig. 17. Without Clock Gate (a), With Clock Gate (b) M. Logical AND Operation ALU out=a&b; Calculate Logical A & B and pass that value to ALU outin Clock Gating, we turn off the 15 functional units as shown in Fig.18. Hence reduce 93.75% Fig. 21. Without clock gate (a), With clock gate (b) Q. RTL Technology Schematic Fig. 22. RTL Schematic of ALU Fig. 18. Without Clock Gate (a), With Clock Gate (b) N. Logical OR Operation ALU out=a B; Calculate Logical A B and pass that value to ALU out. In Clock Gating, we turn off the 15 functional units as shown in Fig. 19. Hence reduce 93.75% RTL Schematic generated by Xilinx Synthesis Technology and save with the extension.ngr as shown in Figure: 22. RTL Schematic shows a schematic representation of optimized design in terms of digital logic symbols that are not related to the targeted Xilinx FPGA device. Fig. 19. Without clock gate (a), With clock gate (b) V. RESULTS An 8-bit ALU is design and developed for low power processor in the platform Xilinx ISE 14.4 and synthesized on 90nm Spartan-3 FPGA. ALU affected by Clock Frequency: All either dynamic (Clock, Logic, Signal, IOs) is directly proportional to clock Frequency. calculation by Xilinx X 14.4 is affected with setting of alucg.ncd and alucg.ucf file of Xilinx ISE 14.4. Fig. 20. Without clock gate (a), With clock gate (b) O. Logical XOR Operation ALU out=a B; Calculate Logical A B and pass that value to ALU out. In Clock Gating, we turn off the 15 Frequency Clock Logic Signal IOs 100MHz 2mW 1mW 1mW 0 mw 1000MHz 17 mw 9mW 10 mw 4 mw 10GHz 168 mw 48mW 88 mw 41mW 100GHz 1679mW 153mW 802 mw 410mW 1000GHz 16795mW 1198mW 7983 mw 4099mW In next phase using clock gating, we turn off rest 15 modules when any module is in execution then theoretical 464

assumption is 93.75% Table II shows 88.23% clock power reduction using latch based clock gating. TABLE II: LATCH BASED CLOCK GATING Latch Based Clock Dynamic Total Gating Clock Without Clock Gate 94mW 41mW 17mW With Clock Gate 77 mw 25 mw 2mW Table III shows 70.58% clock power reduction using latch free clock gating. TABLE III: LATCH FREE BASED CLOCK GATING Latch Free Based Clock Gating Total Dynamic Clock Without Clock Gate 94mW 41mW 17mW With Clock Gate 80 mw 28 mw 5mW VI. CONCLUSION reduction deals with synthesis, design at circuit level and placement and routing stages, now moved to the System Level and Register Transfer Level. This is possible due to clock gating which always switch off the inactive unit of the design and reduce overall power consumption. The Register Transfer Level approach is always important because hardware designers generally verify power only at the gate level and any changes to the Register Transfer Level needs many design repetition to reduce power. Our designed ALU has 16 functions. Each function has one dedicated module. When one instruction executes in their respective module, others module that was not used by current executing instruction must gated off by the clock gate. From given formula, Number of UnitGated Re duction% = 100 Total Number of Unit Here, when any one of module execute because of clock gating rest 15 modules turned off and hence reduce power (15/16) 100=93.75% VII. FUTURE SCOPE Clock gating technique is one of the best techniques to reduce dynamic power. There is need to extend clock gating technique to reduce leakage power consumption. Virtex-6 FPGA is based on 40-nm technology. Latest FPGA like Virtex-7, Kintex-7, Artex-7 based on 28-nm technology contribute significant leakage power consumption. There is need to optimize clock gating to reduce leakage power along with dynamic power. ACKNOWLEDGMENT Thanks and appreciation to the helpful people at ABV-IIITM, and CDAC Noida for their support. REFERENCES [1] J. P. Oliver, J. Curto, D. Bouvier, M. Ramos, and E. Boemo, Clock gating and clock enable for FPGA power reduction, in Proc. 8th Southern Conference on Programmable Logic (SPL), pp. 1-5, 2012. [2] J. Shinde and S. S. Salankar, Clock gating-a power optimizing technique for VLSI circuits, in Proc. Annual IEEE India Conference (INDICON), pp. 1-4, 2011. [3] J. Castro, P. Parra, and A. J. Acosta, Optimization of clock-gating structures for low-leakage high-performance applications, in Proceedings of IEEE International Symposium on Efficient Embedded Computing, pp. 3220-3223, 2010. [4] V. Khorasani, B. V. Vahdat, and M. Mortazavi, Design and implementation of floating point ALU on a FPGA processor, IEEE International Conference on Computing, Electronics and Electrical Technologies (ICCEET), pp. 772-776, 2012. [5] S. Cisneros, J. J. Panduro, J. Muro, and E. Boemo, Rapid prototyping of a self-timed ALU with FPGAs, in Proc. International Conference on Reconfigurable Computing and FPGAs, pp. 26-33, 2012. [6] B. S. Ryu, J. S. Yi, K. Y. Lee, and T. W. Cho, A design of low power 16-bit ALU, in Proceedings of the IEEE TENCON Conference, pp.868-871, 1999. Bishwajeet Pandey received the Integrated BCA-MCA degree from The People s University, Delhi in 2009. He is pursuing Master of Technology in Computer Science Engineering with specialisationn in VLSI from Indian Institute of Information Technology(IIIT), Gwalior. He is working in a joint research project of Centre for Development of Advanced Computing(C-DAC) Noida and VLSI Design Lab of IIIT, Gwalior. His area of Interest is Low Research in Hardware Design for Energy Efficient Green Computing. Pandey has over 8 year work experience in different domain as Website Developer, Lecturer, Trainer, Cloud Manager and Application Software developer. Manisha Pattanaik received ME degree in Electronic Systems and Communications from NIT, Rourkela, India in 1993 and 1997 respectively. She received the PhD degree from IIT Kharagpur, India in 2005. Dr. Manisha Pattanaik joined as Faculty at ABV-Indian Institute of Information Technology and Management, Gwalior, India in 2007 and is currently an Associate Professor. She is working with more than 60 Co-Researcher from Industry and Academia to create a global educational excellence. She has authored and coauthored over 80 papers in journals and conference proceedings in various areas of VLSI design, applications and in Electronics Design Automation. She is a member of IEEE, Institute of Electronics, Information and Communication Engineers (IEICE), World Scientific and Engineering Academy and Society (WSEAS), Greece, and ISTE 465