Tutorial Outline. Design Levels

Similar documents
Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Lecture 26: Multipliers. Final presentations May 8, 1-5pm, BWRC Final reports due May 7 Final exam, Monday, May :30pm, 241 Cory

EEC 118 Lecture #9: Sequential Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

II. ANALYSIS I. INTRODUCTION

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

Area Efficient Pulsed Clock Generator Using Pulsed Latch Shift Register

problem maximum score 1 28pts 2 10pts 3 10pts 4 15pts 5 14pts 6 12pts 7 11pts total 100pts

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.5

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Low Power Design: From Soup to Nuts. Tutorial Outline

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

DESIGN OF DOUBLE PULSE TRIGGERED FLIP-FLOP BASED ON SIGNAL FEED THROUGH SCHEME

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

ECE321 Electronics I

A NOVEL APPROACH TO ACHIEVE HIGH SPEED LOW-POWER HYBRID FLIP-FLOP

On the Rules of Low-Power Design

Implementation of Low Power and Area Efficient Carry Select Adder

Microprocessor Design

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

EE-382M VLSI II FLIP-FLOPS

AN EFFICIENT DOUBLE EDGE TRIGGERING FLIP FLOP (MDETFF)

WINTER 15 EXAMINATION Model Answer

Tutorial Outline. Typical Memory Hierarchy

High Performance Carry Chains for FPGAs

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

128 BIT CARRY SELECT ADDER USING BINARY TO EXCESS-ONE CONVERTER FOR DELAY REDUCTION AND AREA EFFICIENCY

A Low-Power CMOS Flip-Flop for High Performance Processors

Figure.1 Clock signal II. SYSTEM ANALYSIS

Improve Performance of Low-Power Clock Branch Sharing Double-Edge Triggered Flip-Flop

Comparative study on low-power high-performance standard-cell flip-flops

Research Article Design and Implementation of High Speed and Low Power Modified Square Root Carry Select Adder (MSQRTCSLA)

Dual Edge Adaptive Pulse Triggered Flip-Flop for a High Speed and Low Power Applications

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

Computer Architecture and Organization

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Asynchronous Data Sampling Within Clock-Gated Double Edge-Triggered Flip-Flops

Register Transfer Level (RTL) Design Cont.

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

EE241 - Spring 2007 Advanced Digital Integrated Circuits. Announcements

Load-Sensitive Flip-Flop Characterization

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Chapter 3 Unit Combinational

A Unified Approach in the Analysis of Latches and Flip-Flops for Low-Power Systems

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

New Single Edge Triggered Flip-Flop Design with Improved Power and Power Delay Product for Low Data Activity Applications

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

An FPGA Implementation of Shift Register Using Pulsed Latches

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Low-Power and Area-Efficient Shift Register Using Pulsed Latches

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

Solution to Digital Logic )What is the magnitude comparator? Design a logic circuit for 4 bit magnitude comparator and explain it,

Sequential Logic. References:

Topic 8. Sequential Circuits 1

P.Akila 1. P a g e 60

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Design And Analysis of Clocked Subsystem Elements Using Leakage Reduction Technique

ECE 263 Digital Systems, Fall 2015

Modeling and designing of Sense Amplifier based Flip-Flop using Cadence tool at 45nm

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

DIGITAL ELECTRONICS MCQs

CSE115: Digital Design Lecture 23: Latches & Flip-Flops

RECENT advances in mobile computing and multimedia

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Modeling Digital Systems with Verilog

Power Optimization by Using Multi-Bit Flip-Flops

High performance and Low power FIR Filter Design Based on Sharing Multiplication

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Design of a Low Power Four-Bit Binary Counter Using Enhancement Type Mosfet

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

K.T. Tim Cheng 07_dft, v Testability

International Journal of Engineering Research in Electronics and Communication Engineering (IJERECE) Vol 1, Issue 6, June 2015 I.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science

Low Power and Reduce Area Dual Edge Pulse Triggered Flip-Flop Based on Signal Feed-Through Scheme

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

COE328 Course Outline. Fall 2007

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

Energy Recovery Clocking Scheme and Flip-Flops for Ultra Low-Energy Applications

VU Mobile Powered by S NO Group


DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

Computer Systems Architecture

Comparative Analysis of Pulsed Latch and Flip-Flop based Shift Registers for High-Performance and Low-Power Systems

Spiral Content Mapping. Spiral 2 1. Learning Outcomes DATAPATH COMPONENTS. Datapath Components: Counters Adders Design Example: Crosswalk Controller

Retiming Sequential Circuits for Low Power

Digital Integrated Circuits EECS 312

Power Efficient Design of Sequential Circuits using OBSC and RTPG Integration

Low Power D Flip Flop Using Static Pass Transistor Logic

Transcription:

Tutorial Outline 8:3-8:45 8:45-9:5 9:5-9:3 9:3-1:3 1:3-1:5 1:5-12:15 12:15-1:3 1:3-2:3 2:3-3:3 3:3-3:5 3:5-4:3 4:3-4:45 Introduction and motivation Sources of power in CMOS designs Power analysis tools and techniques Gate & functional unit design issues & techniques BREAK Architectural level issues and techniques LUNCH Low power memory system design Software level issues and techniques BREAK Software level issues and techniques, con t Future challenges ISCA Tutorial: Low Power Design Gate.1 Design Levels Abstraction Analysis Analysis Analysis Analysis Energy Level Capacity Accuracy Speed Resources Savings Most Worst Fastest Least Most Application Behavioral Architectural (RTL) Logic (Gate) Transistor (Circuit) Least Best Slowest Most Least ISCA Tutorial: Low Power Design Gate.2 1

Basic Principles of Low Power Design P = C L V DD2 f 1 + t sc V DD I peak f 1 + V DD I leakage Reduce switching (supply) voltage» quadratic effect -> dramatic savings» negative effect on performance Reduce capacitance Reduce switching frequency» switching activity» clock rate Reduce glitching Reduce short circuit currents (slope engineering) Reduce leakage currents ISCA Tutorial: Low Power Design Gate.3 Low Energy Gates: Transistor Sizing Use the smallest transistors that satisfy the delay constraints» slack time - difference between required time and arrival time of a signal at a gate output Positive slack - size down Negative slack - size up Make gates that toggle more frequently smaller Size for slope engineering to reduce short circuit currents ISCA Tutorial: Low Power Design Gate.4 2

Low Energy Gates: Transistor Pin Ordering Logically equivalent inputs may not have identical energy/delay characteristics B A Out C out To conserve energy (and improve speed), connect inputs so that most active input is nearest output C i Need to know signal statistics ISCA Tutorial: Low Power Design Gate.5 Low Energy Gates: Dynamic Gate Pin Ordering Dynamic gates exhibit higher switching activity (and add to clock load) but are fast SelA SelB SelC SelA SelB SelC!A!B!C A B C If A, B, and C have low signal probability ISCA Tutorial: Low Power Design Gate.6 3

Low Energy Gates: Gate Restructuring Logically equivalent CMOS gates may not have identical energy/delay characteristics ISCA Tutorial: Low Power Design Gate.7 Low Energy Gate Networks: Balanced Delay Paths Reduce glitching by balancing the delay path F 1 1 F2 2 F 1 1 F 3 F 2 1 F 3 Equalize lengths of timing paths through logic ISCA Tutorial: Low Power Design Gate.8 4

Low Energy Gate Networks: Network Restructuring Consider logic topology alternatives.5 A B.5 C.5 3/16 W D.5 7/64 X 15/256 F.5A.5B.5 C.5D 3/16 Y Z 3/16 15/256 F Chain implementation has a lower overall switching activity than the tree implementation Ignores glitching effects ISCA Tutorial: Low Power Design Gate.9 Network Restructuring, con t Logically equivalent gate networks may not have identical energy/delay characteristics F = ABCD Technology mapping delay area energy ISCA Tutorial: Low Power Design Gate.1 5

Low Energy Gate Networks: Network Input Ordering Input ordering.5 A B.2 (1-.5x.2)x(.5x.2)=.9 X C F.1.2 B C.1 (1-.2x.1)x(.2x.1)=.196 X A F.5 Beneficial to postpone the introduction of signals with a high transition rate (signals with signal probability close to.5) ISCA Tutorial: Low Power Design Gate.11 Dual Supply Voltages Use two V DD s (e.g., 2.5V and 1.5V)» use the higher supply for gates on the critical path» use the lower supply for gates off the critical path Reduces energy without a performance loss Cons» slight area penalty» increased design time» need level converters to interconnect gates on different supplies (to avoid static currents) ISCA Tutorial: Low Power Design Gate.12 6

Dual Threshold Voltages Use two V T s (e.g.,.6v and.3v for V DD = 2.5V)» use the lower threshold for gates on the critical path» use the higher threshold for gates off the critical path Improves performance without an increase in power Cons» increased fabrication complexity» increased design time» beware of increased leakage in low V T portion of the circuit - could end up with increased power! ISCA Tutorial: Low Power Design Gate.13 Functional Unit Energy Optimization Key processor core functional units» latches and (pipeline) registers» ALUs - adders, multipliers, barrel shifters» control logic (FSMs)» interconnect» multi-ported register file On-chip memories (ROMs, caches, SRAMs, edrams) MMU, TLB Clock generation and distribution Off-chip interconnect (pads) ISCA Tutorial: Low Power Design Gate.14 7

Flipflops and Pipeline Registers Consume a lot of energy because they are clocked every cycle» Clock energy (E c ) energy dissipated when the ff is clocked with stable data» Data energy (E d ) energy dissipated when the ff is clocked and the data has changed so that the ff changes state» Typically the data rate (f d ) is much lower than the clock rate (f c ) Also impacts clock energy since a large portion of clock energy is used to drive the sequential elements ISCA Tutorial: Low Power Design Gate.15 Power Consumption in Latches B D Q 1 % Power 8 6 4 2 Data Clock.1.2.3.4.5 Latch Data AF From Tiwari,, 1998 ISCA Tutorial: Low Power Design Gate.16 8

Some Typical CMOS FFs D Q D Q Static TG FF Dynamic C2MOS FF D Q D Q Dyn Precharged TSPC FF Dyn Non-Precharged TSPC FF ISCA Tutorial: Low Power Design Gate.17 FF Power Comparison Relative Power Consumption 3 25 2 15 1 5.5.15.25.35.45 Latch Data AF TGFF C2MOS PTSPC NPTSPC From Svenson,, 1996 ISCA Tutorial: Low Power Design Gate.18 9

Energy Efficient Flipflops D GND V DD V DD V DD Q Q 16 transistor & B 4 clock loads each Power PC 63 FF B StrongArm SA11 FF 2 transistor 3 clock loads D B Q ISCA Tutorial: Low Power Design Gate.19 EDP of Some Low Power FFs EDPtot (fj) 8 7 6 5 4 3 2 1 High Low Average HLFF SDFF PowerPC mc2mos SA11FF K6ETL From Stojanovic,, 1998 ISCA Tutorial: Low Power Design Gate.2 1

Self-Gating FF When ff input is equal to its output, suppress internal clocking to conserve energy» gating function is derived within the FF Φ Φ D Φ Φ Φ Φ Q Φ Φ Φ D Q Strict rules on when D can change wrt ISCA Tutorial: Low Power Design Gate.21 Power of Self-Gated FF Power dissipation 1 SG FF Reg FF 1 2 Data switching rate f d /f c From Reyes, 1996 ISCA Tutorial: Low Power Design Gate.22 11

Double Edge Triggered FF B B Loads data at both rising and falling clock edges D B Q B ISCA Tutorial: Low Power Design Gate.23 Advantages DETFF Pros and Cons» Clock frequency can be halved to achieve the same computational throughput: P d =.84P s» Also get a 2X energy savings in the clock network Disadvantages» About 15% larger in transistor count» Maximum operating frequency less» Strict requirements on clock skew» Requires a strict 5% duty cycle» Larger clock load ISCA Tutorial: Low Power Design Gate.24 12

Adders (Subtractors) synchronous word parallel adders ripple carry adders (RCA) T = O(n), A = O(n) carry prop min adders signed-digit fast carry prop residue adders adders adders T = O(1), A = O(n) Manchester carry carry conditional carry carry chain select lookahead sum skip T = O(n), A = O(n) T = O(log n) A = O(n log n) T = O(n**1/2), A = O(n) ISCA Tutorial: Low Power Design Gate.25 PDP of Different Adders 1 75 5 25 RCA MCCA CSkA VSkA CSlA CLA BKA ELMA 8 bits 16 bits 32 bits 48 bits 64 bits From Nagendra,, 1996 ISCA Tutorial: Low Power Design Gate.26 13

Parallel Prefix Computation T = log 2 n - 1 A = 2log 2 n T = log 2 n Brent-Kung (CLA) Adder g 15 p 15 g 14 p 14 g 13 p 13 g 12 p 12 g 11 p 11 g 1 p 1 g 9 p 9 g 8 p 8 g 7 p 7 g 6 p 6 g 5 p 5 g 4 p 4 g 3 p 3 g 2 p 2 g 1 p 1 g p c 16 c 15 c 14 c 13 c 12 c 11 c 1 c 9 c 8 c 7 c 6 c 5 c 4 c 3 c 2 c 1 A = n/2 ISCA Tutorial: Low Power Design Gate.27 BK and ELM Adder Optimization 2 15 EDP (pj) 1 5 BK Classic BK Hybrid ELM Classic ELM Hybrid 16 32 64 Number of bits ISCA Tutorial: Low Power Design Gate.28 14

Parallel Multipliers Form partial product array in parallel and add it in parallel» can use multiplier recoding to reduce the high of the partial produce array by half» recoding may cost more energy than it saves!» use delay balancing to reduce glitching Array multipliers (regularity) Pipelined multipliers (higher throughput, longer latency, less glitching but adds to clock load) ISCA Tutorial: Low Power Design Gate.29 multiple forming circuits Parallel Multiplier Structure D D Q ( ier) D D ( icand) partial product array reduction tree fast CPA muxes + tree reduction (log n) + CPA P (product) ISCA Tutorial: Low Power Design Gate.3 15

PP Array Reduction Process icand (4,2) counter ier partial product array reduced partial product array to CPA ISCA Tutorial: Low Power Design Gate.31 (4,2) Counters Built out of (3,2) counters (FA s) (3,2) (3,2) (3,2) (3,2) (3,2) (3,2) Tiles with neighboring (4,2) counters Can use delay balancing in cell design and interconnect to reduce glitching ISCA Tutorial: Low Power Design Gate.32 16

PP Array Reduction Tree Structure multiple generators multiplicand... (4,2) counter slices 2 (4,2) counter slices (4,2) counter slices multiple selection signals ( ier) CPA ISCA Tutorial: Low Power Design Gate.33 Glitch Reduction by Pipelining Glitches are dependent on the logic depth of the circuit Nodes logically deeper are more prone to glitching» arrival times of the gate inputs are more spread due to delay imbalances» usually affected by more primary input switching Reduce depth by adding pipeline registers ISCA Tutorial: Low Power Design Gate.34 17

multiple forming circuits Pipelined Parallel Multiplier D D Q ( ier) D D ( icand) partial product array reduction tree helps to reduce glitching but adds to the clock load fast CPA P (product) clk ISCA Tutorial: Low Power Design Gate.35 CSA Array Multiplier q 3 q 2 q 1 q M 3 M 2 M 1 M d q j carry sum in input M 13 M 12 M 11 M 1 p d 1 d i M 23 M 22 M 21 M 2 p 1 d 2 CSA p 7 M 33 p 6 M 32 p 5 M 31 p 4 M 3 p 3 p 2 d 3 carry out sum output Longest delay path n + n - 1 = 2n - 1 ISCA Tutorial: Low Power Design Gate.36 18

Multiplier Cell Structure B j 2D sum input 1D A i add delay elements to minimize glitching carry out full adder carry in ISCA Tutorial: Low Power Design Gate.37 Pipelined CSA Array Multiplier clk q 3 q 2 q 1 q M 3 M 2 M 1 M d p M 13 M 12 M 11 M 1 d 1 M 23 M 22 M 21 M 2 p 1 d 2 M 33 M 32 M 31 M 3 p 2 d 3 p 3 M 43 M 42 M 41 p 4 M 53 M 52 p 5 M 63 p 6 ISCA Tutorial: Low Power Design Gate.38 p7 19

Barrel Shifters Average Power (mw) 5 1 log PT Array PT log static log dynamic Delay (ns) 12 1 8 6 4 2 log PT Array PT log static log dynamic Influence of architecture: Logarithmic, Array and Gate types: Pass Transistor, Dynamic/Static Mux From Acken,, 1996 ISCA Tutorial: Low Power Design Gate.39 Control Unit Design Inputs Combinational Logic Outputs State FFs State Encoding One of most important factors determining area, speed, and energy of resulting control logic / n! different possible encodings (n states) / 11 1/X 1 1/X,1/1 ISCA Tutorial: Low Power Design Gate.4 2

Energy State Encoding Heuristic Area driven -> try to reduce the distance in Boolean n-space between related states Energy driven -> try to minimize number of bit transitions in the state register» fewer transitions in state register» fewer transitions propagated to combinational logic.1.3 1.4.1.1 11 probability that a transition will occur (sum of all edges equals unity) ISCA Tutorial: Low Power Design Gate.41 Caveat Lowest E[M] may not be lowest in energy it could require more gates and/or signal transitions in the combinational logic Experiments show that the area and energy dissipation of a state machine are correlated when the state encoding is varied ISCA Tutorial: Low Power Design Gate.42 21

Power State Encoding Effects 75 7 65 6 55 5 33 34 35 36 37 38 39 4 41 Area From Yeap,, 1997 ISCA Tutorial: Low Power Design Gate.43 Practical Considerations Balance area-energy by forced encoding of only a subset of states that span the high probability edges» leave assignment of remaining states to the logic synthesis system for area optimization» fortunately, in practice, most state machines have this characteristic Unlike area encoding, energy encoding requires knowledge of probabilities of state transitions and input signals ISCA Tutorial: Low Power Design Gate.44 22

A Low Power Processor Core Example ISCA Tutorial: Low Power Design Gate.45 M CORE Architecture GP reg file (32bitx16) X port Alt reg file (32bitx16) Y port Control reg file (32bitx13) Scale Immed PC increment Branch adder Address bus Sign ext Barrel shift, FF1 Instr pipeline ALU, priority encode, detect Instr decoder Writeback bus H/W acc bus Data bus ISCA Tutorial: Low Power Design Gate.46 23

M CORE Power Distribution 28% 36% 5% 9% 36% Datapath Clock Control 6% 7% 8% 9% 14% 42% Reg File Addr/Data Bus Inst Reg Barrel Shifter X MUX Y MUX Addr Gen Other ISCA Tutorial: Low Power Design Gate.47 Key References Hossain, Low power design using double edge triggered flipflop, IEEE Trans. on VLSI Systems, 2(2):261-265, 1994. Motorola, M CORE Architecture microrisc Engine, MCORE 1/D, www.mot.com/sps/mcore/info_documentation.htm Mutsunori, Low power design method using multiple supply voltages, SLPED, 1997. Rabaey, Digital Integrated Circuits, Prentice-Hall, 1996. Reyes, Low Power FF Circuit and Method Thereof, Patent No 5,498,988, 1996. Roy, Power analysis and design at the system level, Low Power Design in Deep Submicron Electronics, Nebel and Mermet, Ed., Kluwer, 1997. Sakuta, Delay balanced multipliers for low power, SLPE, 1995. Scott, Designing the Low-Power M CORE Architecture, Proc. Inter. Symp. Computer Architecture Power Driven Microarchitecture Workshop, June 1998. Stojanovic, A unified approach in the analysis of latches and FFs for low power systems, ISLPED, 1998. Tiwari, Reducing power in high-performance microprocessors, DAC, 1998. Yeap, CPU controller optimization for HDL logic synthesis, CICC, 1997. Yeap, Practical Low Power Digital VLSI Design, KAP, 1998. ISCA Tutorial: Low Power Design Gate.48 24