Slack Redistribution for Graceful Degradation Under Voltage Overscaling

Similar documents
RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

On the Rules of Low-Power Design

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Retiming Sequential Circuits for Low Power

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Performance Driven Reliable Link Design for Network on Chips

Novel Low Power and Low Transistor Count Flip-Flop Design with. High Performance

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Figure 1. Setup/hold definition for the sequential cells

Reducing Pipeline Energy Demands with Local DVS and Dynamic Retiming

Using on-chip Test Pattern Compression for Full Scan SoC Designs

Power-Optimal Pipelining in Deep Submicron Technology

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.5

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume Reduction

Novel Pulsed-Latch Replacement Based on Time Borrowing and Spiral Clustering

Power-Driven Flip-Flop p Merging and Relocation. Shao-Huan Wang Yu-Yi Liang Tien-Yu Kuo Wai-Kei Tsing Hua University

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

Design for Testability

Clock Tree Power Optimization of Three Dimensional VLSI System with Network

Sharif University of Technology. SoC: Introduction

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

Figure.1 Clock signal II. SYSTEM ANALYSIS

Integrated Circuit Design ELCT 701 (Winter 2017) Lecture 1: Introduction

New Single Edge Triggered Flip-Flop Design with Improved Power and Power Delay Product for Low Data Activity Applications

Interconnect Planning with Local Area Constrained Retiming

High Performance Low Swing Clock Tree Synthesis with Custom D Flip-Flop Design

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

A Novel Framework for Faster-than-at-Speed Delay Test Considering IR-drop Effects

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

A Low Power Delay Buffer Using Gated Driver Tree

Built-In Proactive Tuning System for Circuit Aging Resilience

Spring 2017 EE 3613: Computer Organization Chapter 5: The Processor: Datapath & Control - 1

Low Voltage Clocking Methodologies for Nanoscale ICs. A Dissertation Presented. Weicheng Liu. The Graduate School. in Partial Fulfillment of the

6.S084 Tutorial Problems L05 Sequential Circuits

Characterizing the Voltage Scaling Limitations of Razor-based Designs

Timing with Virtual Signal Synchronization for Circuit Performance and Netlist Security

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

Power Efficient Design of Sequential Circuits using OBSC and RTPG Integration

LOW POWER AND HIGH PERFORMANCE SHIFT REGISTERS USING PULSED LATCH TECHNIQUE

Designing for High Speed-Performance in CPLDs and FPGAs

Design Project: Designing a Viterbi Decoder (PART I)

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

II. ANALYSIS I. INTRODUCTION

Long and Fast Up/Down Counters Pushpinder Kaur CHOUHAN 6 th Jan, 2003

Fully Static and Compressed Topology Using Power Saving in Digital circuits for Reduced Transistor Flip flop

Design of Fault Coverage Test Pattern Generator Using LFSR

Design and Analysis of Custom Clock Buffers and a D Flip-Flop for Low Swing Clock Distribution Networks. A Thesis presented.

EITF35: Introduction to Structured VLSI Design

EDSU: Error detection and sampling unified flip-flop with ultra-low overhead

EE-382M VLSI II FLIP-FLOPS

Scan Chain Design for Power Minimization During Scan Testing Under Routing Constraint.

Modifying the Scan Chains in Sequential Circuit to Reduce Leakage Current

SGERC: a self-gated timing error resilient cluster of sequential cells for wide-voltage processor

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

ECE321 Electronics I

DIFFERENTIAL CONDITIONAL CAPTURING FLIP-FLOP TECHNIQUE USED FOR LOW POWER CONSUMPTION IN CLOCKING SCHEME

An FPGA Implementation of Shift Register Using Pulsed Latches

Low Power Digital Design using Asynchronous Logic

Why FPGAs? FPGA Overview. Why FPGAs?

Robust Synchronization using the Wagging Technique

Testing of Cryptographic Hardware

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Laboratory 1 - Introduction to Digital Electronics and Lab Equipment (Logic Analyzers, Digital Oscilloscope, and FPGA-based Labkit)

Sequential logic. Circuits with feedback. How to control feedback? Sequential circuits. Timing methodologies. Basic registers

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

Implementation of BIST Test Generation Scheme based on Single and Programmable Twisted Ring Counters

Administrative issues. Sequential logic

System IC Design: Timing Issues and DFT. Hung-Chih Chiang

Design and Evaluation of a Low-Power UART-Protocol Deserializer

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

Interframe Bus Encoding Technique for Low Power Video Compression

Load-Sensitive Flip-Flop Characterization

Design of a Low Power Four-Bit Binary Counter Using Enhancement Type Mosfet

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

A Novel Approach for Auto Clock Gating of Flip-Flops

Latch-Based Performance Optimization for FPGAs. Xiao Teng

LFSR Counter Implementation in CMOS VLSI

Clock - key to synchronous systems. Topic 7. Clocking Strategies in VLSI Systems. Latch vs Flip-Flop. Clock for timing synchronization

Clock - key to synchronous systems. Lecture 7. Clocking Strategies in VLSI Systems. Latch vs Flip-Flop. Clock for timing synchronization

Comparative study on low-power high-performance standard-cell flip-flops

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

VirtualSync: Timing Optimization by Synchronizing Logic Waves with Sequential and Combinational Components as Delay Units

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components

A Novel Low-overhead Delay Testing Technique for Arbitrary Two-Pattern Test Application

VARIABLE FREQUENCY CLOCKING HARDWARE

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

Dynamic Power Reduction in Sequential Circuits Using Look Ahead Clock Gating Technique R. Manjith, C. Muthukumari

SYNCHRONOUS DERIVED CLOCK AND SYNTHESIS OF LOW POWER SEQUENTIAL CIRCUITS *

Transcription:

Slack Redistribution for Graceful Degradation Under Voltage Overscaling Andrew B. Kahng, Seokhyeong Kang, Rakesh Kumar and John Sartori VLSI CAD LABORATORY, UCSD PASSAT GROUP, UIUC UCSD VLSI CAD Laboratory and UIUC PASSAT Group - ASPDAC, Jan. 21, 2010

Outline Background and motivation Voltage scaling and BTWC designs Limitation of Traditional CAD Flow Power-Aware Slack Redistribution Our design optimization goal Related work: BlueShift Our Heuristic Experimental Framework and Results Design methodology Testbed Results and analysis Conclusions and Ongoing Work (2/25)

Reducing Power with Voltage Scaling Power is a first-order design constraint Voltage scaling can significantly reduce power Voltage scaling may result in timing violations Power Timing errors begin to occur Voltage (lower voltage) Voltage scaling is limited because of timing errors (3/25)

Better-Than-Worst-Case Design Better-Than-worst-case (BTWC) design approach Optimize for normal operating conditions Trade off reliability and power/performance Have error detection/correction mechanism (e.g., Razor*) Traditional IC design Does not allow timing errors in STA Fixed target frequency and operating voltage BTWC design Error correction architecture allows timing errors CPU, Heal Thyself... Overclocking or voltage overscaling * Ernst et al. Razor: A low power pipeline based on circuit-level timing speculation, Proc. MICRO 2003. BTWC design allows tradeoffs between reliability and power (4/25)

Voltage Scaling with Error Correction Error correction incurs power overhead Minimum power at point b A B B A Voltage v pwr(v) : Power consumption at v Voltage v P E (v) : Error rate at voltage v Overscaling is possible for Better-Than-Worst-Case designs (5/25)

Limitations of Traditional CAD Flow Conventional designs exhibit critical operating points Many paths have near-critical slack wall of (critical) slack Scaling beyond COP causes massive errors that cannot be corrected Conventional designs fail critically when voltage is scaled down Zero slack wall of slack Timing slack Error rate should be increased gracefully : gradual slope slack COP Lower voltage Higher frequency (6/25)

Outline Background and motivation Voltage scaling and BTWC designs Limitation of Traditional CAD Flow Power-Aware Slack Redistribution Our design optimization goal Related work: BlueShift Our Heuristic Experimental Framework and Results Design methodology Testbed Results and analysis Conclusions and Ongoing Work (7/25)

Our Design Optimization Goal Problem: Minimize power for a given error rate Goal: Achieve a gradual slope slack distribution Approach: Frequently-exercised paths: upsize cells Rarely-exercised paths: downsize cells wall of slack Number of paths gradual slope slack with gradual failure characteristic Zero slack after voltage scaling We make a gradual slope slack distribution Timing slack (8/25)

Related Work: BlueShift BlueShift* : maximize frequency for a given error rate Gate-level simulation BlueShift speed up Paths with the highest frequency of timing errors FBB (forward body-biasing) & Timing override Limitation Compute error rate ER < Target Finish * Grescamp et al. Blueshift: Designing processors for timing speculation from the ground up, HPCA 2009 YES Repetitive gate level simulation impractical Design overhead of FBB NO Speed up paths BlueShift is impractical with modern SOC designs (9/25)

Our Heuristic Optimize slack distribution by cell swaps, exploiting switching activity information Iteratively scale target voltage the until error rate exceeds a target, and optimize negative slack paths Set initial voltage Optimize Paths Error rate estimation ER < ER target NO Power Reduction Voltage scaling YES Finish Our heuristic: Voltage scaling Optimize paths Power reduction (10/25)

Heuristic Implementation Voltage Scaling Set initial voltage Optimize Paths Error rate estimation ER < ER target NO Power Reduction Voltage scaling YES Finish Negative Slack of Path A at the target voltage Nominal voltage Target voltage (fixed) Path A Path B Path C Actual voltage at the target error rate Unnecessary cell sizing Target voltage (fixed) Lower Optimize voltage with fixed incrementally target voltage Load a pre-characterized library at each voltage point With iterative voltage scaling, we can find minimum operating voltage (11/25)

Heuristic Implementation Optimize Paths Set initial voltage Optimize Paths Error rate estimation ER < ER target NO Power Reduction Voltage scaling YES Finish Main idea: increase slack of frequently-exercised paths in order of increasing switching activity Procedure 1. Pick a critical path p with maximum switching activity 2. Resize cell instance c i in p 3. If slack of path p is not improved, cell change is restored 4. Repeat 2. ~ 3. for all cell instances in path p 5. Repeat 2.~ 4. for all critical paths OptimizePaths procedure reduces error rates and enables further voltage scaling (12/25)

Heuristic Implementation Power Reduction Set initial voltage Optimize Paths Voltage scaling Error rate estimation ER < ER target Main idea: Downsize cells on rarely-exercised paths in order of decreasing toggle rate Procedure 1. Pick a cell c with minimum toggle rate 2. Downsize cell c with logically equivalent cell 3. Incremental timing analysis and check error rate 4. If error rate is increased, cell change is restored 5. Repeat 1. ~ 4. YES NO Power Reduction Finish PowerReduction procedure reduces power without affecting error rate (13/25)

Heuristic Implementation Error Rate Estimation Set initial voltage Optimize Paths Error rate estimation ER < ER target NO Power Reduction Voltage scaling YES Finish Error rate estimation: use toggle rate from SAIF(Switching Error Activity rate Interchange contribution Format) Error rate of an entire design of one flip-flop P 1 ER = α TG = TG A ER ff f/f TG(P 1 ) = 0.3 TG(P 2 ) = 0.2 TG(P 3 ) = 0.1 TG(X) = 0.6 D ER ff CLK P 2 X PNEG ff D ff X P 3 TGf/f PALL P 1 P 2 P 1 P 2 P 1 P 3 Slack(P 1 ) = postive Slack(P 2 ) = negative Slack(P 3 ) = positive ER(X) = TG(X) α : compensation parameter Timing Error We estimate error rates without functional simulation TG(P 2 ) TG(P 1 ) + TG(P 2 ) + TG(P 3 ) = 0.2 (14/25)

Power Reduction Through Slack Redistribution Power consumption @BTWC Minimum power P min is obtained at minimum operating voltage V min 1. OptimizePaths Minimize error rate Enable to scale voltage further 2. ReducePower Downsize cells Obtain additional power reduction P min P min P min Power consumption Operating point V min Error rate Error rate Operating 1. OptimizePaths point 1 Maximum error rate 2 Operating point 2. ReducePower V min (lower voltage) (15/25)

Outline Background and motivation Voltage scaling and BTWC designs Limitation of Traditional CAD Flow Power-Aware Slack Redistribution Our design optimization goal Related work: BlueShift Our Heuristic Experimental Framework and Results Design methodology Testbed Results and analysis Conclusions and Ongoing Work (16/25)

Design Methodology Functional Library ECO Benchmark Heuristic P&R characterization (Slack simulation generation Optimization) Cadence Virtutech Implement NC SignalStorm SOCEncounter Simics in Verilog C++ and Full system Gate use Synopsys ECO Tcl level implementation socket simulation Liberty interface and generation capture with for test each vectors Synopsys voltage PrimeTime (17/25)

Testbed Target design : sub-modules of OpenSPARC T1 Benchmark Ammp, bzip2, equake, sort and twolf Make test vectors with 1 billion cycles for each sub-module Implementation TSMC 65GP technology with standard SP&R flow (18/25)

List of Experiments Design techniques 1. SP&R with 0.8 GHz (loose constraints) 2. SP&R with 1.2 GHz (tight constraints) 3. Blueshift: timing override 4. Slack Optimizer Experiments compare all design techniques with respect to: 1. Power consumption at each voltage point 2. Actual error rates from gate level simulation 3. Power consumption at each target error rate 4. Estimated processor-wide power consumption (19/25)

Error Rate and Power Results Error rate at each operating voltage (test case : lsu_dctl) Power consumption at each operating voltage (20/25)

Comparison of Power and Slack Results Power consumption at each target error rate Slack distribution (21/25)

Power Reduction and Area Overhead Power reduction after optimization (@ 2% error rate) Area overhead of design approaches Max. 32.8 %, Avg. 12.5% power reduction (22/25)

Processor-wide Results * *Kahng et al. Designing a Processor From the Ground Up to Allow Voltage/Reliability Tradeoffs, HPCA 2010. Slack optimization extends range of voltage scaling and reduces Razor recovery cost (23/25)

Conclusions and Ongoing Work Showed limitations of a BTWC design Presented design technique slack redistribution Optimize frequently exercised critical paths De-optimize rarely-exercised paths Demonstrated significant power benefits of gradual slack design Reduced power 33% on maximum, 12.5% on average Ongoing work Reliability-power tradeoffs for embedded memory Applying to heterogeneous multi-core architecture (24/25)

THANK YOU UCSD VLSI CAD Laboratory and UIUC PASSAT Group - ASPDAC, Jan. 21, 2010

BACKUP UCSD VLSI CAD Laboratory and UIUC PASSAT Group - ASPDAC, Jan. 21, 2010

CPU, Heal Thyself Razor* system Timing errors can be corrected Manage the trade-off between system voltage and error rate New design methodology is needed * Razor: A low power pipeline based on circuit-level timing speculation. In International Symposium on Micro architecture, December 2003. (27/25)

Razor How it works Razor Implementation Razor: A low power pipeline based on circuit-level timing speculation. In International Symposium on Microarchitecture, December 2003. Main flip-flop latches at T, but Shadow latch latches at T+skew If a timing violation occurs, main flip-flop will latch incorrect value, but shadow latch should latch correct value Comparator signals error and the late arriving value is fed back into the main flip-flop (28/25)

BTWC: Voltage Scaling Voltage Overclocking scaling case case PE(f) perf(f) Minimum Maximum power performance at point c b c a c b c a b vf vf vf a vf b vf c (lower voltage) vf a vf b vf c P E (v) (f) : Error rate at frequency voltage v f pwr(v) perf(f) : Performance Power consumption at f at v Error correction needs additional clock cycles and incurs power overhead (29/25)

Limitation of Voltage Scaling At some voltage, circuit breaks down 0.5 0.4 0.3 0.2 0.1 Errors / Cycle 0.0 1.0 0.9 0.8 0.7 Voltage 0.6 0.5 Voltage scaling must halt after only 10% scaling. (30/25)

Reason for Steep Error Degradation Critical paths are bunched up in traditional designs. (31/25)

Slack Re-distribution Example Negative Positive Slack Error Rate = 1% 25% Negative Positive Slack 0.0-0.1 (32/25)

Heuristic Implementation Error Rate Estimation Error rate contribution of one flip-flop (1) ER = Error rate of an entire design ER ff TG = α (2) α : compensation parameter Actual vs. estimated error rates ff TG D ER ff ff D TG P P NEG ALL (33/25)

Gradual Slack Distribution Slack optimization achieves gradual slack distribution. (34/25)

Processor Error Rate and Power Designs with comparable error rates have much higher power/area overheads. (35/25)

Reliability/Power Tradeoff Slack-optimized design enjoys continued power reduction as error rate increases. (36/25)

Enhancing Razor-based Design Slack optimization extends range of voltage scaling and reduces Razor recovery cost. (37/25)

Moore s Law Power consumption of processor node doubles every 18 months. (38/25)

Power Scaling With current design techniques, processor power soon on par with nuclear power plant (39/25)