On the Rules of Low-Power Design

Similar documents
RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Reducing Pipeline Energy Demands with Local DVS and Dynamic Retiming

Performance Driven Reliable Link Design for Network on Chips

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

11. Sequential Elements

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Design of Fault Coverage Test Pattern Generator Using LFSR

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Slack Redistribution for Graceful Degradation Under Voltage Overscaling

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

EEC 118 Lecture #9: Sequential Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

WINTER 15 EXAMINATION Model Answer

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Lecture 11: Sequential Circuit Design

Modeling Digital Systems with Verilog

Instruction Level Parallelism

Design Project: Designing a Viterbi Decoder (PART I)

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Instruction Level Parallelism and Its. (Part II) ECE 154B

Design and Implementation of Timer, GPIO, and 7-segment Peripherals

Sequential Circuit Design: Part 1

ECEN454 Digital Integrated Circuit Design. Sequential Circuits. Sequencing. Output depends on current inputs

Sequential Circuit Design: Part 1

An Improved Hardware Implementation of the Grain-128a Stream Cipher

32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY /$ IEEE

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

EDSU: Error detection and sampling unified flip-flop with ultra-low overhead

Why FPGAs? FPGA Overview. Why FPGAs?

Timing Error Detection and Correction for Reliable Integrated Circuits in Nanometer Technologies

More Digital Circuits

Noise Margin in Low Power SRAM Cells

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

PICOSECOND TIMING USING FAST ANALOG SAMPLING

An MFA Binary Counter for Low Power Application

POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

IT T35 Digital system desigm y - ii /s - iii

Digital Integrated Circuits EECS 312

Impact of Intermittent Faults on Nanocomputing Devices

Figure 1 shows a simple implementation of a clock switch, using an AND-OR type multiplexer logic.

Logic Devices for Interfacing, The 8085 MPU Lecture 4

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

Simultaneous Control of Subthreshold and Gate Leakage Current in Nanometer-Scale CMOS Circuits

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Dual Edge Adaptive Pulse Triggered Flip-Flop for a High Speed and Low Power Applications

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

FP 12.4: A CMOS Scheme for 0.5V Supply Voltage with Pico-Ampere Standby Current

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

problem maximum score 1 28pts 2 10pts 3 10pts 4 15pts 5 14pts 6 12pts 7 11pts total 100pts

Testing Sequential Circuits

Register Transfer Level (RTL) Design Cont.

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Tutorial Outline. Design Levels

Virtually all engineers use worst-case component

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

EE178 Spring 2018 Lecture Module 5. Eric Crabill

ECE321 Electronics I

Combinational vs Sequential

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

SGERC: a self-gated timing error resilient cluster of sequential cells for wide-voltage processor

Power-Optimal Pipelining in Deep Submicron Technology

Technology Scaling Issues of an I DDQ Built-In Current Sensor

VLSI System Testing. BIST Motivation

792 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 4, APRIL 2006

Self-Test and Adaptation for Random Variations in Reliability

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Implementation of BIST Test Generation Scheme based on Single and Programmable Twisted Ring Counters

COMP2611: Computer Organization. Introduction to Digital Logic

VARIABLE FREQUENCY CLOCKING HARDWARE

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Introduction to Microprocessor & Digital Logic

CS3350B Computer Architecture Winter 2015

CSE 352 Laboratory Assignment 3

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

SoC IC Basics. COE838: Systems on Chip Design

EE241 - Spring 2005 Advanced Digital Integrated Circuits

EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

Memory, Latches, & Registers

Chapter 7 Sequential Circuits

Module for Lab #16: Basic Memory Devices

Transcription:

On the Rules of Low-Power Design (and How to Break Them) Prof. Todd Austin Advanced Computer Architecture Lab University of Michigan austin@umich.edu Once upon a time 1

Rules of Low-Power Design P = acv 2 f + VI leak 1. Minimize switching activity 2. Design for lower load capacitance 3. Reduce frequency 4. Reduce leakage and the most important of all: 5. Decrease supply voltage! V th Critical voltage (determined by critical path) 1.2v Noise margin Ambient margin Process margin 0.v Overclockers Break the Rules 1.2v Noise margin Ambient margin Process margin 0.v V th 2

Goals of This Presentation Review some of the rules of low-power design Show how clever designs can break these rules Razor resilient circuits Subliminal subthreshold voltage processor Highlight the benefits of taking a rule-breaking approach to technical research Investigating Overclocking 3

Two Slow Pipelines Check a Fast Pipeline Slow Pipeline A 4-bit 4-bit LFSR LFSR 4-bit 4-bit LFSR LFSR 1 1 clk/2 clk/2 X 1x1 45 MHz Slow Pipeline B X 1x1 45 MHz Fast Pipeline X 1x1 36 36 36 stabilize clk/2 clk/2 clk/2!= 40-bit 40-bit Error Error Counter Counter clk 90 MHz clk clk Observation: Voltage Margins Are Plentiful 1x1-bit Multiplier Block at 90 MHz and 27 C 35% energy savings with 1.3% errors 20% energy savings 1.7 1.74 1.70 1.66 1.62 1.5 1.54 1.50 1.46 1.42 1.3 1.34 1.30 1.26 1.22 1.1 1.14 Supply Voltage (V) Environmental-margin @ 1.69 V Zero-margin @ 1.54 V 100.0000000% 10.0000000% 1.0000000% 0.1000000% 0.0100000% 0.0010000% 0.0001000% 0.0000100% 0.0000010% 0.0000001% 0.0000000% One error every 20 seconds! Margin grows if a few (~1%) errors can be tolerated Error rate 4

Razor Resilient Circuits Main FF 5 4 939 Main FF MEM clk clk Shadow Latch 9 clk_del Double-sampling metastability tolerant latches detect timing errors Second sample is correct-by-design Microarchitectural support restores program state Timing errors treated like branch mispredictions Distributed Pipeline Recovery Cycle: 0123456 79 inst1 inst2 inst5 inst6 inst7 inst inst3 inst4 inst2 PC IF Razor FF error ID bubble Razor FF error EX bubble Razor FF MEM (read-only) error bubble Razor FF error bubble Stabilizer FF WB (reg/mem) recover recover recover recover Flush flushid flushid flushid flushid Builds on existing branch prediction framework Multiple cycle penalty for timing failure Scalable design as all communication is local 5

Razor Prototype Design Six stage 64-bit Alpha pipeline 200MHz in 0.1mm @ 1.V tunable via sw from 200-50MHz, 1.-1.1V 32-entry, 3-port RF, K I-Cache/K D-Cache Branch-not-taken branch predictor Full scan capability Razor overhead: 192 Razor FF out of 240 (9%) Error-free power overhead: Razor flip-flops: < 1% Short path buffer: 2.1% Recovery power overhead: 1x an inst, for pipeline recovery 3 mm I-Cache Register File WB IF ID EX D-Cache MEM 3.3 mm Razor Prototype Testbed 6

Razor-Based Dynamic Voltage Scaling E diff = E ref -E sample reset E ref - E diff Voltage Function Voltage Regulator V dd Pipeline error signals... Σ E sample Current design utilizes a very simple proportional control function algorithm implemented in software Example Voltage ler Response Percentage Error Rate 10 9 7 6 5 4 3 2 1 0 20 40 60 0 100 120 140 Time (Seconds) 1.0 1.76 1.72 1.6 1.64 1.60 1.56 1.52 1.4 ler Output Voltage(V) Two minute snapshot of a 15 minute run 7

Effects of Razor DVS Energy IPC Total Energy, E total = E proc + E recovery Pipeline Throughput 1% 50% Optimal E total Energy of Processor Operations, E proc Energy of Processor w/o Razor Support Decreasing Supply Voltage Energy of Pipeline Recovery, E recovery Razor Also Improves Yield Voltage at 0.1%Error Rate 1. Chips 1.7 Linear Fit y=0.765x + 0.22117 1.6 1.5 1.4 1.4 1.5 1.6 1.7 1. Voltage at First Failure

How Razor Breaks the Rules Traditional worst-case design techniques must observe margin rules for reliable operation Incorporating timing-error correction mechanisms allow margins to be erased V th 1.2v Noise margin Ambient margin Process margin 0.v Infrequent use of critical paths allow for even deeper cuts in V dd Back to the Rules P = acv 2 f + VI leak 1. Minimize switching activity 2. Design for lower load capacitance 3. Reduce frequency 4. Reduce leakage and the most important of all: 5. Decrease supply voltage! V th Critical voltage (determined by critical path) 1.2v Noise margin Ambient margin Process margin 0.v 9

Subthreshold Circuits Break The Rules Superthreshold Subthreshold 1.2V IN P OUT 1.2V 0.2V IN P OUT 0.2V 0V N 0V 0V N 0V Static logic still works below V th Differences in I leak continue to (dis)charge outputs But diminished I on /I off results in big delays Approach works if the apps are not too demanding Sensing Applications Security Biomedical Environmental Industrial 10

Sensor Processing Data Rates Sensor Processor Sensing Communication Computation Storage Power Supply 11

Sensing Performance Demands are Low xrt: # times faster than real-time 10000.00 1000.00 100.00 10.00 1.00 2965.01 3943.47 036.77 296.37 Platform ARM 720T ARM 7TDMI ARM 920T ARM 1020T Voltage (V) 1.2 1.2 1.2 1.2 Speed (Hz) 100M 133M 250M 325M Fast Growing Leakage Complicates Design E inst = E cycle CPI Cycles per Instruction Energy per Instruction Energy per Cycle 2 E cycle = N(½αC s V dd + V dd I leak t clk Activity factor - average number of transistor switches per transistor per cycle Total circuit capacitance Supply Voltage Leakage current Clock period 12

Fast Growing Leakage Complicates Design 2 E cycle = N(½αC s V dd + V dd I leak t clk Activity factor - average number of transistor switches per transistor per cycle Total circuit capacitance Supply Voltage Leakage current Clock period Impact of voltage reduction I leak t clk E leak E dyn E cycle Superthreshold linear linear ~const. quad. quad. Subthreshold linear exp. ~exp. quad.??? Tension Fast Growing Leakage Complicates Design Impact of voltage reduction I leak t clk E leak E dyn E cycle Superthreshold linear linear ~const. quad. quad. Subthreshold linear exp. ~exp. quad.??? Tension 13

Lessons from Architectural Studies To minimize energy at subthreshold voltages, architects must: Minimize area Maximize Transistor utility Minimize CPI To reduce leakage energy per cycle To reduce V min and energy per cycle To reduce Energy per instruction Winning designs tend to be compromising designs that balance area, transistor utility and CPI Memory comprises the single largest factor of leakage energy, therefore, efficient designs must reduce memory storage requirements Subliminal Architectural Overview IF/ID Stage EX/MEM Stage WB Stage Imem 4x16x2x12 24 Prefetch Buffer 2x2x12 12 Register File 32-bit Timer OpA OpB ALU Carry Zero Register Write Flag μoperation Decoder External Interrupts Scheduler Page Dmem 12x Fetch Jump 14

First Subliminal Chip Large solar cell Solar cell for processor Custom memories Solar cell for discrete cells Discrete cells Mux-based memories Test memory level converter array Test module Level converter array Solar cell for adders Discrete adders Subliminal processors Pareto Analysis of Sensor Network Processors Energy/Inst (pj) 24 22 20 1 16 14 12 10 6 4 2 0 Hempstead (Harvard) 0.5pJ/Inst@0.0 4MIPS CleverDust (Berkeley) 2.25 pj/inst@1mips 0.01 0.1 1 10 MIPS SNAP/LE (Cornell) Subliminal (Michigan) 15

How Subliminal Breaks the Rules Traditional circuit design relies an transistor switching to perform computation Static logic circuits continue to operate below V th by modulating leakage currents Approach lends itself to low-demand sensor apps, as long as care is taken to build an efficient processor What I Really Learned A rule-breaking approach to technical research is effective and engaging You will find yourself on very fertile ground It is that which everyone knows is certainly true, that is indeed false. The early bird gets the worm. If you are not failing some of the time, you are not trying hard enough. You will more fully engage your colleagues One half will think crazy idea will never work One half will be intrigued (with your crazy idea) 16

Questions???????????? 17