Performance Driven Reliable Link Design for Network on Chips

Similar documents
TERROR: RELIABLE AND EFFICIENT LINK DESIGN FOR NETWORK ON CHIPS

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,


11. Sequential Elements

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

DESIGN OF DOUBLE PULSE TRIGGERED FLIP-FLOP BASED ON SIGNAL FEED THROUGH SCHEME

CHAPTER 6 DESIGN OF HIGH SPEED COUNTER USING PIPELINING

ECE321 Electronics I

On the Rules of Low-Power Design

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

AN EFFICIENT LOW POWER DESIGN FOR ASYNCHRONOUS DATA SAMPLING IN DOUBLE EDGE TRIGGERED FLIP-FLOPS

Lecture 11: Sequential Circuit Design

Design Project: Designing a Viterbi Decoder (PART I)

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

Chapter 5 Flip-Flops and Related Devices

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Design of Fault Coverage Test Pattern Generator Using LFSR

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

Using on-chip Test Pattern Compression for Full Scan SoC Designs

Power-Optimal Pipelining in Deep Submicron Technology

The basic logic gates are the inverter (or NOT gate), the AND gate, the OR gate and the exclusive-or gate (XOR). If you put an inverter in front of

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Sequential Circuit Design: Part 1

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

LOW POWER AND HIGH PERFORMANCE SHIFT REGISTERS USING PULSED LATCH TECHNIQUE

Figure 1 shows a simple implementation of a clock switch, using an AND-OR type multiplexer logic.

DEPARTMENT OF ELECTRICAL &ELECTRONICS ENGINEERING DIGITAL DESIGN

Sequential Circuit Design: Part 1

Chapter 5 Synchronous Sequential Logic

1. What does the signal for a static-zero hazard look like?

Timing Error Detection and Correction for Reliable Integrated Circuits in Nanometer Technologies

DIGITAL SYSTEM FUNDAMENTALS (ECE421) DIGITAL ELECTRONICS FUNDAMENTAL (ECE422) LATCHES and FLIP-FLOPS

Hardware Design I Chap. 5 Memory elements

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Computer Architecture and Organization

Introduction. NAND Gate Latch. Digital Logic Design 1 FLIP-FLOP. Digital Logic Design 1

Prototyping an ASIC with FPGAs. By Rafey Mahmud, FAE at Synplicity.

LOW-POWER CLOCK DISTRIBUTION IN EDGE TRIGGERED FLIP-FLOP

EECS 270 Group Homework 4 Due Friday. June half credit if turned in by June

LOW POWER LEVEL CONVERTING FLIP-FLOP DESIGN BY USING CONDITIONAL DISCHARGE TECHNIQUE

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

Timing Error Detection and Correction by Time Dilation

Metastability Analysis of Synchronizer

A Novel Low-overhead Delay Testing Technique for Arbitrary Two-Pattern Test Application

A High-Resolution Flash Time-to-Digital Converter Taking Into Account Process Variability. Nikolaos Minas David Kinniment Keith Heron Gordon Russell

32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY /$ IEEE

PHYSICS 5620 LAB 9 Basic Digital Circuits and Flip-Flops

D Latch (Transparent Latch)

Built-In Proactive Tuning System for Circuit Aging Resilience

Lecture 26: Multipliers. Final presentations May 8, 1-5pm, BWRC Final reports due May 7 Final exam, Monday, May :30pm, 241 Cory

Lecture 10: Sequential Circuits

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

Chapter 2. Digital Circuits

Scan. This is a sample of the first 15 pages of the Scan chapter.

COMP2611: Computer Organization. Introduction to Digital Logic

Design for Testability Part II

SGERC: a self-gated timing error resilient cluster of sequential cells for wide-voltage processor

ECEN454 Digital Integrated Circuit Design. Sequential Circuits. Sequencing. Output depends on current inputs

Lecture 8: Sequential Logic

The NOR latch is similar to the NAND latch

EE-382M VLSI II FLIP-FLOPS

Research Article Ultra Low Power, High Performance Negative Edge Triggered ECRL Energy Recovery Sequential Elements with Power Clock Gating

Novel Low Power and Low Transistor Count Flip-Flop Design with. High Performance

Sequential Logic. E&CE 223 Digital Circuits and Systems (A. Kennings) Page 1

Fault Detection And Correction Using MLD For Memory Applications

55:131 Introduction to VLSI Design Project #1 -- Fall 2009 Counter built from NAND gates, timing Due Date: Friday October 9, 2009.

Current Mode Double Edge Triggered Flip Flop with Enable

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Leakage Current Reduction in Sequential Circuits by Modifying the Scan Chains. Outline

DIFFERENTIAL CONDITIONAL CAPTURING FLIP-FLOP TECHNIQUE USED FOR LOW POWER CONSUMPTION IN CLOCKING SCHEME

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

EITF35: Introduction to Structured VLSI Design

Clocks. Sequential Logic. A clock is a free-running signal with a cycle time.

EE141-Fall 2010 Digital Integrated Circuits. Announcements. Homework #8 due next Tuesday. Project Phase 3 plan due this Sat.

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

Final Project [Tic-Tac-Toe]

DESIGN AND IMPLEMENTATION OF SYNCHRONOUS 4-BIT UP COUNTER USING 180NM CMOS PROCESS TECHNOLOGY

EECS150 - Digital Design Lecture 3 - Timing

LFSR Counter Implementation in CMOS VLSI

Power Reduction Techniques for a Spread Spectrum Based Correlator

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science

HIGH SPEED CLOCK DISTRIBUTION NETWORK USING CURRENT MODE DOUBLE EDGE TRIGGERED FLIP FLOP WITH ENABLE

Overview: Logic BIST

Power Optimization of Linear Feedback Shift Register (LFSR) using Power Gating

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

II. ANALYSIS I. INTRODUCTION

An automatic synchronous to asynchronous circuit convertor

EECS 270 Midterm 2 Exam Closed book portion Fall 2014

CSE 352 Laboratory Assignment 3

Self-Test and Adaptation for Random Variations in Reliability

Design and Analysis of Custom Clock Buffers and a D Flip-Flop for Low Swing Clock Distribution Networks. A Thesis presented.

Modeling Digital Systems with Verilog

WINTER 15 EXAMINATION Model Answer

Transcription:

Performance Driven Reliable Link Design for Network on Chips Rutuparna Tamhankar Srinivasan Murali Prof. Giovanni De Micheli Stanford University

Outline Introduction Objective Logic design and implementation Alternative approaches Transistor level design Timing overheads Simulation Results Conclusion

Introduction Wire delay contribution to total delay is increasing 6-10 cycles needed to diagonally cross the chip in 50 nm Faster clock cycles, scaling voltages and noisy environment make wires unreliable. Unpredictability in wire characteristics results in variation in wire delay Noise-induced wire delay consumes greater percentage of useful clock cycle. Conservative design approaches that consider worst case operating conditions, result in poor performance.

Objective of work Motivational example A 3-bit adversarial switching pattern (101- >010, 010->101) increases wire delay by 50%. Conservative approach results in larger clock cycle to account for such delay variations. Use aggressive design approach Designed for normal (ignoring noise) conditions, tolerate errors caused to noisy environment. Use higher clock rate than conservative design or increase spacing between adjacent buffers.

Timing Error-Tolerant System Timing error-tolerant (Terror) system Detects and corrects timing errors with minimal impact on latency. Terror systems can tolerate delay variations giving large latency savings (upto 35%) over traditional retransmission scheme. Only transient timing errors are corrected, static errors due to logic faults, soft errors etc. are not corrected.

Buffered link design SENDER Pipeline buffer 1 Pipeline buffer 2 Pipeline buffer b RECEIVER Typical pipelined link design with b stages Pipeline buffer Changed to data Pipeline buffer Control signal Proposed Design Error control circuit

Terror Design Principle Input data M U X Main flip-flop output Delayed flip-flop sel ck XOR ckd errq In normal state, data is captured and sent by main flip-flop XOR detects difference in data captured by delayed flip-flop and main flip-flop Delay between ck and ckd ensures data bits get sufficient time to reach delayed flip-flop.

Input Normal to Delayed state Delayed flip-flop M U X Main flip-flop ck output XOR ckd Control circuit corr_out ck ckd input output

Terror enabled Link design errqw Terror 1 Terror 1 errq1 Terror 2. Bit 1 Bit w Terror b Terror Terror.. 2.. b.. OR OR OR sel1 err sel2 err2 selb err3 Control 1 Control 2 Control b One error correction circuit common to w buffers Decreases overall cost of control circuitry errq signals of all buffers (vertically) are Ored and fed to the control circuit. Avoids synchronisation circuit at the end.

Error control circuit sel err prev_corr errq1 errq2. errqw prev_corr AND AND OR err OR ck ckd SR Clk generation ckt Correction flip-flop sel corr_out ckdd corr_out signal indicates data sent on previous cycle was incorrect (due to a timing error) Clocks ckd and ckdd are locally generated by a delay chain (chain of inverters) SR latch is set when err =1, and is reset when prev_corr =1 prev_corr is corr_out signal of previous stage

Input Delayed to Normal state Delayed flip-flop M U X Main flip-flop ck output XOR ckd prev_corr Control circuit ck ckd input output

Alternative approaches Teatime [Comp. '04] tracks logic delay and avoids errors by changing the clock frequency. Requires complex frequency controller and tracking logic Razor [Micro '04] monitors error rate to control power. Uses gated clock or flushes pipeline to correct error. Favalli et al. [DFT/VLSI '97] use encoded data and decoder at flip-flop input to detect errors. Overhead of decoder at each flip-flop input. Mousetrap[ICCD '01] uses high speed asynhcronous pipeline. Acknowledge and request signals to ensure correct

Comparison of latency Previous approaches give large latency overheads for high error rates and do not scale well bus widths. Error penalty increases linearly with error rate. Difficult to correct multiple errors. For example, if errors occur each cycle, for N bits Razor [Micro '04] : % of useful cycles = N/2N = 50% Terror : % of useful cycles = N / (N + b) where b = no. of buffers on the link For b << N, the benefits are substantial

Transistor level design Terror element was designed for 32 bit bus in 100 nm technology targeted at 1Ghz frequency. Transistor level optimisations were included Include 2:1 Mux inside the main flip-flop. Use domino OR instead of static OR for errq signals Combining AND-OR gates into correction flip-flop Using a simple SR latch design. Timing overheads were calculated based on SPICE simulations.

Transistor level design ck vdd vdd 2:1 MUX sel ck out d0 d1 gnd ck ck gnd FLIP-FLOP Schematic showing 2:1 Mux inside a flip-flop

Timing Overheads Parameter % Overhead Hold time 10.0 Normal->Delayed 27.0 Delayed->Normal 9.7 Ideally ckd can be delayed by one cycle Practically, ckd delay limited by finite timing delays Hold time flip-flop and logic delay involved in going from normal to delayed state and vice-versa. Hence range of variation of ckd is 100 9.7 27 10 = 53.3% of cycle To simplify, we use 50% delay variation for ckd Clock ckd can be delayed by half clock cycle.

Simulation results We considered a SoC with on-chip link length of 12 mm operating at a frequency of 1Ghz With conservative design approach assume distance between successive stages as 2 mm. Thus number of stages required are 6. A Terror enable pipeline will have 3 stages, ideally (since ckd can be delayed by one cycle) Practically ckd can be delayed by 50%, so number of stages with Terror are 4. We plot the receiver latency for different error rates for 1000 data bits in both cases. Errors not detected by Terror are corrected by retransmission scheme.

Simulation plots Receiver latency variation with delay between clocks ck and ckd for ideal case Receiver latency for practical case

Aggressiveness Vs Latency Aggressiveness Vs Latency We define aggressiveness = percentage increase in inter-buffer spacing over conservative design. Latency reduces by 33% for 50% aggressiveness. Plot terminates at 50% since distance between pipeline stages can only be increased by 50%.

Retransmission Vs Terror Scheme Receiver latency for receiving 1000 bits at 1% and 5% error rate is plotted for both schemes. Terror scheme gives a 35% reduction in latency.

Maximum Penalty Vs Data size Variation in penalty for different error rates and data sizes for a 4-stage pipeline is plotted. Maximum penalty is 4 cycles.

Error Penalty Maximum latency is bounded by the number of buffer stages and is not affected by the error rate. Latency overhead does not increase for large data sizes or high error rates. Typical error correcting schemes degrade at high error rates. Terror enabled systems are suitable for high error rate designs. This can happen in high noise environment or at low voltage levels. Used in aggressive designs, where clock frequency is higher than conservative value or physical spacing between buffers is increased.

Receiver design Receiver ck data corr_in Look Ahead rec_out data corr_in ck rec_out Receiver design Look-ahead stage operation Receiver looks ahead 1 cycle for the data. Only first bit incurs 1 cycle penalty since other bits follow in a pipeline fashion. 1 cycle penalty can be hidden in switch, it only occurs at the end receiver.

Area Overhead in typical SoCs Design Area overhead Merlot 0.12% DSP 0.6% MIT RAW 0.86% Alpha MP 0.9% Average 0.62% We estimate area overhead in typical MPSoCs Based on increase in gate count due to Terror. At a 0.6% increase in area, we get large latency savings (upto 35%) Power overhead is negligible.

Conclusion Reliability, delay and throughput of link is affected by unpredictability in link characteristics. Increasingly affected by cross-talk, other interferences. Terror enabled systems tolerate such variation in link delay and encourage an aggressive design approach. Provide a 35% savings in latency over traditional approach Network on Chips (NoCs) a communication centric approach will be required for future SoCs Provides scalability and reliability for efficient communication between cores.

THANK YOU

EXTRA

ck Normal to delayed state data ckd err ckdd sel out corr_out Terror goes into delayed state when err =1. In delayed state, data is captured by delayed flipflop and sent by main flip-flop. One cycle penalty occurs for the first occurrence of error. For subsequent error, there is no penalty.

Delayed to normal state ck ckdd data prev_corr sel out Terror returns to normal state when prev_corr signal is received. Incorrect data captured by delayed flip-flop is not sent. For proper operation, prev_corr signal is made error free by shielding from other wires, routing the

Analysis of Penalty Maximum latency bounded by pipeline stages. Average penalty depends upon the when and where timing errors occur along the pipeline. 1 < Penalty < b (for b link buffers) One cycle penalty occurs when error in (b-1) th stage is absorbed by b th stage since b th goes from delayed to normal state Worst case (b cycle) penalty occurs when error occurs first in pipeline stage 1, then in stage 2, up to b th stage.

Timing Analysis Setup time increases due to 2:1 Mux T setup = t setup(nominal) + t d(mux) Setup time of correction flip-flop T setup = t and + t or + t setup(nominal) Minimum ckd delay is path delay of err signal T ckd = t ck-q + t xor + t domino-or + t and + t or + t setup(nominal) Minimum ckdd delay is path delay of sel signal T ckdd = t SRlatch + t mux - t setup(nominal) Hold time condition of correction flip-flop T hold < t SRlatch + t and + t or