High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Similar documents
DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

L11/12: Reconfigurable Logic Architectures

L12: Reconfigurable Logic Architectures

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Sharif University of Technology. SoC: Introduction

Impact of Intermittent Faults on Nanocomputing Devices

Why FPGAs? FPGA Overview. Why FPGAs?

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Methodology. Nitin Chawla,Harvinder Singh & Pascal Urard. STMicroelectronics

A video signal processor for motioncompensated field-rate upconversion in consumer television

Digital Integrated Circuits EECS 312

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

Innovative Fast Timing Design

TKK S ASIC-PIIRIEN SUUNNITTELU

SEMICONDUCTOR TECHNOLOGY -CMOS-

COMP2611: Computer Organization. Introduction to Digital Logic

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

EECS150 - Digital Design Lecture 2 - CMOS

SEMICONDUCTOR TECHNOLOGY -CMOS-

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Field Programmable Gate Arrays (FPGAs)

FPGA Design with VHDL

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

Achieving Faster Time to Tapeout with In-Design, Signoff-Quality Metal Fill

Performance Modeling and Noise Reduction in VLSI Packaging

Lossless Compression Algorithms for Direct- Write Lithography Systems

Hardware Design I Chap. 5 Memory elements

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

Achieving Timing Closure in ALTERA FPGAs

24. Scaling, Economics, SOI Technology

Syed Muhammad Yasser Sherazi CURRICULUM VITAE

Boolean, 1s and 0s stuff: synthesis, verification, representation This is what happens in the front end of the ASIC design process

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Co-simulation Techniques for Mixed Signal Circuits

FPGA Design. Part I - Hardware Components. Thomas Lenzi

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

On the Rules of Low-Power Design

Cascade2D: A Design-Aware Partitioning Approach to Monolithic 3D IC with 2D Commercial Tools

A Low-Power 0.7-V H p Video Decoder

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

VLSI Digital Signal Processing Systems: Design And Implementation PDF

Introduction to CMOS VLSI Design (E158) Lecture 11: Decoders and Delay Estimation

SoC IC Basics. COE838: Systems on Chip Design

ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Timing with Virtual Signal Synchronization for Circuit Performance and Netlist Security

LOW POWER DIGITAL EQUALIZATION FOR HIGH SPEED SERDES. Masum Hossain University of Alberta

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Digitally Assisted Analog Circuits. Boris Murmann Stanford University Department of Electrical Engineering

IE1204 Digital Design. F11: Programmable Logic, VHDL for Sequential Circuits. Masoumeh (Azin) Ebrahimi

Certus TM Silicon Debug: Don t Prototype Without It by Doug Amos, Mentor Graphics

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

Clock Tree Power Optimization of Three Dimensional VLSI System with Network

PICOSECOND TIMING USING FAST ANALOG SAMPLING

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

High Performance Carry Chains for FPGAs

Scan. This is a sample of the first 15 pages of the Scan chapter.

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Made- for- Analog Design Automation The Time Has Come

Using on-chip Test Pattern Compression for Full Scan SoC Designs

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

Integrated Circuit Design ELCT 701 (Winter 2017) Lecture 1: Introduction

A Fast Constant Coefficient Multiplier for the XC6200

System IC Design: Timing Issues and DFT. Hung-Chih Chiang

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

11. Sequential Elements

Clocking Spring /18/05

Scalability of MB-level Parallelism for H.264 Decoding

Layout Decompression Chip for Maskless Lithography

Amon: Advanced Mesh-Like Optical NoC

EXOSTIV TM. Frédéric Leens, CEO

IC Layout Design of Decoders Using DSCH and Microwind Shaik Fazia Kausar MTech, Dr.K.V.Subba Reddy Institute of Technology.

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

IE1204 Digital Design F11: Programmable Logic, VHDL for Sequential Circuits

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

RFSOI and FDSOI enabling smarter and IoT applications. Kirk Ouellette Digital Products Group STMicroelectronics

FPGA Development for Radar, Radio-Astronomy and Communications

VLSI Design Digital Systems and VLSI

Digital Integrated Circuits EECS 312

Lecture 23 Design for Testability (DFT): Full-Scan

Adding Analog and Mixed Signal Concerns to a Digital VLSI Course

PERFORMANCE ANALYSIS OF AN EFFICIENT PULSE-TRIGGERED FLIP FLOPS FOR ULTRA LOW POWER APPLICATIONS

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

Power-Optimal Pipelining in Deep Submicron Technology

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Why Use the Cypress PSoC?

A Symmetric Differential Clock Generator for Bit-Serial Hardware

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

VLSI Digital Signal Processing

CS184a: Computer Architecture (Structures and Organization) Last Time

Transcription:

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities

Introduction About Myself What to expect out of this lecture Understand the current trend in the IC Design Challenges and opportunities 2012 IBM Corporation

Agenda Different Eras Technology Era Multi core Era (Design Era) Innovation Era (EDA Era) Innovation Technology Innovation Productivity Innovation 2012 IBM Corporation

The Technology Era: Frequency Scaling Once upon a time, life used to be Great, when technology was the superman and Design tagged along for the ride and even EDA grabbed designer legs for the fun!

Characteristics of Single Thread Era Dennard Scaling Optical Scaling / Node Migration Exponential Frequency Growth Expanding uarch Complexity Frequency (GHz) 6 5 4 3 2 1 0 POWER4 POWER5 POWER6 TXs per core 500 400 300 200 100 0 POWER4 POWER5 POWER6

Single Thread Era EDA: Static timing analysis of complex circuits Transistor Analysis & Optimization Transistor Level timing optimization 1200 1000 # of paths 800 600 400 Pre-Tuning Post-Tuning clkin 200 2 nd Timing fbk = 0 w2 Cycle w3 fbk = 1 w3_int 0-6 -4-2 0 2 4 6 8 Slack (ps) evaluate 1 st Timing precharge w0 fbk w2 = 1

End of Frequency Scaling : The Power Wall 1000 Power Density (W/cm 2 ) 100 10 1 0.1 0.01 Active Power Air Cooling limit Passive Power 0.001 1 1994 2004 0.1 Gate Length (microns) 0.01 Inability to scale Oxide thickness & lower voltage resulted in a power wall for single thread performance

Frequency Scaling : POWER6 (65nm, 2007) 5+ GHz operation, >790M transistors, 341mm 2 die 65nm SOI with 10 levels of Cu interconnect Same pipeline depth & power @ 2x frequency versus POWER5 2 MB L2 Mem. Cntl. IFU / IDU LSU L2 Dir L2 Dir F X U RU B F U SMP Fabric D F U V M X L 3 C O N T R O L L E R 2 MB L2 Mem. Cntl. 2 MB L2 Core 1 2 MB L2

Technology Tantrums Technology Design End Designers of Frequency Scaling with Technology Squeezing the design hard Shock and awe of 65nm: Wire delays overtaking Gate delays

Multi-Core Era Multi-Core End of frequency scaling ushered in a new era of innovation with multi-core design

POWER Processors Began the Multi-Core / Multi-Thread Era Power 4 2001 Introduced First Dual core Power 5 2004 Dual Core Introduces SMT (4 threads) Power 6 2007 Dual Core 4 threads Enhances SMT Efficiency

Life starts to become interesting: Technology ride very bumpy Gain by Traditional Scaling Gain by Innovation Relative % Improvement 100% 80% 60% 40% 20% 0% 180nm 130nm 90nm 65nm 45nm 32nm 3fF BL (32 Cells) 4.0um Node WL BOX BL Passing WL Node W L Deep Trench Cap 18fF Storage Node High-K Metal Gate

Multi-Core Era Limiters 100 SW parallelism Socket BW 64 log (performance) 10 Technology complexity & rising costs Power 4 8 16 32 Ideal Growth Likely Multi- Core Path 2 1 1 90 65 45 32 22 14 10 Technology Node

Multi-Core Advantage Need to Amplify Effective Socket Throughput To Achieve Potential Compute Throughput Potential Socket Throughput Limitation (Power, memory bandwidth)

Innovation Drive Architecture & Productivity Innovation

High performance up Designs: Extending Multi-Core Gains (Power processor) Compute Throughput Potential Coherence Innovation to minimize socket-to-socket communication Low-Power Off-Chip Signaling Technology High bandwidth memory buffer EDRAM = large, low power cache Socket Throughput Limitation (Power, memory bandwidth)

Innovation Drive : System Level Technologies 3D Stacking with Through Silicon Vias Silicon Photonics Single Processor Memory Socket FPGA Accelerators Heterogeneous systems on Chip Specialized functions Specialized cores: Single thread focused Throughput focused Flash Memory / SSD

Innovation at Technology, Design Interface: Double/Triple Patterning Pitch (nm) 150 100 50 Need for Double / Triple Patterning EUV? Device Pitch Single Exposure Limit Metal Pitch Double Patterning Limit 0 32nm 20nm Future Technology Node

Productivity Innovation: Structured Synthesis and Large Block Synthesis Customs take large amount of resources and productivity is key Merge the domain of customs and Synthesis targeting design productivity and improved quality through merging of custom and synthesis hierarchy with structure in synthesis (not random logic any more) Global Optimization view; Targeted structured data paths and synthesis A methodology with numerous algorithmic and practical innovations spanning from incremental logic design processing, to data paths to structured clocking to custom synthesis merged techniques. P/Z server Macro Quad FPU

Productivity Innovation : Reduce Custom Design (Structured Synthesis) # of Customs over Time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 >10x reduction over 5 generation Synthesis results w/ custom-like data flow alignment.

Productivity Innovation: Reduced # of Design Partitions (Large Block Synthesis) # of Macros over Time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 60 logic macros, 25 customs, 14 unique arrays/rfs 1 macro, 0 customs, 9 unique arrays / RFs Reduced area & power; equal cycle time

Productivity & TAT Innovation: Gate Level Analysis & Signoff Large speedup 1 Similar accuracy Arbitrary Units Reduced cleanup TX Level Gate Level 0 Runtime Cleanup Work Accuracy

Productivity & TAT Innovation: Hierarchical Abstraction & Multi-Threading 50 Projected Chip Timing Runtime hrs. 0 Base Cleanup Coarse Hierarchical Parallelism abstracts Multithreading Fast global analysis tools allow designers to iterate more often resulting in improved final designs. Hierarchical abstraction & multi-threading are the most promising ways to minimize TAT. Applies to all disciplines (timing, verification, etc)

Productivity Innovation: Retiming Area/Power too high... latch Optimal Doesn t meet cycle time Significant fraction of logic designer effort spent in optimizing cycle boundaries Retiming enables physical synthesis to optimally place latches in logic cones to balance timing/area/power Invention is required to seamlessly handle divergence between functional RTL (Verilog/VHDL) and physical implementation throughout methodology.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Designer Time Innovation: The sweet spot in this new era Wait for Tools Implement Plan Innovate 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Designer Time Wait for Tools Implement Plan Innovate 25

Hardware Programming Millions of Software designers HLL: C/C++, LiMe, OpenCL 1000s of RTL designers HL Compiler VHDL / Verilog VHDL / Verilog Synthesis Place & Route Synthesis Place & Route Hardware LUT FF RAM LUT FF RAM Hardware Traditional High-level

Architectural Synthesis Successive Refinement Functional Cycle Accurate RTL: VHDL/Verilog C/C++ Model C/C++ Model Back End Design Implementation and Analysis Metric: Cache Miss rate etc. Metrics: Performance Models, CPI etc Metrics: Electrical, Timing, Area, Noise etc.

Summary Information technology landscape is changing dramatically Value is in innovating across the entire stack and increasingly higher up in the stack Key problems remain to be solved in technology, design and automation as technology continues to scale Significant emerging opportunities in new ways to solve system bottlenecks at every levels: Logic, Architecture, Memory. In last several years, life became very challenging but also very interesting as the ride has gotten a lot choppier With challenges and opportunities abound, organizations that grab these challenge and innovate their way out of the current dilemmas will be the winners. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Designer Time Wait for Tools Implement Plan Innovate 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Designer Time Wait for Tools Implement Plan Innovate IP Design content creation innovation IP Design Process Innovation Design Implementation Innovation System value moving up the stack