ECE552 / CPS550 Advanced Computer Architecture I. Lecture 1 Introduction

Similar documents
Digital Integrated Circuits EECS 312

EECS150 - Digital Design Lecture 2 - CMOS

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

24. Scaling, Economics, SOI Technology

VLSI Digital Signal Processing

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

Integrated Circuit Design ELCT 701 (Winter 2017) Lecture 1: Introduction

On the Rules of Low-Power Design

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

SEMICONDUCTOR TECHNOLOGY -CMOS-

EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

Impact of Intermittent Faults on Nanocomputing Devices

SEMICONDUCTOR TECHNOLOGY -CMOS-

ELEN Electronique numérique

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

Microprocessor Design

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

Lossless Compression Algorithms for Direct- Write Lithography Systems

A Low-Power 0.7-V H p Video Decoder

CPS311 Lecture: Sequential Circuits

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

CS/EE 181a 2010/11 Lecture 6

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Out of order execution allows

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

CS 61C: Great Ideas in Computer Architecture

SoC IC Basics. COE838: Systems on Chip Design

Advanced Pipelining and Instruction-Level Paralelism (2)

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

140 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 2, FEBRUARY 2004

Leakage Current Reduction in Sequential Circuits by Modifying the Scan Chains

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Modifying the Scan Chains in Sequential Circuit to Reduce Leakage Current

COMP2611: Computer Organization. Introduction to Digital Logic

TKK S ASIC-PIIRIEN SUUNNITTELU

High Performance Carry Chains for FPGAs

Technology Scaling Issues of an I DDQ Built-In Current Sensor

Boolean, 1s and 0s stuff: synthesis, verification, representation This is what happens in the front end of the ASIC design process

Encoders and Decoders: Details and Design Issues

Amdahl s Law in the Multicore Era

Flexible Electronics Production Deployment on FPD Standards: Plastic Displays & Integrated Circuits. Stanislav Loboda R&D engineer

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

WELCOME. ECE 2030: Introduction to Computer Engineering* Richard M. Dansereau Copyright by R.M. Dansereau,

COE328 Course Outline. Fall 2007

Lecture 1: Circuits & Layout

Instruction Level Parallelism and Its. (Part II) ECE 154B

Sharif University of Technology. SoC: Introduction

Topics. Microelectronics Revolution. Digital Circuits Part 1 Logic Gates. Introductory Medical Device Prototyping

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

[2 credit course- 3 hours per week]

Slide Set 14. Design for Testability

Design and Simulation of a Digital CMOS Synchronous 4-bit Up-Counter with Set and Reset

Digital Logic Design: An Overview & Number Systems

11. Sequential Elements

WINTER 15 EXAMINATION Model Answer

Digitally Assisted Analog Circuits. Boris Murmann Stanford University Department of Electrical Engineering

Contents Circuits... 1

EECS150 - Digital Design Lecture 3 - Timing

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Modeling Digital Systems with Verilog

EECS150 - Digital Design Lecture 3 - Timing

Music Electronics Finally DeMorgan's Theorem establishes two very important simplifications 3 : Multiplexers

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

A video signal processor for motioncompensated field-rate upconversion in consumer television

A Fast Constant Coefficient Multiplier for the XC6200

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Report on 4-bit Counter design Report- 1, 2. Report on D- Flipflop. Course project for ECE533

Retiming Sequential Circuits for Low Power

Designing VeSFET-based ICs with CMOS-oriented EDA Infrastructure

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Power-Optimal Pipelining in Deep Submicron Technology

EE C247B ME C218 Introduction to MEMS Design Spring 2017

Digital Integrated Circuits EECS 312

Computer Systems Architecture

CS 152 Computer Architecture and Engineering

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

FP 12.4: A CMOS Scheme for 0.5V Supply Voltage with Pico-Ampere Standby Current

Sequential Elements con t Synchronous Digital Systems

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Lecture 1: Intro to CMOS Circuits

Based on slides/material by. Topic 14. Testing. Testing. Logic Verification. Recommended Reading:

ELEC 4609 IC DESIGN TERM PROJECT: DYNAMIC PRSG v1.2

An Efficient Reduction of Area in Multistandard Transform Core

VLSI Chip Design Project TSEK06

IC TECHNOLOGY Lecture 2.

Transcription:

ECE552 / CPS550 Advanced Computer Architecture I Lecture 1 Introduction Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece552fall12.html

Computing Devices (Then) Mark I Harvard University, 1944 EDSAC University of Cambridge, 1949 ECE 552 / CPS 550 2

Computing Devices (Now) ipad Apple/ARM, 2010 Blue Gene/P IBM, 2007 ECE 552 / CPS 550 3

Computer Architecture Application Gap too large to bridge in one step Physics Computer architecture is the design of abstraction layers, which allow efficient implementations of computational applications on available technologies ECE 552 / CPS 550 4

Abstraction Layers Application Algorithm Programming Language Domain of early computer architecture ( 50s- 80s) Operating System/Virtual Machines Instruction Set Architecture (ISA) Microarchitecture Gates/Register-Transfer Level (RTL) Circuits Devices Physics Domain of recent computer architecture (since 90s) ECE 552 / CPS 550 5

An Integrated Approach Architect Systems - Coordinate technology, hardware, run-time software, compilers, apps - Responsible for end-to-end functionality Design and Analyze - Search the space of possible designs at all levels in computer system - Evaluate designs with quantitative metrics (performance, power, cost) Navigate Computing Landscape - Architects work at the hardware-software interface - Technologies are emerging - Applications are demanding - Systems are scaling ECE 552 / CPS 550 6

ECE 552 Executive Summary In-order Datapath (built, ECE152) Chip Multiprocessors (understand, experiment ECE552) ECE 552 / CPS 550 7

ECE 552 Administrivia Instructor Prof. Benjamin Lee benjamin.c.lee@duke.edu Office Hours: Tu 4-5pm, Fr 4-5pm, 210 Hudson Teaching Assistants Marisabel Guevara, mg@cs.duke.edu Office Hours: Tu 12:25-1:25pm, W 4-5pm, TBD Weidan Wu, ww53@duke.edu Office Hours: M 4-5pm, Th 2:40-3:40pm, TBD Lectures Tu/Th 1:25-2:40PM, 208 Hudson Text Computer Architecture: A Quantitative Approach, 5 th Edition (2012). Do not use earlier editions Web http://www.duke.edu/~bcl15/class/class_ece552fall12.html ECE 552 / CPS 550 8

ECE 552 Prerequisites Participation Prerequisites - Electrical and Computer Engineering, Computer Science - PhD, MS, Undergraduates - Introduction to computer architecture (CPS 104, ECE 152, or equiv.) - Programming (homework/projects in C, C++) Background Knowledge - Instruction sets, computer arithmetic, assembly programming D.A. Patterson and J.L. Hennessy. Computer Organization and Design: The Hardware/Software Interface, 5 th Edition. Dropping the Course - if you are going to drop, please do so early ECE 552 / CPS 550 9

ECE 552 Syllabus 1. Design Metrics 1. Performance 2. Power 3. Early machines 2. Simple Pipelining 1 Multi-cycle machines 2 Branch Prediction 3 In-order Superscalar 4 Optimizations 3. Complex Pipelining 1 Score-boarding, Tomasulo Algorithm 2 Out-of-order Superscalar Midterm Exam Fall Break 4. Memory Systems 1 Caches 2 DRAM 3 Virtual Memory 5. Explicitly Parallel Architectures 1 VLIW 2 Vector machines 3 Multi-threading 6. Multiprocessors 1 Memory Models 2 Coherence Protocols 7. Advanced Topics 1 Emerging Technologies 2 Specialized Architectures 3 Datacenter Architectures ECE 552 / CPS 550 10

ECE 552 Components 30% Homework and Readings - Homework done in teams of 3-5 classes dedicated to paper discussions 15% Midterm exam - 75 minutes (in class), closed book 25% Final exam - 3 hours, closed-book - based on lectures, problem sets, readings 30% Term project/paper - Project done in teams of 3 Academic Policy University policy as codified by Duke Undergraduate Honor Code will be strictly enforced. Zero tolerance for cheating and/or plagiarism. ECE 552 / CPS 550 11

ECE 552 Academic Policy University policy as codified by the Duke Undergraduate Honor Code will be strictly enforced. Zero tolerance for cheating and/or plagiarism. If a student is suspect of academic dishonesty (e.g., cheating on an exam, copying a lab report, collaborating inappropriately on an assignment), faculty are required to report the matter to the Office of Student Conduct. A student found responsible for academic dishonesty faces formal disciplinary action, which may include suspension. A student suspended twice for academic dishonesty automatically faces a minimum 5-year separation from Duke University. ECE 552 / CPS 550 12

Scope ECE 552 Term Project Final Paper - Semester-long research project - Teams of 3 - Students propose project ideas (Oct 14) - 6-12 page research paper - Evaluate research idea quantitatively - Survey and cite related work ECE 552 / CPS 550 13

ECE 552 Upcoming Deadlines 11 September Homework #1 Due Assignment on web page. Teams of 2-3. Submit hard copy in class. Email code to TA s 11 September Class Discussion Roughly one reading per class. Do not wait until the day before! 1. Hill et al. Classic machines: Technology, implementation, and economics 2. Moore. Cramming more components onto integrated circuits 3. Radin. The 801 minicomputer 4. Patterson et al. The case for the reduced instruction set computer 5. Colwell et al. Instruction sets and beyond: Computers, complexity, controversy ECE 552 / CPS 550 14

Performance Definitions - Latency: time to finish given task (a.k.a. execution time) - Throughput: number of tasks in given time (a.k.a. bandwidth) - Throughput can exploit parallelism while latency cannot Example: Move people from Duke to UNC, 10 miles - Car: capacity = 5, speed = 60 miles/hour - Bus: capacity = 60, speed = 20 miles/hour - Latency(car) = (10 miles @ 60 miles/hour )= 10 minutes - Latency(bus) = (10 miles @ 20 miles/hour) = 30 minutes - Throughput(car) = (3 trips @ 60 miles per hour) = 15 people/hour - Throughput(bus) = (1 trip @ 20 miles per hour) = 60 people/hour ECE 552 / CPS 550 15

Benchmarking Measuring Performance - Target Workload: accurate but not portable - Representative Benchmark: portable but not accurate - Microbenchmark: small, fast code sequences but incomplete Representative Benchmarks - SPEC (Standard Performance Evaluation Corporation, www.spec.org) - Collects, standardizes, distributes benchmark programs - Parallel Benchmarks - Scientific and commercial computing - SPLASH-2, NAS, SPEC OpenMP, SPECjbb - Transaction Processing Council (TPC) - Online transaction processing (OLTP) with heavy I/O, memory - TPC-C, TPC-H, TPC-W ECE 552 / CPS 550 16

Aggregating Performance Addition Averages - Latency is additive but throughput is not - Example: Consider applications A1 and A2 on processor P - Latency(A1,A2) = Latency(A1) + Latency(A2) - Throughput (A1,A2) = 1/[1/Throughput(A1) + 1/Throughput(A2)] - Arithmetic Mean: (1/N) * P=1..N Latency(P) - For measures that are proportional to time (e.g., latency) - Harmonic Mean: N / P=1..N 1/Throughput(P) - For measures that are inversely proportional to time (e.g., throughput) - Geometric Mean: ( P=1..N Speedup(P))^(1/N) - For ratios (e.g., speed-ups) ECE 552 / CPS 550 17

Performance (vs. VAX-11/780) Processor Performance 10000 1000 SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4 th Edition, 2006.??%/year 100 52%/year 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 ECE 552 / CPS 550 18

Performance Factors Latency = (Instructions / Program) x (Cycles / Instruction) x (Seconds / Cycle) Seconds / Cycle - Technology and architecture - Transistor scaling - Processor microarchitecture Cycles / Instruction (CPI) - Architecture and systems - Processor microarchitecture - System balance (processor, memory, network, storage) Instructions / Program - Algorithm and applications - Compiler transformations, optimizations - Instruction set architecture ECE 552 / CPS 550 19

Moore s Law - Moore. Cramming more components onto integrated circuits. Electronics, Vol 38, No. 8, 1965. - As integration increases and packaging cost decrease - How does Moore s Law impact performance? ECE 552 / CPS 550 20

MOSFET Field-Effect Transistors - MOS: metal-oxide semiconductor - FET: field-effect transistor - Charge carriers flow between source-drain - Flow controlled by gate voltage - Abstract MOSFET as electrical switch Gate Source Drain Drain Width Gate Channel Length Bulk Source ECE 552 / CPS 550 21

Complementary MOS (CMOS) - Voltages map to logical values (Vdd=1, Gnd=0) - Implement complementary Boolean logic - nfet: conduct charge when Vg = Vdd, used in pull-down network - pfet: conduct charge when Vg = Gnd, used in pull-up network - Examples: Inverter, NAND (universal, any logic function via De Morgan s Law) Vdd pfet A B A!A nfet Gnd A B!(AB) ECE 552 / CPS 550 22

Transistor Dimensions - Process defined by feature size (F), layout design (l = F/2) - Example: F=2l =45nm process technology - Transistor dimensions determine technology performance - Transistor drive strength (i.e., performance) increases as channel length shrinks Minimum Length=2l Gate Source Drain Width=4l Gate Source Length Drain Bulk Width ECE 552 / CPS 550 23

Dennard Scaling - Dennard et al. Design of ion-implanted MOSFETs with very small physical dimensions, Journal Solid State Circuits, 1974. - Scale not only dimensions but also doping concentration and voltage - Transistors become faster (1.4x) - Applied to Moore s Law: k=1.4, 1/k = 0.7 every 18-24 months Gate Source Drain Width Bulk Length ECE 552 / CPS 550 24

Dennard Scaling Limits - Horowitz et al. Scaling, power, and the future of CMOS. IEDM, 2005. - Classical Dennard scaling ended at 130nm in 2000-2001. - Oxide Thickness: How to manage increasing leakage? Use high-k dielectrics - Channel Length: How to manage increasing leakage? Stop scaling L - Doping Concentration: How to handle imprecise doping? Manage variability - Voltage: How to manage increasing leakage? Stop scaling V - Current: How to increase current with shrinking channels? Stress silicon - Example: Intel 22nm process technology with FinFET Image: Courtesy Intel Corp. ECE 552 / CPS 550 25

Performance (vs. VAX-11/780) Processor Performance 10000 1000 SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4 th Edition, 2006.??%/year 100 52%/year 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 ECE 552 / CPS 550 26

Performance Factors Latency = (Instructions / Program) x (Cycles / Instruction) x (Seconds / Cycle) Seconds / Cycle - Technology and architecture - Transistor scaling - Processor microarchitecture Cycles / Instruction (CPI) - Architecture and systems - Processor microarchitecture - System balance (processor, memory, network, storage) Instructions / Program - Algorithm and applications - Compiler transformations, optimizations - Instruction set architecture ECE 552 / CPS 550 27

Cycles per Instruction (CPI) Average Instruction Latency Example Caveat - Examine instruction frequency - Different instructions require different number of cycles - Example: Integer instructions (1 cy), Floating-point instruction (>10 cy) - CPI is slightly easier to calculate than IPC (time versus rate) - Instruction frequency: 1/3 INT, 1/3 FP, 1/3 MEM operations - Instruction cycles: 1cy INT, 3cy FP, 2cy MEM - CPI = (1/3 x 1) + (1/3 x 3) + (1/3 x 2) - CPI provides high-level, quick estimates of performance - Does not account for details (e.g., instruction dependences) ECE 552 / CPS 550 28

CPI and Design Baseline Processor / Application - Integer ALU: 50%, 1 cycle - Load: 20%, 5 cycle - Store: 10%, 1 cycle - Branch: 20%, 2 cycle Possible Enhancements - Option 1: Branch prediction to reduce branch cost to 1 cycle - Option 2: Bigger data cache to reduce load cost to 3 cycles - Which enhancement would we prefer? Cycles Per Instruction - Base = (0.5 x 1) + (0.2 x 5) + (0.1 x 1) + (0.2 x 2) = 2 cycles - Option 1 = (0.5 x 1) + (0.2 x 5) + (0.1 x 1) + (0.2 x 1) = 1.8 cycles - Option 1 = (0.5 x 1) + (0.2 x 3) + (0.1 x 1) + (0.2 x 2) = 1.6 cycles ECE 552 / CPS 550 29

Measuring CPI Physical Measurements - Measure wall clock time as application runs - Multiply time by clock frequency to get cycles - Profile application with hardware counters (e.g., Intel VTune) Simulated Measurements - Cycle-level, microarchitectural simulation (e.g., SimpleScalar) - Run applications on simulated hardware - Track instructions as they progress through the design ECE 552 / CPS 550 30

Pitfall: Partial Performance Metrics Ignoring Instructions per Program - Neglect dynamic instruction count - Misleading if working in algorithms, compilers, or ISA Using Instructions per Second - MIPS = (Instructions / Cycle) x (Cycles / Second) x 1E-6 - FLOPS: considers only floating-point instructions - Example: CPI = 2, clock frequency = 500MHz, 250 MIPS - Example: compiler removes instructions, latency falls, MIPS increases Using Clock Frequency - Cannot equate clock frequency with performance - Proc A: CPI = 2, f = 500MHz - Proc B: CPI = 1, f = 300MHz - Given the same ISA and compiler, B is faster ECE 552 / CPS 550 31

Pitfall & Amdahl s Law - Amdahl. Validity of the single-processor approach AFIPS, 1967. Make Common Case Fast Consider improving fraction F of system with a speedup S. T(new) = T(base) x (1-F) + T(base) x F / S = T(base) x [(1-F) + F/S] Speedup Max Speedup = 1 / (1 F) = 1 / [(1-F) + F/S] = T(base)/T(new) Example - Suppose FP computation is 1/4 of an application s execution time - Maximum benefit from optimizing FP unit is 1.3x (=1/0.75) - Multiprocessor systems were original application of this law - Accounts for diminishing marginal returns ECE 552 / CPS 550 32

Performance (vs. VAX-11/780) Processor Power 10000 1000 SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4 th Edition, 2006.??%/year 100 52%/year 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 ECE 552 / CPS 550 33

Power and Energy Definitions - Energy (Joules) = a x C x V 2 - Power (Watts) = a x C x V 2 x f Power Factors and Trends - activity (a): function of application resource usage - capacitance (C): function of design; scales with area - voltage (V): constrained by leakage, which increases as V falls - frequency (f): varies with pipelining and transistor speeds - Models in cycle-accurate simulators (e.g., Princeton Wattch) Dynamic Voltage and Frequency Scaling (DVFS) - P-states: move between operational modes with different V, f - Intel TurboBoost: increase V, f for short durations without violating thermal design point (TDP) ECE 552 / CPS 550 34

Power and Temperature Temperature - Power density (Watts / sq-mm) is proxy for thermal effects - Estimate thermal conductivity and resistance to understand processor hot spots (e.g., University of Virginia, HotSpot simulator) Power Budgets - Higher power budgets increase packaging cost - 130W servers, 65W desktops, 10-30W laptops, 1-2W hand-held ECE 552 / CPS 550 35

Power and Chip-Multiprocessors Definitions - Historically, multiprocessors use multiple packages (e.g., IBM Power 3) - Chip multi-processor integrates multiple cores on the same die Multiprocessor Efficiency - Reduce power with simpler cores - Recover lost performance with many core parallelism (e.g., IBM Power 4) ECE 552 / CPS 550 36

Power and Chip-Multiprocessors Lower voltages, frequencies - Voltage, frequency scale together (approximately) - Power proportional to V 2, f (falls cubically) - Performance proportional to f (falls linearly) Example - Baseline: 1-core at V, f - Multiprocessor: 4-cores at 0.85V, 0.85f; program is 75% parallel - 1-Core Power 0.61x =0.85 3-1-Core Performance 0.85x - Power impact 2.44x = 0.61x 4 - Performance adjusted for parallelism 2.28x = 1/[0.25 + (0.75 / 4)] - Performance adjusted for freq slowdown 1.94x = 2.28 x 0.85 - Multiprocessor: 1.5% power per 1% performance (=144%/94%) - Higher V, f: 3% power per 1% performance (=(1.01 3-1)/(1.01-1)) ECE 552 / CPS 550 37

Cost Non-recurring Engineering (NRE) Chip Cost - Dominated by engineer-years ($200K per engineer-year) - Mask costs (>$1M per spin) - Depends on wafer and chip size, process maturity Packaging Cost - Depends on number of pins (e.g., signal + power/ground) - Depends on thermal design point (e.g., heat sink) Total Cost of Ownership - Capital costs (e.g., server procurement cost) - Operating costs (e.g., electricity) ECE 552 / CPS 550 38

Wafers Yield - Integrated circuits built with multi-step chemical process on wafers - Cost per wafer depends on wafer size, number of steps Chip (a.k.a. Die) - If chips are large, fewer chips per wafer - Larger chips have lower yield - Uniform defect density - Chip cost is proportional to area 2-3 Process Variability - Yield is non-binary - Binning for speed grades - Binning for core count - Post-fabrication tuning with spares ECE 552 / CPS 550 39

Acknowledgements These slides contain material developed and copyright by - Arvind (MIT) - Krste Asanovic (MIT/UCB) - Joel Emer (Intel/MIT) - James Hoe (CMU) - John Kubiatowicz (UCB) - Alvin Lebeck (Duke) - David Patterson (UCB) - Daniel Sorin (Duke) ECE 552 / CPS 550 40