Amdahl s Law in the Multicore Era

Similar documents
As we enter the multicore era, we re at an

Scalability of MB-level Parallelism for H.264 Decoding

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Parallel Computing. Chapter 3

PRACE Autumn School GPU Programming

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Sharif University of Technology. SoC: Introduction

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Tomasulo Algorithm Based Out of Order Execution Processor

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

GPU Acceleration of a Production Molecular Docking Code

Sequential Logic. Introduction to Computer Yung-Yu Chuang

6.3 Sequential Circuits (plus a few Combinational)

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era

Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics

LUT Optimization for Memory Based Computation using Modified OMS Technique

VLSI System Testing. BIST Motivation

Impact of Intermittent Faults on Nanocomputing Devices

100Gb/s Single-lane SERDES Discussion. Phil Sun, Credo Semiconductor IEEE New Ethernet Applications Ad Hoc May 24, 2017

Out-of-Order Execution

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

Fooling the Masses with Performance Results: Old Classics & Some New Ideas

Digital Integrated Circuits EECS 312

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

FPGA Hardware Resource Specific Optimal Design for FIR Filters

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Computer Architecture Spring 2016

Communication Avoiding Successive Band Reduction

24. Scaling, Economics, SOI Technology

CS 61C: Great Ideas in Computer Architecture

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

Distributed Cluster Processing to Evaluate Interlaced Run-Length Compression Schemes

Chapter 3: Sequential Logic

Digital Logic Design ENEE x. Lecture 24

Altera's 28-nm FPGAs Optimized for Broadcast Video Applications

M89 FAMILY In-System Programmable (ISP) Multiple-Memory and Logic FLASH+PSD Systems for MCUs

A Highly Scalable Parallel Implementation of H.264

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Embedded System Design

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

CS8803: Advanced Digital Design for Embedded Hardware

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Profiling techniques for parallel applications

Logic Design ( Part 3) Sequential Logic- Finite State Machines (Chapter 3)

Using the MAX3656 Laser Driver to Transmit Serial Digital Video with Pathological Patterns

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Switching Solutions for Multi-Channel High Speed Serial Port Testing

Designing for High Speed-Performance in CPLDs and FPGAs

Avoiding False Pass or False Fail

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Instruction Level Parallelism and Its. (Part II) ECE 154B

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

LIO-8 Quick Start Guide

Instructions. Final Exam CPSC/ELEN 680 December 12, Name: UIN:

Microprocessor Design

Lecture 0: Organization

Profiling techniques for parallel applications

Sequencing and Control

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

COMP2611: Computer Organization. Introduction to Digital Logic

ECE552 / CPS550 Advanced Computer Architecture I. Lecture 1 Introduction

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Implementation of Memory Based Multiplication Using Micro wind Software

Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse

BAL Real Power Balancing Control Performance Standard Background Document

Scalable Lossless High Definition Image Coding on Multicore Platforms

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

ALONG with the progressive device scaling, semiconductor

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 3. A Network-Centric View on HPC

Spring 2017 EE 3613: Computer Organization Chapter 5: The Processor: Datapath & Control - 1

Video Output and Graphics Acceleration

Overcoming challenges of high multi-site, high multi-port RF wafer sort testing

3. Configuration and Testing

Layers of Innovation: How Signal Chain Innovations are Creating Analog Opportunities in a Digital World

MMI: A General Narrow Interface for Memory Devices

Upgrading a FIR Compiler v3.1.x Design to v3.2.x

FLIP-5: Only send data to each taskmanager once for broadcasts

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding

Digital Integrated Circuits EECS 312. People. Exams. Purpose of Course and Course Objectives I. Grading philosophy. Grading and written feedback

EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

High Performance Carry Chains for FPGAs

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Instruction Level Parallelism Part III

Design of Memory Based Implementation Using LUT Multiplier

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

EECS150 - Digital Design Lecture 15 Finite State Machines. Announcements

Administrative issues. Sequential logic

ELCT201: DIGITAL LOGIC DESIGN

CS 152 Computer Architecture and Engineering

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

THE Collider Detector at Fermilab (CDF) [1] is a general

Outline. CPE/EE 422/522 Advanced Logic Design L04. Review: 8421 BCD to Excess3 BCD Code Converter. Review: Mealy Sequential Networks

Transcription:

Amdahl s Law in the Multicore Era Mark D. Hill and Michael R. Marty University of Wisconsin Madison August 2008 @ Semiahmoo Workshop IBM s Dr. Thomas Puzak: Everyone knows Amdahl s Law 2008 Multifacet Project But quickly forgets it! University of Wisconsin-Madison

Executive Summary Develop A Corollary to Amdahl s Law Simple Model of Multicore Hardware Complements Amdahl s software model Fixed chip resources for cores Core performance improves sub-linearly with resources Research Implications (1) Need Dramatic Increases in Parallelism (No Surprise) 99% parallel limits 256 cores to speedup 72 New Moore s Law: Double Parallelism Every Two Years? (2) Many larger chips need increased core performance (3) HW/SW for asymmetric designs (one/few cores enhanced) (4) HW/SW for dynamic designs (serial parallel) 8/6/2008 4

Outline Multicore Motivation & Research Paper Trends Recall Amdahl s Law A Model of Multicore Hardware Symmetric Multicore Chips Asymmetric Multicore Chips Dynamic Multicore Chips Caveats & Wrap Up 8/6/2008 5

Percent Multiprocessor Papers in ISCA 1973 1974 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 How has Architecture Research Prepared? 100 90 80 70 60 50 40 SMP Bulge Lead up to Multicore What Next? 30 20 10 0 Source: Hill & Rajwar, The Rise & Fall of Multiprocessor Papers in ISCA, http://www.cs.wisc.edu/~markhill/mp2001.html (3/2001) 8/6/2008 11

Percent Multiprocessor Papers in ISCA 1973 1974 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 100 90 Reacted? How has Architecture Research Prepared? 80 70 60 50 Will Architecture Research Overreact? Multicore Ramp 40 30 20 10 0 Source: Hill, 2/2008 8/6/2008 12

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Percent Multiprocessor Papers What About PL/Compilers (PLDI) Research? 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% End of Small SMP Bulge? Lead up to Multicore What Next? Gentle Multicore Ramp PLDI Begins Source: Steve Jackson, 3/2008 8/6/2008 14

1967 1969 1971 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1994 1995 1996 1997 1999 1999 2000 2001 2002 2003 2004 2005 2006 2007 Percent Multiprocessor Papers What About Systems (SOSP/OSDI) Research? 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Small SMP Bulge Lead up to Multicore What Next? NO Multicore Ramp (Yet) SOSP odd years only ODSI even & SOSP odd Source: Michael Swift, 3/2008 8/6/2008 15

Outline Multicore Motivation & Research Paper Trends Recall Amdahl s Law A Model of Multicore Hardware Symmetric Multicore Chips Asymmetric Multicore Chips Dynamic Multicore Chips Caveats & Wrap Up 8/6/2008 16

Recall Amdahl s Law Begins with Simple Software Assumption (Limit Arg.) Fraction F of execution time perfectly parallelizable No Overhead for Scheduling Communication Synchronization, etc. Fraction 1 F Completely Serial Time on 1 core = (1 F) / 1 + F / 1 = 1 Time on N cores = (1 F) / 1 + F / N 8/6/2008 17

Recall Amdahl s Law [1967] 1 Amdahl s Speedup = 1 - F 1 + F N For mainframes, Amdahl expected 1 - F = 35% For a 4-processor speedup = 2 For infinite-processor speedup < 3 Therefore, stay with mainframes with one/few processors Amdahl s Law applied to Minicomputer to PC Eras What about the Multicore Era? 8/6/2008 18

Designing Multicore Chips Hard Designers must confront single-core design options Instruction fetch, wakeup, select Execution unit configuation & operand bypass Load/queue(s) & data cache Checkpoint, log, runahead, commit. As well as additional design degrees of freedom How many cores? How big each? Shared caches: levels? How many banks? Memory interface: How many banks? On-chip interconnect: bus, switched, ordered? 8/6/2008 19

Want Simple Multicore Hardware Model To Complement Amdahl s Simple Software Model (1) Chip Hardware Roughly Partitioned into Multiple Cores (with L1 caches) The Rest (L2/L3 cache banks, interconnect, pads, etc.) Changing Core Size/Number does NOT change The Rest (2) Resources for Multiple Cores Bounded Bound of N resources per chip for cores Due to area, power, cost ($$$), or multiple factors Bound = Power? (but our pictures use Area) 8/6/2008 20

Want Simple Multicore Hardware Model, cont. (3) Micro-architects can improve single-core performance using more of the bounded resource A Simple Base Core Consumes 1 Base Core Equivalent (BCE) resources Provides performance normalized to 1 An Enhanced Core (in same process generation) Consumes R BCEs Performance as a function Perf(R) What does function Perf(R) look like? 8/6/2008 21

More on Enhanced Cores (Performance Perf(R) consuming R BCEs resources) If Perf(R) > R Always enhance core Cost-effectively speedups both sequential & parallel Therefore, Equations Assume Perf(R) < R Graphs Assume Perf(R) = Square Root of R 2x performance for 4 BCEs, 3x for 9 BCEs, etc. Why? Models diminishing returns with no coefficients Alpha EV4/5/6 [Kumar 11/2005] & Intel s Pollack s Law How to speedup enhanced core? <Insert favorite or TBD micro-architectural ideas here> 8/6/2008 22

Outline Multicore Motivation & Research Paper Trends Recall Amdahl s Law A Model of Multicore Hardware Symmetric Multicore Chips Asymmetric Multicore Chips Dynamic Multicore Chips Caveats & Wrap Up 8/6/2008 23

How Many (Symmetric) Cores per Chip? Each Chip Bounded to N BCEs (for all cores) Each Core consumes R BCEs Assume Symmetric Multicore = All Cores Identical Therefore, N/R Cores per Chip (N/R)*R = N For an N = 16 BCE Chip: Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core 8/6/2008 24

Performance of Symmetric Multicore Chips Serial Fraction 1-F uses 1 core at rate Perf(R) Serial time = (1 F) / Perf(R) Parallel Fraction uses N/R cores at rate Perf(R) each Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N Therefore, w.r.t. one base core: Symmetric Speedup = Implications? 1 - F Perf(R) 8/6/2008 25 + 1 F * R Perf(R)*N Enhanced Cores speed Serial & Parallel

Symmetric Speedup Symmetric Multicore Chip, N = 16 BCEs 16 14 12 10 8 6 4 2 0 F=0.5 1 2 4 8 16 (16 cores) (8 cores) R BCEs (2 cores) (1 core) (4 cores) F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16)) F=0.5 R=16, Cores=1, Speedup=4 Need to increase parallelism to make multicore optimal! 8/6/2008 26

Symmetric Speedup Symmetric Multicore Chip, N = 16 BCEs 16 14 12 10 8 6 4 2 0 F=0.9 F=0.9, R=2, Cores=8, Speedup=6.7 F=0.5 1 2 4 8 16 R BCEs F=0.5 R=16, Cores=1, Speedup=4 At F=0.9, Multicore optimal, but speedup limited Need to obtain even more parallelism! 8/6/2008 27

Symmetric Speedup Symmetric Multicore Chip, N = 16 BCEs 16 F=0.999 14 12 10 F=0.99 F=0.975 F 1, R=1, Cores=16, Speedup 16 8 6 F=0.9 4 2 F=0.5 0 1 2 4 8 16 R BCEs F matters: Amdahl s Law applies to multicore chips MANY Researchers should target parallelism F first 8/6/2008 28

Need a Third Moore s Law? Technologist s Moore s Law Double Transistors per Chip every 2 years Slows or stops: TBD Microarchitect s Moore s Law Double Performance per Core every 2 years Slowed or stopped: Early 2000s Multicore s Moore s Law Double Cores per Chip every 2 years & Double Parallelism per Workload every 2 years & Aided by Architectural Support for Parallelism = Double Performance per Chip every 2 years Starting now Software as Producer, not Consumer, of Performance Gains! 8/6/2008 29

Symmetric Speedup Symmetric Multicore Chip, N = 16 BCEs 16 14 12 10 F=0.999 F=0.99 F=0.975 8 F=0.9 6 Recall F=0.9, R=2, Cores=8, Speedup=6.7 4 F=0.5 2 0 1 2 4 8 16 R BCEs As Moore s Law enables N to go from 16 to 256 BCEs, More cores? Enhance cores? Or both? 8/6/2008 30

Symmetric Speedup Symmetric Multicore Chip, N = 256 BCEs 250 200 150 100 F=0.999 F=0.99 F 1 R=1 (vs. 1) Cores=256 (vs. 16) Speedup=204 (vs. 16) MORE CORES! 50 F=0.99 R=3 (vs. 1) 0 Cores=85 (vs. 16) Speedup=80 (vs. 13.9) MORE CORES & ENHANCE CORES! F=0.975 F=0.9 F=0.5 1 2 4 8 16 32 64 128 256 R BCEs F=0.9 R=28 (vs. 2) Cores=9 (vs. 8) Speedup=26.7 (vs. 6.7) ENHANCE CORES! As Moore s Law increases N, often need enhanced core designs Some arch. researchers should target single-core performance 8/6/2008 31

Software for Large Symmetric Multicore Chips F matters: Amdahl s Law applies to multicore chips N = 256 F=0.9 Speedup = 27 @ R = 28 F=0.99 Speedup = 80 @ R = 3 F=0.999 Speedup = 204 @ R = 1 N = 1024 F=0.9 Speedup = 53 @ R = 114 F=0.99 Speedup = 161 @ R = 10 F=0.999 Speedup = 506 @ R = 1 Researchers must target parallelism F first

Aside: Cost-Effective Parallel Computing Isn t Speedup(C) < C Inefficient? (C = #cores) Much of a Computer s Cost OUTSIDE Processor [Wood & Hill, IEEE Computer 2/1995] Cores Let Costup(C) = Cost(C)/Cost(1) Parallel Computing Cost-Effective: Speedup(C) > Costup(C) 1995 SGI PowerChallenge w/ 500MB: Costup(32) = 8.6 Multicores have even lower Costups!!!

Outline Multicore Motivation & Research Paper Trends Recall Amdahl s Law A Model of Multicore Hardware Symmetric Multicore Chips Asymmetric Multicore Chips Dynamic Multicore Chips Caveats & Wrap Up 8/6/2008 35

Asymmetric (Heterogeneous) Multicore Chips Symmetric Multicore Required All Cores Equal Why Not Enhance Some (But Not All) Cores? For Amdahl s Simple Software Assumptions One Enhanced Core Others are Base Cores How? <fill in favorite micro-architecture techniques here> Model ignores design cost of asymmetric design How does this effect our hardware model? 8/6/2008 36

How Many Cores per Asymmetric Chip? Each Chip Bounded to N BCEs (for all cores) One R-BCE Core leaves N-R BCEs Use N-R BCEs for N-R Base Cores Therefore, 1 + N - R Cores per Chip For an N = 16 BCE Chip: Symmetric: Four 4-BCE cores Asymmetric: One 4-BCE core & Twelve 1-BCE base cores 8/6/2008 37

Performance of Asymmetric Multicore Chips Serial Fraction 1-F same, so time = (1 F) / Perf(R) Parallel Fraction F One core at rate Perf(R) N-R cores at rate 1 Parallel time = F / (Perf(R) + N - R) Therefore, w.r.t. one base core: 1 Asymmetric Speedup = 1 - F Perf(R) + F Perf(R) + N - R 8/6/2008 38

Asymmetric Speedup Asymmetric Multicore Chip, N = 256 BCEs 250 F=0.999 200 150 F=0.99 100 F=0.975 50 0 F=0.9 F=0.5 1 2 4 8 16 32 64 128 256 (256 cores)(1+252 cores) R BCEs (1+192 cores) (1 core) (1+240 cores) Number of Cores = 1 (Enhanced) + 256 R (Base) How do Asymmetric & Symmetric speedups compare? 8/6/2008 39

Symmetric Speedup Recall Symmetric Multicore Chip, N = 256 BCEs 250 200 150 F=0.999 100 50 0 F=0.99 F=0.975 F=0.9 F=0.5 Recall F=0.9, R=28, Cores=9, Speedup=26.7 1 2 4 8 16 32 64 128 256 R BCEs 8/6/2008 40

Asymmetric Speedup Asymmetric Multicore Chip, N = 256 BCEs 250 200 150 F=0.999 F=0.99 F=0.99 R=41 (vs. 3) Cores=216 (vs. 85) Speedup=166 (vs. 80) 100 F=0.975 50 0 F=0.9 F=0.5 1 2 4 8 16 32 64 128 256 R BCEs Asymmetric offers greater speedups potential than Symmetric In Paper: As Moore s Law increases N, Asymmetric gets better Some arch. researchers should target asymmetric multicores F=0.9 R=118 (vs. 28) Cores= 139 (vs. 9) Speedup=65.6 (vs. 26.7) 8/6/2008 41

Asymmetric Multicore: 3 Software Issues 1. Schedule computation (e.g., when to use bigger core) 2. Manage locality (e.g., sending code or data can sap gains) 3. Synchronize (e.g., asymmetric cores reaching a barrier) At What Level? Application Programmer Library Author Compiler Runtime System More Info (?) Operating System Hypervisor (Virtual Machine Monitor) Hardware More Leverage (?)

Outline Multicore Motivation & Research Paper Trends Recall Amdahl s Law A Model of Multicore Hardware Symmetric Multicore Chips Asymmetric Multicore Chips Dynamic Multicore Chips Caveats & Wrap Up 8/6/2008 43

Dynamic Multicore Chips, Take 1 Why NOT Have Your Cake and Eat It Too? N Base Cores for Best Parallel Performance Harness R Cores Together for Serial Performance How? DYNAMICALLY Harness Cores Together <insert favorite or TBD techniques here> parallel mode sequential mode 8/6/2008 44

Dynamic Multicore Chips, Take 2 Let POWER provide the limit of N BCEs While Area is Unconstrained (to first order) parallel mode sequential mode How to model these two chips? Result: N base cores for parallel; large core for serial [Chakraborty, Wells, & Sohi, Wisconsin CS-TR-2007-1607] When Simultaneous Active Fraction (SAF) < ½ 45 8/6/2008

Performance of Dynamic Multicore Chips N Base Cores with R BCEs used Serially Serial Fraction 1-F uses R BCEs at rate Perf(R) Serial time = (1 F) / Perf(R) Parallel Fraction F uses N base cores at rate 1 each Parallel time = F / N Therefore, w.r.t. one base core: Dynamic Speedup = 1 - F Perf(R) 1 F 8/6/2008 46 + N

Asymmetric Speedup Recall Asymmetric Multicore Chip, N = 256 BCEs 250 F=0.999 200 150 100 F=0.99 F=0.975 Recall F=0.99 R=41 Cores=216 Speedup=166 50 0 F=0.9 F=0.5 1 2 4 8 16 32 64 128 256 R BCEs What happens with a dynamic chip? 8/6/2008 47

Dynamic Speedup Dynamic Multicore Chip, N = 256 BCEs 250 F=0.999 200 150 100 F=0.99 F=0.975 F=0.99 R=256 (vs. 41) Cores=256 (vs. 216) Speedup=223 (vs. 166) 50 0 F=0.9 F=0.5 1 2 4 8 16 32 64 128 256 R BCEs Dynamic offers greater speedup potential than Asymmetric Arch. researchers should target dynamically harnessing cores 8/6/2008 48

Dynamic Asymmetric Multicore: 3 Software Issues 1. Schedule computation (e.g., when to use bigger core) 2. Manage locality (e.g., sending code or data can sap gains) 3. Synchronize (e.g., asymmetric cores reaching a barrier) At What Level? Application Programmer Library Author More Leverage (?) Compiler Runtime System More Info (?) Operating System Hypervisor (Virtual Machine Monitor) Hardware Dynamic Challenges > Asymmetric Ones Dynamic chips due to power likely

Outline Multicore Motivation & Research Paper Trends Recall Amdahl s Law A Model of Multicore Hardware Symmetric Multicore Chips Asymmetric Multicore Chips Dynamic Multicore Chips Caveats & Wrap Up 8/6/2008 50

Three Multicore Amdahl s Law 1 Parallel Section Symmetric Speedup = Sequential Section 1 Enhanced Core Asymmetric Speedup = 1 - F Perf(R) 1 - F Perf(R) + + 1 F * R Perf(R)*N F Perf(R) + N - R N/R Enhanced Cores 1 Enhanced & N-R Base Cores 1 Dynamic Speedup = 1 - F Perf(R) + F N N Base Cores 8/6/2008 51

Software Model Charges 1 of 2 Serial fraction not totally serial Can extend software model to tree algorithms, etc. Parallel fraction not totally parallel Can extend for varying or bounded parallelism Serial/Parallel fraction may change Can extend for Weak Scaling [Gustafson, CACM 88] Run larger, more parallel problem in constant time But prudent architectures support Strong Scaling 8/6/2008 52

Software Model Charges 2 of 2 Synchronization, communication, scheduling effects? Can extend for overheads and imbalance Software challenges for asymmetric multicore worse Can extend for asymmetric scheduling, etc. Software challenges for dynamic multicore greater Can extend to model overheads to facilitate Future software will be totally parallel (see my work ) I m skeptical; not even true for MapReduce 8/6/2008 53

Hardware Model Charges 1 of 2 Naïve to consider total resources for cores fixed Can extend hardware model to how core changes effect The Rest Naïve to bound Cores by one resource (esp. area) Can extend for Pareto optimal mix of area, dynamic/static power, complexity, reliability, Naïve to ignore challenges due to off-chip bandwidth limits & benefits of last-level caching Can extend for modeling these 8/6/2008 54

Hardware Model Charges 2 of 2 Naïve to use performance = square root of resources Can extend as equations can use any function We architects can t scale Perf(R) for very large R True, not yet. We architects can t dynamically harness very large R True, not yet So what should computer scientists do about it? 8/6/2008 55

Three-Part Charge Architects: Build more-effective multicore hardware Don t lament that we can t do, but do it! Play with & trash our models [IEEE Computer, July 2008] www.cs.wisc.edu/multifacet/amdahl Computer Scientists: Implement 3 rd Moore s Law Double Parallelism Every Two Years Consider Symmetric, Asymmetric, & Dynamic Chips Finally, We must all work together Keep (cost-) performance gains progressing Parallel Programming & Parallel Computers 8/6/2008 57

Dynamic Speedup Dynamic Multicore Chip, N = 1024 BCEs 1000 800 600 400 F=0.999 F=0.99 F 1 R 1024 Cores 1024 Speedup 1024! 200 0 1 4 16 64 256 1024 NOT Possible Today F=0.975 R BCEs F=0.9 F=0.5 NOT Possible EVER Unless We Dream & Act 8/6/2008 58

Executive Summary Develop A Corollary to Amdahl s Law Simple Model of Multicore Hardware Complements Amdahl s software model Fixed chip resources for cores Core performance improves sub-linearly with resources Research Implications (1) Need Dramatic Increases in Parallelism (No Surprise) 99% parallel limits 256 cores to speedup 72 New Moore s Law: Double Parallelism Every Two Years? (2) Many larger chips need increased core performance (3) HW/SW for asymmetric designs (one/few cores enhanced) (4) HW/SW for dynamic designs (serial parallel) 8/6/2008 59

Backup Slides 8/6/2008 60

Symmetric Speedup Symmetric Multicore Chip, N = 16 BCEs 16 14 12 10 8 6 4 2 0 F=0.999 F=0.99 F=0.975 F=0.9 F=0.5 1 2 4 8 16 R BCEs 8/6/2008 62

Symmetric Speedup Symmetric Multicore Chip, N = 256 BCEs 250 200 150 F=0.999 100 50 0 F=0.99 F=0.975 F=0.9 F=0.5 1 2 4 8 16 32 64 128 256 R BCEs 8/6/2008 63

Symmetric Speedup Symmetric Multicore Chip, N = 1024 BCEs 1000 800 600 F=0.999 400 200 0 F=0.99 F=0.975 F=0.9 F=0.5 1 4 16 64 256 1024 R BCEs 8/6/2008 64

Asymmetric Speedup Asymmetric Multicore Chip, N = 16 BCEs 16 14 12 10 8 6 4 2 0 F=0.999 F=0.99 F=0.975 F=0.9 F=0.5 1 2 4 8 16 R BCEs 8/6/2008 65

Asymmetric Speedup Asymmetric Multicore Chip, N = 256 BCEs 250 F=0.999 200 150 F=0.99 100 50 0 F=0.975 F=0.9 F=0.5 1 2 4 8 16 32 64 128 256 R BCEs 8/6/2008 66

Asymmetric Speedup Asymmetric Multicore Chip, N = 1024 BCEs 1000 800 F=0.999 600 400 200 0 F=0.99 F=0.975 1 4 16 64 256 1024 R BCEs F=0.9 F=0.5 8/6/2008 67

Dynamic Speedup Dynamic Multicore Chip, N = 16 BCEs 16 14 12 10 8 6 4 2 0 F=0.999 F=0.99 F=0.975 F=0.9 F=0.5 1 2 4 8 16 R BCEs 8/6/2008 68

Dynamic Speedup Dynamic Multicore Chip, N = 256 BCEs 250 F=0.999 200 150 100 50 0 F=0.99 F=0.975 F=0.9 F=0.5 1 2 4 8 16 32 64 128 256 R BCEs 8/6/2008 69

Dynamic Speedup Dynamic Multicore Chip, N = 1024 BCEs 1000 800 F=0.999 600 400 200 0 F=0.99 F=0.975 1 4 16 64 256 1024 R BCEs F=0.9 F=0.5 8/6/2008 70