Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Outline 1 Reiteration Lecture 5: EIT090 Computer Architecture 2 Dynamic scheduling - Tomasulo Anders Ardö 3 Superscalar, VLIW EIT Electrical and Information Technology, Lund University Sept. 30, 2009 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 1 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 2 / 62 Instruction Level Parallelism - ILP Why loop unrolling works ILP: Overlap execution of unrelated instructions: Pipelining Two main approaches: DYNAMIC = hardware detects parallelism STATIC = software detects parallelism Often a mix between both. Longer sequences of straight code without branches (longer basic blocks) allows for easier compiler static rescheduling Longer basic blocks also facilitates dynamic rescheduling such as Scoreboard and Tomasulo s algorithm Pipeline CPI = Ideal CPI + Structural stalls + Data hazard stalls + Control stalls A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 3 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 4 / 62

Dynamic Branch Prediction Dependencies Branches limit performance because: Branch penalties Limit to available Instruction Level Parallelism Solution: Dynamic branch prediction to predict the outcome of conditional branches. Benefits: Reduce the time to when the branch condition is known Reduce the time to calculate the branch target address Two instructions must be independent in order to execute in parallel There are three general types of dependencies that limit parallelism: Data dependencies Name dependencies Control dependencies Dependencies are properties of the program Whether a dependency leads to a hazard or not is a property of the pipeline implementation A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 5 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 6 / 62 Scoreboard pipeline Summary Goal of scoreboarding is to maintain an execution rate of one instruction per clock cycle by executing an instruction as early as possible. Instructions execute out-of-order when there are sufficient resources and no data dependencies. A scoreboard is a hardware unit that keeps track of the instructions that are in the process of being executed, the functional units that are doing the executing, and the registers that will hold the results of those units. A scoreboard centrally performs all hazard detection and resolution and thus controls the instruction progression from one step to the next. ILP: Rescheduling and loop unrolling are important to take advantage of potential Instruction Level Parallelism Dynamic instruction scheduling An alternative to compile-time scheduling Does not need recompilation to increase performance Used in most new processor implementations Dynamic Branch Prediction reduce branch penalties by early prediction of conditional branch outcomes A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 7 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 8 / 62

Lecture 5 agenda Outline Chapters 2.4-2.8, 3.1-3.4 in "Computer Architecture" 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 9 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 10 / 62 Scoreboard pipeline Limitations with Scoreboard Issue: Decode and check for structural hazards Read operands: wait until no data hazards, then read operands All data hazards are handled by the scoreboard The number of scoreboard entries (window size) The number and types of functional units Number of datapaths to registers The presence of name dependencies Tomasulo s algorithm addresses the last two limitations. A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 11 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 12 / 62

Tomasulo s Algorithm Tomasulo Organization Another dynamic instruction scheduling algorithm For IBM 360/91, a few years after the CDC 6600 (Scoreboard) Goal: High performance without compiler support Differences between Tomasulo & Scoreboard: Control & Buffers distributed with FUs (called reservation stations) vs. centralized in Scoreboard Register names in instructions replaced by pointers to reservation station buffer (HW register renaming) Common Data Bus broadcasts results to all FUs Loads and Stores treated as FUs as well A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 13 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 14 / 62 Three Stages of Tomasulo Alg. Tomasulo example, cycle 0 1. Issue get instruction from FP Op Queue If reservation station free (no structural hazard), the instruction is issued together with its operands (renames registers) 2. Execution operate on operands (EX) When both operands are ready, then execute; if not ready, watch Common Data Bus (CDB) for operands (snooping) 3. Write result finish execution (WB) Write on CDB to all awaiting functional units; mark reservation station available Normal bus: data + destination Common Data Bus: data + source (snooping) A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 15 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 16 / 62

Tomasulo example, cycle 1 Tomasulo example, cycle 2 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 17 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 18 / 62 Tomasulo example, cycle 3 Tomasulo example, cycle 4 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 19 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 20 / 62

Tomasulo example, cycle 5 Tomasulo example, cycle 6 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 21 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 22 / 62 Tomasulo example, cycle 7 Tomasulo example, cycle 8 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 23 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 24 / 62

Tomasulo example, cycle 10 Elimination of WAR hazards Example: LD F6, 34(R2)...... DIVD F10,F0,F6 ADDD F6,F8,F2 ADDD can safely finish before DIVD has read register F6 because: DIVD has renamed register F6 to point at the reservation station LD broadcasts its result on the Common Data Bus Register renaming can thus be done: statically by the compiler dynamically by the hardware A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 25 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 26 / 62 Tomasulo example, cycle 11 Tomasulo example, cycle 15 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 27 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 28 / 62

Tomasulo example, cycle 16 Tomasulo example, cycle 56 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 29 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 30 / 62 Tomasulo example, cycle 57 Benefits Tomasulo distributed hazard detection logic distributed reservation stations Common Data Bus (CDB) with snooping elimination WAR,WAW hazards (renaming registers) A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 31 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 32 / 62

Dynamic scheduling - summary Outline 1 Reiteration tolerates unpredictable delays compile for one pipeline - run effectively on another significant increase in HW complexity out-of-order execution, completion register renaming 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 33 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 34 / 62 Getting CPI < 1! Approaches for multiple issue Issuing multiple instructions per clock cycle Superscalar: varying number of instructions/cycle (1-8) scheduled by compiler or HW IBM Power5, Pentium 4, Sun SuperSparc, DEC Alpha Simple hardware, complicated compiler or... Very complex hardware but simple for compiler Very Long Instruction Word (VLIW): fixed number of instructions (3-5) scheduled by the compiler HP/Intel IA-64, Itanium Simple hardware, difficult for compiler high performance through extensive compiler optimization Issue Hazard Scheduling Characteristics detection /examples Superscalar dynamic HW static in-order execution ARM Superscalar dynamic HW dynamic out-of-order execution Superscalar dynamic HW dynamic speculation Pentium 4 IBM power5 WLIW static compiler static TI C6x EPIC static compiler mostly static Itanium A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 35 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 36 / 62

Very Long Instruction Word (VLIW) Itanium instruction format A number of functional units that independently execute instructions in parallel. The compiler decides which instructions can execute in parallel No hazard detection needed A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 37 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 38 / 62 Itanium architecture Limits of VLIW Limited Instruction Level Parallelism With n functional units and k pipeline stages we need n x k independent instructions to utilize the hardware Memory and register bandwidth With increasing number of functional units, the number of ports needed at the memory or register file must increase to prevent structural hazards Code size Compiler scheduled pipeline bubbles take up space in the instruction Need more aggressive loop unrolling to work well which also increases code size No binary code compatibility A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 39 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 40 / 62

Outline HW supported speculation 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations A combination of three main ideas: Dynamic instruction scheduling; take advantage of ILP Dynamic branch prediction; allows instruction scheduling across branches Speculative execution; execute instructions before all control dependencies are resolved Hardware based speculation uses a data-flow execution: instructions execute when their operands are available 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 41 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 42 / 62 HW vs. SW speculation Tomasulo extended to handle speculation Advantages: Dynamic runtime disambiguation of memory addresses Dynamic branch prediction is often better than static which limits the performance of SW speculation HW speculation can maintain a precise exception model Can achieve higher performance on older code (without recompilation) Main disadvantage: Extremely complex implementation and extensive need for hardware resources A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 43 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 44 / 62

Re-order buffer - ROB Four steps of Speculative Tomasulo Data structure entry instruction type destination value ready 1 2... n supports speculative execution instructions commit in order precise exceptions Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer nr. for destination Execution operate on operands (EX) If both operands ready: execute; if not, watch CDB for result; when both operands are in reservation station: execute Write result complete execution Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available Commit update register with reorder result When instr. is at head of reorder buffer & result is present; update register with result (or store to memory) and remove instr. from reorder buffer; (handle misspeculations and precise exceptions) A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 45 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 46 / 62 Misspeculation! Multiple issue and speculation Commit branch prediction wrong When branch instr. is at head of reorder buffer & incorrect prediction: remove all instr. from reorder buffer (flush); restart execution at correct instruction Expensive = try to recover as early as possible Performance sensitive to branch prediction/speculation mechanism Possible to extend Tomasulo with both multiple issue and speculation. Major issues instruction issue and monitoring CDB Must be able to handle multiple commits Alternative to Tomasulo is to use extra physical registers for both architecturally visible registers and temporary values with register renaming A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 47 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 48 / 62

Tomasulo speculation - increased complexity Dynamic scheduling, speculation - summary tolerates unpredictable delays compile for one pipeline - run effectively on another allows speculation multiple branches in-order commit precise exceptions time, energy; recovery significant increase in HW complexity out-of-order execution, completion register renaming A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 49 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 50 / 62 Outline ILP 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation How much performance can we get by utilizing ILP? 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 51 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 52 / 62

A model of an ideal processor Upper Limit to ILP Provides a base for ILP measurements No structural hazards Register renaming infinite virtual registers and all WAW & WAR hazards avoided Machine with perfect speculation Branch prediction perfect; no mispredictions Jump prediction all jumps perfectly predicted Memory-address alias analysis addresses are known & a store can be moved before a load provided addresses not equal Perfect caches There are only true data dependencies left! A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 53 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 54 / 62 Impact window size More realistic HW: Branch impact A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 55 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 56 / 62

More realistic HW: Register impact Summary Software (compiler) tricks: Loop unrolling Static instruction scheduling (with register renaming)... and more Hardware tricks: Dynamic instruction scheduling Dynamic branch prediction Multiple issue Superscalar, VLIW Speculative execution... and more A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 57 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 58 / 62 Outline AMD Phenom CPU 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 59 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 60 / 62

Intel Core2 Intel Core2 chip (Nehalem) A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 61 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 62 / 62