Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu www.secs.oakland.edu/~yan

Topics for Instruction Level Parallelism ILP Introduction, Compiler Techniques and Branch Prediction 3.1, 3.2, 3.3 Dynamic Scheduling (OOO) 3.4, 3.5 and C.5, C.6 and C.7 (FP pipeline and scoreboard) Hardware Speculation and Static Superscalar/VLIW 3.6, 3.7 Dynamic Scheduling, Multiple Issue and Speculation 3.8, 3.9 ILP Limitations and SMT 3.10, 3.11, 3.12 2

Acknowledge and Copyright Slides adapted from UC Berkeley course Computer Science 252: Graduate Computer Architecture of David E. Culler Copyright(C) 2005 UCB UC Berkeley course Computer Science 252, Graduate Computer Architecture Spring 2012 of John Kubiatowicz Copyright(C) 2012 UCB Computer Science 152: Computer Architecture and Engineering, Spring 2016 by Dr. George Michelogiannakis from UC Berkeley https://passlab.github.io/cse564/copyrightack.html 3

Complex Pipelining: Motivation Why would we want more than our in-order pipeline? PC Physical Address Inst. Cache D Decode E + M Physical Address Data Cache W Physical Address Memory Controller Physical Address Main Memory (DRAM) Physical Address 4

Complex Pipelining: Motivation Pipelining becomes complex when we want high performance in the presence of: Long latency or partially pipelined floating-point units Not all instructions are floating point or integer Memory systems with variable access time For example cache misses Multiple arithmetic and memory units 5

Floating Point Representation IEEE standard 754 Value = (-1) s * 1.mantissa * 2 (exp-127) Exponent = 0 has special meaning 6

Floating-Point Unit (FPU) Much more hardware than an integer unit A simple FPU takes 150,000 gates. Verification complex. Some exceptions specific to floating point. Integer FU to the order of thousands Common to have several FPU s Some integer, some floating point Common to have different types of FPU s: Fadd, Fmul, Fdiv, An FPU may be pipelined, partially pipelined or not pipelined To operate several FPU s concurrently the FP register file needs to have more read and write ports 7

Unpipelined FP EXE Stage FP takes loops to compute Much longer clock period Single-cycle FPU is a bad idea 8

Latency and Interval Latency The number of intervening cycles between an instruction that produces a result and an instruction that uses the result. Usually the number of stages after EX that an instruction produces a result» ALU Integer 0, Load latency 1 Initiation or repeat interval the number of cycles that must elapse between issuing two operations of a given type à structural hazards 9

Pipelined FP EXE Increased stall for RAW hazards 10

Breaking Our Assumption of Integer Pipeline The divide unit is not fully pipelined structural hazards can occur» need to be detected and stall incurred. The instructions have varying running times the number of register writes required in a cycle can be > 1 Instructions no longer reach WB in order Write after write (WAW) hazards are possible» Note that write after read (WAR) hazards are not possible, since the register reads always occur in ID. Instructions can complete in a different order than they were issued (out-of-order complete) causing problems with exceptions Longer latency of operations stalls for RAW hazards will be more frequent. 11

Hazards and Forwarding for Longer- Latency Pipeline H 12

SPEC89 FP Latency average FP add, subtract, or convert 1.7 cycles, or 56% of the latency (3 cycles). Multiplies and divides 2.8 and 14.2, respectively, or 46% and 59% of the corresponding latency. Structural hazards for divides are rare Stalls of FP Operations since the divide frequency is low. 13

The total number of stalls per instruction ranges from 0.65 for su2cor to 1.21 for doduc, with an average of 0.87. Stalls per FP Operation FP result stalls dominate in all cases, with an average of 0.71 stalls per instruction, or 82% of the stalled cycles. 14

Problems Arising From Writes If we issue one instruction per cycle, how can we avoid structural hazards at the writeback stage and out-of-order writeback issues? WAW Hazards WAW Hazards 15

Complex In-Order Pipeline PC Inst. Mem D Decode GPRs X1 + X2 Data Mem X3 W Delay writeback so all operations have same latency to W stage Write ports never oversubscribed (one inst. in & one inst. out every cycle) Stall pipeline on long latency operations, e.g., divides, cache misses FPRs X1 X2 FAdd X3 X2 FMul X3 W Handle exceptions in-order at commit point How to prevent increased writeback latency from slowing down single cycle integer opera:ons? Bypassing FDiv Unpipeline X2 d divider X3 Commit Point 16

Floating-Point ISA Interaction between floating-point datapath and integer datapath is determined by ISA RISC-V ISA separate register files for FP and Integer instructions» the only interaction is via a set of move/convert instructions (some ISA s don t even permit this) separate load/store for FPR s and GPR s (general purpose registers) but both use GPR s for address calculation FP compares write integer registers, then use integer branch 17

Realistic Memory Systems Common approaches to improving memory performance: Caches - single cycle except in case of a miss =>stall Banked memory - multiple memory accesses => bank conflicts split-phase memory operations (separate memory request from response), many in flight => out-of-order responses Latency of access to the main memory is usually much greater than one cycle and ohen unpredictable Solving this problem is a central issue in computer architecture 18

Multiple-Cycles MEM Stage MIPS R4000 IF: First half of instruction fetch; PC selection actually happens here, together with initiation of instruction cache access. IS: Second half of instruction fetch, complete instruction cache access. RF: Instruction decode and register fetch, hazard checking, and instruction cache hit detection. EX: Execution, which includes effective address calculation, ALU operation, and branch-target computation and condition evaluation. DF: Data fetch, first half of data cache access. DS: Second half of data fetch, completion of data cache access. TC: Tag check, to determine whether the data cache access hit. WB: Write-back for loads and register-register operations. 19

2-Cycles Load Delay 2 20

3-Cycle Branch Delay when Taken 21

Data Hazards Control Hazards Dynamic Scheduling 22

Types of Data Hazards Consider execulng a sequence of r k <= r i op r j type of instruclons Data-dependence r 3 <= r 1 op r 2 Read-aHer-Write r 5 <= r 3 op r 4 (RAW) hazard AnL-dependence r 3 <= r 1 op r 2 Write-aHer-Read r 1 <= r 4 op r 5 (WAR) hazard Output-dependence r 3 <= r 1 op r 2 r 3 <= r 6 op r 7 Write-aHer-Write (WAW) hazard 23

Register vs. Memory Dependence Data hazards due to register operands can be determined at the decode stage, but data hazards due to memory operands can be determined only after computing the effective address Store: M[r1 + disp1] <= r2! Load: r3 <= M[r4 + disp2]!! Does (r1 + disp1) = (r4 + disp2)? 24

Data Hazards: An Example I 1 FDIV.D f6, f6, f4 I 2 FLD f2, 45(x3) I 3 FMUL.D f0, f2, f4 I 4 FDIV.D f8, f6, f2 I 5 FSUB.D f10, f0, f6 I 6 FADD.D f6, f8, f2 RAW Hazards WAR Hazards WAW Hazards 25

Instruction Scheduling I 1 FDIV.D f6, f6, f4 I 2 FLD f2, 45(x3) I 3 FMULT.D f0, f2, f4 I 1 I 2 I 4 FDIV.D f8, f6, f2 I 5 FSUB.D f10, f0, f6 I 6 FADD.D f6, f8, f2 Valid orderings: in-order I 1 I 2 I 3 I 4 I 5 I 6 I 3 I 4 I 5 out-of-order out-of-order I 2 I 1 I 3 I 4 I 5 I 6 I 1 I 2 I 3 I 5 I 4 I 6 I 6 26

Out-of-order Completion In-order Issue Latency I 1 FDIV.D f6, f6, f4 4 I 2 FLD f2, 45(x3) 1 I 3 FMULT.D f0, f2, f4 3 I 4 FDIV.D f8, f6, f2 4 I 5 FSUB.D f10, f0, f6 1 I 6 FADD.D f6, f8, f2 1 in-order comp 1 2 out-of-order comp 1 2 1 2 3 4 3 5 4 6 5 6 2 3 1 4 3 5 5 4 6 6 Underlines are completes 27

Dynamic Scheduling Rearrange order of instructions to reduce stalls while maintaining data flow Minimize RAW Hazards Minimize WAW and WAR hazards via Register Renaming Between registers and memory hazards Advantages: Compiler doesn t need to have knowledge of microarchitecture Handles cases where dependencies are unknown at compile time Disadvantage: Substantial increase in hardware complexity Complicates exceptions 28

Dynamic Scheduling Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates more possibility for WAR and WAW hazards Scoreboard: C.6 CDC6600 in 1963 Tomasulo s Approach Tracks when operands are available Introduces register renaming in hardware» Minimizes WAW and WAR hazards 29

Register Renaming Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 Anti-dependence on F8 Output dependence on F6 30

Register Renaming Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 DIV.D F0,F2,F4 S.D F6,0(R1) ADD.D F6,F0,F8 SUB.D F8,F10,F14 S.D S,0(R1) MUL.D F6,F10,F8 SUB.D T,F10,F14 MUL.D T,F10,T Now only RAW hazards remain, which can be strictly ordered 31

Tomasulo Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 IBM has memory-register ops Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, 32

Organizations of Tomasulo s Algorithm Load/Store buffer Reservation station Common data bus v 33

Tomasulo Algorithm vs. Scoreboard Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; FU buffers called reservation stations ; have pending operands Registers in instructions replaced by values or pointers to reservation stations(rs); called register renaming ; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 34

Register Renaming Register renaming by reservation stations (RS) Each entry contains:» The instruction» Buffered operand values (when available)» Reservation station number of instruction providing the operand values RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file) Pending instructions designate the RS to which they will send their output» Result values broadcast on the common data bus (CDB) Only the last output updates the register file As instructions are issued, the register specifiers are renamed with the reservation station May be more reservation stations than registers 35

Reservation Station Components Op: Operation to perform in the unit (e.g., + or ) Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready in Vj or Vk Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Qi: Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. 36

Three Stages of Tomasulo Algorithm 1.!Issue get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2.!Execution operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3.!Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination ( go to bus) Common data bus: data + source ( come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast 37

Tomasulo Organization for the Example From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 FP Registers Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders Reservation Stations FP multipliers To Mem Common Data Bus (CDB) 38

Tomasulo Example Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 Load1 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 0 FU 39

Tomasulo Example Cycle 1 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Load1 40

Tomasulo Example Cycle 2 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Load2 Load1 Allow multiple outstanding loads 41

Tomasulo Example Cycle 3 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Mult1 Load2 Load1 Note: registers names are removed ( renamed ) in Reservation Stations; MULT issued vs. scoreboard Load1 completing; what is waiting for Load1? 42

Tomasulo Example Cycle 4 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Waiting for data from memory by the instructio originally in Load1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Mult1 Load2 M(A1) Add1 Load2 completing; what is waiting for Load2? 43

Tomasulo Example Cycle 5 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Waiting for data from memory by the instructio originally in Load2 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Mult1 M(A2) M(A1) Add1 Mult2 44

Tomasulo Example Cycle 6 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 M(A2) Add2 Add1 Mult2 Issue ADDD here vs. scoreboard? 45

Tomasulo Example Cycle 7 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 M(A2) Add2 Add1 Mult2 Add1 completing; what is waiting for it? 46

Tomasulo Example Cycle 8 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 M(A2) Add2 (M-M) Mult2 47

Tomasulo Example Cycle 9 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 M(A2) Add2 (M-M) Mult2 48

Tomasulo Example Cycle 10 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 M(A2) Add2 (M-M) Mult2 Add2 completing; what is waiting for it? 49

Tomasulo Example Cycle 11 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Write result of ADDD here vs. scoreboard? All quick instructions complete in this cycle! 50

Tomasulo Example Cycle 12 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2 51

Tomasulo Example Cycle 13 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 M(A2) (M-M+M(M-M) Mult2 52

Tomasulo Example Cycle 14 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 M(A2) (M-M+M(M-M) Mult2 53

Tomasulo Example Cycle 15 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2 54

Tomasulo Example Cycle 16 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2 55

Faster than light computation (skip a couple of cycles) 56

Tomasulo Example Cycle 55 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2 57

Tomasulo Example Cycle 56 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 56 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Mult2 is completing; what is waiting for it? 58

Tomasulo Example Cycle 57 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 56 57 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Result Once again: In-order issue, out-of-order execution and completion. 59

Compare to Scoreboard Cycle 62 Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue ComplResult LD F6 34+ R2 1 2 3 4 1 3 4 LD F2 45+ R3 5 6 7 8 2 4 5 MULTD F0 F2 F4 6 9 19 20 3 15 16 SUBD F8 F6 F2 7 9 11 12 4 7 8 DIVD F10 F0 F6 8 21 61 62 5 56 57 ADDD F6 F8 F2 13 14 16 22 6 10 11 Why take longer on scoreboard/6600? Structural Hazards Lack of forwarding 60

) Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600) Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/ ) (1 load/store, 1 +, 2 x, 1 window size: 14 instructions 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard 61

SUMMARY 62

Not Every Stage Takes only one Cycle FP EXE Stage Multi-cycle Add/Mul Nonpiplined for DIV MEM Stage 63

Issues of Multi-Cycle in Some Stages The divide unit is not fully pipelined structural hazards can occur» need to be detected and stall incurred. The instructions have varying running times the number of register writes required in a cycle can be > 1 Instructions no longer reach WB in order Write after write (WAW) hazards are possible» Note that write after read (WAR) hazards are not possible, since the register reads always occur in ID. Instructions can complete in a different order than they were issued (out-of-order complete) causing problems with exceptions Longer latency of operations stalls for RAW hazards will be more frequent. 64

Hazards and Forwarding for Longer- Latency Pipeline H 65

Problems Arising From Writes If we issue one instruction per cycle, how can we avoid structural hazards at the writeback stage and out-of-order writeback issues? WAW Hazards WAW Hazards 66

2-Cycles Load Delay 2 67

3-Cycle Branch Delay when Taken 68

Register Renaming Example: DIV.D ADD.D S.D SUB.D MUL.D F0,F2,F4 S,F0,F8 S,0(R1) T,F10,F14 F6,F10,T Now only RAW hazards remain, which can be strictly ordered 70

How important is renaming? Consider execution without it latency 1 LD F2, 34(R2) 1 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 3 4 SUBD F8, F2, F2 1 5 DIVD F4, F2, F8 4 5 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1)...... 2 3 4 4 3 5... 5 6 6 Out-of-order: 1 (2,1) 4 4.... 2 3.. 3 5... 5 6 6 Out-of-order execution did not allow any significant improvement! 71

Instruction-level Parallelism via Renaming latency 1 LD F2, 34(R2) 1 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 4 X 3 5 DIVD F4, F2, F8 4 5 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1)...... 2 3 4 4 3 5... 5 6 6 Out-of-order: 1 (2,1) 4 4 5... 2 (3,5) 3 6 6 Any antidependence can be eliminated by renaming. (renaming additional storage) Can be done either in Software or Hardware 72

Hardware Solution Dynamic Scheduling Out-of-order execution and completion Data Hazard via Register Renaming Dynamic RAW hazard detection and scheduling in data-flow fashion Register renaming for WRA and WRA hazard (name conflict) Implementations Scoreboard (CDC 6600 1963)» Centralized register renaming Tomasulo s Approach (IBM 360/91, 1966)» Distributed control and renaming via reservation station, load/ store buffer and common data bus (data+source) 73

Organizations of Tomasulo s Algorithm Load/Store buffer Reservation station Common data bus v 74

Register Renaming Summary Purpose of Renaming: removing Anti-dependencies Get rid of WAR and WAW hazards, since these are not real dependencies Implicit Renaming: i.e. Tomasulo Registers changed into values or response tags We call this implicit because space in register file may or may not be used by results! Explicit Renaming: more physical registers than needed by ISA. Rename table: tracks current association between architectural registers and physical registers Uses a translation table to perform compiler-like transformation on the fly With Explicit Renaming: All registers concentrated in single register file Can utilize bypass network that looks more like 5-stage pipeline Introduces a register-allocation problem» Need to handle branch misprediction and precise exceptions differently, but ultimately makes things simpler 77