Instruction Level Parallelism and Its. (Part II) ECE 154B

Size: px

Start display at page:

Download "Instruction Level Parallelism and Its. (Part II) ECE 154B"

Denis Horton
6 years ago
Views:

1 Instruction Level Parallelism and Its Exploitation (Part II) ECE 154B Dmitri Strukov

2 ILP techniques not covered last week this week next week

3 Scoreboard Technique Review Allow for out of order execution by processing several instructions simultaneously In order issue / out of order of order ex/completion Pipe stages: Issue, read registers, ex, write back Booking to resolve RAW, WAW, and RAW Resolve RAW at read registers stage Stall issue for WAW, stall completion (WB) for WAR

4 Instruction status: Scoreboard Example Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU

5 Another Dynamic Algorithm: Tomasulo s s Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 IBMhasmemory memory register register ops Small number of floating point registers prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers renaming in hardware! Why Study? The descendants d of this have flourished! Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,

6 Tomasulo vs. Scoreboard Control & buffers distributed with Function Units (FU) vs. centralized din scoreboard FU buffers called reservation stations ; have pending operands Registers in instructions replaced by values or pointers to reservation stations(rs); called register renaming avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well

7 Tomasulo Organization From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 FP Registers Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders Reservation Stations FP multipliers To Mem Common Data Bus (CDB)

8 Reservation Station Components Op: Operation to perform in the unit (e.g., + or ) Vj, Vk: Value of source operands Store buffers have V fields, results to be stored Qj, Qk: Reservation stations producing source registers (value to be written) No ready flags as in Scoreboard; Qj,Qk=0 Qk=0 => ready Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions i that will write that register.

9 Three Stages of Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers) 2. Execute operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting units; markreservation reservation station available Normal data bus: data + destination ( go to bus) Common data bus: data + source ( come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

10 Tomasulo Example Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 Load1 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 0 FU

11 Tomasulo Example Cycle 1 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Load1

12 Tomasulo Example Cycle 2 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Load2 Load1 Note: Unlike 6600, can have multiple loads outstanding (This was not an inherent limitation of scoreboarding)

13 Tomasulo Example Cycle 3 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Mult1 Load2 Load1 Note: registers names are removed ( renamed ) in Reservation Stations; MULT issued vs. scoreboard Load1 completing; what is waiting for Load1?

14 Tomasulo Example Cycle 4 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Mult1 Load2 M(A1) Add1 Load2 completing; what is waiting for Load2?

15 Tomasulo Example Cycle 5 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Mult1 M(A2) M(A1) Add1 Mult2

16 Tomasulo Example Cycle 6 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 M(A2) Add2 Add1 Mult2 Issue ADDD here?

17 Tomasulo Example Cycle 7 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 M(A2) Add2 Add1 Mult2 Add1 completing; what is waiting for it?

18 Tomasulo Example Cycle 8 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 M(A2) Add2 (M-M) Mult2

19 Tomasulo Example Cycle 9 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 M(A2) Add2 (M-M) Mult2

20 Tomasulo Example Cycle 10 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 M(A2) Add2 (M-M) Mult2 Add2 completing; what is waiting for it?

21 Tomasulo Example Cycle 11 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 4Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Write result of ADDD here? All quick instructions complete in this cycle!

22 Tomasulo Example Cycle 12 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 3Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2

23 Tomasulo Example Cycle 13 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 2Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 M(A2) (M-M+M(M-M) Mult2

24 Tomasulo Example Cycle 14 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 1Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 M(A2) (M-M+M(M-M) Mult2

25 Tomasulo Example Cycle 15 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 0Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2

26 Tomasulo Example Cycle 16 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2

27 Faster than light computation (skip a couple of cycles)

28 Tomasulo Example Cycle 55 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 1Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2

29 Tomasulo Example Cycle 56 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 0Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Mult2 is completing; what is waiting for it?

30 Tomasulo Example Cycle 57 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Result Once again: In-order issue, out-of-order execution and completion

31 Compare to Scoreboard Cycle 62 Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Why take longer on scoreboard/6600? Structural hazards Lack of forwarding

32 Tomasulo vs. Scoreboard (IBM 360/91 vs. CDC 6600) Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3+, 2x/ ) (1 load/store, 1+, 2x, 1 ) window size: 14 instructions 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard

33 Tomasulo Drawbacks Complexity delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Each hcdb must go to multiple l functional units high h capacitance, high wiring density Number of functional units that can complete per cycle limited to one! (Multiple CDBs more FU logic for parallel assoc stores) Non precise interrupts!

34 Summary on Tomasulo Reservations stations: implicit register renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW (when branch can be quickly resolved) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants d are Pentium II; PowerPC PC604; MIPS R10000; HP PA 8000; Alpha 21264

35 Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP (will look at branch prediction implementations next week) One way to overcome control dependencies is with speculation Mk Make a guess and execute program as if our guess is correct Need mechanisms to handle the case when the speculation is incorrect Can do some speculation in the compiler We saw this previously with reordered / duplicated instructions around branches

36 Hardware Based Speculation Extends the idea of dynamic scheduling with three key ideas: 1. Dynamic branch prediction 2. Speculation to allow the execution of instructions before control dependencies are resolved 3. Dynamic scheduling to deal with scheduling different combinations of basic blocks What we saw earlier was within a basic block Modern processors started using speculation around the introduction of the PowerPC 603, Intel Pentium II and extend Tomasulo s approach to support speculation

37 Speculating with Tomasulo Separate execution from completion Allow instructions to execute speculatively but do not let instructions update registers or memory until they are no longer speculative Instruction Commit After an instruction is no longer speculative it is allowed to make register and memory updates Allow instructions i to execute and complete out of order but force them to commit in order Add a hardware buffer, called the reorder buffer (ROB), with registers to hold the result of an instruction between completion and commit Acts as a FIFO queue in order issued

38 Original Tomasulo Architecture

39 Tomasulo and Reorder Buffer Sits between Execution and Register File Source of operands In this case integrated with Store buffer Reservation stations use ROB slot as a tag Instructions commit at head of ROB FIFO queue Easy to undo speculated instructions on mispredicted d branches or on exceptions

40 ROB Data Structure Instruction Type Field Indicates whether the instruction is a branch, store, or register operation ination Field Register number for loads, ALU ops, or memory address for stores Value Field Holds the value of the instruction result until instruction commits Ready Field Indicates if instruction has completed execution and the value is ready

41 Instruction Execution 1. Issue: Get an instruction from the Instruction Queue If the reservation station and the ROB has a free slot (no structural hazard), issue the instruction to the reservation station and the ROB, send operands to the reservation station if available in the register file or the ROB. The allocated ROB slot number is sent to the reservation station to use as a tag when placing data on the CDB. 2. Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3. Write result: Finish execution (WB) Write on CDB to all awaiting units and to the ROB using the tag; mark reservation station available 4. Commit: Update register or memory with the ROB result When an instruction reaches the head of the ROB and results are present, update the register with the result or store to memory and remove the instruction from the ROB If an incorrectly predicted branch reaches es the head of the ROB, flush the ROB, and restart at the correct successor of the branch Blue text = Change from Tomasulo

42 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 ROB2 F0 LD F0, 10(R2) I ROB1 Oldest Registers To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

43 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Registers To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

44 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Registers 2 ADDD R(F4),1 To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

45 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Registers 2 ADDD R(F4),1 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

46 FP Op Queue Tomasulo With ROB State ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 LD F4,0(R3) E ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers To Memory From 2 ADDD R(F4),1 3 MULD 2,R(F6) Memory 6 ADDD 5,R(F6) Reservation 1 10+R2 Stations FP adders FP multipliers li li 5 0+R3

47 FP Op Queue Tomasulo With ROB State [R3] ROB5 ST F4, 0(R3) I ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 LD F4,0(R3) E ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers To Memory From 2 ADDD R(F4),1 3 MULD 2,R(F6) Memory 6 ADDD 5,R(F6) Reservation 1 10+R2 Stations FP adders FP multipliers li li 5 0+R3

48 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers 2 ADDD R(F4),1 6 ADDD V1,R(F6) FP adders Reservation Stations 3 MULD 2,R(F6) FP multipliers li li To Memory From Memory 1 10+R2

49 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 ADDD F0,F4,F6 E ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers 2 ADDD R(F4),1 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

50 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers 2 ADDD R(F4),1 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

51 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 V3 LD F0, 10(R2) W ROB1 Newest Oldest Registers 2 ADDD R(F4),V3 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li

52 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L E ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 E ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li

53 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L W ROB4 F2 MULD F2,F10,F6 I ROB3 F10 V4 ADDD F10,F4,F0 W ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 3 MULD V4,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li

54 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L W ROB4 F2 MULD F2,F10,F6 E ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 F10=V4 To Memory From Memory FP adders Reservation Stations FP multipliers li li

55 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L W ROB4 F2 V5 MULD F2,F10,F6 W ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 F10=V4 To Memory From Memory FP adders Reservation Stations FP multipliers li li

56 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L W ROB4 F2 V5 MULD F2,F10,F6 C ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 F10=V4 F2=V5 To Memory From Memory FP adders Reservation Stations FP multipliers li li

57 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L C ROB4 F2 V5 MULD F2,F10,F6 C ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 F10=V4 F2=V5 To Memory From Memory FP adders Reservation Stations FP multipliers li li

58 Avoiding Memory Hazards A store only updates memory when it reaches the head of the ROB Otherwise WAW and WAR hazards are possible By waiting to reach the head memory is updated in order and no earlier loads orstores can still be pending If a load accesses a memory location written to by an earlier store then it cannot perform the memory access until the store haswritten the data Prevents RAW hazard through memory

59 Reorder Buffer Implementation In practice Try to recover as early as possible after a branch is mispredicted rather than wait until branch reaches the head Performance in speculative processors more sensitive to branch prediction Highercost of misprediction Exceptions (will look into that next week) Don t recognize the exception until it is ready to commit Could try to handle exceptions as they arise and earlier branchesresolved, resolved, butmore challenging

60 Acknowledgements Some of the slides contain material developed and copyrighted by Sally A. McKee and K. Mock (Cornell University) and instructor material for the textbook 60

Advanced Pipelining and Instruction-Level Paralelism (2)

Advanced Pipelining and Instruction-Level Paralelism (2) Riferimenti bibliografici Computer architecture, a quantitative approach, Hennessy & Patterson: (Morgan Kaufmann eds.) Tomasulo s Algorithm For