Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 8 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017

ENCM 501 W17 Lectures: Slide Set 8 slide 2/74 Contents Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 3/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 4/74 Pipelines with long-latency instructions Trying to execute instructions with long latencies in-order using a single pipeline can have very bad effects when data hazards arise. The costs in lost cycles are much worse than they might appear to be from study of a simple pipeline that can always do the EX step in one clock cycle. Let s look at some examples using the pipeline of textbook Figure C.35 as a reference. That pipeline is fairly realistic about variable latency in EX. The Figure C.35 pipeline model is unreasonably simplistic about instruction fetch and data memory access, but problems waiting for EX results will make the point that needs to be made.

ENCM 501 W17 Lectures: Slide Set 8 slide 5/74 Integer unit EX M1 FP/integer multiply M2 M3 M4 M5 M6 M7 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV Image is Figure C.35 from Hennessy J. L. and Patterson D. A., Computer Architecture: A Quantitative Approach, 5nd ed., c 2012, Elsevier, Inc.

ENCM 501 W17 Lectures: Slide Set 8 slide 6/74 Let s look at the effects of RAW data hazards concerning use of FPRs (floating-point registers). Such a hazard involving two instructions can be detected in the ID stage of the later instruction a source register number of the later instruction will match the destination register number of the earlier instruction. The later instruction will not be allowed to enter EX until the result of the earlier instruction can be forwarded. If this requires one or more stall cycles, that blocks other instructions from even entering the ID stage.

ENCM 501 W17 Lectures: Slide Set 8 slide 7/74 Example: MUL.D F0, F20, F22 S.D F0, (R8) L.D F2, (R9) DADDIU R9, R9, 8 ADD.D F24, F24, F2 S.D can t leave ID until MUL.D leaves M7... MUL.D S.D L.D DADDIU IF-ID-M1-M2-M3-M4-M5-M6-M7-Me-W IF-ID-ID-ID-ID-ID-ID-ID-EX-Me-W IF-IF-IF-IF-IF-IF-IF-ID-EX-Me-W IF-ID-EX-Me-W L.D, DADDIU and ADD.D don t depend on the MUL.D result, but are all seriously delayed because in an in-order system, they all have to wait for S.D to leave ID.

ENCM 501 W17 Lectures: Slide Set 8 slide 8/74 A worse example: MUL.D ADD.D L.D SUB.D F0, F20, F22 F24, F0, F24 F2, (R9) F26, F26, F2 ADD.D can t leave ID until M7 of MUL.D is done, and then L.D can t enter Mem until Mem of ADD.D is done... MUL.D ADD.D L.D SUB.D IF-ID-M1-M2-M3-M4-M5-M6-M7-Me-W IF-ID-ID-ID-ID-ID-ID-ID-A1-A2-A3-A4-Me-W IF-IF-IF-IF-IF-IF-IF-ID-EX-EX-EX-EX-Me-W IF-ID-ID-ID-ID-ID-A1-

ENCM 501 W17 Lectures: Slide Set 8 slide 9/74 An even worse example: DIV.D F0, F20, F22 ADD.D F24, F0, F24 L.D F2, (R9)

ENCM 501 W17 Lectures: Slide Set 8 slide 10/74 Mitigation of long stalls due to RAW hazards In-order pipelined execution used to be common even in reasonably high-end processors, and is still common in embedded processors. What criticism could be made of a compiler that emitted the following sequence of instructions? MUL.D F0, F20, F22 S.D F0, (R8) L.D F2, (R9) DADDIU R9, R9, 8 ADD.D F24, F24, F2

ENCM 501 W17 Lectures: Slide Set 8 slide 11/74 Integer unit EX M1 FP/integer multiply M2 M3 M4 M5 M6 M7 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV The above model is simplistic about memory accesses. With a more realistic model for memory access, let s come up with another source of stalls to due to RAW hazards with long-latency instructions, not involving complicated arithmetic.

ENCM 501 W17 Lectures: Slide Set 8 slide 12/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 13/74 What does program order mean? Most (maybe all?) current mainstream ISAs guarantee that results register and memory writes, branch decisions, etc. produced by a single stream of instructions are what you would predict if each instruction completed before the next instruction was fetched. Program order refers to the order in which instructions would be processed in a hypothetical computer with no ILP. ILP schemes aim to get the effects of execution in program order, without the massive performance penalty of actually waiting to finish one instruction before starting the next.

ENCM 501 W17 Lectures: Slide Set 8 slide 14/74 If a scalar in-order pipeline has proper hazard detection, it is pretty much guaranteed to generate correct program order results... Forwarding, preceded if necessary by stalling, ensures that instructions don t work with stale versions of source operands. A later instruction can t pass an earlier instruction in the pipeline key stages such as decode and data memory access can be occupied by only one instruction. That means that the earlier instruction will always write its result before the later instruction does. Warning: The above material isn t perfectly correct! But it does provide a decent explanation about the main benefits of in-order instruction processing.

ENCM 501 W17 Lectures: Slide Set 8 slide 15/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 16/74 Out-of-order execution, WAW and WAR hazards The next slide shows a hypothetical variation of textbook Figure C.35. As in Figure C.35 instructions are sent one per clock cycle to one of the four paths that lead to instruction completion. Transfer of an instruction from the fetch-and-decode unit to one of the four functional units is called instruction issue. Let s assume that in each cycle the fetch-and-decode unit can inspect a window of four or so instructions that need to be issued soon, and picks one, not necessarily in program order, but instead with the goal of optimizing instruction throughput.

ENCM 501 W17 Lectures: Slide Set 8 slide 17/74 integer unit EX FP/integer multiplier instruction fetch and decode unit M1 M2 M3 M4 M5 M6 M7 FP adder A1 A2 A3 A4 FP/integer divider MEM WB

ENCM 501 W17 Lectures: Slide Set 8 slide 18/74 Further, as the paths through the execution units have different latencies, our hypothetical system will allow not only out-of-order issue, but also out-of-order completion instructions may arrive at the MEM stage not in program order. (And, two or more instructions could arrive at MEM in the same cycle, which would require some sort of arbitration and queueing system.) OoO (out-of-order) issue and OoO completion may create new kinds of hazards, ones that would be impossible with in-order issue and in-order completion.

ENCM 501 W17 Lectures: Slide Set 8 slide 19/74 WAW (Write-After-Write) hazard example MUL.D ADD.D L.D ADD.D F0, F20, F20 F2, F2, F0 F0, (R8) F4, F4, F0 several more instructions, but no more writes to F0 SUB.D F6, F6, F0 The first ADD.D has to wait for the MUL.D result, so it makes sense to issue L.D and the second ADD.D before the first ADD.D. What is the potential bad consequence for the SUB.D instruction? Suppose the instruction sequence was produced by a compiler. How could the compiler have avoided the WAW hazard?

ENCM 501 W17 Lectures: Slide Set 8 slide 20/74 WAR (Write-After-Read) hazard example MUL.D F2, F22, F22 S.D F2, 40(R29) L.D F2, 32(R29) ADD.D F20, F20, F2 In program order, L.D writes to F2 after S.D reads from F2. But if L.D is issued out of order, S.D could store the L.D result to memory instead of storing the MUL.D result.

ENCM 501 W17 Lectures: Slide Set 8 slide 21/74 Data dependencies and name dependencies A RAW hazard is a data dependency, sometimes also called a true dependency. A later instruction has a source that was a destination of an earlier instruction. The earlier instruction may not have written its result to a register or to memory when the later instruction tries to read that result.

ENCM 501 W17 Lectures: Slide Set 8 slide 22/74 In contrast, WAW and WAR hazards are sometimes called name dependencies. Generally, the situation is like this... Some instruction B receives information from an earlier instruction A, in some register or memory location. Some instruction D receives unrelated information from an earlier instruction C, in the same register or memory location. In program order, there is no problem the first use of the storage location is over before the second use starts. But in an out-of-order environment, communication between one pair of instructions may interfere with communication between another pair.

ENCM 501 W17 Lectures: Slide Set 8 slide 23/74 The term name dependency comes from the idea that two or more pairs of instructions are making conflicting use of a name (register number or memory address) used for inter-instruction communication. WAW name dependencies are also called output dependencies: The correctness of a write depends on the write overwriting any writes to the same location that should occur earlier in program order. WAR name dependencies are also called anti-dependencies: The correctness of a read depends on the read result not being contaminated by a write that should occur later in program order.

ENCM 501 W17 Lectures: Slide Set 8 slide 24/74 Name dependencies are a real problem The example WAW and WAR hazards given earlier in the Slide Set are easy to fix the hazards go away if better choices are made about FPRs used for intermediate results. But what if a compiler has very few registers to allocate? This could happen with a register-poor ISA (like x86!), or with code that for some reason needs to use many registers. A more common and important problem arises from short loops with long-latency instructions...

ENCM 501 W17 Lectures: Slide Set 8 slide 25/74 Here s a loop to multiply each element in a vector by a factor in register F12... L1: L.D F2, (R8) MUL.D F4, F2, F12 DADDIU R8, R8, 8 S.D F4, (R9) DADDIU R9, R9, 8 BNE R8, R10, L1 MUL.D instructions have long latencies but they can be pipelined. It would be good to start the second MUL.D before the first MUL.D finishes, the third MUL.D before the second MUL.D finishes, and so on. Where is there a name dependency in this sequence? (By the way, there are several RAW hazards in this example, too!)

ENCM 501 W17 Lectures: Slide Set 8 slide 26/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 27/74 Data hazards related to memory locations Example RAW, WAW and WAR hazards given earlier in this slide set have been related to sequences of writes and reads to registers. Similar hazards can arise related to stores to and loads from memory locations.

ENCM 501 W17 Lectures: Slide Set 8 slide 28/74 Is there any kind of data hazard that could arise from OoO issue or completion of the two load instructions here? If so, what kind of hazard is it? L.D F0, (R8) instructions, but no loads or stores L.D F2, (R9)

ENCM 501 W17 Lectures: Slide Set 8 slide 29/74 What kind of data hazard is possible involving the store instruction and the load instruction, in an out-of-order system? MUL.D S.D F2, F0, F0 F2, 40(R29) instructions, but no loads or stores L.D F4, (R9)

ENCM 501 W17 Lectures: Slide Set 8 slide 30/74 What kind of data hazard is possible involving the two store instructions, in an out-of-order system? MUL.D S.D F0, F2, F2 F0, (R8) instructions, but no loads or stores S.D F4, (R9)

ENCM 501 W17 Lectures: Slide Set 8 slide 31/74 What kind of data hazard is possible involving the L.D instruction and the S.D instruction, in an out-of-order system? LD R9, (R8) # read address from memory L.D F0, (R9) # use just-read address instructions, but no loads or stores S.D F2, (R10)

ENCM 501 W17 Lectures: Slide Set 8 slide 32/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 33/74 Managing hazards in OoO processor designs Several different design approaches have been used. Because of time constraints in ENCM 501, we ll focus on one design approach, called Tomasulo s algorithm. Textbook Section 3.4 introduces the algorithm. Textbook Section 3.5 provides detailed examples of how the algorithm works. Textbook Section 3.6 shows how the algorithm can be extended to ensure correct processing in the face of problems such as exceptions and branch mispredictions.

ENCM 501 W17 Lectures: Slide Set 8 slide 34/74 The details in Sections 3.4 3.6 focus mainly on data hazards related to writes to and reads from floating-point registers. There is some discussion, less detailed, of data hazards related to writes to and reads from memory locations. To keep things as simple as possible as simple as possible is still quite complicated, as we ll see details are left out regarding writes to and reads from general purpose registers. In a practical design, hazards related to GPR reads and writes would be handled in a similar fashion to FPR read and write hazards.

ENCM 501 W17 Lectures: Slide Set 8 slide 35/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 36/74 Overview of Tomasulo s algorithm It s interesting that this approach to instruction scheduling was developed around 50 years ago (in 1966!) for very high-end expensive computers, then re-adopted around 20 years ago for consumer-level processors (Intel Pentium Pro, 1995). It s been in continual use for a huge number of out-of-order processor designs for the last two decades. The key ideas are: try to get execution of an instruction started as soon as the source operands are ready often, this will require out-of-order issue; as soon as an instruction result has been produced, try to broadcast that result to all the instructions that want to consume it.

ENCM 501 W17 Lectures: Slide Set 8 slide 37/74 Goals and non-goals for H&P Sections 3.4 and 3.5 Goals: high instruction throughput dealing effectively with RAW hazards (which, remember, may occur even with in-order processing) dealing effectively with WAW and WAR hazards (which tend to occur with out-of-order processing) Non-goals (which become goals in Section 3.6): high throughput in the face of code with lots of branches correct behaviour in the face of exceptions, e.g., hardware interrupts, TLB misses handled by software, various other exceptions

ENCM 501 W17 Lectures: Slide Set 8 slide 38/74 Key components instruction fetch and decode unit enhanced register files reservation stations functional units common data bus (CDB) or multiple CDBs for superscalar systems

ENCM 501 W17 Lectures: Slide Set 8 slide 39/74 Instruction fetch and decode unit This unit merges an L1 I-cache some sort of facility for managing branches and jumps (kind of fuzzy in Sections 3.4 3.5, but definitely dynamic branch prediction in Section 3.6) instruction decode capability: when an instruction leaves, it s known exactly what kind of instruction it is, and which registers, offsets and immediate operands are involved For textbook Sections 3.4 3.6, this unit is scalar in any one clock cycle, the maximum output is one instruction. (Section 3.8 looks at output of two or more instructions per cycle.)

ENCM 501 W17 Lectures: Slide Set 8 slide 40/74 Instruction fetch and decode unit: key property Instructions are issued from the instruction unit in program order. This is critical for correct avoidance of data hazards!

ENCM 501 W17 Lectures: Slide Set 8 slide 41/74 Enhanced register files In an in-order processor, a register file is precisely what we ve modeled already: If the number of registers is M and the width of a register is N, there must be M N cells to contain the state of the register file; beyond that, there must also be whatever logic is needed to support parallel reads and/or parallel writes. In a Tomasulo-based processor, each one of the M registers requires N bits for register data more bits for register status is the data up-to-date, and if not, which reservation station will supply the data in the future?

ENCM 501 W17 Lectures: Slide Set 8 slide 42/74 Example: FPR file for MIPS with 16 64-bit FPRs... Qi = 0 indicates that an FPR is up-to-date; Qi 0 indicates that an FPR is not up-to-date and is waiting for a result for a reservation station. This example works for a system with up to fifteen reservation stations... FPR data Qi F0 00110011 00110011 00110011 00110011 00111111 11010011 00110011 00110011 0101 F2 00000000 00000000 00000000 00000000 10111111 11010100 00000000 00000000 0000. F30. 01001001 00100100 10010010 01001001 01000000 00001001 00100100 10010010. 0110

ENCM 501 W17 Lectures: Slide Set 8 slide 43/74 What do all of the bits on Slide 42 mean? Let s write out the FPR file state in a somewhat more human-friendly format.

ENCM 501 W17 Lectures: Slide Set 8 slide 44/74 Reservation stations A reservation station receives an instruction from the instruction unit; waits for source operand data to be ready before starting the execution of the instruction; broadcasts the result of the instruction on the CDB, when the result is ready. Example reservation station: An RS capable of processing either ADD.D or SUB.D needs 6 fields: Busy, Op, Vj, Vk, Qj and Qk. (The textbook also shows an A field, but that field is not needed for an RS for ADD.D / SUB.D.) Let s make some notes about how the 6 fields are used.

ENCM 501 W17 Lectures: Slide Set 8 slide 45/74 Each reservation station has a unique, nonzero identification number. Examples in the textbook have three RS s for ADD.D / SUB.D instructions, and have two RS s for MUL.D / DIV.D instructions. So a possible numbering scheme would be 0001, 0010, 0011 for the first three RS s (called Add1, Add2, and Add3 in the book) and 0100, 0101 for the next two (called Mult1 and Mult2). (Warning: RS stands for reservation station, not for register status. In descriptions of various versions of Tomasulo s algororithm, the textbook uses RegisterStat[x] as an abbreviation for the status of register x.)

ENCM 501 W17 Lectures: Slide Set 8 slide 46/74 Simple example of instruction issue Suppose a program has been running for a while, and these will be the next two instructions to leave the instruction unit: ADD.D SUB.D F4, F0, F2 F6, F6, F4 Suppose also that registers F0, F2, F4 and F6 are up-to-date, and that RS s Add1 and Add2 are not busy. Let s make notes about how the two instructions will be issued to the RS s.

ENCM 501 W17 Lectures: Slide Set 8 slide 47/74 Functional units Functional units are the circuits that perform the execution steps for an instruction. Example functional units are FP adders, FP multipliers, integer ALUs, shifters, and so on. To keep things relatively simple, we can imagine a one-to-one correspondence between reservation stations and functional units. For example, each RS for ADD.D / SUB.D could be thought of as guarding the entrance of an FP adder-subtractor circuit, and watching the exit of that circuit for a result. In reality, things will be more complicated multiple RS s will be set up to feed their operands into a single pipelined functional unit.

ENCM 501 W17 Lectures: Slide Set 8 slide 48/74 Latencies of functional units Tomasulo s algorithm is designed to manage the effects of multiple-cycle latencies of functional units. Further, the algorithm is designed to deal with the fact that the latency of a functional unit might vary from one use of the unit to the next. What are some examples of functional units with variable latencies?

ENCM 501 W17 Lectures: Slide Set 8 slide 49/74 CDB: Common data bus Most reservation stations are capable of broadcasting results on the CDB. RS s for arithmetic instructions and RS s for loads certainly need to be able to do this, but RS s for store instructions don t. Each reservation station must snoop the CDB watch the CDB for results needed by that RS. The register file must also snoop the CDB to grab results that are needed by registers that are currently not up-to-date. (Remember, such a register has some nonzero Qi value to indicate which RS will produce the result the register is waiting for.)

ENCM 501 W17 Lectures: Slide Set 8 slide 50/74 Example instruction completion via CDB This sequence got started a few slides back... ADD.D SUB.D F4, F0, F2 F6, F6, F4 Let s make some notes about how these instructions will get completed. For simplicity, let s assume that the given instructions are followed by a lengthy sequence of instructions that don t use F0, F2, F4 or F6.

ENCM 501 W17 Lectures: Slide Set 8 slide 51/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 52/74 Tomasulo s algorithm and name dependencies The algorithm eliminates WAW and WAR hazards in a really elegant way. To understand how this works, it s sufficient to consider a short example. Here the repeated use of F0 creates both a potential WAW hazard and a potential WAR hazard: DIV.D ADD.D MUL.D SUB.D F0, F20, F2 F22, F22, F0 F0, F24, F24 F26, F26, F0 Let s make notes about how the hazards are eliminated.

ENCM 501 W17 Lectures: Slide Set 8 slide 53/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 54/74 Tomasulo s algorithm and memory hazards The examples given in textbook Sections 3.5 and 3.6 are excellent regarding hazards involving communication of instruction results through floating-point registers (FPRs). Communication of instruction results through general-purpose registers (GPRs, aka integer registers) is very similar to communication via FPRs, so it s reasonable not to discuss that topic at length.

ENCM 501 W17 Lectures: Slide Set 8 slide 55/74 However, the textbook, is, uh, less than excellent about the topic of allowing memory accesses (loads and stores) to complete out-of-order when doing so is harmless and helps with instruction throughput; forcing in-order completion of loads and stores when necessary to avoid RAW, WAW and WAR hazards.

ENCM 501 W17 Lectures: Slide Set 8 slide 56/74 Older and younger instructions It s handy to use the words older and younger as adjectives for instructions in an out-of-order system. Instruction A is older than Instruction B if A came before B in program order. In that case, you could also say that B is younger than A.

ENCM 501 W17 Lectures: Slide Set 8 slide 57/74 Rules for ordering execution of loads and stores To avoid RAW hazards: It is acceptable to access memory for a load instruction if it is known that there are no incomplete older store instructions that will use the same address as the load. What similar rules would be needed for avoidance of WAW and WAR hazards? ENCM 501 won t go into details of hardware solutions for memory data hazards, but here are a couple of key features: some sort of queue in which program order of loads and stores is remembered after loads and stores leave the instuction fetch/decode unit a capability to quickly compare the address a load or store will use with addresses to be used by older loads or stores

ENCM 501 W17 Lectures: Slide Set 8 slide 58/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 59/74 Load and store buffers Load buffer and store buffer are names given to reservation stations dedicated to handling loads or stores. Remark: The load buffers and store buffers provide an interface between the execution unit of the processor and the data caches. That s an interesting design problem we don t have time to study in this course.

ENCM 501 W17 Lectures: Slide Set 8 slide 60/74 Vj, Vk, Qj, Qk, A for store buffers Busy Op Vj Vk Qj Qk A As with the FP math stations, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vk is used for the FP data to be written in an S.D instruction. So what does it mean if Qk 0? Vj, Qj, and A have to do with memory address calculations. Let s not worry about the details for now.

ENCM 501 W17 Lectures: Slide Set 8 slide 61/74 Vj, Vk, Qj, Qk, A for load buffers Busy Op Vj Vk Qj Qk A Again, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vj, Qj, and A have to do with memory address calculations. As with store buffers, let s not worry about the details for now.

ENCM 501 W17 Lectures: Slide Set 8 slide 62/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 63/74 A loop example This is from page 179 of the textbook: Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop1 Let s make some notes about the DADDIU and BEQ instructions. Let s assume the loop starts with R1 = 0x600040 and R2 = 0x600000. Let s trace how Tomasulo s algorithm might handle the first two passes through the loop.

ENCM 501 W17 Lectures: Slide Set 8 slide 64/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 65/74 Costs of the CDB (common data bus) In a typical clock cycle, some reservation station will broadcast a result on the CDB, and other reservation stations and the register file will look at the result to see if it s useful. Transmitting the result and receiving the result both have energy costs. A complex instruction unit, reservation stations, and related hardware require lots of transistors. If Moore s law had not applied for so many decades, we would not see Tomasulo s algorithm used as a basis for design of modestly priced processor chips. It s possible, in some cycles, that two or more reservation stations will simultaneously try to broadcast their results. Why is this not a fatal defect in Tomasulo s algorithm?

ENCM 501 W17 Lectures: Slide Set 8 slide 66/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 67/74 Tomasulo s algorithm and branch prediction Consider this code fragment: BEQ S.D ADD.D R8, R0, L99 F0, (R10) F2, F2, F4 Suppose the branch is incorrectly predicted as not taken, and S.D and ADD.D get issued while BEQ waits for some earlier instruction to provide a value for R8. If Tomasulo s algorithm does nothing beyond what has been presented so far in lectures, what will prevent S.D from making an incorrect update to memory, and what will prevent ADD.D from making an incorrect update to F2?

ENCM 501 W17 Lectures: Slide Set 8 slide 68/74 Tomasulo s algorithm and exceptions MUL.D S.D SUB.D L.D ADD.D F2, F4, F6 F2, 0(R8) F0, F12, F14 F2, 0(R9) F8, F8, F2 Suppose MUL.D gets delayed because it has to wait until a result for F6 is ready. That will delay the execution of S.D. Meanwhile, Tomasulo s algorithm may allow completion of SUB.D, L.D, and ADD.D. What kind of problem is created if S.D eventually results in a page fault exception?

ENCM 501 W17 Lectures: Slide Set 8 slide 69/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 W17 Lectures: Slide Set 8 slide 70/74 Out-of-order execution, in-order completion The version of Tomasulo s algorithm presented in textbook Section 3.5 has scalar issue (that is, at most one instruction issued per clock cycle), out-of-order execution, and out-of-order completion. Section 3.6 modifies the algorithm to include a circuit called a reorder buffer (ROB), which will enforce in-order completion. Use of a reorder buffer solves the branch prediction and exception problems described on slides 67 and 68.

ENCM 501 W17 Lectures: Slide Set 8 slide 71/74 In a processor with a reorder buffer, issue of an instruction sends information related to the instruction both to a reservation station and to the reorder buffer. A reservation station for a store is responsible for address computation only it is not allowed to write to memory. The reorder buffer is a FIFO queue instructions enter in program order, and leave in program order. When an instruction gets to the head of the ROB, it can be committed as soon as its results are known. Examples: An ADD.D can be committed if a reservation station has provided the sum to the reorder buffer. An S.D can be committed if both the data to be stored and the address to be used are ready.

ENCM 501 W17 Lectures: Slide Set 8 slide 72/74 Register file changes: The Qi field for each register is replaced by a Busy flag and a Reorder # field. Busy = 0 means the register is up-to-date; Busy = 1 means the register is waiting for a result from whatever entry in the reorder buffer matches the Reorder #. The register file does not watch the CDB for results. The ROB must watch the CDB for results for all of the instructions within the ROB that don t yet have results.

ENCM 501 W17 Lectures: Slide Set 8 slide 73/74 The reservation stations and functional units work very much as before, except: the Qj and Qk fields hold ROB entry numbers instead of reservation station numbers; each reservation stations has a Dest field to hold an ROB entry number; when a reservation station broadcasts its result on the CDB, it includes the Dest field value to help both the ROB and the other reservation stations.

ENCM 501 W17 Lectures: Slide Set 8 slide 74/74 The reorder buffer and safe speculation The key point about the ROB is that it can collect a large number of results without knowing whether those results should really be written to registers or memory. Consider a branch instruction that is mispredicted as taken. What happens to all the instructions that got into the ROB before the branch? What happens to the branch target instruction, the successor of the the branch target instruction, etc., which got into the ROB after the branch? The bad effect of the above scenario is a waste of time and energy. What are the important bad effects that were prevented?