Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 9 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018

ENCM 501 Winter 2018 Slide Set 9 slide 2/42 Contents Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 3/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 4/42 Overview of Tomasulo s algorithm It s interesting that this approach to instruction scheduling was developed around 50 years ago (in 1966!) for very high-end expensive computers, then re-adopted around 20 years ago for consumer-level processors (Intel Pentium Pro, 1995). It s been in continual use for a huge number of out-of-order processor designs for the last two decades. The key ideas are: try to get execution of an instruction started as soon as the source operands are ready often, this will require out-of-order issue; as soon as an instruction result has been produced, try to broadcast that result to all the instructions that want to consume it.

ENCM 501 Winter 2018 Slide Set 9 slide 5/42 Goals and non-goals for H&P Sections 3.4 and 3.5 Goals: high instruction throughput dealing effectively with RAW hazards (which, remember, may occur even with in-order processing) dealing effectively with WAW and WAR hazards (which tend to occur with out-of-order processing) Non-goals (which become goals in Section 3.6): high throughput in the face of code with lots of branches correct behaviour in the face of exceptions, e.g., hardware interrupts, TLB misses handled by software, various other exceptions

ENCM 501 Winter 2018 Slide Set 9 slide 6/42 Key components instruction fetch and decode unit enhanced register files reservation stations functional units common data bus (CDB) or multiple CDBs for superscalar systems

ENCM 501 Winter 2018 Slide Set 9 slide 7/42 Instruction fetch and decode unit This unit merges an L1 I-cache some sort of facility for managing branches and jumps (kind of fuzzy in Sections 3.4 3.5, but definitely dynamic branch prediction in Section 3.6) instruction decode capability: when an instruction leaves, it s known exactly what kind of instruction it is, and which registers, offsets and immediate operands are involved For textbook Sections 3.4 3.6, this unit is scalar in any one clock cycle, the maximum output is one instruction. (Section 3.8 looks at output of two or more instructions per cycle.)

ENCM 501 Winter 2018 Slide Set 9 slide 8/42 Instruction fetch and decode unit: key property Instructions are issued from the instruction unit in program order. This is critical for correct avoidance of data hazards!

ENCM 501 Winter 2018 Slide Set 9 slide 9/42 Enhanced register files In an in-order processor, a register file is precisely what we ve modeled already: If the number of registers is M and the width of a register is N, there must be M N cells to contain the state of the register file; beyond that, there must also be whatever logic is needed to support parallel reads and/or parallel writes. In a Tomasulo-based processor, each one of the M registers requires N bits for register data more bits for register status is the data up-to-date, and if not, which reservation station will supply the data in the future?

ENCM 501 Winter 2018 Slide Set 9 slide 10/42 Example: FPR file for MIPS with 16 64-bit FPRs... Qi = 0 indicates that an FPR is up-to-date; Qi 0 indicates that an FPR is not up-to-date and is waiting for a result for a reservation station. This example works for a system with up to fifteen reservation stations... FPR data Qi F0 00110011 00110011 00110011 00110011 00111111 11010011 00110011 00110011 0101 F2 00000000 00000000 00000000 00000000 10111111 11010100 00000000 00000000 0000. F30. 01001001 00100100 10010010 01001001 01000000 00001001 00100100 10010010. 0110

ENCM 501 Winter 2018 Slide Set 9 slide 11/42 What do all of the bits on Slide 10 mean? Let s write out the FPR file state in a somewhat more human-friendly format.

ENCM 501 Winter 2018 Slide Set 9 slide 12/42 Reservation stations A reservation station receives an instruction from the instruction unit; waits for source operand data to be ready before starting the execution of the instruction; broadcasts the result of the instruction on the CDB, when the result is ready. Example reservation station: An RS capable of processing either ADD.D or SUB.D needs 6 fields: Busy, Op, Vj, Vk, Qj and Qk. (The textbook also shows an A field, but that field is not needed for an RS for ADD.D / SUB.D.) Let s make some notes about how the 6 fields are used.

ENCM 501 Winter 2018 Slide Set 9 slide 13/42 Each reservation station has a unique, nonzero identification number. Examples in the textbook have three RS s for ADD.D / SUB.D instructions, and have two RS s for MUL.D / DIV.D instructions. So a possible numbering scheme would be 0001, 0010, 0011 for the first three RS s (called Add1, Add2, and Add3 in the book) and 0100, 0101 for the next two (called Mult1 and Mult2). (Warning: RS stands for reservation station, not for register status. In descriptions of various versions of Tomasulo s algororithm, the textbook uses RegisterStat[x] as an abbreviation for the status of register x.)

ENCM 501 Winter 2018 Slide Set 9 slide 14/42 Simple example of instruction issue Suppose a program has been running for a while, and these will be the next two instructions to leave the instruction unit: ADD.D SUB.D F4, F0, F2 F6, F6, F4 Suppose also that registers F0, F2, F4 and F6 are up-to-date, and that RS s Add1 and Add2 are not busy. Let s make notes about how the two instructions will be issued to the RS s.

ENCM 501 Winter 2018 Slide Set 9 slide 15/42 Functional units Functional units are the circuits that perform the execution steps for an instruction. Example functional units are FP adders, FP multipliers, integer ALUs, shifters, and so on. To keep things relatively simple, we can imagine a one-to-one correspondence between reservation stations and functional units. For example, each RS for ADD.D / SUB.D could be thought of as guarding the entrance of an FP adder-subtractor circuit, and watching the exit of that circuit for a result. In reality, things will be more complicated multiple RS s will be set up to feed their operands into a single pipelined functional unit.

ENCM 501 Winter 2018 Slide Set 9 slide 16/42 Latencies of functional units Tomasulo s algorithm is designed to manage the effects of multiple-cycle latencies of functional units. Further, the algorithm is designed to deal with the fact that the latency of a functional unit might vary from one use of the unit to the next. What are some examples of functional units with variable latencies?

ENCM 501 Winter 2018 Slide Set 9 slide 17/42 CDB: Common data bus Most reservation stations are capable of broadcasting results on the CDB. RS s for arithmetic instructions and RS s for loads certainly need to be able to do this, but RS s for store instructions don t. Each reservation station must snoop the CDB watch the CDB for results needed by that RS. The register file must also snoop the CDB to grab results that are needed by registers that are currently not up-to-date. (Remember, such a register has some nonzero Qi value to indicate which RS will produce the result the register is waiting for.)

ENCM 501 Winter 2018 Slide Set 9 slide 18/42 Example instruction completion via CDB This sequence got started a few slides back... ADD.D SUB.D F4, F0, F2 F6, F6, F4 Let s make some notes about how these instructions will get completed. For simplicity, let s assume that the given instructions are followed by a lengthy sequence of instructions that don t use F0, F2, F4 or F6.

ENCM 501 Winter 2018 Slide Set 9 slide 19/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 20/42 Tomasulo s algorithm and name dependencies The algorithm eliminates WAW and WAR hazards in a really elegant way. To understand how this works, it s sufficient to consider a short example. Here the repeated use of F0 creates both a potential WAW hazard and a potential WAR hazard: DIV.D ADD.D MUL.D SUB.D F0, F20, F2 F22, F22, F0 F0, F24, F24 F26, F26, F0 Let s make notes about how the hazards are eliminated.

ENCM 501 Winter 2018 Slide Set 9 slide 21/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 22/42 Tomasulo s algorithm and memory hazards The examples given in textbook Sections 3.5 and 3.6 are excellent regarding hazards involving communication of instruction results through floating-point registers (FPRs). Communication of instruction results through general-purpose registers (GPRs, aka integer registers) is very similar to communication via FPRs, so it s reasonable not to discuss that topic at length.

ENCM 501 Winter 2018 Slide Set 9 slide 23/42 However, the textbook, is, uh, less than excellent about the topic of allowing memory accesses (loads and stores) to complete out-of-order when doing so is harmless and helps with instruction throughput; forcing in-order completion of loads and stores when necessary to avoid RAW, WAW and WAR hazards.

ENCM 501 Winter 2018 Slide Set 9 slide 24/42 Older and younger instructions It s handy to use the words older and younger as adjectives for instructions in an out-of-order system. Instruction A is older than Instruction B if A came before B in program order. In that case, you could also say that B is younger than A.

ENCM 501 Winter 2018 Slide Set 9 slide 25/42 Rules for ordering execution of loads and stores To avoid RAW hazards: It is acceptable to access memory for a load instruction if it is known that there are no incomplete older store instructions that will use the same address as the load. What similar rules would be needed for avoidance of WAW and WAR hazards? ENCM 501 won t go into details of hardware solutions for memory data hazards, but here are a couple of key features: some sort of queue in which program order of loads and stores is remembered after loads and stores leave the instuction fetch/decode unit a capability to quickly compare the address a load or store will use with addresses to be used by older loads or stores

ENCM 501 Winter 2018 Slide Set 9 slide 26/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 27/42 Load and store buffers Load buffer and store buffer are names given to reservation stations dedicated to handling loads or stores. Remark: The load buffers and store buffers provide an interface between the execution unit of the processor and the data caches. That s an interesting design problem we don t have time to study in this course.

ENCM 501 Winter 2018 Slide Set 9 slide 28/42 Vj, Vk, Qj, Qk, A for store buffers Busy Op Vj Vk Qj Qk A As with the FP math stations, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vk is used for the FP data to be written in an S.D instruction. So what does it mean if Qk 0? Vj, Qj, and A have to do with memory address calculations. Let s not worry about the details for now.

ENCM 501 Winter 2018 Slide Set 9 slide 29/42 Vj, Vk, Qj, Qk, A for load buffers Busy Op Vj Vk Qj Qk A Again, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vj, Qj, and A have to do with memory address calculations. As with store buffers, let s not worry about the details for now.

ENCM 501 Winter 2018 Slide Set 9 slide 30/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 31/42 A loop example This is from page 179 of the textbook: Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop1 Let s make some notes about the DADDIU and BEQ instructions. Let s assume the loop starts with R1 = 0x600040 and R2 = 0x600000. Let s trace how Tomasulo s algorithm might handle the first two passes through the loop.

ENCM 501 Winter 2018 Slide Set 9 slide 32/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 33/42 Costs of the CDB (common data bus) In a typical clock cycle, some reservation station will broadcast a result on the CDB, and other reservation stations and the register file will look at the result to see if it s useful. Transmitting the result and receiving the result both have energy costs. A complex instruction unit, reservation stations, and related hardware require lots of transistors. If Moore s law had not applied for so many decades, we would not see Tomasulo s algorithm used as a basis for design of modestly priced processor chips. It s possible, in some cycles, that two or more reservation stations will simultaneously try to broadcast their results. Why is this not a fatal defect in Tomasulo s algorithm?

ENCM 501 Winter 2018 Slide Set 9 slide 34/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 35/42 Tomasulo s algorithm and branch prediction Consider this code fragment: BEQ S.D ADD.D R8, R0, L99 F0, (R10) F2, F2, F4 Suppose the branch is incorrectly predicted as not taken, and S.D and ADD.D get issued while BEQ waits for some earlier instruction to provide a value for R8. If Tomasulo s algorithm does nothing beyond what has been presented so far in lectures, what will prevent S.D from making an incorrect update to memory, and what will prevent ADD.D from making an incorrect update to F2?

ENCM 501 Winter 2018 Slide Set 9 slide 36/42 Tomasulo s algorithm and exceptions MUL.D S.D SUB.D L.D ADD.D F2, F4, F6 F2, 0(R8) F0, F12, F14 F2, 0(R9) F8, F8, F2 Suppose MUL.D gets delayed because it has to wait until a result for F6 is ready. That will delay the execution of S.D. Meanwhile, Tomasulo s algorithm may allow completion of SUB.D, L.D, and ADD.D. What kind of problem is created if S.D eventually results in a page fault exception?

ENCM 501 Winter 2018 Slide Set 9 slide 37/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 38/42 Out-of-order execution, in-order completion The version of Tomasulo s algorithm presented in textbook Section 3.5 has scalar issue (that is, at most one instruction issued per clock cycle), out-of-order execution, and out-of-order completion. Section 3.6 modifies the algorithm to include a circuit called a reorder buffer (ROB), which will enforce in-order completion. Use of a reorder buffer solves the branch prediction and exception problems described on slides 35 and 36.

ENCM 501 Winter 2018 Slide Set 9 slide 39/42 In a processor with a reorder buffer, issue of an instruction sends information related to the instruction both to a reservation station and to the reorder buffer. A reservation station for a store is responsible for address computation only it is not allowed to write to memory. The reorder buffer is a FIFO queue instructions enter in program order, and leave in program order. When an instruction gets to the head of the ROB, it can be committed as soon as its results are known. Examples: An ADD.D can be committed if a reservation station has provided the sum to the reorder buffer. An S.D can be committed if both the data to be stored and the address to be used are ready.

ENCM 501 Winter 2018 Slide Set 9 slide 40/42 Register file changes: The Qi field for each register is replaced by a Busy flag and a Reorder # field. Busy = 0 means the register is up-to-date; Busy = 1 means the register is waiting for a result from whatever entry in the reorder buffer matches the Reorder #. The register file does not watch the CDB for results. The ROB must watch the CDB for results for all of the instructions within the ROB that don t yet have results.

ENCM 501 Winter 2018 Slide Set 9 slide 41/42 The reservation stations and functional units work very much as before, except: the Qj and Qk fields hold ROB entry numbers instead of reservation station numbers; each reservation stations has a Dest field to hold an ROB entry number; when a reservation station broadcasts its result on the CDB, it includes the Dest field value to help both the ROB and the other reservation stations.

ENCM 501 Winter 2018 Slide Set 9 slide 42/42 The reorder buffer and safe speculation The key point about the ROB is that it can collect a large number of results without knowing whether those results should really be written to registers or memory. Consider a branch instruction that is mispredicted as taken. What happens to all the instructions that got into the ROB before the branch? What happens to the branch target instruction, the successor of the the branch target instruction, etc., which got into the ROB after the branch? The bad effect of the above scenario is a waste of time and energy. What are the important bad effects that were prevented?