DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Similar documents
Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Instruction Level Parallelism and Its. (Part II) ECE 154B

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Advanced Pipelining and Instruction-Level Paralelism (2)

Scoreboard Limitations!

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Scoreboard Limitations

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Computer Architecture Spring 2016

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Out-of-Order Execution

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Instruction Level Parallelism

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Tomasulo Algorithm Based Out of Order Execution Processor

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

A VLIW Processor for Multimedia Applications

On the Rules of Low-Power Design

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

Modeling Digital Systems with Verilog

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Pipeline design. Mehran Rezaei

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

CPE300: Digital System Architecture and Design

Contents Circuits... 1

(12) United States Patent (10) Patent No.: US 6,249,855 B1

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

A New Family of High-Performance Parallel Decimal Multipliers*

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

An automatic synchronous to asynchronous circuit convertor

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

6.3 Sequential Circuits (plus a few Combinational)

Chapter 5 Sequential Circuits

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

BUSES IN COMPUTER ARCHITECTURE

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Lecture 0: Organization

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Altera s Max+plus II Tutorial

Chapter 4. Logic Design

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

CHAPTER 4 RESULTS & DISCUSSION

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Midterm Exam 15 points total. March 28, 2011

Combinational Logic Design

AC103/AT103 ANALOG & DIGITAL ELECTRONICS JUN 2015

MODULE 3. Combinational & Sequential logic

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

CS 61C: Great Ideas in Computer Architecture

FPGA Design. Part I - Hardware Components. Thomas Lenzi

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

1. True/False Questions (10 x 1p each = 10p) (a) I forgot to write down my name and student ID number.

Sequencing and Control

Combinational vs Sequential

OUT-OF-ORDER processors with precise exceptions

University of Pennsylvania Department of Electrical and Systems Engineering. Digital Design Laboratory. Lab8 Calculator

THE USE OF forward error correction (FEC) in optical networks

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Figure 1: segment of an unprogrammed and programmed PAL.

Last time, we saw how latches can be used as memory in a circuit

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

Page 1) 7 points Page 2) 16 points Page 3) 22 points Page 4) 21 points Page 5) 22 points Page 6) 12 points. TOTAL out of 100

MC9211 Computer Organization

Branch management into micropipeline joint dot

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

ACT-R ACT-R. Core Components of the Architecture. Core Commitments of the Theory. Chunks. Modules

Lab #12: 4-Bit Arithmetic Logic Unit (ALU)

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

Logic Design II (17.342) Spring Lecture Outline

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

ELEN Electronique numérique

Microprocessor Design

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

MODELING OF ADC ARCHITECTURES IN HDL LANGUAGES

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

AN ABSTRACT OF THE THESIS OF

Chapter 3 Unit Combinational

Logic Design Viva Question Bank Compiled By Channveer Patil

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

Hardware Implementation of Viterbi Decoder for Wireless Applications

Multicore Design Considerations

AN INTRODUCTION TO DIGITAL COMPUTER LOGIC

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

Transcription:

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Outline 2 Dynamic instruction scheduling: Revision of Scoreboard Tomasulo algorithm Example execution using Tomasulo s algorithm

Dynamic scheduling Scoreboard revision 3 Divide the ID/OF stage in two parts: ISSUE Instruction decoding and verification of structural and WAW hazards Once all structural and WAW conflicts are solved, issue the instruction READ OPERANDS (Dispatch) Wait until all data hazards are solved, to read them from the register file and to dispatch the instruction to execution Scoreboard DISP. IF Stage ISSUE Stage Ready Ready Ready EX/MEM Stage WB Stage IN ORDER OUT-OF-ORDER

4 Scoreboard revision Instruction Status L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Completed in cycle? Issue Disp. EX WB Issue stage: Issue the next instruction if no WAW or structural hazard is found: WAW Hazard if destination register is already going to be written by an instruction Structural hazard if the FU is already busy FU Status DR SA SB MULT1 MULT2 ADD DIV Gen by FU? Data Ready? Busy Op Fi Fj Fk Qj Qk Rj Rk Fill the correct row if no hazard is found Register Results Status FU F0 F2 F4 F6 F8 F10 F12... F30 Assign the FU that will write to the register

5 Scoreboard revision Instruction Status L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Completed in cycle? Issue Disp. EX WB Dispatch stage: Dispatch all instructions that have valid operands to execution FU Status DR SA SB Gen by FU? Data Ready? Busy Op Fi Fj Fk Qj Qk Rj Rk MULT1 YES YES MULT2 ADD NO YES DIV Dispatch and set Rj,Rk to Don t dispatch Register Results Status FU F0 F2 F4 F6 F8 F10 F12... F30

6 Scoreboard revision Instruction Status L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Completed in cycle? Issue Disp. EX WB Execute stage: Wait for the instruction to complete execution and inform the scoreboard on finish FU Status DR SA SB MULT1 MULT2 ADD DIV Gen by FU? Data Ready? Busy Op Fi Fj Fk Qj Qk Rj Rk Register Results Status FU F0 F2 F4 F6 F8 F10 F12... F30

7 Scoreboard revision Instruction Status L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Completed in cycle? Issue Disp. EX WB Write back stage: Write the result to the destination register if no WAR hazard is found WAR hazard if an instruction still requires the value on the register; this happens if a preceding instruction is stuck on the dispatch stage waiting for some other value FU Status DR SA SB MULT1 MULT2 ADD DIV Gen by FU? Data Ready? Busy Op Fi Fj Fk Qj Qk Rj Rk Clear the slot Register Results Status FU F0 F2 F4 F6 F8 F10 F12... F30 Set the register value as valid on write

8 Scoreboard update example (Completed) (Completed) (Executing) (Completed) (At Dispatch) (Ending EX) Instruction Status Completed in cycle? Issue Disp. EX WB L.D F6,34(R2) 1 2 3-4 5 L.D F2,45(R3) 6 7 8-9 10 MUL.D F0,F2,F4 7 11 12-21 22 SUB.D F8,F6,F2 8 11 12-13 14 DIV.D F10,F0,F6 9 23 24-63 64 ADD.D F6,F8,F2 15 16 17-18 24 CYCLE 18: The ADD.D has finished executing, but will stall on cycle 19 because of DIV.D: DIV.D precedes ADD.D DIV.D was stalled at dispatch stage because of a RAW on the value of F0 DIV.D reads both operands at the same time FU Status DR SA SB MULT1 MULT2 ADD DIV Gen by FU? Data Ready? Busy Op Fi Fj Fk Qj Qk Rj Rk Register Results Status FU F0 F2 F4 F6 F8 F10 F12... F30

9 Tomasulo algorithm Proposed by Robert Tomasulo in 1966: Initially proposed to overcome the long latencies in both memory accesses and floating point operations First implemented on the IBM 360/91 The algorithm revealed to be far more powerful than anticipated being used in almost all modern superscalar processors

Tomasulo s algorithm General idea 10 Instead of centralizing the control in a scoreboard, distribute it amongst the different components: Instructions no longer wait on a dispatch stage, instead they are issued directly to reservation stations associated with functional units Once instructions are issued the values are directly copied to the reservation station (works as a form of register renaming) If the instruction operands are not available, store which instruction generates the result (given by the reservation station holding the instruction) Reservation Stations for FU1 FU 1 (e.g., ALU) IF ISSUE Common Data Bus (CDB) FU 2 (e.g., LD/ST) Reservation Stations for FU2 When busy, the reservation stations hold instructions An instruction can be identified by the reservation station where it is being held

Tomasulo s algorithm General idea 11 Instruction issue stalls if all reservation stations for the given operation are busy Functional units (FUs) can be pipelined and may have different number of reservation stations All units write to a CDB which forwards the results to the reservation station and the RF IF ISSUE Register File S1 S2 S3 S4 Address calculation MEMORY L1 L2 L3 L4 I1 I2 I3 I4 FU 2 ( ALU) A1 A2 A3 A4 FU 3 (FP ADD) M1 M2 M3 FU 4 (FP MULT) D1 D2 FU 5 (/FP DIV) Common Data Bus (CDB)

Tomasulo s algorithm Reservation stations 12 Information on reservation stations: Reservation station Q n Station availability Operation to execute Busy Op Vj Value of operands j,k (valid if operands are ready) Readiness of operands j,k (Label of the reservation with the instruction that will generate the result) Vk Qj Qk Load/store operations have an additional field for indexed load/stores, e.g., M[R[AA] + Imm] R[BA] A : used to store the immediate and latter the effective load/store address Additional information stored in the RF: R0 Integer Data Data 0 Readiness Q 0 R1 Data 1 Q 1 Rn Data n... Q n F0 FP Data FP Data 0 Readiness Q 0 F1 FP Data 1 Q 1 Fn FP Data n... Q n Label each register as ready (value of zero) or not ready (indicating the reservation station holding the instruction that generates the value)

Tomasulo s algorithm Issue stages 13 1. Decode the instruction Identify both the operation and the operands 2. Verify if the required functional unit has at least one reservation station available (i.e., which is not busy) If no reservation station is available (structural hazard) stall If there is a reservation station available issue the instruction indicating: a) operation to execute; b) value of all operands that are available, i.e., the value stored in the register file (RF); c) if an operand is not available, indicate the reservation station holding the instruction that will generate the corresponding value Reservation station Q n Station availability Operation to execute Busy Op Vj Value of operands j,k (valid if operands are ready) Readiness of operands j,k (Label of the reservation with the instruction that will generate the result) Vk Qj Qk

14 Tomasulo s algorithm Execute stage 1. If a reservation station has all operands available and there is a functional unit available, start executing the instruction 2. Monitor (snoop) writings to the common data bus (CDB); if a value is written on the CDB and that value is required by an instruction on a reservation station, retrieve it and store it on the corresponding field of the reservation station IF On the example: the functional unit FU5 (floating point division) writes a value to the CDB The reservation stations D1 and A3 hold instructions that require that value; the reservation stations take the result and store it on the corresponding fields... DIV.D F4,F0,F2 DIV.D F6,F4,F2 DADD.D F0,F4,F6... S1 S2 S3 S4 Address calculation MEMORY L1 L2 L3 L4 I1 I2 I3 I4 FU 2 ( ALU) ISSUE A1 A2 A3 A4 FU 3 (FP ADD) Common Data Bus (CDB) M1 M2 M3 FU 4 (FP MULT) Register File D1 D2 FU 5 (/FP DIV) WRITE RESULT FROM INSTRUCTION ON RESERVATION STATION D2

15 Tomasulo s algorithm Writing on the CDB 1. When writing a value on the CDB: Write the value plus The label of the reservation station where the instruction was stored Whenever a reservation station (or register) needs a value, it takes it from the CDB On the example: the functional unit FU5 (floating point division) writes a value to the CDB The reservation stations D1 and A3 hold instructions that require that value; the reservation stations take the result and store it on the corresponding fields... DIV.D F4,F0,F2 DIV.D F6,F4,F2 DADD.D F0,F4,F6... Reservation station A3 Station availability The DADD.D instruction is waiting for values produced by reservation stations D2 and D1; Reservation station D2 holds the first division Reservation station D1 holds the second division Operation to execute Busy DADD.D (invalid data) Value of operands j,k (valid if operands are ready) Readiness of operands j,k (Label of the reservation with the instruction that will generate the result) (invalid data) D2 D1 Wait for value being produced by the instruction on reservation station D1 Wait for value being produced by the instruction on reservation station D3

Tomasulo s algorithm Load/Store unit 16 Address calculation The load store unit is seen as a functional unit with read/write (load/store) buffers to the memory The load/store buffers can be seen as reservation stations S1 S2 S3 S4 Store buffers MEMORY Load buffers L1 L2 L3 L4 Common Data Bus (CDB)

Tomasulo s algorithm Solving hazards 17 RAW hazards: Solved by letting an instruction wait for the corresponding value on a reservation station WAR / WAW hazards Solved by renaming the registers (use of reservation stations)

Tomasulo s algorithm Example 18 Consider the execution of the instructions on the left on a processor with: n-pipelined functional units: 1x Integer ALU, with 1 cycle latency 1x FP multiplier, with 10 cycles latency 1x FP Adder/subtractor, with 2 cycles latency 1x /FP Division, with 40 cycles latency Load/store unit has 2 cycles latency (Add calc+mem access) Reservation stations: 3 load/store buffers 1 slot for integer operations 2 slots for FP multiplication/division 2 slots for FP addition/subtraction L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Similar architecture to the CDC6600, except that we are now using Tomasulo s algorithm instead of a Scoreboard

19 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) R1 L.D F2,45(R3) R2 MUL.D F0,F2,F4 R3 SUB.D F8,F6,F2 DIV.D F10,F0,F6 F0 ADD.D F6,F8,F2 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 FP Mult/Div 2 FP Adder 1 FP Adder 2 Register status Q

20 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 R1 L.D F2,45(R3) R2 MUL.D F0,F2,F4 R3 SUB.D F8,F6,F2 DIV.D F10,F0,F6 F0 ADD.D F6,F8,F2 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 Yes L.D R2 0 Ready Ready 34 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 FP Mult/Div 2 FP Adder 1 FP Adder 2 Register status Q LD/ST1

21 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 Calculated effective address R1 L.D F2,45(R3) 2 R2 MUL.D F0,F2,F4 R3 SUB.D F8,F6,F2 DIV.D F10,F0,F6 F0 ADD.D F6,F8,F2 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 Yes L.D R2 0 Ready Ready 34+R2 F10 LD/ST buffer 2 Yes L.D R3 0 Ready Ready 45 LD/ST buffer 3 FP Mult/Div 1 FP Mult/Div 2 FP Adder 1 FP Adder 2 Register status Q LD/ST2 LD/ST1

22 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 Finish loading the value R1 L.D F2,45(R3) 2 Calculated effective address R2 MUL.D F0,F2,F4 3 R3 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 Register status LD/ST buffer 1 Yes L.D R2 0 Ready Ready 34+R2 F10 LD/ST buffer 2 Yes L.D R3 0 Ready Ready 45+R3 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D - F4 LD/ST2 Ready Value of F4 is copied, which is FP Mult/Div 2 equivalent to register renaming FP Adder 1 FP Adder 2 F0 F2 F4 Q FP MULT1 LD/ST2 LD/ST1

23 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 Write the result R1 L.D F2,45(R3) 2 4 R2 MUL.D F0,F2,F4 3 R3 SUB.D F8,F6,F2 4 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Reservation stations OpA OpB Res. station Address F6 Register status LD/ST1 Busy Op Vj Vk Qj Qk A F8 FP ADD1 LD/ST buffer 1 F10 LD/ST buffer 2 Yes L.D R3 0 Ready Ready 34+R2 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D - F4 LD/ST2 Ready FP Mult/Div 2 FP Adder 1 Yes SUB.D F6 - Ready LD/ST2 Value of F6 is forward from CDB FP Adder 2 F0 F2 F4 Q FP MULT1 LD/ST2

24 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 Write the result R2 MUL.D F0,F2,F4 3 10 cycles left R3 SUB.D F8,F6,F2 4 2 cycles left DIV.D F10,F0,F6 5 F0 ADD.D F6,F8,F2 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D F2 F4 Ready Ready FP Mult/Div 2 Yes DIV.D - F6 FP M1 Ready FP Adder 1 Yes SUB.D F6 F2 Ready Ready FP Adder 2 Register status Q FP MULT1 LD/ST2 FP ADD1 FP MULT2 Value of F2 is forwarded from CDB; instructions become ready and starts executing

25 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 9 cycles left R3 SUB.D F8,F6,F2 4 1 cycles left DIV.D F10,F0,F6 5 F0 ADD.D F6,F8,F2 6 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D F2 F4 Ready Ready FP Mult/Div 2 Yes DIV.D - F6 FP M1 Ready FP Adder 1 Yes SUB.D F6 F2 Ready Ready FP Adder 2 Yes ADD.D - F2 FP A1 Ready Register status Q FP MULT1 FP ADD2 FP ADD1 FP MULT2

26 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 8 cycles left R3 SUB.D F8,F6,F2 4 7 Finished execution DIV.D F10,F0,F6 5 F0 ADD.D F6,F8,F2 6 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D F2 F4 Ready Ready FP Mult/Div 2 Yes DIV.D - F6 FP M1 Ready FP Adder 1 Yes SUB.D F6 F2 Ready Ready FP Adder 2 Yes ADD.D - F2 FP A1 Ready Register status Q FP MULT1 FP ADD2 FP ADD1 FP MULT2

27 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 7 cycles left R3 SUB.D F8,F6,F2 4 7 8 Write the result DIV.D F10,F0,F6 5 F0 ADD.D F6,F8,F2 6 2 cycles left F2 F4 Reservation stations OpA OpB Res. station Address F6 Register status FP MULT1 FP ADD2 Busy Op Vj Vk Qj Qk A F8 FP ADD1 LD/ST buffer 1 F10 FP MULT2 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D F2 F4 Ready Ready FP Mult/Div 2 Yes DIV.D - F6 FP M1 Ready FP Adder 1 Value of F8 is forwarded from CDB; FP Adder 2 Yes ADD.D F8 F2 Ready Ready instruction becomes ready and starts executing Q

28 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 5 cycles left R3 SUB.D F8,F6,F2 4 7 8 DIV.D F10,F0,F6 5 F0 ADD.D F6,F8,F2 6 10 Finished execution F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D F2 F4 Ready Ready FP Mult/Div 2 Yes DIV.D - F6 FP M1 Ready FP Adder 1 FP Adder 2 Yes ADD.D F8 F2 Ready Ready Register status Q FP MULT1 FP ADD2 FP MULT2

29 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 4 cycles left R3 SUB.D F8,F6,F2 4 7 8 DIV.D F10,F0,F6 5 F0 ADD.D F6,F8,F2 6 10 11 Write the result F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D F2 F4 Ready Ready FP Mult/Div 2 Yes DIV.D - F6 FP M1 Ready FP Adder 1 FP Adder 2 Register status Q FP MULT1 FP ADD2 FP MULT2

30 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 15 Finished execution R3 SUB.D F8,F6,F2 4 7 8 DIV.D F10,F0,F6 5 F0 ADD.D F6,F8,F2 6 10 11 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 Yes MUL.D F2 F4 Ready Ready FP Mult/Div 2 Yes DIV.D - F6 FP M1 Ready FP Adder 1 FP Adder 2 Register status Q FP MULT1 FP MULT2

31 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 15 16 Write the result R3 SUB.D F8,F6,F2 4 7 8 DIV.D F10,F0,F6 5 40 cycles left F0 ADD.D F6,F8,F2 6 10 11 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 Register status LD/ST buffer 1 F10 FP MULT2 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 Value of F0 is forwarded from CDB; FP Mult/Div 2 Yes DIV.D F0 F6 Ready Ready instruction becomes ready and starts FP Adder 1 executing FP Adder 2 Q FP MULT1

32 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 15 16 R3 SUB.D F8,F6,F2 4 7 8 DIV.D F10,F0,F6 5 56 Finished execution F0 ADD.D F6,F8,F2 6 10 11 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 FP Mult/Div 2 Yes DIV.D F0 F6 Ready Ready FP Adder 1 FP Adder 2 Register status Q FP MULT2

33 Tomasulo execution example Instruction Status (not required in Tomasulo, used only for illustration) Issue EX WB L.D F6,34(R2) 1 3 4 R1 L.D F2,45(R3) 2 4 5 R2 MUL.D F0,F2,F4 3 15 16 R3 SUB.D F8,F6,F2 4 7 8 DIV.D F10,F0,F6 5 56 57 Write the results F0 ADD.D F6,F8,F2 6 10 11 F2 F4 Reservation stations OpA OpB Res. station Address F6 Busy Op Vj Vk Qj Qk A F8 LD/ST buffer 1 F10 LD/ST buffer 2 LD/ST buffer 3 FP Mult/Div 1 FP Mult/Div 2 FP Adder 1 FP Adder 2 Register status Q FP MULT2

34 Tomasulo execution example Instruction Status (Tomasulo) Issue EX WB L.D F6,34(R2) 1 3 4 IN ORDER: L.D F2,45(R3) 2 4 5 - Issue MUL.D F0,F2,F4 3 15 16 SUB.D F8,F6,F2 4 7 8 OUT OF ORDER: DIV.D F10,F0,F6 5 56 57 - EX ADD.D F6,F8,F2 6 10 11 - WB Instruction Status (Scoreboard) Issue Disp. EX WB L.D F6,34(R2) 1 2 3 4 IN ORDER: L.D F2,45(R3) 5 6 7 8 - Issue MUL.D F0,F2,F4 6 9 19 20 SUB.D F8,F6,F2 7 9 11 12 OUT OF ORDER: DIV.D F10,F0,F6 8 21 61 62 - Disp ADD.D F6,F8,F2 13 14 16 22 - EX - WB ISSUE: Speedup = 13 6 = 2.17 WB: Speedup = 62 57 = 1.09 te: Additional gains are achieved by easing the implementation of other architectural changes

Tomasulo vs Scoreboard 35 Scoreboard Tomasulo Structural hazards Stalls the pipeline Stalls the Pipeline WAW hazards Stalls the pipeline Solved by applying WAR hazards Delay writting the result Renaming (use of reservation stations) Control structure Centralized in the scoreboard Distributed in reservation stations Forwarding Hard to apply Automatically applied through the CDB Simultaneous writings Delayed writting may lead to structural hazards Simultaneous access to the CDB may lead to structural hazards Instruction window Smaller Larger

36 Next lesson Dynamic techniques to extract parallelism More on Tomasulo Dynamic branch prediction