Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Similar documents
Advanced Pipelining and Instruction-Level Paralelism (2)

Instruction Level Parallelism and Its. (Part II) ECE 154B

Computer Architecture Spring 2016

Scoreboard Limitations!

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Scoreboard Limitations

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Instruction Level Parallelism Part III

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Instruction Level Parallelism Part III

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Out-of-Order Execution

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Instruction Level Parallelism

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Pipeline design. Mehran Rezaei

Modeling Digital Systems with Verilog

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

BUSES IN COMPUTER ARCHITECTURE

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Tomasulo Algorithm Based Out of Order Execution Processor

(12) United States Patent (10) Patent No.: US 6,249,855 B1

On the Rules of Low-Power Design

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Logic Devices for Interfacing, The 8085 MPU Lecture 4

A VLIW Processor for Multimedia Applications

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CHAPTER1: Digital Logic Circuits

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

CPE300: Digital System Architecture and Design

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

AN ABSTRACT OF THE THESIS OF

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

EITF35: Introduction to Structured VLSI Design

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

EE178 Spring 2018 Lecture Module 5. Eric Crabill

Data flow architecture for high-speed optical processors

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

Microprocessor Design

EE241 - Spring 2005 Advanced Digital Integrated Circuits

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Fill-in the following to understand stalling needs and forwarding opportunities

Sequencing and Control

SHA-256 Module Specification

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Contents Circuits... 1

Lecture 0: Organization

IT T35 Digital system desigm y - ii /s - iii

MC9211 Computer Organization

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

CS61C : Machine Structures

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

MISO - EPG DATA QUALITY INVESTIGATION

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL

Multicore Design Considerations

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

4.5 Pipelining. Pipelining is Natural!

Chapter 4. Logic Design

THE USE OF forward error correction (FEC) in optical networks

Figure 1: Feature Vector Sequence Generator block diagram.

Lab2: Cache Memories. Dimitar Nikolov

Logic Design. Flip Flops, Registers and Counters

CS61C : Machine Structures

Sequential Logic. Introduction to Computer Yung-Yu Chuang

UC Berkeley CS61C : Machine Structures

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

An automatic synchronous to asynchronous circuit convertor

Block Diagram. dw*3 pixin (RGB) pixin_vsync pixin_hsync pixin_val pixin_rdy. clk_a. clk_b. h_s, h_bp, h_fp, h_disp, h_line

CHAPTER 4 RESULTS & DISCUSSION

ELEN Electronique numérique

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Performance Driven Reliable Link Design for Network on Chips

Spiral Content Mapping. Spiral 2 1. Learning Outcomes DATAPATH COMPONENTS. Datapath Components: Counters Adders Design Example: Crosswalk Controller

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

FPGA Prototyping using Behavioral Synthesis for Improving Video Processing Algorithm and FHD TV SoC Design Masaru Takahashi

Power Reduction Techniques for a Spread Spectrum Based Correlator

Block Diagram. 16/24/32 etc. pixin pixin_sof pixin_val. Supports 300 MHz+ operation on basic FPGA devices 2 Memory Read/Write Arbiter SYSTEM SIGNALS

CprE 281: Digital Logic

CSE 352 Laboratory Assignment 3

Dual Link DVI Receiver Implementation

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

HIGH SPEED ASYNCHRONOUS DATA MULTIPLEXER/ DEMULTIPLEXER FOR HIGH DENSITY DIGITAL RECORDERS


OUT-OF-ORDER processors with precise exceptions

Transcription:

Tomasulo Algorithm Developed at IBM and first implemented in IBM s 360/91 IBM wanted to use the existing compiler instead of a specialized compiler for high end machines. Tracks when operands are available to minimize RAW, and uses register renaming to minimize WAW hazards Used in Alpha 21264, HP8600 PowerPC G4, MIPS R12000, X86 (with a RISC core) AMD Athlon, PIII, Xeon. In-order issue, but out-of-order execution0 The original IBM design uses pipelined FU s, in he example we will use multiple FU s (same idea, but with RISC machine).. (In Chapter 3.2)

Tomasulo Algorithm RAW hazards are avoided by executing the instructions only when its operands are ready. Register renaming are used by renaming all destination registers. DIV.D F0,F2,F4 DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 Anti output ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,F8 MUL.D F6,F10,T Eliminating name dependency by register renaming (S,T). Note that any subsequent use of F8 should be replaced by T, there may be branches

Tomasulo Algorithm. Control & buffers distributed with Functional Units (FUs) Vs. centralized in Scoreboard: FU buffers are called reservation stations which have pending instructions (issued instructions) and operands and other instruction status info (including data dependencies). Register renaming is done by reservation stations fetching and buffering the operands of instructions waiting for issue. Reservations stations are sometimes referred to as physical registers or renaming registers as opposed to architecture registers specified by the ISA. Register renaming eliminates WAR, WAW hazards. If the data is not ready yet, the name of the reservation station that will provide them is kept. (In Chapter 3.2)

Tomasulo s Algorithm When successive writes to the same register, only the last write is performed. The information held at the reservation station at the functional unit determine when the instruction can start execution. Instruction results are sent directly from reservation stations to functional units via a common result bus (Common Data Bus CDB. in IBM360/91) thus bypassing intermediate registers. In pipelines with multiple execution units, and issuing multiple instructions per clock, more than one results bus will be needed.

Dynamic Scheduling: The Tomasulo Approach Dynamic Scheduling: The Tomasulo Approach Tomasulo s based MIPS including FP unit and load/store unit

Tomasulo s Algorithm Each reservation station holds the instructions that have been issued, and the operands if available, otherwise the name of the reservation station that will produce them. The load buffer and store buffers hold data/addresses that are going to the memory or coming from the memory, and behaves exactly like a reservation station. FP registers are connected by a pair of buses to the FU, and a single bus to the store buffers. All results from FU are sent to the CDB which goes everywhere except the load buffers.

Steps of Execution Issue: get the next instruction from the instruction queue (FIFO). If there is a matching RS that is empty issue the instruction with the operands if available, else stall (structural hazard). If the operands are not in the registers, keep track of the reservation station that will produce them, renaming eliminating WAR and WAW. Execute If one or more operands are not available, monitor the CDB waiting for it, until available, when sent on the CDB by the reservation station that produced it, copy it. When ready start execution (one instruction per cycle per FU). It is possible that more than one instruction in the same FU are ready in the same cycle!! Load/Store first, calculate the effective address if ready, load go ahead as soon as the memory unit is free, store wait fro the data to be stored. (in program order).

Steps of Execution To preserve exception behavior, no instruction can start execution until all the preceding branches are resolved. Before execution, we have to be sure that the branch prediction was correct before we start execution. Although, it is possible to record the exception instead of actually raising it. Write results: when the result is available, write it on the CDB and to registers and RS (including store buffers), also during this step, store write data to memory.

Data Structure The data structure used to detect and eliminate hazards are attached to the reservation stations, the register file, and the load/store unit. These units are tagged. These tags are names of an extended set of virtual registers used in renaming. In the example, 4 bits is enough to designate one of the 5 reservation stations or one of the 6 load buffers. (note that in the original 360 there was only 4 FP registers). Once the instruction is issued to the reservation station, it refer to the operand by the number of he RS that produces it. ) means the operand is already available

Reservation Station Fields Op Operation to perform in the unit (e.g., + or ) V j, V k Value of Source operands S1 and S2 Store buffers have a single V field indicating result to be stored. Only V i or Q i is valid for each operand. Q j, Q k Reservation stations producing operands. (value to be written). No ready flags as in Scoreboard; Qj,Qk=0 => ready. Store buffers only have Qi for RS producing result. A: Address information for loads or stores. Initially immediate field of instruction then effective address when calculated. Busy: Indicates reservation station is busy. Register result status: Q i Indicates which functional unit will write each register, if one exists. Blank (or 0) when no pending instructions exist that will write to that register.

Three Stages of Tomasulo Algorithm 1 Issue: Get instruction from pending Instruction Queue. Instruction issued to a free reservation station(rs) (no structural hazard). Selected RS is marked busy. Control sends available instruction operands values (from ISA registers) to assigned RS. Operands not available yet are renamed to RSs that will produce the operand (register renaming). 2 Execution (EX): Operate on operands. When both operands are ready then start executing on assigned FU. If all operands are not ready, watch Common Data Bus (CDB) for needed result (forwarding done via CDB). 3 Write result (WB): Finish execution. Write result on Common Data Bus (CDB) to all awaiting units (RSs) Mark reservation station as available. Common Data Bus (CDB): data + source ( come from bus): 64 bits for data + 4 bits for Functional Unit source address. Write data to waiting RS if source matches expected RS (that produces result). Does the result forwarding via broadcast to waiting RSs.

Toasulo s Algorithm. Issue EX Write L.D F6, 34(R2) L.D F2, 45(R3) MUL. D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2

Tomasulo s Algorithm Name Busy OP Vj Vk Qj Qk A LD1 N LD2 Y 45+Reg[3] Add1 Y - Mem[34+R2] LD2 Add2 Y + Add1 LD2 Add3 no MUL1 Y * REG0F4] LD2 MUL2 Y / Mem[34+..] MUL1 Field F0 F2 F4 F6 F8 F10 F12 Q i Mul1 LD2 Add2 Add1 Mul2

Steps in The Tomsulo Approach and Steps in The Tomsulo Approach and The Requirements of Each Step (In Chapter 3.2)

Drawbacks of The Tomasulo Approach Implementation Complexity: Example: The implementation of the Tomasulo algorithm may have caused delays in the introduction of 360/91, MIPS 10000, IBM 620 among other CPUs. Many high-speed associative result stores using (CDB) are required. Performance limited by Common Data Bus Possible solution: Multiple CDBs more Functional Unit and RS logic needed for parallel associative stores. (In Chapter 3.2)

Tomasulo Approach Example Using the same code used in the scoreboard example to be run on the Tomasulo configuration given earlier: # of RSs EX Cycles Integer 1 1 Floating Point Multiply/divide 2 10/40 Floating Point add 3 2 L.D F6, 34(R2) Pipelined Functional Units L.D F2, 45(R3) MUL. D F0, F2, F4 SUB.D F8, F6, F2 Real Data Dependence (RAW) Anti-dependence (WAR) Output Dependence (WAW) DIV.D F10, F0, F6 ADD.D F6, F8, F2 (In Chapter 3.3)

Tomasulo Example: Cycle 0 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D F6 34+ R2 Load1 No L.D F2 45+ R3 Load2 No MUL.D F0 F2 F4 Load3 No SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0Add2 No 0Add3 No 0Mult1 No 0Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 0 FU

Tomasulo Example Cycle 1 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D L.D MUL.D SUB.D DIV.D ADD.D F6 34+ R2 1 Load1 No 34+R2 Yes F2 45+ R3 Load2 No F0 F2 F4 Load3 No F8 F6 F2 F10 F0 F6 F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0Add2 No Add3 No 0Mult1 No 0Mult2 No Register result status Clock F0 F2 F4 F6 1 FU Load1 F8 F10 F12... F30

Tomasulo Example: Cycle 2 Instruction status Execution Write Instruction j k Issue complete Result Busy Address F6 34+ R2 1 2,- Load1 Yes 34+R2 F2 45+ R3 2 Load2 Yes 45+R3 F0 F2 F4 Load3 No F8 F6 F2 F10 F0 F6 F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0Add2 No Add3 No 0Mult1 No 0Mult2 No Register result status L.D L.D MUL.D SUB.D DIV.D ADD.D Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Load2 Load1

Tomasulo Example: Cycle 3 Instruction status Execution Write Instruction j k Issue complete Result Busy Address F6 34+ R2 1 2,3 Load1 Yes 34+R2 F2 45+ R3 2 Load2 Yes 45+R3 F0 F2 F4 3 Load3 No F8 F6 F2 F10 F0 F6 L.D L.D MUL.D SUB.D DIV.D ADD.D F6 F8 F2 Load processing takes 2 cycles (EX, Mem) Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0Add2 No Add3 No 0Mult1 Yes MULTD R(F4) Load2 0Mult2 No Register result status Clock 3 F0 F2 F4 F6 F8 F10 F12... F30 FU Mult1 Load2 Load1

Tomasulo Example: Cycle 4 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D F6 34+ R2 1 2,3 4 Load1 No L.D F2 45+ R3 2 3,4 Load2 Yes 45+R3 MUL.D F0 F2 F4 3 Load3 No SUB.D F8 F6 F2 4 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 Yes SUBD M(34+R2) Load2 0Add2 No Add3 No 0Mult1 Yes MULTD R(F4) Load2 0Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Mult1 Load2 M(34+R2) Add1 Load2 completing; what is waiting for it?

Tomasulo Example: Cycle 5 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D F6 34+ R2 1 3 4 Load1 No L.D F2 45+ R3 2 4 5 Load2 No MUL.D F0 F2 F4 3 Load3 No SUB.D F8 F6 F2 4 DIV.D F10 F0 F6 5 ADD.D F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 2Add1 Yes SUBD M(34+R2) M(45+R3) 0Add2 No Add3 No 10 Mult1 Yes MULTD M(45+R3) R(F4) 0Mult2 Yes DIVD M(34+R2) Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Mult1 M(45+R3) M(34+R2) Add1 Mult2 Load2 result forwarded via CDB to Add1, Mult1 SUB.D, MUL.D execution will start next cycle 6

Tomasulo Example Cycle 6 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D L.D MUL.D SUB.D DIV.D ADD.D F6 34+ R2 1 3 4 Load1 No F2 45+ R3 2 4 5 Load2 No F0 F2 F4 3 Start 6 Load3 No F8 F6 F2 4 Start 6 F10 F0 F6 5 F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 1Add1 Yes SUBD M(34+R2) M(45+R3) 0 Add2 Yes ADDD M(45+R3) Add1 Add3 No 9 Mult1 Yes MULTD M(45+R3) R(F4) 0Mult2 Yes DIVD M(34+R2) Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 M(45+R3) Add2 Add1 Mult2

Tomasulo Example: Cycle 7 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Execution Write Instruction j k Issue complete Result Busy Address F6 34+ R2 1 3 4 Load1 No F2 45+ R3 2 4 5 Load2 No L.D L.D MUL.D SUB.D DIV.D ADD.D F0 F2 F4 3 Start 6 Load3 No F8 F6 F2 4 6,7 F10 F0 F6 5 F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 Yes SUBD M(34+R2) M(45+R3) 0 Add2 Yes ADDD M(45+R3) Add1 Add3 No 8 Mult1 Yes MULTD M(45+R3) R(F4) 0Mult2 Yes DIVD M(34+R2) Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 M(45+R3) Add2 Add1 Mult2 RS Add1 completing; what is waiting for it?

Tomasulo Example: Cycle 10 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D F6 34+ R2 1 3 4 Load1 No L.D F2 45+ R3 2 4 5 Load2 No MUL.D F0 F2 F4 3 Load3 No SUB.D F8 F6 F2 4 6,7 8 DIV.D F10 F0 F6 5 ADD.D F6 F8 F2 6 9,10 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0 Add2 Yes ADDD M() M() M(45+R3) 0Add3 No 5 Mult1 Yes MULTD M(45+R3) R(F4) 0Mult2 Yes DIVD M(34+R2) Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 M(45+R3) Add2 M() M() Mult2 RS Add2 completing; what is waiting for it?

Tomasulo Example: Cycle 11 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D F6 34+ R2 1 3 4 Load1 No L.D F2 45+ R3 2 4 5 Load2 No MUL.D F0 F2 F4 3 Load3 No SUB.D F8 F6 F2 4 7 8 DIV.D F10 F0 F6 5 ADD.D F6 F8 F2 6 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0Add2 No 0Add3 No 4Mult1 Yes MULTDM(45+R3) R(F4) 0Mult2 Yes DIVD M(34+R2) Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 M(45+R3) (M-M)+M() M() M() Mult2 Write back result of ADD.D in this cycle

Tomasulo Example: Cycle 15 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D F6 34+ R2 1 2,3 4 Load1 No L.D F2 45+ R3 2 3,4 5 Load2 No MUL.D F0 F2 F4 3 6,15 Load3 No SUB.D F8 F6 F2 4 7 8 DIV.D F10 F0 F6 5 ADD.D F6 F8 F2 6 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0Add2 No Add3 No 0 Mult1 Yes MULTD M(45+R3) R(F4) 0Mult2 Yes DIVD M(34+R2) Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 M(45+R3) (M M)+M() M() M() Mult2 Mult1 completing; what is waiting for it?

Tomasulo Example: Cycle 16 FP EX Cycles : Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D F6 34+ R2 1 3 4 Load1 No L.D F2 45+ R3 2 4 5 Load2 No MUL.D F0 F2 F4 3 15 16 Load3 No SUB.D F8 F6 F2 4 6,7 8 DIV.D F10 F0 F6 5 ADD.D F6 F8 F2 6 9,10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0Add2 No Add3 No 0Mult1 No 40 Mult2 Yes DIVD M*F4 M(34+R2) Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU M*F4 M(45+R3) (M M)+M() M() M() Mult2 Only Divide instruction remains DIV.D execution will start next cycle (17)

Tomasulo Example: Cycle 57 (vs 62 cycles for scoreboard) Instruction status Execution Write Instruction j k Issue complete Result Busy Address L.D F6 34+ R2 1 3 4 Load1 No L.D F2 45+ R3 2 4 5 Load2 No MUL.D F0 F2 F4 3 15 16 Load3 No SUB.D F8 F6 F2 4 7 8 DIV.D F10 F0 F6 5 17,56 57 ADD.D F6 F8 F2 6 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0Add1 No 0Add2 No Add3 No 0Mult1 No 0Mult2 No Register result status Instruction Block done Clock F0 F2 F4 F6 F8 F10 F12... F30 57 FU M*F4 M(45+R3) (M M)+M() M() M() M*F4/M Again we have: In-oder issue, Out-of-order execution, completion