Scoreboard Limitations!

Similar documents
Scoreboard Limitations

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instruction Level Parallelism and Its. (Part II) ECE 154B

Computer Architecture Spring 2016

Advanced Pipelining and Instruction-Level Paralelism (2)

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Instruction Level Parallelism Part III

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Instruction Level Parallelism Part III

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Out-of-Order Execution

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Instruction Level Parallelism

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Tomasulo Algorithm Based Out of Order Execution Processor

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Modeling Digital Systems with Verilog

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

On the Rules of Low-Power Design

Sequencing and Control

OUT-OF-ORDER processors with precise exceptions

EITF35: Introduction to Structured VLSI Design

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

For an alphabet, we can make do with just { s, 0, 1 }, in which for typographic simplicity, s stands for the blank space.

Microprocessor Design

CHAPTER1: Digital Logic Circuits

BUSES IN COMPUTER ARCHITECTURE

A Review of logic design

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

A VLIW Processor for Multimedia Applications

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Logic Design II (17.342) Spring Lecture Outline

CPE300: Digital System Architecture and Design

(12) United States Patent (10) Patent No.: US 6,249,855 B1

Introduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

Pipeline design. Mehran Rezaei

mamaamo Western Research Laboratory mamaamo Western r セ イ ィ Laboratory ;/ <> i:i:wi/!!?1)xwtw;:il r

AN ABSTRACT OF THE THESIS OF

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Data flow architecture for high-speed optical processors

Jin-Fu Li Advanced Reliable Systems (ARES) Laboratory. National Central University

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

CHAPTER 4: Logic Circuits

2.6 Reset Design Strategy

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

CHAPTER 4: Logic Circuits

specifications of your design. Generally, this component will be customized to meet the specific look of the broadcaster.

Agilent MSO and CEBus PL Communications Testing Application Note 1352

QScript & CNN CNN. ...concept. ...creation. ...product. An Integrated Software Solution Case Study

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

XSbb: Sequential Building-Block Examples

Outcomes. Spiral 1 / Unit 6. Flip-Flops FLIP FLOPS AND REGISTERS. Flip-flops and Registers. Outputs only change once per clock period

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

Lecture 0: Organization

Principles of Computer Architecture. Appendix A: Digital Logic

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Altera s Max+plus II Tutorial

CprE 281: Digital Logic

Scans and encodes up to a 64-key keyboard. DB 1 DB 2 DB 3 DB 4 DB 5 DB 6 DB 7 V SS. display information.

By David Acker, Broadcast Pix Hardware Engineering Vice President, and SMPTE Fellow Bob Lamm, Broadcast Pix Product Specialist

AN INTRODUCTION TO DIGITAL COMPUTER LOGIC

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Broadcast Networks with Arbitrary Channel Bit Rates

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

Computer Architecture and Organization

CPS311 Lecture: Sequential Circuits

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Pro Video Formats for IEEE 1722a

Chapter Contents. Appendix A: Digital Logic. Some Definitions

ELEN Electronique numérique

FlexiScan. Impro FlexiScan 4-Channel Controller INSTALLATION MANUAL

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

Luis Cogan, Dave Harbour., Claude Peny Kern & Co., Ltd 5000 Aarau switzerland Commission II, ISPRS Kyoto, July 1988

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

Ford AMS Test Bench Operating Instructions

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

IMS B007 A transputer based graphics board

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Digital Electronics II 2016 Imperial College London Page 1 of 8

An FPGA Based Solution for Testing Legacy Video Displays

EE178 Spring 2018 Lecture Module 5. Eric Crabill

Multicore Design Considerations

MC9211 Computer Organization

FORWARD PATH TRANSMITTERS

Transcription:

Scoreboard Limitations! No forwarding read from register! Structural hazards stall at issue! WAW hazard stall at issue!! WAR hazard stall at write! Inf3 Computer Architecture - 2015-2016 1

Dynamic Scheduling reloaded: Motivation! IBM 360/91: ~3 years after CDC 6600! Had very few registers! 4 in IBM 360 vs 8 in CDC 6600! Resulted in frequent data dependencies.! à Needed a way to efficiently resolve WAR & WAW dependencies to maximize opportunity for instruction reordering! Had longer memory & functional unit latencies! à Needed to find independent instructions in the presence of long-latency stalls! Solution: Tomasulo s Algorithm for improved dynamic scheduling! Inf3 Computer Architecture - 2015-2016 2

Tomasulo s Algorithm: key ideas! Controls and buffers distributed with functional units (scoreboard centralizes this functionality)! Called reservation stations! Prevents front-end blocking due to a structural hazard! Register names replaced by pointers to reservation station entries: register renaming! Register renaming avoids WAR & WAW hazards by renaming all destination registers! Older readers no longer endangered by younger writers (avoids WAR hazard)! Newly issued readers always get the value from most recent (in program order) writer (avoids WAW hazard)! Common data bus broadcasts results to all functional units! Provides forwarding functionality! Inf3 Computer Architecture - 2015-2016 3

Register Renaming! Register renaming accomplished through reservation stations (RS) containing:! The instruction! Operand values (when available)! RS number(s) of instruction(s) providing the operand values! Op Val Src1 RS Src1 Val Src2 RS Src2 RS3 Op 0xABC.. RS2 Val of R0 from RF LD r1, 8(r7) è RS2 MUL.D r4, r0, r1 è RS3 Inf3 Computer Architecture - 2015-2016 4

Avoiding Data Hazards w/ Register Renaming! Example:! LD r0, 0(r7)!!è!RS1: LD RS1, 0, 0x1000! LD r1, 8(r7)!!è!RS2: LD RS2, 8, 0x1000! MUL.D r4, r0, r1!è!rs3: MUL.D RS3, RS1, RS2!! RAW dependence preserved! Inf3 Computer Architecture - 2015-2016 5

Avoiding Data Hazards w/ Register Renaming! Example:! LD r0, 0(r7)!!è!RS1: LD RS1, 0, 0x1000! LD r1, 8(r7)!!è!RS2: LD RS2, 8, 0x1000! MUL.D r4, r0, r1!è!rs3: MUL.D RS3, RS1, RS2! ADD.D r1, r0, r3!è!rs4: ADD.D RS4, RS1, 0x16!! WAW dependence avoided through renaming! Q: Which r1 should be written into the register file?! A: Only the last (ADD.D à RS4), thus ensuring that the register file holds the correct register value even if instructions reordered! Inf3 Computer Architecture - 2015-2016 6

Register Renaming Mechanics! As each instruction is issued to an RS:! Available values are fetched (from register file) and buffered at the instruction s RS! Dataflow (RAW) dependencies resolved by changing source register specifiers to RS producing those register values! A result status register (or rename table) maps each architectural register to the most recent RS producing its value! Inf3 Computer Architecture - 2015-2016 7

Dynamic Scheduling 2: Tomasulo s Algorithm! Handles RAW with proper stalls and eliminates WAR and WAW through register renaming! Step 1: Issue! Get next instruction from the fetch queue and issue it to the reservation stations if there is a free reservation station! Read operands from register file if available or rename operands if pending (resolve RAW)! Step 2: Execute! Monitor the CDB for operand(s). Once available, store into all reservation stations waiting for it! Execute instruction when both operands are ready in the reservation station (RAW)! Loads & stores maintained in a separate queue that preserves program order by tracking effective addresses! All preceding branches must resolve before an inst can execute! Step 3: Write result! Put the result on CDB and write it into the register file (if last producer) and all reservation stations waiting on it (RAW)! Inf3 Computer Architecture - 2015-2016 8

IBM S/360 model 91 used Tomasulo s Algorithm! Dynamic O-O-O execution! Tags (RS # s) used to name flow dependencies! 5 reservation stations! 6 load buffers! Issue instructions to reservation stations, load buffers and store buffers! Instructions wait in reservation stations or store buffers until all their operands are collected! Functional units broadcast result and tag on the Common Data Bus (CDB) for all reservation stations, store buffers and FP register file! Store buffers Address unit Address unit Memory unit From instruction fetch unit Instruction Queue 6... 11 st f4, 8(r2) add f4, f5, f3 mul f3, f1, f2 ld f1, 4(r1) Load buffers 1 2 3 FP adders FP registers 4 5 Reservation stations FP multipliers Reservation stations associated with functional units: simplifies scheduling & management of structural hazards! Inf3 Computer Architecture - 2015-2016! 9

Handling Loads & Stores in Tomasulo s! Loads and stores placed in a dedicated set of buffers in program order! Re-ordering across buffers is not allowed!! i.e., loads and stores serviced strictly in program order! WHY?! Inf3 Computer Architecture - 2015-2016 10

Handling Loads & Stores in Tomasulo s! Loads and stores placed in a dedicated set of buffers in program order! Re-ordering across buffers is not allowed! i.e., loads and stores serviced strictly in program order! Avoid a potential dependency violation through memory!! Memory addresses are computed dynamically (unlike a register specifier, which is part of the instruction)! A younger load may be able to compute its address before an older store to the same address! Issuing the load out-of-order will violate a RAW dependency! Inf3 Computer Architecture - 2015-2016 11

Common Data Bus (CDB)! Normal bus: data + destination (write address)! go to bus! CDB: data + source (RS producing the result)! come from bus! CDB allows dependent instructions to match the RS# they are waiting on with the values on the bus! The matching is also used to guarantee that only the last writer of a register will update the RF! Each register in the RF is tagged with RS# of the youngest instruction that will write it! Ensures correct architectural state despite reordering! Inf3 Computer Architecture - 2015-2016 12

Reservation station components! Op: Operation to be performed! Qj, Qk: Reservation station producing source registers! Vj, Vk: Values of source operands! Busy: indicates whether reservation station is busy! Register result status Qi: indicates which RS will write each register, if one exists. Blank otherwise.! Inf3 Computer Architecture - 2015-2016 13

Operation of Tomasulo s Algorithm! Instruction Issue:! Get next instruction from head of the issue queue! If reservation station RS is available then:! For each p in { j, k } representing operand register u! If Reg[u].Qi == 0 then RS.Vp = Reg[u].value // value ready now! If Reg[u].Qi!= 0 then RS.Qp = Reg[u].Qi // value not yet ready! RS.Busy = 1 // reserve this RS! RS.Op = instruction opcode // set the operation!! Execution:! Wait until (RS.Qj == 0) and (RS.Qk == 0), and whilst waiting:! For each p in { j, k }! If CDB.tag == RS.Qp then { RS.Vp = CDB.value; RS.Qp = 0 }! When (RS.Qj == 0) and (RS.Qk == 0), perform operation in RS.Op!! Write Result:! When CDB is free, broadcast CDB = { tag = RS.id, value = RS.result }! and clear RS.Busy! Inf3 Computer Architecture - 2015-2016! 14

Tomasulo Example! LDs: 2 cycles! ADDs and SUBDs: 2 cycles! MULTDs: 10 cycles! DIVDs: 40 cycles! Inf3 Computer Architecture - 2015-2016 15

Tomasulo Example Cycle 0! LD F6 34+ R2 Load1 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 No Add2 No Mult1 No Mult2 No 0 FU Inf3 Computer Architecture - 2015-2016 16

Tomasulo Example Cycle 1! LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 No Add2 No Mult1 No Mult2 No 1 FU Load1 Inf3 Computer Architecture - 2015-2016 17

Tomasulo Example Cycle 2! LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 No Add2 No Mult1 No Mult2 No 2 FU Load2 Load1 Inf3 Computer Architecture - 2015-2016 18

Tomasulo Example Cycle 3! LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 No Add2 No Mult1 Yes MULTD R(F4) Load2 Mult2 No 3 FU Mult1 Load2 Load1 Inf3 Computer Architecture - 2015-2016 19

Tomasulo Example Cycle 4! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Add1 Yes SUBD M(A1) Load2 Add2 No Mult1 Yes MULTD R(F4) Load2 Mult2 No 4 FU Mult1 Load2 M(A1) Add1 Inf3 Computer Architecture - 2015-2016 20

Tomasulo Example Cycle 5! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 2 Add1 Yes SUBD M(A1) M(A2) Add2 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 5 FU Mult1 M(A2) M(A1) Add1 Mult2 Inf3 Computer Architecture - 2015-2016 21

Tomasulo Example Cycle 6! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 6 FU Mult1 M(A2) Add2 Add1 Mult2 Inf3 Computer Architecture - 2015-2016 22

Tomasulo Example Cycle 7! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 7 FU Mult1 M(A2) Add2 Add1 Mult2 Inf3 Computer Architecture - 2015-2016 23

Tomasulo Example Cycle 8! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Add1 No 2 Add2 Yes ADDD (M-M) M(A2) 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 8 FU Mult1 M(A2) Add2 (M-M) Mult2 Inf3 Computer Architecture - 2015-2016 24

Tomasulo Example Cycle 9! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Add1 No 1 Add2 Yes ADDD (M-M) M(A2) 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 9 FU Mult1 M(A2) Add2 (M-M) Mult2 Inf3 Computer Architecture - 2015-2016 25

Tomasulo Example Cycle 10! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 Add1 No 0 Add2 Yes ADDD (M-M) M(A2) 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 10 FU Mult1 M(A2) Add2 (M-M) Mult2 Inf3 Computer Architecture - 2015-2016 26

Tomasulo Example Cycle 11! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Inf3 Computer Architecture - 2015-2016 27

Tomasulo Example Cycle 12! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Inf3 Computer Architecture - 2015-2016 28

Tomasulo Example Cycle 13! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 13 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Inf3 Computer Architecture - 2015-2016 29

Tomasulo Example Cycle 14! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 14 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Inf3 Computer Architecture - 2015-2016 30

Tomasulo Example Cycle 15! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Inf3 Computer Architecture - 2015-2016 31

Tomasulo Example Cycle 16! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1) 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Inf3 Computer Architecture - 2015-2016 32

Tomasulo Example Cycle 55! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1) 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Inf3 Computer Architecture - 2015-2016 33

Tomasulo Example Cycle 56! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 56 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1) 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Inf3 Computer Architecture - 2015-2016 34

Tomasulo Example Cycle 57! LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 56 57 ADDD F6 F8 F2 6 10 11 Add1 No Add2 No Mult1 No Mult2 Yes DIVD M*F4 M(A1) 56 FU M*F4 M(A2) (M-M+M(M-M) Result Inf3 Computer Architecture - 2015-2016 35

Summary of Tomasulo s! Advantages! Register renaming:! No need to wait on WAR and WAW (notice that ADD.D has issued before DIV.D has read its F6 operand and will execute as soon as the SUB.D finishes)! Can have many more reservation stations than registers! Parallel release of all dependent instructions as soon as the earlier instruction completes (both SUB.D and MUL.D get the value from Load_2 )! CDB is a forwarding mechanism! Limitation! branches stall execution of later instructions until branch is resolved! Same limitation exists with Scoreboarding! This effectively limits reorder window to the current basic block (4-6 insts)! Extending Tomasulo s beyond just floating point operations introduces the risk of imprecise exceptions.! Complicates exception recovery!! Inf3 Computer Architecture - 2015-2016 36