Instruction Level Parallelism Part III

Similar documents
Instruction Level Parallelism Part III

Instruction Level Parallelism and Its. (Part II) ECE 154B

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Advanced Pipelining and Instruction-Level Paralelism (2)

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Computer Architecture Spring 2016

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Scoreboard Limitations!

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Out-of-Order Execution

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Scoreboard Limitations

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Instruction Level Parallelism

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

BUSES IN COMPUTER ARCHITECTURE

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

CPE300: Digital System Architecture and Design

Pipeline design. Mehran Rezaei

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Technical Note PowerPC Embedded Processors Video Security with PowerPC

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Modeling Digital Systems with Verilog

Tomasulo Algorithm Based Out of Order Execution Processor

OUT-OF-ORDER processors with precise exceptions

DEPARTMENT OF ELECTRICAL &ELECTRONICS ENGINEERING DIGITAL DESIGN

HW#3 - CSE 237A. 1. A scheduler has three queues; A, B and C. Outgoing link speed is 3 bits/sec

Implementation of an MPEG Codec on the Tilera TM 64 Processor

A VLIW Processor for Multimedia Applications

Sequencing and Control

Chapter 4. Logic Design

Motion Video Compression

(12) United States Patent (10) Patent No.: US 6,249,855 B1

Performance Driven Reliable Link Design for Network on Chips

Multicore Design Considerations

Lecture 0: Organization

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

Upgrading a FIR Compiler v3.1.x Design to v3.2.x

On the Rules of Low-Power Design

Sérgio Rodrigo Marques

Video Output and Graphics Acceleration

PRACE Autumn School GPU Programming

Logic Design II (17.342) Spring Lecture Outline

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

IT T35 Digital system desigm y - ii /s - iii

CHAPTER1: Digital Logic Circuits

FPGA Design. Part I - Hardware Components. Thomas Lenzi

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

AE16 DIGITAL AUDIO WORKSTATIONS

Impact of Intermittent Faults on Nanocomputing Devices

BABAR IFR TDC Board (ITB): requirements and system description

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

THE USE OF forward error correction (FEC) in optical networks

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Scalability of MB-level Parallelism for H.264 Decoding

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

EAN-Performance and Latency

MC9211 Computer Organization

Using Mac OS X for Real-Time Image Processing


A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

Lab2: Cache Memories. Dimitar Nikolov

ILDA Image Data Transfer Format

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

ILDA Image Data Transfer Format

EE273 Lecture 15 Synchronizer Design

Investigation on Technical Feasibility of Stronger RS FEC for 400GbE

The high-end network analyzers from Rohde & Schwarz now include an option for pulse profile measurements plus, the new R&S ZVA 40 covers the

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

Optimization of memory based multiplication for LUT

HIGH SPEED ASYNCHRONOUS DATA MULTIPLEXER/ DEMULTIPLEXER FOR HIGH DENSITY DIGITAL RECORDERS

Achieving Timing Closure in ALTERA FPGAs

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

10 Mb/s Single Twisted Pair Ethernet Proposed PCS Layer for Long Reach PHY Dirk Ziegelmeier Steffen Graber Pepperl+Fuchs

Profiling techniques for parallel applications

CS 61C: Great Ideas in Computer Architecture

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Real-Time Parallel MPEG-2 Decoding in Software

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Microprocessor Design

For an alphabet, we can make do with just { s, 0, 1 }, in which for typographic simplicity, s stands for the blank space.

Factory configured macros for the user logic

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

FPGA Development for Radar, Radio-Astronomy and Communications

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

AN ABSTRACT OF THE THESIS OF

Transcription:

Course on: Advanced Computer Architectures Instruction Level Parallelism Part III Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1

Outline of Part III Dynamic Scheduling Techniques: Tomasulo Algorithm Scoreborad vs Tomasulo 2

Tomasulo Dynamic Scheduling Algorithm 3

Tomasulo Algorithm Another dynamic scheduling algorithm: Enables instructions execution behind a stall to proceed Invented at IBM 3 years after CDC 6600 for the IBM 360/91 Same goal: High performance without special compilers Lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604 4

Tomasulo Algorithm vs. Scoreboard Control & buffers distributed with Function Units (FU) vs. centralized in Scoreboard; FU buffers called Reservation Stations have pending operands Registers in instructions replaced by values or pointers to reservation stations (RS) to enable Register Renaming Avoids WAR, WAW hazards by renaming results by using RS numbers instead of RF numbers More reservation stations than registers, so can do optimizations compilers can t Basic idea: Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 5

Tomasulo Architecture PC Instruction cache Data cache Branch prediction Instruction queue Decode/dispatch unit Register file Reservation station Reservation station Reservation station Reservation station Reservation station Reservation station Branch Integer Integer Floating point Store Complex integer Load Load/ store Commit unit Reorder buffer 8

Tomasulo Architecture for an FPU 9

Reservation Station Components Tag identifying the RS Busy = Indicates RS Busy OP = Type of operation to perform on the component. V j, V k = Value of the source operands V j holds offset for loads Q j,q k = Pointers to RS that produce V j,v k Zero value = Source op. is already available in V j or V k Note: Only one of V-field or Q-field is valid for each operand 10

Register File and Load/Store Buffers RF and the Store buffers have a Value (V) and a Pointer (Q) field. Pointer (Q) field corresponds to number of reservation station producing the result to be stored in RF (or store buffer) If zero no active instructions producing the result (RF or store buffer content is the correct value). Load buffers have an Address field (A) and a Busy field. Store buffers have also an Address field (A) A: To hold info for memory address calculation for load/store. Initially contains the instruction offset (immediate field); after address calculation stores the effective address. 11

First stage of Tomasulo Algorithm ISSUE Get an instruction I from the head of instruction queue (maintained in FIFO order to ensure in-order issue). Check if an RS is empty (i.e., check for structural hazards in RS) otherwise stalls. If operands are not in RF, keep track of FU that will produce the operands (Q pointers). If there is not an empty RS structural hazard in RS and the instruction stalls. 12

First stage of Tomasulo Algorithm ISSUE Rename registers WAR resolution: If I writes Rx, read by an instruction K already issued, K knows already the value of Rx read in RS buffer or knows what instruction will write it. So the RF can be linked to I. WAW resolution: Since we use in-order issue, the RF can be linked to I. 13

Second stage of Tomasulo Algorithm Execution When both operands are ready and execution unit available, then start execution. If not ready, watch the Common Data Bus for results. By delaying execution until operands are available, RAW hazards are avoided at this stage. Notice that several instructions could become ready in the same clock cycle for the same FU (need to check if execution unit is available). Notice that usually RAW hazards are shorter because operands are given directly by RS without waiting for RF write back (sort of forwarding). 14

Second stage of Tomasulo Algorithm Execution Load and Stores: Two-step execution process: First step: compute effective address when base register is available, place it in load or store buffer. Loads in Load Buffer execute as soon as memory unit is available; stores in store buffer wait for the value to be stored before being sent to memory unit. Loads and Stores: Kept in program order through effective address calculation helps in preventing hazards through memory. To preserve exception behavior: No instruction can initiate execution until all branches preceding it in program order have completed. If branch prediction is used, CPU must know prediction correctness before beginning execution of following instructions. (Speculation allows more brilliant results!) 15

Third stage of Tomasulo Algorithm Write result When result is available, write on Common Data Bus and from there into RF and into all RSs (including store buffers) waiting for this result; stores also write data to memory during this stage. Mark reservation stations available. 16

The Common Data Bus A common data bus is a data + source bus. In the IBM 360/91: Data=64 bits, Source=4 bits FU must perform associative lookup in the RS. 17

Tomasulo Algorithm (some details) Loads and stores go through a functional unit for effective address computation before proceeding to effective load and store buffers; Loads take a second execution step to access memory, then go to Write Result to send the value from memory to RF and/or RS; Stores complete their execution in their Write Result stage (writes data to memory) All writes occur in Write Result simplifying Tomasulo algorithm. 18

Tomasulo Algorithm (some details) A Load and a Store can be done in different order, provided they access different memory locations; otherwise, a WAR (interchange load-store sequence) or a RAW (interchange store-load sequence) may result (WAW if two stores are interchanged). Loads can be reordered freely. To detect such hazards: Data memory addresses associated with any earlier memory operation must have been computed by the CPU (e.g.: address computation executed in program order) 19

Tomasulo Algorithm (some details) Load executed out of order with previous store: Assume address computed in program order. When Load address has been computed, it can be compared with A fields in active Store buffers: In the case of a match, Load is not sent to Load buffer until conflicting store completes. Stores must check for matching addresses in both Load and Store buffers (dynamic disambiguation, alternative to static disambiguation performed by the compiler) Drawback: Amount of hardware required. Each RS must contain a fast associative buffer; single CDB may limit performance. 20

Tomasulo s example Cycle 1 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 LD F2 45+ R3 MULTF0 F2 F4 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 Load1 34 v(r2) add1 Load2 add2 EXLoad EXADD mult1 mult2 EXMUL v1 q1 v2 q2 RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q Load1 21

Tomasulo s example Cycle 2 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 LD F2 45+ R3 2 MULTF0 F2 F4 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 Load1 34 v(r2) add1 Load2 45 v(r3) add2 EXLoad 34 v(r2) EXADD mult1 mult2 EXMUL v1 q1 v2 q2 RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q Load2 Load1 22

Tomasulo s example Cycle 3 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 LD F2 45+ R3 2 MULTF0 F2 F4 3 SUBDF8 F6 F2 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 Load1 34 v(r2) add1 Load2 45 v(r3) add2 EXLoad 34 v(r2) EXADD v1 q1 v2 q2 mult1 Load2 v(f4) mult2 EXMUL RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 Load2 Load1 23

Tomasulo s example Cycle 4 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 Load1 34 v(r2) add1 v(f6) load2 Load2 45 v(r3) add2 EXLoad 34 v(r2) CDB EXADD v1 q1 v2 q2 mult1 Load2 v(f4) mult2 EXMUL RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 Load2 v(f6) add1 Forwarding is provided Writes on RF (F6) and RS of ADD1 through CDB 24

Tomasulo s example Cycle 5 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 v1 q1 v2 q2 v1 q1 v2 q2 load1 add1 v(f6) load2 load2 45 v(r3) add2 EXLoad 45 v(r3) EXADD v1 q1 v2 q2 mult1 Load2 v(f4) mult2 mult1 v (F6) EXMUL RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 Load2 v(f6) add1 mult2 25

Tomasulo s example Cycle 6 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 v(f6) load2 Load2 45 v(r3) add2 add1 load2 EXLoad 45 v(r3) EXADD v1 q1 v2 q2 mult1 Load2 v(f4) mult2 mult1 v(f6) EXMUL RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 Load2 add2 add1 mult2 WAR on F6 has been eliminated: ADDD will write in F6 and DIVD has already read v(f6) as v2 RS buffer @ Cycle 5 and SUBD has already read v(f6) as v1 RS buffer @ Cycle 4 26

Tomasulo s example Cycle 7 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 7 MULTF0 F2 F4 3 SUBDF8 F6 F2 4 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 v(f6) v(f2) Load2 45 v(r3) add2 add1 v(f2) EXLoad 45 v(r3) CDB EXADD v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 mult1 v(f6) EXMUL RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 v(f2) add2 add1 mult2 Forwarding is provided Writes on RF (F2) and RSs through CDB 27

Tomasulo s example Cycle 8 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 7 MULTF0 F2 F4 3 8 SUBDF8 F6 F2 4 8 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 v(f6) v(f2) Load2 add2 add1 v(f2) EXLoad EXADD v(f6) v(f2) v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 mult1 v(f6) EXMUL v(f2) v(f4) RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 v(f2) add2 add1 mult2 28

Tomasulo s example Cycle 10 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 7 MULTF0 F2 F4 3 8 SUBDF8 F6 F2 4 8 10 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 Latency MULTD: 10 cycles Latency SUBD: 2 cycles v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 v(f6) v(f2) Load2 add2 v(f8) v(f2) EXLoad EXADD v(f6) v(f2) CDB v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 mult1 v(f6) EXMUL v(f2) v(f4) RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 v(f2) add2 v(f8) mult2 29

Tomasulo s example Cycle 11 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 7 MULTF0 F2 F4 3 8 SUBDF8 F6 F2 4 8 10 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 11 MULTD: 7 cycles remaining v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 Load2 add2 v(f8) v(f2) EXLoad EXADD v(f8) v(f2) v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 mult1 v(f6) EXMUL v(f2) v(f4) RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 v(f2) add2 v(f8) mult2 30

Tomasulo s example Cycle 13 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 7 MULTF0 F2 F4 3 8 SUBDF8 F6 F2 4 8 10 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 11 13 MULTD: 5 cycles remaining Latency ADDD: 2 cycles v1 q1 v2 q2 v1 q1 v2 q2 Load1 add1 Load2 add2 v(f8) v(f2) EXLoad EXADD v(f8) v(f2) CDB v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 mult1 v(f6) EXMUL v(f2) v(f4) RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q mult1 v(f2) v(f6) v(f8) mult2 WAR on F6 has already been eliminated: ADDD writes result in CDB and in F6 (not in v2 of mult2 RS for DIVD which has already read v(f6)) 31

Tomasulo s example Cycle 18 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 7 MULTF0 F2 F4 3 8 18 SUBDF8 F6 F2 4 8 10 DIVD F10 F0 F6 5 ADDDF6 F8 F2 6 11 13 Load1 Load2 EXLoad v1 q1 v2 q2 v1 q1 v2 q2 add1 add2 EXADD v1 q1 v2 q2 mult1 v(f2) v(f4) mult2 v(f0) v(f6) EXMUL v(f2) v(f4) CDB RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q v(f0) v(f2) v(f6) v(f8) mult2 32

Tomasulo s example Cycle 19 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 7 MULTF0 F2 F4 3 8 18 SUBDF8 F6 F2 4 8 10 DIVD F10 F0 F6 5 19 ADDDF6 F8 F2 6 11 13 Load1 Load2 EXLoad v1 q1 v2 q2 v1 q1 v2 q2 add1 add2 EXADD v1 q1 v2 q2 mult1 mult2 v(f0) v(f6) EXMUL v(f0) v(f6) RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q v(f0) v(f2) v(f6) v(f8) mult2 33

Tomasulo s example Cycle 59 Instruction status Start Write Instruction j k Issue Execute Result LD F6 34+ R2 1 2 4 LD F2 45+ R3 2 5 7 MULTF0 F2 F4 3 8 18 SUBDF8 F6 F2 4 8 10 DIVD F10 F0 F6 5 19 59 ADDDF6 F8 F2 6 11 13 Latency DIVD: 40 cycles Load1 Load2 EXLoad v1 q1 v2 q2 v1 q1 v2 q2 add1 add2 EXADD v1 q1 v2 q2 mult1 mult2 v(f0) v(f6) EXMUL v(f0) v(f(6) CDB RF 0 1 2 3 4 5 6 7 8 9 10 11 12 q v(f0) v(f2) v(f6) v(f8) v(f10) 34

Compare Scoreboard vs Tomasulo Instruction status: Read Exec Write Start Write Instruction j k Issue Oper Comp Result Issue Exec Result LD F6 34+ R2 1 2 3 4 1 2 4 LD F2 45+ R3 5 6 7 8 2 5 7 MULTD F0 F2 F4 6 9 19 20 3 8 18 SUBD F8 F6 F2 7 9 11 12 4 8 10 DIVD F10 F0 F6 8 21 61 62 5 19 59 ADDD F6 F8 F2 13 14 16 22 6 11 13 35

Tomasulo (IBM) versus Scoreboard (CDC) Issue window size=5 No issue on structural hazards in RS WAR, WAW avoided with renaming Broadcast results from FU Control distributed on RS Allows loop unrolling in HW Issue window size=12 No issue on structural hazards in FU Stall the completion for WAW and WAR hazards Results written back on registers. Control centralized through the Scoreboard. 36

Limits to the Instruction Level Parallelism Branches Exceptions (non-)precise: operand integrity for the exception handler (non-)exact: handler modifications are seen by instructions after the exception 37

Tomasulo Drawbacks Complexity Large amount of hardware Delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Multiple CDBs More FU logic for parallel assoc stores 38

Summary (1) HW exploiting ILP Works when can t know dependence at compile time. Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode Issue Instr & Read Operands) Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn t handle forwarding No automatic register renaming 39

Summary (2) Reservations Stations: Renaming to larger set of registers + Buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation IBM 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264 40

Dynamic Scheduling Techniques: Scoreboard vs. Tomasulo

SCOREBOARD BASIC SCHEME IN-ORDER ISSUE OUT-OF-ORDER READ OPERANDS OUT-OF-ORDER EXECUTION OUT-OF-ORDER COMPLETION NO FORWARDING Control is centralized into the Scoreboard

SCOREBOARD STAGES ISSUE (IN-ORDER): Check for structural hazards Check for WAW hazards on destination ops READ OPERANDS (OUT-OF-ORDER) Check for RAW hazards Check for structural hazards in read RF EXECUTION (OUT-OF-ORDER) Execution completion depends on latency of FUs Execution completion of LD/ST depends on cache hit/miss latencies) WRITE RESULTS (OUT-OF-ORDER) Check for WAR hazards on destionation ops Check for structural hazards in write RF

SCOREBOARD optimisations Check for WAW in WRITE stage instead of in ISSUE stage Forwarding

TOMASULO BASIC SCHEME IN-ORDER ISSUE OUT-OF-ORDER EXECUTION OUT-OF-ORDER COMPLETION REGISTER RENAMING based on Reservation Stations to avoid WAR and WAW hazards Results dispatched to RESERVATION STATIONS and to RF through the Common Data Bus Control is distributed on Reservation Stations Reservation Stations offer a sort of data forwarding!

TOMASULO STAGES ISSUE (IN-ORDER): Check for structural hazards in RESERVATION STATIONS (not in FU) START EXECUTE (OUT-OF-ORDER) When operands ready (Check for RAW hazards solved) When FU available (Check for structural hazards in FU) WRITE RESULTS (OUT-OF-ORDER) Execution completion depends on latency of FUs Execution completion of LD/ST depends on cache hit/miss latencies Write results on Common Data Bus to Reservations Stations and RF