Out-of-Order Execution

Similar documents
Computer Architecture Spring 2016

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Advanced Pipelining and Instruction-Level Paralelism (2)

Instruction Level Parallelism and Its. (Part II) ECE 154B

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Scoreboard Limitations

Scoreboard Limitations!

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Instruction Level Parallelism

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Tomasulo Algorithm Based Out of Order Execution Processor

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Pipeline design. Mehran Rezaei

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Modeling Digital Systems with Verilog

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report

On the Rules of Low-Power Design

AN ABSTRACT OF THE THESIS OF

THE USE OF forward error correction (FEC) in optical networks

CPE300: Digital System Architecture and Design

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Data flow architecture for high-speed optical processors

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Logic Design II (17.342) Spring Lecture Outline

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

ENGG2410: Digital Design Lab 5: Modular Designs and Hierarchy Using VHDL

BUSES IN COMPUTER ARCHITECTURE

2.6 Reset Design Strategy

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Digital Logic Design ENEE x. Lecture 24

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Amdahl s Law in the Multicore Era

Lecture 0: Organization

Load Frequency Control Structure for Ireland and Northern Ireland

Multicore Design Considerations

Quiz #4 Thursday, April 25, 2002, 5:30-6:45 PM

Microprocessor Design

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

4.5 Pipelining. Pipelining is Natural!

Scalability of MB-level Parallelism for H.264 Decoding

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

ALONG with the progressive device scaling, semiconductor

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Performance Driven Reliable Link Design for Network on Chips

Project Design. Eric Chang Mike Ilardi Jess Kaneshiro Jonathan Steiner

6.3 Sequential Circuits (plus a few Combinational)

Video Output and Graphics Acceleration

CS61C : Machine Structures

Lab2: Cache Memories. Dimitar Nikolov

Motion Video Compression

Achieving Faster Time to Tapeout with In-Design, Signoff-Quality Metal Fill

Chapter 10 Basic Video Compression Techniques

CSE 140 Exam #3 Solution Tajana Simunic Rosing

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Fully Pipelined High Speed SB and MC of AES Based on FPGA

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Distributed Cluster Processing to Evaluate Interlaced Run-Length Compression Schemes

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Syrah. Flux All 1rights reserved

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

An Improved Hardware Implementation of the Grain-128a Stream Cipher

AN INTRODUCTION TO DIGITAL COMPUTER LOGIC

Using Mac OS X for Real-Time Image Processing

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

CS61C : Machine Structures

Digital System Clocking: High-Performance and Low-Power Aspects

More Digital Circuits

CS3350B Computer Architecture Winter 2015

Simple motion control implementation

FPGA Implementation of DA Algritm for Fir Filter

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER

An automatic synchronous to asynchronous circuit convertor

Unit 8: Testability. Prof. Roopa Kulkarni, GIT, Belgaum. 29

Review of Sequential Logic Circuits

TSIU03, SYSTEM DESIGN. How to Describe a HW Circuit

Transcription:

1 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulo s algorithm & reservation stations out-of-order completion leads to: imprecise interrupts WAR hazards WAW hazards in-order completion MIPS R10000/R12000 & Alpha 21264/21364 with large physical register file & register renaming Intel Pentium Pro/Pentium III with the reorder buffer Autumn 2006 CSE P548 - Tomasulo 1 Out-of-order Hardware In order to compute correct results, need to keep track of: which instruction is in which stage of the pipeline which registers are being used for reading/writing & by which instructions which operands are available which instructions have completed Each scheme has different hardware structures & different algorithms to do this Autumn 2006 CSE P548 - Tomasulo 2

2 Tomasulo s Algorithm Tomasulo s Algorithm (IBM 360/91) out-of-order execution capability plus register renaming Motivation long FP delays only 4 FP registers wanted common compiler for all implementations Autumn 2006 CSE P548 - Tomasulo 3 Tomasulo s Algorithm Key features & hardware structures reservation stations distributed hazard detection & execution control forwarding to eliminate RAW hazards register renaming to eliminate WAR & WAW hazards deciding which instruction to execute next common data bus dynamic memory disambiguation Autumn 2006 CSE P548 - Tomasulo 4

3 Hardware for Tomasulo s Algorithm Autumn 2006 CSE P548 - Tomasulo 5 Tomasulo s Algorithm: Key Features Reservation stations buffers for functional units that hold instructions stalled for RAW hazards & their operands source operands can be values or names of other reservation station entries or load buffer entries that will produce the value both operands don t have to be available at the same time when both operand values have been computed, an instruction can be dispatched to its functional unit Autumn 2006 CSE P548 - Tomasulo 6

4 Reservation Stations RAW hazards eliminated by forwarding source operand values that are computed after the registers are read are known by the functional unit or load buffer that will produce them results are immediately forwarded to functional units on the common data bus don t have to wait until for value to be written into the register file Autumn 2006 CSE P548 - Tomasulo 7 Reservation Stations Eliminate WAR & WAW hazards by register renaming name-dependent instructions refer to reservation station or load buffer locations for their sources, not the registers (as above) the last writer to the register updates it more reservation stations than registers, so eliminates more name dependences than a compiler can & exploits more parallelism examples on next slide Autumn 2006 CSE P548 - Tomasulo 8

5 Reservation Stations Register renaming eliminates WAR & WAW hazards Tag in the reservation station/register file/store buffer indicates where the result will come from Handling WAW hazards addf F1,F0,F8... subf F1,F8,F14 F1 s tag originally specifies addf s entry in the reservation station F1 s tag now specifies subf s entry in the reservation station no register will claim the addf result if it completes last Autumn 2006 CSE P548 - Tomasulo 9 Reservation Stations Handling WAR hazards ld F1,_ register F1 s tag originally specifies the entry in the load buffer for the ld addf _, F1,_ addf s reservation station entry specifies the ld s entry in the load buffer for source operand 1 subf F1,_ register F1 s tag now specifies the reservation reservation station that holds subf Does not matter if ld finishes after subf; F1 will no longer claim it & addf will use the load buffer to get the loaded value Autumn 2006 CSE P548 - Tomasulo 10

6 Tomasulo s Algorithm: More Key Features Common data bus (CDB) connects functional units & load buffer to reservations stations, registers, store buffer ships results to all hardware that could want an updated value eliminates RAW hazards: not have to wait until registers are written before consuming a value Distributed hazard detection & execution control each reservation station decides when to dispatch instructions to its function unit each hardware data structure entry that needs a value from the common data bus grabs the value itself: snooping reservation stations, store buffer entries & registers have a tag saying where their data should come from when it matches the data producer s tag on the bus, reservation stations, store buffer entries & registers grab the data Autumn 2006 CSE P548 - Tomasulo 11 Tomasulo s Algorithm: More Key Features Dynamic memory disambiguation the issue: don t want loads to bypass stores to the same location the solution: loads associatively check addresses in store buffer if an address match, grab the value Autumn 2006 CSE P548 - Tomasulo 12

7 Tomasulo s Algorithm: Execution Steps Tomasulo functions (assume the instruction has been fetched) issue & read structural hazard detection for reservation stations & load/store buffers issue if no hazard stall if hazard read registers for source operands put into reservation stations if values are in them put tag of producing functional unit or load buffer if not (renaming the registers to eliminate WAR & WAW hazards) Autumn 2006 CSE P548 - Tomasulo 13 Tomasulo s Algorithm: Execution Steps execute RAW hazard detection snoop on common data bus for missing operands dispatch instruction to a functional unit when obtain both operand values execute the operation calculate effective address & start memory operation write broadcast result & reservation station id (tag) on the common data bus reservation stations, registers & store buffer entries obtain the value through snooping Autumn 2006 CSE P548 - Tomasulo 14

8 Tomasulo s Algorithm: State Tomasulo state: the information that the hardware needs to control distributed execution operation of the issued instructions waiting for execution (Op) located in reservation stations tags that indicate the producer for a source operand (Q) located in reservation stations, registers, store buffer entries what unit (reservation station or load buffer) will produce the operand special value (blank for us) if value already there operand values in reservation stations & load/store buffers (V) reservation station & load/store buffer busy fields (Busy) addresses in load/store buffers (for memory disambiguation) Autumn 2006 CSE P548 - Tomasulo 15 Example in the Book: 1 Instruction Status Table first load has executed Autumn 2006 CSE P548 - Tomasulo 16

9 Example in the Book: 2 Instruction Status Table second load has executed Autumn 2006 CSE P548 - Tomasulo 17 Example in the Book: 3 Instruction Status Table subtract has executed (Add1) Autumn 2006 CSE P548 - Tomasulo 18

10 Example in the Book: 4 Instruction Status Table add has executed (Add2) (Add1) Autumn 2006 CSE P548 - Tomasulo 19 Example in the Book: 5 Instruction Status Table multiply has executed (Mult1) (Add2) (Add1) Autumn 2006 CSE P548 - Tomasulo 20

11 Tomasulo s Algorithm Dynamic loop unrolling addf and st in each iteration has a different tag for the F0 value only the last iteration writes to F0 effectively completely unrolling the loop LOOP: ld F0, 0(R1) addf F0, F0, F1 st F0, 0(R1) sub R1, R1, #8 bnez R1, LOOP Autumn 2006 CSE P548 - Tomasulo 21 Tomasulo s Algorithm Dynamic loop unrolling Nice features relative to static loop unrolling effectively increases number of registers (# reservations stations, load buffer entries, registers) but without register pressure dynamic memory disambiguation to prevent loads after stores with the same address from getting old data if they execute first simpler compiler Downside loop control instructions still executed much more complex hardware Autumn 2006 CSE P548 - Tomasulo 22

12 Dynamic Scheduling Advantages over compiler code scheduling more places to hold register values makes dispatch decisions dynamically, based on when instructions actually complete & operands are available can completely disambiguate memory references Effects of these advantages more effective at exploiting parallelism (especially given compiler technology at the time) increased instruction throughput increased functional unit utilization efficient execution of code compiled for a different pipeline simpler compiler in theory Use both! Autumn 2006 CSE P548 - Tomasulo 23