Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Similar documents
Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Computer Architecture Spring 2016

Instruction Level Parallelism and Its. (Part II) ECE 154B

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Advanced Pipelining and Instruction-Level Paralelism (2)

Instruction Level Parallelism Part III

Scoreboard Limitations!

Instruction Level Parallelism Part III

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Scoreboard Limitations

Out-of-Order Execution

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Instruction Level Parallelism

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

On the Rules of Low-Power Design

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Pipeline design. Mehran Rezaei

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Tomasulo Algorithm Based Out of Order Execution Processor

ASIC = Application specific integrated circuit

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

CS3350B Computer Architecture Winter 2015

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

Microprocessor Design

Sequencing and Control

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

A VLIW Processor for Multimedia Applications

Sequential Elements con t Synchronous Digital Systems

BUSES IN COMPUTER ARCHITECTURE

Implementation of an MPEG Codec on the Tilera TM 64 Processor

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Real-Time Systems Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Logic Design II (17.342) Spring Lecture Outline

Scalability of MB-level Parallelism for H.264 Decoding

Technical Note PowerPC Embedded Processors Video Security with PowerPC

AN ABSTRACT OF THE THESIS OF

Multicore Design Considerations

Sharif University of Technology. SoC: Introduction

Combinational vs Sequential

OUT-OF-ORDER processors with precise exceptions

DEDICATED TO EMBEDDED SOLUTIONS

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Sequential Logic. Analysis and Synthesis. Joseph Cavahagh Santa Clara University. r & Francis. TaylonSi Francis Group. , Boca.Raton London New York \

Digital Integrated Circuits EECS 312

Introduction to image compression

CPE300: Digital System Architecture and Design

CS61C : Machine Structures

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

CHAPTER1: Digital Logic Circuits

Investigation on Technical Feasibility of Stronger RS FEC for 400GbE

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components

ELMB Full Branch Test

Fundamentals of Computer Systems

Final Exam review: chapter 4 and 5. Supplement 3 and 4

Modeling Digital Systems with Verilog

DIMACS Implementation Challenges 1 Network Flows and Matching, Clique, Coloring, and Satisability, Parallel Computing on Trees and

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

CS61C : Machine Structures

Sequential Circuit Design: Principle

Vorne Industries. 2000B Series Buffered Display Users Manual Industrial Drive Itasca, IL (630) Telefax (630)

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

MC9211 Computer Organization

A Case for Merging the ILP and DLP Paradigms

Film Grain Technology

4.5 Pipelining. Pipelining is Natural!

(12) United States Patent (10) Patent No.: US 6,249,855 B1

Amdahl s Law in the Multicore Era

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

Chapter 4. Logic Design

Data flow architecture for high-speed optical processors

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Lecture 0: Organization

Lab2: Cache Memories. Dimitar Nikolov

L11/12: Reconfigurable Logic Architectures

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

Introduction. Serial In - Serial Out Shift Registers (SISO)

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

THE USE OF forward error correction (FEC) in optical networks

Transcription:

Dynamic Scheduling (or out-of-order execution) Dynamic Scheduling Or ydanicm ceshuldngi CDC 6600 scoreboard Instruction storage added to each functional execution unit Instructions issue to FU when no structural hazards, begin execution when dependences satisfied. Thus, instructions issued to different FUs can execute out of order. scoreboard tracks RAW, WAR, WAW hazards, tells each instruction when to proceed. No forwarding No register renaming Tomasulo (IBM 360/9) Instruction Queue (MIPS R0000, Alpha 64, ) Tomasulo Algorithm Differences between Tomasulo Algorithm & Scoreboard For IBM 360/9 about 3 years after CDC 6600 Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 Implications? Control & buffers distributed with Function Units vs. centralized in scoreboard; called reservation stations => instrs schedule themselves Registers in instructions replaced by pointers to reservation station buffer scoreboard => registers primary operand storage Tomasulo => reservation stations as operand storage HW renaming of registers to avoid WAR, WAW hazards Scoreboard => both source registers read together Tomasulo => each register read as soon as available. Common Data Bus broadcasts results to all FUs RS s (FU s), registers, etc. responsible for collecting own data off CDB Load and Store Queues treated as FUs as well

Tomasulo Organization Reservation Station Components Op Operation O to perform in the unit (e.g., + or ) Qj, Qk Reservation stations producing source registers Vj, Vk Value of Source operands Rj, Rk Flags indicating when Vj, Vk are ready Busy Indicates reservation station is busy Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. Three Stages of Tomasulo Algorithm Tomasulo Example. Issue get instruction from FP Inst Queue If reservation station free, the IQ issues instr & sends operands (renames registers).. Execution operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3. Write result finish execution (WB) Write on Common Data Bus to all waiting units; mark reservation station available. F4, F, F0 MULD F8, F4, F F, F8, F0 Multiply takes 0 clocks, add/sub take 4

Tomasulo cycle 0 Tomasulo cycle F, F8, F0 F, F8, F0 F4, F, F0 F.0 F4 4.0 F6 6.0 F8 80 8.0 F, F8, F0 F, F8, F0 F0 F F4 F6 0.0.0 4.0 add 6.0 F8 80 8.0.0 0.0 3 Tomasulo cycle Tomasulo cycle F, F8, F0 F, F8, F0 F.0 F4 4.0 add F6 6.0 F8 80 8.0 mult l F.0 Op F6, Qj F8, F6 Qk Vj Vk F4 4.0Busy add SUBD MULD F8, add F, F0 - - F, F8,.0 F0 F6 6.0 Y F8 F, F8, F0 80 8.0 mult l.0 0.0 MULD add.0.0 0.0 MULD add.0

Tomasulo cycle Tomasulo cycle 3 F, F8, F0 F, F8, F0 F.0 F4 4.0 add F6 6.0 F8 80 8.0 mult l F, F8, F0 F, F8, F0 F.0 F4 4.0 add F6 6.0 add F8 80 8.0 mult l.0 0.0 MULD add.0.0 0.0 mult 6.0 MULD add.0 Tomasulo cycle 4 Tomasulo cycle 5 F, F8, F0 F, F8, F0 F.0 F4 4.0 add F6 6.0 add F8 80 8.0 add3 F, F8, F0 F, F8, F0 F.0 F4.0 - F6 6.0 add F8 80 8.0 add3.0 0.0 mult 6.0 3 SUBD 0.0 00 0.0 MULD add.0.0 0.0 mult 6.0 3 SUBD 0.0 00 0.0 MULD.0.0.0 (add result)

Tomasulo cycle 6 Tomasulo cycle 8 F, F8, F0 F.0 add F4.0 - F6 6.0 add F8 80 8.0 add3 F, F8, F0 F.0 add F4.0 - F6 6.0 add F8 0.0 - add3 0.0 mult 6.0 3 SUBD 0.0 00 0.0 MULD.0.0.0 0.0 mult 6.0 3 SUBD 0.0 00 0.0 MULD.0.0.0 (add3 result) Tomasulo cycle 9 Tomasulo cycle F, F8, F0 F.0 add F4.0 F6 6.0 add F8 0.0 F, F8, F0 F.0 - F4.0 F6 6.0 add F8 0.0.0 0.0 mult 6.0 MULD.0.0.0 0.0 mult 6.0 MULD.0.0.0 (add result)

Tomasulo cycle 5 Tomasulo cycle 6 F, F8, F0 F0 F 0.0.0 - F4.0 F6 6.0 add F8 0.0 F, F8, F0 F0 F 0.0.0 - F4.0 F6 6.0 add F8 0.0 4.0 6.0 MULD.0.0 4.0 6.0 4.0 (mult result) Tomasulo cycle 9 Tomasulo Summary F, F8, F0 4.0 6.0 F0 F 0.0.0 F4.0 F60.0 F8 0.0 - Prevents Register as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited it to basic blocks (provided d branch prediction) Lasting Contributions Dynamic scheduling Register renaming (in what way does the register name change?) Load/store disambiguation 0.0 (add result)

Modern Architectures MIPS R0000, some detail Alpha 64+, MIPS R0K+, Pentium 4 use an instruction queue. They use explicit register renaming. Registers are not read until instruction ti dispatches (begins execution). Register renaming ensures no conflicts. Div, R4, R Add R7,, R Sub, R3, R Lw R7, 000() Div PR37, PR45, PR Add PR4, PR37, PR3 Sub PR4, PR7, PR Lw PR9, 000(PR4) R PR3 R3 PR7 PR4 R7 PR9 Register Map Instruction Queue I:Div, R4, R Active List I:Add R7,, R R PR3 I4:Lw R7, 000() R3 PR7 PR3 R7 PR30 PR37, PR4, PR4, PR9, Active list maintains original instruction order, determines when a physical register can be freed. MIPS R0000, some detail MIPS R0000, some detail Register Map Instruction Queue I:Div, R4, R Active List I:Add R7,, R R PR3 I4:Lw R7, 000() R3 PR7 PR3 R7 PR30 Register Map Instruction Queue I:Div, R4, R Active List I:Add R7,, R R PR3 I: PR3 I4:Lw R7, 000() R3 PR7 PR37 Div PR37, PR46, PR R7 PR30 PR37, PR4, PR4, PR9, PR4PR4PR9 PR4, PR4, PR9,

MIPS R0000, some detail MIPS R0000, some detail I:Div, R4, R I:Add R7,, R R PR3 Div,,46 =>37 I: PR3 I: PR30 I4:Lw R7, 000() R3 PR7 PR37 R7 PR4 Add PR4, PR37, PR3 I:Div, R4, R I:Add R7,, R R PR3 Div,,46 =>37 I: PR3 Add 37,3 =>4 I: PR30 I4:Lw R7, 000() R3 PR7 I3: PR37 PR4 R7 PR4 Sub PR4, PR7, PR PR4, PR9, PR9, MIPS R0000, some detail MIPS R0000, some detail I:Div, R4, R I:Add R7,, R R PR3 Div,,46 =>37 I: PR3 Add 37,3 =>4 I: PR30 I4:Lw R7, 000() R3 PR7 Sub 7, => 4 I3: PR37 PR4 R7 PR9 Lw PR9, 000(PR4) I:Div, R4, R I:Add R7,, R R PR3 Div,,46 =>37 I: PR3 Add 37,3 =>4 I: PR30 I4:Lw R7, 000() R3 PR7 Sub 7, => 4 I3: PR37 PR4 R7 PR9 Lw 4 => 9

MIPS R0000, some detail MIPS R0000, some detail I:Div, R4, R I:Add R7,, R R PR3 I: PR3 Add 37,3 =>4 I: PR30 I4:Lw R7, 000() R3 PR7 I3: PR37 PR4 R7 PR9 Lw 4 => 9 I:Div, R4, R I:Add R7,, R R PR3 I: PR3 Add 37,3 =>4 I: PR30 I4:Lw R7, 000() R3 PR7 I3: PR37 PR4 R7 PR9 I3, producing register 4, completes, broadcasts a completion signal to IQ I4, producing register 9, completes, broadcasts a completion signal to IQ MIPS R0000, some detail MIPS R0000, some detail I:Div, R4, R I:Add R7,, R R PR3 I: PR3 Add 37,3 =>4 I: PR30 I4:Lw R7, 000() R3 PR7 I3: PR37 PR4 R7 PR9 I:Div, R4, R I:Add R7,, R R PR3 I: PR3 I: PR30 I4:Lw R7, 000() R3 PR7 I3: PR37 PR4 R7 PR9, PR3 I, producing register 37, completes, broadcasts a completion signal to IQ I, producing register 4, completes, broadcasts a completion signal to IQ I commits.

Dynamic Scheduling Key Points Dynamic scheduling is code motion in HW. Dynamic scheduling can do things SW scheduling (static scheduling) cannot. Register renaming eliminates WAW, WAR dependencies. To get cross-iteration parallelism, we need to eliminate WAW, WAR dependencies.