Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Similar documents
Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Computer Architecture Spring 2016

Instruction Level Parallelism and Its. (Part II) ECE 154B

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Advanced Pipelining and Instruction-Level Paralelism (2)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III

Scoreboard Limitations

Scoreboard Limitations!

Out-of-Order Execution

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Instruction Level Parallelism

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

Pipeline design. Mehran Rezaei

On the Rules of Low-Power Design

Combinational vs Sequential

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Tomasulo Algorithm Based Out of Order Execution Processor

Microprocessor Design

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

ASIC = Application specific integrated circuit

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Sequencing and Control

CS3350B Computer Architecture Winter 2015

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Sequential Elements con t Synchronous Digital Systems

CPE300: Digital System Architecture and Design

BUSES IN COMPUTER ARCHITECTURE

CSE140: Components and Design Techniques for Digital Systems. More D-Flip-Flops. Tajana Simunic Rosing. Sources: TSR, Katz, Boriello & Vahid

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

CHAPTER1: Digital Logic Circuits

Modeling Digital Systems with Verilog

Logic Design II (17.342) Spring Lecture Outline

CS61C : Machine Structures

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Methodology. Nitin Chawla,Harvinder Singh & Pascal Urard. STMicroelectronics

AN ABSTRACT OF THE THESIS OF

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

A VLIW Processor for Multimedia Applications

Fundamentals of Computer Systems

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

DEDICATED TO EMBEDDED SOLUTIONS

Vorne Industries. 2000B Series Buffered Display Users Manual Industrial Drive Itasca, IL (630) Telefax (630)

CS184a: Computer Architecture (Structures and Organization) Last Time

Sequential Logic. Analysis and Synthesis. Joseph Cavahagh Santa Clara University. r & Francis. TaylonSi Francis Group. , Boca.Raton London New York \

Digital Integrated Circuits EECS 312

Implementation of an MPEG Codec on the Tilera TM 64 Processor

OUT-OF-ORDER processors with precise exceptions

Chapter 3: Sequential Logic

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

CS61C : Machine Structures

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Scalability of MB-level Parallelism for H.264 Decoding

A Case for Merging the ILP and DLP Paradigms

UC Berkeley CS61C : Machine Structures

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

ELMB Full Branch Test

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

CprE 281: Digital Logic

EITF35: Introduction to Structured VLSI Design

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Final Exam review: chapter 4 and 5. Supplement 3 and 4

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

Design for Testability

Technical Note PowerPC Embedded Processors Video Security with PowerPC

Memec Spartan-II LC User s Guide

Investigation on Technical Feasibility of Stronger RS FEC for 400GbE

DIMACS Implementation Challenges 1 Network Flows and Matching, Clique, Coloring, and Satisability, Parallel Computing on Trees and

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

UNIT V 8051 Microcontroller based Systems Design

ECE337 Lab 4 Introduction to State Machines in VHDL

Film Grain Technology

Logic Devices for Interfacing, The 8085 MPU Lecture 4

4.5 Pipelining. Pipelining is Natural!

Transcription:

Another Dynamic Algorithm: Tomasulo Algorithm Differences between Tomasulo Algorithm & Scoreboard For IBM 360/9 about 3 years after CDC 6600 Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 Implications? Control & buffers distributed with Function Units vs. centralized in scoreboard; called reservation stations => instrs schedule themselves Registers in instructions replaced by pointers to reservation station buffer scoreboard => registers primary operand storage Tomasulo => reservation stations as operand storage HW renaming of registers to avoid WAR, WAW hazards Scoreboard => both source registers read together (thus one could not be overwritten while we wait for the other). Tomasulo => each register read as soon as available. Common Data Bus broadcasts results to all llfus RS s (FU s), registers, etc. responsible for collecting own data off CDB Load and Store Queues treated as FUs as well Tomasulo Organization Reservation Station Components Op Operation O to perform in the unit (e.g., + or ) Qj, Qk Reservation stations producing source registers Vj, Vk Value of Source operands Rj, Rk Flags indicating when Vj, Vk are ready Busy Indicates reservation station is busy Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Three Stages of Tomasulo Algorithm Tomasulo Example. Issue get instruction from FP Op Queue If reservation station free, the scoreboard issues instr & sends operands (renames registers).. Execution operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3. Write result finish execution (WB) Write on Common Data Bus to all waiting units; mark reservation station available. F4, F, MULD F8, F4, F F6, F8, F6 F, F8, Multiply takes 0 clocks, add/sub take 4 Tomasulo cycle 0 Tomasulo cycle F, F8, F6, F8, F6 F6, F8, F6 F, F8, F4, F, F.0 F4 4.0 F6 6.0 F8 80 8.0 F6, F8, F6 F, F8, F, F8, F6, F8, F6 F F4 F6.0 4.0 add 6.0 F8 80 8.0.0 3

Tomasulo cycle Tomasulo cycle F.0 Op F6, Qj F8, F6 Qk Vj Vk F4 4.0Busy add SUBD MULD F8, add F, - - F, F8,.0 F6 6.0 Y F8 F, F8, 80 8.0 mult l F6, F8, F6 F6, F8, F6 F, F8, F, F8, F6, F8, F6 F.0 F4 4.0 add F6 6.0 F8 80 8.0 mult l.0 MULD add.0.0 MULD add.0 Tomasulo cycle 3 Tomasulo cycle 4 F6, F8, F6 F, F8, F, F8, F.0 F4 4.0 add F6 6.0 add F8 80 8.0 mult l F6, F8, F6 F, F8, F, F8, F.0 F4 4.0 add F6 6.0 add F8 80 8.0 add3.0 mult 6.0 MULD add.0.0 mult 6.0 3 SUBD 0.0 00 MULD add.0

Tomasulo cycle 5 Tomasulo cycle 6 F6, F8, F6 F, F8, F, F8, F.0 F4.0 - F6 6.0 add F8 80 8.0 add3 F6, F8, F6 F, F8, F.0 add F4.0 - F6 6.0 add F8 80 8.0 add3.0 mult 6.0 3 SUBD 0.0 00 MULD.0.0 add3 mult 6.0 3 SUBD 0.0 00 MULD.0.0.0 (add result) Tomasulo cycle 8 Tomasulo cycle 9 F6, F8, F6 F, F8, F.0 add F4.0 - F6 6.0 add F8 0.0 - F6, F8, F6 F, F8, F.0 add F4.0 F6 6.0 add F8 0.0.0 mult 6.0 3 SUBD 0.0 00 MULD.0.0.0 mult 6.0 MULD.0.0.0 (add3 result)

Tomasulo cycle Tomasulo cycle 5 F6, F8, F6 F, F8, F.0 - F4.0 F6 6.0 add F8 0.0 F6, F8, F6 F, F8, F.0 - F4.0 F6 6.0 add F8 0.0.0 mult 6.0 MULD.0.0 4.0 6.0 MULD.0.0.0 (add result) 4.0 (mult result) Tomasulo cycle 6 Tomasulo cycle 9 F6, F8, F6 F, F8, F.0 - F4.0 F6 6.0 add F8 0.0 F6, F8, F6 F, F8, F.0 F4.0 F6 F8 0.0-4.0 6.0 4.0 6.0 (add result)

Tomasulo Summary Prevents Register as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited it to basic blocks (provided d branch prediction) Lasting Contributions Dynamic scheduling Register renaming (in what way does the register name change?) Load/store disambiguation Scoreboard vs. Tomasulo, the score Scoreboard Tomasulo issue when FU free when RS free read operands from reg file from reg file, CDB write operands to reg file to CDB structural hazards functional units reservation stations WAW, WAR hazards problem no problem register renaming no yes instructions completing no limit / cycle (per CDB) instructions beginning ex. (per set of read ports) no limit Modern Architectures MIPS R0000, some detail Alpha 64+, MIPS R0K+, Pentium 4 use an instruction queue. Uses explicit register renaming. Registers are not read until instruction ti dispatches (begins execution). Register renaming ensures no conflicts. Div R5, R4, R Add R7, R5, R Sub R5, R3, R Lw R7, 000(R5) Div PR37, PR45, PR Add PR4, PR37, PR3 Sub PR4, PR7, PR Lw PR9, 000(PR4) R PR3 R PR R3 PR7 R5 PR4 R6 PR0 R7 PR9 Register Map Instruction Queue I:Div R5, R4, R Active List I:Add R7, R5, R R PR3 I3:Sub R5, R3, R R PR I4:Lw R7, 000(R5) R3 PR7 R5 PR3 R6 PR0 R7 PR30 PR37, PR4, PR4, PR9, Head

MIPS R0000, some detail MIPS R0000, some detail Register Map Instruction Queue I:Div R5, R4, R Active List I:Add R7, R5, R R PR3 I3:Sub R5, R3, R R PR I4:Lw R7, 000(R5) R3 PR7 R5 PR3 R6 PR0 R7 PR30 Head Register Map Instruction Queue I:Div R5, R4, R Active List I:Add R7, R5, R R PR3 I: PR3 Head I3:Sub R5, R3, R R PR I4:Lw R7, 000(R5) R3 PR7 R5 PR37 Div PR37, R6 PR0 PR46, PR R7 PR30 PR37, PR4, PR4, PR9, PR4PR4PR9 PR4, PR4, PR9, MIPS R0000, some detail Dynamic Scheduling Key Points I:Div R5, R4, R Register Map Instruction Queue Active List I:Add R7, R5, R R PR3 Div,,46 =>37 I: PR3 I3:Sub R5, R3, R R PR I: PR30 I4:Lw R7, 000(R5) R3 PR7 R5 PR37 R6 PR0 R7 PR4 Add PR4, PR37, PR3 Head Dynamic scheduling is code motion in HW. Dynamic scheduling can do things SW scheduling (static scheduling) cannot. Scoreboard, Tomasulo have various tradeoffs Register renaming eliminates WAW, WAR dependencies. To get cross-iteration parallelism, we need to eliminate WAW, WAR dependencies. PR4, PR9,