Computer Architecture Spring 2016

Similar documents
Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Instruction Level Parallelism and Its. (Part II) ECE 154B

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Scoreboard Limitations

Advanced Pipelining and Instruction-Level Paralelism (2)

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Scoreboard Limitations!

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Out-of-Order Execution

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Instruction Level Parallelism

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

BUSES IN COMPUTER ARCHITECTURE

Tomasulo Algorithm Based Out of Order Execution Processor

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Modeling Digital Systems with Verilog

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Pipeline design. Mehran Rezaei

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

6.3 Sequential Circuits (plus a few Combinational)

Lecture 0: Organization

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

On the Rules of Low-Power Design

Sequencing and Control

Sequential Logic. Introduction to Computer Yung-Yu Chuang

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Amdahl s Law in the Multicore Era

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

ELEN Electronique numérique

CS 250 VLSI System Design

(12) United States Patent (10) Patent No.: US 6,249,855 B1

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

UNIT V 8051 Microcontroller based Systems Design

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

AN ABSTRACT OF THE THESIS OF

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

Technical Note PowerPC Embedded Processors Video Security with PowerPC

CHAPTER1: Digital Logic Circuits

Video Output and Graphics Acceleration

Quiz #4 Thursday, April 25, 2002, 5:30-6:45 PM

Data flow architecture for high-speed optical processors

OWNER'S MANUAL KIT INCLUDES. 3M VHB Mounting Pad Mounting Hardware PART # 40040

HIGH SPEED ASYNCHRONOUS DATA MULTIPLEXER/ DEMULTIPLEXER FOR HIGH DENSITY DIGITAL RECORDERS

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Testing Digital Systems II

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Logic Design II (17.342) Spring Lecture Outline

OUT-OF-ORDER processors with precise exceptions

Chapter 8 Design for Testability

Amplification. Most common signal conditioning

EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

Tools to Debug Dead Boards

Side Street. Traffic Sensor. Main Street. Walk Button. Traffic Lights

mamaamo Western Research Laboratory mamaamo Western r セ イ ィ Laboratory ;/ <> i:i:wi/!!?1)xwtw;:il r

Vorne Industries. 2000B Series Buffered Display Users Manual Industrial Drive Itasca, IL (630) Telefax (630)

High Performance Raster Scan Displays

Experiment 8 Introduction to Latches and Flip-Flops and registers

EITF35: Introduction to Structured VLSI Design

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report


CS 61C: Great Ideas in Computer Architecture

Cambridge International Examinations Cambridge International Advanced Subsidiary and Advanced Level

L11/12: Reconfigurable Logic Architectures

Lab2: Cache Memories. Dimitar Nikolov

Studio One Pro Mix Engine FX and Plugins Explained

Pivoting Object Tracking System

Multicore Design Considerations

8 DIGITAL SIGNAL PROCESSOR IN OPTICAL TOMOGRAPHY SYSTEM

An Overview of FLEET CS-152

Data Converters and DSPs Getting Closer to Sensors

Chapter 2. Digital Circuits

ECE 532 Design Project Group Report. Virtual Piano

Lecture 6. Clocked Elements

Application Note PG001: Using 36-Channel Logic Analyzer and 36-Channel Digital Pattern Generator for testing a 32-Bit ALU

C8000. switch over & ducking

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

SP02 Series Tape Feeder. Operator Guide. All rights reserved Revision 1 29 Feb D-E36

Transcription:

Computer Architecture Spring 2016 Lecture 12: Dynamic Scheduling: Tomasulo s Algorithm Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CS252, UC Berkeley and CS 246, Harvard University]

Tomasulo s Algorithm Used in IBM 360/91 Machines (Late 60s) Similar to scoreboarding, but added renaming in hardware Key concept: Reservation Stations (RS) Can eliminate WAW and WAR hazards Very Important Topic Scheduling ideas led to Alpha 21264, HP PA-8000, MIPS R10K, Pentium III, Pentium 4, PowerPC 604, etc

[ IBM]

Tomasulo s Algorithm Distributed (rather than centralized) control scheme Bypassing is allowed via Common Data Bus (CDB) to RS Register Renaming eliminates WAR/WAW hazards Scoreboard/Instruction Buffer => Reservation Stations (RS) Fetch and Buffer operands as soon as available Eliminates need to always get values from registers at execute Pending instructions designate reservation stations that will provide their inputs Successive writes to a register cause only the last one to update the register

Register Renaming with Tomasulo At instruction issue: Register specifiers for source operands are renamed to the names of the reservation stations Values can exist in reservation station or register file To eliminate WAR, register file values are copied to reservation stations at issue Other methods example use pointer-based renaming (map-table) Technique used in Pentium III, Pentium M, PowerPC604

Tomasulo Organization

Tomasulo Implementation Reservation station has following fields Op: Operation to perform in the unit Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) Note: Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status FU: Indicates which functional unit will write this register. Blank when no active instructions that will write that register.

Three Stages of Tomasulo s Algorithm 1. Issue - get instruction from FP Op. Queue - If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execute - operate on operands (EX) - When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write Result - finish execution (WB) - Write on Common Data Bus to all awaiting units; mark reservation station available

Data Buses in Tomasulo Algorithm Normal data bus: data + destination ( go to bus) Common data bus: data + source ( come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

Tomasulo Example

Tomasulo Example: Cycle 1

Tomasulo Example: Cycle 2

Tomasulo Example: Cycle 3 Note: registers names are removed ( renamed ) in Reservation Stations; MULT issued vs. scoreboard Load 1 is complete! What is waiting for it?

Tomasulo Example: Cycle 4 Load 2 is complete! What is waiting for it? 14

Tomasulo Example: Cycle 5

Tomasulo Example: Cycle 6 Issue ADD here vs. scoreboard?

Tomasulo Example: Cycle 7 Add1 completing; what is waiting for it?

Tomasulo Example: Cycle 8

Tomasulo Example: Cycle 9

Tomasulo Example: Cycle 10 Add2 completing; what is waiting for it?

Tomasulo Example: Cycle 11 Write result of ADDD here vs. scoreboard? All quick instructions complete in this cycle!

Tomasulo Example: Cycle 12

Tomasulo Example: Cycle 13

Tomasulo Example: Cycle 14

Tomasulo Example: Cycle 15

Tomasulo Example: Cycle 16

Tomasulo Example: Cycle 55 (Way Later!)

Tomasulo Example: Cycle 56 Mult2 is completing; what is waiting for it?

Tomasulo Example: Cycle 57 Once again: In-order issue, out-of-order execution and completion.

Compare to Scoreboard: Cycle 62 Why take longer on scoreboard/6600 - Structural Hazards - Lack of forwarding

Advantages of Tomasulo The distribution of the hazard detection logic distributed reservation stations and the CDB If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available. The elimination of stalls for WAW and WAR hazards

Tomasulo vs. Scoreboarding No explicit checking for WAW or WAR hazards Distribute RAW hazard detection Renaming eliminates WAW hazards Buffering values in Reservation Stations removes WAR hazards CDB broadcasts results rather than waiting on registers Loads/Store are treated like basic FUs Distributed vs. Centralized control

Tomasulo Drawbacks Performance limited by Common Data Bus Tag match in CDB requires many associative compares Each CDB must go to multiple functional units =>high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! Multiple CDBs => more FU logic for parallel assoc stores Need Load/Store reordering Load checks A field of all active stores Store checks A field of earlier loads and stores Non-precise exceptions! We will address in later lectures