Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Similar documents
Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Instruction Level Parallelism and Its. (Part II) ECE 154B

Scoreboard Limitations!

Advanced Pipelining and Instruction-Level Paralelism (2)

Scoreboard Limitations

Computer Architecture Spring 2016

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Instruction Level Parallelism Part III

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Instruction Level Parallelism Part III

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Out-of-Order Execution

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Instruction Level Parallelism

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Pipeline design. Mehran Rezaei

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

6.3 Sequential Circuits (plus a few Combinational)

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

On the Rules of Low-Power Design

Modeling Digital Systems with Verilog

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

4.5 Pipelining. Pipelining is Natural!

Tomasulo Algorithm Based Out of Order Execution Processor

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Amdahl s Law in the Multicore Era

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CPE300: Digital System Architecture and Design

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Scalability of MB-level Parallelism for H.264 Decoding

Microprocessor Design

A VLIW Processor for Multimedia Applications

Introduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Lab2: Cache Memories. Dimitar Nikolov

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

CHAPTER 4: Logic Circuits

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

CHAPTER 4: Logic Circuits

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Profiling techniques for parallel applications

Sequential Logic. Introduction to Computer Yung-Yu Chuang

A Case for Merging the ILP and DLP Paradigms

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Lecture 0: Organization

Outcomes. Spiral 1 / Unit 6. Flip-Flops FLIP FLOPS AND REGISTERS. Flip-flops and Registers. Outputs only change once per clock period

AN ABSTRACT OF THE THESIS OF

Logic Design II (17.342) Spring Lecture Outline

Sequential Elements con t Synchronous Digital Systems

OUT-OF-ORDER processors with precise exceptions

Laboratory Exercise 4

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

More Digital Circuits

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Profiling techniques for parallel applications

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

BUSES IN COMPUTER ARCHITECTURE

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

Combinational vs Sequential

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Logic Analysis Basics

Logic Analysis Basics

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER

Sequential Circuit Design: Part 1

8088 Corruption. Motion Video on a 1981 IBM PC with CGA

Design for Testability

White paper Max number of unique video stream configurations

CHAPTER1: Digital Logic Circuits

REAL-TIME DIGITAL SIGNAL PROCESSING from MATLAB to C with the TMS320C6x DSK

Computer Architecture Basic Computer Organization and Design

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

FPGA Development for Radar, Radio-Astronomy and Communications

ITU-T Y Specific requirements and capabilities of the Internet of things for big data

Training Document for Comprehensive Automation Solutions Totally Integrated Automation (T I A)

Quiz #4 Thursday, April 25, 2002, 5:30-6:45 PM

Impact of Intermittent Faults on Nanocomputing Devices

Multicore Design Considerations

Fill-in the following to understand stalling needs and forwarding opportunities

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Transcription:

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise Interrupts (Section 3.6) Multiple Issue (Section 3.7) Static Techniques (Section 3.2, Appendix H) Limitations of ILP Multithreading (Section 3.11) Putting it Together (Mini-projects)

ILP vs. Parallel Computers Instruction-Level Parallelism (ILP) Instructions of single process (or thread) executed in parallel Parallel components must appear to execute in sequential program order Parallel Computers or Multiprocessors Program divided into multiple processes (or threads) Instructions of multiple threads executed in parallel Typically also involves ILP within each thread No a priori sequential order between parallel threads

The situation: DIV.D F0, F2, F4 ADD.D F10, F0, F8 MULT.D F6, F6, F14 The problem: Dynamic Scheduling - Basics ADD stalls due to RAW hazard MULT stalls because ADD stalls Example 1 2 3 4 5 6 7 8 DIV.D IF ID E/ E/ E/ E/ MEM WB ADD.D IF ID ** ** ** E+ E+ MULT.D IF ** ** ** ID E* why stall? In-order execution limits performance

Dynamic Scheduling - Basics (Cont.) Solutions Static Scheduling Dynamic Scheduling Static Scheduling (Software) Compiler reorganizes instructions + + (Will see more later) Dynamic Scheduling (Hardware) Hardware reorganizes instructions + +

Dynamic Scheduling - Basics (Cont.) In-order execution - Static Instructions sent to execution units sequentially Stall instruction i + 1 if instruction i stalls for lack of operands Out-of-order execution - Dynamic Send independent instructions to execution units as soon as possible

Dynamic Scheduling Basics (Cont.) Original simple pipeline ID decode, check all hazards, read operands EX execute Dynamic pipeline Split ID ( issue to execution unit ) into two parts Check for structural hazards Wait for data dependences New organization (conceptual): Issue decode, check structural hazards, read ready operands ReadOps wait until data hazards clear, read operands, begin execution Issue stays in-order; ReadOps/beginning of EX is out-of-order

Dynamic Scheduling Basics (Cont.) Dynamic scheduling can create WAW, WAR hazards, and imprecise exceptions WAW hazards with dynamic scheduling DIV.D F0, F2, F4 ADD.D F10, F0, F8 MUL.D F10, F8, F14 WAR hazards with dynamic scheduling DIV.D F0, F2, F4 ADD.D F10,F0, F8 MUL.D F8, F8, F14 Can always stall, but more aggressive solution with register renaming

Register Renaming - Tomasulo s Algorithm Registers are Names for data values Think of register specifiers as tags NOT storage locations Tomasulo's algorithm exploited above in IBM 360/91 WAW hazards: DIV.D F0, F2, F4 ADD.D F10, F0, F8 MUL.D F10, F8, F14 WAR hazards: DIV.D F0, F2, F4 ADD.D F10, F0, F8 MUL.D F8, F8, F14

Some History - IBM 360/91 Fast 360 for scientific code Completed in 1967 Predates cache memories Pipelined, rather than multiple, functional units (FU) We will assume multiple functional units 360 had register memory instructions, we don t

Register Renaming - Tomasulo s Algorithm Tomasulo s algm uses reservation stations for register renaming Instruction is issued to a reservation station A pending operand is designated via a tag Tag = reservation station that will provide the operand Reservation station with pending instruction fetches and buffers the operand when it becomes available All FUs place output on the common data bus (CDB) with tag Waiting reservation station gets the data from the CDB (register bypass)

Tomasulo s Algorithm - Implementation Extend simple pipeline as example for Tomasulo's algorithm Assume multiple FUs Copyright 2019, Elsevier Inc. All rights Reserved.

Our Tomasulo Pipeline 3-stage Execution (ignore IF and MEM) Issue Execute Write Get instruction from queue ALU Op: Check for available reservation station Load/Store: Check for available load/store buffer If not, stall due to structural hazard If operands available, execute operation If not, monitor CDB for operand If CDB available, write it on CDB If not, stall

Our Tomasulo Pipeline, cont Reservation Stations Handle distributed hazard detection and instruction control Everything, except store buffers, has a tag 4-bit tag specifies reservation station or load buffer Specifies which FU will produce result Register specifier is used to assign tags THEN IT'S DISCARDED! Register specifers are ONLY used in ISSUE

Our Tomasulo Pipeline, cont Reservation Stations Op Opcode Q j,q k Tag Fields V j,v k Operand values Busy Currently in use Register File and Store Buffer Q i Busy Tag Field Currently in use Load and Store Buffers Busy Currently in use A Address Latencies: FP+ = 2, FP* = 10, FP/ = 40, Load/int = 1

Example code L.D F6,34(R2) Tomasulo Example L.D F2,45(R3) MULT.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2

Tomasulo Example Instruction Status (For illustration ONLY) Instruction Issue Execute Write L.D L.D MULT.D SUB.D DIV.D ADD.D F6,34(R2) F2,45(R3) F0,F2,F4 F8,F6,F2 F10,F0,F6 F6,F8,F2 FU Name Busy Op Vj Vk Qj Qk 1 Add1 2 Add2 3 Add3 4 Mult1 5 Mult2 QI Busy Register Result Status F0 F2 F4 F6 F8 F10 F12 F30

Tomasulo Example Instruction Status (For illustration ONLY) Instruction Issue Execute Write L.D L.D MULT.D SUB.D DIV.D ADD.D F6,34(R2) F2,45(R3) F0,F2,F4 F8,F6,F2 F10,F0,F6 F6,F8,F2 FU Name Busy Op Vj Vk Qj Qk 1 Add1 2 Add2 3 Add3 4 Mult1 5 Mult2 QI Busy Register Result Status F0 F2 F4 F6 F8 F10 F12 F30

Tomasulo Example Instruction Status (For illustration ONLY) Instruction Issue Execute Write L.D L.D MULT.D SUB.D DIV.D ADD.D F6,34(R2) F2,45(R3) F0,F2,F4 F8,F6,F2 F10,F0,F6 F6,F8,F2 FU Name Busy Op Vj Vk Qj Qk 1 Add1 2 Add2 3 Add3 4 Mult1 5 Mult2 QI Busy Register Result Status F0 F2 F4 F6 F8 F10 F12 F30

Out-of-order loads and stores? Tomasulo, cont. CDB is a bottleneck Could duplicate Increases the required hardware Complex implementation

Tomasulo, cont. Advantages Distribution of hazard detection Elimination of WAR and WAW stalls Common Data Bus + Broadcasts results to multiple instructions, bypasses registers - Central bottleneck Could duplicate (increases required hardware) Register Renaming + Eliminates WAR and WAW Hazards + Allows dynamic loop unrolling Especially important with only 4 registers - Requires many associative lookups

Loops with Tomasulo s Algorithm Consider the following example: FORTRAN: DO I = 1, N C[I] = A[I] + s * B[I] ASSEMBLY: L.D F0, A(R1) L.D F2, B(R1) MUL.D F2, F2, F4 /* s in F4 */ ADD.D F2, F2, F0 S.D C(R1), F2 Branch code What would Tomasulo s algorithm do?