Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Similar documents
Instruction Level Parallelism and Its. (Part II) ECE 154B

Advanced Pipelining and Instruction-Level Paralelism (2)

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III

Computer Architecture Spring 2016

Out-of-Order Execution

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Scoreboard Limitations!

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Scoreboard Limitations

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Instruction Level Parallelism

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

On the Rules of Low-Power Design

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Pipeline design. Mehran Rezaei

EITF35: Introduction to Structured VLSI Design

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Tomasulo Algorithm Based Out of Order Execution Processor

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

EE241 - Spring 2005 Advanced Digital Integrated Circuits

Impact of Intermittent Faults on Nanocomputing Devices

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

OUT-OF-ORDER processors with precise exceptions

Amdahl s Law in the Multicore Era

Lecture 0: Organization

Modeling Digital Systems with Verilog

Sequencing and Control

Performance Driven Reliable Link Design for Network on Chips

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Logic Design II (17.342) Spring Lecture Outline

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

11. Sequential Elements

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

Microprocessor Design

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

CS3350B Computer Architecture Winter 2015

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

UNIT V 8051 Microcontroller based Systems Design

A VLIW Processor for Multimedia Applications

CPE300: Digital System Architecture and Design

IMS B007 A transputer based graphics board

Logic Analysis Basics

6.3 Sequential Circuits (plus a few Combinational)

Logic Analysis Basics

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

Fundamentals of Computer Systems

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

BUSES IN COMPUTER ARCHITECTURE

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report

Design for Testability

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Sharif University of Technology. SoC: Introduction

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

A Low-Power 0.7-V H p Video Decoder

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Profiling techniques for parallel applications

AN ABSTRACT OF THE THESIS OF

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

2.6 Reset Design Strategy

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Vicon Valerus Performance Guide

Sequential Logic. Introduction to Computer Yung-Yu Chuang

PRACE Autumn School GPU Programming

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

HW#3 - CSE 237A. 1. A scheduler has three queues; A, B and C. Outgoing link speed is 3 bits/sec

A Case for Merging the ILP and DLP Paradigms

CS61C : Machine Structures

SoC IC Basics. COE838: Systems on Chip Design

Lab #10: Building Output Ports with the 6811

Frame Processing Time Deviations in Video Processors

Digital Integrated Circuits EECS 312

HIGH SPEED ASYNCHRONOUS DATA MULTIPLEXER/ DEMULTIPLEXER FOR HIGH DENSITY DIGITAL RECORDERS

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

Scalability of MB-level Parallelism for H.264 Decoding

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

Transcription:

Outline 1 Reiteration Lecture 5: EIT090 Computer Architecture 2 Dynamic scheduling - Tomasulo Anders Ardö 3 Superscalar, VLIW EIT Electrical and Information Technology, Lund University Sept. 30, 2009 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 1 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 2 / 62 Instruction Level Parallelism - ILP Why loop unrolling works ILP: Overlap execution of unrelated instructions: Pipelining Two main approaches: DYNAMIC = hardware detects parallelism STATIC = software detects parallelism Often a mix between both. Longer sequences of straight code without branches (longer basic blocks) allows for easier compiler static rescheduling Longer basic blocks also facilitates dynamic rescheduling such as Scoreboard and Tomasulo s algorithm Pipeline CPI = Ideal CPI + Structural stalls + Data hazard stalls + Control stalls A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 3 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 4 / 62

Dynamic Branch Prediction Dependencies Branches limit performance because: Branch penalties Limit to available Instruction Level Parallelism Solution: Dynamic branch prediction to predict the outcome of conditional branches. Benefits: Reduce the time to when the branch condition is known Reduce the time to calculate the branch target address Two instructions must be independent in order to execute in parallel There are three general types of dependencies that limit parallelism: Data dependencies Name dependencies Control dependencies Dependencies are properties of the program Whether a dependency leads to a hazard or not is a property of the pipeline implementation A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 5 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 6 / 62 Scoreboard pipeline Summary Goal of scoreboarding is to maintain an execution rate of one instruction per clock cycle by executing an instruction as early as possible. Instructions execute out-of-order when there are sufficient resources and no data dependencies. A scoreboard is a hardware unit that keeps track of the instructions that are in the process of being executed, the functional units that are doing the executing, and the registers that will hold the results of those units. A scoreboard centrally performs all hazard detection and resolution and thus controls the instruction progression from one step to the next. ILP: Rescheduling and loop unrolling are important to take advantage of potential Instruction Level Parallelism Dynamic instruction scheduling An alternative to compile-time scheduling Does not need recompilation to increase performance Used in most new processor implementations Dynamic Branch Prediction reduce branch penalties by early prediction of conditional branch outcomes A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 7 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 8 / 62

Lecture 5 agenda Outline Chapters 2.4-2.8, 3.1-3.4 in "Computer Architecture" 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 9 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 10 / 62 Scoreboard pipeline Limitations with Scoreboard Issue: Decode and check for structural hazards Read operands: wait until no data hazards, then read operands All data hazards are handled by the scoreboard The number of scoreboard entries (window size) The number and types of functional units Number of datapaths to registers The presence of name dependencies Tomasulo s algorithm addresses the last two limitations. A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 11 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 12 / 62

Tomasulo s Algorithm Tomasulo Organization Another dynamic instruction scheduling algorithm For IBM 360/91, a few years after the CDC 6600 (Scoreboard) Goal: High performance without compiler support Differences between Tomasulo & Scoreboard: Control & Buffers distributed with FUs (called reservation stations) vs. centralized in Scoreboard Register names in instructions replaced by pointers to reservation station buffer (HW register renaming) Common Data Bus broadcasts results to all FUs Loads and Stores treated as FUs as well A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 13 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 14 / 62 Three Stages of Tomasulo Alg. Tomasulo example, cycle 0 1. Issue get instruction from FP Op Queue If reservation station free (no structural hazard), the instruction is issued together with its operands (renames registers) 2. Execution operate on operands (EX) When both operands are ready, then execute; if not ready, watch Common Data Bus (CDB) for operands (snooping) 3. Write result finish execution (WB) Write on CDB to all awaiting functional units; mark reservation station available Normal bus: data + destination Common Data Bus: data + source (snooping) A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 15 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 16 / 62

Tomasulo example, cycle 1 Tomasulo example, cycle 2 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 17 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 18 / 62 Tomasulo example, cycle 3 Tomasulo example, cycle 4 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 19 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 20 / 62

Tomasulo example, cycle 5 Tomasulo example, cycle 6 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 21 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 22 / 62 Tomasulo example, cycle 7 Tomasulo example, cycle 8 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 23 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 24 / 62

Tomasulo example, cycle 10 Elimination of WAR hazards Example: LD F6, 34(R2)...... DIVD F10,F0,F6 ADDD F6,F8,F2 ADDD can safely finish before DIVD has read register F6 because: DIVD has renamed register F6 to point at the reservation station LD broadcasts its result on the Common Data Bus Register renaming can thus be done: statically by the compiler dynamically by the hardware A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 25 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 26 / 62 Tomasulo example, cycle 11 Tomasulo example, cycle 15 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 27 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 28 / 62

Tomasulo example, cycle 16 Tomasulo example, cycle 56 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 29 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 30 / 62 Tomasulo example, cycle 57 Benefits Tomasulo distributed hazard detection logic distributed reservation stations Common Data Bus (CDB) with snooping elimination WAR,WAW hazards (renaming registers) A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 31 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 32 / 62

Dynamic scheduling - summary Outline 1 Reiteration tolerates unpredictable delays compile for one pipeline - run effectively on another significant increase in HW complexity out-of-order execution, completion register renaming 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 33 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 34 / 62 Getting CPI < 1! Approaches for multiple issue Issuing multiple instructions per clock cycle Superscalar: varying number of instructions/cycle (1-8) scheduled by compiler or HW IBM Power5, Pentium 4, Sun SuperSparc, DEC Alpha Simple hardware, complicated compiler or... Very complex hardware but simple for compiler Very Long Instruction Word (VLIW): fixed number of instructions (3-5) scheduled by the compiler HP/Intel IA-64, Itanium Simple hardware, difficult for compiler high performance through extensive compiler optimization Issue Hazard Scheduling Characteristics detection /examples Superscalar dynamic HW static in-order execution ARM Superscalar dynamic HW dynamic out-of-order execution Superscalar dynamic HW dynamic speculation Pentium 4 IBM power5 WLIW static compiler static TI C6x EPIC static compiler mostly static Itanium A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 35 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 36 / 62

Very Long Instruction Word (VLIW) Itanium instruction format A number of functional units that independently execute instructions in parallel. The compiler decides which instructions can execute in parallel No hazard detection needed A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 37 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 38 / 62 Itanium architecture Limits of VLIW Limited Instruction Level Parallelism With n functional units and k pipeline stages we need n x k independent instructions to utilize the hardware Memory and register bandwidth With increasing number of functional units, the number of ports needed at the memory or register file must increase to prevent structural hazards Code size Compiler scheduled pipeline bubbles take up space in the instruction Need more aggressive loop unrolling to work well which also increases code size No binary code compatibility A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 39 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 40 / 62

Outline HW supported speculation 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations A combination of three main ideas: Dynamic instruction scheduling; take advantage of ILP Dynamic branch prediction; allows instruction scheduling across branches Speculative execution; execute instructions before all control dependencies are resolved Hardware based speculation uses a data-flow execution: instructions execute when their operands are available 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 41 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 42 / 62 HW vs. SW speculation Tomasulo extended to handle speculation Advantages: Dynamic runtime disambiguation of memory addresses Dynamic branch prediction is often better than static which limits the performance of SW speculation HW speculation can maintain a precise exception model Can achieve higher performance on older code (without recompilation) Main disadvantage: Extremely complex implementation and extensive need for hardware resources A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 43 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 44 / 62

Re-order buffer - ROB Four steps of Speculative Tomasulo Data structure entry instruction type destination value ready 1 2... n supports speculative execution instructions commit in order precise exceptions Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer nr. for destination Execution operate on operands (EX) If both operands ready: execute; if not, watch CDB for result; when both operands are in reservation station: execute Write result complete execution Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available Commit update register with reorder result When instr. is at head of reorder buffer & result is present; update register with result (or store to memory) and remove instr. from reorder buffer; (handle misspeculations and precise exceptions) A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 45 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 46 / 62 Misspeculation! Multiple issue and speculation Commit branch prediction wrong When branch instr. is at head of reorder buffer & incorrect prediction: remove all instr. from reorder buffer (flush); restart execution at correct instruction Expensive = try to recover as early as possible Performance sensitive to branch prediction/speculation mechanism Possible to extend Tomasulo with both multiple issue and speculation. Major issues instruction issue and monitoring CDB Must be able to handle multiple commits Alternative to Tomasulo is to use extra physical registers for both architecturally visible registers and temporary values with register renaming A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 47 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 48 / 62

Tomasulo speculation - increased complexity Dynamic scheduling, speculation - summary tolerates unpredictable delays compile for one pipeline - run effectively on another allows speculation multiple branches in-order commit precise exceptions time, energy; recovery significant increase in HW complexity out-of-order execution, completion register renaming A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 49 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 50 / 62 Outline ILP 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation How much performance can we get by utilizing ILP? 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 51 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 52 / 62

A model of an ideal processor Upper Limit to ILP Provides a base for ILP measurements No structural hazards Register renaming infinite virtual registers and all WAW & WAR hazards avoided Machine with perfect speculation Branch prediction perfect; no mispredictions Jump prediction all jumps perfectly predicted Memory-address alias analysis addresses are known & a store can be moved before a load provided addresses not equal Perfect caches There are only true data dependencies left! A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 53 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 54 / 62 Impact window size More realistic HW: Branch impact A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 55 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 56 / 62

More realistic HW: Register impact Summary Software (compiler) tricks: Loop unrolling Static instruction scheduling (with register renaming)... and more Hardware tricks: Dynamic instruction scheduling Dynamic branch prediction Multiple issue Superscalar, VLIW Speculative execution... and more A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 57 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 58 / 62 Outline AMD Phenom CPU 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 59 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 60 / 62

Intel Core2 Intel Core2 chip (Nehalem) A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 61 / 62 A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 62 / 62