Instruction Level Parallelism

Similar documents
06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Advanced Pipelining and Instruction-Level Paralelism (2)

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Instruction Level Parallelism and Its. (Part II) ECE 154B

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Pipeline design. Mehran Rezaei

Instruction Level Parallelism Part III

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Instruction Level Parallelism Part III

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Out-of-Order Execution

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Computer Architecture Spring 2016

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Scoreboard Limitations

4.5 Pipelining. Pipelining is Natural!

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Scoreboard Limitations!

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

A VLIW Processor for Multimedia Applications

On the Rules of Low-Power Design

Fill-in the following to understand stalling needs and forwarding opportunities

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

CPE300: Digital System Architecture and Design

Fundamentals of Computer Systems

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

An Overview of FLEET CS-152

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

First Name Last Name November 10, 2009 CS-343 Exam 2

AN ABSTRACT OF THE THESIS OF

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

CpE 442. Designing a Pipeline Processor (lect. II)

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

Modeling Digital Systems with Verilog

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

CS3350B Computer Architecture Winter 2015

CSE 140 Exam #3 Solution Tajana Simunic Rosing

Registers and Counters

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

ASIC = Application specific integrated circuit

Digital Design and Computer Architecture

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors

CS61C : Machine Structures

6.3 Sequential Circuits (plus a few Combinational)

Computer and Digital System Architecture

Microprocessor Design

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

11. Sequential Elements

Lab #10: Building Output Ports with the 6811

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Midterm Exam 15 points total. March 28, 2011

Lecture 10: Sequential Circuits

More Digital Circuits

University of Pennsylvania Department of Electrical and Systems Engineering. Digital Design Laboratory. Lab8 Calculator

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Last time, we saw how latches can be used as memory in a circuit

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Digital Design Datapath Components: Parallel Load Register

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Register Transfer Level (RTL) Design Cont.

CS 61C: Great Ideas in Computer Architecture

CS 250 VLSI System Design

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Motion Compensation Hardware Accelerator Architecture for H.264/AVC

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Combinational vs Sequential

CSE 140 Exam #3 Tajana Simunic Rosing

An automatic synchronous to asynchronous circuit convertor

CS61C : Machine Structures

EITF35: Introduction to Structured VLSI Design

Chapter 7 Sequential Circuits

Lab #12: 4-Bit Arithmetic Logic Unit (ALU)

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

Lab 2: Hardware/Software Co-design with the Wimp51

Introduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison

UC Berkeley CS61C : Machine Structures

Sequencing and Control

BUSES IN COMPUTER ARCHITECTURE

Digital Design and Computer Architecture

DEPARTMENT OF ELECTRICAL &ELECTRONICS ENGINEERING DIGITAL DESIGN

POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Transcription:

Instruction Level Parallelism Pipelining, Hazards Appendix C, HPe

Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP.

Pipelining Basics

Implementation of RISC ISA - Stages Instruction Fetch (IF) Instruction Decode/Register Fetch () Fixed field decoding Execution/Effective address (EX) Memory Access (MEM) Write back (WB)

ALU MIPS Datapath IF EX MEM WB 4 ADD NPC M U X Zero? Cond P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LMD M U X Sign Extend 16 32 Imm Instruction Fetch Instruction Decode/ Register Fetch Execute/ Address Calculation Memory Access Write Back

B A Multiple Issue Integer Pipeline Zero? IR0 IM RF Read RF Write IR1 DM IF EX MEM WB

Pipeline Performance An unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup? Average Instruction Execution time = Clock cycle * Average CPI n CPI = i =1 IC i InstructionCount CPI i

Dependences Pipeline Hazards Structural & Data

Data dependences Name dependences Structural hazards Data hazards Stalling, Forwarding Outline

Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

Dependence for (i=0; i<=999; i=i+1) x[i] = x[i] + a; Data Dependence Name Dependence Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Name dependence antidependence, output dependence Register renaming Hazard ADD.D ADD.D F4, F0, F2 F4, F6, F8 Overlap during execution could change the order of access to the operand involved in the dependence.

Hazards Program Order ILP preserves program order only where it affects the outcome of the program Structural Hazards Resource conflicts Data Hazards RAW, WAW, WAR Control Hazard Whether or not an instruction should be executed depends on a control decision made by an earlier instruction

Structural Hazard 1 2 3 4 5 6 7 8 9 i1 i2 i3 i4 i5... MEM EX MEM WB MEM EX MEM WB MEM EX MEM WB MEM EX MEM WB MEM EX MEM WB HAZARD!!! Unified Memory example Register File WB, example.

Cost of a Load Structural Hazard Data references constitute 40% of the instruction mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much? Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime ideal =CPI Clock cycle time ideal

Cost of a Load Structural Hazard Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime =(1+0.4 1) Clock cycle time ideal 1.1 Avg. InstructionTime =1.27 Clock cycle time ideal

ALU Data Hazards R1 is updated in the WB stage. IR IR IR 4 ADD NPC M U X Zero? Cond P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LMD M U X R1 R2 + R3 R4 R1 + R5 Sign Extend 16 32 Imm

Stalled Stages and Pipeline Bubbles Time (clock cycles) R1 R2 + R3 IF EX MA WB R4 R1 + R5 IF EX MA WB IF IF IF EX MA WB Stalled Stages IF EX MA WB IF EX MA WB IF I1 I2 I3 I3 I3 I3 I4 I5 I1 I2 I2 I2 I2 I3 I4 I5 EX I1 nop nop nop I2 I3 I4 I5 MA I1 nop nop nop I2 I3 I4 I5 WB I1 nop nop nop I2 I3 I4 I5 How to overcome this hazard?

Resolving Data Hazards Stalling one of the instructions Data Forwarding (Bypassing) Scheduling hazardous instructions away from each other

ALU Stalling (Interlocking) Stall Condition NOP IR IR IR 4 ADD NPC M U X Zero? Cond P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LMD M U X R1 R2 + R3 R4 R1 + R5 Sign Extend 16 32 Imm

Pipeline Performance Speedup pipelining = CPI unpipelined CPI pipelined Speedup pipelining = Pipeline depth 1+ Stall cycles per instruction

Forwarding DADD DSUB AND OR XOR R1,R2,R3 R4,R1,R5 R6,R1,R7 R8,R1,R9 R10,R1,R11 Time (clock cycles) DADD IM REG ALU DM REG DSUB IM REG ALU DM REG AND IM REG ALU DM REG

Forwarding Before Bypassing Time (clock cycles) R1 R2 + R3 IF EX MA WB R4 R1 + R5 IF EX MA WB CPI > 1 IF IF IF Stalled Stages EX MA WB After Bypassing Time (clock cycles) R1 R2 + R3 IF EX MA WB R4 R1 + R5 IF EX MA WB CPI = 1 IF EX MA WB

Cost of Forwarding In longer pipelines? In multiple issue pipelines? All the dependences have been solved?

Forwarding Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG ALU DM REG

Forwarding - Stall Condition Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG REG ALU DM STALL REG