Pipeline design. Mehran Rezaei

Similar documents
EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Instruction Level Parallelism

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Digital Design and Computer Architecture

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

ASIC = Application specific integrated circuit

CpE 442. Designing a Pipeline Processor (lect. II)

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Fundamentals of Computer Systems

Fill-in the following to understand stalling needs and forwarding opportunities

Digital Design and Computer Architecture

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

4.5 Pipelining. Pipelining is Natural!

Instruction Level Parallelism and Its. (Part II) ECE 154B

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Register Transfer Level (RTL) Design Cont.

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

A VLIW Processor for Multimedia Applications

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

CPE300: Digital System Architecture and Design

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

On the Rules of Low-Power Design

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Out-of-Order Execution

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

AN ABSTRACT OF THE THESIS OF

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Advanced Pipelining and Instruction-Level Paralelism (2)

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

Instruction Level Parallelism Part III

Modeling Digital Systems with Verilog

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Instruction Level Parallelism Part III

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Lab #12: 4-Bit Arithmetic Logic Unit (ALU)

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Computer Architecture Spring 2016

6.3 Sequential Circuits (plus a few Combinational)

CS 250 VLSI System Design

Multiplexor (aka MUX) An example, yet VERY useful circuit!

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Structural Fault Tolerance for SOC

Lab 2: Hardware/Software Co-design with the Wimp51

Why do we need to debounce the clock input on counter or state machine design? What happens if we don t?

Introduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison

Review: What is it? What does it do? slti $4, $5, 6

Computer and Digital System Architecture

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Logic Analysis Basics

Logic Analysis Basics

Open book/open notes, 90-minutes. Calculators permitted. Do not write on the back side of any pages.

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

DP Tuner 80 Remote Control Software User Manual. Version:08 Issue Date:May 10, 2018

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

CS3350B Computer Architecture Winter 2015

Logic Devices for Interfacing, The 8085 MPU Lecture 4

More Digital Circuits

Sequencing and Control

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

T 2 : WR = 0, AD 7 -AD 0 (μp Internal Reg.) T 3 : WR = 1,, M(AB) AD 7 -AD 0 or BDB

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Sequential Elements con t Synchronous Digital Systems

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

Lecture 10: Sequential Circuits

CMOS VLSI Design. Lab 3: Datapath and Zipper Assembly

CS61C : Machine Structures

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Computer Architecture Basic Computer Organization and Design

Lecture 2: Digi Logic & Bus

Logic Design ( Part 3) Sequential Logic (Chapter 3)

11. Sequential Elements

Spiral Content Mapping. Spiral 2 1. Learning Outcomes DATAPATH COMPONENTS. Datapath Components: Counters Adders Design Example: Crosswalk Controller

Sequential logic circuits

CSE 140 Exam #3 Solution Tajana Simunic Rosing

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

CS8803: Advanced Digital Design for Embedded Hardware

Lab #10 Hexadecimal-to-Seven-Segment Decoder, 4-bit Adder-Subtractor and Shift Register. Fall 2017

Computer Systems Architecture

Ryerson University Department of Electrical and Computer Engineering COE/BME 328 Digital Systems

VID_OVERLAY. Digital Video Overlay Module Rev Key Design Features. Block Diagram. Applications. Pin-out Description

Microprocessor Design

First Name Last Name November 10, 2009 CS-343 Exam 2

Therefore we need the help of sound editing software to convert the sound source captured from CD into the required format.

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

UC Berkeley CS61C : Machine Structures

Amdahl s Law in the Multicore Era

Transcription:

Pipeline design Mehran Rezaei

Shift Left 2 pc Opcode ExtOp Cont Unit RegDst Addr Addr2 Addr npcsle Reg ALUSrc Mem 2 OVF Branch ALUCtr MemtoReg Mem Funct Extension ALUOp ALU Cont

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF

IF PC+ Inst. pc IF/ID Registers

ID IF/ID Registers PC+ Inst. Addr Addr2 Addr 2 PC+ RegA RegB IMM ID/EXE Registers Extension Rt Rd

Shift Left 2 EXE ID/EXE Registers PC+ RegA RegB IMM Rt Rd Br. Tr. Add. ALUres RegB Rt/Rd EXE/MEM Registers

MEM Br. Tr. Add. Mem ALUres EXE/MEM Registers RegB Rt/Rd ALUres Rt/Rd MEM/WB Registers

WB MEM/WB Registers Mem ALUres Rt/Rd

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF

Example Run the following code on our pipeline machine add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3

Shift Left 2 pc add $,$0,$3 0 3 R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 add $,$0,$3? 3 Extension

Shift Left 2 add $,$0,$3 pc Lw $,20($2) 2 R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 0 5 lw $,20($2) 20? Extension 3

Shift Left 2 lw $,20($2) add $,$0,$3 pc Sub $5,$6,$6 6 6 R0 R2 R R6 R8 0 8 5 8 6 7 3 9 R R3 R5 R7 R9 8 5 sub $5,$6,$6 6 Extension 5 20?

Shift Left 2 sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc sw $7,0($8) R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 6 6 28 sw $7,8($8) Extension 6 5 5

Shift Left 2 sw $7,8($8) sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc add $9,$,$3 R0 R2 R R6 R8 0 8 5 5 8 6 7 9 R R3 R5 R7 R9 7 0 28 200 200 Extension 8 7 5 add $9,$,$3

Clk Next PC Recall: Single cycle control! Ideal Memory 32 Rd 5 Rs 5 Rw Ra Rt 5 Rb 32 32-bit Registers A 32 B Control Control Signals ALU Conditions 32 In Ideal Memory Out Clk 32 Clk path

Stationary Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc,...) are used cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc IF/ID Register Main Control ALUOp RegDst MemWr Branch MemtoReg ID/Ex Register ALUOp RegDst MemWr Branch MemtoReg Ex/Mem Register MemWr Branch MemtoReg Mem/Wr Register MemtoReg RegWr RegWr RegWr RegWr

Next PC PC Mem Acces s Mem Reg File Exec Reg. File Inst. Mem Decode path + Stationary Control IR fun rt rs op rs rt v rw wb me ex im v rw wb me Mem Ctrl v rw wb WB Ctrl A S M B D

Shift Left 2 pc Opcode ExtOp Cont Unit RegDst npcsle Reg ALUSrc Addr Addr2 Addr 2 Mem MemtoReg Mem OVF Branch ALUCtr Funct Extension ALUOp ALU Cont 20

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF 2

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF 22

Pipeline timing diagram add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3 IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB 23

What are they? Hazards How do you detect them? How do you deal with them? 2

Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 25

Pipeline cycles for add IF - Fetch: read instruction from memory ID - Decode: read source operands from reg EXE - Execute: calculate sum MEM - Memory: pass results to next stage WB - back: write sum (ALUres) into register file 26

Hazard Register one is written add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF ID EXE MEM WB Register one is read If we are not careful, we will read the wrong value! If sub is supposed to read updated value (not stale), how many instruction should be in between add and sub? 27

Shift Left 2 sub $,$5,$ add $,$2,$3 pc R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 8 3 Extension 28

Hazard write add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 29

Class work What are the data hazards in this piece of code? add $,$2,$3 sub $2,$,$3 xor $,$3,$5 nor $5,$2,$ add $5,$3,$5 30

What to do with them? Avoid Make sure there are no hazards in the code Detect and Stall If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible) 3

First Approach: avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. Make sure no hazards exist. Consider if I have an instruction called noop. Put noops between any dependent instructions. add $,$2,$3 noop noop sub $,$5,$ IF ID EXE MEM WB IF ID EXE MEM WB 32

What is the problem with this solution? Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI is, but some instructions are noops 33

The second solution Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 3

Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 35

Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 36

Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 37

Hazard write Addr 0x00 add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 38

0 Shift Left 2 First half of cycle 0x0 0x00 PC+ 0 5 6 2 3 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres add $,$2,$3 39

Shift Left 2 Second half of cycle add $,$2,$3 0x0 0x0 add $,$2,$3 0 5 6 2 3 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 0

0 Shift Left 2 First half of cycle 2 0x08 0x0 0x0 add $,$2,$3 add $,$2,$3 2 3 0 5 6 2 3 6 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5

0 Shift Left 2 Second half of cycle 2 add $,$2,$3 0x08 0x08 sub $,$,$5 2 3 0 5 6 2 3 0x0 6 target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5 2

0 Shift Left 2 First half of cycle 3 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 6 target ALUres eq? mdata Extension IMM 7 valb ALUres sub $,$,$5 3

Hazard detected compare compare compare compare rega regb REG file IF/ ID ID/ EX

Hazard detected compare 0000 5 0000 rega regb

What Next? Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 6

0 Shift Left 2 Second half of cycle 3 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 7 eq? mdata Extension valb ALUres sub $,$,$5 noop 7

0 Shift Left 2 First half of cycle 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 6 0x0 7 eq? mdata sub $,$,$5 Extension IMM noop valb 7 ALUres 8

0 Shift Left 2 Second half of cycle 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 eq? 0x0 mdata 7 Extension sub $,$,$5 noop noop 9

0 Shift Left 2 first half of cycle 5 0x0c 0x0c 0x08 sub $,$,$5 5 0 7 6 2 3 eq? 0x0 mdata add $,$2,$3 sub $,$,$5 Extension noop noop 50

0 Shift Left 2 second half of cycle 5 0x0c 0x08 0 7 6 2 3 7 3 eq? mdata Extension sub $,$,$5 noop noop 5

Timing graph Time: 2 3 5 6 7 8 9 0 2 3 add $,$2,$3 IF ID EX ME WB Sub $,$,$5 IF no op no op ID EX ME WB add $6,$,$7 IF ID EX ME WB lw $6,0($8) IF ID EX ME WB sw $6,3($) IF no op no op ID EX ME 52

Problems with the second solution Still CPI is the same as before, no improvement in performance The only improvement is in the code size, and no longer compiler is responsible to detect the data hazards In fact, now the system runs slower Why? 53

Detect the data hazard The third solution Add instruction calculated the result in the execution cycle Forward the result to the decode stage of the sub instruction Therefore sub does not need to wait until the result is written back into register file And more control is needed; place the result somewhere else rather than register file 5

The third solution Detect: same as detect and stall Except that all hazards are treated differently Forward: i.e., you can t logical-or the hazard signals New bypass datapaths route computed data to where it is needed New MUX and control to pick the right data Beware: Stalling may still be required even in the presence of forwarding 55

Shift Left 2 First half of cycle 3 sub $,$,$5 add $,$2,$3 pc PC+ sub $,$,$5 Hazard detected 5 0 5 6 2 3 PC+ 6 7 target ALUres eq? mdata Extension IMM valb ALUres FW FW FW add $6,$,$7 56

Shift Left 2 End of cycle 3 sub $,$,$5 add $,$2,$3 pc PC+ Add $6,$,$7 0 5 6 2 3 7 9 Extension PC+ 5 3 IMM target 7 eq? valb mdata ALUres FW FW H add $6,$,$7 57

Shift Left 2 First half of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc PC+ Add $6,$,$7 New Hazard 7 0 5 6 2 3 7 9 Extension PC+ 5 3 IMM target 7 eq? valb 7 mdata ALUres lw $6,0($8) H FW FW 58

Shift Left 2 End of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc PC+ lw $6,0($8) 6 0 5 6 2 3 7 9 2 PC+ 5 9 target eq? valb mdata Extension IMM 7 lw $6,0($8) H2 H FW 59

Shift Left 2 pc PC+ lw $6,0($8) First half of cycle 5 lw $6,0($8) New Hazard 6 0 7 6 2 3 7 9 2 add $6,$,$7 PC+ 5 9 sub $,$,$5 target 6 eq? valb mdata add $,$2,$3 Extension IMM sw $6,3($) H2 H FW 60

What else can go wrong in our pipelined CPU? Control hazards Exceptions: First of all, what are exceptions? And, how do you handle exceptions in a pipelined processor with 5 instructions in flight?

Control Hazard What is a control hazard? How does the pipelined CPU handle control hazards?

Shift Left 2 beq bne pc PC+ PC+ vala valb target ALUres eq? mdata Extension IMM ALU Unit valb ALUres Control Unit

What happens in executing BEQ? Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality Memory: Send target to PC if test is equal back: Nothing left to do

Example y=y*2; x=0; for(j=00;j>0;j--){ x++; z--; } y--; x=x*3; z=z+x; 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

What do you observe from the example? How many times the branch is taken? How many times is not taken? What happens each time that the branch instruction is executed? What happens next?

Surprise! 2 addi $2,$2,... 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 2 IF ID EXE MEM WB 28 IF ID EXE MEM WB 32 IF ID EXE MEM WB 36 IF ID EXE MEM WB 2 IF ID EXE MEM WB

Solutions Avoid Make sure there are no hazards in the code Detect and Stall Delay fetch until branch resolved. Speculate and Squash-if-Wrong Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn t have been executed

Avoid Don t have branch instructions! Maybe a little impractical Delay taking branch: dbeq R,R2,offset dbne R,R2,offset s at PC+, PC+8, etc will execute before deciding whether to fetch from PC++offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)

Consider our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 noop 32 noop 36 noop 0 addi $3,$3,- add $5,$2,$0 8 add $2,$2,$2 52 add $2,$2,$5 56 add $,$,$2

Can we do better? 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $5,$5,- 6 dbne $5,$0,-2 20 addi $,$,- 2 addi $2,$2, 28 noop 32 addi $3,$3,- 36 add $5,$2,$0 0 add $2,$2,$2 add $2,$2,$5 8 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 dbne $5,$0,- 6 addi $5,$5,- 20 addi $,$,- 2 addi $2,$2, 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 This code generates wrong results.

Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more instuctions/noops after delayed beq Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI equals, but some instructions are noops

Detect and Stall (hardware approach) Detection: Must wait until decode Compare opcode to beq Alternately, this is just another control signal Stall: Keep current instructions in fetch Pass noop to decode stage (not execute!)

Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

Shift Left 2 28 28 bne $5,$0,- PC+ vala valb target ALUres Eq? mdata Extension IMM ALU Unit valb ALUres bne $5,$0,- Control Unit

Shift Left 2 bne $5,$0,- pc 28 noop 28 0 target ALUres eq mdata 0 Extension IMM ALU Unit valb ALUres Control Unit

Shift Left 2 bne $5,$0,- pc 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop

Shift Left 2 28 2 bne $5,$0,- pc 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop noop

Shift Left 2 pc 6 addi $2,$2, 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres addi $2,$2, Control Unit noop noop noop

What seems to be the problem? CPI increases every time a branch is detected! Is that necessary? Not always! Only about ½ of the time is the branch taken Let s assume that it is NOT taken In this case, we can ignore the beq or bne (treat them like a noop) Keep fetching PC + What if we are wrong? OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don t perform writeback)

Speculate and Squash Speculate: assume not equal Keep fetching from PC+ until we know that the branch is really taken Squash: stop bad instructions if taken Send a noop to: Decode, Execute and Memory Send target address to PC

Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

Shift Left 2 pc PC+ noop PC+ vala valb target ALUres eq? mdata 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 Extension Control Unit IMM noop ALU Unit valb noop ALUres

Performance problem, again CPI increases every time a branch is taken! About ½ of the time Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch much less whether it is taken???

Shift Left 2 28 28 2 bne $5,$0,- PC+ vala valb target ALUres Eq? mdata bpc target Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

Shift Left 2 28 28 PC PC+ PC vala valb target ALUres Eq? 2 mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres Control Unit 2 bne $5,$0,-

Shift Left 2 eq? 28 28 PC PC+ PC vala valb target ALUres Eq? PC mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

Branch Prediction Predict not taken: ~50% accurate Predict backward taken: ~65% accurate Predict same as last time: ~80% accurate Pentium: ~85% accurate Pentium Pro: ~92% accurate Best paper designs: ~96% accurate