Pipeline design. Mehran Rezaei

Pipeline design Mehran Rezaei

Shift Left 2 pc Opcode ExtOp Cont Unit RegDst Addr Addr2 Addr npcsle Reg ALUSrc Mem 2 OVF Branch ALUCtr MemtoReg Mem Funct Extension ALUOp ALU Cont

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF

IF PC+ Inst. pc IF/ID Registers

ID IF/ID Registers PC+ Inst. Addr Addr2 Addr 2 PC+ RegA RegB IMM ID/EXE Registers Extension Rt Rd

Shift Left 2 EXE ID/EXE Registers PC+ RegA RegB IMM Rt Rd Br. Tr. Add. ALUres RegB Rt/Rd EXE/MEM Registers

MEM Br. Tr. Add. Mem ALUres EXE/MEM Registers RegB Rt/Rd ALUres Rt/Rd MEM/WB Registers

WB MEM/WB Registers Mem ALUres Rt/Rd

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF

Example Run the following code on our pipeline machine add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3

Shift Left 2 pc add $,$0,$3 0 3 R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 add $,$0,$3? 3 Extension

Shift Left 2 add $,$0,$3 pc Lw $,20($2) 2 R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 0 5 lw $,20($2) 20? Extension 3

Shift Left 2 lw $,20($2) add $,$0,$3 pc Sub $5,$6,$6 6 6 R0 R2 R R6 R8 0 8 5 8 6 7 3 9 R R3 R5 R7 R9 8 5 sub $5,$6,$6 6 Extension 5 20?

Shift Left 2 sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc sw $7,0($8) R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 6 6 28 sw $7,8($8) Extension 6 5 5

Shift Left 2 sw $7,8($8) sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc add $9,$,$3 R0 R2 R R6 R8 0 8 5 5 8 6 7 9 R R3 R5 R7 R9 7 0 28 200 200 Extension 8 7 5 add $9,$,$3

Clk Next PC Recall: Single cycle control! Ideal Memory 32 Rd 5 Rs 5 Rw Ra Rt 5 Rb 32 32-bit Registers A 32 B Control Control Signals ALU Conditions 32 In Ideal Memory Out Clk 32 Clk path

Stationary Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc,...) are used cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc IF/ID Register Main Control ALUOp RegDst MemWr Branch MemtoReg ID/Ex Register ALUOp RegDst MemWr Branch MemtoReg Ex/Mem Register MemWr Branch MemtoReg Mem/Wr Register MemtoReg RegWr RegWr RegWr RegWr

Next PC PC Mem Acces s Mem Reg File Exec Reg. File Inst. Mem Decode path + Stationary Control IR fun rt rs op rs rt v rw wb me ex im v rw wb me Mem Ctrl v rw wb WB Ctrl A S M B D

Shift Left 2 pc Opcode ExtOp Cont Unit RegDst npcsle Reg ALUSrc Addr Addr2 Addr 2 Mem MemtoReg Mem OVF Branch ALUCtr Funct Extension ALUOp ALU Cont 20

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF 2

Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF 22

Pipeline timing diagram add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3 IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB 23

What are they? Hazards How do you detect them? How do you deal with them? 2

Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 25

Pipeline cycles for add IF - Fetch: read instruction from memory ID - Decode: read source operands from reg EXE - Execute: calculate sum MEM - Memory: pass results to next stage WB - back: write sum (ALUres) into register file 26

Hazard Register one is written add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF ID EXE MEM WB Register one is read If we are not careful, we will read the wrong value! If sub is supposed to read updated value (not stale), how many instruction should be in between add and sub? 27

Shift Left 2 sub $,$5,$ add $,$2,$3 pc R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 8 3 Extension 28

Hazard write add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 29

Class work What are the data hazards in this piece of code? add $,$2,$3 sub $2,$,$3 xor $,$3,$5 nor $5,$2,$ add $5,$3,$5 30

What to do with them? Avoid Make sure there are no hazards in the code Detect and Stall If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible) 3

First Approach: avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. Make sure no hazards exist. Consider if I have an instruction called noop. Put noops between any dependent instructions. add $,$2,$3 noop noop sub $,$5,$ IF ID EXE MEM WB IF ID EXE MEM WB 32

What is the problem with this solution? Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI is, but some instructions are noops 33

The second solution Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 3

Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 35

Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 36

Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 37

Hazard write Addr 0x00 add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 38

0 Shift Left 2 First half of cycle 0x0 0x00 PC+ 0 5 6 2 3 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres add $,$2,$3 39

Shift Left 2 Second half of cycle add $,$2,$3 0x0 0x0 add $,$2,$3 0 5 6 2 3 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 0

0 Shift Left 2 First half of cycle 2 0x08 0x0 0x0 add $,$2,$3 add $,$2,$3 2 3 0 5 6 2 3 6 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5

0 Shift Left 2 Second half of cycle 2 add $,$2,$3 0x08 0x08 sub $,$,$5 2 3 0 5 6 2 3 0x0 6 target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5 2

0 Shift Left 2 First half of cycle 3 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 6 target ALUres eq? mdata Extension IMM 7 valb ALUres sub $,$,$5 3

Hazard detected compare compare compare compare rega regb REG file IF/ ID ID/ EX

Hazard detected compare 0000 5 0000 rega regb

What Next? Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 6

0 Shift Left 2 Second half of cycle 3 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 7 eq? mdata Extension valb ALUres sub $,$,$5 noop 7

0 Shift Left 2 First half of cycle 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 6 0x0 7 eq? mdata sub $,$,$5 Extension IMM noop valb 7 ALUres 8

0 Shift Left 2 Second half of cycle 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 eq? 0x0 mdata 7 Extension sub $,$,$5 noop noop 9

0 Shift Left 2 first half of cycle 5 0x0c 0x0c 0x08 sub $,$,$5 5 0 7 6 2 3 eq? 0x0 mdata add $,$2,$3 sub $,$,$5 Extension noop noop 50

0 Shift Left 2 second half of cycle 5 0x0c 0x08 0 7 6 2 3 7 3 eq? mdata Extension sub $,$,$5 noop noop 5

Timing graph Time: 2 3 5 6 7 8 9 0 2 3 add $,$2,$3 IF ID EX ME WB Sub $,$,$5 IF no op no op ID EX ME WB add $6,$,$7 IF ID EX ME WB lw $6,0($8) IF ID EX ME WB sw $6,3($) IF no op no op ID EX ME 52

Problems with the second solution Still CPI is the same as before, no improvement in performance The only improvement is in the code size, and no longer compiler is responsible to detect the data hazards In fact, now the system runs slower Why? 53

Detect the data hazard The third solution Add instruction calculated the result in the execution cycle Forward the result to the decode stage of the sub instruction Therefore sub does not need to wait until the result is written back into register file And more control is needed; place the result somewhere else rather than register file 5

The third solution Detect: same as detect and stall Except that all hazards are treated differently Forward: i.e., you can t logical-or the hazard signals New bypass datapaths route computed data to where it is needed New MUX and control to pick the right data Beware: Stalling may still be required even in the presence of forwarding 55

Shift Left 2 First half of cycle 3 sub $,$,$5 add $,$2,$3 pc PC+ sub $,$,$5 Hazard detected 5 0 5 6 2 3 PC+ 6 7 target ALUres eq? mdata Extension IMM valb ALUres FW FW FW add $6,$,$7 56

Shift Left 2 End of cycle 3 sub $,$,$5 add $,$2,$3 pc PC+ Add $6,$,$7 0 5 6 2 3 7 9 Extension PC+ 5 3 IMM target 7 eq? valb mdata ALUres FW FW H add $6,$,$7 57

Shift Left 2 First half of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc PC+ Add $6,$,$7 New Hazard 7 0 5 6 2 3 7 9 Extension PC+ 5 3 IMM target 7 eq? valb 7 mdata ALUres lw $6,0($8) H FW FW 58

Shift Left 2 End of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc PC+ lw $6,0($8) 6 0 5 6 2 3 7 9 2 PC+ 5 9 target eq? valb mdata Extension IMM 7 lw $6,0($8) H2 H FW 59

Shift Left 2 pc PC+ lw $6,0($8) First half of cycle 5 lw $6,0($8) New Hazard 6 0 7 6 2 3 7 9 2 add $6,$,$7 PC+ 5 9 sub $,$,$5 target 6 eq? valb mdata add $,$2,$3 Extension IMM sw $6,3($) H2 H FW 60

What else can go wrong in our pipelined CPU? Control hazards Exceptions: First of all, what are exceptions? And, how do you handle exceptions in a pipelined processor with 5 instructions in flight?

Control Hazard What is a control hazard? How does the pipelined CPU handle control hazards?

Shift Left 2 beq bne pc PC+ PC+ vala valb target ALUres eq? mdata Extension IMM ALU Unit valb ALUres Control Unit

What happens in executing BEQ? Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality Memory: Send target to PC if test is equal back: Nothing left to do

Example y=y*2; x=0; for(j=00;j>0;j--){ x++; z--; } y--; x=x*3; z=z+x; 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

What do you observe from the example? How many times the branch is taken? How many times is not taken? What happens each time that the branch instruction is executed? What happens next?

Surprise! 2 addi $2,$2,... 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 2 IF ID EXE MEM WB 28 IF ID EXE MEM WB 32 IF ID EXE MEM WB 36 IF ID EXE MEM WB 2 IF ID EXE MEM WB

Solutions Avoid Make sure there are no hazards in the code Detect and Stall Delay fetch until branch resolved. Speculate and Squash-if-Wrong Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn t have been executed

Avoid Don t have branch instructions! Maybe a little impractical Delay taking branch: dbeq R,R2,offset dbne R,R2,offset s at PC+, PC+8, etc will execute before deciding whether to fetch from PC++offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)

Consider our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 noop 32 noop 36 noop 0 addi $3,$3,- add $5,$2,$0 8 add $2,$2,$2 52 add $2,$2,$5 56 add $,$,$2

Can we do better? 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $5,$5,- 6 dbne $5,$0,-2 20 addi $,$,- 2 addi $2,$2, 28 noop 32 addi $3,$3,- 36 add $5,$2,$0 0 add $2,$2,$2 add $2,$2,$5 8 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 dbne $5,$0,- 6 addi $5,$5,- 20 addi $,$,- 2 addi $2,$2, 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 This code generates wrong results.

Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more instuctions/noops after delayed beq Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI equals, but some instructions are noops

Detect and Stall (hardware approach) Detection: Must wait until decode Compare opcode to beq Alternately, this is just another control signal Stall: Keep current instructions in fetch Pass noop to decode stage (not execute!)

Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

Shift Left 2 28 28 bne $5,$0,- PC+ vala valb target ALUres Eq? mdata Extension IMM ALU Unit valb ALUres bne $5,$0,- Control Unit

Shift Left 2 bne $5,$0,- pc 28 noop 28 0 target ALUres eq mdata 0 Extension IMM ALU Unit valb ALUres Control Unit

Shift Left 2 bne $5,$0,- pc 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop

Shift Left 2 28 2 bne $5,$0,- pc 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop noop

Shift Left 2 pc 6 addi $2,$2, 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres addi $2,$2, Control Unit noop noop noop

What seems to be the problem? CPI increases every time a branch is detected! Is that necessary? Not always! Only about ½ of the time is the branch taken Let s assume that it is NOT taken In this case, we can ignore the beq or bne (treat them like a noop) Keep fetching PC + What if we are wrong? OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don t perform writeback)

Speculate and Squash Speculate: assume not equal Keep fetching from PC+ until we know that the branch is really taken Squash: stop bad instructions if taken Send a noop to: Decode, Execute and Memory Send target address to PC

Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

Shift Left 2 pc PC+ noop PC+ vala valb target ALUres eq? mdata 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 Extension Control Unit IMM noop ALU Unit valb noop ALUres

Performance problem, again CPI increases every time a branch is taken! About ½ of the time Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch much less whether it is taken???

Shift Left 2 28 28 2 bne $5,$0,- PC+ vala valb target ALUres Eq? mdata bpc target Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

Shift Left 2 28 28 PC PC+ PC vala valb target ALUres Eq? 2 mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres Control Unit 2 bne $5,$0,-

Shift Left 2 eq? 28 28 PC PC+ PC vala valb target ALUres Eq? PC mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

Branch Prediction Predict not taken: ~50% accurate Predict backward taken: ~65% accurate Predict same as last time: ~80% accurate Pentium: ~85% accurate Pentium Pro: ~92% accurate Best paper designs: ~96% accurate