Pipeline design. Mehran Rezaei

Size: px

Start display at page:

Download "Pipeline design. Mehran Rezaei"

Hubert Riley
5 years ago
Views:

1 Pipeline design Mehran Rezaei

2 Shift Left 2 pc Opcode ExtOp Cont Unit RegDst Addr Addr2 Addr npcsle Reg ALUSrc Mem 2 OVF Branch ALUCtr MemtoReg Mem Funct Extension ALUOp ALU Cont

3 Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF

4 Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF

5 IF PC+ Inst. pc IF/ID Registers

6 ID IF/ID Registers PC+ Inst. Addr Addr2 Addr 2 PC+ RegA RegB IMM ID/EXE Registers Extension Rt Rd

7 Shift Left 2 EXE ID/EXE Registers PC+ RegA RegB IMM Rt Rd Br. Tr. Add. ALUres RegB Rt/Rd EXE/MEM Registers

8 MEM Br. Tr. Add. Mem ALUres EXE/MEM Registers RegB Rt/Rd ALUres Rt/Rd MEM/WB Registers

9 WB MEM/WB Registers Mem ALUres Rt/Rd

10 Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF

11 Example Run the following code on our pipeline machine add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3

12 Shift Left 2 pc add $,$0,$3 0 3 R0 R2 R R6 R R R3 R5 R7 R9 add $,$0,$3? 3 Extension

13 Shift Left 2 add $,$0,$3 pc Lw $,20($2) 2 R0 R2 R R6 R R R3 R5 R7 R9 0 5 lw $,20($2) 20? Extension 3

14 Shift Left 2 lw $,20($2) add $,$0,$3 pc Sub $5,$6,$6 6 6 R0 R2 R R6 R R R3 R5 R7 R9 8 5 sub $5,$6,$6 6 Extension 5 20?

15 Shift Left 2 sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc sw $7,0($8) R0 R2 R R6 R R R3 R5 R7 R sw $7,8($8) Extension 6 5 5

16 Shift Left 2 sw $7,8($8) sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc add $9,$,$3 R0 R2 R R6 R R R3 R5 R7 R Extension add $9,$,$3

17 Clk Next PC Recall: Single cycle control! Ideal Memory 32 Rd 5 Rs 5 Rw Ra Rt 5 Rb bit Registers A 32 B Control Control Signals ALU Conditions 32 In Ideal Memory Out Clk 32 Clk path

18 Stationary Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc,...) are used cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc IF/ID Register Main Control ALUOp RegDst MemWr Branch MemtoReg ID/Ex Register ALUOp RegDst MemWr Branch MemtoReg Ex/Mem Register MemWr Branch MemtoReg Mem/Wr Register MemtoReg RegWr RegWr RegWr RegWr

19 Next PC PC Mem Acces s Mem Reg File Exec Reg. File Inst. Mem Decode path + Stationary Control IR fun rt rs op rs rt v rw wb me ex im v rw wb me Mem Ctrl v rw wb WB Ctrl A S M B D

20 Shift Left 2 pc Opcode ExtOp Cont Unit RegDst npcsle Reg ALUSrc Addr Addr2 Addr 2 Mem MemtoReg Mem OVF Branch ALUCtr Funct Extension ALUOp ALU Cont 20

21 Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF 2

22 Shift Left 2 ID EXE MEM WB pc Addr Addr2 Addr 2 Extension IF 22

23 Pipeline timing diagram add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3 IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB 23

24 What are they? Hazards How do you detect them? How do you deal with them? 2

25 Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 25

26 Pipeline cycles for add IF - Fetch: read instruction from memory ID - Decode: read source operands from reg EXE - Execute: calculate sum MEM - Memory: pass results to next stage WB - back: write sum (ALUres) into register file 26

27 Hazard Register one is written add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF ID EXE MEM WB Register one is read If we are not careful, we will read the wrong value! If sub is supposed to read updated value (not stale), how many instruction should be in between add and sub? 27

28 Shift Left 2 sub $,$5,$ add $,$2,$3 pc R0 R2 R R6 R R R3 R5 R7 R9 8 3 Extension 28

29 Hazard write add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 29

30 Class work What are the data hazards in this piece of code? add $,$2,$3 sub $2,$,$3 xor $,$3,$5 nor $5,$2,$ add $5,$3,$5 30

31 What to do with them? Avoid Make sure there are no hazards in the code Detect and Stall If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible) 3

32 First Approach: avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. Make sure no hazards exist. Consider if I have an instruction called noop. Put noops between any dependent instructions. add $,$2,$3 noop noop sub $,$5,$ IF ID EXE MEM WB IF ID EXE MEM WB 32

33 What is the problem with this solution? Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI is, but some instructions are noops 33

34 The second solution Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 3

35 Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 35

36 Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 36

37 Shift Left 2 pc PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 37

38 Hazard write Addr 0x00 add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 38

39 0 Shift Left 2 First half of cycle 0x0 0x00 PC PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres add $,$2,$3 39

40 Shift Left 2 Second half of cycle add $,$2,$3 0x0 0x0 add $,$2,$ PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 0

41 0 Shift Left 2 First half of cycle 2 0x08 0x0 0x0 add $,$2,$3 add $,$2,$ PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5

42 0 Shift Left 2 Second half of cycle 2 add $,$2,$3 0x08 0x08 sub $,$,$ x0 6 target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5 2

43 0 Shift Left 2 First half of cycle 3 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected x0 6 target ALUres eq? mdata Extension IMM 7 valb ALUres sub $,$,$5 3

44 Hazard detected compare compare compare compare rega regb REG file IF/ ID ID/ EX

45 Hazard detected compare rega regb

46 What Next? Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 6

47 0 Shift Left 2 Second half of cycle 3 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected x0 7 eq? mdata Extension valb ALUres sub $,$,$5 noop 7

48 0 Shift Left 2 First half of cycle 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected x0 6 0x0 7 eq? mdata sub $,$,$5 Extension IMM noop valb 7 ALUres 8

49 0 Shift Left 2 Second half of cycle 0x0c add $,$2,$3 0x08 0x08 sub $,$,$5 Hazard detected eq? 0x0 mdata 7 Extension sub $,$,$5 noop noop 9

50 0 Shift Left 2 first half of cycle 5 0x0c 0x0c 0x08 sub $,$,$ eq? 0x0 mdata add $,$2,$3 sub $,$,$5 Extension noop noop 50

51 0 Shift Left 2 second half of cycle 5 0x0c 0x eq? mdata Extension sub $,$,$5 noop noop 5

52 Timing graph Time: add $,$2,$3 IF ID EX ME WB Sub $,$,$5 IF no op no op ID EX ME WB add $6,$,$7 IF ID EX ME WB lw $6,0($8) IF ID EX ME WB sw $6,3($) IF no op no op ID EX ME 52

53 Problems with the second solution Still CPI is the same as before, no improvement in performance The only improvement is in the code size, and no longer compiler is responsible to detect the data hazards In fact, now the system runs slower Why? 53

54 Detect the data hazard The third solution Add instruction calculated the result in the execution cycle Forward the result to the decode stage of the sub instruction Therefore sub does not need to wait until the result is written back into register file And more control is needed; place the result somewhere else rather than register file 5

55 The third solution Detect: same as detect and stall Except that all hazards are treated differently Forward: i.e., you can t logical-or the hazard signals New bypass datapaths route computed data to where it is needed New MUX and control to pick the right data Beware: Stalling may still be required even in the presence of forwarding 55

56 Shift Left 2 First half of cycle 3 sub $,$,$5 add $,$2,$3 pc PC+ sub $,$,$5 Hazard detected PC+ 6 7 target ALUres eq? mdata Extension IMM valb ALUres FW FW FW add $6,$,$7 56

57 Shift Left 2 End of cycle 3 sub $,$,$5 add $,$2,$3 pc PC+ Add $6,$,$ Extension PC+ 5 3 IMM target 7 eq? valb mdata ALUres FW FW H add $6,$,$7 57

58 Shift Left 2 First half of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc PC+ Add $6,$,$7 New Hazard Extension PC+ 5 3 IMM target 7 eq? valb 7 mdata ALUres lw $6,0($8) H FW FW 58

59 Shift Left 2 End of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc PC+ lw $6,0($8) PC+ 5 9 target eq? valb mdata Extension IMM 7 lw $6,0($8) H2 H FW 59

60 Shift Left 2 pc PC+ lw $6,0($8) First half of cycle 5 lw $6,0($8) New Hazard add $6,$,$7 PC+ 5 9 sub $,$,$5 target 6 eq? valb mdata add $,$2,$3 Extension IMM sw $6,3($) H2 H FW 60

61 What else can go wrong in our pipelined CPU? Control hazards Exceptions: First of all, what are exceptions? And, how do you handle exceptions in a pipelined processor with 5 instructions in flight?

62 Control Hazard What is a control hazard? How does the pipelined CPU handle control hazards?

63 Shift Left 2 beq bne pc PC+ PC+ vala valb target ALUres eq? mdata Extension IMM ALU Unit valb ALUres Control Unit

64 What happens in executing BEQ? Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality Memory: Send target to PC if test is equal back: Nothing left to do

65 Example y=y*2; x=0; for(j=00;j>0;j--){ x++; z--; } y--; x=x*3; z=z+x; 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

66 What do you observe from the example? How many times the branch is taken? How many times is not taken? What happens each time that the branch instruction is executed? What happens next?

67 Surprise! 2 addi $2,$2,... 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 2 IF ID EXE MEM WB 28 IF ID EXE MEM WB 32 IF ID EXE MEM WB 36 IF ID EXE MEM WB 2 IF ID EXE MEM WB

68 Solutions Avoid Make sure there are no hazards in the code Detect and Stall Delay fetch until branch resolved. Speculate and Squash-if-Wrong Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn t have been executed

69 Avoid Don t have branch instructions! Maybe a little impractical Delay taking branch: dbeq R,R2,offset dbne R,R2,offset s at PC+, PC+8, etc will execute before deciding whether to fetch from PC++offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)

70 Consider our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 noop 32 noop 36 noop 0 addi $3,$3,- add $5,$2,$0 8 add $2,$2,$2 52 add $2,$2,$5 56 add $,$,$2

71 Can we do better? 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $5,$5,- 6 dbne $5,$0,-2 20 addi $,$,- 2 addi $2,$2, 28 noop 32 addi $3,$3,- 36 add $5,$2,$0 0 add $2,$2,$2 add $2,$2,$5 8 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 dbne $5,$0,- 6 addi $5,$5,- 20 addi $,$,- 2 addi $2,$2, 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 This code generates wrong results.

72 Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more instuctions/noops after delayed beq Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI equals, but some instructions are noops

73 Detect and Stall (hardware approach) Detection: Must wait until decode Compare opcode to beq Alternately, this is just another control signal Stall: Keep current instructions in fetch Pass noop to decode stage (not execute!)

74 Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

75 Shift Left bne $5,$0,- PC+ vala valb target ALUres Eq? mdata Extension IMM ALU Unit valb ALUres bne $5,$0,- Control Unit

76 Shift Left 2 bne $5,$0,- pc 28 noop 28 0 target ALUres eq mdata 0 Extension IMM ALU Unit valb ALUres Control Unit

77 Shift Left 2 bne $5,$0,- pc 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop

78 Shift Left bne $5,$0,- pc 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop noop

79 Shift Left 2 pc 6 addi $2,$2, 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres addi $2,$2, Control Unit noop noop noop

80 What seems to be the problem? CPI increases every time a branch is detected! Is that necessary? Not always! Only about ½ of the time is the branch taken Let s assume that it is NOT taken In this case, we can ignore the beq or bne (treat them like a noop) Keep fetching PC + What if we are wrong? OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don t perform writeback)

81 Speculate and Squash Speculate: assume not equal Keep fetching from PC+ until we know that the branch is really taken Squash: stop bad instructions if taken Send a noop to: Decode, Execute and Memory Send target address to PC

82 Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

83 Shift Left 2 pc PC+ noop PC+ vala valb target ALUres eq? mdata 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 Extension Control Unit IMM noop ALU Unit valb noop ALUres

84 Performance problem, again CPI increases every time a branch is taken! About ½ of the time Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch much less whether it is taken???

85 Shift Left bne $5,$0,- PC+ vala valb target ALUres Eq? mdata bpc target Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

86 Shift Left PC PC+ PC vala valb target ALUres Eq? 2 mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres Control Unit 2 bne $5,$0,-

87 Shift Left 2 eq? PC PC+ PC vala valb target ALUres Eq? PC mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

88 Branch Prediction Predict not taken: ~50% accurate Predict backward taken: ~65% accurate Predict same as last time: ~80% accurate Pentium: ~85% accurate Pentium Pro: ~92% accurate Best paper designs: ~96% accurate

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture Feb 17, 2009 John Wawrzynek Spring 2009 EECS150 - Lec9-cpu Page 1 CMOS Devices Review: Transistor switch-level models The gate acts like a capacitor.