Fundamentals of Computer Systems A Pipelined MIPS Processor Stephen A. Edwards Columbia University Summer 25 Technical Illustrations Copyright c 27 Elsevier
Sequential Laundry Time Alice Bob Cindy
Pipelined Laundry Time Alice Bob Why let the washer or dryer sit idle when both can be running? Cindy
Single-Cycle Datapath Timing Clk PC Register File Data
Single-Cycle vs. Pipelined Datapath SignImmE A RD 4 A A3 WD3 RD2 RD WE3 A2 Sign Extend Register File A RD Data WD WE PCF PC' InstrD 25:2 2:6 5: SrcBE 2:6 5: RtE RdE <<2 OutM OutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchM ResultW PCPlus4E PCPlus4F ZeroM WriteRegE 4: SignImm A RD 4 A A3 WD3 RD2 RD WE3 A2 Sign Extend Register File A RD Data WD WE PC PC' Instr 25:2 2:6 5: SrcB 2:6 5: <<2 Result ReadData WriteData SrcA PCPlus4 PCBranch WriteReg 4: Result Zero Fetch Decode Execute Writeback
Corrected Pipelined Datapath The register number to write (WriteReg) must stay synchronized with the result. PC' PCF A RD InstrD 25:2 A WE3 RD 2:6 A2 RD2 A3 Register WD3 File 2:6 5: RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: ZeroM WE OutM A RD Data WriteDataM WD WriteRegM 4: OutW ReadDataW WriteRegW 4: 4 5: Sign Extend SignImmE <<2 PCBranchM PCPlus4F PCPlus4D PCPlus4E Fetch Decode Execute Writeback ResultW
Pipeline Control Same control unit as the single-cycle processor; Control signals delayed across pipeline stages Control Unit RegWriteD MemtoRegD MemWriteD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 5: Op Funct BranchD ControlD SrcD BranchE ControlE 2: SrcE BranchM PCSrcM RegDstD RegDstE OutW PC' PCF A RD InstrD 25:2 A WE3 RD 2:6 A2 RD2 A3 Register WD3 File 2:6 5: RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: ZeroM WE OutM A RD Data WriteDataM WD WriteRegM 4: ReadDataW WriteRegW 4: 4 5: Sign Extend SignImmE <<2 PCBranchM PCPlus4F PCPlus4D PCPlus4E ResultW
Single-Cycle vs. Pipeline Timing Single-Cycle Instr 2 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Fetch Decode Read Reg Execute Read / Write Write Reg Fetch Decode Read Reg Execute Read / Write Time (ps) Write Reg Pipelined Instr 2 Fetch Decode Read Reg Fetch Execute Decode Read Reg Read/Write Execute Write Reg Read/Write Write Reg 3 Fetch Decode Read Reg Execute Read/Write Write Reg
Pipelining Abstraction 2 3 4 5 6 7 8 9 Time (cycles) $ lw $s2, 4($) lw 4 DM $s2 add $s3, $t, $t2 add $t $t2 DM $s3 sub $s4, $s, $s5 sub $s $s5 - DM $s4 and $s5, $t5, $t6 and $t5 $t6 & DM $s5 sw $s6, 2($s) sw $s 2 DM $s6 or $s7, $t3, $t4 or $t3 $t4 DM $s7
Data Hazard 2 3 4 5 6 7 8 Time (cycles) $s2 add $s, $s2, $s3 add $s3 DM $s and $t, $s, $s and $s $s & DM $t or $t, $s4, $s or $s4 $s DM $t sub $t2, $s, $s5 sub $s $s5 - DM $t2 The first instruction produces a result (in $s) that later instructions need. The has computed the value in cycle 3, but it won t be written to the register file until cycle 5. Dukes of Hazzard Hazard Lights Water Hazard Biohazard
Eliminating Hazards at Compile-Time 2 3 4 5 6 7 8 9 Time (cycles) $s2 add $s, $s2, $s3 add $s3 DM $s nop nop DM nop nop DM and $t, $s, $s and $s $s & DM $t or $t, $s4, $s or $s4 $s DM $t sub $t2, $s, $s5 sub $s $s5 - DM $t2 Insert nops to delay later instructions; sometimes possible to put useful work in those slots.
Eliminating Data Hazards through Forwarding 2 3 4 5 6 7 8 Time (cycles) $s2 add $s, $s2, $s3 add $s3 DM $s and $t, $s, $s and $s $s & DM $t or $t, $s4, $s or $s4 $s DM $t sub $t2, $s, $s5 sub $s $s5 - DM $t2 Add logic to send data between instructions; register file eventually written. Here, the result is available at the end of cycle 3 and needed in cycles 4 and 5.
ControlD 2: ControlE 2: Datapath with Data Forwarding Control Unit RegWriteD MemtoRegD MemWriteD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 Op SrcD SrcE 5: Funct RegDstD RegDstE PCSrcM BranchD BranchE BranchM PC' PCF A RD InstrD WE3 25:2 A RD 2:6 A2 RD2 A3 Register WD3 File 25:2 2:6 5: RsD RtD RdD RsE RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: ZeroM OutM WriteDataM WriteRegM 4: WE A RD Data WD ReadDataW OutW WriteRegW 4: 4 5: Sign Extend SignImmD SignImmE <<2 PCPlus4F PCPlus4D PCPlus4E PCBranchM ResultW ForwardAE ForwardBE RegWriteM RegWriteW Hazard Unit if rse rse = WriteRegM RegWriteM, ForwardAE = if rse rse = WriteRegW RegWriteW, otherwise.
Data Hazard that Demands a Stall 2 3 4 5 6 7 8 Time (cycles) $ lw $s, 4($) lw 4 DM $s and $t, $s, $s and Trouble! $s $s & DM $t or $t, $s4, $s or $s4 $s DM $t sub $t2, $s, $s5 sub $s $s5 - DM $t2 This data hazard can t be solved with forwarding because the value is only available at the end of cycle 4, yet is needed at the beginning.
Stalling a Pipeline 2 3 4 5 6 7 8 9 Time (cycles) $ lw $s, 4($) lw 4 DM $s and $t, $s, $s and $s $s $s $s & DM $t or $t, $s4, $s or or $s4 $s DM $t sub $t2, $s, $s5 Stall sub $s $s5 - DM $t2 A stall tells an instruction to wait for a cycle before proceeding.
Stalling Hardware Control Unit RegWriteD MemtoRegD MemWriteD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 Op 5: Funct ControlD 2: SrcD RegDstD BranchD ControlE 2: SrcE RegDstE BranchE BranchM PCSrcM PC' PCF InstrD A RD EN 25:2 WE3 A RD 2:6 A2 RD2 A3 Register WD3 File 25:2 2:6 5: RsD RtD RdD RsE RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: ZeroM OutM WriteDataM WriteRegM 4: WE A RD Data WD ReadDataW OutW WriteRegW 4: 4 5: SignImmD Sign Extend SignImmE <<2 PCPlus4F EN PCPlus4D CLR PCPlus4E PCBranchM ResultW StallF StallD FlushE ForwardAE ForwardBE MemtoRegE RegWriteM RegWriteW Hazard Unit lwstall = MemToRegE ( (rsd = rte) (rtd = rte) ) StallF, StallD, FlushE = lwstall
Control Hazards 2 3 4 5 6 7 8 9 Time (cycles) 2 $t beq $t, $t2, 4 lw $t2 - DM 24 28 and $t, $s, $s or $t, $s4, $s and $s $s or & $s4 $s DM DM Flush these instructions 2C sub $t2, $s, $s5 sub $s $s5 - DM 3...... $s2 64 slt $t3, $s2, $s3 slt $s3 slt DM $t3 Whether to branch isn t determined until the beginning of the fourth cycle; three instructions would be executed erroneously if the branch were taken.
Early Branch Resolution Control Unit 3:26 Op 5: Funct RegWriteD MemtoRegD MemWriteD ControlD 2: SrcD RegDstD BranchD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM ControlE 2: SrcE RegDstE PC' PCF EN InstrD A RD WE3 25:2 A RD 2:6 A2 RD2 A3 Register WD3 File 25:2 2:6 5: EqualD PCSrcD = RsD RtD RdE RsE RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: OutM WriteDataM WriteRegM 4: WE A RD Data WD ReadDataW OutW WriteRegW 4: 4 5: SignImmD Sign Extend <<2 SignImmE PCPlus4F CLR EN PCPlus4D CLR PCBranchD ResultW StallF StallD FlushE ForwardAE ForwardBE MemtoRegE RegWriteM RegWriteW Hazard Unit Introduced another data hazard in the decode stage
Control Hazards w/ Early Branch Resolution 2 3 4 5 6 7 8 9 Time (cycles) 2 $t beq $t, $t2, 4 lw $t2 - DM 24 and $t, $s, $s and $s $s & DM Flush this instruction 28 or $t, $s4, $s 2C sub $t2, $s, $s5 3...... $s2 64 slt $t3, $s2, $s3 slt $s3 slt DM $t3
Handling Data and Control Hazards Control Unit 3:26 Op 5: Funct RegWriteD MemtoRegD MemWriteD ControlD 2: SrcD RegDstD BranchD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM ControlE 2: SrcE RegDstE PC' PCF InstrD A RD EN WE3 25:2 A RD 2:6 A2 RD2 A3 Register WD3 File 25:2 2:6 5: EqualD PCSrcD = RsD RtD RdD RsE RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: OutM WriteDataM WriteRegM 4: WE A RD Data WD ReadDataW OutW WriteRegW 4: 4 5: SignImmD Sign Extend <<2 SignImmE PCPlus4F CLR EN PCPlus4D CLR PCBranchD ResultW StallF StallD BranchD ForwardAD ForwardBD FlushE ForwardAE ForwardBE MemtoRegE RegWriteE RegWriteM RegWriteW Hazard Unit
Forwarding and Stalling Logic Forward result to branch-if-equal comparator if we would read its destination register ForwardAD = (rsd rsd = WriteRegM RegWriteM) ForwardBD = (rtd rtd = WriteRegM RegWriteM) Stall if the branch would test the result of an operation or a memory read branchstall = (BranchD RegWriteE (WriteRegE = rsd WriteRegE = rtd)) (BranchD MemToRegM (WriteRegM = rsd WriteRegM = rtd)) Stall if we need to read the result of a memory read or of a branch StallF, StallD, FlushE = lwstall branchstall
Pipeline Performance CPI Example Ideal CPI = ; stalls reduce this, but how much? s in SPECINT2 benchmark: 52% R-type 25% Loads % Stores % Branches 2% Jumps If 4% of loads are used by the next instruction and 25% of branches are mispredicted, what is the average CPI?
Pipeline Performance CPI Example Ideal CPI = ; stalls reduce this, but how much? s in SPECINT2 benchmark: 52% R-type 25% Loads % Stores % Branches 2% Jumps If 4% of loads are used by the next instruction and 25% of branches are mispredicted, what is the average CPI? Load CPI = 2 when next instruction uses; otherwise Branch CPI = 2 when mispredicted; otherwise Jump CPI = 2 Average CPI =.52 R-type (2.4.6).25 Loads. Stores (2.25.75). Branches 2.2 Jumps =.475
Fully Bypassed Processor EqualD SignImmE A RD 4 A A3 WD3 RD2 RD WE3 A2 Sign Extend Register File A RD Data WD WE PCF PC' InstrD 25:2 2:6 5: 5: SrcBE 25:2 5: RsE RdE <<2 OutM OutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchD WriteRegM4: ResultW PCPlus4F 3:26 RegDstD BranchD MemWriteD MemtoRegD ControlD 2: SrcD RegWriteD Op Funct Control Unit PCSrcD WriteRegW4: ControlE2: RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM RegDstE SrcE WriteRegE4: = SignImmD 2:6 RtE RsD RdD RtD Hazard Unit StallF StallD ForwardAE ForwardBE ForwardAD ForwardBD RegWriteE RegWriteM RegWriteW MemtoRegE BranchD FlushE EN CLR EN CLR
EN EN CLR 3:26 5: 25:2 2:6 25:2 5: 5: CLR ControlE2: WriteRegE4: WriteRegM4: WriteRegW4: Pipelined Processor Critical Path Element Delay PC' StallF PCF 4 A RD PCPlus4F StallD InstrD Control Unit Op Funct A A2 PCPlus4D WE3 A3 Register WD3 File Sign Extend PCBranchD RegWriteD MemtoRegD MemWriteD ControlD 2: SrcD RegDstD BranchD RD RD2 SignImmD <<2 EqualD = RsD PCSrcD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE SrcE RegDstE RsE 2:6 RtD RtE BranchD ForwardBD ForwardAD RdD FlushE RdE Hazard Unit ForwardBE ForwardAE SrcAE SrcBE SignImmE WriteDataE MemtoRegE RegWriteE MemWriteM OutM WriteDataM A WE RD Data WD RegWriteM ReadDataW OutW ResultW RegWriteW Register clk-to-q t pcq 3 ps Register setup t setup 2 Multiplexer t mux 25 t 2 Read t memread 25 Register file read t read 5 Register file setup t setup 2 Equality t eq 4 AND gate t AND 5 Write t memwrite22 Register file write t write Fetch Execute t pcq t memread t setup 2(t read t mux t eq t AND t mux t setup) Decode T c = max t pcq t mux t mux t t setup t pcq t memwrite t setup 2(t pcq t mux t write) Writeback = 2(5 25 4 5 25 2) ps = 55 ps Why 2? We assume it takes half a cycle for a newly written register s value (WD3) to propagate to RD or RD2, i.e., when an earlier instruction writes a register used by the current one.
Pipelined Processor Performance For a billion-instruction task on our pipelined processor, each instruction takes.5 cycles on average. With a 55 ps clock period, time = 9.5 55 ps = 63 seconds Processor Execution Time Speedup Single-Cycle 92.5 s. (by definition) Multi-Cycle 33.9.7 Pipelined 63.25.46