Instruction Level Parallelism Pipelining, Hazards Appendix C, HPe
Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP.
Pipelining Basics
Implementation of RISC ISA - Stages Instruction Fetch (IF) Instruction Decode/Register Fetch () Fixed field decoding Execution/Effective address (EX) Memory Access (MEM) Write back (WB)
ALU MIPS Datapath IF EX MEM WB 4 ADD NPC M U X Zero? Cond P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LMD M U X Sign Extend 16 32 Imm Instruction Fetch Instruction Decode/ Register Fetch Execute/ Address Calculation Memory Access Write Back
B A Multiple Issue Integer Pipeline Zero? IR0 IM RF Read RF Write IR1 DM IF EX MEM WB
Pipeline Performance An unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup? Average Instruction Execution time = Clock cycle * Average CPI n CPI = i =1 IC i InstructionCount CPI i
Dependences Pipeline Hazards Structural & Data
Data dependences Name dependences Structural hazards Data hazards Stalling, Forwarding Outline
Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop
Dependence for (i=0; i<=999; i=i+1) x[i] = x[i] + a; Data Dependence Name Dependence Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Name dependence antidependence, output dependence Register renaming Hazard ADD.D ADD.D F4, F0, F2 F4, F6, F8 Overlap during execution could change the order of access to the operand involved in the dependence.
Hazards Program Order ILP preserves program order only where it affects the outcome of the program Structural Hazards Resource conflicts Data Hazards RAW, WAW, WAR Control Hazard Whether or not an instruction should be executed depends on a control decision made by an earlier instruction
Structural Hazard 1 2 3 4 5 6 7 8 9 i1 i2 i3 i4 i5... MEM EX MEM WB MEM EX MEM WB MEM EX MEM WB MEM EX MEM WB MEM EX MEM WB HAZARD!!! Unified Memory example Register File WB, example.
Cost of a Load Structural Hazard Data references constitute 40% of the instruction mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much? Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime ideal =CPI Clock cycle time ideal
Cost of a Load Structural Hazard Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime =(1+0.4 1) Clock cycle time ideal 1.1 Avg. InstructionTime =1.27 Clock cycle time ideal
ALU Data Hazards R1 is updated in the WB stage. IR IR IR 4 ADD NPC M U X Zero? Cond P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LMD M U X R1 R2 + R3 R4 R1 + R5 Sign Extend 16 32 Imm
Stalled Stages and Pipeline Bubbles Time (clock cycles) R1 R2 + R3 IF EX MA WB R4 R1 + R5 IF EX MA WB IF IF IF EX MA WB Stalled Stages IF EX MA WB IF EX MA WB IF I1 I2 I3 I3 I3 I3 I4 I5 I1 I2 I2 I2 I2 I3 I4 I5 EX I1 nop nop nop I2 I3 I4 I5 MA I1 nop nop nop I2 I3 I4 I5 WB I1 nop nop nop I2 I3 I4 I5 How to overcome this hazard?
Resolving Data Hazards Stalling one of the instructions Data Forwarding (Bypassing) Scheduling hazardous instructions away from each other
ALU Stalling (Interlocking) Stall Condition NOP IR IR IR 4 ADD NPC M U X Zero? Cond P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LMD M U X R1 R2 + R3 R4 R1 + R5 Sign Extend 16 32 Imm
Pipeline Performance Speedup pipelining = CPI unpipelined CPI pipelined Speedup pipelining = Pipeline depth 1+ Stall cycles per instruction
Forwarding DADD DSUB AND OR XOR R1,R2,R3 R4,R1,R5 R6,R1,R7 R8,R1,R9 R10,R1,R11 Time (clock cycles) DADD IM REG ALU DM REG DSUB IM REG ALU DM REG AND IM REG ALU DM REG
Forwarding Before Bypassing Time (clock cycles) R1 R2 + R3 IF EX MA WB R4 R1 + R5 IF EX MA WB CPI > 1 IF IF IF Stalled Stages EX MA WB After Bypassing Time (clock cycles) R1 R2 + R3 IF EX MA WB R4 R1 + R5 IF EX MA WB CPI = 1 IF EX MA WB
Cost of Forwarding In longer pipelines? In multiple issue pipelines? All the dependences have been solved?
Forwarding Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG ALU DM REG
Forwarding - Stall Condition Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG REG ALU DM STALL REG