Fundamentals of Computer Systems

Similar documents
Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Digital Design and Computer Architecture

Pipeline design. Mehran Rezaei

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Instruction Level Parallelism

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

ASIC = Application specific integrated circuit

Digital Design and Computer Architecture

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

CS3350B Computer Architecture Winter 2015

Lecture 10: Sequential Circuits

11. Sequential Elements

CS61C : Machine Structures

Register Transfer Level (RTL) Design Cont.

4.5 Pipelining. Pipelining is Natural!

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Modeling Digital Systems with Verilog

CpE 442. Designing a Pipeline Processor (lect. II)

Figure 1 shows a simple implementation of a clock switch, using an AND-OR type multiplexer logic.

Advanced Pipelining and Instruction-Level Paralelism (2)

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

ECEN454 Digital Integrated Circuit Design. Sequential Circuits. Sequencing. Output depends on current inputs

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

CS 250 VLSI System Design

Sequential Elements con t Synchronous Digital Systems

First Name Last Name November 10, 2009 CS-343 Exam 2

Digital Design Datapath Components: Parallel Load Register

Instruction Level Parallelism Part III

Lecture 11: Sequential Circuit Design

More Digital Circuits

CS61C : Machine Structures

Instruction Level Parallelism Part III

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

Chapter 3 Unit Combinational

Instruction Level Parallelism and Its. (Part II) ECE 154B

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

On the Rules of Low-Power Design

CPE/EE 427, CPE 527 VLSI Design I Sequential Circuits. Sequencing

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

UC Berkeley CS61C : Machine Structures

Chapter 6. Flip-Flops and Simple Flip-Flop Applications

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

LOW POWER AND HIGH PERFORMANCE SHIFT REGISTERS USING PULSED LATCH TECHNIQUE

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

CprE 281: Digital Logic

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

1. Convert the decimal number to binary, octal, and hexadecimal.

Lab #10 Hexadecimal-to-Seven-Segment Decoder, 4-bit Adder-Subtractor and Shift Register. Fall 2017

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

CSE115: Digital Design Lecture 23: Latches & Flip-Flops

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Computer Architecture Spring 2016

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Out-of-Order Execution

EECS150 - Digital Design Lecture 3 - Timing

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

Using minterms, m-notation / decimal notation Sum = Cout = Using maxterms, M-notation Sum = Cout =

Unit 11. Latches and Flip-Flops

Lab 2: Hardware/Software Co-design with the Wimp51

VARIABLE FREQUENCY CLOCKING HARDWARE

TSIU03, SYSTEM DESIGN. How to Describe a HW Circuit

Administrative issues. Sequential logic

Logic Design II (17.342) Spring Lecture Outline

Digital Logic & Computer Design CS Professor Dan Moldovan Spring Chapter 3 :: Sequential Logic Design

Sequential Circuit Design: Part 1

Sequencing and Control

An Overview of FLEET CS-152

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Counters

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Counter dan Register

CprE 281: Digital Logic

Fundamentals of Computer Systems

COMP2611: Computer Organization. Introduction to Digital Logic

Combinational vs Sequential

EITF35: Introduction to Structured VLSI Design

Sequential logic. Circuits with feedback. How to control feedback? Sequential circuits. Timing methodologies. Basic registers

Control Unit. Arturo Díaz-Pérez Departamento de Computación Laboratorio de Tecnologías de Información CINVESTAV-IPN

Multiplexor (aka MUX) An example, yet VERY useful circuit!

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

Transcription:

Fundamentals of Computer Systems A Pipelined MIPS Processor Stephen A. Edwards Columbia University Summer 25 Technical Illustrations Copyright c 27 Elsevier

Sequential Laundry Time Alice Bob Cindy

Pipelined Laundry Time Alice Bob Why let the washer or dryer sit idle when both can be running? Cindy

Single-Cycle Datapath Timing Clk PC Register File Data

Single-Cycle vs. Pipelined Datapath SignImmE A RD 4 A A3 WD3 RD2 RD WE3 A2 Sign Extend Register File A RD Data WD WE PCF PC' InstrD 25:2 2:6 5: SrcBE 2:6 5: RtE RdE <<2 OutM OutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchM ResultW PCPlus4E PCPlus4F ZeroM WriteRegE 4: SignImm A RD 4 A A3 WD3 RD2 RD WE3 A2 Sign Extend Register File A RD Data WD WE PC PC' Instr 25:2 2:6 5: SrcB 2:6 5: <<2 Result ReadData WriteData SrcA PCPlus4 PCBranch WriteReg 4: Result Zero Fetch Decode Execute Writeback

Corrected Pipelined Datapath The register number to write (WriteReg) must stay synchronized with the result. PC' PCF A RD InstrD 25:2 A WE3 RD 2:6 A2 RD2 A3 Register WD3 File 2:6 5: RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: ZeroM WE OutM A RD Data WriteDataM WD WriteRegM 4: OutW ReadDataW WriteRegW 4: 4 5: Sign Extend SignImmE <<2 PCBranchM PCPlus4F PCPlus4D PCPlus4E Fetch Decode Execute Writeback ResultW

Pipeline Control Same control unit as the single-cycle processor; Control signals delayed across pipeline stages Control Unit RegWriteD MemtoRegD MemWriteD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 5: Op Funct BranchD ControlD SrcD BranchE ControlE 2: SrcE BranchM PCSrcM RegDstD RegDstE OutW PC' PCF A RD InstrD 25:2 A WE3 RD 2:6 A2 RD2 A3 Register WD3 File 2:6 5: RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: ZeroM WE OutM A RD Data WriteDataM WD WriteRegM 4: ReadDataW WriteRegW 4: 4 5: Sign Extend SignImmE <<2 PCBranchM PCPlus4F PCPlus4D PCPlus4E ResultW

Single-Cycle vs. Pipeline Timing Single-Cycle Instr 2 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Fetch Decode Read Reg Execute Read / Write Write Reg Fetch Decode Read Reg Execute Read / Write Time (ps) Write Reg Pipelined Instr 2 Fetch Decode Read Reg Fetch Execute Decode Read Reg Read/Write Execute Write Reg Read/Write Write Reg 3 Fetch Decode Read Reg Execute Read/Write Write Reg

Pipelining Abstraction 2 3 4 5 6 7 8 9 Time (cycles) $ lw $s2, 4($) lw 4 DM $s2 add $s3, $t, $t2 add $t $t2 DM $s3 sub $s4, $s, $s5 sub $s $s5 - DM $s4 and $s5, $t5, $t6 and $t5 $t6 & DM $s5 sw $s6, 2($s) sw $s 2 DM $s6 or $s7, $t3, $t4 or $t3 $t4 DM $s7

Data Hazard 2 3 4 5 6 7 8 Time (cycles) $s2 add $s, $s2, $s3 add $s3 DM $s and $t, $s, $s and $s $s & DM $t or $t, $s4, $s or $s4 $s DM $t sub $t2, $s, $s5 sub $s $s5 - DM $t2 The first instruction produces a result (in $s) that later instructions need. The has computed the value in cycle 3, but it won t be written to the register file until cycle 5. Dukes of Hazzard Hazard Lights Water Hazard Biohazard

Eliminating Hazards at Compile-Time 2 3 4 5 6 7 8 9 Time (cycles) $s2 add $s, $s2, $s3 add $s3 DM $s nop nop DM nop nop DM and $t, $s, $s and $s $s & DM $t or $t, $s4, $s or $s4 $s DM $t sub $t2, $s, $s5 sub $s $s5 - DM $t2 Insert nops to delay later instructions; sometimes possible to put useful work in those slots.

Eliminating Data Hazards through Forwarding 2 3 4 5 6 7 8 Time (cycles) $s2 add $s, $s2, $s3 add $s3 DM $s and $t, $s, $s and $s $s & DM $t or $t, $s4, $s or $s4 $s DM $t sub $t2, $s, $s5 sub $s $s5 - DM $t2 Add logic to send data between instructions; register file eventually written. Here, the result is available at the end of cycle 3 and needed in cycles 4 and 5.

ControlD 2: ControlE 2: Datapath with Data Forwarding Control Unit RegWriteD MemtoRegD MemWriteD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 Op SrcD SrcE 5: Funct RegDstD RegDstE PCSrcM BranchD BranchE BranchM PC' PCF A RD InstrD WE3 25:2 A RD 2:6 A2 RD2 A3 Register WD3 File 25:2 2:6 5: RsD RtD RdD RsE RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: ZeroM OutM WriteDataM WriteRegM 4: WE A RD Data WD ReadDataW OutW WriteRegW 4: 4 5: Sign Extend SignImmD SignImmE <<2 PCPlus4F PCPlus4D PCPlus4E PCBranchM ResultW ForwardAE ForwardBE RegWriteM RegWriteW Hazard Unit if rse rse = WriteRegM RegWriteM, ForwardAE = if rse rse = WriteRegW RegWriteW, otherwise.

Data Hazard that Demands a Stall 2 3 4 5 6 7 8 Time (cycles) $ lw $s, 4($) lw 4 DM $s and $t, $s, $s and Trouble! $s $s & DM $t or $t, $s4, $s or $s4 $s DM $t sub $t2, $s, $s5 sub $s $s5 - DM $t2 This data hazard can t be solved with forwarding because the value is only available at the end of cycle 4, yet is needed at the beginning.

Stalling a Pipeline 2 3 4 5 6 7 8 9 Time (cycles) $ lw $s, 4($) lw 4 DM $s and $t, $s, $s and $s $s $s $s & DM $t or $t, $s4, $s or or $s4 $s DM $t sub $t2, $s, $s5 Stall sub $s $s5 - DM $t2 A stall tells an instruction to wait for a cycle before proceeding.

Stalling Hardware Control Unit RegWriteD MemtoRegD MemWriteD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 Op 5: Funct ControlD 2: SrcD RegDstD BranchD ControlE 2: SrcE RegDstE BranchE BranchM PCSrcM PC' PCF InstrD A RD EN 25:2 WE3 A RD 2:6 A2 RD2 A3 Register WD3 File 25:2 2:6 5: RsD RtD RdD RsE RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: ZeroM OutM WriteDataM WriteRegM 4: WE A RD Data WD ReadDataW OutW WriteRegW 4: 4 5: SignImmD Sign Extend SignImmE <<2 PCPlus4F EN PCPlus4D CLR PCPlus4E PCBranchM ResultW StallF StallD FlushE ForwardAE ForwardBE MemtoRegE RegWriteM RegWriteW Hazard Unit lwstall = MemToRegE ( (rsd = rte) (rtd = rte) ) StallF, StallD, FlushE = lwstall

Control Hazards 2 3 4 5 6 7 8 9 Time (cycles) 2 $t beq $t, $t2, 4 lw $t2 - DM 24 28 and $t, $s, $s or $t, $s4, $s and $s $s or & $s4 $s DM DM Flush these instructions 2C sub $t2, $s, $s5 sub $s $s5 - DM 3...... $s2 64 slt $t3, $s2, $s3 slt $s3 slt DM $t3 Whether to branch isn t determined until the beginning of the fourth cycle; three instructions would be executed erroneously if the branch were taken.

Early Branch Resolution Control Unit 3:26 Op 5: Funct RegWriteD MemtoRegD MemWriteD ControlD 2: SrcD RegDstD BranchD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM ControlE 2: SrcE RegDstE PC' PCF EN InstrD A RD WE3 25:2 A RD 2:6 A2 RD2 A3 Register WD3 File 25:2 2:6 5: EqualD PCSrcD = RsD RtD RdE RsE RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: OutM WriteDataM WriteRegM 4: WE A RD Data WD ReadDataW OutW WriteRegW 4: 4 5: SignImmD Sign Extend <<2 SignImmE PCPlus4F CLR EN PCPlus4D CLR PCBranchD ResultW StallF StallD FlushE ForwardAE ForwardBE MemtoRegE RegWriteM RegWriteW Hazard Unit Introduced another data hazard in the decode stage

Control Hazards w/ Early Branch Resolution 2 3 4 5 6 7 8 9 Time (cycles) 2 $t beq $t, $t2, 4 lw $t2 - DM 24 and $t, $s, $s and $s $s & DM Flush this instruction 28 or $t, $s4, $s 2C sub $t2, $s, $s5 3...... $s2 64 slt $t3, $s2, $s3 slt $s3 slt DM $t3

Handling Data and Control Hazards Control Unit 3:26 Op 5: Funct RegWriteD MemtoRegD MemWriteD ControlD 2: SrcD RegDstD BranchD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM ControlE 2: SrcE RegDstE PC' PCF InstrD A RD EN WE3 25:2 A RD 2:6 A2 RD2 A3 Register WD3 File 25:2 2:6 5: EqualD PCSrcD = RsD RtD RdD RsE RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: OutM WriteDataM WriteRegM 4: WE A RD Data WD ReadDataW OutW WriteRegW 4: 4 5: SignImmD Sign Extend <<2 SignImmE PCPlus4F CLR EN PCPlus4D CLR PCBranchD ResultW StallF StallD BranchD ForwardAD ForwardBD FlushE ForwardAE ForwardBE MemtoRegE RegWriteE RegWriteM RegWriteW Hazard Unit

Forwarding and Stalling Logic Forward result to branch-if-equal comparator if we would read its destination register ForwardAD = (rsd rsd = WriteRegM RegWriteM) ForwardBD = (rtd rtd = WriteRegM RegWriteM) Stall if the branch would test the result of an operation or a memory read branchstall = (BranchD RegWriteE (WriteRegE = rsd WriteRegE = rtd)) (BranchD MemToRegM (WriteRegM = rsd WriteRegM = rtd)) Stall if we need to read the result of a memory read or of a branch StallF, StallD, FlushE = lwstall branchstall

Pipeline Performance CPI Example Ideal CPI = ; stalls reduce this, but how much? s in SPECINT2 benchmark: 52% R-type 25% Loads % Stores % Branches 2% Jumps If 4% of loads are used by the next instruction and 25% of branches are mispredicted, what is the average CPI?

Pipeline Performance CPI Example Ideal CPI = ; stalls reduce this, but how much? s in SPECINT2 benchmark: 52% R-type 25% Loads % Stores % Branches 2% Jumps If 4% of loads are used by the next instruction and 25% of branches are mispredicted, what is the average CPI? Load CPI = 2 when next instruction uses; otherwise Branch CPI = 2 when mispredicted; otherwise Jump CPI = 2 Average CPI =.52 R-type (2.4.6).25 Loads. Stores (2.25.75). Branches 2.2 Jumps =.475

Fully Bypassed Processor EqualD SignImmE A RD 4 A A3 WD3 RD2 RD WE3 A2 Sign Extend Register File A RD Data WD WE PCF PC' InstrD 25:2 2:6 5: 5: SrcBE 25:2 5: RsE RdE <<2 OutM OutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchD WriteRegM4: ResultW PCPlus4F 3:26 RegDstD BranchD MemWriteD MemtoRegD ControlD 2: SrcD RegWriteD Op Funct Control Unit PCSrcD WriteRegW4: ControlE2: RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM RegDstE SrcE WriteRegE4: = SignImmD 2:6 RtE RsD RdD RtD Hazard Unit StallF StallD ForwardAE ForwardBE ForwardAD ForwardBD RegWriteE RegWriteM RegWriteW MemtoRegE BranchD FlushE EN CLR EN CLR

EN EN CLR 3:26 5: 25:2 2:6 25:2 5: 5: CLR ControlE2: WriteRegE4: WriteRegM4: WriteRegW4: Pipelined Processor Critical Path Element Delay PC' StallF PCF 4 A RD PCPlus4F StallD InstrD Control Unit Op Funct A A2 PCPlus4D WE3 A3 Register WD3 File Sign Extend PCBranchD RegWriteD MemtoRegD MemWriteD ControlD 2: SrcD RegDstD BranchD RD RD2 SignImmD <<2 EqualD = RsD PCSrcD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE SrcE RegDstE RsE 2:6 RtD RtE BranchD ForwardBD ForwardAD RdD FlushE RdE Hazard Unit ForwardBE ForwardAE SrcAE SrcBE SignImmE WriteDataE MemtoRegE RegWriteE MemWriteM OutM WriteDataM A WE RD Data WD RegWriteM ReadDataW OutW ResultW RegWriteW Register clk-to-q t pcq 3 ps Register setup t setup 2 Multiplexer t mux 25 t 2 Read t memread 25 Register file read t read 5 Register file setup t setup 2 Equality t eq 4 AND gate t AND 5 Write t memwrite22 Register file write t write Fetch Execute t pcq t memread t setup 2(t read t mux t eq t AND t mux t setup) Decode T c = max t pcq t mux t mux t t setup t pcq t memwrite t setup 2(t pcq t mux t write) Writeback = 2(5 25 4 5 25 2) ps = 55 ps Why 2? We assume it takes half a cycle for a newly written register s value (WD3) to propagate to RD or RD2, i.e., when an earlier instruction writes a register used by the current one.

Pipelined Processor Performance For a billion-instruction task on our pipelined processor, each instruction takes.5 cycles on average. With a 55 ps clock period, time = 9.5 55 ps = 63 seconds Processor Execution Time Speedup Single-Cycle 92.5 s. (by definition) Multi-Cycle 33.9.7 Pipelined 63.25.46