EECS150 - Digital Design Lecture 9 - CPU Microarchitecture Feb 17, 2009 John Wawrzynek Spring 2009 EECS150 - Lec9-cpu Page 1 CMOS Devices Review: Transistor switch-level models The gate acts like a capacitor. A high voltage on the gate attracts charge into the channel. If a voltage exists between the source and drain a current will flow. In its simplest approximation, the device acts like a switch. nfet pfet Spring 2009 EECS150 - Lec9-cpu Page 2
Transistor-level Logic Circuits Simple rule for wiring up MOSFETs: nfet is used only to pass logic zero. pfet is used only to pass logic one. For example, consider the NAND gate: Note: This rule is sometimes violated by expert designers under special conditions. Spring 2009 EECS150 - Lec9-cpu Page 3 Transistor-level Logic Circuits NOR gate: Note: out = 0 iff a OR b =1 therefore out = (a+b) Again pfet network and nfet networks are duals of one another. Other more complex functions are possible. Ex: out = (a+bc) Spring 2009 EECS150 - Lec9-cpu Page 4
CMOS Logic Gates in General Pull-up network conducts under conditions to generate a logic 1 output Pull-down network conducts for logic 0 output Conductance must be mutually exclusive - else, short circuit! Pull-up and pull-down networks are topological duals Spring 2009 EECS150 - Lec9-cpu Page 5 Transmission Gate Transmission gates are the way to build switches in CMOS. In general, both transistor types are needed: nfet to pass zeros. pfet to pass ones. The transmission gate is bi-directional (unlike logic gates). Does not directly connect to Vdd and GND, but can be combined with logic gates or buffers to simplify many logic structures. Spring 2009 EECS150 - Lec9-cpu Page 6
Transmission-gate Multiplexor 2-to-multiplexor: C = sa + s b Switches simplify the implementation: a s b s c Compare the cost to logic gate implementation. Spring 2009 EECS150 - Lec9-cpu Page 7 Tri-state Buffers Tri-state Buffer: high impedance (output disconnected) Variations: Inverting buffer Inverted enable transmission gate useful in implementation Spring 2009 EECS150 - Lec9-cpu Page 8
Tri-state Buffers = 10 = 0 Tri-state buffers enable bidirectional connections. = 01 Tri-state buffers are used when multiple circuits all connect to a common wire. Only one circuit at a time is allowed to drive the bus. All others disconnect their outputs, but can listen. =1 = 0 Spring 2009 EECS150 - Lec9-cpu Page 9 = 0 Tri-state Based Multiplexor Multiplexor Transistor Circuit for inverting multiplexor: If s=1 then c=a else c=b Spring 2009 EECS150 - Lec9-cpu Page 10
Positive level-sensitive latch: Latches and Flip-flops Positive Edge-triggered flip-flop built from two level-sensitive latches: Latch Implementation: clk clk clk clk Spring 2009 EECS150 - Lec9-cpu Page 11 Processor Microarchitecture Introduction Microarchitecture: how to implement an architecture in hardware Good examples of how to put principles of digital design to practice. Introduction to final project. Spring 2009 EECS150 - Lec9-cpu Page 12
MIPS Processor Architecture For now we consider a subset of MIPS instructions: R-type instructions: and, or, add, sub, slt Memory instructions: lw, sw Branch instructions: beq Later we ll add addi and j Spring 2009 EECS150 - Lec9-cpu Page 13 MIPS Micrarchitecture Oganization Datapath + Controller + External Memory Controller Spring 2009 EECS150 - Lec9-cpu Page 14
How to Design a Processor: step-by-step 1. Analyze instruction set architecture (ISA) datapath requirements meaning of each instruction is given by the data transfers (register transfers) datapath must include storage element for ISA registers datapath must support each data transfer 2. Select set of datapath components and establish clocking methodology 3. Assemble datapath meeting requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the data transfer. 5. Assemble the control logic. Spring 2009 EECS150 - Lec9-cpu Page 15 Review: The MIPS Instruction R-type I-type J-type 31 31 31 26 21 16 11 6 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 26 21 16 op rs rt address/immediate 6 bits 5 bits 5 bits 16 bits 26 op target address 6 bits 26 bits 0 0 0 The different fields are: op: operation ( opcode ) of the instruction rs, rt, rd: the source and destination register specifiers shamt: shift amount funct: selects the variant of the operation in the op field address / immediate: address offset or immediate value target address: target address of jump instruction Spring 2009 EECS150 - Lec9-cpu Page 16
add, sub, or, slt addu rd,rs,rt subu rd,rs,rt lw, sw lw rt,rs,imm16 sw rt,rs,imm16 beq beq rs,rt,imm16 Subset for Lecture 31 31 31 26 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 26 Spring 2009 EECS150 - Lec9-cpu Page 17 21 21 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 26 21 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 16 16 16 11 6 0 0 0 Register Transfer Descriptions All start with instruction fetch: {op, rs, rt, rd, shamt, funct} IMEM[ PC ] OR {op, rs, rt, Imm16} IMEM[ PC ] THEN inst Register Transfers add R[rd] R[rs] + R[rt]; PC PC + 4 sub R[rd] R[rs] R[rt]; PC PC + 4 or R[rd] R[rs] R[rt]; PC PC + 4 slt R[rd] (R[rs] < R[rt])? 1 : 0; PC PC + 4 lw R[rt] DMEM[ R[rs] + sign_ext(imm16)]; PC PC + 4 sw DMEM[ R[rs] + sign_ext(imm16) ] R[rt]; PC PC + 4 beq if ( R[rs] == R[rt] ) then PC PC + 4 + {sign_ext(imm16), 00} else PC PC + 4 Spring 2009 EECS150 - Lec9-cpu Page 18
Microarchitecture Multiple implementations for a single architecture: Single-cycle Each instruction executes in a single clock cycle. Multicycle Each instruction is broken up into a series of shorter steps with one step per clock cycle. Pipelined Each instruction is broken up into a series of steps with one step per clock cycle Multiple instructions execute at once. Spring 2009 EECS150 - Lec9-cpu Page 19 CPU clocking (1/2) Single Cycle CPU: All stages of an instruction are completed within one long clock cycle. The clock cycle is made sufficient long to allow each instruction to complete all stages without interruption and within one cycle. 1. Instruction Fetch 2. Decode/ Register Read 3. Execute 4. Memory 5. Reg. Write Spring 2009 EECS150 - Lec9-cpu Page 20
CPU clocking (2/2) Multiple-cycle CPU: Only one stage of instruction per clock cycle. The clock is made as long as the slowest stage. 1. Instruction Fetch 2. Decode/ Register Read 3. Execute 4. Memory 5. Reg. Write Several significant advantages over single cycle execution: Unused stages in a particular instruction can be skipped OR instructions can be pipelined (overlapped). Spring 2009 EECS150 - Lec9-cpu Page 21 MIPS State Elements Determines everything about the execution status of a processor: PC register 32 registers Memory Note: for these state elements, clock is used for write but not for read (asynchronous read, synchronous write). Spring 2009 EECS150 - Lec9-cpu Page 22
Single-Cycle Datapath: lw fetch First consider executing lw R[rt] DMEM[ R[rs] + sign_ext(imm16)] STEP 1: Fetch instruction Spring 2009 EECS150 - Lec9-cpu Page 23 Single-Cycle Datapath: lw register read R[rt] DMEM[ R[rs] + sign_ext(imm16)] STEP 2: Read source operands from register file Spring 2009 EECS150 - Lec9-cpu Page 24
Single-Cycle Datapath: lw immediate R[rt] DMEM[ R[rs] + sign_ext(imm16)] STEP 3: Sign-extend the immediate Spring 2009 EECS150 - Lec9-cpu Page 25 Single-Cycle Datapath: lw address R[rt] DMEM[ R[rs] + sign_ext(imm16)] STEP 4: Compute the memory address Spring 2009 EECS150 - Lec9-cpu Page 26
Single-Cycle Datapath: lw memory read R[rt] DMEM[ R[rs] + sign_ext(imm16)] STEP 5: Read data from memory and write it back to register file Spring 2009 EECS150 - Lec9-cpu Page 27 Single-Cycle Datapath: lw PC increment STEP 6: Determine the address of the next instruction PC PC + 4 Spring 2009 EECS150 - Lec9-cpu Page 28
Single-Cycle Datapath: sw DMEM[ R[rs] + sign_ext(imm16) ] R[rt] Write data in rt to memory Spring 2009 EECS150 - Lec9-cpu Page 29 Single-Cycle Datapath: R-type instructions Read from rs and rt Write ALUResult to register file Write to rd (instead of rt) R[rd] R[rs] op R[rt] Spring 2009 EECS150 - Lec9-cpu Page 30
Single-Cycle Datapath: beq if ( R[rs] == R[rt] ) then PC PC + 4 + {sign_ext(imm16), 00} Determine whether values in rs and rt are equal Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4) Spring 2009 EECS150 - Lec9-cpu Page 31 Complete Single-Cycle Processor Spring 2009 EECS150 - Lec9-cpu Page 32
Control Unit Spring 2009 EECS150 - Lec9-cpu Page 33 Review: ALU F 2:0 Function 0 A & B 1 A B 10 A + B 11 not used 100 A & ~B 101 A ~B 110 A - B 111 SLT Spring 2009 EECS150 - Lec9-cpu Page 34
Control Unit: ALU Decoder ALUOp 1:0 Meaning 0 Add 1 Subtract 10 Look at Funct 11 Not Used ALUOp 1:0 Funct ALUControl 2:0 0 X 010 (Add) X1 X 110 (Subtract) 1X 100000 (add) 010 (Add) 1X 100010 (sub) 110 (Subtract) 1X 100100 (and) 000 (And) 1X 100101 (or) 001 (Or) Spring 2009 EECS150 - Lec9-cpu Page 35 1X 101010 (slt) 111 (SLT) Control Unit: Main Decoder Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 R-type 0 lw 1E+05 sw 1E+05 beq 100 Spring 2009 EECS150 - Lec9-cpu Page 36
Control Unit: Main Decoder Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 R-type 0 1 1 0 0 0 0 10 lw 1E+05 1 0 1 0 0 0 0 sw 1E+05 0 X 1 0 1 X 0 beq 100 0 X 0 1 0 X 1 Spring 2009 EECS150 - Lec9-cpu Page 37 Single-Cycle Datapath Example: or Spring 2009 EECS150 - Lec9-cpu Page 38
Extended Functionality: addi No change to datapath Spring 2009 EECS150 - Lec9-cpu Page 39 Control Unit: addi Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 R-type 0 1 1 0 0 0 0 10 lw 1E+05 1 0 1 0 0 1 0 sw 1E+05 0 X 1 0 1 X 0 beq 100 0 X 0 1 0 X 1 addi 1000 Spring 2009 EECS150 - Lec9-cpu Page 40
Control Unit: addi Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 R-type 0 1 1 0 0 0 0 10 lw 1E+05 1 0 1 0 0 1 0 sw 1E+05 0 X 1 0 1 X 0 beq 100 0 X 0 1 0 X 1 addi 1000 1 0 1 0 0 0 0 Spring 2009 EECS150 - Lec9-cpu Page 41 Extended Functionality: j Spring 2009 EECS150 - Lec9-cpu Page 42
Control Unit: Main Decoder Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 Jump R-type 0 1 1 0 0 0 0 10 0 lw 1E+05 1 0 1 0 0 1 0 0 sw 1E+05 0 X 1 0 1 X 0 0 beq 100 0 X 0 1 0 X 1 0 j 100 Spring 2009 EECS150 - Lec9-cpu Page 43 Control Unit: Main Decoder Instruction Op 5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp 1:0 Jump R-type 0 1 1 0 0 0 0 10 0 lw 1E+05 1 0 1 0 0 1 0 0 sw 1E+05 0 X 1 0 1 X 0 0 beq 100 0 X 0 1 0 X 1 0 j 100 0 X X X 0 X XX 1 Spring 2009 EECS150 - Lec9-cpu Page 44
Review: Processor Performance Program Execution Time = (# instructions)(cycles/instruction)(seconds/cycle) = # instructions x CPI x T C Spring 2009 EECS150 - Lec9-cpu Page 45 Single-Cycle Performance T C is limited by the critical path (lw) Spring 2009 EECS150 - Lec9-cpu Page 46
Single-Cycle Performance Single-cycle critical path: T c = t pcq_pc + t mem + max(t RFread, t sext + t mux ) + t ALU + t mem + t mux + t RFsetup In most implementations, limiting paths are: memory, ALU, register file. T c = t pcq_pc + 2t mem + t RFread + t mux + t ALU + t RFsetup Spring 2009 EECS150 - Lec9-cpu Page 47 Single-Cycle Performance Example Element Parameter Delay (ps) Register clock-to-q t pcq_pc 30 Register setup t setup 20 Multiplexer t mux 25 ALU t ALU 200 Memory read t mem 250 Register file read t RFread 150 Register file setup t RFsetup 20 T c = Spring 2009 EECS150 - Lec9-cpu Page 48
Single-Cycle Performance Example Element Parameter Delay (ps) Register clock-to-q t pcq_pc 30 Register setup t setup 20 Multiplexer t mux 25 ALU t ALU 200 Memory read t mem 250 Register file read t RFread 150 Register file setup t RFsetup 20 T c = t pcq_pc + 2t mem + t RFread + t mux + t ALU + t RFsetup = [30 + 2(250) + 150 + 25 + 200 + 20] ps = 925 ps Spring 2009 EECS150 - Lec9-cpu Page 49 Single-Cycle Performance Example For a program with 100 billion instructions executing on a singlecycle MIPS processor, Execution Time = Spring 2009 EECS150 - Lec9-cpu Page 50
Single-Cycle Performance Example For a program with 100 billion instructions executing on a singlecycle MIPS processor, Execution Time = # instructions x CPI x T C = (100 10 9 )(1)(925 10-12 s) = 92.5 seconds Spring 2009 EECS150 - Lec9-cpu Page 51 Pipelined MIPS Processor Temporal parallelism Divide single-cycle processor into 5 stages: Fetch Decode Execute Memory Writeback Add pipeline registers between stages Spring 2009 EECS150 - Lec9-cpu Page 52
Single-Cycle vs. Pipelined Performance Spring 2009 EECS150 - Lec9-cpu Page 53 Pipelining Abstraction Spring 2009 EECS150 - Lec9-cpu Page 54
Single-Cycle and Pipelined Datapath Spring 2009 EECS150 - Lec9-cpu Page 55 Corrected Pipelined Datapath WriteReg must arrive at the same time as Result Spring 2009 EECS150 - Lec9-cpu Page 56
Pipelined Control Same control unit as single-cycle processor Spring 2009 EECS150 - Lec9-cpu Page 57 Control delayed to proper pipeline stage Pipeline Hazards Occurs when an instruction depends on results from previous instruction that hasn t completed. Types of hazards: Data hazard: register value not written back to register file yet Control hazard: next instruction not decided yet (caused by branches) Spring 2009 EECS150 - Lec9-cpu Page 58