CSE4: Components and Design Techniques for Digital Systems Register Transfer Level (RTL) Design Cont. Tajana Simunic Rosing
Where we are now What we are covering today: RTL design examples, RTL critical path analysis, CPU design CAPEs are out!!! https://cape.ucsd.edu/students/ Your feedback is very important, please take the time to fill out the survey. I read all your feedback carefully and use it to guide the design of future courses. If at least 25 students do CAPES, I will drop the lowest quiz grade! Deadlines: HW#6 due today the last HW!!! Exam#3 during finals week the last exam!!! 8 minutes long, comprehensive Bring one 8 ½ x paper with handwritten notes, but nothing else Sample midterm 3 has been posted: Problem 4: We did not cover the three guidelines heuristic, all else is ok Problem 5: PLA we did not cover, so just skip it Extra prof. office hour during finals week on Monday :3-2:3pm Extra TA/tutor office hours starting this week on Friday morning through Tuesday at :3am
Data vs. Control Dominated RTL Design Data dominant design: extensive datapath, simple controller Control dominant design: complex controller, simple datapath Example of data dominant design: simple filter Converts digital input stream to new digital output stream e.g: remove noise 8, 8, 8, 8, 24, 8, 8 24 is probably noise, filter might replace by 8 Simple filter: output average of the last N values Small N: less filtering Large N: more filtering, but less sharp output X clk 2 digital filter 2 Y 3
Data Dominated RTL Design Example: FIR Filter FIR filter Finite Impulse Response A configurable weighted sum of past input values y(t) = c*x(t) + c*x(t-) + c2*x(t-2) Above known as 3 tap Tens of taps more common Very general filter User sets the constants (c, c, c2) to define a specific filter RTL design Step : Create HLSM Very simple states/transitions X clk 2 digital filter 2 y(t) = c*x(t) + c*x(t-) + c2*x(t-2) Inputs: X (2 bits) Outputs: Y (2 bits) Local storage: xt, xt, xt2, c, c, c2 (2 bits); Yreg (2 bits) FIR filter Init Yreg := xt := xt := xt2 := c := 3 c := 2 c2 := 2 FC Yreg := c*xt + c*xt + c2*xt2 xt := X xt := xt xt2 := xt Y Assumes constants set to 3, 2, and 2
FIR Filter: Create datapath Begin by creating chain of xt registers to hold past values of X X Y 2 digital filter 2 Instantiate registers for c, c, c2 clk Instantiate multipliers to compute c*x values Instantiate adders y(t) = c*x(t) + c*x(t-) + c2*x(t-2) Add circuitry to allow loading of particular c register Step 3 & 4: Connect to controller, Create FSM: No controller needed CL Ca Ca C X clk e 3 2x4 2 x(t) 3-tap FIR filter x(t-) x(t-2) c c c2 xt xt xt2 * * * + + yreg Y
FIR Filter: Design the Circuit Inputs: X (2 bits) Outputs: Y (2 bits) Local storage: xt, xt, xt2, c, c, c2 (2 bits); Yreg (2 bits) Create datapath Connect Ctrlr/DP Derive FSM Set clr and ld lines appropriately FIR filter Init Yreg := xt := xt := xt2 := c := 3 c := 2 c2 := 2 FC Yreg := c*xt + c*xt + c2*xt2 xt := X xt := xt xt2 := xt 3 2 2 X clk xt_clr xt_ld 2 c_ld c_ld c2_ld xt xt xt2 x(t) c c c2... x(t-) * *... x(t-2) * Datapath for 3-tap FIR filter + + Yreg_clr Yreg_ld Yreg 2 Y 6
Comparing the FIR circuit to a software implementation Circuit Adder has 2-gate delay, multiplier has 2-gate delay Longest past goes through one multiplier and two adders 2 + 2 + 2 = 24-gate delay -tap filter, would have about a 34-gate delay: multiplier and 7 adders on longest path Software -tap filter: multiplications, additions. If 2 instructions per multiplication, 2 per addition. Say -gate delay per instruction. (*2 + *2)* = 4 gate delays CL Ca Ca C X clk y(t) = c*x(t) + c*x(t-) + c2*x(t-2) 3-tap FIR filter e 3 2x4 2 x(t) c x(t-) c x(t-2) c2 xt xt xt2 * * + + * yreg Y
2 ns 7 ns 7 ns 2 ns RTL: Determining Clock Frequency Frequency limited by longest register-to-register delay Known as the critical path There are more components to the critical path: wire delays, setup/hold constraints, etc. Longest path is 7 ns Fastest frequency / 7 ns = 42 MHz a b 2 ns delay + * 5 ns delay Max (2,7,7,5) = 7 ns c d
RTL: A Circuit May Have Numerous Paths Paths can exist s a In the datapath Combinational logic 8 8 In the controller d Between the controller and datapath May be hundreds or thousands of paths Timing analysis tools need to evaluate all possible paths c tot_lt_s clk s s State register tot_ld t ot_clr (c) n n tot_lt_s (b) ld clr Datapath 8-bit < tot 8 8-bit adder (a) 8 a 9
RTL Summary Datapath and Control Design RTL Design Steps. Define the high level state machine 2. Create datapath 3. Connect datapath with control 4. Implement the FSM Timing analysis critical path in more complex circuits Watch out for all possible long paths (e.g. datapath to FSM, FSM control logic, datapath logic etc)
CSE4: Components and Design Techniques for Digital Systems Single Cycle CPU Design Tajana Simunic Rosing
4 RESULT ADDER MIPS Single-Cycle Datapath & Control MUX RESULT ADDER PC << 2 READ ADDRESS INSTRUCTION MEMORY INSTRUCTION [3-] INSTRUCTION[3-26] INSTRUCTION[25-2] INSTRUCTION[2-6] INST[5-] MUX REG_DST READ REGISTER READ REGISTER 2 WRITE REGISTER WRITE MUX DATA REG_WRITE REGISTERS READ DATA READ DATA 2 CON TROL ALU_SRC ALU ALU_OP ZERO RESULT ADDRESS BRANCH MEM_READ,MEM_WRITE DATA MEMORY READ DATA MEM_TO_REG INSTRUCTION[5-] Sign Extend WRITE DATA INSTRUCTION[5-] ALU CONTROL MUX 2
CPU Components Combinational logic: Boolean equations, logic gates Multiplexors and decoders ALU: executes arithmetic /logical operations 3
2-input, 32-bit MUX Selects one input as the output S I 3 I 3 M U X O 3 I I 32 32 M U X 32 O implementation I 3 I 3 M U X O 3 S I I M U X 4 O
Decoder 2 input, 2 2 = 4 outputs I I 2-to-4 DECODER O O O2 O3 I implementation Translates input into binary number B and turns on output B I I I O3 O2 O O O O O2 O3 5
A 32 Full 32-bit ALU OP CODE CarryIn Performs: AND, OR, NOT, ADD, SUB, Overflow Detection, GTE B 32 32-bit ALU 32 Result Overflow CarryOut 6
A3 B3 + MSB ALU Binvert CarryIn ADD GTEin = If GTEout =, A B CarryOut xor OP 2 3 4 result GTEout xor 7 overflow
CPU Components Combinational logic: Boolean equations, logic gates Multiplexors and decoders ALU: executes arithmetic /logical operations Sequential logic: Storage (memory) elements Counters 8
Memory elements: D-Latch Sets SR-latch (Q) to value of D when clock (C) is high; otherwise last Q retained D C Reset Set stores stores C R: reset nor Q Stored state value S: set nor Q D 9
Memory elements: Flip-Flop Stores new value of D in Q when C falls, otherwise current stored value of Q is retained: falling edge-triggered FF C (clock) D (data) C D D LATCH Q Q C2 Q2 D LATCH 2 D2 Q2 Q Q 2
Read/Write Register File Input Read Reg #. MUX selects Q for that set of FFs as output Input Write Reg # and Value. Write Value goes to each FF. Write Reg # turns on C to only FF, where Value is stored. Clock Write Reg # (5 bits) Write Value D E C O D E R O O O3 C FF Q D Reg C FF Q D Reg C FF Q D Reg 3 3 M U X Read Reg # (5 bits) Read Value 2
Comparing Processor Memory Register file Intermediate data storage within CPU Fastest Biggest area/cell SRAM Fast More compact Used for caches DRAM Slowest but very compact And refreshing takes time Different technology due to large caps. Used for main memory 32 4 W_data W_addr R_data R_addr W_en R_en 6 32 register file 32 4 register file MxN Memory implemented as a: SRAM DRAM Size comparison for the same number of bits (not to scale) REGISTER FILE SRAM DRAM OUT OUT2 OUT3 OUT4 R S R S R S R S D Q D Q D Q D Q Data' W Data Data CLK IN IN2 IN3 IN4 W 22
RAM Internal Structure 32 data addr rw en 24x32 RAM Let A = log 2 M d wdata(n-) word enable wdata(n-2) wdata bit storage block (aka cell ) addr addr addr(a-) a a AxM d decoder a(a-) word data cell clk en rw e d(m-) to all cells rdata(n-) rdata(n-2) rdata word word enable enable rw data RAM cell Similar internal structure as register file Decoder enables appropriate word based on address inputs rw controls whether cell is written or read 23
32 data addr rw en 24x32 RAM Static RAM (SRAM) - writing SRAM cell data d cell d data Static RAM cell 6 transistors (recall inverter is 2 transistors) Writing this cell word enable input comes from decoder When, value d loops around inverters That loop is where a bit stays stored When, the data bit value enters the loop data is the bit to be stored in this cell data enters on other side Example shows a being written into cell word enable SRAM cell word enable data d data 24
Static RAM (SRAM) - reading 32 data addr rw en 24x32 RAM SRAM cell Static RAM cell - reading When rw set to read, the RAM logic sets both data and data to The stored bit d will pull either the left line or the right bit down slightly below Sense amplifiers detect which side is slightly pulled down word enable data data d < To sense amplifiers 25
Dynamic RAM (DRAM) 32 data addr rw en 24x32 RAM DRAM cell Dynamic RAM cell transistor (rather than 6) Relies on large capacitor to store bit Write: transistor conducts, data voltage level gets stored on top plate of capacitor Read: look at the value of d Problem: Capacitor discharges over time Must refresh regularly, by reading d and then writing it right back word enable data enable d data d (a) discharges (b) cell capacitor slowly discharging 26
Storage permanence Memory Storage Permanence Traditional ROM/RAM ROM RAM read only, bits stored without power read and write, lose stored bits without power Distinctions blurred Advanced ROMs can be written to e.g., EEPROM, FLASH Advanced RAMs can hold bits without power Life of product Tens of years Battery life ( years) Near zero Mask-programmed ROM Nonvolatile During fabrication only OTP ROM External programmer, one time only EPROM External programmer,,s of cycles EEPROM In-system programmable External programmer OR in-system,,s of cycles FLASH External programmer OR in-system, block-oriented writes,,s of cycles e.g., NVRAM Write ability and storage permanence of memories, showing relative degrees along each axis (not to scale). Ideal memory NVRAM SRAM/DRAM Write ability In-system, fast writes, unlimited cycles 27
ROM & Non-volatile memory Erasable Programmable ROM (EPROM) Uses floating-gate transistor in each cell Programmer uses higher-than-normal voltage so electrons tunnel into the gate Electrons become trapped in the gate Only done for cells that should store, rest are To erase, shine ultraviolet light onto chip Electronically-Erasable Programmable ROM (EEPROM) Programming similar to EPROM Erasing one word at a time electronically Flash memory Large blocks can be erased simultaneously Non-volatile memory (NVM): Phase-change memory (PCM) Material changes phase (liquid to solid) to program STT-RAM & MRAM Uses magnetic properties to program Similar to RAM, but with slower writes PCM Word -line 28 Bit-lin