CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Similar documents
EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Instruction Level Parallelism

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Instruction Level Parallelism and Its. (Part II) ECE 154B

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Advanced Pipelining and Instruction-Level Paralelism (2)

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Modeling Digital Systems with Verilog

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Instruction Level Parallelism Part III

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Instruction Level Parallelism Part III

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Out-of-Order Execution

Computer Architecture Spring 2016

On the Rules of Low-Power Design

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Scoreboard Limitations!

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Logic Design II (17.342) Spring Lecture Outline

CS 151 Final. Instructions: Student ID. (Last Name) (First Name) Signature

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Pipeline design. Mehran Rezaei

problem maximum score 1 28pts 2 10pts 3 10pts 4 15pts 5 14pts 6 12pts 7 11pts total 100pts

Scoreboard Limitations

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

Sequential logic circuits

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Fundamentals of Computer Systems

ASIC = Application specific integrated circuit

More Digital Circuits

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Digital Electronics II 2016 Imperial College London Page 1 of 8

CprE 281: Digital Logic

Digital System Design

First Name Last Name November 10, 2009 CS-343 Exam 2

6.3 Sequential Circuits (plus a few Combinational)

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Computer Systems Architecture

Digital Design and Computer Architecture

CS3350B Computer Architecture Winter 2015

1. Convert the decimal number to binary, octal, and hexadecimal.

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

CSE115: Digital Design Lecture 23: Latches & Flip-Flops

EEC 118 Lecture #9: Sequential Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

EECS 270 Midterm 1 Exam Closed book portion Winter 2017

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Chapter 2. Digital Circuits

Counter dan Register

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

EECS150 - Digital Design Lecture 19 - Finite State Machines Revisited

CS8803: Advanced Digital Design for Embedded Hardware

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

4.5 Pipelining. Pipelining is Natural!

Register Transfer Level (RTL) Design Cont.

CS61C : Machine Structures

CPE300: Digital System Architecture and Design

The basic logic gates are the inverter (or NOT gate), the AND gate, the OR gate and the exclusive-or gate (XOR). If you put an inverter in front of

CprE 281: Digital Logic

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

Course Administration

CSE 352 Laboratory Assignment 3

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

EECS150 - Digital Design Lecture 3 - Timing

Introduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison

EECS150 - Digital Design Lecture 15 Finite State Machines. Announcements

Contents Circuits... 1

Computer Organization & Architecture Lecture #5

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science SOLUTIONS

Asynchronous (Ripple) Counters

Spring 2017 EE 3613: Computer Organization Chapter 5: The Processor: Datapath & Control - 1

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

Combinational vs Sequential

IT T35 Digital system desigm y - ii /s - iii

Chapter Contents. Appendix A: Digital Logic. Some Definitions

CSE 140 Exam #3 Solution Tajana Simunic Rosing

CS61C : Machine Structures

Microprocessor Design

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science. EECS 150 Spring 2000

Principles of Computer Architecture. Appendix A: Digital Logic

BUSES IN COMPUTER ARCHITECTURE

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

Midterm Exam 15 points total. March 28, 2011

Physics 217A LAB 4 Spring 2016 Shift Registers Tri-State Bus. Part I

Logic Design ( Part 3) Sequential Logic (Chapter 3)

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Memory elements. Topics. Memory element terminology. Variations in memory elements. Clock terminology. Memory element parameters. clock.

Rensselaer Polytechnic Institute Computer Hardware Design ECSE Report. Lab Three Xilinx Richards Controller and Logic Analyzer Laboratory

Transcription:

CS 152 Midterm 2 May 2, 2002 Bob Brodersen Name Solutions Show your work if you want partial credit! Try all the problems, don t get stuck on one of them. Each one is worth 10 points. 1) 2) 3) 4) 5) 6) 7) 8)

Question 1: Delay, Capacitance and Energy W Compound gate Y W Equivalent compound block Y X Z X Z B Delay ->O (or B->O) for high to low transitions.1 ns Load at and B= 10 ff O Slope =.01 ns/ff Load at = 10 ff Use the graphs below for the delay characteristics of the inverter and NND gate. Co is the capacitance connected to the output. Delay ->O (or B->O) for low to high transitions.2 ns O Slope =.04 ns/ff C o C o a) Fill in this table for the composite block (W to Y) ssume the wiring load is negligible and X = 1. W-> Y Internal delay Load dependent delay Input load at W Low to High at Y High to Low at Y.2 ns.9 ns.04 ns/ff.01 ns/ff 20fF 20fF b) If W & X start high and go low at the same time how much energy does the logic in this compound gate use if the supply is at 3 volts? (assume no load at X and Y and do not include the energy used to drive this gate). ssume the only capacitance being switched is the input capacitance of each gate. CV 2 = (40fF of internal node capacitance) *3 2 = 360 femtojoules.

Question 2: CPI ssume a processor has a clock rate of 500 MHz and an ideal CPI (no memory misses) of 1.0. What is the effective CPI if a program with a mix of 50% arithmetic and logic, 30% load/stores and 20% control instructions is run, if 10% of the data memory operations and 1% of the instructions have a miss penalty of 50 cycles. Show the equation you used to get your answer. Base CPI + Data Mem misses+ Inst. Mem misses 1 + (.3)(.1)50 +.01(1)(50) = 3.0

Question 3: Pipelining hazards The = block after the registers is a comparator, assume its output is available to the controller. (a) How many branch delay slots does this datapath need? Explain why. One delay slot. By the time the branch is decoded and a decision is made by the comparator, there is already one instruction following it in the IF stage. (b) add $2, $1, $3 addiu $1, $2, 1234 beq $1, $0, label This code demonstrates that this datapath has a hazard problem. What is the hazard, what kind is it and what changes to the datapath are needed to eliminate it? Read after write hazard between the addiu and BEQ instructions. This doesn t work because there is no way to forward register $1 to the ID stage. Therefore, the beq will not get the proper value for register $1. It can be fixed by moving the forwarding muxes to the ID stage before the comparator or adding an extra set of forwarding muxes there.

Question 4: Tomasulo scheduling Functional Unit type: Loads 1 Integer 1 FP adder 3 FP multiplier 6 Cycles: Consider the single issue Tomasulo processor and program shown above. ssume there is an integer unit that can process all integer operations in one clock cycle. Enter into the table below the clock cycle of the issue, start of execution and when the result is posted to the CDB. lso fill in the entries in the table below which show the value in each FP register and what the entries are in the Register Result Status table at the clock cycle right after issue of each instruction. Register result status table FP Registers Issue Exec CDB F0 F2 F4 F6 F0 F2 F4 F6 ld1 1 2 3 ld1 mlt1 2 4 10 ld1 mult1 ld1: ld $f0, 0($r1) mlt1: multd $f4, $f0, $f2 ld2: ld $f6, 0($r2) mlt2: multd $f6, $f4, $f6 mlt3: multd $f2, $f4, $f6 sd: sd 0($r2), $f2 ld2 3 4 5 mult1 ld2 M ($r1) mlt2 4 11 17 mult1 mult2 M ($r1) mlt3 11 18 24 mult1 mult2 M ($r1) sd 12 25 N/ mult1 mult2 M ($r1) $r1* $f2 $r1* $f2 M ($r2) M ($r2) M ($r2) M ($r2)

Question 5: Logic delay clk1 Write clock 32-bits clock skew LU LU Op Control clk2 B Shift right WriteB LSB 64-bits Registers and B: - setup time = 3 ns - hold time = 2 ns - clock-to-q time = 3 ns - shift time = 1 ns LU: - shortest delay path = 1 ns - longest delay path = 6 ns ssume delays through the control logic are negligible. a) Suppose that Write and WriteB are always asserted. What is the maximum allowable skew between clk1 and clk2 and in which direction? clktoq + LU Skew > B hold 3 + 1 - Skew > 2 Skew < 2 ns with Clk1 being skewed to be before Clk2 b) Suppose that Write is asserted once to load a value into register and then the circuit is allowed to run for a number of cycles. While running, the control unit will alternately assert the Shift right signal and then the WriteB signal. What is the maximum clock frequency at which we can run this circuit? B clktoq + LU + B setup = 12ns ; f clk = 83 MHz

Question 6: Branch Prediction Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches, which are 15% of the instructions. ssume that the misprediction penalty is always 3 cycles and the buffer miss penalty is always 6 cycles. ssume 90% hit rate in the buffer and 75% accuracy of the buffer prediction. ssume a base CPI without branch stalls of 1. a) What is the CPI? Explain your answer in words as well as equations. 1 +prob of branch( miss in buffer + in buffer but miss prediction ) = 1 +.15 ( (.1)6 +.9 (.25) 3 ) = 1.19 b) What are the entries in a branch history table and how are they indexed? The entries are the bits to indicate if past branches were taken or not and it is indexed by the lower bits of the instruction address.

Question 7: Caches Cache C1 is direct-mapped, C2 is fully associative, and C3 is 2-way set associative. Each has 4, one-word blocks (4 total words). ssume that the miss penalty for each is 10 clock cycles. ssume that the caches are initially empty. Using word addresses, fill in the chart below whether each memory hits or misses and which block it would be in, for all of the caches. t the bottom of the chart, compute the hit rate and the total miss penalty. Use an LRU strategy for replacement when appropriate. Memory Cache 1 (direct) Cache 2 (assoc) Cache 3 (2 way Set ssoc) Reference H/M? Block #? H/M? Block #? H/M? Set Block #? 0 M 0 M 0 M 0 0 4 M 0 M 1 M 0 1 8 M 0 M 2 M 0 0 0 M 0 H 0 M 0 1 4 M 0 H 1 M 0 0 8 M 0 H 2 M 0 1 Hit rate 0 50% 0 Miss penalty 6 * 10 = 60 cycles 30 cycles 60 cycles

Question 8: Multicycle datapath Bus Bus B npc P C I R SX ZX Rs,Rt,Rd Register File D B S D MEM M The datapath above forms a multicycle processor which uses two time-multiplexed buses for communication rather than point-to-point connections and muxes. SX and ZX is the sign and zero extended immediate. a) For this datapath draw a FSM (with bubbles and arcs) for Fetch, Decode and the operations DDI and LW. Fetch Decode Fetch Dispatch DDI: S <- + SX LW: S<- + SX Mem <-S Execute Memory Regfile <-S Regfile <- M Writeback b) Fill out the microprogram table below to implement this FSM. Src and SrcB fields specify which signals will be assigned to Bus and BusB, Wrt and WrtB fields specify what components are receiving inputs from the busses. The SrcX and WrtX fields can be any one of the state registers(ir,, B, S, M), the register file (RegFile), or memory(mem). Sequence specifies the function of a jump counter (next for next instruction, fetch for go to 00 or dispatch). mddr Instruction Src SrcB LUOp Wrt WrtB Sequence 00 Fetch PC Mem IR Next 01 Decode Dispatch 02 DDI SX DD Next 03 S RegFile Fetch 04 LW SX DD Next 05 S Mem Next 06 M RegFile Fetch 07 08