Fill-in the following to understand stalling needs and forwarding opportunities

Similar documents
A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Pipeline design. Mehran Rezaei

Instruction Level Parallelism

CS61C : Machine Structures

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Exercise 2: D-Type Flip-Flop

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

CS3350B Computer Architecture Winter 2015

CSE 140 Exam #3 Solution Tajana Simunic Rosing

First Name Last Name November 10, 2009 CS-343 Exam 2

Digital Design Datapath Components: Parallel Load Register

On the Rules of Low-Power Design

Modeling Digital Systems with Verilog

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

CSE 140 Exam #3 Tajana Simunic Rosing

EECS 270 Group Homework 4 Due Friday. June half credit if turned in by June

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Outcomes. Spiral 1 / Unit 6. Flip-Flops FLIP FLOPS AND REGISTERS. Flip-flops and Registers. Outputs only change once per clock period

UC Berkeley CS61C : Machine Structures

EECS 270 Midterm Exam Spring 2011

CSE 352 Laboratory Assignment 3

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

CHAPTER 11 LATCHES AND FLIP-FLOPS

YEDITEPE UNIVERSITY DEPARTMENT OF COMPUTER ENGINEERING. EXPERIMENT VIII: FLIP-FLOPS, COUNTERS 2014 Fall

Digital Design and Computer Architecture

CS61C : Machine Structures

Topic D-type Flip-flops. Draw a timing diagram to illustrate the significance of edge

1. Synopsis: 2. Description of the Circuit:

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Why do we need to debounce the clock input on counter or state machine design? What happens if we don t?

Sequentielle Schaltelemente

CS 151 Final. Instructions: Student ID. (Last Name) (First Name) Signature

Digital Circuits ECS 371

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CPS311 Lecture: Sequential Circuits

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Instruction Level Parallelism and Its. (Part II) ECE 154B

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

The basic logic gates are the inverter (or NOT gate), the AND gate, the OR gate and the exclusive-or gate (XOR). If you put an inverter in front of

CprE 281: Digital Logic

EECS 270 Midterm 2 Exam Closed book portion Fall 2014

EECS 270 Midterm 1 Exam Closed book portion Winter 2017

4.5 Pipelining. Pipelining is Natural!

Last time, we saw how latches can be used as memory in a circuit

More Digital Circuits

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

UNIT III. Combinational Circuit- Block Diagram. Sequential Circuit- Block Diagram

Previous Lecture Sequential Circuits. Slide Summary of contents covered in this lecture. (Refer Slide Time: 01:55)

Midterm Exam 15 points total. March 28, 2011

CprE 281: Digital Logic

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors

Logic Design II (17.342) Spring Lecture Outline

Multiplexor (aka MUX) An example, yet VERY useful circuit!

EE 121 June 4, 2002 Digital Design Laboratory Handout #34 CLK

CprE 281: Digital Logic

Digital Logic. ECE 206, Fall 2001: Lab 1. Learning Objectives. The Logic Simulator

UNIVERSITI TEKNOLOGI MALAYSIA

Register Transfer Level (RTL) Design Cont.

Design of a Binary Number Lock (using schematic entry method) 1. Synopsis: 2. Description of the Circuit:

Figure 1 shows a simple implementation of a clock switch, using an AND-OR type multiplexer logic.

CSCB58 - Lab 4. Prelab /3 Part I (in-lab) /1 Part II (in-lab) /1 Part III (in-lab) /2 TOTAL /8

Digital Design and Computer Architecture

Sequential Design Basics

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

ASYNCHRONOUS SEQUENTIAL CIRCUIT CONCEPTS

CS/EE 6710 Digital VLSI Design CAD Assignment #3 Due Thursday September 21 st, 5:00pm

CprE 281: Digital Logic

High Performance Carry Chains for FPGAs

Overview: Logic BIST

CHAPTER 4 RESULTS & DISCUSSION

Sequential Elements con t Synchronous Digital Systems

11. Sequential Elements

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Final Exam review: chapter 4 and 5. Supplement 3 and 4

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Sequential Circuits

Lab #5: Design Example: Keypad Scanner and Encoder - Part 1 (120 pts)

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Chapter 3. Boolean Algebra and Digital Logic

Digital Circuits I and II Nov. 17, 1999

A clock is a free-running signal with a cycle time. A clock may be either high or low, and alternates between the two states.

Chapter 4. Logic Design

1. Convert the decimal number to binary, octal, and hexadecimal.

Department of Electrical and Computer Engineering Mid-Term Examination Winter 2012

EE273 Lecture 14 Synchronizer Design November 11, Today s Assignment

ECE 263 Digital Systems, Fall 2015

Laboratory Exercise 7

Universidad Carlos III de Madrid Digital Electronics Exercises

Introduction to Sequential Circuits

Experiment 8 Introduction to Latches and Flip-Flops and registers

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Instruction Level Parallelism Part III

Instructions. Final Exam CPSC/ELEN 680 December 12, Name: UIN:

Design and Simulation of a Digital CMOS Synchronous 4-bit Up-Counter with Set and Reset

Transcription:

Fill-in the following to understand stalling needs and forwarding opportunities Instruction ADD4 ADD Receiving forwarding help Providing forwarding help Insists on Doesn t mind Doesn t mind Capable of Capable of receiving in EX receiving in EX receiving in EX2 providing from EX2 providing from WB Based on the above, if an instruction is dependent on a senior instruction which is not just above (just above = just before), there is never a need to stall the dependent instruction. True / False If the dependent instruction is either or, then it needs help at the beginning of the clock when it is in EX as it needs to process the data using in EX. And if the senior instruction (donor instruction) is just above it (just before it) in EX2 stage, and if it is either or, then it can't help at the beginning of the clock, as it is still producing the data using the ADD4 in EX2. Hence this dependency hazard should be detected when the dependent instruction is in the ID stage and should be stalled. The stall is for (just clock, minimum for clock). (Unlike/Like) the MIPS 5 stage pipeline, where the instructions (have only one source register / can have two source registers), here the instructions (have only one source register / can have two source registers). Hence it (is / isn t) possible to stall the dependent instruction in EX stage instead of the ID stage. Draw logic to go into HDU and FU2 HDU STALL FU2 FORW2 ee457_lab7_p3_simple_pipeline.fm 3/5/ 4 C Copyright 2 Gandhi Puvvada

Draw logic to go into FU as per the diagram on page 2 FR_HP EX_XD EX_PRIO_XD EX_ADDER_IN FORW EX EX_ADDER_OUT EX_XD_OUT EX2 WB FR_LP PRIORITY FU PRIORITY FORW Redesign the logic to go into FU if the arrangement of the forwarding muxes is changed as shown EX EX2 WB FR_LP FR_HP NEW_FU FR_HP (forward High_Priority) FR_LP (forward Low_Priority) Any advantage of the NEW_FU over the original FU? Note: If logic is reduced, then it is cheaper and faster! ee457_lab7_p3_simple_pipeline.fm 3/5/ 5 C Copyright 2 Gandhi Puvvada

Questions (individual effort, paper submission, submit pages 2/4, 4/4, 5/4 and also pages 8/4 to 4/4) Please consider the following questions before implementing and designing your control. You need to think who can wait for forwarding data latest until when and who can provide forwarding data earliest by when. Q Can an instruction postpone receiving forwarding data until it reaches EX2 stage? If an instruction can postpone receiving forwarding help until reaching EX2, would it still try to receive help while it is in EX (may be because the donor instruction can not wait)? Q 2 Are there occasions where you end up stalling an instruction because you could not provide the needed forwarding data to it in EX? Q 3 Can an instruction in EX2 provide forwarding help to an instruction behind it? Or is it too early for any instruction to start providing forwarding help while it is in EX2? Q 4 Which instructions need to wait until they reach WB stage for them to provide forwarding help? And why they need to wait until then? ee457_lab7_p3_simple_pipeline.fm 3/5/ 8 C Copyright 2 Gandhi Puvvada

Q 5 Priority in Forwarding: Recall that, in the MIPS pipelined CPU design, if the instructions in both MEM stage and also in WB stage are willing to provide forwarding help to the instruction in EX stage, we exercise priority and accept help from (MEM/WB) stage. Do we have such a situation here? If so, explain with an example instruction sequence. Q 6 Normally forwarding is done at the beginning of a clock so that the recipient instruction can process the information during the clock. However, sometimes it may make sense (as in Lab 7 Part ) to forward information at the end of the clock. Can an instruction such as ADD4 or ADD in EX2 provide forwarding help to an ADD4 or instruction in EX towards the end of the clock? If yes, did you provide such an arrangement in your design? If not, is it desirable to provide such an arrangement? Does it cost extra? Does it avoid any stalls, thereby improving the pipeline performance? Or is it that the particular help we plan to offer at the end of the clock will anyway be available at the beginning of the next clock and it is just one and the same (one and the same, whether you provide data at the end of the current clock or at the beginning of the next clock)? Explain briefly. ee457_lab7_p3_simple_pipeline.fm 3/5/ 9 C Copyright 2 Gandhi Puvvada

Q 7 The following is a slightly modified version of the Q#3 from Spring 22 Final Exam. Please answer this as part of this lab questions. 7. Suppose the current design is working at 5 MHz (clock period = 2 ns). Due to VLSI technology improvements, you can either () double the clock rate to GHz (clock period = ns) or (2) keep the clock rate at the same level (5 MHz) and combine IF and ID stages into one stage called IFID and also the EX and EX2 stages into one stage called EX2 stage. Circle your choice and explain. a) Both options are equally good (b) Option is better than option 2 (c) Option 2 is better than option 7.2 Given below are four flip-flop hook-ups and five statements describing their operation. You need to find a matching statement for each of the hook-ups. (a) Once SET, it remains set. (b) Once RESET, it remains reset. (c) If it is currently SET, it will RESET on the next clock. (d) If it is currently RESET, it will SET on the next clock. (e) none of the above 2 IN IN Q Q Matching statement Matching statement 3 4 IN IN Matching Q statement Matching statement Q ee457_lab7_p3_simple_pipeline.fm 3/5/ C Copyright 2 Gandhi Puvvada

7.3 Let us go back to the original (slow) VLSI technology. Let us still combine the two stages EX and EX2 into one stage EX2 as shown in the incomplete design on page 4/4. The and ADD4 instructions require only one of two resources in EX2 and take only one clock to pass through EX2. and NOP do not need any computation. Only the ADD instruction requires both subtract_three and add_four operations and takes two-clocks through EX2. At that time we need to stall the entire pipe including WB stage for one clock so that the ADD completes using EX2 stage. 7.3. Explain why do we need to stall the WB stage also and why cannot we send a bubble into the WB stage. Use the sequence on this page (where the entire pipeline is stalled during clock 2) to explain. ADD4 $5, $6 ; ($5) <= ($6) + 4 ADD $4, $5 ; ($4) <= ($5) + $3, $5 ; ($3) <= ($5) - 3 ADD4 $, $2 ; ($) <= ($2) + 4 Pipe Stalled Clock IF ID EX2 WB ADD4 ADD ADD4 ADD4 ADD ADD4 ADD4 ADD ADD4 7.3.2 Notice that in the incomplete design, we have provided a flip-flop in the EX2 stage to help you stall the entire pipe for one clock (no more no less) when ADD is passing through the EX2 stage. Complete the design after answering the following questions. Do you need to stall the pipe to resolve any dependency problem? Yes / No Explain. Notice that we have removed the hazard detection unit. Notice that -- we removed one of the two comparators -- we removed the forwarding mux X2_Mux and FU2 -- we removed the prioritization mux. Explain why it is appropriate to remove these. ee457_lab7_p3_simple_pipeline.fm 3/5/ C Copyright 2 Gandhi Puvvada

7.4 Performance: Carefully compare the original Lab 7 Part 3 subpart design with the design in the above 7.3 (Lab 7 Part 3 subpart 2). Both are running at 5MHz. Does one of them perform always better? Or depending on the code any one of them could perform better? Decide after considering the two code sequences below and completing the time-space diagrams on this page. A A sequence of dependent ADD4 s B A sequence of independent ADD s ADD4 $5, $6 ; ($5) <= ($6) + 4; ADD4# $4, $5 ; ($4) <= ($5) - 3; #2 ADD4 $3, $4 ; ($3) <= ($4) + 4; ADD4#3 ADD $5, $ ; ($5) <= ($) + ; ADD# ADD $3, $4 ; ($3) <= ($4) + ; ADD#2 ADD $, $2 ; ($) <= ($2) + ; ADD#3 Code A running on Lab #7 Part 3 Code B running on Lab #7 Part 3 IF ID EX EX2 WB IF ID EX EX2 WB Clock ADD4#3 #2 ADD4# Clock ADD#3 ADD#2 ADD# Code A running on design in section 7.3 above IF ID EX2 WB Code B running on design in section 7.3 above IF ID EX2 WB Clock ADD4#3 #2 ADD4# Clock ADD#3 ADD#2 ADD# One of the two designs is always better. TRUE / FALSE. Explain. ee457_lab7_p3_simple_pipeline.fm 3/5/ 2 C Copyright 2 Gandhi Puvvada

7.5 Multicycle implementation: 7.5. The Datapath below is complete. Complete the state diagram and produce the two outputs, R_Write and PC_EN (draw combinational logic necessary to produce R_Write and PC_EN). 7.5.2 The design below corresponds to (i) the Lab 7 P3 Subpart # design (with separate EX and EX2 stages) only (ii) the design in Lab 7 P3 Subpart #2 design (with EX and EX2 merged) only (iii) both the above two designs. 7.5.3 If an IR (Instruction register) is available, PC can be incremented early. TRUE / FALSE Performance improves if we add IR. TRUE / FALSE ADD4 Note: NOP = (ADD + ADD4 + + ) ADD + ADD4 + + (IF) (ID) (EX2_) (EX2_2) Figure for Question 7.5 (WB) ee457_lab7_p3_simple_pipeline.fm 3/5/ 3 C Copyright 2 Gandhi Puvvada

PC IF ID EX2 WB Comp Station in ID Stage EN FU Qualifying signals XMEX2 EN EN Reg. File I-MEM ADD4 EN XA XA RA RD R-Write XD XD X_Mux A A-3 R_Mux A A+4 RD Write EX2_ EX2_ EX2_ADD4 R2_Mux WB_RD WB_Write EX2_ADD WB_RA RA EX2_RA ADD ADD4 ADD ADD4 SKIP SKIP2 FORW RA RA Comp Station in ID Stage ID_XMEX2 = ID_XA Matched with EX2_RA ID_XMEX2 CLR P=Q P Q ID_XA EX2_RA Figure for Question 7.3 LAB 7 P 3 with EX and EX2 merged Block Diagram. Complete the missing connections to the register file. 2. Design the forwarding unit. Generate SKIP and SKIP2 signals. 3. Use the flip-flop in EX2 stage to get one extra clock for ADD instruction. 4. Control the EN (ENABLE) control signal on PC and the three stage registers IF/ID, ID/EX2, and EX2/WB. ee457_lab7_p3_simple_pipeline.fm 3/5/ 4 C Copyright 2 Gandhi Puvvada