Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Similar documents
Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Out-of-Order Execution

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III

Advanced Pipelining and Instruction-Level Paralelism (2)

Instruction Level Parallelism and Its. (Part II) ECE 154B

Instruction Level Parallelism

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Computer Architecture Spring 2016

A VLIW Processor for Multimedia Applications

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

On the Rules of Low-Power Design

Pipeline design. Mehran Rezaei

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Scoreboard Limitations

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Scoreboard Limitations!

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Impact of Intermittent Faults on Nanocomputing Devices

An Overview of FLEET CS-152

SoC IC Basics. COE838: Systems on Chip Design

Design Project: Designing a Viterbi Decoder (PART I)

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

BUSES IN COMPUTER ARCHITECTURE

High Performance Carry Chains for FPGAs

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Modeling Digital Systems with Verilog

Tomasulo Algorithm Based Out of Order Execution Processor

Methodology. Nitin Chawla,Harvinder Singh & Pascal Urard. STMicroelectronics

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Digital (5hz to 500 Khz) Frequency-Meter

Integrated Circuit Design ELCT 701 (Winter 2017) Lecture 1: Introduction

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

Design for Testability

AN ABSTRACT OF THE THESIS OF

UNIT V 8051 Microcontroller based Systems Design

Design of Fault Coverage Test Pattern Generator Using LFSR

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

P.Akila 1. P a g e 60

Testing Digital Systems II

ECE552 / CPS550 Advanced Computer Architecture I. Lecture 1 Introduction

Data flow architecture for high-speed optical processors

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

4.5 Pipelining. Pipelining is Natural!

CS3350B Computer Architecture Winter 2015

ORM0022 EHPC210 Universal Controller Operation Manual Revision 1. EHPC210 Universal Controller. Operation Manual

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

THE USE OF forward error correction (FEC) in optical networks

Reconfigurable Neural Net Chip with 32K Connections

CS61C : Machine Structures

CPE300: Digital System Architecture and Design

OPTIMIZING VIDEO SCALERS USING REAL-TIME VERIFICATION TECHNIQUES

Sharif University of Technology. SoC: Introduction

VARIABLE FREQUENCY CLOCKING HARDWARE


A Case for Merging the ILP and DLP Paradigms

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors

Low Power VLSI CMOS Design An Image Processing Chip for RGB to HSI Conversion

ATOMMS Software Review October 25, History of crio hardware choice and initial software development

A MISSILE INSTRUMENTATION ENCODER

Wafer Thinning and Thru-Silicon Vias

Logic Design Viva Question Bank Compiled By Channveer Patil

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

At-speed Testing of SOC ICs

Comparative Analysis of Stein s. and Euclid s Algorithm with BIST for GCD Computations. 1. Introduction

Amon: Advanced Mesh-Like Optical NoC

Computer Systems Architecture

Stream Labs, JSC. Stream Logo SDI 2.0. User Manual

Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume Reduction

Parallel Computing. Chapter 3

Lecture 0: Organization

DIGITAL SYSTEM FUNDAMENTALS (ECE421) DIGITAL ELECTRONICS FUNDAMENTAL (ECE422) LATCHES and FLIP-FLOPS

Controlling Peak Power During Scan Testing

Novel Low Power and Low Transistor Count Flip-Flop Design with. High Performance

CS61C : Machine Structures

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Performance mesurement of multiprocessor architectures on FPGA(case study: 3D, MPEG-2)

ECEN454 Digital Integrated Circuit Design. Sequential Circuits. Sequencing. Output depends on current inputs

Transcription:

Very Short Answer: (1) (1) Peak performance does or does not track observed performance. (2) (1) Which is more effective, dynamic or static branch prediction? (3) (1) Do benchmarks remain valid indefinitely? (4) (2) Issuing multiple instructions per cycle puts tremendous pressure on what two parts of the machine? (5) (2) In class we mentioned VLIW and Superscalar as two ways to circumvent the Flynn Limit of 1. We also talked about two other approaches - what were they? (6) (2) Out of Order completion makes supporting what very difficult? (7) (2) Decoupled architectures split a program into two streams. What are they? (8) (2) Are wire delays or transistors more likely to be the most significant limit on clock frequency in the future? Why? (9) (2) What is Amdahl s law (in words)? (10) (2) What is the relationship between speculation and power consumption? -1-

Short Answers: (10) (3) What is the primary difference between Scoreboarding and Tomasulo s algorithm? What hardware feature makes Tomasulo s work? (11) (3) Why are there multiple dies per silicon wafer? Why not just fabricate one huge die per wafer? (12) (3) The book lists several things that limit the amount of achievable ILP. List 3 of them. (13) (4) Understanding the hardware can influence how you write programs. Give at least 2 examples of how you might write software differently for a heavily pipelined machine verses a non-pipelined one. -2-

(14) (4)What is a predicated instruction? What are the advantages to using predicated instructions? When would you not want to use one? (15) (4) What is the definition of a basic block? Why isthere a desire to create larger ones? (16) (3) There are at least two types of control flow changes that standard dynamic branch predictors have trouble with. There is a technique that works well for one of these types... name the two types of branches, and the technique used to successfully deal with one of them. (17) (4) Supporting precise interrupts in machines that allow out of order completion is a challenge. Briefly explain why, and give three different techniques that can be used to provide precise interrupts. -3-

(18) (5) Why is branch prediction important? What performance enhancing techniques have made it so? List 3 examples of existing Branch Prediction strategies in order of (average) increasing effectiveness. (19) (5) What does SMT stand for? What is SMT trying to accomplish? What is the difference between Superscalar, coarse MT, fine MT, and SMT? (20) (6) Compare and contrast Superscalar and VLIW. Describe each, and list the advantages and disadvantages of each approach. -4-

(21) (10) Draw abasic high-level picture of what tomasulo s hardware looks like, when the ROB is included. (In other words, sketch out all the hardware involved, and how things are connected.) The emphasis is on conveying knowledge - do not worry about how pretty it is, but do make sure I can read it and understand what you have done. -5-

(22) (10) Youare given the following code sequence: ADDF F1,F2,F3 SUBF F1,F4,F5 MULTF F2,F6,F7 DIVF F1,F8,F9 Assume there are 8 logical and 16 physical registers. On the left below isthe register mapping upon entering the code sequence. Your job is to fill in the mappings after the execution of the DIVF instruction, including what is on the free list. (Assume that during the execution of this code, no registers are released - in other words, the free list will be shorter at the end than at the beginning.) BEFORE Logical Physical 0 2 1 4 2 6 3 8 4 10 5 12 6 14 7 0 AFTER Logical Physical 0 1 2 3 4 5 6 7 Free Pool: 0,2,4,9,10,13,14,15 Free Pool: Now, rewrite the code sequence below using the actual physical register names instead of the logical ones. ADDF P,P,P SUBF P,P,P MULTF P,P,P DIVF P,P,P -6-

(23) (15) Given the following loop: LOOP: LoadF0,0($1) AddF4,F0,F2 StoreF4,0(F1) SubR1,R1,#4 BneR1,R2,Loop There is a 1 cycle Load Delay Slot, a 1 cycle Branch Delay Slot, and a 2 cycle Add Delay Slot. Your machine has 16 registers. a) Calculate how many cycles this loop requires in order to execute 9 times. b) Now unroll the loop 3 times, schedule the code, and calculate how many cycles your unrolled, scheduled loop requires to execute. -7-

(24) (4) In class, we talked about the cycle by cycle steps that occur on different interrupts. For example, here is what happens if there is an illegal operand interrupt generated by instruction i+1: 1 2 3 4 5 6 7 8 9 i IF ID EX MEM WB i+1 IF ID EX MEM WB <- Interrupt detected i+2 IF ID EX MEM WB <- Instruction Squashed i+3 IF ID EX MEM WB <- Trap Handler fetched i+4 IF ID EX MEM WB Fill out the following table if instruction i+1 experiences a fault in the EX stage: 1 2 3 4 5 6 7 8 9 10 i IF ID EX MEM WB i+1 IF ID EX MEM WB i+2 IF ID EX MEM WB i+3 IF ID EX MEM WB i+4 IF ID EX MEM WB i+5 IF ID EX MEM WB What happens in this case? 1 2 3 4 5 6 7 8 9 10 i IF ID EX MEM WB <- Data write causes Page Fault i+1 IF ID EX MEM WB <- Divide by Zero i+2 IF ID EX MEM WB <- Illegal Opcode i+3 IF ID EX MEM WB i+4 IF ID EX MEM WB i+5 IF ID EX MEM WB -8-