PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Similar documents
06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Instruction Level Parallelism

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Pipeline design. Mehran Rezaei

Instruction Level Parallelism and Its. (Part II) ECE 154B

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Scoreboard Limitations!

Advanced Pipelining and Instruction-Level Paralelism (2)

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III

On the Rules of Low-Power Design

Scoreboard Limitations

Computer Architecture Spring 2016

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

ASIC = Application specific integrated circuit

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

A VLIW Processor for Multimedia Applications

4.5 Pipelining. Pipelining is Natural!

Out-of-Order Execution

Digital Design and Computer Architecture

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Fundamentals of Computer Systems

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

6.3 Sequential Circuits (plus a few Combinational)

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

CpE 442. Designing a Pipeline Processor (lect. II)

Introduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Fill-in the following to understand stalling needs and forwarding opportunities

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

BUSES IN COMPUTER ARCHITECTURE

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

Digital Design and Computer Architecture

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

EE141-Fall 2010 Digital Integrated Circuits. Announcements. Homework #8 due next Tuesday. Project Phase 3 plan due this Sat.

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

EITF35: Introduction to Structured VLSI Design

AN ABSTRACT OF THE THESIS OF

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

EE141-Fall 2010 Digital Integrated Circuits. Announcements. Synchronous Timing. Latch Parameters. Class Material. Homework #8 due next Tuesday

Administrative issues. Sequential logic

Frame Processing Time Deviations in Video Processors

An FPGA Implementation of Shift Register Using Pulsed Latches

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors

Sequential logic circuits

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Computer and Digital System Architecture

Motion Video Compression

EECS150 - Digital Design Lecture 3 - Timing

EAN-Performance and Latency

Rapid prototyping of of DSP algorithms. real-time. Mattias Arlbrant. Grupphandledare, ANC

EECS150 - Digital Design Lecture 2 - CMOS

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

Electrical and Telecommunications Engineering Technology_TCET3122/TC520. NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York

A Low Power Delay Buffer Using Gated Driver Tree

Logic Design II (17.342) Spring Lecture Outline

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Computer Systems Architecture

Retiming Sequential Circuits for Low Power

Modeling Digital Systems with Verilog

(RGBW)

Computer Architecture and Organization

Last time, we saw how latches can be used as memory in a circuit

Design Project: Designing a Viterbi Decoder (PART I)

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Scheduler Activity Instructions

Analysis of MPEG-2 Video Streams

Chapter Contents. Appendix A: Digital Logic. Some Definitions

CSC Computer Architecture and Organization

EE241 - Spring 2005 Advanced Digital Integrated Circuits

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

Homework 3 posted this week, due after Spring break Quiz #2 today Midterm project report due on Wednesday No office hour today

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

EE-382M VLSI II FLIP-FLOPS

Transcription:

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture

Overview Announcement Homework 1 submission deadline: Jan. 30 th This lecture Control hazards in the five-stage pipeline Multicycle instructions n Pipelined n Unpipelined Reorder buffer

Control Hazards Example C/C++ code for (i=100; i > 0; i--) { sum = sum + i; } total = total + sum; How many branches are in this code?

Control Hazards Example C/C++ code for (i=100; i > 0; i--) { sum = sum + i; } total = total + sum; add r1, r0, #100 for: beq r0, r1, next next: add r2, r2, r1 sub r1, r1, #1 J for add r3, r3, r2 What are possible target instructions?

Control Hazards Example C/C++ code for (i=100; i > 0; i--) { sum = sum + i; } total = total + sum; add r1, r0, #100 ALU DM for: beq r0, r1, next ALU DM add r2, r2, r1 ALU DM sub r1, r1, #1 ALU DM J for ALU next: add r3, r3, r2 What happens inside the pipeline?

Handling Control Hazards 1. introducing stall cycles and delay slots How many cycles/slots? One branch per every six instructions on average!! add r1, r0, #100 ALU DM for: beq r0, r1, next ALU DM nothing ALU DM nothing ALU DM add r2, r2, r1 ALU sub r1, r1, #1 J for 2 additional delay slots per 6 cycles!

Handling Control Hazards 1. introducing stall cycles and delay slots How many cycles/slots? One branch per every six instructions on average!! add r1, r0, #100 ALU DM for: beq r0, r1, next ALU DM nothing ALU DM add r2, r2, r1 ALU DM sub r1, r1, #1 ALU J for nothing 1 additional delay slot, but longer path

Handling Control Hazards 1. introducing stall cycles and delay slots How many cycles/slots? One branch per every six instructions on average!! add r1, r0, #100 ALU DM for: beq r0, r1, next ALU DM nothing ALU DM add r2, r2, r1 ALU DM J for ALU next: sub r1, r1, #1 add r3, r3, r2 Reordering instructions may help

Handling Control Hazards 1. introducing stall cycles and delay slots How many cycles/slots? One branch per every six instructions on average!! add r1, r0, #100 ALU DM for: beq r0, r1, next ALU DM nothing ALU DM add r2, r2, r1 ALU DM J for ALU next: sub r1, r1, #1 add r3, r3, r2 Jump and function calls can be resolved in the decode stage.

Handling Control Hazards 1. introducing stall cycles and delay slots 2. predict the branch outcome n simply assume the branch is taken or not taken n predict the next PC add r1, r0, #100 ALU DM for: beq r0, r1, next ALU DM add r2, r2, r1 ALU DM sub r1, r1, #1 ALU DM J for ALU next: add r3, r3, r2 May need to cancel the wrong path

Multicycle Instructions Not all of the ALU operations complete in one cycle Typically, FP operations need more time

Multicycle Instructions Not all of the ALU operations complete in one cycle pipelined and un-pipelined multicycle functional units Pipelined vs. un-pipelined?

Multicycle Instructions Structural hazards potentially multiple RF writes Possibly multiple writes to the ister File

Multicycle Instructions Data hazards more read-after-write hazards load f4, 0(r2) mul f0, f4, f6 add f2, f0, f8 store f2, 0(r2) IF ID EX MAWB IF ID M1 M2 M3 M4 M5 M6 M7 MAWB IF ID A1 A2 A3 A4 MAWB IF ID EX MA WB

Multicycle Instructions Data hazards potential write-after-write hazards load f4, 0(r2) mul f2, f4, f6 IF ID EX MAWB IF ID M1 M2 M3 M4 M5 M6 M7 MAWB add f2, f0, f8 IF ID A1 A2 A3 A4 MAWB Out of Order Write-back!! store f2, 0(r2) IF ID EX MA WB

Multicycle Instructions Data hazards potential write-after-write hazards load f4, 0(r2) mul f2, f4, f6 IF ID EX MAWB IF ID M1 M2 M3 M4 M5 M6 M7 MAWB add f2, f0, f8 store f2, 0(r2) IF ID A1 A2 A3 A4 MAWB IF ID EX MA WB In-Order Writes

Multicycle Instructions Imprecise exception instructions do not necessarily complete in program order load f4, 0(r2) IF ID EX MAWB mul f2, f4, f6 IF ID M1 M2 M3 M4 M5 M6 M7 MAWB Overflow!! add f3, f0, f8 IF ID A1 A2 A3 A4 MAWB store f2, 0(r2) IF ID EX MA WB

Multicycle Instructions Imprecise exception state of the processor must be kept updated with respect to the program order load f4, 0(r2) IF ID EX MAWB mul f2, f4, f6 IF ID M1 M2 M3 M4 M5 M6 M7 MAWB add f3, f0, f8 IF ID A1 A2 A3 A4 MAWB store f2, 0(r2) IF ID EX MA WB In-order register file updates

Reorder Buffer Multicycle Instructions mul f2, f4, f6 add f4, f0, f1 sub f6, f3, f7 Ints. Dest.

Reorder Buffer Multicycle Instructions mul f2, f4, f6 add f4, f0, f1 sub f6, f3, f7 Ints. mul add sub Dest. f2 f4 f6