Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Similar documents
Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Instruction Level Parallelism and Its. (Part II) ECE 154B

Advanced Pipelining and Instruction-Level Paralelism (2)

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Scoreboard Limitations

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Scoreboard Limitations!

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Computer Architecture Spring 2016

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Out-of-Order Execution

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Instruction Level Parallelism

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Logic Design II (17.342) Spring Lecture Outline

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Microprocessor Design

Modeling Digital Systems with Verilog

COMP12111: Fundamentals of Computer Engineering

6.3 Sequential Circuits (plus a few Combinational)

Pipeline design. Mehran Rezaei

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

On the Rules of Low-Power Design

BUSES IN COMPUTER ARCHITECTURE

Contents Circuits... 1

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

COMP sequential logic 1 Jan. 25, 2016

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

Lecture 0: Organization

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

FPGA Development for Radar, Radio-Astronomy and Communications

Logic Devices for Interfacing, The 8085 MPU Lecture 4

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

COE328 Course Outline. Fall 2007

CS 61C: Great Ideas in Computer Architecture

THE USE OF forward error correction (FEC) in optical networks

Previous Lecture Sequential Circuits. Slide Summary of contents covered in this lecture. (Refer Slide Time: 01:55)

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

CPE300: Digital System Architecture and Design

A VLIW Processor for Multimedia Applications

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

Sequential Elements con t Synchronous Digital Systems

Slide Set 7. for ENEL 353 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

ALONG with the progressive device scaling, semiconductor

A video signal processor for motioncompensated field-rate upconversion in consumer television

ASIC = Application specific integrated circuit

BCN1043. By Dr. Mritha Ramalingam. Faculty of Computer Systems & Software Engineering

Digital Circuits 4: Sequential Circuits

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

Design Project: Designing a Viterbi Decoder (PART I)

CHAPTER 4 RESULTS & DISCUSSION

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Chapter 3. Boolean Algebra and Digital Logic

CPS311 Lecture: Sequential Circuits

CHAPTER 4: Logic Circuits

Implementation of Memory Based Multiplication Using Micro wind Software

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

UNIT V 8051 Microcontroller based Systems Design

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

CS/EE 6710 Digital VLSI Design CAD Assignment #3 Due Thursday September 21 st, 5:00pm

Tomasulo Algorithm Based Out of Order Execution Processor

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Laboratory Exercise 4

AN INTRODUCTION TO DIGITAL COMPUTER LOGIC

A Fast Constant Coefficient Multiplier for the XC6200

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

DIGITAL SYSTEM FUNDAMENTALS (ECE421) DIGITAL ELECTRONICS FUNDAMENTAL (ECE422) LATCHES and FLIP-FLOPS

Performance Driven Reliable Link Design for Network on Chips

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

OUT-OF-ORDER processors with precise exceptions

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL

Combinational vs Sequential

EET 1131 Lab #12 - Page 1 Revised 8/10/2018

FPGA Implementation of DA Algritm for Fir Filter

WITH the demand of higher video quality, lower bit

Computer Graphics NV1 (1DT383) Computer Graphics (1TT180) Cary Laxer, Ph.D. Visiting Lecturer

CHAPTER 4: Logic Circuits

CprE 281: Digital Logic

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

Video Output and Graphics Acceleration

Transcription:

Slide Set 9 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018

ENCM 501 Winter 2018 Slide Set 9 slide 2/42 Contents Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 3/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 4/42 Overview of Tomasulo s algorithm It s interesting that this approach to instruction scheduling was developed around 50 years ago (in 1966!) for very high-end expensive computers, then re-adopted around 20 years ago for consumer-level processors (Intel Pentium Pro, 1995). It s been in continual use for a huge number of out-of-order processor designs for the last two decades. The key ideas are: try to get execution of an instruction started as soon as the source operands are ready often, this will require out-of-order issue; as soon as an instruction result has been produced, try to broadcast that result to all the instructions that want to consume it.

ENCM 501 Winter 2018 Slide Set 9 slide 5/42 Goals and non-goals for H&P Sections 3.4 and 3.5 Goals: high instruction throughput dealing effectively with RAW hazards (which, remember, may occur even with in-order processing) dealing effectively with WAW and WAR hazards (which tend to occur with out-of-order processing) Non-goals (which become goals in Section 3.6): high throughput in the face of code with lots of branches correct behaviour in the face of exceptions, e.g., hardware interrupts, TLB misses handled by software, various other exceptions

ENCM 501 Winter 2018 Slide Set 9 slide 6/42 Key components instruction fetch and decode unit enhanced register files reservation stations functional units common data bus (CDB) or multiple CDBs for superscalar systems

ENCM 501 Winter 2018 Slide Set 9 slide 7/42 Instruction fetch and decode unit This unit merges an L1 I-cache some sort of facility for managing branches and jumps (kind of fuzzy in Sections 3.4 3.5, but definitely dynamic branch prediction in Section 3.6) instruction decode capability: when an instruction leaves, it s known exactly what kind of instruction it is, and which registers, offsets and immediate operands are involved For textbook Sections 3.4 3.6, this unit is scalar in any one clock cycle, the maximum output is one instruction. (Section 3.8 looks at output of two or more instructions per cycle.)

ENCM 501 Winter 2018 Slide Set 9 slide 8/42 Instruction fetch and decode unit: key property Instructions are issued from the instruction unit in program order. This is critical for correct avoidance of data hazards!

ENCM 501 Winter 2018 Slide Set 9 slide 9/42 Enhanced register files In an in-order processor, a register file is precisely what we ve modeled already: If the number of registers is M and the width of a register is N, there must be M N cells to contain the state of the register file; beyond that, there must also be whatever logic is needed to support parallel reads and/or parallel writes. In a Tomasulo-based processor, each one of the M registers requires N bits for register data more bits for register status is the data up-to-date, and if not, which reservation station will supply the data in the future?

ENCM 501 Winter 2018 Slide Set 9 slide 10/42 Example: FPR file for MIPS with 16 64-bit FPRs... Qi = 0 indicates that an FPR is up-to-date; Qi 0 indicates that an FPR is not up-to-date and is waiting for a result for a reservation station. This example works for a system with up to fifteen reservation stations... FPR data Qi F0 00110011 00110011 00110011 00110011 00111111 11010011 00110011 00110011 0101 F2 00000000 00000000 00000000 00000000 10111111 11010100 00000000 00000000 0000. F30. 01001001 00100100 10010010 01001001 01000000 00001001 00100100 10010010. 0110

ENCM 501 Winter 2018 Slide Set 9 slide 11/42 What do all of the bits on Slide 10 mean? Let s write out the FPR file state in a somewhat more human-friendly format.

ENCM 501 Winter 2018 Slide Set 9 slide 12/42 Reservation stations A reservation station receives an instruction from the instruction unit; waits for source operand data to be ready before starting the execution of the instruction; broadcasts the result of the instruction on the CDB, when the result is ready. Example reservation station: An RS capable of processing either ADD.D or SUB.D needs 6 fields: Busy, Op, Vj, Vk, Qj and Qk. (The textbook also shows an A field, but that field is not needed for an RS for ADD.D / SUB.D.) Let s make some notes about how the 6 fields are used.

ENCM 501 Winter 2018 Slide Set 9 slide 13/42 Each reservation station has a unique, nonzero identification number. Examples in the textbook have three RS s for ADD.D / SUB.D instructions, and have two RS s for MUL.D / DIV.D instructions. So a possible numbering scheme would be 0001, 0010, 0011 for the first three RS s (called Add1, Add2, and Add3 in the book) and 0100, 0101 for the next two (called Mult1 and Mult2). (Warning: RS stands for reservation station, not for register status. In descriptions of various versions of Tomasulo s algororithm, the textbook uses RegisterStat[x] as an abbreviation for the status of register x.)

ENCM 501 Winter 2018 Slide Set 9 slide 14/42 Simple example of instruction issue Suppose a program has been running for a while, and these will be the next two instructions to leave the instruction unit: ADD.D SUB.D F4, F0, F2 F6, F6, F4 Suppose also that registers F0, F2, F4 and F6 are up-to-date, and that RS s Add1 and Add2 are not busy. Let s make notes about how the two instructions will be issued to the RS s.

ENCM 501 Winter 2018 Slide Set 9 slide 15/42 Functional units Functional units are the circuits that perform the execution steps for an instruction. Example functional units are FP adders, FP multipliers, integer ALUs, shifters, and so on. To keep things relatively simple, we can imagine a one-to-one correspondence between reservation stations and functional units. For example, each RS for ADD.D / SUB.D could be thought of as guarding the entrance of an FP adder-subtractor circuit, and watching the exit of that circuit for a result. In reality, things will be more complicated multiple RS s will be set up to feed their operands into a single pipelined functional unit.

ENCM 501 Winter 2018 Slide Set 9 slide 16/42 Latencies of functional units Tomasulo s algorithm is designed to manage the effects of multiple-cycle latencies of functional units. Further, the algorithm is designed to deal with the fact that the latency of a functional unit might vary from one use of the unit to the next. What are some examples of functional units with variable latencies?

ENCM 501 Winter 2018 Slide Set 9 slide 17/42 CDB: Common data bus Most reservation stations are capable of broadcasting results on the CDB. RS s for arithmetic instructions and RS s for loads certainly need to be able to do this, but RS s for store instructions don t. Each reservation station must snoop the CDB watch the CDB for results needed by that RS. The register file must also snoop the CDB to grab results that are needed by registers that are currently not up-to-date. (Remember, such a register has some nonzero Qi value to indicate which RS will produce the result the register is waiting for.)

ENCM 501 Winter 2018 Slide Set 9 slide 18/42 Example instruction completion via CDB This sequence got started a few slides back... ADD.D SUB.D F4, F0, F2 F6, F6, F4 Let s make some notes about how these instructions will get completed. For simplicity, let s assume that the given instructions are followed by a lengthy sequence of instructions that don t use F0, F2, F4 or F6.

ENCM 501 Winter 2018 Slide Set 9 slide 19/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 20/42 Tomasulo s algorithm and name dependencies The algorithm eliminates WAW and WAR hazards in a really elegant way. To understand how this works, it s sufficient to consider a short example. Here the repeated use of F0 creates both a potential WAW hazard and a potential WAR hazard: DIV.D ADD.D MUL.D SUB.D F0, F20, F2 F22, F22, F0 F0, F24, F24 F26, F26, F0 Let s make notes about how the hazards are eliminated.

ENCM 501 Winter 2018 Slide Set 9 slide 21/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 22/42 Tomasulo s algorithm and memory hazards The examples given in textbook Sections 3.5 and 3.6 are excellent regarding hazards involving communication of instruction results through floating-point registers (FPRs). Communication of instruction results through general-purpose registers (GPRs, aka integer registers) is very similar to communication via FPRs, so it s reasonable not to discuss that topic at length.

ENCM 501 Winter 2018 Slide Set 9 slide 23/42 However, the textbook, is, uh, less than excellent about the topic of allowing memory accesses (loads and stores) to complete out-of-order when doing so is harmless and helps with instruction throughput; forcing in-order completion of loads and stores when necessary to avoid RAW, WAW and WAR hazards.

ENCM 501 Winter 2018 Slide Set 9 slide 24/42 Older and younger instructions It s handy to use the words older and younger as adjectives for instructions in an out-of-order system. Instruction A is older than Instruction B if A came before B in program order. In that case, you could also say that B is younger than A.

ENCM 501 Winter 2018 Slide Set 9 slide 25/42 Rules for ordering execution of loads and stores To avoid RAW hazards: It is acceptable to access memory for a load instruction if it is known that there are no incomplete older store instructions that will use the same address as the load. What similar rules would be needed for avoidance of WAW and WAR hazards? ENCM 501 won t go into details of hardware solutions for memory data hazards, but here are a couple of key features: some sort of queue in which program order of loads and stores is remembered after loads and stores leave the instuction fetch/decode unit a capability to quickly compare the address a load or store will use with addresses to be used by older loads or stores

ENCM 501 Winter 2018 Slide Set 9 slide 26/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 27/42 Load and store buffers Load buffer and store buffer are names given to reservation stations dedicated to handling loads or stores. Remark: The load buffers and store buffers provide an interface between the execution unit of the processor and the data caches. That s an interesting design problem we don t have time to study in this course.

ENCM 501 Winter 2018 Slide Set 9 slide 28/42 Vj, Vk, Qj, Qk, A for store buffers Busy Op Vj Vk Qj Qk A As with the FP math stations, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vk is used for the FP data to be written in an S.D instruction. So what does it mean if Qk 0? Vj, Qj, and A have to do with memory address calculations. Let s not worry about the details for now.

ENCM 501 Winter 2018 Slide Set 9 slide 29/42 Vj, Vk, Qj, Qk, A for load buffers Busy Op Vj Vk Qj Qk A Again, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vj, Qj, and A have to do with memory address calculations. As with store buffers, let s not worry about the details for now.

ENCM 501 Winter 2018 Slide Set 9 slide 30/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 31/42 A loop example This is from page 179 of the textbook: Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop1 Let s make some notes about the DADDIU and BEQ instructions. Let s assume the loop starts with R1 = 0x600040 and R2 = 0x600000. Let s trace how Tomasulo s algorithm might handle the first two passes through the loop.

ENCM 501 Winter 2018 Slide Set 9 slide 32/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 33/42 Costs of the CDB (common data bus) In a typical clock cycle, some reservation station will broadcast a result on the CDB, and other reservation stations and the register file will look at the result to see if it s useful. Transmitting the result and receiving the result both have energy costs. A complex instruction unit, reservation stations, and related hardware require lots of transistors. If Moore s law had not applied for so many decades, we would not see Tomasulo s algorithm used as a basis for design of modestly priced processor chips. It s possible, in some cycles, that two or more reservation stations will simultaneously try to broadcast their results. Why is this not a fatal defect in Tomasulo s algorithm?

ENCM 501 Winter 2018 Slide Set 9 slide 34/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 35/42 Tomasulo s algorithm and branch prediction Consider this code fragment: BEQ S.D ADD.D R8, R0, L99 F0, (R10) F2, F2, F4 Suppose the branch is incorrectly predicted as not taken, and S.D and ADD.D get issued while BEQ waits for some earlier instruction to provide a value for R8. If Tomasulo s algorithm does nothing beyond what has been presented so far in lectures, what will prevent S.D from making an incorrect update to memory, and what will prevent ADD.D from making an incorrect update to F2?

ENCM 501 Winter 2018 Slide Set 9 slide 36/42 Tomasulo s algorithm and exceptions MUL.D S.D SUB.D L.D ADD.D F2, F4, F6 F2, 0(R8) F0, F12, F14 F2, 0(R9) F8, F8, F2 Suppose MUL.D gets delayed because it has to wait until a result for F6 is ready. That will delay the execution of S.D. Meanwhile, Tomasulo s algorithm may allow completion of SUB.D, L.D, and ADD.D. What kind of problem is created if S.D eventually results in a page fault exception?

ENCM 501 Winter 2018 Slide Set 9 slide 37/42 Outline of Slide Set 9 Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

ENCM 501 Winter 2018 Slide Set 9 slide 38/42 Out-of-order execution, in-order completion The version of Tomasulo s algorithm presented in textbook Section 3.5 has scalar issue (that is, at most one instruction issued per clock cycle), out-of-order execution, and out-of-order completion. Section 3.6 modifies the algorithm to include a circuit called a reorder buffer (ROB), which will enforce in-order completion. Use of a reorder buffer solves the branch prediction and exception problems described on slides 35 and 36.

ENCM 501 Winter 2018 Slide Set 9 slide 39/42 In a processor with a reorder buffer, issue of an instruction sends information related to the instruction both to a reservation station and to the reorder buffer. A reservation station for a store is responsible for address computation only it is not allowed to write to memory. The reorder buffer is a FIFO queue instructions enter in program order, and leave in program order. When an instruction gets to the head of the ROB, it can be committed as soon as its results are known. Examples: An ADD.D can be committed if a reservation station has provided the sum to the reorder buffer. An S.D can be committed if both the data to be stored and the address to be used are ready.

ENCM 501 Winter 2018 Slide Set 9 slide 40/42 Register file changes: The Qi field for each register is replaced by a Busy flag and a Reorder # field. Busy = 0 means the register is up-to-date; Busy = 1 means the register is waiting for a result from whatever entry in the reorder buffer matches the Reorder #. The register file does not watch the CDB for results. The ROB must watch the CDB for results for all of the instructions within the ROB that don t yet have results.

ENCM 501 Winter 2018 Slide Set 9 slide 41/42 The reservation stations and functional units work very much as before, except: the Qj and Qk fields hold ROB entry numbers instead of reservation station numbers; each reservation stations has a Dest field to hold an ROB entry number; when a reservation station broadcasts its result on the CDB, it includes the Dest field value to help both the ROB and the other reservation stations.

ENCM 501 Winter 2018 Slide Set 9 slide 42/42 The reorder buffer and safe speculation The key point about the ROB is that it can collect a large number of results without knowing whether those results should really be written to registers or memory. Consider a branch instruction that is mispredicted as taken. What happens to all the instructions that got into the ROB before the branch? What happens to the branch target instruction, the successor of the the branch target instruction, etc., which got into the ROB after the branch? The bad effect of the above scenario is a waste of time and energy. What are the important bad effects that were prevented?