CpE 442. Designing a Pipeline Processor (lect. II)

Similar documents
4.5 Pipelining. Pipelining is Natural!

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Review: What is it? What does it do? slti $4, $5, 6

Pipeline design. Mehran Rezaei

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

Instruction Level Parallelism

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Fill-in the following to understand stalling needs and forwarding opportunities

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Fundamentals of Computer Systems

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Content-Based Movie Recommendation Using Different Feature Sets

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

ASIC = Application specific integrated circuit

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

On the Rules of Low-Power Design

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Instruction Level Parallelism and Its. (Part II) ECE 154B

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

A VLIW Processor for Multimedia Applications

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

CS3350B Computer Architecture Winter 2015

CprE 281: Digital Logic

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

A Reconfigurable Frame Interpolation Hardware Architecture for High Definition Video

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

CS61C : Machine Structures

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Computer and Digital System Architecture

Modeling Digital Systems with Verilog

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Out-of-Order Execution

On the Design of LPM Address Generators Using Multiple LUT Cascades on FPGAs

Instruction Level Parallelism Part III

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Register Transfer Level (RTL) Design Cont.

Instruction Level Parallelism Part III

Digital Design and Computer Architecture

Advanced Pipelining and Instruction-Level Paralelism (2)

Lab #10 Hexadecimal-to-Seven-Segment Decoder, 4-bit Adder-Subtractor and Shift Register. Fall 2017

Scoreboard Limitations!

Computer Architecture Spring 2016

Sequential Elements con t Synchronous Digital Systems

The game of competitive sorcery that will leave you spellbound.

AN ABSTRACT OF THE THESIS OF

C2 Vectors C3 Interactions transfer momentum. General Physics GP7-Vectors (Ch 4) 1

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Logic Design ( Part 3) Sequential Logic (Chapter 3)

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

CS61C : Machine Structures

CprE 281: Digital Logic

Sequential Logic Design CS 64: Computer Organization and Design Logic Lecture #14

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

Spiral Content Mapping. Spiral 2 1. Learning Outcomes DATAPATH COMPONENTS. Datapath Components: Counters Adders Design Example: Crosswalk Controller

Scoreboard Limitations

H-DFT: A HYBRID DFT ARCHITECTURE FOR LOW-COST HIGH QUALITY STRUCTURAL TESTING

Music Technology Advanced Subsidiary Unit 1: Music Technology Portfolio 1

Precision Interface Technology

EECS150 - Digital Design Lecture 3 - Timing

Sequential logic circuits

CPE300: Digital System Architecture and Design

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors

EECS 373 Design of Microprocessor-Based Systems

Sequential Logic. Introduction to Computer Yung-Yu Chuang

ECEN454 Digital Integrated Circuit Design. Sequential Circuits. Sequencing. Output depends on current inputs

I/O Interfacing. What we are going to learn in this session:

Study on evaluation method of the pure tone for small fan

Structural Fault Tolerance for SOC

CPE/EE 427, CPE 527 VLSI Design I Sequential Circuits. Sequencing

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

More Digital Circuits

Compact Beamformer Design with High Frame Rate for Ultrasound Imaging

Computer Architecture Basic Computer Organization and Design

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

A Low-Power 0.7-V H p Video Decoder

Lecture 10: Sequential Circuits

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Grant Spacing Signaling at the ONU

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

A Parallel Multilevel-Huffman Decompression Scheme for IP Cores with Multiple Scan Chains

A QUERY BY HUMMING SYSTEM THAT LEARNS FROM EXPERIENCE

First Name Last Name November 10, 2009 CS-343 Exam 2

CS 250 VLSI System Design


11. Sequential Elements

EWCM 900. technical user manual. electronic controller for compressors and fans

Software Manual Control Panel for Professional Single Booster Units Models: MM3 BW3

Chapter. Sequential Circuits

UC Berkeley CS61C : Machine Structures

Investigation on Technical Feasibility of Stronger RS FEC for 400GbE

Transcription:

CpE 442 Designing a Pipeline Pocesso (lect. II) CPE 442 hazads.1

Otline of Today s Lecte Recap and Intodction (5 mintes) Intodction to Hazads (15 mintes) Fowading (25 mintes) 1 cycle Load Delay (5 mintes) 1 cycle Banch Delay (15 mintes) What makes pipelining had Smmay (5 mintes) CPE 442 hazads.2

Review: Single Cycle, ltiple Cycle, vs. Pipeline Clk Cycle 1 Cycle 2 Single Cycle Implementation: Load Stoe Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk ltiple Cycle Implementation: Load Ifetch Reg Eec em W Stoe Ifetch Reg Eec em R-type Ifetch Pipeline Implementation: Load Ifetch Reg Eec em W Stoe Ifetch Reg Eec em W R-type Ifetch Reg Eec em W CPE 442 hazads.3

Review: A Pipelined Datapath Clk Ifetch Reg/Dec Eec em W RegW EtOp Op Banch PC 1 0 PC+4 AIUnit I IF/ID Registe PC+4 Imm16 Rs Rt Ra Rb RFile Rt Rw Di Rd ID/E Registe 0 1 PC+4 Imm16 bsa bsb Eec Unit E/em Registe Zeo Data e m RA Do WA Di em/w Registe 1 0 RegDst Sc emw emtoreg CPE 442 hazads.4

Review: Pipeline Contol Data Stationay Contol The ain Contol geneates the contol signals ding Reg/Dec Contol signals fo Eec (EtOp, Sc,...) ae sed 1 cycle late Contol signals fo em (emw Banch) ae sed 2 cycles late Contol signals fo W (emtoreg emw) ae sed 3 cycles late Reg/Dec Eec em W EtOp EtOp Sc Sc IF/ID Registe ain Contol Op RegDst emw Banch emtoreg ID/E Registe Op RegDst emw Banch emtoreg E/em Registe emw Banch emtoreg em/w Registe emtoreg RegW RegW RegW RegW CPE 442 hazads.5

Review: Pipeline Smmay Pipeline Pocesso: Natal enhancement of the mltiple clock cycle pocesso Each fnctional nit can only be sed once pe instction If a instction is going to se a fnctional nit: - it mst se it at the same stage as all othe instctions Pipeline Contol: - Each stage s contol signal depends ONLY on the instction that is cently in that stage CPE 442 hazads.6

Otline of Today s Lecte Recap and Intodction (5 mintes) Intodction to Hazads Fowading (25 mintes) 1 cycle Load Delay (5 mintes) 1 cycle Banch Delay (15 mintes) What makes pipelining had Smmay (5 mintes) CPE 442 hazads.7

Intodction to Hazads Limits to pipelining: Hazads pevent net instction fom eecting ding its designated clock cycle stctal hazads: HW cannot sppot this combination of instctions data hazads: instction depends on eslt of pio instction still in the pipeline contol hazads: pipelining of banches & othe instctionscommon soltion is to stall the pipeline ntil the hazadbbbles in the pipeline CPE 442 hazads.8

A Single emoy is a Stctal Hazad Time (clock cycles) I n s t. O d e Load Inst 1 Inst 2 Inst 3 Inst 4 em Reg em Reg em Reg em Reg em em Reg em Reg Reg em Reg em Reg em Reg CPE 442 hazads.9

Option 1: Stall to esolve emoy Stctal Hazad Time (clock cycles) I n s t. O d e Load Inst 1 Inst 2 Inst 3(stall) Inst 4 em Reg em Reg em Reg em Reg em Reg em Reg bbble em Reg em Reg em Reg em Reg CPE 442 hazads.10

Option 2: Dplicate to Resolve Stctal Hazad Sepaate Instction Cache (Im) & Data Cache (Dm) Time (clock cycles) I n s t. O d e Load Inst 1 Inst 2 Inst 3 Inst 4 Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg CPE 442 hazads.11

Data Hazad on 1 add 1,2,3 sb 4, 1,3 and 6, 1,7 o 8, 1,9 o 10, 1,11 CPE 442 hazads.12

Data Hazad on 1: (Fige 6.30, page 397, P&H) Dependencies backwads in time ae hazads I n s t. O d e Time (clock cycles) I ID/R E E W Im F FReg X Dm Reg B add 1,2,3 sb 4,1,3 and 6,1,7 o 8,1,9 o 10,1,11 Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg CPE 442 hazads.13

Option1: HW Stalls to Resolve Data Hazad Dependencies backwads in time ae hazads I n s t. O d e Time (clock cycles) I ID/R E E W Im F FReg X Dm Reg B add 1,2,3 sb 4, 1,3 and 6,1,7 o 8,1,9 o 10,1,11 Im bbble bbble bbble Reg Dm Reg Im Reg Dm Im Reg Im Reg CPE 442 hazads.14

Bt ecall se of Data Stationay Contol The ain Contol geneates the contol signals ding Reg/Dec Contol signals fo Eec (EtOp, Sc,...) ae sed 1 cycle late Contol signals fo em (emw Banch) ae sed 2 cycles late Contol signals fo W (emtoreg emw) ae sed 3 cycles late Reg/Dec Eec em W EtOp EtOp Sc Sc IF/ID Registe ain Contol Op RegDst emw Banch emtoreg ID/E Registe Op RegDst emw Banch emtoreg E/em Registe emw Banch emtoreg em/w Registe emtoreg RegW RegW RegW RegW CPE 442 hazads.15

Option 1: How HW eally stalls pipeline I n s t. O d e HW doesn t change PC => keeps fetching same instction & sets contol signals to to benign vales (0) Time (clock cycles) I ID/R E E W Im F FReg X Dm Reg B add 1,2,3 stall stall stall sb 4,1,3 Im bbble bbble bbble bbble Im bbble bbble bbble bbble Im bbble bbble bbble bbble Im Reg Dm Reg and 6,1,7 Im Reg Dm CPE 442 hazads.16

Option 2: SW insets indepdendent instctions Wost case insets NOP instctions I n s t. O d e Time (clock cycles) I ID/R E E W Im F FReg X Dm Reg B add 1,2,3 nop nop nop sb 4,1,3 Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg and 6,1,7 Im Reg Dm CPE 442 hazads.17

Otline of Today s Lecte Recap and Intodction (5 mintes) Intodction to Hazads (15 mintes) Fowading 1 cycle Load Delay (5 mintes) 1 cycle Banch Delay (15 mintes) What makes pipelining had Smmay (5 mintes) CPE 442 hazads.18

Option 3 Insight: Data is available! (Fige 6.35, page 415, P&H) Pipeline egistes aleady contain needed data I n s t. O d e Time (clock cycles) I ID/R E E W Im F FReg X Dm Reg B add 1,2,3 sb 4,1,3 and 6,1,7 o 8,1,9 o 10,1,11 Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg CPE 442 hazads.19

HW Change fo Fowading (Bypassing): Incease mltipleos to add paths fom pipeline egistes Assmes egiste ead ding wite gets new vale (othewise moe eslts to be fowaded) ID/EX EX/E E/WB Zeo? Data emoy CPE 442 hazads.20

Complete data Path with Hazad detection and Fowading Fige 6.41 in the tet IF.Flsh Hazad detection nit ID/EX WB EX/E Contol 0 WB E/WB IF/ID EX WB PC 4 Instction memoy Shift left 2 Registes = Data memoy Sign etend Fowading nit CPE 442 hazads.21

Otline of Today s Lecte Recap and Intodction (5 mintes) Intodction to Hazads (15 mintes) Fowading (25 mintes) 1 cycle Load Delay 1 cycle Banch Delay (15 mintes) What makes pipelining had Smmay (5 mintes) CPE 442 hazads.22

Clock Fom Last Lecte: The Delay Load Phenomenon Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 I0: Load Ifetch Reg/Dec Eec em W Pls 1 Ifetch Reg/Dec Eec em W Pls 2 Ifetch Reg/Dec Eec em W Pls 3 Ifetch Reg/Dec Eec em W Pls 4 Ifetch Reg/Dec Eec em W Althogh Load is fetched ding Cycle 1: The data is NOT witten into the Reg File ntil the end of Cycle 5 We cannot ead this vale fom the Reg File ntil Cycle 6 3-instction delay befoe the load take effect CPE 442 hazads.23

Fowading edces Data Hazad to 1 cycle: (Fige 6.47, page 420 P&H) I n s t. O d e Time (clock cycles) I ID/R E E W Im F FReg X Dm Reg B lw 1, 0(2) sb 4,1,6 and 6,1,7 o 8,1,9 Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg CPE 442 hazads.24

Option1: HW Stalls to Resolve Data Hazad Intelock : checks fo hazad & stalls I n s t. O d e Time (clock cycles) I ID/R E E W Im F FReg X Dm Reg B lw 1, 0(2) stall sb 4,1,3 and 6,1,7 o 8,1,9 Im bbble bbble bbble bbble Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg CPE 442 hazads.25

Option 2: SW insets independent instctions I n s t. O d e Wost case insets NOP instctions IPS I soltion: No HW checking Time (clock cycles) I ID/R E E W Im F FReg X Dm Reg B lw 1, 0(2) nop sb 4,1,3 and 6,1,7 o 8,1,9 Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg CPE 442 hazads.26

Softwae Schedling to Avoid Load Hazads Ty podcing fast code fo a = b + c; d = e f; assming a, b, c, d,e, and f in memoy. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,rd CPE 442 hazads.27

Softwae Schedling to Avoid Load Hazads Ty podcing fast code fo a = b + c; d = e f; assming a, b, c, d,e, and f in memoy. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf d,rd CPE 442 hazads.28

Slow code: Fast code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf d,rd LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf d,rd CPE 442 hazads.29

Otline of Today s Lecte Recap and Intodction (5 mintes) Intodction to Hazads (15 mintes) Fowading (25 mintes) 1 cycle Load Delay (5 mintes) 1 cycle Banch Delay What makes pipelining had Smmay (5 mintes) CPE 442 hazads.30

Fom Last Lecte: The Delay Banch Phenomenon Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 Clk 12: Beq Ifetch Reg/Dec Eec em W (taget is 1000) 16: R-type Ifetch Reg/Dec Eec em W 20: R-type Ifetch Reg/Dec Eec em W 24: R-type Ifetch Reg/Dec Eec em W 1000: Taget of B Ifetch Reg/Dec Eec em W Althogh Beq is fetched ding Cycle 4: Taget addess is NOT witten into the PC ntil the end of Cycle 7 Banch s taget is NOT fetched ntil Cycle 8 3-instction delay befoe the banch take effect CPE 442 hazads.31

Contol Hazad on Banches: 3 stage stall Time (in Clock Cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Pogam Eection Ode (in Instctions) 40 beq $1,$3,36 44 and $12,$2,$5 48 o $13,$6,$2 I Reg I Reg I D Reg Reg D Reg D Reg 52 add $14,$2,$2 I Reg D Reg 80 ld $4,$7,100 I Reg D Reg CPE 442 hazads.32

Banch Stall Impact If CPI = 1, 30% banch, Stall 3 cycles => new CPI = 1.9! 2 pat soltion: Detemine banch taken o not soone, AND Compte taken banch addess ealie Soltion Option 1: ove Zeo test to ID/RF stage Adde to calclate new PC in ID/RF stage 1 clock cycle penalty fo banch vs. 3 CPE 442 hazads.33

Option 1: move HW fowad to edce banch delay Data Path befoe change Instction Fetch IF/ID Inst. Decode Reg. Fetch ID/EX Eecte Add. Calc. EX/E emo y Access E/WB Wite Back 4 ADD Zeo? IR 6..10 PC Instction emoy IR IR 11..15 E/WB.IR Registes Data emoy Sign etend 16 32 CPE 442 hazads.34

Banch Delay now 1 clock cycle Data Path afte change Instction Fetch Inst. Decode Reg. Fetch Eecte Add. Calc. emoy Access Wite Back IF/ID ID/EX EX/E E/WB ADD Zeo? ADD 4 IR 6..10 PC Instction emoy IR IR 11..15 E/WB.IR Registes Data emoy Sign etend 16 32 CPE 442 hazads.35

Option 2: No Stalls, Define Banch as Delayed, inset instction afte the banch and allow it to eecte always, Wost case, SW insets NOP into banch delay if no instction can be fond Whee to get instctions to fill banch delay slot? Befoe banch instction, eample sw 1,0(2); beqd 0,2,T change to, beqd 0,2,T; sw 1,0(2) Fom the taget addess: only valable when banch Fom fall thogh: only valable when don t banch Compile effectiveness fo single banch delay slot: Fills abot 60% of banch delay slots Abot 80% of instctions eected in banch delay slots sefl in comptation abot 50% (60% 80%) of slots seflly filled CPE 442 hazads.36

Complete data Path with Hazad detection and Fowading Fige 6.41 in the tet IF.Flsh Hazad detection nit ID/EX WB EX/E Contol 0 WB E/WB IF/ID EX WB PC 4 Instction memoy Shift left 2 Registes = Data memoy Sign etend Fowading nit CPE 442 hazads.37

Eample Tet Fige 6.52 and $12, $2, $5 beq $1, $3, 7 sb $10, $4, $8 IF.Flsh 72 48 IF/ID 48 44 Contol Hazad detection nit 28 72 0 ID/EX WB EX EX/E WB befoe<1> E/WB WB befoe<2> 4 Instction PC 72 44 memoy Shift left 2 7 Registes = $1 $3 $4 $8 Data memoy Sign etend 10 Fowading nit Clock 3 lw $4, 50($7) bbble (nop) beq $1, $3, 7 sb $10,... befoe<1> IF.Flsh 76 Hazad detection nit ID/EX WB EX/E IF/ID Contol 0 EX WB E/WB WB 76 72 4 PC 76 72 Instction memoy Shift left 2 Registes = $1 $3 Data memoy Sign etend 10 Fowading nit CPE 442 hazads.38 Clock 4

Otline of Today s Lecte Recap and Intodction (5 mintes) Intodction to Hazads (15 mintes) Fowading (25 mintes) 1 cycle Load Delay (5 mintes) 1 cycle Banch Delay (15 mintes) What makes pipelining had Smmay (5 mintes) CPE 442 hazads.39

When is pipelining had? Intepts: 5 instctions eecting in 5 stage pipeline How to stop the pipeline? Restat? Who cased the intept? Stage Poblem intepts occing IF Page falt on instction fetch; misaligned memoy access; memoy-potection violation ID Undefined o illegal opcode EX Aithmetic intept E Page falt on data fetch; misaligned memoy access; memoy-potection violation Load with data page falt, Add with instction page falt? Soltion 1: intept vecto/instction 2: intept ASAP, estat eveything incomplete CPE 442 hazads.40

Data path with Eception Handling, Tet Fige 6.55, add a Case egiste, an Eception PC, and constant add. of Eception Handeling otine IF.Flsh ID.Flsh EX.Flsh 40000040 IF/ID Hazad detection nit Contol 0 ID/EX WB EX Case 0 0 EX/E WB E/WB WB PC 4 Instction memoy Shift left 2 Registes = Ecept PC Data memoy Sign etend Fowading nit CPE 442 hazads.41

Review: Smmay of Pipelining Basics Speed Up Š Pipeline Depth (nmbe of stages); if ideal CPI is 1, then: Speedp = Pipeline depth 1 Pipeline stall cycles pe instction Clock cycle npipelined Clock cycle pipelined Hazads limit pefomance on comptes: stctal: need moe HW esoces data: need fowading, compile schedling contol: ealy evalation & PC, delayed banch, pediction Inceasing length of pipe inceases impact of hazads since pipelining helps instction bandwidth, not latency Compiles key to edcing cost of data and contol hazads load delay slots banch delay slots Eceptions, Instction Set, FP makes pipelining hade Longe pipelines => Banch pediction, moe instction paallelism? CPE 442 hazads.42