Instruction Level Parallelism and Its. (Part II) ECE 154B

Size: px
Start display at page:

Download "Instruction Level Parallelism and Its. (Part II) ECE 154B"

Transcription

1 Instruction Level Parallelism and Its Exploitation (Part II) ECE 154B Dmitri Strukov

2 ILP techniques not covered last week this week next week

3 Scoreboard Technique Review Allow for out of order execution by processing several instructions simultaneously In order issue / out of order of order ex/completion Pipe stages: Issue, read registers, ex, write back Booking to resolve RAW, WAW, and RAW Resolve RAW at read registers stage Stall issue for WAW, stall completion (WB) for WAR

4 Instruction status: Scoreboard Example Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU

5 Another Dynamic Algorithm: Tomasulo s s Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 IBMhasmemory memory register register ops Small number of floating point registers prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers renaming in hardware! Why Study? The descendants d of this have flourished! Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,

6 Tomasulo vs. Scoreboard Control & buffers distributed with Function Units (FU) vs. centralized din scoreboard FU buffers called reservation stations ; have pending operands Registers in instructions replaced by values or pointers to reservation stations(rs); called register renaming avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well

7 Tomasulo Organization From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 FP Registers Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders Reservation Stations FP multipliers To Mem Common Data Bus (CDB)

8 Reservation Station Components Op: Operation to perform in the unit (e.g., + or ) Vj, Vk: Value of source operands Store buffers have V fields, results to be stored Qj, Qk: Reservation stations producing source registers (value to be written) No ready flags as in Scoreboard; Qj,Qk=0 Qk=0 => ready Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions i that will write that register.

9 Three Stages of Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers) 2. Execute operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting units; markreservation reservation station available Normal data bus: data + destination ( go to bus) Common data bus: data + source ( come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

10 Tomasulo Example Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 Load1 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 0 FU

11 Tomasulo Example Cycle 1 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Load1

12 Tomasulo Example Cycle 2 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Load2 Load1 Note: Unlike 6600, can have multiple loads outstanding (This was not an inherent limitation of scoreboarding)

13 Tomasulo Example Cycle 3 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Mult1 Load2 Load1 Note: registers names are removed ( renamed ) in Reservation Stations; MULT issued vs. scoreboard Load1 completing; what is waiting for Load1?

14 Tomasulo Example Cycle 4 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Mult1 Load2 M(A1) Add1 Load2 completing; what is waiting for Load2?

15 Tomasulo Example Cycle 5 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Mult1 M(A2) M(A1) Add1 Mult2

16 Tomasulo Example Cycle 6 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 M(A2) Add2 Add1 Mult2 Issue ADDD here?

17 Tomasulo Example Cycle 7 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 M(A2) Add2 Add1 Mult2 Add1 completing; what is waiting for it?

18 Tomasulo Example Cycle 8 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 M(A2) Add2 (M-M) Mult2

19 Tomasulo Example Cycle 9 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 M(A2) Add2 (M-M) Mult2

20 Tomasulo Example Cycle 10 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 M(A2) Add2 (M-M) Mult2 Add2 completing; what is waiting for it?

21 Tomasulo Example Cycle 11 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 4Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Write result of ADDD here? All quick instructions complete in this cycle!

22 Tomasulo Example Cycle 12 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 3Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2

23 Tomasulo Example Cycle 13 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 2Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 M(A2) (M-M+M(M-M) Mult2

24 Tomasulo Example Cycle 14 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 1Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 M(A2) (M-M+M(M-M) Mult2

25 Tomasulo Example Cycle 15 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 0Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2

26 Tomasulo Example Cycle 16 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2

27 Faster than light computation (skip a couple of cycles)

28 Tomasulo Example Cycle 55 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 1Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2

29 Tomasulo Example Cycle 56 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 0Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Mult2 is completing; what is waiting for it?

30 Tomasulo Example Cycle 57 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Result Once again: In-order issue, out-of-order execution and completion

31 Compare to Scoreboard Cycle 62 Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Why take longer on scoreboard/6600? Structural hazards Lack of forwarding

32 Tomasulo vs. Scoreboard (IBM 360/91 vs. CDC 6600) Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3+, 2x/ ) (1 load/store, 1+, 2x, 1 ) window size: 14 instructions 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard

33 Tomasulo Drawbacks Complexity delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Each hcdb must go to multiple l functional units high h capacitance, high wiring density Number of functional units that can complete per cycle limited to one! (Multiple CDBs more FU logic for parallel assoc stores) Non precise interrupts!

34 Summary on Tomasulo Reservations stations: implicit register renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW (when branch can be quickly resolved) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants d are Pentium II; PowerPC PC604; MIPS R10000; HP PA 8000; Alpha 21264

35 Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP (will look at branch prediction implementations next week) One way to overcome control dependencies is with speculation Mk Make a guess and execute program as if our guess is correct Need mechanisms to handle the case when the speculation is incorrect Can do some speculation in the compiler We saw this previously with reordered / duplicated instructions around branches

36 Hardware Based Speculation Extends the idea of dynamic scheduling with three key ideas: 1. Dynamic branch prediction 2. Speculation to allow the execution of instructions before control dependencies are resolved 3. Dynamic scheduling to deal with scheduling different combinations of basic blocks What we saw earlier was within a basic block Modern processors started using speculation around the introduction of the PowerPC 603, Intel Pentium II and extend Tomasulo s approach to support speculation

37 Speculating with Tomasulo Separate execution from completion Allow instructions to execute speculatively but do not let instructions update registers or memory until they are no longer speculative Instruction Commit After an instruction is no longer speculative it is allowed to make register and memory updates Allow instructions i to execute and complete out of order but force them to commit in order Add a hardware buffer, called the reorder buffer (ROB), with registers to hold the result of an instruction between completion and commit Acts as a FIFO queue in order issued

38 Original Tomasulo Architecture

39 Tomasulo and Reorder Buffer Sits between Execution and Register File Source of operands In this case integrated with Store buffer Reservation stations use ROB slot as a tag Instructions commit at head of ROB FIFO queue Easy to undo speculated instructions on mispredicted d branches or on exceptions

40 ROB Data Structure Instruction Type Field Indicates whether the instruction is a branch, store, or register operation ination Field Register number for loads, ALU ops, or memory address for stores Value Field Holds the value of the instruction result until instruction commits Ready Field Indicates if instruction has completed execution and the value is ready

41 Instruction Execution 1. Issue: Get an instruction from the Instruction Queue If the reservation station and the ROB has a free slot (no structural hazard), issue the instruction to the reservation station and the ROB, send operands to the reservation station if available in the register file or the ROB. The allocated ROB slot number is sent to the reservation station to use as a tag when placing data on the CDB. 2. Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3. Write result: Finish execution (WB) Write on CDB to all awaiting units and to the ROB using the tag; mark reservation station available 4. Commit: Update register or memory with the ROB result When an instruction reaches the head of the ROB and results are present, update the register with the result or store to memory and remove the instruction from the ROB If an incorrectly predicted branch reaches es the head of the ROB, flush the ROB, and restart at the correct successor of the branch Blue text = Change from Tomasulo

42 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 ROB2 F0 LD F0, 10(R2) I ROB1 Oldest Registers To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

43 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Registers To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

44 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Registers 2 ADDD R(F4),1 To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

45 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Registers 2 ADDD R(F4),1 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

46 FP Op Queue Tomasulo With ROB State ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 LD F4,0(R3) E ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers To Memory From 2 ADDD R(F4),1 3 MULD 2,R(F6) Memory 6 ADDD 5,R(F6) Reservation 1 10+R2 Stations FP adders FP multipliers li li 5 0+R3

47 FP Op Queue Tomasulo With ROB State [R3] ROB5 ST F4, 0(R3) I ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 LD F4,0(R3) E ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers To Memory From 2 ADDD R(F4),1 3 MULD 2,R(F6) Memory 6 ADDD 5,R(F6) Reservation 1 10+R2 Stations FP adders FP multipliers li li 5 0+R3

48 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers 2 ADDD R(F4),1 6 ADDD V1,R(F6) FP adders Reservation Stations 3 MULD 2,R(F6) FP multipliers li li To Memory From Memory 1 10+R2

49 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 ADDD F0,F4,F6 E ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers 2 ADDD R(F4),1 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

50 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Registers 2 ADDD R(F4),1 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li 1 10+R2

51 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 V3 LD F0, 10(R2) W ROB1 Newest Oldest Registers 2 ADDD R(F4),V3 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li

52 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L E ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 E ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li

53 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L W ROB4 F2 MULD F2,F10,F6 I ROB3 F10 V4 ADDD F10,F4,F0 W ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 3 MULD V4,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers li li

54 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L W ROB4 F2 MULD F2,F10,F6 E ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 F10=V4 To Memory From Memory FP adders Reservation Stations FP multipliers li li

55 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L W ROB4 F2 V5 MULD F2,F10,F6 W ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 F10=V4 To Memory From Memory FP adders Reservation Stations FP multipliers li li

56 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L W ROB4 F2 V5 MULD F2,F10,F6 C ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 F10=V4 F2=V5 To Memory From Memory FP adders Reservation Stations FP multipliers li li

57 FP Op Queue Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 Reorder Buffer -- BNE F0, 0, L C ROB4 F2 V5 MULD F2,F10,F6 C ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Registers F0=V3 F10=V4 F2=V5 To Memory From Memory FP adders Reservation Stations FP multipliers li li

58 Avoiding Memory Hazards A store only updates memory when it reaches the head of the ROB Otherwise WAW and WAR hazards are possible By waiting to reach the head memory is updated in order and no earlier loads orstores can still be pending If a load accesses a memory location written to by an earlier store then it cannot perform the memory access until the store haswritten the data Prevents RAW hazard through memory

59 Reorder Buffer Implementation In practice Try to recover as early as possible after a branch is mispredicted rather than wait until branch reaches the head Performance in speculative processors more sensitive to branch prediction Highercost of misprediction Exceptions (will look into that next week) Don t recognize the exception until it is ready to commit Could try to handle exceptions as they arise and earlier branchesresolved, resolved, butmore challenging

60 Acknowledgements Some of the slides contain material developed and copyrighted by Sally A. McKee and K. Mock (Cornell University) and instructor material for the textbook 60

Advanced Pipelining and Instruction-Level Paralelism (2)

Advanced Pipelining and Instruction-Level Paralelism (2) Advanced Pipelining and Instruction-Level Paralelism (2) Riferimenti bibliografici Computer architecture, a quantitative approach, Hennessy & Patterson: (Morgan Kaufmann eds.) Tomasulo s Algorithm For

More information

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm 2003-10-23 Dave Patterson (www.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs152/ CS 152 L17 Adv.

More information

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91 Tomasulo Algorithm Developed at IBM and first implemented in IBM s 360/91 IBM wanted to use the existing compiler instead of a specialized compiler for high end machines. Tracks when operands are available

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 12: Dynamic Scheduling: Tomasulo s Algorithm Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CS252, UC Berkeley

More information

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

More information

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi Dynamic Scheduling (or out-of-order execution) Dynamic Scheduling Or ydanicm ceshuldngi CDC 6600 scoreboard Instruction storage added to each functional execution unit Instructions issue to FU when no

More information

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components Another Dynamic Algorithm: Tomasulo Algorithm Differences between Tomasulo Algorithm & Scoreboard For IBM 360/9 about 3 years after CDC 6600 Goal: High Performance without special compilers Differences

More information

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III Course on: Advanced Computer Architectures Instruction Level Parallelism Part III Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Outline of Part III Tomasulo Dynamic Scheduling

More information

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III Course on: Advanced Computer Architectures Instruction Level Parallelism Part III Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Outline of Part III Dynamic Scheduling

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Scoreboard Limitations!

Scoreboard Limitations! Scoreboard Limitations! No forwarding read from register! Structural hazards stall at issue! WAW hazard stall at issue!! WAR hazard stall at write! Inf3 Computer Architecture - 2015-2016 1 Dynamic Scheduling

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview of Chap. 3 (again) Pipelined

More information

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Scoreboard Limitations

Scoreboard Limitations Scoreboard Limitations! No forwarding read from register! Structural hazards stall at issue! WAW hazard stall at issue! WAR hazard stall at write Inf3 Computer Architecture - 2016-2017 1 Dynamic Scheduling

More information

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far. Outline 1 Reiteration Lecture 5: EIT090 Computer Architecture 2 Dynamic scheduling - Tomasulo Anders Ardö 3 Superscalar, VLIW EIT Electrical and Information Technology, Lund University Sept. 30, 2009 4

More information

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng Slide Set 9 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 9 slide

More information

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng Slide Set 8 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017 ENCM 501 W17 Lectures: Slide

More information

Out-of-Order Execution

Out-of-Order Execution 1 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulo s algorithm & reservation stations out-of-order completion leads to: imprecise

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Pipelining, Hazards Appendix C, HPe Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Pipelining

More information

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

CS 152 Midterm 2 May 2, 2002 Bob Brodersen CS 152 Midterm 2 May 2, 2002 Bob Brodersen Name Solutions Show your work if you want partial credit! Try all the problems, don t get stuck on one of them. Each one is worth 10 points. 1) 2) 3) 4) 5) 6)

More information

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Very Short Answer: (1) (1) Peak performance does or does not track observed performance. Very Short Answer: (1) (1) Peak performance does or does not track observed performance. (2) (1) Which is more effective, dynamic or static branch prediction? (3) (1) Do benchmarks remain valid indefinitely?

More information

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 1 submission

More information

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards. 06 1 MIPS Implementation 06 1 Material from Chapter 3 of H&P (for DLX). Material from Chapter 6 of P&H (for MIPS). line: (In this set.) Unpipelined DLX Implementation. (Diagram only.) Pipelined DLX and

More information

On the Rules of Low-Power Design

On the Rules of Low-Power Design On the Rules of Low-Power Design (and How to Break Them) Prof. Todd Austin Advanced Computer Architecture Lab University of Michigan austin@umich.edu Once upon a time 1 Rules of Low-Power Design P = acv

More information

Pipeline design. Mehran Rezaei

Pipeline design. Mehran Rezaei Pipeline design Mehran Rezaei Shift Left 2 pc Opcode ExtOp Cont Unit RegDst Addr Addr2 Addr npcsle Reg ALUSrc Mem 2 OVF Branch ALUCtr MemtoReg Mem Funct Extension ALUOp ALU Cont Shift Left 2 ID EXE MEM

More information

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7 CM 69 W4 Section Slide Set 6 slide 2/9 Contents Slide Set 6 for CM 69 Winter 24 Lecture Section Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary

More information

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 6 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary February 2018 ENCM 369 Winter 2018 Section

More information

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices EECS150 - Digital Design Lecture 9 - CPU Microarchitecture Feb 17, 2009 John Wawrzynek Spring 2009 EECS150 - Lec9-cpu Page 1 CMOS Devices Review: Transistor switch-level models The gate acts like a capacitor.

More information

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14 CS61C L14 Introduction to Synchronous Digital Systems (1) inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #14 Introduction to Synchronous Digital Systems 2007-7-18 Scott Beamer, Instructor

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #14 Introduction to Synchronous Digital Systems 2007-7-18 Scott Beamer, Instructor CS61C L14 Introduction to Synchronous Digital Systems

More information

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access. Chapter 6 Pipelining Improve performance by increasing instrction throghpt Program eection order Time (in instrctions) lw $, ($) Instrction fetch 2 4 6 8 2 4 6 8 ALU Data access lw $2, 2($) 8 ns Instrction

More information

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C CS6C L5 Intro to SDS, State Elements I () inst.eecs.berkeley.edu/~cs6c CS6C : Machine Structures Lecture #5 Intro to Synchronous Digital Systems, State Elements I 28-7-6 Go BEARS~ Albert Chae, Instructor

More information

6.3 Sequential Circuits (plus a few Combinational)

6.3 Sequential Circuits (plus a few Combinational) 6.3 Sequential Circuits (plus a few Combinational) Logic Gates: Fundamental Building Blocks Introduction to Computer Science Robert Sedgewick and Kevin Wayne Copyright 2005 http://www.cs.princeton.edu/introcs

More information

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers Shadi T. Khasawneh and Kanad Ghose Department of Computer Science State University of New York, Binghamton,

More information

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee CS/ECE 25: Computer Architecture Basics of Logic esign: ALU, Storage, Tristate Benjamin Lee Slides based on those from Alvin Lebeck, aniel, Andrew Hilton, Amir Roth, Gershon Kedem Homework #3 ue Mar 7,

More information

Lecture 0: Organization

Lecture 0: Organization 581365 Tietokoneen rakenne Computer Organization II Spring 2010 Tiina Niklander Matemaattis-luonnontieteellinen tiedekunta Computer Organization II Advanced (master) level course! Prerequisite: Computer

More information

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 29 Minimizing Switched Capacitance-III. (Refer

More information

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers Unit 2 Registers and Counters Fundamentals of Logic esign EE2369 Prof. Eric Maconald Fall Semester 23 Registers Groups of flip-flops Can contain data format can be unsigned, 2 s complement and other more

More information

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics 1) Explain why & how a MOSFET works VLSI Design: 2) Draw Vds-Ids curve for a MOSFET. Now, show how this curve changes (a) with increasing Vgs (b) with increasing transistor width (c) considering Channel

More information

OUT-OF-ORDER processors with precise exceptions

OUT-OF-ORDER processors with precise exceptions TRANSACTIONS ON COMPUTER, VOL. X, NO. Y, FEBRUARY 2009 1 Store Buffer Design for Multibanked Data Caches Enrique Torres, Member, IEEE, Pablo Ibáñez, Member, IEEE, Víctor Viñals-Yúfera, Member, IEEE, and

More information

Tomasulo Algorithm Based Out of Order Execution Processor

Tomasulo Algorithm Based Out of Order Execution Processor Tomasulo Algorithm Based Out of Order Execution Processor Bhavana P.Shrivastava MAaulana Azad National Institute of Technology, Department of Electronics and Communication ABSTRACT In this research work,

More information

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION Shohaib Aboobacker TU München 22 nd March 2011 Based on Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation Dan

More information

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics EECS150 - Digital Design Lecture 10 - Interfacing Oct. 1, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)

More information

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7 EE457 Lab7 Questions page A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7 1. A. In which parts or subparts of Lab 7 does the STALL signal cause the

More information

A VLIW Processor for Multimedia Applications

A VLIW Processor for Multimedia Applications A VLIW Processor for Multimedia Applications E. Holmann T. Yoshida A. Yamada Y. Shimazu Mitsubishi Electric Corporation, System LSI Laboratory 4-1 Mizuhara, Itami, Hyogo 664, Japan Outline Objective System

More information

EITF35: Introduction to Structured VLSI Design

EITF35: Introduction to Structured VLSI Design EITF35: Introduction to Structured VLSI Design Part 4.2.1: Learn More Liang Liu liang.liu@eit.lth.se 1 Outline Crossing clock domain Reset, synchronous or asynchronous? 2 Why two DFFs? 3 Crossing clock

More information

Sequential Elements con t Synchronous Digital Systems

Sequential Elements con t Synchronous Digital Systems ecture 15 Computer Science 61C Spring 2017 February 22th, 2017 Sequential Elements con t Synchronous Digital Systems 1 Administrivia I Good news: Waitlist students: You are in! Concurrent Enrollment students:

More information

HW#3 - CSE 237A. 1. A scheduler has three queues; A, B and C. Outgoing link speed is 3 bits/sec

HW#3 - CSE 237A. 1. A scheduler has three queues; A, B and C. Outgoing link speed is 3 bits/sec HW#3 - CSE 237A 1. A scheduler has three queues; A, B and C. Outgoing link speed is 3 bits/sec a. (Assume queue A wants to transmit at 1 bit/sec, and queue B at 2 bits/sec and queue C at 3 bits/sec. What

More information

Modeling Digital Systems with Verilog

Modeling Digital Systems with Verilog Modeling Digital Systems with Verilog Prof. Chien-Nan Liu TEL: 03-4227151 ext:34534 Email: jimmy@ee.ncu.edu.tw 6-1 Composition of Digital Systems Most digital systems can be partitioned into two types

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 1-Bus Architecture and Datapath 10262011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline 1-Bus Microarchitecture and

More information

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger. CS 110 Computer Architecture Finite State Machines, Functional Units Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University

More information

CS3350B Computer Architecture Winter 2015

CS3350B Computer Architecture Winter 2015 CS3350B Computer Architecture Winter 2015 Lecture 5.2: State Circuits: Circuits that Remember Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design,

More information

BUSES IN COMPUTER ARCHITECTURE

BUSES IN COMPUTER ARCHITECTURE BUSES IN COMPUTER ARCHITECTURE The processor, main memory, and I/O devices can be interconnected by means of a common bus whose primary function is to provide a communication path for the transfer of data.

More information

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Sequential Logic. Introduction to Computer Yung-Yu Chuang Sequential Logic Introduction to Computer Yung-Yu Chuang with slides by Sedgewick & Wayne (introcs.cs.princeton.edu), Nisan & Schocken (www.nand2tetris.org) and Harris & Harris (DDCA) Review of Combinational

More information

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements ECE 25 / CPS 25 Computer Architecture Basics of Logic esign ALU and Storage Elements Benjamin Lee Slides based on those from Andrew Hilton (uke), Alvy Lebeck (uke) Benjamin Lee (uke), and Amir Roth (Penn)

More information

Fill-in the following to understand stalling needs and forwarding opportunities

Fill-in the following to understand stalling needs and forwarding opportunities Fill-in the following to understand stalling needs and forwarding opportunities Instruction ADD4 ADD Receiving forwarding help Providing forwarding help Insists on Doesn t mind Doesn t mind Capable of

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #21 State Elements: Circuits that Remember 2008-3-14 Scott Beamer, Guest Lecturer www.piday.org 3.14159265358979323 8462643383279502884

More information

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz CSE140L: Components and Design Techniques for Digital Systems Lab CPU design and PLDs Tajana Simunic Rosing Source: Vahid, Katz 1 Lab #3 due Lab #4 CPU design Today: CPU design - lab overview PLDs Updates

More information

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL Toronto 2015 Summary 1 Overview... 5 1.1 Motivation... 5 1.2 Goals... 5 1.3

More information

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Logic Devices for Interfacing, The 8085 MPU Lecture 4 Logic Devices for Interfacing, The 8085 MPU Lecture 4 1 Logic Devices for Interfacing Tri-State devices Buffer Bidirectional Buffer Decoder Encoder D Flip Flop :Latch and Clocked 2 Tri-state Logic Outputs

More information

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large ESE680-002 (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance Last Time Saw how to formulate and automate retiming: start with network calculate minimum achievable

More information

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming ESE534: Computer Organization Today Retiming Demand Folded Computation Day 21: April 14, 2014 Retiming Logical Pipelining Physical Pipelining Retiming Supply Technology Structures Hierarchy 1 2 Image Processing

More information

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs EGC442 Introdction to Compter Architectre Chapter 4 (Part I) The Processor Baback Izadi Division of Engineering Programs bai@engr.newpaltz.ed Introdction CPU performance factors Instrction cont Determined

More information

CS 250 VLSI System Design

CS 250 VLSI System Design CS 250 VLSI System Design Lecture 3 Timing 2013-9-5 Professor Jonathan Bachrach today s lecture by John Lazzaro TA: Ben Keller www-insteecsberkeleyedu/~cs250/ 1 everything doesn t happen at once Timing,

More information

An automatic synchronous to asynchronous circuit convertor

An automatic synchronous to asynchronous circuit convertor An automatic synchronous to asynchronous circuit convertor Charles Brej Abstract The implementation methods of asynchronous circuits take time to learn, they take longer to design and verifying is very

More information

CS 61C: Great Ideas in Computer Architecture

CS 61C: Great Ideas in Computer Architecture CS 6C: Great Ideas in Computer Architecture Combinational and Sequential Logic, Boolean Algebra Instructor: Alan Christopher 7/23/24 Summer 24 -- Lecture #8 Review of Last Lecture OpenMP as simple parallel

More information

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation Chapter 05: Basic Processing Units Control Unit Design Organization Lesson 11: Multiple Bus Organisation Objective Understand multiple bus organisation Learn how the number of independent steps can be

More information

Sequencing and Control

Sequencing and Control Sequencing and Control Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Spring, 2016 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/ Source:

More information

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath Objectives Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath In the previous chapters we have studied how to develop a specification from a given application, and

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 24 State Circuits : Circuits that Remember Senior Lecturer SOE Dan Garcia www.cs.berkeley.edu/~ddgarcia Bio NAND gate Researchers at Imperial

More information

PRACE Autumn School GPU Programming

PRACE Autumn School GPU Programming PRACE Autumn School 2010 GPU Programming October 25-29, 2010 PRACE Autumn School, Oct 2010 1 Outline GPU Programming Track Tuesday 26th GPGPU: General-purpose GPU Programming CUDA Architecture, Threading

More information

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview DATASHEET DC Ultra Concurrent Timing, Area, Power and Test Optimization DC Ultra RTL synthesis solution enables users to meet today s design challenges with concurrent optimization of timing, area, power

More information

A New Family of High-Performance Parallel Decimal Multipliers*

A New Family of High-Performance Parallel Decimal Multipliers* A New Family of High-Performance Parallel Decimal Multipliers* Alvaro Vázquez, Elisardo Antelo Dept. of Electronic and Computer Science University of Santiago de Compostela Spain alvaro@dec.usc.es elisardo@dec.usc.es

More information

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities Introduction About Myself What to expect out of this lecture Understand the current trend in the IC Design

More information

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking. EE141-Fall 2011 Digital Integrated Circuits Lecture 2 Clock, I/O Timing 1 4 Administrative Stuff Pipelining Project Phase 4 due on Monday, Nov. 21, 10am Homework 9 Due Thursday, December 1 Visit to Intel

More information

AN ABSTRACT OF THE THESIS OF

AN ABSTRACT OF THE THESIS OF AN ABSTRACT OF THE THESIS OF Licheng Zhang for the degree of Master of Science in Electrical and Computer Engineering presented on June 7, 1989. Title: The Design of A Reduced Instruction Set Computer

More information

Technical Note PowerPC Embedded Processors Video Security with PowerPC

Technical Note PowerPC Embedded Processors Video Security with PowerPC Introduction For many reasons, digital platforms are becoming increasingly popular for video security applications. In comparison to traditional analog support, a digital solution can more effectively

More information

Microprocessor Design

Microprocessor Design Microprocessor Design Principles and Practices With VHDL Enoch O. Hwang Brooks / Cole 2004 To my wife and children Windy, Jonathan and Michelle Contents 1. Designing a Microprocessor... 2 1.1 Overview

More information

Administrative issues. Sequential logic

Administrative issues. Sequential logic Administrative issues Midterm #1 will be given Tuesday, October 29, at 9:30am. The entire class period (75 minutes) will be used. Open book, open notes. DDPP sections: 2.1 2.6, 2.10 2.13, 3.1 3.4, 3.7,

More information

Lecture 2: Digi Logic & Bus

Lecture 2: Digi Logic & Bus Lecture 2 http://www.du.edu/~etuttle/electron/elect36.htm Flip-Flop (kiikku) Sequential Circuits, Bus Online Ch 20.1-3 [Sta10] Ch 3 [Sta10] Circuits with memory What moves on Bus? Flip-Flop S-R Latch PCI-bus

More information

ASIC = Application specific integrated circuit

ASIC = Application specific integrated circuit ASIC = Application specific integrated circuit CS 2630 Computer Organization Meeting 19: Building a MIPS processor Brandon Myers University of Iowa The goal: implement most of MIPS So far Implementing

More information

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors

A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors 1 A Low-cost, Radiation-Hardened Method for Pipeline Protection in Microprocessors Yang Lin, Mark Zwolinski, Senior Member, IEEE, and Basel Halak Abstract The aggressive scaling of semiconductor technology

More information

L11/12: Reconfigurable Logic Architectures

L11/12: Reconfigurable Logic Architectures L11/12: Reconfigurable Logic Architectures Acknowledgements: Materials in this lecture are courtesy of the following people and used with permission. - Randy H. Katz (University of California, Berkeley,

More information

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits Nov 26, 2002 John Wawrzynek Outline SR Latches and other storage elements Synchronizers Figures from Digital Design, John F. Wakerly

More information

Logic Design II (17.342) Spring Lecture Outline

Logic Design II (17.342) Spring Lecture Outline Logic Design II (17.342) Spring 2012 Lecture Outline Class # 03 February 09, 2012 Dohn Bowden 1 Today s Lecture Registers and Counters Chapter 12 2 Course Admin 3 Administrative Admin for tonight Syllabus

More information

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor 14 12 10 8 6 IBM ES9000 Bipolar Fujitsu VP2000 IBM 3090S Pulsar 4 IBM 3090 IBM RY6 CDC Cyber 205 IBM 4381 IBM RY4 2 IBM 3081 Apache Fujitsu M380 IBM 370 Merced IBM 360 IBM 3033 Vacuum Pentium II(DSIP)

More information

Sharif University of Technology. SoC: Introduction

Sharif University of Technology. SoC: Introduction SoC Design Lecture 1: Introduction Shaahin Hessabi Department of Computer Engineering System-on-Chip System: a set of related parts that act as a whole to achieve a given goal. A system is a set of interacting

More information

Laboratory Exercise 4

Laboratory Exercise 4 Laboratory Exercise 4 Polling and Interrupts The purpose of this exercise is to learn how to send and receive data to/from I/O devices. There are two methods used to indicate whether or not data can be

More information

Digital Integrated Circuits EECS 312

Digital Integrated Circuits EECS 312 14 12 10 8 6 Fujitsu VP2000 IBM 3090S Pulsar 4 IBM 3090 IBM RY6 CDC Cyber 205 IBM 4381 IBM RY4 2 IBM 3081 Apache Fujitsu M380 IBM 370 Merced IBM 360 IBM 3033 Vacuum Pentium II(DSIP) 0 1950 1960 1970 1980

More information

UC Berkeley CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 21 State Elements : Circuits that Remember 2007-03-07 Mocha sipping TA Valerie Ishida inst.eecs.berkeley.edu/~cs61c-td 161 Exabytes

More information

Register Files and Memories

Register Files and Memories Register Files and Memories ECE 554 Digital Engineering Laboratory C. R. Kime 2/18/2002 Register Files and Memories Register Files Issues and Objectives Register File Concepts Implementation of Register

More information

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size ESE534: Computer Organization Day 22: November 16, 2016 Retiming 1 Day 21: Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism Will

More information

Profiling techniques for parallel applications

Profiling techniques for parallel applications Profiling techniques for parallel applications Analyzing program performance with HPCToolkit 03/10/2016 PRACE Autumn School 2016 2 Introduction Focus of this session Profiling of parallel applications

More information

(12) United States Patent (10) Patent No.: US 6,249,855 B1

(12) United States Patent (10) Patent No.: US 6,249,855 B1 USOO6249855B1 (12) United States Patent (10) Patent No.: Farrell et al. (45) Date of Patent: *Jun. 19, 2001 (54) ARBITER SYSTEM FOR CENTRAL OTHER PUBLICATIONS PROCESSING UNIT HAVING DUAL DOMINOED ENCODERS

More information

Chapter 3: Sequential Logic

Chapter 3: Sequential Logic Elements of Computg Systems, Nisan & Schocken, MIT Press, 2005 www.idc.ac.il/tecs Chapter 3: Sequential Logic Usage and Copyright Notice: Copyright 2005 Noam Nisan and Shimon Schocken This presentation

More information

First Name Last Name November 10, 2009 CS-343 Exam 2

First Name Last Name November 10, 2009 CS-343 Exam 2 CS-343 Exam 2 Instructions: For multiple choice questions, circle the letter of the one best choice unless the question explicitly states that it might have multiple correct answers. There is no penalty

More information

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham, Conrad Ziesler, David Blaauw, Todd Austin, Krisztian Flautner

More information

EE241 - Spring 2005 Advanced Digital Integrated Circuits

EE241 - Spring 2005 Advanced Digital Integrated Circuits EE241 - Spring 2005 Advanced Digital Integrated Circuits Lecture 21: Asynchronous Design Synchronization Clock Distribution Self-Timed Pipelined Datapath Req Ack HS Req Ack HS Req Ack HS Req Ack Start

More information

Why FPGAs? FPGA Overview. Why FPGAs?

Why FPGAs? FPGA Overview. Why FPGAs? Transistor-level Logic Circuits Positive Level-sensitive EECS150 - Digital Design Lecture 3 - Field Programmable Gate Arrays (FPGAs) January 28, 2003 John Wawrzynek Transistor Level clk clk clk Positive

More information