Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Size: px
Start display at page:

Download "Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng"

Transcription

1 Slide Set 8 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017

2 ENCM 501 W17 Lectures: Slide Set 8 slide 2/74 Contents Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

3 ENCM 501 W17 Lectures: Slide Set 8 slide 3/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

4 ENCM 501 W17 Lectures: Slide Set 8 slide 4/74 Pipelines with long-latency instructions Trying to execute instructions with long latencies in-order using a single pipeline can have very bad effects when data hazards arise. The costs in lost cycles are much worse than they might appear to be from study of a simple pipeline that can always do the EX step in one clock cycle. Let s look at some examples using the pipeline of textbook Figure C.35 as a reference. That pipeline is fairly realistic about variable latency in EX. The Figure C.35 pipeline model is unreasonably simplistic about instruction fetch and data memory access, but problems waiting for EX results will make the point that needs to be made.

5 ENCM 501 W17 Lectures: Slide Set 8 slide 5/74 Integer unit EX M1 FP/integer multiply M2 M3 M4 M5 M6 M7 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV Image is Figure C.35 from Hennessy J. L. and Patterson D. A., Computer Architecture: A Quantitative Approach, 5nd ed., c 2012, Elsevier, Inc.

6 ENCM 501 W17 Lectures: Slide Set 8 slide 6/74 Let s look at the effects of RAW data hazards concerning use of FPRs (floating-point registers). Such a hazard involving two instructions can be detected in the ID stage of the later instruction a source register number of the later instruction will match the destination register number of the earlier instruction. The later instruction will not be allowed to enter EX until the result of the earlier instruction can be forwarded. If this requires one or more stall cycles, that blocks other instructions from even entering the ID stage.

7 ENCM 501 W17 Lectures: Slide Set 8 slide 7/74 Example: MUL.D F0, F20, F22 S.D F0, (R8) L.D F2, (R9) DADDIU R9, R9, 8 ADD.D F24, F24, F2 S.D can t leave ID until MUL.D leaves M7... MUL.D S.D L.D DADDIU IF-ID-M1-M2-M3-M4-M5-M6-M7-Me-W IF-ID-ID-ID-ID-ID-ID-ID-EX-Me-W IF-IF-IF-IF-IF-IF-IF-ID-EX-Me-W IF-ID-EX-Me-W L.D, DADDIU and ADD.D don t depend on the MUL.D result, but are all seriously delayed because in an in-order system, they all have to wait for S.D to leave ID.

8 ENCM 501 W17 Lectures: Slide Set 8 slide 8/74 A worse example: MUL.D ADD.D L.D SUB.D F0, F20, F22 F24, F0, F24 F2, (R9) F26, F26, F2 ADD.D can t leave ID until M7 of MUL.D is done, and then L.D can t enter Mem until Mem of ADD.D is done... MUL.D ADD.D L.D SUB.D IF-ID-M1-M2-M3-M4-M5-M6-M7-Me-W IF-ID-ID-ID-ID-ID-ID-ID-A1-A2-A3-A4-Me-W IF-IF-IF-IF-IF-IF-IF-ID-EX-EX-EX-EX-Me-W IF-ID-ID-ID-ID-ID-A1-

9 ENCM 501 W17 Lectures: Slide Set 8 slide 9/74 An even worse example: DIV.D F0, F20, F22 ADD.D F24, F0, F24 L.D F2, (R9)

10 ENCM 501 W17 Lectures: Slide Set 8 slide 10/74 Mitigation of long stalls due to RAW hazards In-order pipelined execution used to be common even in reasonably high-end processors, and is still common in embedded processors. What criticism could be made of a compiler that emitted the following sequence of instructions? MUL.D F0, F20, F22 S.D F0, (R8) L.D F2, (R9) DADDIU R9, R9, 8 ADD.D F24, F24, F2

11 ENCM 501 W17 Lectures: Slide Set 8 slide 11/74 Integer unit EX M1 FP/integer multiply M2 M3 M4 M5 M6 M7 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV The above model is simplistic about memory accesses. With a more realistic model for memory access, let s come up with another source of stalls to due to RAW hazards with long-latency instructions, not involving complicated arithmetic.

12 ENCM 501 W17 Lectures: Slide Set 8 slide 12/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

13 ENCM 501 W17 Lectures: Slide Set 8 slide 13/74 What does program order mean? Most (maybe all?) current mainstream ISAs guarantee that results register and memory writes, branch decisions, etc. produced by a single stream of instructions are what you would predict if each instruction completed before the next instruction was fetched. Program order refers to the order in which instructions would be processed in a hypothetical computer with no ILP. ILP schemes aim to get the effects of execution in program order, without the massive performance penalty of actually waiting to finish one instruction before starting the next.

14 ENCM 501 W17 Lectures: Slide Set 8 slide 14/74 If a scalar in-order pipeline has proper hazard detection, it is pretty much guaranteed to generate correct program order results... Forwarding, preceded if necessary by stalling, ensures that instructions don t work with stale versions of source operands. A later instruction can t pass an earlier instruction in the pipeline key stages such as decode and data memory access can be occupied by only one instruction. That means that the earlier instruction will always write its result before the later instruction does. Warning: The above material isn t perfectly correct! But it does provide a decent explanation about the main benefits of in-order instruction processing.

15 ENCM 501 W17 Lectures: Slide Set 8 slide 15/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

16 ENCM 501 W17 Lectures: Slide Set 8 slide 16/74 Out-of-order execution, WAW and WAR hazards The next slide shows a hypothetical variation of textbook Figure C.35. As in Figure C.35 instructions are sent one per clock cycle to one of the four paths that lead to instruction completion. Transfer of an instruction from the fetch-and-decode unit to one of the four functional units is called instruction issue. Let s assume that in each cycle the fetch-and-decode unit can inspect a window of four or so instructions that need to be issued soon, and picks one, not necessarily in program order, but instead with the goal of optimizing instruction throughput.

17 ENCM 501 W17 Lectures: Slide Set 8 slide 17/74 integer unit EX FP/integer multiplier instruction fetch and decode unit M1 M2 M3 M4 M5 M6 M7 FP adder A1 A2 A3 A4 FP/integer divider MEM WB

18 ENCM 501 W17 Lectures: Slide Set 8 slide 18/74 Further, as the paths through the execution units have different latencies, our hypothetical system will allow not only out-of-order issue, but also out-of-order completion instructions may arrive at the MEM stage not in program order. (And, two or more instructions could arrive at MEM in the same cycle, which would require some sort of arbitration and queueing system.) OoO (out-of-order) issue and OoO completion may create new kinds of hazards, ones that would be impossible with in-order issue and in-order completion.

19 ENCM 501 W17 Lectures: Slide Set 8 slide 19/74 WAW (Write-After-Write) hazard example MUL.D ADD.D L.D ADD.D F0, F20, F20 F2, F2, F0 F0, (R8) F4, F4, F0 several more instructions, but no more writes to F0 SUB.D F6, F6, F0 The first ADD.D has to wait for the MUL.D result, so it makes sense to issue L.D and the second ADD.D before the first ADD.D. What is the potential bad consequence for the SUB.D instruction? Suppose the instruction sequence was produced by a compiler. How could the compiler have avoided the WAW hazard?

20 ENCM 501 W17 Lectures: Slide Set 8 slide 20/74 WAR (Write-After-Read) hazard example MUL.D F2, F22, F22 S.D F2, 40(R29) L.D F2, 32(R29) ADD.D F20, F20, F2 In program order, L.D writes to F2 after S.D reads from F2. But if L.D is issued out of order, S.D could store the L.D result to memory instead of storing the MUL.D result.

21 ENCM 501 W17 Lectures: Slide Set 8 slide 21/74 Data dependencies and name dependencies A RAW hazard is a data dependency, sometimes also called a true dependency. A later instruction has a source that was a destination of an earlier instruction. The earlier instruction may not have written its result to a register or to memory when the later instruction tries to read that result.

22 ENCM 501 W17 Lectures: Slide Set 8 slide 22/74 In contrast, WAW and WAR hazards are sometimes called name dependencies. Generally, the situation is like this... Some instruction B receives information from an earlier instruction A, in some register or memory location. Some instruction D receives unrelated information from an earlier instruction C, in the same register or memory location. In program order, there is no problem the first use of the storage location is over before the second use starts. But in an out-of-order environment, communication between one pair of instructions may interfere with communication between another pair.

23 ENCM 501 W17 Lectures: Slide Set 8 slide 23/74 The term name dependency comes from the idea that two or more pairs of instructions are making conflicting use of a name (register number or memory address) used for inter-instruction communication. WAW name dependencies are also called output dependencies: The correctness of a write depends on the write overwriting any writes to the same location that should occur earlier in program order. WAR name dependencies are also called anti-dependencies: The correctness of a read depends on the read result not being contaminated by a write that should occur later in program order.

24 ENCM 501 W17 Lectures: Slide Set 8 slide 24/74 Name dependencies are a real problem The example WAW and WAR hazards given earlier in the Slide Set are easy to fix the hazards go away if better choices are made about FPRs used for intermediate results. But what if a compiler has very few registers to allocate? This could happen with a register-poor ISA (like x86!), or with code that for some reason needs to use many registers. A more common and important problem arises from short loops with long-latency instructions...

25 ENCM 501 W17 Lectures: Slide Set 8 slide 25/74 Here s a loop to multiply each element in a vector by a factor in register F12... L1: L.D F2, (R8) MUL.D F4, F2, F12 DADDIU R8, R8, 8 S.D F4, (R9) DADDIU R9, R9, 8 BNE R8, R10, L1 MUL.D instructions have long latencies but they can be pipelined. It would be good to start the second MUL.D before the first MUL.D finishes, the third MUL.D before the second MUL.D finishes, and so on. Where is there a name dependency in this sequence? (By the way, there are several RAW hazards in this example, too!)

26 ENCM 501 W17 Lectures: Slide Set 8 slide 26/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

27 ENCM 501 W17 Lectures: Slide Set 8 slide 27/74 Data hazards related to memory locations Example RAW, WAW and WAR hazards given earlier in this slide set have been related to sequences of writes and reads to registers. Similar hazards can arise related to stores to and loads from memory locations.

28 ENCM 501 W17 Lectures: Slide Set 8 slide 28/74 Is there any kind of data hazard that could arise from OoO issue or completion of the two load instructions here? If so, what kind of hazard is it? L.D F0, (R8) instructions, but no loads or stores L.D F2, (R9)

29 ENCM 501 W17 Lectures: Slide Set 8 slide 29/74 What kind of data hazard is possible involving the store instruction and the load instruction, in an out-of-order system? MUL.D S.D F2, F0, F0 F2, 40(R29) instructions, but no loads or stores L.D F4, (R9)

30 ENCM 501 W17 Lectures: Slide Set 8 slide 30/74 What kind of data hazard is possible involving the two store instructions, in an out-of-order system? MUL.D S.D F0, F2, F2 F0, (R8) instructions, but no loads or stores S.D F4, (R9)

31 ENCM 501 W17 Lectures: Slide Set 8 slide 31/74 What kind of data hazard is possible involving the L.D instruction and the S.D instruction, in an out-of-order system? LD R9, (R8) # read address from memory L.D F0, (R9) # use just-read address instructions, but no loads or stores S.D F2, (R10)

32 ENCM 501 W17 Lectures: Slide Set 8 slide 32/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

33 ENCM 501 W17 Lectures: Slide Set 8 slide 33/74 Managing hazards in OoO processor designs Several different design approaches have been used. Because of time constraints in ENCM 501, we ll focus on one design approach, called Tomasulo s algorithm. Textbook Section 3.4 introduces the algorithm. Textbook Section 3.5 provides detailed examples of how the algorithm works. Textbook Section 3.6 shows how the algorithm can be extended to ensure correct processing in the face of problems such as exceptions and branch mispredictions.

34 ENCM 501 W17 Lectures: Slide Set 8 slide 34/74 The details in Sections focus mainly on data hazards related to writes to and reads from floating-point registers. There is some discussion, less detailed, of data hazards related to writes to and reads from memory locations. To keep things as simple as possible as simple as possible is still quite complicated, as we ll see details are left out regarding writes to and reads from general purpose registers. In a practical design, hazards related to GPR reads and writes would be handled in a similar fashion to FPR read and write hazards.

35 ENCM 501 W17 Lectures: Slide Set 8 slide 35/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

36 ENCM 501 W17 Lectures: Slide Set 8 slide 36/74 Overview of Tomasulo s algorithm It s interesting that this approach to instruction scheduling was developed around 50 years ago (in 1966!) for very high-end expensive computers, then re-adopted around 20 years ago for consumer-level processors (Intel Pentium Pro, 1995). It s been in continual use for a huge number of out-of-order processor designs for the last two decades. The key ideas are: try to get execution of an instruction started as soon as the source operands are ready often, this will require out-of-order issue; as soon as an instruction result has been produced, try to broadcast that result to all the instructions that want to consume it.

37 ENCM 501 W17 Lectures: Slide Set 8 slide 37/74 Goals and non-goals for H&P Sections 3.4 and 3.5 Goals: high instruction throughput dealing effectively with RAW hazards (which, remember, may occur even with in-order processing) dealing effectively with WAW and WAR hazards (which tend to occur with out-of-order processing) Non-goals (which become goals in Section 3.6): high throughput in the face of code with lots of branches correct behaviour in the face of exceptions, e.g., hardware interrupts, TLB misses handled by software, various other exceptions

38 ENCM 501 W17 Lectures: Slide Set 8 slide 38/74 Key components instruction fetch and decode unit enhanced register files reservation stations functional units common data bus (CDB) or multiple CDBs for superscalar systems

39 ENCM 501 W17 Lectures: Slide Set 8 slide 39/74 Instruction fetch and decode unit This unit merges an L1 I-cache some sort of facility for managing branches and jumps (kind of fuzzy in Sections , but definitely dynamic branch prediction in Section 3.6) instruction decode capability: when an instruction leaves, it s known exactly what kind of instruction it is, and which registers, offsets and immediate operands are involved For textbook Sections , this unit is scalar in any one clock cycle, the maximum output is one instruction. (Section 3.8 looks at output of two or more instructions per cycle.)

40 ENCM 501 W17 Lectures: Slide Set 8 slide 40/74 Instruction fetch and decode unit: key property Instructions are issued from the instruction unit in program order. This is critical for correct avoidance of data hazards!

41 ENCM 501 W17 Lectures: Slide Set 8 slide 41/74 Enhanced register files In an in-order processor, a register file is precisely what we ve modeled already: If the number of registers is M and the width of a register is N, there must be M N cells to contain the state of the register file; beyond that, there must also be whatever logic is needed to support parallel reads and/or parallel writes. In a Tomasulo-based processor, each one of the M registers requires N bits for register data more bits for register status is the data up-to-date, and if not, which reservation station will supply the data in the future?

42 ENCM 501 W17 Lectures: Slide Set 8 slide 42/74 Example: FPR file for MIPS with bit FPRs... Qi = 0 indicates that an FPR is up-to-date; Qi 0 indicates that an FPR is not up-to-date and is waiting for a result for a reservation station. This example works for a system with up to fifteen reservation stations... FPR data Qi F F F

43 ENCM 501 W17 Lectures: Slide Set 8 slide 43/74 What do all of the bits on Slide 42 mean? Let s write out the FPR file state in a somewhat more human-friendly format.

44 ENCM 501 W17 Lectures: Slide Set 8 slide 44/74 Reservation stations A reservation station receives an instruction from the instruction unit; waits for source operand data to be ready before starting the execution of the instruction; broadcasts the result of the instruction on the CDB, when the result is ready. Example reservation station: An RS capable of processing either ADD.D or SUB.D needs 6 fields: Busy, Op, Vj, Vk, Qj and Qk. (The textbook also shows an A field, but that field is not needed for an RS for ADD.D / SUB.D.) Let s make some notes about how the 6 fields are used.

45 ENCM 501 W17 Lectures: Slide Set 8 slide 45/74 Each reservation station has a unique, nonzero identification number. Examples in the textbook have three RS s for ADD.D / SUB.D instructions, and have two RS s for MUL.D / DIV.D instructions. So a possible numbering scheme would be 0001, 0010, 0011 for the first three RS s (called Add1, Add2, and Add3 in the book) and 0100, 0101 for the next two (called Mult1 and Mult2). (Warning: RS stands for reservation station, not for register status. In descriptions of various versions of Tomasulo s algororithm, the textbook uses RegisterStat[x] as an abbreviation for the status of register x.)

46 ENCM 501 W17 Lectures: Slide Set 8 slide 46/74 Simple example of instruction issue Suppose a program has been running for a while, and these will be the next two instructions to leave the instruction unit: ADD.D SUB.D F4, F0, F2 F6, F6, F4 Suppose also that registers F0, F2, F4 and F6 are up-to-date, and that RS s Add1 and Add2 are not busy. Let s make notes about how the two instructions will be issued to the RS s.

47 ENCM 501 W17 Lectures: Slide Set 8 slide 47/74 Functional units Functional units are the circuits that perform the execution steps for an instruction. Example functional units are FP adders, FP multipliers, integer ALUs, shifters, and so on. To keep things relatively simple, we can imagine a one-to-one correspondence between reservation stations and functional units. For example, each RS for ADD.D / SUB.D could be thought of as guarding the entrance of an FP adder-subtractor circuit, and watching the exit of that circuit for a result. In reality, things will be more complicated multiple RS s will be set up to feed their operands into a single pipelined functional unit.

48 ENCM 501 W17 Lectures: Slide Set 8 slide 48/74 Latencies of functional units Tomasulo s algorithm is designed to manage the effects of multiple-cycle latencies of functional units. Further, the algorithm is designed to deal with the fact that the latency of a functional unit might vary from one use of the unit to the next. What are some examples of functional units with variable latencies?

49 ENCM 501 W17 Lectures: Slide Set 8 slide 49/74 CDB: Common data bus Most reservation stations are capable of broadcasting results on the CDB. RS s for arithmetic instructions and RS s for loads certainly need to be able to do this, but RS s for store instructions don t. Each reservation station must snoop the CDB watch the CDB for results needed by that RS. The register file must also snoop the CDB to grab results that are needed by registers that are currently not up-to-date. (Remember, such a register has some nonzero Qi value to indicate which RS will produce the result the register is waiting for.)

50 ENCM 501 W17 Lectures: Slide Set 8 slide 50/74 Example instruction completion via CDB This sequence got started a few slides back... ADD.D SUB.D F4, F0, F2 F6, F6, F4 Let s make some notes about how these instructions will get completed. For simplicity, let s assume that the given instructions are followed by a lengthy sequence of instructions that don t use F0, F2, F4 or F6.

51 ENCM 501 W17 Lectures: Slide Set 8 slide 51/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

52 ENCM 501 W17 Lectures: Slide Set 8 slide 52/74 Tomasulo s algorithm and name dependencies The algorithm eliminates WAW and WAR hazards in a really elegant way. To understand how this works, it s sufficient to consider a short example. Here the repeated use of F0 creates both a potential WAW hazard and a potential WAR hazard: DIV.D ADD.D MUL.D SUB.D F0, F20, F2 F22, F22, F0 F0, F24, F24 F26, F26, F0 Let s make notes about how the hazards are eliminated.

53 ENCM 501 W17 Lectures: Slide Set 8 slide 53/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

54 ENCM 501 W17 Lectures: Slide Set 8 slide 54/74 Tomasulo s algorithm and memory hazards The examples given in textbook Sections 3.5 and 3.6 are excellent regarding hazards involving communication of instruction results through floating-point registers (FPRs). Communication of instruction results through general-purpose registers (GPRs, aka integer registers) is very similar to communication via FPRs, so it s reasonable not to discuss that topic at length.

55 ENCM 501 W17 Lectures: Slide Set 8 slide 55/74 However, the textbook, is, uh, less than excellent about the topic of allowing memory accesses (loads and stores) to complete out-of-order when doing so is harmless and helps with instruction throughput; forcing in-order completion of loads and stores when necessary to avoid RAW, WAW and WAR hazards.

56 ENCM 501 W17 Lectures: Slide Set 8 slide 56/74 Older and younger instructions It s handy to use the words older and younger as adjectives for instructions in an out-of-order system. Instruction A is older than Instruction B if A came before B in program order. In that case, you could also say that B is younger than A.

57 ENCM 501 W17 Lectures: Slide Set 8 slide 57/74 Rules for ordering execution of loads and stores To avoid RAW hazards: It is acceptable to access memory for a load instruction if it is known that there are no incomplete older store instructions that will use the same address as the load. What similar rules would be needed for avoidance of WAW and WAR hazards? ENCM 501 won t go into details of hardware solutions for memory data hazards, but here are a couple of key features: some sort of queue in which program order of loads and stores is remembered after loads and stores leave the instuction fetch/decode unit a capability to quickly compare the address a load or store will use with addresses to be used by older loads or stores

58 ENCM 501 W17 Lectures: Slide Set 8 slide 58/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

59 ENCM 501 W17 Lectures: Slide Set 8 slide 59/74 Load and store buffers Load buffer and store buffer are names given to reservation stations dedicated to handling loads or stores. Remark: The load buffers and store buffers provide an interface between the execution unit of the processor and the data caches. That s an interesting design problem we don t have time to study in this course.

60 ENCM 501 W17 Lectures: Slide Set 8 slide 60/74 Vj, Vk, Qj, Qk, A for store buffers Busy Op Vj Vk Qj Qk A As with the FP math stations, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vk is used for the FP data to be written in an S.D instruction. So what does it mean if Qk 0? Vj, Qj, and A have to do with memory address calculations. Let s not worry about the details for now.

61 ENCM 501 W17 Lectures: Slide Set 8 slide 61/74 Vj, Vk, Qj, Qk, A for load buffers Busy Op Vj Vk Qj Qk A Again, Vj is ready if and only if Qj = 0, and the same applies for Vk and Qk. Vj, Qj, and A have to do with memory address calculations. As with store buffers, let s not worry about the details for now.

62 ENCM 501 W17 Lectures: Slide Set 8 slide 62/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

63 ENCM 501 W17 Lectures: Slide Set 8 slide 63/74 A loop example This is from page 179 of the textbook: Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop1 Let s make some notes about the DADDIU and BEQ instructions. Let s assume the loop starts with R1 = 0x and R2 = 0x Let s trace how Tomasulo s algorithm might handle the first two passes through the loop.

64 ENCM 501 W17 Lectures: Slide Set 8 slide 64/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

65 ENCM 501 W17 Lectures: Slide Set 8 slide 65/74 Costs of the CDB (common data bus) In a typical clock cycle, some reservation station will broadcast a result on the CDB, and other reservation stations and the register file will look at the result to see if it s useful. Transmitting the result and receiving the result both have energy costs. A complex instruction unit, reservation stations, and related hardware require lots of transistors. If Moore s law had not applied for so many decades, we would not see Tomasulo s algorithm used as a basis for design of modestly priced processor chips. It s possible, in some cycles, that two or more reservation stations will simultaneously try to broadcast their results. Why is this not a fatal defect in Tomasulo s algorithm?

66 ENCM 501 W17 Lectures: Slide Set 8 slide 66/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

67 ENCM 501 W17 Lectures: Slide Set 8 slide 67/74 Tomasulo s algorithm and branch prediction Consider this code fragment: BEQ S.D ADD.D R8, R0, L99 F0, (R10) F2, F2, F4 Suppose the branch is incorrectly predicted as not taken, and S.D and ADD.D get issued while BEQ waits for some earlier instruction to provide a value for R8. If Tomasulo s algorithm does nothing beyond what has been presented so far in lectures, what will prevent S.D from making an incorrect update to memory, and what will prevent ADD.D from making an incorrect update to F2?

68 ENCM 501 W17 Lectures: Slide Set 8 slide 68/74 Tomasulo s algorithm and exceptions MUL.D S.D SUB.D L.D ADD.D F2, F4, F6 F2, 0(R8) F0, F12, F14 F2, 0(R9) F8, F8, F2 Suppose MUL.D gets delayed because it has to wait until a result for F6 is ready. That will delay the execution of S.D. Meanwhile, Tomasulo s algorithm may allow completion of SUB.D, L.D, and ADD.D. What kind of problem is created if S.D eventually results in a page fault exception?

69 ENCM 501 W17 Lectures: Slide Set 8 slide 69/74 Outline of Slide Set 8 Pipelines with long-latency instructions What does program order mean? Out-of-order execution, WAW and WAR hazards Data hazards related to memory locations Managing hazards in OoO processor designs Overview of Tomasulo s algorithm Tomasulo s algorithm and name dependencies Tomasulo s algorithm and memory hazards Load and store buffers A loop example Costs of the CDB (common data bus) Tomasulo s algorithm, branches, and exceptions Out-of-order execution, in-order completion

70 ENCM 501 W17 Lectures: Slide Set 8 slide 70/74 Out-of-order execution, in-order completion The version of Tomasulo s algorithm presented in textbook Section 3.5 has scalar issue (that is, at most one instruction issued per clock cycle), out-of-order execution, and out-of-order completion. Section 3.6 modifies the algorithm to include a circuit called a reorder buffer (ROB), which will enforce in-order completion. Use of a reorder buffer solves the branch prediction and exception problems described on slides 67 and 68.

71 ENCM 501 W17 Lectures: Slide Set 8 slide 71/74 In a processor with a reorder buffer, issue of an instruction sends information related to the instruction both to a reservation station and to the reorder buffer. A reservation station for a store is responsible for address computation only it is not allowed to write to memory. The reorder buffer is a FIFO queue instructions enter in program order, and leave in program order. When an instruction gets to the head of the ROB, it can be committed as soon as its results are known. Examples: An ADD.D can be committed if a reservation station has provided the sum to the reorder buffer. An S.D can be committed if both the data to be stored and the address to be used are ready.

72 ENCM 501 W17 Lectures: Slide Set 8 slide 72/74 Register file changes: The Qi field for each register is replaced by a Busy flag and a Reorder # field. Busy = 0 means the register is up-to-date; Busy = 1 means the register is waiting for a result from whatever entry in the reorder buffer matches the Reorder #. The register file does not watch the CDB for results. The ROB must watch the CDB for results for all of the instructions within the ROB that don t yet have results.

73 ENCM 501 W17 Lectures: Slide Set 8 slide 73/74 The reservation stations and functional units work very much as before, except: the Qj and Qk fields hold ROB entry numbers instead of reservation station numbers; each reservation stations has a Dest field to hold an ROB entry number; when a reservation station broadcasts its result on the CDB, it includes the Dest field value to help both the ROB and the other reservation stations.

74 ENCM 501 W17 Lectures: Slide Set 8 slide 74/74 The reorder buffer and safe speculation The key point about the ROB is that it can collect a large number of results without knowing whether those results should really be written to registers or memory. Consider a branch instruction that is mispredicted as taken. What happens to all the instructions that got into the ROB before the branch? What happens to the branch target instruction, the successor of the the branch target instruction, etc., which got into the ROB after the branch? The bad effect of the above scenario is a waste of time and energy. What are the important bad effects that were prevented?

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng Slide Set 9 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 9 slide

More information

Instruction Level Parallelism and Its. (Part II) ECE 154B

Instruction Level Parallelism and Its. (Part II) ECE 154B Instruction Level Parallelism and Its Exploitation (Part II) ECE 154B Dmitri Strukov ILP techniques not covered last week this week next week Scoreboard Technique Review Allow for out of order execution

More information

Advanced Pipelining and Instruction-Level Paralelism (2)

Advanced Pipelining and Instruction-Level Paralelism (2) Advanced Pipelining and Instruction-Level Paralelism (2) Riferimenti bibliografici Computer architecture, a quantitative approach, Hennessy & Patterson: (Morgan Kaufmann eds.) Tomasulo s Algorithm For

More information

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91 Tomasulo Algorithm Developed at IBM and first implemented in IBM s 360/91 IBM wanted to use the existing compiler instead of a specialized compiler for high end machines. Tracks when operands are available

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

More information

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far. Outline 1 Reiteration Lecture 5: EIT090 Computer Architecture 2 Dynamic scheduling - Tomasulo Anders Ardö 3 Superscalar, VLIW EIT Electrical and Information Technology, Lund University Sept. 30, 2009 4

More information

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm 2003-10-23 Dave Patterson (www.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs152/ CS 152 L17 Adv.

More information

Scoreboard Limitations!

Scoreboard Limitations! Scoreboard Limitations! No forwarding read from register! Structural hazards stall at issue! WAW hazard stall at issue!! WAR hazard stall at write! Inf3 Computer Architecture - 2015-2016 1 Dynamic Scheduling

More information

Scoreboard Limitations

Scoreboard Limitations Scoreboard Limitations! No forwarding read from register! Structural hazards stall at issue! WAW hazard stall at issue! WAR hazard stall at write Inf3 Computer Architecture - 2016-2017 1 Dynamic Scheduling

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview of Chap. 3 (again) Pipelined

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 12: Dynamic Scheduling: Tomasulo s Algorithm Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CS252, UC Berkeley

More information

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi Dynamic Scheduling (or out-of-order execution) Dynamic Scheduling Or ydanicm ceshuldngi CDC 6600 scoreboard Instruction storage added to each functional execution unit Instructions issue to FU when no

More information

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III Course on: Advanced Computer Architectures Instruction Level Parallelism Part III Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Outline of Part III Tomasulo Dynamic Scheduling

More information

Instruction Level Parallelism Part III

Instruction Level Parallelism Part III Course on: Advanced Computer Architectures Instruction Level Parallelism Part III Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Outline of Part III Dynamic Scheduling

More information

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components Another Dynamic Algorithm: Tomasulo Algorithm Differences between Tomasulo Algorithm & Scoreboard For IBM 360/9 about 3 years after CDC 6600 Goal: High Performance without special compilers Differences

More information

Out-of-Order Execution

Out-of-Order Execution 1 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulo s algorithm & reservation stations out-of-order completion leads to: imprecise

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Pipelining, Hazards Appendix C, HPe Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Pipelining

More information

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 6 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary February 2018 ENCM 369 Winter 2018 Section

More information

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7 CM 69 W4 Section Slide Set 6 slide 2/9 Contents Slide Set 6 for CM 69 Winter 24 Lecture Section Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary

More information

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards. 06 1 MIPS Implementation 06 1 Material from Chapter 3 of H&P (for DLX). Material from Chapter 6 of P&H (for MIPS). line: (In this set.) Unpipelined DLX Implementation. (Diagram only.) Pipelined DLX and

More information

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 1 submission

More information

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

CS 152 Midterm 2 May 2, 2002 Bob Brodersen CS 152 Midterm 2 May 2, 2002 Bob Brodersen Name Solutions Show your work if you want partial credit! Try all the problems, don t get stuck on one of them. Each one is worth 10 points. 1) 2) 3) 4) 5) 6)

More information

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Very Short Answer: (1) (1) Peak performance does or does not track observed performance. Very Short Answer: (1) (1) Peak performance does or does not track observed performance. (2) (1) Which is more effective, dynamic or static branch prediction? (3) (1) Do benchmarks remain valid indefinitely?

More information

BUSES IN COMPUTER ARCHITECTURE

BUSES IN COMPUTER ARCHITECTURE BUSES IN COMPUTER ARCHITECTURE The processor, main memory, and I/O devices can be interconnected by means of a common bus whose primary function is to provide a communication path for the transfer of data.

More information

Pipeline design. Mehran Rezaei

Pipeline design. Mehran Rezaei Pipeline design Mehran Rezaei Shift Left 2 pc Opcode ExtOp Cont Unit RegDst Addr Addr2 Addr npcsle Reg ALUSrc Mem 2 OVF Branch ALUCtr MemtoReg Mem Funct Extension ALUOp ALU Cont Shift Left 2 ID EXE MEM

More information

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics 1) Explain why & how a MOSFET works VLSI Design: 2) Draw Vds-Ids curve for a MOSFET. Now, show how this curve changes (a) with increasing Vgs (b) with increasing transistor width (c) considering Channel

More information

On the Rules of Low-Power Design

On the Rules of Low-Power Design On the Rules of Low-Power Design (and How to Break Them) Prof. Todd Austin Advanced Computer Architecture Lab University of Michigan austin@umich.edu Once upon a time 1 Rules of Low-Power Design P = acv

More information

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath Objectives Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath In the previous chapters we have studied how to develop a specification from a given application, and

More information

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices EECS150 - Digital Design Lecture 9 - CPU Microarchitecture Feb 17, 2009 John Wawrzynek Spring 2009 EECS150 - Lec9-cpu Page 1 CMOS Devices Review: Transistor switch-level models The gate acts like a capacitor.

More information

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015 Optimization of Multi-Channel BCH Error Decoding for Common Cases Russell Dill Master's Thesis Defense April 20, 2015 Bose-Chaudhuri-Hocquenghem (BCH) BCH is an Error Correcting Code (ECC) and is used

More information

Modeling Digital Systems with Verilog

Modeling Digital Systems with Verilog Modeling Digital Systems with Verilog Prof. Chien-Nan Liu TEL: 03-4227151 ext:34534 Email: jimmy@ee.ncu.edu.tw 6-1 Composition of Digital Systems Most digital systems can be partitioned into two types

More information

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20 Advanced Devices Using a combination of gates and flip-flops, we can construct more sophisticated logical devices. These devices, while more complex, are still considered fundamental to basic logic design.

More information

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 29 Minimizing Switched Capacitance-III. (Refer

More information

Microprocessor Design

Microprocessor Design Microprocessor Design Principles and Practices With VHDL Enoch O. Hwang Brooks / Cole 2004 To my wife and children Windy, Jonathan and Michelle Contents 1. Designing a Microprocessor... 2 1.1 Overview

More information

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee CS/ECE 25: Computer Architecture Basics of Logic esign: ALU, Storage, Tristate Benjamin Lee Slides based on those from Alvin Lebeck, aniel, Andrew Hilton, Amir Roth, Gershon Kedem Homework #3 ue Mar 7,

More information

Logic Design II (17.342) Spring Lecture Outline

Logic Design II (17.342) Spring Lecture Outline Logic Design II (17.342) Spring 2012 Lecture Outline Class # 03 February 09, 2012 Dohn Bowden 1 Today s Lecture Registers and Counters Chapter 12 2 Course Admin 3 Administrative Admin for tonight Syllabus

More information

FPGA Development for Radar, Radio-Astronomy and Communications

FPGA Development for Radar, Radio-Astronomy and Communications John-Philip Taylor Room 7.03, Department of Electrical Engineering, Menzies Building, University of Cape Town Cape Town, South Africa 7701 Tel: +27 82 354 6741 email: tyljoh010@myuct.ac.za Internet: http://www.uct.ac.za

More information

A video signal processor for motioncompensated field-rate upconversion in consumer television

A video signal processor for motioncompensated field-rate upconversion in consumer television A video signal processor for motioncompensated field-rate upconversion in consumer television B. De Loore, P. Lippens, P. Eeckhout, H. Huijgen, A. Löning, B. McSweeney, M. Verstraelen, B. Pham, G. de Haan,

More information

COMP12111: Fundamentals of Computer Engineering

COMP12111: Fundamentals of Computer Engineering COMP2: Fundamentals of Computer Engineering Part I Course Overview & Introduction to Logic Paul Nutter Introduction What is this course about? Computer hardware design o not electronics nothing nasty like

More information

Contents Circuits... 1

Contents Circuits... 1 Contents Circuits... 1 Categories of Circuits... 1 Description of the operations of circuits... 2 Classification of Combinational Logic... 2 1. Adder... 3 2. Decoder:... 3 Memory Address Decoder... 5 Encoder...

More information

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus Digital logic: ALUs Sequential logic circuits CS207, Fall 2004 October 11, 13, and 15, 2004 1 Read-only memory (ROM) A form of memory Contents fixed when circuit is created n input lines for 2 n addressable

More information

Motion Video Compression

Motion Video Compression 7 Motion Video Compression 7.1 Motion video Motion video contains massive amounts of redundant information. This is because each image has redundant information and also because there are very few changes

More information

A VLIW Processor for Multimedia Applications

A VLIW Processor for Multimedia Applications A VLIW Processor for Multimedia Applications E. Holmann T. Yoshida A. Yamada Y. Shimazu Mitsubishi Electric Corporation, System LSI Laboratory 4-1 Mizuhara, Itami, Hyogo 664, Japan Outline Objective System

More information

6.3 Sequential Circuits (plus a few Combinational)

6.3 Sequential Circuits (plus a few Combinational) 6.3 Sequential Circuits (plus a few Combinational) Logic Gates: Fundamental Building Blocks Introduction to Computer Science Robert Sedgewick and Kevin Wayne Copyright 2005 http://www.cs.princeton.edu/introcs

More information

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers Shadi T. Khasawneh and Kanad Ghose Department of Computer Science State University of New York, Binghamton,

More information

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access. Chapter 6 Pipelining Improve performance by increasing instrction throghpt Program eection order Time (in instrctions) lw $, ($) Instrction fetch 2 4 6 8 2 4 6 8 ALU Data access lw $2, 2($) 8 ns Instrction

More information

CPS311 Lecture: Sequential Circuits

CPS311 Lecture: Sequential Circuits CPS311 Lecture: Sequential Circuits Last revised August 4, 2015 Objectives: 1. To introduce asynchronous and synchronous flip-flops (latches and pulsetriggered, plus asynchronous preset/clear) 2. To introduce

More information

Design for Testability

Design for Testability TDTS 01 Lecture 9 Design for Testability Zebo Peng Embedded Systems Laboratory IDA, Linköping University Lecture 9 The test problems Fault modeling Design for testability techniques Zebo Peng, IDA, LiTH

More information

COE328 Course Outline. Fall 2007

COE328 Course Outline. Fall 2007 COE28 Course Outline Fall 2007 1 Objectives This course covers the basics of digital logic circuits and design. Through the basic understanding of Boolean algebra and number systems it introduces the student

More information

Tomasulo Algorithm Based Out of Order Execution Processor

Tomasulo Algorithm Based Out of Order Execution Processor Tomasulo Algorithm Based Out of Order Execution Processor Bhavana P.Shrivastava MAaulana Azad National Institute of Technology, Department of Electronics and Communication ABSTRACT In this research work,

More information

Slide Set 7. for ENEL 353 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

Slide Set 7. for ENEL 353 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary Slide Set 7 for ENEL 353 Fall 216 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Fall Term, 216 SN s ENEL 353 Fall 216 Slide Set 7 slide

More information

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review September 1, 2011 Elad Alon Electrical Engineering and Computer Sciences University of California, Berkeley http://www-inst.eecs.berkeley.edu/~cs150

More information

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger. CS 110 Computer Architecture Finite State Machines, Functional Units Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University

More information

THE USE OF forward error correction (FEC) in optical networks

THE USE OF forward error correction (FEC) in optical networks IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 8, AUGUST 2005 461 A High-Speed Low-Complexity Reed Solomon Decoder for Optical Communications Hanho Lee, Member, IEEE Abstract

More information

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Logic Devices for Interfacing, The 8085 MPU Lecture 4 Logic Devices for Interfacing, The 8085 MPU Lecture 4 1 Logic Devices for Interfacing Tri-State devices Buffer Bidirectional Buffer Decoder Encoder D Flip Flop :Latch and Clocked 2 Tri-state Logic Outputs

More information

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics EECS150 - Digital Design Lecture 10 - Interfacing Oct. 1, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)

More information

Previous Lecture Sequential Circuits. Slide Summary of contents covered in this lecture. (Refer Slide Time: 01:55)

Previous Lecture Sequential Circuits. Slide Summary of contents covered in this lecture. (Refer Slide Time: 01:55) Previous Lecture Sequential Circuits Digital VLSI System Design Prof. S. Srinivasan Department of Electrical Engineering Indian Institute of Technology, Madras Lecture No 7 Sequential Circuit Design Slide

More information

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors CSC258 Week 5 1 We are here Assembly Language Processors Arithmetic Logic Units Devices Finite State Machines Flip-flops Circuits Gates Transistors 2 Circuits using flip-flops Now that we know about flip-flops

More information

CS3350B Computer Architecture Winter 2015

CS3350B Computer Architecture Winter 2015 CS3350B Computer Architecture Winter 2015 Lecture 5.2: State Circuits: Circuits that Remember Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design,

More information

COMP sequential logic 1 Jan. 25, 2016

COMP sequential logic 1 Jan. 25, 2016 OMP 273 5 - sequential logic 1 Jan. 25, 2016 Sequential ircuits All of the circuits that I have discussed up to now are combinational digital circuits. For these circuits, each output is a logical combination

More information

Design Project: Designing a Viterbi Decoder (PART I)

Design Project: Designing a Viterbi Decoder (PART I) Digital Integrated Circuits A Design Perspective 2/e Jan M. Rabaey, Anantha Chandrakasan, Borivoje Nikolić Chapters 6 and 11 Design Project: Designing a Viterbi Decoder (PART I) 1. Designing a Viterbi

More information

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers Unit 2 Registers and Counters Fundamentals of Logic esign EE2369 Prof. Eric Maconald Fall Semester 23 Registers Groups of flip-flops Can contain data format can be unsigned, 2 s complement and other more

More information

CS8803: Advanced Digital Design for Embedded Hardware

CS8803: Advanced Digital Design for Embedded Hardware CS883: Advanced Digital Design for Embedded Hardware Lecture 4: Latches, Flip-Flops, and Sequential Circuits Instructor: Sung Kyu Lim (limsk@ece.gatech.edu) Website: http://users.ece.gatech.edu/limsk/course/cs883

More information

Lecture 0: Organization

Lecture 0: Organization 581365 Tietokoneen rakenne Computer Organization II Spring 2010 Tiina Niklander Matemaattis-luonnontieteellinen tiedekunta Computer Organization II Advanced (master) level course! Prerequisite: Computer

More information

ECSE-323 Digital System Design. Datapath/Controller Lecture #1

ECSE-323 Digital System Design. Datapath/Controller Lecture #1 1 ECSE-323 Digital System Design Datapath/Controller Lecture #1 2 Synchronous Digital Systems are often designed in a modular hierarchical fashion. The system consists of modular subsystems, each of which

More information

An automatic synchronous to asynchronous circuit convertor

An automatic synchronous to asynchronous circuit convertor An automatic synchronous to asynchronous circuit convertor Charles Brej Abstract The implementation methods of asynchronous circuits take time to learn, they take longer to design and verifying is very

More information

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques Madhavi Anupoju 1, M. Sunil Prakash 2 1 M.Tech (VLSI) Student, Department of Electronics & Communication Engineering, MVGR

More information

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture Vinaykumar Bagali 1, Deepika S Karishankari 2 1 Asst Prof, Electrical and Electronics Dept, BLDEA

More information

Performance Driven Reliable Link Design for Network on Chips

Performance Driven Reliable Link Design for Network on Chips Performance Driven Reliable Link Design for Network on Chips Rutuparna Tamhankar Srinivasan Murali Prof. Giovanni De Micheli Stanford University Outline Introduction Objective Logic design and implementation

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 1-Bus Architecture and Datapath 10262011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline 1-Bus Microarchitecture and

More information

ALONG with the progressive device scaling, semiconductor

ALONG with the progressive device scaling, semiconductor IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 4, APRIL 2010 285 LUT Optimization for Memory-Based Computation Pramod Kumar Meher, Senior Member, IEEE Abstract Recently, we

More information

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview Digilent Nexys-3 Cellular RAM Controller Reference Design Overview General Overview This document describes a reference design of the Cellular RAM (or PSRAM Pseudo Static RAM) controller for the Digilent

More information

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation International Journal of Modern Science and Technology Vol. 2, No. 5, 2017. Page 217-222. http://www.ijmst.co/ ISSN: 2456-0235. Research Article Implementation of Low Power, Delay and Area Efficient Shifters

More information

CHAPTER 4: Logic Circuits

CHAPTER 4: Logic Circuits CHAPTER 4: Logic Circuits II. Sequential Circuits Combinational circuits o The outputs depend only on the current input values o It uses only logic gates, decoders, multiplexers, ALUs Sequential circuits

More information

Data Converters and DSPs Getting Closer to Sensors

Data Converters and DSPs Getting Closer to Sensors Data Converters and DSPs Getting Closer to Sensors As the data converters used in military applications must operate faster and at greater resolution, the digital domain is moving closer to the antenna/sensor

More information

Synchronization in Asynchronously Communicating Digital Systems

Synchronization in Asynchronously Communicating Digital Systems Synchronization in Asynchronously Communicating Digital Systems Priyadharshini Shanmugasundaram Abstract Two digital systems working in different clock domains require a protocol to communicate with each

More information

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits Nov 26, 2002 John Wawrzynek Outline SR Latches and other storage elements Synchronizers Figures from Digital Design, John F. Wakerly

More information

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari Sequential Circuits The combinational circuit does not use any memory. Hence the previous state of input does not have any effect on the present state of the circuit. But sequential circuit has memory

More information

SoC IC Basics. COE838: Systems on Chip Design

SoC IC Basics. COE838: Systems on Chip Design SoC IC Basics COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview SoC

More information

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 2, FEBRUARY 2001 229 A Reed Solomon Product-Code (RS-PC) Decoder Chip DVD Applications Hsie-Chia Chang, C. Bernard Shung, Member, IEEE, and Chen-Yi Lee

More information

Sequential Elements con t Synchronous Digital Systems

Sequential Elements con t Synchronous Digital Systems ecture 15 Computer Science 61C Spring 2017 February 22th, 2017 Sequential Elements con t Synchronous Digital Systems 1 Administrivia I Good news: Waitlist students: You are in! Concurrent Enrollment students:

More information

Application Note PG001: Using 36-Channel Logic Analyzer and 36-Channel Digital Pattern Generator for testing a 32-Bit ALU

Application Note PG001: Using 36-Channel Logic Analyzer and 36-Channel Digital Pattern Generator for testing a 32-Bit ALU Application Note PG001: Using 36-Channel Logic Analyzer and 36-Channel Digital Pattern Generator for testing a 32-Bit ALU Version: 1.0 Date: December 14, 2004 Designed and Developed By: System Level Solutions,

More information

Laboratory Exercise 4

Laboratory Exercise 4 Laboratory Exercise 4 Polling and Interrupts The purpose of this exercise is to learn how to send and receive data to/from I/O devices. There are two methods used to indicate whether or not data can be

More information

IMPLEMENTATION OF X-FACTOR CIRCUITRY IN DECOMPRESSOR ARCHITECTURE

IMPLEMENTATION OF X-FACTOR CIRCUITRY IN DECOMPRESSOR ARCHITECTURE IMPLEMENTATION OF X-FACTOR CIRCUITRY IN DECOMPRESSOR ARCHITECTURE SATHISHKUMAR.K #1, SARAVANAN.S #2, VIJAYSAI. R #3 School of Computing, M.Tech VLSI design, SASTRA University Thanjavur, Tamil Nadu, 613401,

More information

Implementation of Memory Based Multiplication Using Micro wind Software

Implementation of Memory Based Multiplication Using Micro wind Software Implementation of Memory Based Multiplication Using Micro wind Software U.Palani 1, M.Sujith 2,P.Pugazhendiran 3 1 IFET College of Engineering, Department of Information Technology, Villupuram 2,3 IFET

More information

An Efficient Reduction of Area in Multistandard Transform Core

An Efficient Reduction of Area in Multistandard Transform Core An Efficient Reduction of Area in Multistandard Transform Core A. Shanmuga Priya 1, Dr. T. K. Shanthi 2 1 PG scholar, Applied Electronics, Department of ECE, 2 Assosiate Professor, Department of ECE Thanthai

More information

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction 1 Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction Matthew Fojtik, David Fick, Yejoong Kim, Nathaniel Pinckney, David Harris, David Blaauw, Dennis Sylvester mfojtik@umich.edu

More information

Sequencing and Control

Sequencing and Control Sequencing and Control Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Spring, 2016 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/ Source:

More information

CS 61C: Great Ideas in Computer Architecture

CS 61C: Great Ideas in Computer Architecture CS 6C: Great Ideas in Computer Architecture Combinational and Sequential Logic, Boolean Algebra Instructor: Alan Christopher 7/23/24 Summer 24 -- Lecture #8 Review of Last Lecture OpenMP as simple parallel

More information

UNIT V 8051 Microcontroller based Systems Design

UNIT V 8051 Microcontroller based Systems Design UNIT V 8051 Microcontroller based Systems Design INTERFACING TO ALPHANUMERIC DISPLAYS Many microprocessor-controlled instruments and machines need to display letters of the alphabet and numbers. Light

More information

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Sequential Logic. Introduction to Computer Yung-Yu Chuang Sequential Logic Introduction to Computer Yung-Yu Chuang with slides by Sedgewick & Wayne (introcs.cs.princeton.edu), Nisan & Schocken (www.nand2tetris.org) and Harris & Harris (DDCA) Review of Combinational

More information

Good afternoon! My name is Swetha Mettala Gilla you can call me Swetha.

Good afternoon! My name is Swetha Mettala Gilla you can call me Swetha. Good afternoon! My name is Swetha Mettala Gilla you can call me Swetha. I m a student at the Electrical and Computer Engineering Department and at the Asynchronous Research Center. This talk is about the

More information

AN INTRODUCTION TO DIGITAL COMPUTER LOGIC

AN INTRODUCTION TO DIGITAL COMPUTER LOGIC SUPPLEMENTRY HPTER 1 N INTRODUTION TO DIGITL OMPUTER LOGI J K J K FREE OMPUTER HIPS FREE HOOLTE HIPS I keep telling you Gwendolyth, you ll never attract today s kids that way. S1.0 INTRODUTION 1 2 Many

More information

TV Synchronism Generation with PIC Microcontroller

TV Synchronism Generation with PIC Microcontroller TV Synchronism Generation with PIC Microcontroller With the widespread conversion of the TV transmission and coding standards, from the early analog (NTSC, PAL, SECAM) systems to the modern digital formats

More information

More Digital Circuits

More Digital Circuits More Digital Circuits 1 Signals and Waveforms: Showing Time & Grouping 2 Signals and Waveforms: Circuit Delay 2 3 4 5 3 10 0 1 5 13 4 6 3 Sample Debugging Waveform 4 Type of Circuits Synchronous Digital

More information

FPGA Implementation of DA Algritm for Fir Filter

FPGA Implementation of DA Algritm for Fir Filter International Journal of Computational Engineering Research Vol, 03 Issue, 8 FPGA Implementation of DA Algritm for Fir Filter 1, Solmanraju Putta, 2, J Kishore, 3, P. Suresh 1, M.Tech student,assoc. Prof.,Professor

More information

DEDICATED TO EMBEDDED SOLUTIONS

DEDICATED TO EMBEDDED SOLUTIONS DEDICATED TO EMBEDDED SOLUTIONS DESIGN SAFE FPGA INTERNAL CLOCK DOMAIN CROSSINGS ESPEN TALLAKSEN DATA RESPONS SCOPE Clock domain crossings (CDC) is probably the worst source for serious FPGA-bugs that

More information