Tomasulo Algorithm Based Out of Order Execution Processor

Similar documents
Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Advanced Pipelining and Instruction-Level Paralelism (2)

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

LUT Optimization for Memory Based Computation using Modified OMS Technique

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

Out-of-Order Execution

Implementation of UART with BIST Technique

Faculty of Electrical & Electronics Engineering BEE3233 Electronics System Design. Laboratory 3: Finite State Machine (FSM)

Design of Memory Based Implementation Using LUT Multiplier

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

FPGA Implementation of DA Algritm for Fir Filter

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Implementation of Low Power and Area Efficient Carry Select Adder

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

COE328 Course Outline. Fall 2007

AbhijeetKhandale. H R Bhagyalakshmi

Optimization of memory based multiplication for LUT

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

L11/12: Reconfigurable Logic Architectures

A Compact and Fast FPGA Based Implementation of Encoding and Decoding Algorithm Using Reed Solomon Codes

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

Hardware Modeling of Binary Coded Decimal Adder in Field Programmable Gate Array

L12: Reconfigurable Logic Architectures

An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Computer Architecture Spring 2016

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

A Fast Constant Coefficient Multiplier for the XC6200

Lecture 0: Organization

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

ISSN:

Why FPGAs? FPGA Overview. Why FPGAs?

[Krishna*, 4.(12): December, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Modeling Digital Systems with Verilog

An Efficient Reduction of Area in Multistandard Transform Core

FPGA-BASED EDUCATIONAL LAB PLATFORM

Digital Systems Design

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Design of Low Power Efficient Viterbi Decoder

Implementation and Analysis of Area Efficient Architectures for CSLA by using CLA

Inside Digital Design Accompany Lab Manual

A Modified Design of Test Pattern Generator for Built-In-Self- Test Applications

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

Scoreboard Limitations

FPGA Design with VHDL

FPGA Implementation of Low Power Self Testable MIPS Processor

2.6 Reset Design Strategy

[Dharani*, 4.(8): August, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

ENGG2410: Digital Design Lab 5: Modular Designs and Hierarchy Using VHDL

Authentic Time Hardware Co-simulation of Edge Discovery for Video Processing System

Instruction Level Parallelism and Its. (Part II) ECE 154B

Improved 32 bit carry select adder for low area and low power

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

Instruction Level Parallelism Part III

FPGA Development for Radar, Radio-Astronomy and Communications

Scoreboard Limitations!

Instruction Level Parallelism Part III

Comparative Analysis of Stein s. and Euclid s Algorithm with BIST for GCD Computations. 1. Introduction

Memory efficient Distributed architecture LUT Design using Unified Architecture

Microprocessor Design

FPGA Implementation of Viterbi Decoder

The Design of Efficient Viterbi Decoder and Realization by FPGA

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

Modified128 bit CSLA For Effective Area and Speed

Design on CIC interpolator in Model Simulator

Sharif University of Technology. SoC: Introduction

Design & Simulation of 128x Interpolator Filter

Fully Pipelined High Speed SB and MC of AES Based on FPGA

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Amdahl s Law in the Multicore Era

Design of BIST with Low Power Test Pattern Generator

FPGA Implementaion of Soft Decision Viterbi Decoder

A video signal processor for motioncompensated field-rate upconversion in consumer television

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

Modeling Latches and Flip-flops

Syed Muhammad Yasser Sherazi CURRICULUM VITAE

SOC Implementation for Christmas Lighting with Pattern Display Indication RAMANDEEP SINGH 1, AKANKSHA SHARMA 2, ANKUR AGGARWAL 3, ANKIT SATIJA 4 1

ECE 270 Lab Verification / Evaluation Form. Experiment 9

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Efficient Method for Look-Up-Table Design in Memory Based Fir Filters

COMP12111: Fundamentals of Computer Engineering

An optimized implementation of 128 bit carry select adder using binary to excess-one converter for delay reduction and area efficiency

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

Transcription:

Tomasulo Algorithm Based Out of Order Execution Processor Bhavana P.Shrivastava MAaulana Azad National Institute of Technology, Department of Electronics and Communication ABSTRACT In this research work, Tomasulo algorithm based out of order execution processor is implemented. Tomasulo algorithm is the basic technique that is used to implement Out of order (OOO) execution in modern microprocessors. This thesis explains the idea behind OOO execution and how Tomasulo s algorithm implements it. The algorithm describes the working of the instruction dispatch and handling techniques in a processor. It allows sequential instructions that would normally be stalled due to certain dependencies to execute.. In Tomasulo algorithm, reservation stations are used to solve data hazards. Processor s performance is improved and available memory bandwidth is used more effectively. Processor is built in the hardware description language, Verilog. There are two phases to this thesis: Firstly, the various stages of research are outlined, focusing on dependencies and hazards. Secondly, a detailed design description is given, outlining the specifications, requirements, design procedure and simulation results stages.design and verification of processor has been done successfully using Verilog on Xilinx 13.2 platform. The processor is verified in both simulation and synthesis with the help of test programs. This design aimed to be implemented on Xilinx Spartan 3E XC3S1500E FPGA. Keywords-Tomasulo Algorithm, Register Renaming, Common Data Bus, Out of Order Execution. 1. INTRODUCTION 1.1 Tomasulo Algorithm The formal design of an out-of-order processing unit based on Tomasulo s algorithm. All related techniques such as register renaming are used in modern microprocessors to keep multiple or deeply pipelined execution units busy by executing instructions in data-flow order, rather than sequential order. The complex variability of instruction flow in out-of-order processors presents a significant opportunity for undetected errors, compared to an in-order pipelined machine where the flow of instructions is fixed and orderly. Tomasulo based processor solves the prominent problems of dependencies, hazards and stalls.for the implementation of Tomasulo's Algorithm following step has to be followed. 1. Instructions are issued sequentially so that the effects of a sequence of instructions such as exceptions raised by these instructions occur in the same order as they would in a non-pipelined processor, regardless of the fact that they are being executed non-sequentially. 2. All general-purpose and reservation station registers hold either real or virtual values (called as tags here). If a real value is unavailable to a destination register during the issue stage, a virtual value (tag) is initially used. The functional unit that is computing the real value is assigned as the virtual value (tag). The virtual register values are converted to real values as soon as the designated functional unit completes its computation and puts it on the bus. 3. Functional units use reservation stations with multiple slots. Each slot holds information needed to execute a single instruction, including the operation and the operands. The functional unit begins processing when it is free and when all source operands needed for an instruction are real. The design implemented in this paper can decode four types of instructions namely ADD/SUB, MUL, FETCH, WRITE. So there are four functional blocks in the design to perform addition/subtraction, multiplication, fetching from memory and writing to the memory. The data is communicated through registers. There are four registers available for use in instructions. The registers are arranged and handled using the completion file. The completion file is used to handle special conditions such as overflow and page 43 Bhavana P.Shrivastava

fault. The instruction dispatch unit reads the instructions from the instruction queue in order and decodes them. The instruction dispatch unit is designed to dispatch to instructions in parallel, on two instruction buses. Fig. 1.1: A model of an implementation of Tomasulo s algorithm. 1.2 OUT OF ORDER EXECUTION This algorithm is the basic technique that is used to implement Out Of Order (OOO) execution in modern microprocessors. To achieve greater throughput of instructions, superscalar microprocessors use several functional units that can execute instructions in parallel. However, if two instructions depend on each other one of them has to wait until the other has finished. In this case, one functional unit is idle. But if a different instruction, potentially following the other two in the instruction sequence, does not depend on their results, then it can be executed in parallel on the free functional unit. Data hazards must be handled properly, In general, a data hazard arises when changing the instruction execution order influences the result of the computation. Tomasulo s algorithm was designed to avoid such problems. 2. LITERATURE SURVEY The Systems 360 computer family is where Tomasulo s Algorithm originated. Here an overview of how, why and when Tomasulo s Algorithm was developed is discussed[]1,2]. The IBM System/360 is a family of computer systems, developed in the 1960 s, where the chief architect was the well-known Gene Amdahl [16]. Prior to the announcement of this family, computers were custom made and designed independently. This development of computers indicated that a new revelation was underway and would change the computer industry forever. Initially only 6 models were announced: 30, 40, 50, 60, 62, and 70, whereas in actual fact 14 models were produced: 20, 22, 25, 30, 40, 44, 50, 65, 67, 75, 85, 91, 95 and the 195 [16]. Despite the models individual differences, the System 360 family employed the same user-instruction set. The larger machines dealt with complex instructions through hardware whilst the smaller ones dealt with them in micro-code, where such an instruction as multiplication would be completed by repeated addition. And as we know today, this was an extremely inefficient way to execute a multiplication instruction [10]. (It was also rumored that the smaller 360 machines performed addition by repeated increments! (i.e. x + 5! add a 1 bit five times!) [13].The 44 Bhavana P.Shrivastava

System 360 employed a variety of operating systems [14] like DOS/360, OS/360, CP-67 (later VM/370), MTS, CRJE, TSO, Amdahl s UTS.The OS/360 proved to be the most popular. The 360 computer family had a very limited number of registers that initially consisted of only four double precision floating-point registers. Consequently compiler scheduling was not particularly effective. On top of this, even the more optimal 360 designs took considerable time to access memory and compute long floating point equations. Due to the number of constraining factors, this prompted programmers to develop a solution, so as to attain maximum efficiency [10,11,12]. The ultimate solution to the problems comes in the form of Tomasulo s Algorithm. 3. IMPLEMENTATION OF THE LOGIC DESIGN Xilinx ISE 13.2 is used for implementing all the modules used in the architecture of Tomasulo based out of order execution processor using Verilog HDL. 3.1 SIMULATION AND SYNTHESIS RESULTS RTL schematic of Tomasulo based processor is shown in Fig. 3.1. Device utilization summary is given below Number of Slices: 537 out of 4656 11% Number of Slice Flip Flops: 407 out of 9312 4% Number of 4 input LUTs: 838 out of 9312 8% Number of IOs: 8 Number of bonded IOBs: 8 out of 232 3% IOB Flip Flops: 8 Number of GCLKs: 1 out of 24 4% Fig. 3.1: Tomasulo block 3.1.1 Fetch station- RTL schematic of Fetch block is shown in Fig. 3.2. Device utilization summary is given below. Number of Slices: 4656 0% Number of Slice Flip Flops: 9312 0% Number of 4 input LUTs: of 9312 0% Number of IOs: 34 Number of bonded IOBs: 34 13 out of 8 out of 24 out Fig.3.2: Fetch block 3.1.2Instruction Decode Unit RTL schematic of Instruction Decode Unit is shown in Fig. 3.3 a Device utilization summary is given below. 45 Bhavana P.Shrivastava

Fig.3.3: Instruction Decode Unit Number of Slices: 67 out of 4656 1% Number of Slice Flip Flops: 57 out of 9312 0% Number of 4 input LUTs: 129 out of 9312 1% Number of IOs: 186 Number of bonded IOBs: 186 out of 232 80% 3.1.2Reservation station- RTL schematic of Reservation station is shown in Fig3.4.Device utilization summary is also given below Fig.3.4 Instruction Decode Unit 3.1.3Register bank RTL schematic of Register bank is shown in Fig.3.5. Device utilization summary is given below. Fig.3.5Register bank Number of Slices: 0 out of 4656 0% Number of IOs: 20 Number of bonded IOBs: 1 out of 232 0% Number of Slices: 141 out of 4656 3% Number of Slice Flip Flops: 128 out of 9312 1% Number of 4 input LUTs: 168 out of 9312 1% Number of IOs: of bonded IOBs: 60 out of 232 25% Number of GCLKs: 1 out of 24 4% 60Number 3.1.4Write Block RTL schematic of Write Block is shown in Fig. 3.6 Device utilization summary is given below. Fig.3.6Write Block Number of Slices: 9 out of 4656 0% Number of 4 input LUTs: 16 out of 9312 0% Number of IOs: 60 Number of bonded IOBs: 60 out of 232 25% 46 Bhavana P.Shrivastava

The presented Tomasulo based processor avoids the stalling of instruction that can cause due to different type of data hazards. By this the performance of processor is improved (shown in Fig. 3.7 and Fig. 3.8). Tomasulo based Processor completes its execution in 150ns (shown in Fig 3.8) But processor with stalling cannot complete its execution in the same time. This takes more time to complete the execution (shown in Fig. 3.7). Therefore it is obvious that the presented Tomasulo based processor improves the performance 4. CONCLUSION The work presented an idea about the Tomasulo algorithm based out of order execution processor for the out of order execution. The Tomasulo based processor has been synthesized in Xilinx 13.2 and have been simulated in simulation environment of Xilinx ISE. The device chosen for synthesis was XC3S1500E. Coding is done in Verilog HDL. The processor improves the performance and avoids the stalling of instruction due to different hazards. It uses register renaming to overcome the hazards problem. Therefore the Tomasulo Algorithm based out of order execution processor is more efficient and very useful in modern day processor design. Fig. 3.7: Simulation result of the test program 1 for with stalling 47 Bhavana P.Shrivastava Fig 3.8: Simulation result of the test program 1 for without stalling

REFERENCES [1] K. Aasaraai and A. Moshovos, Towards a viable out-of-order soft core:copy-free, checkpointed register renaming, Proceedingss the Field-Programmable Logic and Applications, pp. 79-85, 2009. [2] S. Petit, J. Sahuquillo, P. Lo pez, R. Ubal, and J. Duato, A Complexity-Effective Out-of-Order Retirement Microarchitecture, IEEE Trans. Computers, vol. 58, no. 12, pp. 1626-1639,Dec. 2009. [3] R. Plyaskin and A. Herkersdorf. Context-aware compiled simulation of out-of-order processor behavior based on atomic traces. In 2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip (VLSI-SoC), pages 386 391. IEEE, Oct. 2011. [4] F.J. Mesa-Martinez and al., SCOORE: Santa Cruz out-of-order RISCengine, FPGA design issues, Proceeding of the Workshop on Architectural Rsearch Prototyping, pp. 61-70, 2006 [5] S. Berezin, A. Biere, E. Clarke, andy. Zhu, Combining symbolic model checking with uninterpreted functions for out-of-order processor verification, in FMCAD 98, Lecture Notes in Computer Science, Vol. 1522,, pp. 369 386 Springer-Verlag, Berlin, 1998. [6] A. Biere, A. Cimatti, E.M. Clarke, and Y. Zhu, Symbolic model checking without BDDs, in TACAS 99, Lecture Notes in Computer Science, Vol. 1579, Springer-Verlag, Amsterdam, The Netherlands, 1999. [7] Tomasulo, R. M. An efficient algorithm for exploting multiple arithmetic units, IBM J. Research and Development 11:1, pp. 25-33, January 1967. [8] W. Damm and A. Pnueli. Verifying out-of-order executions. In D. Probst, editor, CHARME 97. Chapman & Hall, 1997. [9] D. Sima, B. Polytech, The design space of register renaming techniques, Journal of Micro, IEEE, 20(5), pp. 70-83, 2000 [10] S. Palacharla and al., Complexity-Effective Superscalar Processors Proceedings of the International Symposium on Computer Architecture, pp. 206-218, 1997. [11] F.J. Mesa-Martinez and al., SCOORE: Santa Cruz out-of-order RISC engine, FPGA design issues, Proceeding of the Workshop on Architectural Rsearch Prototyping, pp. 61-70, 2006. [12] K. Aasaraai and A. Moshovos, Towards a viable out-of-order soft core: Copy-free, checkpointed register renaming, Proceedings the Field-Programmable Logic and Applications, pp. 79-85, 2009. [13] W. Damm and A. Pnueli, Verifying out-of-order executions, in D. Probst (Ed.), CHARME 97, Chapman & Hall, London, 1997. [14] L. Gwennap, Intel s P6 uses decoupled superscalar design, Microprocessor Report, Vol. 9, No. 2, pp. 9 15,1995. [15] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, SanMateo, CA, 1996. [16] Peter Dell. Die Auswirkung von Mechanismenzur out-of-order Ausf uhrung auf den Cyclecount von RISC- Architekturen. Master s thesis, Universit at des Saarlandes, FB. Informatik, 1998. 48 Bhavana P.Shrivastava