Tomasulo Algorithm Based Out of Order Execution Processor

Tomasulo Algorithm Based Out of Order Execution Processor Bhavana P.Shrivastava MAaulana Azad National Institute of Technology, Department of Electronics and Communication ABSTRACT In this research work, Tomasulo algorithm based out of order execution processor is implemented. Tomasulo algorithm is the basic technique that is used to implement Out of order (OOO) execution in modern microprocessors. This thesis explains the idea behind OOO execution and how Tomasulo s algorithm implements it. The algorithm describes the working of the instruction dispatch and handling techniques in a processor. It allows sequential instructions that would normally be stalled due to certain dependencies to execute.. In Tomasulo algorithm, reservation stations are used to solve data hazards. Processor s performance is improved and available memory bandwidth is used more effectively. Processor is built in the hardware description language, Verilog. There are two phases to this thesis: Firstly, the various stages of research are outlined, focusing on dependencies and hazards. Secondly, a detailed design description is given, outlining the specifications, requirements, design procedure and simulation results stages.design and verification of processor has been done successfully using Verilog on Xilinx 13.2 platform. The processor is verified in both simulation and synthesis with the help of test programs. This design aimed to be implemented on Xilinx Spartan 3E XC3S1500E FPGA. Keywords-Tomasulo Algorithm, Register Renaming, Common Data Bus, Out of Order Execution. 1. INTRODUCTION 1.1 Tomasulo Algorithm The formal design of an out-of-order processing unit based on Tomasulo s algorithm. All related techniques such as register renaming are used in modern microprocessors to keep multiple or deeply pipelined execution units busy by executing instructions in data-flow order, rather than sequential order. The complex variability of instruction flow in out-of-order processors presents a significant opportunity for undetected errors, compared to an in-order pipelined machine where the flow of instructions is fixed and orderly. Tomasulo based processor solves the prominent problems of dependencies, hazards and stalls.for the implementation of Tomasulo's Algorithm following step has to be followed. 1. Instructions are issued sequentially so that the effects of a sequence of instructions such as exceptions raised by these instructions occur in the same order as they would in a non-pipelined processor, regardless of the fact that they are being executed non-sequentially. 2. All general-purpose and reservation station registers hold either real or virtual values (called as tags here). If a real value is unavailable to a destination register during the issue stage, a virtual value (tag) is initially used. The functional unit that is computing the real value is assigned as the virtual value (tag). The virtual register values are converted to real values as soon as the designated functional unit completes its computation and puts it on the bus. 3. Functional units use reservation stations with multiple slots. Each slot holds information needed to execute a single instruction, including the operation and the operands. The functional unit begins processing when it is free and when all source operands needed for an instruction are real. The design implemented in this paper can decode four types of instructions namely ADD/SUB, MUL, FETCH, WRITE. So there are four functional blocks in the design to perform addition/subtraction, multiplication, fetching from memory and writing to the memory. The data is communicated through registers. There are four registers available for use in instructions. The registers are arranged and handled using the completion file. The completion file is used to handle special conditions such as overflow and page 43 Bhavana P.Shrivastava

fault. The instruction dispatch unit reads the instructions from the instruction queue in order and decodes them. The instruction dispatch unit is designed to dispatch to instructions in parallel, on two instruction buses. Fig. 1.1: A model of an implementation of Tomasulo s algorithm. 1.2 OUT OF ORDER EXECUTION This algorithm is the basic technique that is used to implement Out Of Order (OOO) execution in modern microprocessors. To achieve greater throughput of instructions, superscalar microprocessors use several functional units that can execute instructions in parallel. However, if two instructions depend on each other one of them has to wait until the other has finished. In this case, one functional unit is idle. But if a different instruction, potentially following the other two in the instruction sequence, does not depend on their results, then it can be executed in parallel on the free functional unit. Data hazards must be handled properly, In general, a data hazard arises when changing the instruction execution order influences the result of the computation. Tomasulo s algorithm was designed to avoid such problems. 2. LITERATURE SURVEY The Systems 360 computer family is where Tomasulo s Algorithm originated. Here an overview of how, why and when Tomasulo s Algorithm was developed is discussed[]1,2]. The IBM System/360 is a family of computer systems, developed in the 1960 s, where the chief architect was the well-known Gene Amdahl [16]. Prior to the announcement of this family, computers were custom made and designed independently. This development of computers indicated that a new revelation was underway and would change the computer industry forever. Initially only 6 models were announced: 30, 40, 50, 60, 62, and 70, whereas in actual fact 14 models were produced: 20, 22, 25, 30, 40, 44, 50, 65, 67, 75, 85, 91, 95 and the 195 [16]. Despite the models individual differences, the System 360 family employed the same user-instruction set. The larger machines dealt with complex instructions through hardware whilst the smaller ones dealt with them in micro-code, where such an instruction as multiplication would be completed by repeated addition. And as we know today, this was an extremely inefficient way to execute a multiplication instruction [10]. (It was also rumored that the smaller 360 machines performed addition by repeated increments! (i.e. x + 5! add a 1 bit five times!) [13].The 44 Bhavana P.Shrivastava

System 360 employed a variety of operating systems [14] like DOS/360, OS/360, CP-67 (later VM/370), MTS, CRJE, TSO, Amdahl s UTS.The OS/360 proved to be the most popular. The 360 computer family had a very limited number of registers that initially consisted of only four double precision floating-point registers. Consequently compiler scheduling was not particularly effective. On top of this, even the more optimal 360 designs took considerable time to access memory and compute long floating point equations. Due to the number of constraining factors, this prompted programmers to develop a solution, so as to attain maximum efficiency [10,11,12]. The ultimate solution to the problems comes in the form of Tomasulo s Algorithm. 3. IMPLEMENTATION OF THE LOGIC DESIGN Xilinx ISE 13.2 is used for implementing all the modules used in the architecture of Tomasulo based out of order execution processor using Verilog HDL. 3.1 SIMULATION AND SYNTHESIS RESULTS RTL schematic of Tomasulo based processor is shown in Fig. 3.1. Device utilization summary is given below Number of Slices: 537 out of 4656 11% Number of Slice Flip Flops: 407 out of 9312 4% Number of 4 input LUTs: 838 out of 9312 8% Number of IOs: 8 Number of bonded IOBs: 8 out of 232 3% IOB Flip Flops: 8 Number of GCLKs: 1 out of 24 4% Fig. 3.1: Tomasulo block 3.1.1 Fetch station- RTL schematic of Fetch block is shown in Fig. 3.2. Device utilization summary is given below. Number of Slices: 4656 0% Number of Slice Flip Flops: 9312 0% Number of 4 input LUTs: of 9312 0% Number of IOs: 34 Number of bonded IOBs: 34 13 out of 8 out of 24 out Fig.3.2: Fetch block 3.1.2Instruction Decode Unit RTL schematic of Instruction Decode Unit is shown in Fig. 3.3 a Device utilization summary is given below. 45 Bhavana P.Shrivastava

Fig.3.3: Instruction Decode Unit Number of Slices: 67 out of 4656 1% Number of Slice Flip Flops: 57 out of 9312 0% Number of 4 input LUTs: 129 out of 9312 1% Number of IOs: 186 Number of bonded IOBs: 186 out of 232 80% 3.1.2Reservation station- RTL schematic of Reservation station is shown in Fig3.4.Device utilization summary is also given below Fig.3.4 Instruction Decode Unit 3.1.3Register bank RTL schematic of Register bank is shown in Fig.3.5. Device utilization summary is given below. Fig.3.5Register bank Number of Slices: 0 out of 4656 0% Number of IOs: 20 Number of bonded IOBs: 1 out of 232 0% Number of Slices: 141 out of 4656 3% Number of Slice Flip Flops: 128 out of 9312 1% Number of 4 input LUTs: 168 out of 9312 1% Number of IOs: of bonded IOBs: 60 out of 232 25% Number of GCLKs: 1 out of 24 4% 60Number 3.1.4Write Block RTL schematic of Write Block is shown in Fig. 3.6 Device utilization summary is given below. Fig.3.6Write Block Number of Slices: 9 out of 4656 0% Number of 4 input LUTs: 16 out of 9312 0% Number of IOs: 60 Number of bonded IOBs: 60 out of 232 25% 46 Bhavana P.Shrivastava

The presented Tomasulo based processor avoids the stalling of instruction that can cause due to different type of data hazards. By this the performance of processor is improved (shown in Fig. 3.7 and Fig. 3.8). Tomasulo based Processor completes its execution in 150ns (shown in Fig 3.8) But processor with stalling cannot complete its execution in the same time. This takes more time to complete the execution (shown in Fig. 3.7). Therefore it is obvious that the presented Tomasulo based processor improves the performance 4. CONCLUSION The work presented an idea about the Tomasulo algorithm based out of order execution processor for the out of order execution. The Tomasulo based processor has been synthesized in Xilinx 13.2 and have been simulated in simulation environment of Xilinx ISE. The device chosen for synthesis was XC3S1500E. Coding is done in Verilog HDL. The processor improves the performance and avoids the stalling of instruction due to different hazards. It uses register renaming to overcome the hazards problem. Therefore the Tomasulo Algorithm based out of order execution processor is more efficient and very useful in modern day processor design. Fig. 3.7: Simulation result of the test program 1 for with stalling 47 Bhavana P.Shrivastava Fig 3.8: Simulation result of the test program 1 for without stalling

REFERENCES [1] K. Aasaraai and A. Moshovos, Towards a viable out-of-order soft core:copy-free, checkpointed register renaming, Proceedingss the Field-Programmable Logic and Applications, pp. 79-85, 2009. [2] S. Petit, J. Sahuquillo, P. Lo pez, R. Ubal, and J. Duato, A Complexity-Effective Out-of-Order Retirement Microarchitecture, IEEE Trans. Computers, vol. 58, no. 12, pp. 1626-1639,Dec. 2009. [3] R. Plyaskin and A. Herkersdorf. Context-aware compiled simulation of out-of-order processor behavior based on atomic traces. In 2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip (VLSI-SoC), pages 386 391. IEEE, Oct. 2011. [4] F.J. Mesa-Martinez and al., SCOORE: Santa Cruz out-of-order RISCengine, FPGA design issues, Proceeding of the Workshop on Architectural Rsearch Prototyping, pp. 61-70, 2006 [5] S. Berezin, A. Biere, E. Clarke, andy. Zhu, Combining symbolic model checking with uninterpreted functions for out-of-order processor verification, in FMCAD 98, Lecture Notes in Computer Science, Vol. 1522,, pp. 369 386 Springer-Verlag, Berlin, 1998. [6] A. Biere, A. Cimatti, E.M. Clarke, and Y. Zhu, Symbolic model checking without BDDs, in TACAS 99, Lecture Notes in Computer Science, Vol. 1579, Springer-Verlag, Amsterdam, The Netherlands, 1999. [7] Tomasulo, R. M. An efficient algorithm for exploting multiple arithmetic units, IBM J. Research and Development 11:1, pp. 25-33, January 1967. [8] W. Damm and A. Pnueli. Verifying out-of-order executions. In D. Probst, editor, CHARME 97. Chapman & Hall, 1997. [9] D. Sima, B. Polytech, The design space of register renaming techniques, Journal of Micro, IEEE, 20(5), pp. 70-83, 2000 [10] S. Palacharla and al., Complexity-Effective Superscalar Processors Proceedings of the International Symposium on Computer Architecture, pp. 206-218, 1997. [11] F.J. Mesa-Martinez and al., SCOORE: Santa Cruz out-of-order RISC engine, FPGA design issues, Proceeding of the Workshop on Architectural Rsearch Prototyping, pp. 61-70, 2006. [12] K. Aasaraai and A. Moshovos, Towards a viable out-of-order soft core: Copy-free, checkpointed register renaming, Proceedings the Field-Programmable Logic and Applications, pp. 79-85, 2009. [13] W. Damm and A. Pnueli, Verifying out-of-order executions, in D. Probst (Ed.), CHARME 97, Chapman & Hall, London, 1997. [14] L. Gwennap, Intel s P6 uses decoupled superscalar design, Microprocessor Report, Vol. 9, No. 2, pp. 9 15,1995. [15] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, SanMateo, CA, 1996. [16] Peter Dell. Die Auswirkung von Mechanismenzur out-of-order Ausf uhrung auf den Cyclecount von RISC- Architekturen. Master s thesis, Universit at des Saarlandes, FB. Informatik, 1998. 48 Bhavana P.Shrivastava