A VLIW Processor for Multimedia Applications

Similar documents
Instruction Level Parallelism

AN ABSTRACT OF THE THESIS OF

Very Short Answer: (1) (1) Peak performance does or does not track observed performance.

Pipeline design. Mehran Rezaei

06 1 MIPS Implementation Pipelined DLX and MIPS Implementations: Hardware, notation, hazards.

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Slide Set 8. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

DYNAMIC INSTRUCTION SCHEDULING WITH TOMASULO

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

Logic Devices for Interfacing, The 8085 MPU Lecture 4

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Slide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

CPE300: Digital System Architecture and Design

On the Rules of Low-Power Design

Instruction Level Parallelism and Its. (Part II) ECE 154B

A Single-chip MPEG2 Video Encoder LSI with Multi-chip Configuration for a Single-board Encoder

Instruction Level Parallelism Part III

Design Challenge of a QuadHDTV Video Decoder

Instruction Level Parallelism Part III

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

A few questions to test your familiarity of Lab7 at the end of finishing all assigned parts of Lab 7

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

THE architecture of present advanced video processing BANDWIDTH REDUCTION FOR VIDEO PROCESSING IN CONSUMER SYSTEMS

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

A low-power portable H.264/AVC decoder using elastic pipeline

Introduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison

Chapter 05: Basic Processing Units Control Unit Design Organization. Lesson 11: Multiple Bus Organisation

Lab #10: Building Output Ports with the 6811

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Implementation of an MPEG Codec on the Tilera TM 64 Processor

University of Pennsylvania Department of Electrical and Systems Engineering. Digital Design Laboratory. Lab8 Calculator

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Pipelining. Improve performance by increasing instruction throughput Program execution order. Data access. Instruction. fetch. Data access.

Quiz #4 Thursday, April 25, 2002, 5:30-6:45 PM

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

Chapter 4 (Part I) The Processor. Baback Izadi Division of Engineering Programs

CprE 281: Digital Logic

Dynamic Scheduling. Differences between Tomasulo. Tomasulo Algorithm. CDC 6600 scoreboard. Or ydanicm ceshuldngi

8 DIGITAL SIGNAL PROCESSOR IN OPTICAL TOMOGRAPHY SYSTEM

Sequencing and Control


Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

VARIABLE FREQUENCY CLOCKING HARDWARE

DHANALAKSHMI COLLEGE OF ENGINEERING Tambaram, Chennai

Page 1) 7 points Page 2) 16 points Page 3) 22 points Page 4) 21 points Page 5) 22 points Page 6) 12 points. TOTAL out of 100

Contents Circuits... 1

Chapter. Sequential Circuits

Scalability of MB-level Parallelism for H.264 Decoding

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

Advanced Pipelining and Instruction-Level Paralelism (2)

THE APPLICATION OF SIGMA DELTA D/A CONVERTER IN THE SIMPLE TESTING DUAL CHANNEL DDS GENERATOR

Multicore Design Considerations

6.3 Sequential Circuits (plus a few Combinational)

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

EECS150 - Digital Design Lecture 12 - Video Interfacing. Recap and Outline

HARDWARE-SOFTWARE CODESIGN OF A 14.4MBIT - 64 STATE - VITERBI DECODER FOR AN APPLICATION-SPECIFIC DIGITAL SIGNAL PROCESSOR

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

Modeling Digital Systems with Verilog

First Name Last Name November 10, 2009 CS-343 Exam 2

Laboratory Exercise 4

Microprocessor Design

SoC and SiP technology for digital consumer electronic systems

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

Video 1 Video October 16, 2001

Design and Implementation of an AHB VGA Peripheral

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

The World Leader in High Performance Signal Processing Solutions. Section 15. Parallel Peripheral Interface (PPI)

Digital Design and Computer Architecture

WELCOME. ECE 2030: Introduction to Computer Engineering* Richard M. Dansereau Copyright by R.M. Dansereau,

A Novel VLSI Architecture of Motion Compensation for Multiple Standards

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

M66004SP/FP M66004SP/FP MITSUBISHI DIGITAL ASSP ASSP 16-DIGIT 5X7-SEGMENT VFD CONTROLLER 16-DIGIT 5 7-SEGMENT VFD CONTROLLER

Differences between Tomasulo. Another Dynamic Algorithm: Tomasulo Organization. Reservation Station Components

Computer Architecture Basic Computer Organization and Design

Frame Processing Time Deviations in Video Processors

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design ALU and Storage Elements

DT3162. Ideal Applications Machine Vision Medical Imaging/Diagnostics Scientific Imaging

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

Serial FIR Filter. A Brief Study in DSP. ECE448 Spring 2011 Tuesday Section 15 points 3/8/2011 GEORGE MASON UNIVERSITY.

Section 14 Parallel Peripheral Interface (PPI)

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

AN-ENG-001. Using the AVR32 SoC for real-time video applications. Written by Matteo Vit, Approved by Andrea Marson, VERSION: 1.0.0

A Low-Power 0.7-V H p Video Decoder

Transcription:

A VLIW Processor for Multimedia Applications E. Holmann T. Yoshida A. Yamada Y. Shimazu Mitsubishi Electric Corporation, System LSI Laboratory 4-1 Mizuhara, Itami, Hyogo 664, Japan

Outline Objective System Architecture Performance Conclusions

System Architecture MPEG2 bitstream Video Audio 64b 32b DD bus DA bus Serial input VLD process Display & Audio output Block loader Bus cnt. D30V core Inst RAM (32KB) Data RAM (32KB) System Bus I/F Data Address External ROM RAM I/O etc. ED bus EA bus 64b 32b ID bus IA bus DRAM I/F External DRAM (2MB)

Processor Core Diagram Instruction RAM (32KB ) 64 32 Instruction Decode Unit Decoder 0 Decoder 1 Control signals D30V CORE Memory Unit MEM control ALU PC control Shift RegFile 64 64 64 x 32b GPRs 64 64 Integer Unit Mul ALU Shift 2 x 64b Accs. 32 64 Data RAM (32KB )

Instruction Formats Two types of instructions Two short RISC sub-instructions (28 bits each) Short sub-instruction L Short sub-instruction R One long RISC sub-instruction (54 bits) Long sub-instruction

Instruction Issuing CC CC L-container R-container 0 4 32 36 63 FM0 FM1 FM Issue format 00 Short sub-instruction L Short sub-instruction R parallel 01 10 11 Short sub-instruction L Short sub-instruction R Short sub-instruction R Short sub-instruction L Long sub-instruction serial serial long inst

Speculative Execution CC L-container R-container 0 4 32 36 63 FM0 FM1 Every sub-instruction is speculatively executed 3 bits define condition for execution Conditions are based on status of user flags PSW has 8 user flags CC 2 user flags used for speculative execution

ALU Special Operations Added video operations Variable length saturation instruction: SAT, SATZ, SATHL, SATHH» SAT ra, rb, 24 -> ra = saturate (rb, 24)» SATHH ra, rb, 12 -> rah = saturate (rb, 12) Flexible join instruction: JOINLL, JOINLH, JOINHL, JOINHH» JOINLH ra, rb, rc -> ra = rbl rch Add sign instruction ADDS» ADDS ra, rb, rc -> ra = rb + sign(rc)

ALU Special Operations (cont.) Added sub-word operations ALU operations on dual half-word data: ADD2H, SUB2H, ADDS2H, AVG2H, SAT2H, SATZ2H MUL2H, MULX2H» ADD2H ra, rb, rc -> rah = rbh + rch ral = rbl + rcl Shifter operations on dual half-word data: SRA2H, SRL2H, ROT2H» SRA2H ra, rb, 3 -> rah = rbh >> 3 ral = rbl >> 3

ALU Special Operations (cont.) Added single half-word operations ALU operations on single half-word operands ADDHppp, SUBHppp, JOINpp, MULHXpp, SATHp» ADDHLHH ra, rb, rc -> ral = rbh + rch» SUBHHLH ra, rb, rc -> rah = rbl - rch» JOINLH ra, rb, rc -> ra = rbl rch

Memory Unit Special Features Flexible operand types: byte (signed, unsigned) half-word (signed, unsigned) word double word Multiple operand accessing: four byte data load with packing four byte data store with unpacking two half-word data load with packing two half-word data store with unpacking Post-increment/decrement register indexed Modulo addressing

Branch Unit Special Features Destination address calculated in second pipe stage Variable number of delay slots Block repeat with zero delay penalty Additional conditional branches Test zero and branch instruction Test not-zero and branch instruction

Instruction Examples Short_M 0 7 9 15 21 27 opcode X Ra Rb src LDBU R7, @(R6, 20) LDW R4, @(R5, R7) LD2W R8, @(R7+, R22) Short_A 0 7 9 15 21 27 opcode Y0 Ra Rb src ADD R7, R6, R8 SUB R10, R6, 20 ADD2H R4, R5, R7 Long 0 7 9 15 21 opcode 1 0 Ra Rb imm:32 53 LD2H R7, @(R6, 0x00001000) AVG2H R8, R9, 0x00010001

Instruction Examples (Branches) Short_B1 0 7 9 15 21 27 opcode 00 0 0 src BRA JMP R3 R4 Short_B2 0 7 9 27 opcode 10 disp:18 BRA JMP 0x1234 0x10000 Short_B3 Short_D1 0 7 9 15 27 opcode WZ opcode W0 Ra Ra src 0 7 9 15 27 src 0 7 9 15 27 Short_D2 opcode W0 d:6 src BSRTZR R2, R5 JMPTNZ R52, 0x200 DBRA DBRA DJMP DBSR R3, R17 R3, 0x39A 3, R50 41, 0x300

Pipeline Specification ALU IF D EX WB IF : Instruction Fetch D : Decode EX : Execute WB : Write Back LD/ST IF DA M WB IF : Instruction Fetch DA : Decode & Address M : Memory WB : Write Back BRA IF DA EX WB IF : Instruction Fetch DA : Decode & Address EX : Execute (delayed branch) WB : Write Back MUL16 IF D EX WB IF : Instruction Fetch D : Decode EX : Execute WB : Write Back

Conditional Branch Instructions Instructions: BRAT, BSRT, JMPT, JSRT PC PC + 8 IF DA EX WB IF D EX WB (BRAT Ra, OFFSET) (Squashed) PC + OFFSET IF DA EX WB Decode Instruction Calculate newpc Speculative Execution (CC bits and user flags) Test register for zero/not zero (conditional execution)

Delayed Branch Instructions Instructions: DBRA, DBSR, DJMP, DJSR PC PC + 8 IF DA EX WB IF D EX WB (DBRA DELAY, OFFSET) PC + 16 PC + 24 PC + OFFSET IF D EX WB IF D EX WB IF D EX WB Decode Instruction Calculate PC + offset Speculative Execution Calculate PC + delay

Processor Parameters and Figures of Merit Clock Frequency Parallelism Peak Performance Register File RAM 8x8 IDCT 256 point complex IFFT MPEG-2 macroblock 250 MHz 2 way VLIW, 2 way SIMD 1000 MIPS 64 x 32bits 32KB DRAM, 32KB IRAM < 2 µseconds ~ 40 µseconds < 800 cycles (real time)

Conclusions High performance dual-issue RISC system zero delay branches zero delay repeat loops speculative execution Multimedia Processor sub-word operations half-word operations special video operations Single chip system for DSP applications D30V serves functions of DSP and MCU chip

Application areas for D30V 2D/3D graphics AC-3 decode modem V.34 (28.8kbps) H.263 codec MPEG-1 decode MPEG-2 decode