Multicore Design Considerations

Similar documents
Video compression principles. Color Space Conversion. Sub-sampling of Chrominance Information. Video: moving pictures and the terms frame and

Video coding standards

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

Audio and Video II. Video signal +Color systems Motion estimation Video compression standards +H.261 +MPEG-1, MPEG-2, MPEG-4, MPEG- 7, and MPEG-21

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Chapter 2 Introduction to

Multimedia Communications. Video compression

The H.26L Video Coding Project

Motion Video Compression

Chapter 10 Basic Video Compression Techniques

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Multimedia Communications. Image and Video compression

COMP 9519: Tutorial 1

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

AUDIOVISUAL COMMUNICATION

Video Compression. Representations. Multimedia Systems and Applications. Analog Video Representations. Digitizing. Digital Video Block Structure

MPEG-2. ISO/IEC (or ITU-T H.262)

MPEGTool: An X Window Based MPEG Encoder and Statistics Tool 1

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

Advanced Computer Networks

17 October About H.265/HEVC. Things you should know about the new encoding.

An Overview of Video Coding Algorithms

A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds.

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

Joint source-channel video coding for H.264 using FEC

H.264/AVC Baseline Profile Decoder Complexity Analysis

Digital Video Telemetry System

Dual frame motion compensation for a rate switching network

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

Frame Processing Time Deviations in Video Processors

DSP in Communications and Signal Processing

Video 1 Video October 16, 2001

Midterm Review. Yao Wang Polytechnic University, Brooklyn, NY11201

Video Over Mobile Networks

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work

Introduction to image compression

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b

Workload Prediction and Dynamic Voltage Scaling for MPEG Decoding

Overview: Video Coding Standards

ABSTRACT ERROR CONCEALMENT TECHNIQUES IN H.264/AVC, FOR VIDEO TRANSMISSION OVER WIRELESS NETWORK. Vineeth Shetty Kolkeri, M.S.

Design Challenge of a QuadHDTV Video Decoder

Hardware study on the H.264/AVC video stream parser

4 H.264 Compression: Understanding Profiles and Levels

Alain Legault Hardent. Create Higher Resolution Displays With VESA Display Stream Compression

Content storage architectures

MPEG + Compression of Moving Pictures for Digital Cinema Using the MPEG-2 Toolkit. A Digital Cinema Accelerator

THE architecture of present advanced video processing BANDWIDTH REDUCTION FOR VIDEO PROCESSING IN CONSUMER SYSTEMS

Hardware Decoding Architecture for H.264/AVC Digital Video Standard

Error concealment techniques in H.264 video transmission over wireless networks

PACKET-SWITCHED networks have become ubiquitous

Digital Image Processing

SPATIAL LIGHT MODULATORS

MIPI D-PHY Bandwidth Matrix Table User Guide. UG110 Version 1.0, June 2015

H.261: A Standard for VideoConferencing Applications. Nimrod Peleg Update: Nov. 2003

Digital Front End (DFE) Training. DFE Overview

White Paper Versatile Digital QAM Modulator

Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

New GRABLINK Frame Grabbers

A Highly Scalable Parallel Implementation of H.264

ITU-T Video Coding Standards

Key Techniques of Bit Rate Reduction for H.264 Streams

HIGH SPEED ASYNCHRONOUS DATA MULTIPLEXER/ DEMULTIPLEXER FOR HIGH DENSITY DIGITAL RECORDERS

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

Motion Compensation Hardware Accelerator Architecture for H.264/AVC

Performance Evaluation of Error Resilience Techniques in H.264/AVC Standard

Film Grain Technology

Visual Communication at Limited Colour Display Capability

Principles of Video Compression

SUMMIT LAW GROUP PLLC 315 FIFTH AVENUE SOUTH, SUITE 1000 SEATTLE, WASHINGTON Telephone: (206) Fax: (206)

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL

2.1 Introduction. [ Team LiB ] [ Team LiB ] 1 of 1 4/16/12 11:10 AM

CS A490 Digital Media and Interactive Systems

Study of AVS China Part 7 for Mobile Applications. By Jay Mehta EE 5359 Multimedia Processing Spring 2010

Error-Resilience Video Transcoding for Wireless Communications

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

DC-105 Quick Installation Guide

How to Manage Video Frame- Processing Time Deviations in ASIC and SOC Video Processors

Video Compression - From Concepts to the H.264/AVC Standard

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Technical Note PowerPC Embedded Processors Video Security with PowerPC

STUDY OF AVS CHINA PART 7 JIBEN PROFILE FOR MOBILE APPLICATIONS

Memory interface design for AVS HD video encoder with Level C+ coding order

New forms of video compression

A Study on AVS-M video standard

Solutions to Embedded System Design Challenges Part II

Real-time SHVC Software Decoding with Multi-threaded Parallel Processing

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

A Novel Approach towards Video Compression for Mobile Internet using Transform Domain Technique

Using RFC2429 and H.263+

HEVC: Future Video Encoding Landscape

Section 14 Parallel Peripheral Interface (PPI)

ANALYZING VIDEO COMPRESSION FOR TRANSPORTING OVER WIRELESS FADING CHANNELS. A Thesis KARTHIK KANNAN

Understanding Compression Technologies for HD and Megapixel Surveillance

Transcription:

Multicore Design Considerations

Multicore: The Forefront of Computing Technology We re not going to have faster processors. Instead, making software run faster in the future will mean using parallel programming techniques. This will be a huge shift. -- Katherine Yelick, Lawrence Berkeley National Laboratory from The Economist: Parallel Bars Multicore is a term associated with parallel processing, which refers to the use of simultaneous processors to execute an application or multiple computational threads. Parallel programming/processing can be implemented on TI s KeyStone multicore architecture.

Parallel Processing Parallel processing divides big applications into smaller applications and distributes tasks across multiple cores. The goal is to speed up processing of a computationallyintensive applications. Characteristics of computationally-intensive applications: Large amount of data to process Complex algorithms require many computations Goals of task partitioning Computational load balancing evenly divides effort among all available cores Minimizes contention of system resources Memory (DDR, shared L2) Transport (Teranet, peripherals)

Parallel Processing: Use Cases Network gateway, speech/voice processing Typically hundreds or thousands of channels Each channel consumes about 30 MIPS Large, complex, floating point FFT (1M) Multiple-size, short FFTs Video processing Slice-based encoder Video transcoder (low quality) High-quality decoder

Parallel Processing: Use Cases Medical imaging Filtering > reconstruction > post filtering Edge detection LTE channel excluding turbo decoder/encoder Two cores uplink Two cores downlink LTE channel including turbo decoder Equal to the performance of 30 cores Each core works on a package of bits Scientific processing Large complex matrix manipulations Use Case: Oil exploration

Parallel Processing: Control Models Master Slave Model Multiple speech processing Variable-size, short FFT Video encoder slice processing VLFFT Data Flow Model High quality video encoder Video decoder Video transcoder LTE physical layer Master Slave Slave Slave Core 0 Core 1 Core 2

Parallel Processing: Partitioning Considerations Function driven Large tasks are divided into function blocks Function blocks are assigned to each core The output of one core is the input of the next core Use cases: H.264 high quality encoding and decoding, LTE Data driven Large data sets are divided into smaller data sets All cores perform the same process on different blocks of data Use cases: image processing, multi-channel speech processing, sliced-based encoder

Parallel Processing: System Recommendations Ability to perform many operations Fixed-point AND floating-point processing SIMD instruction, multicore architecture Ability to communicate with the external world Fast two-way peripherals that support high bit-rate traffic Fast response to external events Ability to address large external memory Fast and efficient save and retrieve methods Transparent resource sharing between cores Efficient communication between cores Synchronization Messaging Data sharing

Parallel Processing: Recommended Tools Easy-to-use IDE (Integrated Development Environment) Advanced debug features (system trace, CP tracer) Simultaneous, core-specific debug monitoring Real-time operating system (e.g., SYS/BIOS) Multicore software development kit Standard APIs simplifies programming Layered abstraction hides physical details from the application System optimized capabilities Full-featured compiler, optimizer, linker Third-party support

Example: High Def 1080i60 Video H264 Encoder A short introduction to video encoding Pixel format Macroblocks Performance numbers and limitations Motion estimation Encoding Entropy encoder Reconstruction Data in and out of the system DDR bandwidth Synchronization, data movement System architecture

Macroblock and Pixel Data RGB and YUV 4:4:4 and 4:2:0 format -- Pixel with only Y value -- Pixel with only Cr and Cb values 4:4:4 4:2:0 -- Pixel with Y, Cr, and Cb values macroblock Typically 8-bit values (10, 12, 14) Macroblock = 16x16 pixels

Video Encoder Flow (per Macroblock) Motion estimation Coder Width Height Frames/Second MCycles/Second D1(NTSC) 720 480 30 660 D1 (PAL) 720 576 25 660 720P30 1280 720 30 1850 1080i 1920 1080 (1088) 60 fields 3450 Intra prediction and Motion Compensation Integer transform Module Percentage Approximate MIPS Number of Cores (1080i)/Second Motion Estimation ~50% 1750 2 IP, MC, Transform, ~12.5% 437.7 0.5 Quantization Entropy Encoder ~25% 875 1 IT, IQ and Reconstruction ~12.5% 437.5 0.5 Quantization Inverse Quantization Inverse Integer transform And reconstruction De-blocking Filter and reconstruction Out Entropy Encoder (Cabac or CAVLC)

Video Coding Algorithm Limitations Motion estimation Depends on the reconstruction of previous (and future) frames Shortcuts can be performed (e.g., first row of frame N does not need last row of frame N-1). Intra-prediction Depends on the macroblock above and to the left Must be done consecutively or encoding efficiency is lost (i.e., lower quality for the same number of bits) Entropy encoding (CABAC, CAVLC) Must be processed in the macroblock order Each frame is independent of other frames.

How Many Channels Can One TMS320C6678 Process? Looks like 2 channels; Each one uses 4 cores Two cores for motion estimation One core for entropy encoding One core for everything else What other resources are needed? Streaming data in and out of the system Store and load data to and from DDR Internal bus bandwidth DMA availability Synchronization between cores, especially if trying to minimize delay

What are the System Input Requirements? Stream data in and out of the system Raw data: 1920 * 1080 * 1.5 = 3,110,400 bytes per frame = 24.883200 bits per frame (~25M bits per frame) At 30 frames per second, the input is 750 Mbps NOTE: The order of raw data for a frame is Y component first, followed by U and V 750 Mbps input requires one of the following: One SRIO lane (5 Gbps raw, about 3.5 Gbps of payload), One PCIe lane (5 Gbps raw) NOTE: KeyStone devices provide four SRIO lanes and two PCIe lanes Compressed data (e.g., 10 to 20 Mbps) can use SGMII (10M/100M/1G) or SRIO or PCIe.

How Many Accesses to the DDR? For purposes of this example, only consider frame-size accesses. All other accesses (ME vectors, parameters, compressed data, etc.) are negligible. Requirements for processing a single frame: Retrieving data from peripheral to DDR - 25M bits = 3.125MB Motion estimation phase reads the current frame (only Y) and older Y component of reconstruction frame(s). A good ME algorithm may read up to 6x older frame(s). 7 * 1920 * 1088 = ~ 15M Bytes Encoding phase reads the current frame and one old frame. The total size is about 6.25 MB. Reconstruction phase reads one frame and writes one frame. So the total bandwidth is 6.25 MB. Frame compression before or after the entropy encoder is negligible. Total DDR access for a single frame is less than 32 MB.

How Does this Access Avoid Contention? Total DDR access for a single frame is less than 32 MB. The total DDR access for 30 frames per second (60 fields) is less than 32 * 30 = 960 MBps. The DDR3 raw bandwidth is more than 10 Gbps (1333 MHz clock and 64 bits). 10% utilization reduces contention possibilities. DDR3 DMA uses TeraNet with clock/3 and 128 bits. TeraNet bandwidth is 400 MHz * 16B = 6.4 GBps.

KeyStone SoC Architecture Resources 10 EDMA controllers with 144 EDMA channels and 1152 PaRAM (parameter blocks) The EDMA scheme must be designed by the user. The LLD provides easy EDMA usage. In addition, Navigator has its own PKTDMA for each master. Data in and out of the system (SRIO, PCIe or SGMII) is done using the Navigator. All synchronization between cores and moving pointers to data between cores is done using the Navigator. IPC provides easy access to the Navigator.

Conclusion Two H264 high-quality 1080i encoders can be processed on a single TMS320C6678

System Architecture Core 0 Motion Estimation channel 1 Upper Half Core 1 Motion Estimation channel 1 Lower half Core 2 Compression and Reconstruction Channel 1 Core 3 Entropy Encoder Channel 1 SGMII Driver SRIO Or PCI Stream data TeraNet Core 4 Motion Estimation channel 2 Upper Half Core 5 Motion Estimation channel 2 Lower half Core 6 Compression and Reconstruction Channel 2 Core 7 Entropy Encoder Channel 2