Scalability of MB-level Parallelism for H.264 Decoding

Similar documents
A Highly Scalable Parallel Implementation of H.264

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

Chi Ching Chi, Ben Juurlink A QHD-capable parallel H.264 decoder

Implementation of an MPEG Codec on the Tilera TM 64 Processor

PRACE Autumn School GPU Programming

HEVC Real-time Decoding

Amdahl s Law in the Multicore Era

Advanced System LSIs for Home 3D Systems

A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b

H.264/AVC Baseline Profile Decoder Complexity Analysis

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges

Conference object, Postprint version This version is available at

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

A HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING. Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt

OddCI: On-Demand Distributed Computing Infrastructure

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Sharif University of Technology. SoC: Introduction

On Complexity Modeling of H.264/AVC Video Decoding and Its Application for Energy Efficient Decoding

Frame Processing Time Deviations in Video Processors

Multicore Design Considerations

Real-Time Parallel MPEG-2 Decoding in Software

Introduction to image compression

A Low-Power 0.7-V H p Video Decoder

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

Video coding standards

Scalable Lossless High Definition Image Coding on Multicore Platforms

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Critical C-RAN Technologies Speaker: Lin Wang

Digital Video Telemetry System

Video 1 Video October 16, 2001

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

Approaches to synchronize vision, motion and robotics

AUDIOVISUAL COMMUNICATION

Film Grain Technology

Principles of Video Compression

New forms of video compression

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Performance and Energy Consumption Analysis of the X265 Video Encoder

Real-time SHVC Software Decoding with Multi-threaded Parallel Processing

Design Challenge of a QuadHDTV Video Decoder

Reduced complexity MPEG2 video post-processing for HD display

17 October About H.265/HEVC. Things you should know about the new encoding.

Milestone Solution Partner IT Infrastructure Components Certification Report

Lossless Compression Algorithms for Direct- Write Lithography Systems

Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work

SWITCHED INFINITY: SUPPORTING AN INFINITE HD LINEUP WITH SDV

Signum BlackHive. Generation II. Broadcast Production System and video server. The new system generation signum.blackhive

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

Chapter 10 Basic Video Compression Techniques

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

Motion Video Compression

A Real-Time MPEG Software Decoder

Computer and Machine Vision

Workload Prediction and Dynamic Voltage Scaling for MPEG Decoding

Sequential Circuit Design: Principle

Milestone Leverages Intel Processors with Intel Quick Sync Video to Create Breakthrough Capabilities for Video Surveillance and Monitoring

Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding

Video Over Mobile Networks

Video Technologies for Next Generation Immersive Media

J. Maillard, J. Silva. Laboratoire de Physique Corpusculaire, College de France. Paris, France

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

THE new video coding standard H.264/AVC [1] significantly

Hardware Implementation of Viterbi Decoder for Wireless Applications

WiBench: An Open Source Kernel Suite for Benchmarking Wireless Systems

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Altera's 28-nm FPGAs Optimized for Broadcast Video Applications

Performance mesurement of multiprocessor architectures on FPGA(case study: 3D, MPEG-2)

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

UHD 4K Transmissions on the EBU Network

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

A VLSI Architecture for Variable Block Size Video Motion Estimation

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

COMP2611: Computer Organization. Introduction to Digital Logic

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Fooling the Masses with Performance Results: Old Classics & Some New Ideas

Slack Redistribution for Graceful Degradation Under Voltage Overscaling

FPGA Development for Radar, Radio-Astronomy and Communications

AN MPEG-4 BASED HIGH DEFINITION VTR

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

FPGA based Satellite Set Top Box prototype design

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Overview: Video Coding Standards

High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

Motion Compensation Hardware Accelerator Architecture for H.264/AVC

Design and Analysis of Modified Fast Compressors for MAC Unit

Parallel SHVC decoder: Implementation and analysis

AN-ENG-001. Using the AVR32 SoC for real-time video applications. Written by Matteo Vit, Approved by Andrea Marson, VERSION: 1.0.0

Design of Fault Coverage Test Pattern Generator Using LFSR

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

WHITE PAPER. Perspectives and Challenges for HEVC Encoding Solutions. Xavier DUCLOUX, December >>

Transcription:

Scalability of Macroblock-level Parallelism for H.264 Decoding Mauricio Alvarez Mesa 1, Alex Ramírez 1,2, Mateo Valero 1,2, Arnaldo Azevedo 3, Cor Meenderinck 3, Ben Juurlink 3 1 Universitat Politècnica de Catalunya (UPC). Barcelona. Spain 2 Barcelona Supercomputing Center (BSC). Barcelona. Spain 3 Delft University of Technology (TUD). Delft. The Netherlands December 15, 2009

Outline Introduction 1 Introduction 2 3 4

Trends in digital video Towards high quality systems High definition video and high quality video codecs High computational complexity Towards Mobile and Integrated systems Convergence of mobile and multimedia systems Real-time, area and power constraints. Towards multiple formats and extensions: H.264 extensions and different video codecs (MPEG-2, VC-1) Programmable processors instead of application specific hardware

Trends in multicore computer architecture Towards manycore systems Hundreds of cores on a chip Power and complexity wall: simpler cores Massive Thread Level Parallelism Towards heterogeneous/asymmetric architectures Asymmetric cores: same ISA but different performance Heterogeneous cores: accelerators for different application domains Specialized architectures have better performance/power/area benefits

Challenges of video applications in the multicore era Requirements of digital video applications: Performance: High quality video translates in high computational complexity Efficiency: Embedded environments impose real-time and power constraints. Flexibility: Multiple video formats and new extensions requires programmability. Opportunities of multicore architectures Scalability Applications can benefit from multicores only if they can be parallelized

Outline Introduction H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 1 Introduction 2 H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 3 4

H.264 Decoder Introduction H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation Block based: MBs are the basic coding unit. Hybrid: motion compensation + transform coding (DCT)

MacroBlock-level parallelism H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 2D-wave processing order satisfies MB dependencies and allows to exploit TLP. Scalability depends on frame resolution

Parallel model Introduction H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation Each frame in a video sequence can be represented with a finite Directed Acyclic Graph (DAG): Each node in the DAG represents the decoding of one MB by one processor.

Theoretical maximum performance 1 H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 60 50 Parallel macroblocks 40 30 20 10 0 0 50 100 150 200 250 Time The maximum speedup can not be reached because: MB processing time is variable and input dependent Thread synchronization time is not negligible load unbalance and synchronization overhead 1 Meenderinck et al. Parallel Scalability of Video Decoders

Effects of variable decoding time H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 34 32 30 28 constant time blue_sky pedestrian riverbed rush_hour 26 speedup 24 22 20 18 16 14 12 0 20 40 60 80 100 Number of frames Average speedup reduction: 33% Actual performance depends on input content

H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation Effects of thread synchronization overhead 25 20 blue_sky pedestrian riverbed rush_hour speedup 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 Overhead as a factor of MB decoding time Speedup reduction: 38%, when overhead = MB decoding time Observed overhead is greather than MB decoding time

Outline Introduction Experimental Platform Performance Analysis Removing the bottlenecks 1 Introduction 2 3 Experimental Platform Performance Analysis Removing the bottlenecks 4

Parallel architecture: SGI Altix Experimental Platform Performance Analysis Removing the bottlenecks Base module: 2 dual core Intel Itanium2 2 Distributed Shared Memory (cc-numa) 2 Rusu S., Circuit Technologies for Multi-Core Processor Design

Benchmark: HD-VideoBench 3 Experimental Platform Performance Analysis Removing the bottlenecks HD-VideoBench Test sequences: Full High Definition (FHD), 100 frames, 25 fps. H.264 decoder: FFmpeg modified for MB-level parallelization. 3 http://people.ac.upc.edu/alvarez/hdvideobench/

Programming model Experimental Platform Performance Analysis Removing the bottlenecks Single Program Multiple Data: SPMD N+2 threads: 1 master, 1 CABAC, N workers Task pool for dynamic load balancing

Scheduling strategies Experimental Platform Performance Analysis Removing the bottlenecks Static scheduling Master thread: Checks the dependencies and inserts work in the task queue. Worker threads: Process tasks and update dependencies Dynamic scheduling Master thread: Inserts the first MB in the task queue and waits for the last MB. Worker threads: take work from the task queue, process tasks, update dependencies and insert ready MBs in the task queue Tail-submit: If at least one MB is ready process it directly

Speedup and scalability Experimental Platform Performance Analysis Removing the bottlenecks 10 9 8 tail submit right-first tail submit down-left-first static scheduling dynamic scheduling Average Speedup 7 6 5 4 3 2 1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Number of threads Static scheduling: load unbalance Dynamic scheduling: suffers from synchronization overhead Tail submit: reduces sync. overhead and exploits data locality

Profiling analysis Introduction Experimental Platform Performance Analysis Removing the bottlenecks Sync. overhead ratio [factor of MB decoding time] 20 18 16 14 12 10 8 6 4 2 dynamic_scheduling_wo_tailsubmit dynamic_scheduling_w_tailsubmit 0 0 5 10 15 20 25 30 35 Number of threads Significant reduction of synchronization overhead Submitting new tasks to the task queue is the main source of overhead

Experimental Platform Performance Analysis Removing the bottlenecks Impact of the CABAC entropy decoder 180000 160000 control_thread hl_decode_mb decode_cabac 140000 Execution Time [us/frame] 120000 100000 80000 60000 40000 20000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Number of threads CABAC should be executed sequentially. CABAC execution time behavior a side effect of the cc-numa architecture.

Experimental Platform Performance Analysis Removing the bottlenecks Identifying the acceleration requirements A scalable MB-level parallelization requires: Remove the CABAC bottleneck Low latency synchronization primitives. These limiting factors offer a potential for multicore acceleration Multicore acceleration evaluation: Dedicated and accelerated CABAC processor On-chip hardware supported synchronization Evaluated using a fast trace-driven multicore simulation

Experimental Platform Performance Analysis Removing the bottlenecks Accelerating CABAC entropy decoding 25 Speed-up 20 15 10 5 100 80 60 40 20 Frames per second 0 0 1 2 4 8 16 32 64 128 Number of processors (+1 master) cabac- 1.0X cabac- 1.5X cabac- 2.0X cabac- 3.0X cabac- 4.0X cabac- 5.0X cabac-10.0x real-time 25 fps real-time 50 fps real-time 100fps FHD 25 fps: CABAC 1X, 7 worker processors FHD 50 fps: CABAC 1.5X, 16 worker processors FHD 100 fps: not enough parallelism

Accelerating thread synchronization Experimental Platform Performance Analysis Removing the bottlenecks 25 Speed-up 20 15 10 5 100 80 60 40 20 Frames per second 0 1 2 4 8 16 32 64 128 0 Number of processors (+1 master) 1ns 10ns 100ns 500ns 1000ns 5000ns 10000ns sync-altix-1p sync-altix-sw real-time 25 fps real-time 50 fps real-time 100fps Altix sync. time: 1000-5000 ns; without contention: 500-1000 ns FHD 25 fps : sync. latency 500 ns, 7 workers FHD 50 fps : sync. latency 100 ns, 16 workers

Outline Introduction Backup slides 1 Introduction 2 3 4 Backup slides

Introduction Backup slides Limitations to scalability load unbalance synchronization overhead CABAC sequential bottleneck Implementation on a cc-numa machine Best scheduling strategy: dynamic scheduling + Tail-submit Acceleration potential Estimation of the required CABAC acceleration Limits of latency of thread synchronization

Acknowledgements Introduction Backup slides This work has been supported by: HiPEAC. European Network of Excellence on High Performance and Embedded Architecture and Compilation The European Commission in the context of the SARC project (contract no. 27648) The Spanish Ministry of Education (contract no. TIN2007-60625).

Trace-driven DAG simulator Backup slides DAG Simulator Creates the DAG for each frame in the video using real execution traces Calculates the Task Processing Time (TPT) of every node as: TPT (n) = w n + s n + MAX (TFT (pr n )) (1) w n : the time required to process the task s n : the time required for thread synchronization; MAX (TFT (pr n ) is the maximum task finish time (TFT) of the immediate predecessors tasks of that task.

Backup slides Base architecture: dual core itanium2 processor Intel Itanium2 processor 1,6GHz, 90nm 16 KB I-L1, 16 KB D-L1 cache per core 1MB I-L2, 256 KB D-L2 cache per core Shared 8MB (I+D)-L3 8 GB of RAM