Scalability of Macroblock-level Parallelism for H.264 Decoding Mauricio Alvarez Mesa 1, Alex Ramírez 1,2, Mateo Valero 1,2, Arnaldo Azevedo 3, Cor Meenderinck 3, Ben Juurlink 3 1 Universitat Politècnica de Catalunya (UPC). Barcelona. Spain 2 Barcelona Supercomputing Center (BSC). Barcelona. Spain 3 Delft University of Technology (TUD). Delft. The Netherlands December 15, 2009
Outline Introduction 1 Introduction 2 3 4
Trends in digital video Towards high quality systems High definition video and high quality video codecs High computational complexity Towards Mobile and Integrated systems Convergence of mobile and multimedia systems Real-time, area and power constraints. Towards multiple formats and extensions: H.264 extensions and different video codecs (MPEG-2, VC-1) Programmable processors instead of application specific hardware
Trends in multicore computer architecture Towards manycore systems Hundreds of cores on a chip Power and complexity wall: simpler cores Massive Thread Level Parallelism Towards heterogeneous/asymmetric architectures Asymmetric cores: same ISA but different performance Heterogeneous cores: accelerators for different application domains Specialized architectures have better performance/power/area benefits
Challenges of video applications in the multicore era Requirements of digital video applications: Performance: High quality video translates in high computational complexity Efficiency: Embedded environments impose real-time and power constraints. Flexibility: Multiple video formats and new extensions requires programmability. Opportunities of multicore architectures Scalability Applications can benefit from multicores only if they can be parallelized
Outline Introduction H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 1 Introduction 2 H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 3 4
H.264 Decoder Introduction H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation Block based: MBs are the basic coding unit. Hybrid: motion compensation + transform coding (DCT)
MacroBlock-level parallelism H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 2D-wave processing order satisfies MB dependencies and allows to exploit TLP. Scalability depends on frame resolution
Parallel model Introduction H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation Each frame in a video sequence can be represented with a finite Directed Acyclic Graph (DAG): Each node in the DAG represents the decoding of one MB by one processor.
Theoretical maximum performance 1 H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 60 50 Parallel macroblocks 40 30 20 10 0 0 50 100 150 200 250 Time The maximum speedup can not be reached because: MB processing time is variable and input dependent Thread synchronization time is not negligible load unbalance and synchronization overhead 1 Meenderinck et al. Parallel Scalability of Video Decoders
Effects of variable decoding time H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation 34 32 30 28 constant time blue_sky pedestrian riverbed rush_hour 26 speedup 24 22 20 18 16 14 12 0 20 40 60 80 100 Number of frames Average speedup reduction: 33% Actual performance depends on input content
H.264 video decoder Macroblock-level parallelism Theoretical Maximum Speed-up Abstract Trace-driven Simulation Effects of thread synchronization overhead 25 20 blue_sky pedestrian riverbed rush_hour speedup 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 Overhead as a factor of MB decoding time Speedup reduction: 38%, when overhead = MB decoding time Observed overhead is greather than MB decoding time
Outline Introduction Experimental Platform Performance Analysis Removing the bottlenecks 1 Introduction 2 3 Experimental Platform Performance Analysis Removing the bottlenecks 4
Parallel architecture: SGI Altix Experimental Platform Performance Analysis Removing the bottlenecks Base module: 2 dual core Intel Itanium2 2 Distributed Shared Memory (cc-numa) 2 Rusu S., Circuit Technologies for Multi-Core Processor Design
Benchmark: HD-VideoBench 3 Experimental Platform Performance Analysis Removing the bottlenecks HD-VideoBench Test sequences: Full High Definition (FHD), 100 frames, 25 fps. H.264 decoder: FFmpeg modified for MB-level parallelization. 3 http://people.ac.upc.edu/alvarez/hdvideobench/
Programming model Experimental Platform Performance Analysis Removing the bottlenecks Single Program Multiple Data: SPMD N+2 threads: 1 master, 1 CABAC, N workers Task pool for dynamic load balancing
Scheduling strategies Experimental Platform Performance Analysis Removing the bottlenecks Static scheduling Master thread: Checks the dependencies and inserts work in the task queue. Worker threads: Process tasks and update dependencies Dynamic scheduling Master thread: Inserts the first MB in the task queue and waits for the last MB. Worker threads: take work from the task queue, process tasks, update dependencies and insert ready MBs in the task queue Tail-submit: If at least one MB is ready process it directly
Speedup and scalability Experimental Platform Performance Analysis Removing the bottlenecks 10 9 8 tail submit right-first tail submit down-left-first static scheduling dynamic scheduling Average Speedup 7 6 5 4 3 2 1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Number of threads Static scheduling: load unbalance Dynamic scheduling: suffers from synchronization overhead Tail submit: reduces sync. overhead and exploits data locality
Profiling analysis Introduction Experimental Platform Performance Analysis Removing the bottlenecks Sync. overhead ratio [factor of MB decoding time] 20 18 16 14 12 10 8 6 4 2 dynamic_scheduling_wo_tailsubmit dynamic_scheduling_w_tailsubmit 0 0 5 10 15 20 25 30 35 Number of threads Significant reduction of synchronization overhead Submitting new tasks to the task queue is the main source of overhead
Experimental Platform Performance Analysis Removing the bottlenecks Impact of the CABAC entropy decoder 180000 160000 control_thread hl_decode_mb decode_cabac 140000 Execution Time [us/frame] 120000 100000 80000 60000 40000 20000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Number of threads CABAC should be executed sequentially. CABAC execution time behavior a side effect of the cc-numa architecture.
Experimental Platform Performance Analysis Removing the bottlenecks Identifying the acceleration requirements A scalable MB-level parallelization requires: Remove the CABAC bottleneck Low latency synchronization primitives. These limiting factors offer a potential for multicore acceleration Multicore acceleration evaluation: Dedicated and accelerated CABAC processor On-chip hardware supported synchronization Evaluated using a fast trace-driven multicore simulation
Experimental Platform Performance Analysis Removing the bottlenecks Accelerating CABAC entropy decoding 25 Speed-up 20 15 10 5 100 80 60 40 20 Frames per second 0 0 1 2 4 8 16 32 64 128 Number of processors (+1 master) cabac- 1.0X cabac- 1.5X cabac- 2.0X cabac- 3.0X cabac- 4.0X cabac- 5.0X cabac-10.0x real-time 25 fps real-time 50 fps real-time 100fps FHD 25 fps: CABAC 1X, 7 worker processors FHD 50 fps: CABAC 1.5X, 16 worker processors FHD 100 fps: not enough parallelism
Accelerating thread synchronization Experimental Platform Performance Analysis Removing the bottlenecks 25 Speed-up 20 15 10 5 100 80 60 40 20 Frames per second 0 1 2 4 8 16 32 64 128 0 Number of processors (+1 master) 1ns 10ns 100ns 500ns 1000ns 5000ns 10000ns sync-altix-1p sync-altix-sw real-time 25 fps real-time 50 fps real-time 100fps Altix sync. time: 1000-5000 ns; without contention: 500-1000 ns FHD 25 fps : sync. latency 500 ns, 7 workers FHD 50 fps : sync. latency 100 ns, 16 workers
Outline Introduction Backup slides 1 Introduction 2 3 4 Backup slides
Introduction Backup slides Limitations to scalability load unbalance synchronization overhead CABAC sequential bottleneck Implementation on a cc-numa machine Best scheduling strategy: dynamic scheduling + Tail-submit Acceleration potential Estimation of the required CABAC acceleration Limits of latency of thread synchronization
Acknowledgements Introduction Backup slides This work has been supported by: HiPEAC. European Network of Excellence on High Performance and Embedded Architecture and Compilation The European Commission in the context of the SARC project (contract no. 27648) The Spanish Ministry of Education (contract no. TIN2007-60625).
Trace-driven DAG simulator Backup slides DAG Simulator Creates the DAG for each frame in the video using real execution traces Calculates the Task Processing Time (TPT) of every node as: TPT (n) = w n + s n + MAX (TFT (pr n )) (1) w n : the time required to process the task s n : the time required for thread synchronization; MAX (TFT (pr n ) is the maximum task finish time (TFT) of the immediate predecessors tasks of that task.
Backup slides Base architecture: dual core itanium2 processor Intel Itanium2 processor 1,6GHz, 90nm 16 KB I-L1, 16 KB D-L1 cache per core 1MB I-L2, 256 KB D-L2 cache per core Shared 8MB (I+D)-L3 8 GB of RAM