A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System Zhibin Xiao and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis
Outline Introduction to H.264 CAVLC Encoder Features of Target Fine-Grained Many-Core System The Proposed Parallel CAVLC Encoder Results and Performance Analysis Summary
Advanced Video Processing Video applications are everywhere: High definition video, realtime video conference, portable handset
Introduction to H.264/AVC Standard Drafted on May 2003 from JVT formed by ITU and ISO MPEG organization Target from high-definition TV to low-resolution mobile video Huge computation complexity with more data dependency and irregular processings Video Input - Decoder Coder Control Transform/ Quantizer 0 Motion- Compensated Intra/Inter Predictor Deq./Inv. Transform Control Data Quant. Transf. coeffs Entropy Coding Video Output Motion Estimator Motion Data
Introduction of H.264 CAVLC Encoder Context-adaptive variablelength coding (CAVLC) Adopted in H.264 baseline profile Reverse zigzag scanned runlength coding and adaptive coding table selection Up to 27 4x4 or 2x2 blocks within a macroblock in order Less processing regularity Serial in pixel level SIMD approach is not feasible in this case Task-level parallelism is available 16x16 Macroblock CAVLC Processing Order -1 16 17 0 1 4 5 18 19 22 23 2 3 6 7 20 21 24 25 8 9 12 13 10 11 14 15 (a) Luma (b) Chroma Cb/Cr
Introduction of H.264 CAVLC Encoder CAVLC Five parameters of each 4x4 block are coded separately coeff_token, Sign_trail, Levels, Total_zeros, Run_before CAVLC data-flow graph Serial scanning phase Parameter coding phase Parameter Coding Phase input residual data Serial Scanning Phase data receiver zigzag predict nc CAVLC scanning coeff_token encoder sign_trail encoder levels encoder VLC packer encoded bitstream total_zeros encoder run_before encoder
Outline Introduction to H.264 CAVLC Encoder Features of Target Fine-Grained Many-Core System The Proposed Parallel CAVLC Encoder Results and Performance Analysis Summary
Target Many-core System Architecture Key features 164 Enhanced prog. procs. 3 Dedicated-purpose procs. 3 Shared memories Long-distance circuit-switched communication network Dynamic Voltage and Frequency Scaling (DVFS) osc DVFS Core Tile Comm Motion FFT Estimation Viterbi 16 KB Shared Decoder Memories
Project motivation and mapping methdology Fine-grained many-core system for DSP applications energy efficient scalable performance highly flexibile Mapping methdology Sequential C code Parallel C code Fine-grained assembly-level code
Outline Introduction to H.264 CAVLC Encoder Features of Target Fine-Grained Many-Core System The Proposed Parallel CAVLC Encoder Results and Performance Analysis Summary
Parallel CAVLC : Memory Optimization Coeff_Token table selection Encode number of non-zero coefficients (nnz) in current 4x4 block The table index depends on top and left 4x4 blocks A row of nnz values of previous blocks has to be stored in the shared big memory 45 0 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 0 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 0 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 80 0 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 720p HDTV: 324 word memory for nnz Table elimination and compression Levels encoded at runtime Reduce more than 75% table memory for coeff_token, total_zeros and run_before Width compression Zero-value reduction
CAVLC Partition and Dataflow mapping A 20-processor mapping No long-distance link 8 routing processors Run_before Router VLC Binary Packing data_out Every encoding blocks can fit into a fine-grained core (128 word instruction and data memory) Total_zeros Router Router Levels P1 Levels P2 Router data_in Data Receiver Zigzag Reorder CAVLC Scanning Sign_trail Router Router Chroma nc Predicting Luma nc Predicting Router Coeff_ token Router 16 KB Shared Memory
Mapping and Throughput Optimization 15-processor mapping 4 long-distance link Zigzag Reorder CAVLC Scanning Sign_trail Total_zeros Run_before Reduce 5 routing processors data_in Data Receiver Coeff_ token Router1 Router2 Router3 Throughput optimization Readjust workload Chroma nc Predicting Luma nc Predicting Levels P1 Levels P2 VLC Binary Packing data_out Code optimization 16 KB Shared Memory 550 760 172 132 245 34 53 11 19 19 29 34 555 1203 1548 Throughput optimization 377 600 172 44 153 34 53 160 11 44 29 34 426 463 420
Outline Introduction to H.264 CAVLC Encoder Features of Target Fine-Grained Many-Core System The Proposed Parallel CAVLC Encoder Results and Performance Analysis Summary
Parallel CAVLC Encoder Performance Throughput Five QCIF video test sequences with varying Quantization Parmeter (10-40) Scaled performance can achieve 30fps 720p HDTV (1280x720) processing
Performance Comparison with General CPU Performance comparison Intel Core 2 Duo, Intel Pentium 4 and Pentium 4 HT Throughput 4.86-6.83 times better Scaled area 20.2 times smaller
Performance Comparison: traditional DSPs Performance estimation on DSPs CAVLC takes 18.2% computation time for H.264 baseline encoder 1.0-6.15 higher throughput and 6.2 times smaller area compared to TI C642 DSP Scaled to 65nm More demanding test for our design Platform Target App. Processor Type Tech. Area (mm 2 ) Freq. (MHz) Scaled Area to 65nm (mm 2 ) Scaled Freq. to 65nm (MHz) Test Sequence CAVLC Performance (fps 720p) TI C642 CIF 24fps 8-way VLIW 130nm CMOS 72 600 18 1200 50 frames IPPP...P QP=25 28 ADSP BF561 CIF 30fps Dualcore DSP 130nm CMOS N/A 600 N/A 1200 N/A 36 TI C641 QCIF 24.5fps 8-way VLIW 130nm CMOS 72 600 18 1200 100 frames IPPP P QP=28 7.4 This work AsAP 720p HDTV 30fps Array (15 cores) 65nm CMOS 2.89 * 1070 2.89 * 1070 2 frames IP QP=20 36.0-41.3
Processor Activity Analysis & Power Processor activity type Execution Stalls on input or output Analysis Data receiving stall on output 7%-65% active time for most processors Bottleneck: zigzag reorder and CAVLC scanning, over 94% active time Power estimation One processor 59mW@1.07GHz, 1.3V, 65nm 100% active Nearly zero leakage when processor is idle 323mW@1.07GHz, 1.3V, 15-processor + memory
Outline Introduction to H.264 CAVLC Encoder Features of Target Fine-Grained Many-Core System The Proposed Parallel CAVLC Encoder Results and Performance Analysis Summary
Summary Fine-grained many-core system Energy efficient, scalable and flexible Exploiting task-level parallelism The proposed parallel CAVLC encoder 15-processor plus 324 word memory, 720p HDTV at 30 fps 4.86-6.83 times higher scaled throughput than latest generalpurpose processor 1.0-6.15 higher scaled throughput and 6.2 times smaller area compared with traditional DSPs Future work Further power reduction using DVFS A complete parallel H.264 baseline encoder
Acknowledgments Intellasys Inc. SRC GRC Grant 1598 and CSR Grant 1659 ST Microelectronics NSF Grant 0430090 and CAREER Award 0546907 Intel and S Machines Corporation UC Micro UCD Faculty Research Grant
The End Thank You!