Design Challenge of a QuadHDTV Video Decoder Youn-Long Lin Department of Computer Science National Tsing Hua University MPSOC27, Japan
More Pixels YLLIN NTHU-CS 2
NHK Proposes UHD TV Broadcast Super HiVision 768x432 pixels at 6 fps (XHDTV) Baseband signal is 24 Gbps. Using MPEG-2 encoding chips, the signal was compressed to 25 Mbps for transmission. HDTV signals at present are 1.5 Gbps for baseband and 2 Mbps for compressed signals. High Performance compression / decompression and transmission / storage are needed for 24 Gbps 3 Mbps YLLIN NTHU-CS 3
768x432 UHD TV 384x2 QFHD TV 192x18 HDTV SDTV YLLIN NTHU-CS 4
Video Coding Technology Trend H.264 5% 69% YLLIN NTHU-CS 5
Features of Video Coding Standards Standard MPEG-1 MPEG-2 MPEG-4 H.264 MB size * *(frame) * * Block size 8*8 8*8 *, 8*8 *, *8, 8*, 8*8, 8*4, 4*8, 4*4 Transform DCT DCT DCT/ Wavelet 4*4 int transform Entropy coding VLC VLC VLC VLC, CAVLC and CABAC ME, MC Yes Yes Yes 41 MVs per MB Pixel accuracy ½ pel ½ pel ¼ pel ¼ pel Reference frames One frame One frame One frame Multiple (5) frames Picture type I, P, B I, P, B I, P, B I, P, B Transmission rate Up to 1.5 Mbps 2-15 Mbps 64kbps2Mbps 64kbps 15Mbps YLLIN NTHU-CS 6
Not all H.264/AVC systems are equal Relative Computational Complexity #Ref Frames Search Range 8 32 5.9 24.6 55.7 1 1 2.54 8.87 Video Coding with H.264/AVC: Tools, Performance and Complexity, J. Ostermann et al, IEEE CAS Mag., Q1 24. YLLIN NTHU-CS 7
Quality vs Bit-rate vs Decoding Throughput Decoding Capability of a 6MHz CPU QP 21 26 Bit Rate (Kbps) 1723 74 37 Fps 44 55 65 H.264/AVC Baseline Profile Decoder Complexity Analysis, M. Horowitz, IEEE T-CSVT, July 23 YLLIN NTHU-CS 8
Our Target Single-Chip Decoder for QFHD (384x2) H.264/AVC High Profile Video CABAD 8x8 Transform Commodity DDR External Memory Platform-Based Design YLLIN NTHU-CS 9
Performance Resolution Size Clock Frequency Application SQCIF (128 x 96) 1..4 MHz Video phone QCIF (176 x 144) 2..8 MHz CIF (352 x 288) 8.3 3 MHz Mobile TV D2 (72 x 48) 28.1 1 MHz Car TV Surveillance 72HD (18 x 72) 75. 3 MHz Home theater 18HD (192 x 188) 17. 62 MHz QFHD (384 x 2) 675. 249 MHz Digital signage Medical video Satellite image Space exploration YLLIN NTHU-CS 1
Essential Issues Memory Tradeoff Between the Size of Internal Memory and Bandwidth of External Access Massive Parallelism Macroblock Decoding Scheduling YLLIN NTHU-CS 11
NTHU H.264 Decoder Architecture CPU Display Memory Controller Ethernet AHB MAU & AMBA Interface Translator Parser CAVLD/ CABAD coeff mvdinfo IQ & IT MVG residual mv & ridx IPRED INTERP BSG recon bs DF para & predinfo H.264 Video Decoder YLLIN NTHU-CS 12
Memory
Memory Size (Bytes) size vs. b/w in ME 124929 D Full HD 3fps, # of rf =1, SRV=SRH=64 Level A : 24 Bytes, 19658 MB/s Level B : 12 Bytes, 15MB/s Level C: 4977 Bytes, 317MB/s Level D: 124,929 Bytes, 62 MB/s 4977 C B 12 A 24 62 317 15 19658 Memory Bandwidth (MB/s) YLLIN NTHU-CS 14
rf mem rf1 mem CB mem rf AG CB AG IME block diagram rf router rf reg array CMB reg CMB reg CMB reg CMB reg comparator comparator comparator comparator MVGen rf MVGen rf MVGen rf MVGen rf MVGen rf MVGen rf MVGen rf MVGen rf YLLIN NTHU-CS 15 MV AG MV mem
Memory Size (Bytes) size vs. b/w in ME 124929 D C 4977 B 12 24 ours A 62 317 15 19658 Memory Bandwidth (MB/s) YLLIN NTHU-CS
Reference-data Pre-fetch System No redundant fetching Collecting several MB s motion vectors, and read the same place by only one single operation Minimize the number of burst initials Averagely 2 burst initials per MB (1 for luma, 1 for chroma) : a group of sequentially read (burst read) YLLIN NTHU-CS 17
Reference-data Pre-fetch System (Cont) CABAC.... MB1 MB9 MB8 MB7 R7 Reference Region & Index Register R6 MB6 MB7 MB7 Region Information MB7 MV R5 MB4 MB5 MB6 MB7 Translator Motion Vector Generator R4 MB4 MB5 R3 MB2 MB3 MB4 R2 MB1 MB2 R1 MB MB1 MB2 R MB R2 Information Region Analyzer / Searcher OES manager MAU Interface Buffer R2 Information R R1 R2 R2 Data from SDRAM R/R1 Data MB7 Information MB7 MB6 MB5 MB4 MB3 MB2 MB1 MB Interp YLLIN NTHU-CS 18
Massive Parallelism
RLD/IQ/IDCT Timing Diagram 3 122 14 144 1 195 212 219 t coeflag_mem read 2 1 1 1 1 1 1 1 1 1 1 1 1 1 coeff_mem read luma ac 1 luma ac_14_15 4 dc 15 chroma ac 1 15 15 15 chroma ac_6_7 IQ stage 1 luma ac 1 luma ac_14_15 15 chroma ac 1 15 15 15 chroma ac_6_7 IQ stage 2 luma ac 1 luma ac_14_15 15 chroma ac 1 15 15 15 chroma ac_6_7 IDCT stage 1 1 1 1 1 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 4 4 4 1 1 1 1 IDCT stage 2 4 4 4 4 4 4 4 4 4 4 4 4 residual_mem write 4 4 4 4 4 4 4 4 4 4 4 4 1 YLLIN NTHU-CS 2
DF Timing Diagram YLLIN NTHU-CS 21
Dual Pipelined Edge Filter Stage 1 Read Pixels L L1 L2 L3 M M1 M2 M3 R R1 R2 R3 Stage 2 Strong filter (Bs=4)/ Left delta calculation R21 delta calculation L1 L11 L12 L13 Left delta M1 M11 M12 M13 R21 delta R1 R11 R12 R13 Stage 3 Left Weak Filter (Bs<4) Right delta calculation R21 filter L2 L21 L22 L23 M2 M21 M22 M23 Right delta R2 R21 R22 R23 Stage 4 Right Weak filter (Bs<4) L31 L3 L32 L33 M3 M31 M32 M33 R3 R31 R32 R33 Stage 5 Write Pixels YLLIN NTHU-CS 22
System-Level Optimization Cyclic-Queue-Based IP Interface
Sequential Decoder Timing Diagram (I Frame) PARSER CABAD IQ/IT BSG IPRED DF Header information decode Initial context table and condition offset MB decode MB 1 decode MB 2 decode (time) YLLIN NTHU-CS 24
Elastic Pipeline Decoder Timing Diagram (I Frame) PARSER CABAD IQ/IT BSG IPRED DF (time) Header information decode Initial context table and condition offset MB decode MB 4 decode MB 1 decode MB 5 decode MB 2 decode MB 6 decode YLLIN NTHU-CS MB 3 decode 25
ASAP Decode with Cyclic Queue Timing Diagram (I frame) PARSER CABAD IQ/IT BSG IPRED DF Header information decode Initial context table and condition offset MB decode MB 1 decode MB 5 decode MB 6 decode MB 2 decode MB 7 decode YLLIN NTHU-CS MB 3 decode MB 8 26 MB 4 decode decode (time)
Comparison of Different Scheduling Methods (Cycles/ MB) 65 6 55 5 486 62 644 8.3 54 9 8 7 KB 45 4 486 5.6 5.6 6 35 5 3 4 25 2 2.62 3 15 1 5 1 159 14 2 1 Sequential Elastic Pipeline ASAP Ping-Pong ASAP Cyclicqueue SRAM Usage Turnaround Cycle Processing Cycle YLLIN NTHU-CS 27 Test Pattern: pedestrian Resolution: 72*48 QP: 28 GOP: III Frame #: 3
Verification Environment H264 filelist tbench fpga_lib rtl_sim asic_lib mfu amba_wrap top lm_wrap main_ctrl Easy Bug Tracing gate_sim mvg bsg parser mem cabad vn idct ipred nlint interp netlist df Sub IP jm11. hd_amba syn def filelist tbench rtl_sim xilinx_mem altera_mem artisan_mem rtl syn vn nlint gate_sim YLLIN NTHU-CS 28
A Multimedia SOC Platform CPU Accelerator (FPGA) USB(PHY) Daughter Board ROM/ Flash Memory SRAM SDRAM FPGA VIC USB 2. Static memory SDRAM Controller(4-CH) High-Speed Bus JPEG Codec DMA SRAM PWM WDT TIMER APB Bridge Capture Display Controller Peripheral Bus DAI SSI SD SM UART GPIO 12C Audio Codec I2S Flash memory with SSI Video-In CCIR61 TV/LCD YLLIN NTHU-CS 29 Flash Card Button LED
Summary Super High Definition Video Capturing, Delivery and Display are on the Horizon Massive Parallelism is Essential for Making Consumer Applications Possible Tradeoff Among Memory Usage, Bandwidth and Logic Has Profound Impact on the Overall System Performance System Design Should Be Adaptable to Content, Quality Variation YLLIN NTHU-CS 3