A Low-Power 0.7-V H p Video Decoder

A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008

Outline Motivation for low-power video decoders Low-power techniques pipelining and parallelism independent voltage/clock domains efficient memory accessing ASIC results Comparison with state of the art Summary

Motivation High demand for video capture and playback on mobile devices iphone H.264 state of the art video coding standard DVC Goal: Ultra Low Power H.264 decoder in 65nm 1280x720 @ 30fps Digital Camera PSP

H.264 Decoder Architecture Bitstream Input ED MVS FIFO PARALLEL LUMA MVS MC MEM SHARED MODES INTRA MUX + DB CHROMA COEFFS IT OFF-CHIP MODES INTRA MUX + DB MVS MC MEM MVS YUV->RGB (FPGA) FRAME BUFFER (ZBT SRAM) Pipelined, highly parallel architecture to reduce voltage (and power)

Pipeline FIFO Sizing Normalized System Throughput 1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 1 2 4 8 16 32 256 Sizes on this chip FIFO Depth Pipeline stages have variable latencies ex: ED latency is 0-33 cycles per 4x4 block Larger FIFOs help average out workload increase performance by up to 45% FIFOs of depths 1-4 chosen to reduce area

Deblocking Filter Parallelism 4 edges in parallel Process entire 4x4 edge (4 filters) in parallel Filter luma and chroma in parallel 192 cycles reduced to ~ 46 cycles per 16x16 block

Deblocking Filter Architecture Datapath Control Datapath Control P IN 4x4 4x4 4x1 Last Line Cache 104kb [SRAM] 4x4 4 PARALLEL FILTERS Filters (bs=1 to 3) threshold calc clip << Boundary Strength (bs) Datapath Control 4x4 Block IN Q IN 4x4 4x1 Filters (bs=4) threshold <<...... >> P OUT Q OUT 4x4 Block OUT Datapath Control (bs=0) 4x4 4x4 Internal Memory 4x(4x4x8b) [DFF] Datapath Control

Motion Compensation (MC) 4x4 block in current frame Reference block Vector (1, -1) Reference block Vector (0.5, -0.5) Use two interpolators in parallel Interpolate luma and chroma in parallel 176 reduced to ~ 72 cycles per 16x16 block

Parallel MC Interpolators MC0 0 1 4 5 Frame Buffer (FB) column0 column1 MC0 6:1 6:1 6:1 MC1 MC0 2 3 6 7 8 9 12 13 Entropy Decoding (ED) MV0 MV1 MC1 6:1 6:1 x9 MC1 10 11 14 15 4x4 16x16 Interpolators can run in same cycle when motion vectors are all available memory interface supplies 2 columns per cycle Interpolators are synchronized MC0: even 4x4 rows, MC1: odd 4x4 rows shared interpolation data reused

20 18 16 14 12 10 8 6 4 2 0 MEM_luma Clock & Voltage Domains MEM_chroma ED Memory Controller Average Cycles / block DB_chroma DB_luma MC_luma Core Domain IT MC_chroma Core Domain V low CLK slow Level Shifters Dual-Clock FIFO Decouple voltage / clock domains lower core voltage and frequency 25% power savings vs. single domain 4% further savings if we used 3 domains V high CLK fast dual-clock FIFOs and level-shifters link domains DQ Mem DFF Array V low Q D FIFO LOGIC V high DQ DQ Memory Controller

Workload Variation P-frame I-frame Core Domain [ MHz @ V ] 14 @ 0.70 53 @ 0.90 Memory Controller [ MHz @ V ] 50 @ 0.84 25 @ 0.76 Relative Power @720p 1 I-frame, 14 P-frames [%] Power No DVFS MAX 53 @ 0.90 50 @ 0.84 100 ΔP Perfect DVFS DVFS FA 14 @ 0.70 53 @ 0.90 17 @ 0.72 25 @ 0.76 50 @ 0.84 48 @ 0.83 73 73 Workload 100% INTER-INTRA workload variation MAX: maximum frequency on each domain DVFS: 1 frame every 33ms Frame Averaging (FA): 15 frames every 15 * 33ms switches less often than DVFS, but needs output buffer

MC Data Overlap Current 4x4 block Horizontal Neighbors Interpolation Area Vertical Neighbors Overlapped Interpolation Area Neighboring 4x4 with same MVs Overlap area shared horizontally and vertically Reduced MC read bandwidth

Last-line On-chip Caches Top- Left Top Top- Right Left 16x16 block Deblocking Cache Size [kb] 120 100 80 60 40 20 0 807 Mbps 122 Mbps 26 Mbps 21 Mbps DB INTRA MC ED M A B C D E F G H I J K L Intra prediction

Off-chip Bandwidth P-frame off-chip BW [Gbps ] 2 1 26% 19% 0 original caching cache & fewer reads Frame buffer off-chip (1.4 MByte per frame) P-frames more common than I-frames P-frame off-chip BW larger due to MC 40% (0.9 Gbps) total reduction last-line caches avoided redundant reads in MC

Voltage Scalable SRAM 8T SRAM Cell Write Assist to improve writability at low voltages Extra 2 Tx ensures read stability at low voltages DOUT Low voltage SRAM needed Typical 6T SRAMs fail at low voltages 8T SRAMs work down to 0.5V RDBL snsen snsref Pseudo-differential sense amplifier with global snsref

H.264 Decoder ASIC 3.3 mm 3.3 mm 176 I/O PADS CACHES CORE DOMAIN MEMORY CONTROLLER DOMAIN DECODER STATISTICS Area (w/o pads) : Area Utilization : Technology : I/O Pads : On-chip SRAM : 2.76 x 2.76 mm 2 31 % 65-nm 176 17kB

Area Breakdown Cache area 3x larger than logic Logic 25% Standard Cells: 134k Caches 75% Parallelism Overhead 1.5% of active chip area 4 luma + 2 chroma filters: 1.5% of DB 2 luma + 4 chroma interpolators: MISC 1.4% ED 1.6% IT 8.2% INTRA 15.6% 9% of MC DB 56% MC 16.6%

Power Measurements 720p Video mobcal shields parkrun Input Bitrate [ Mbps ] 5.4 7.0 26 Core [ MHz @ V ] 14 @ 0.70 14 @ 0.70 25 @ 0.80 MEM [ MHz @ V ] 50 @ 0.84 50 @ 0.84 50 @ 0.84 Power [ mw ] 1.8 1.8 3.2 Distribution Across 15 Dies 10 9 8 7 6 5 4 3 2 1 0 0.69 0.70 0.71 0.72 Minimum Core Vdd @ 720p

Power Breakdown Pipeline control & FIFOs 19% ED 3% IDCT 1% INTRA 1% Motion Vector predictor Interpolators 5% 20% MEM write 7% MC 42% DB 26% P-Frame MEM Read 75% P-frame power dominated by: MC (frame buffer reads) deblocking filter

Survey of Other Decoders Power 1 W 100 mw 10 mw 1 mw Resolution 15fps 30fps QCIF CIF D1 720p 1080p Core Domain Memory Cntl 0.55 V 0.68 V 0.70 V 0.85 V 0.85 V 1.15 V [work] - process, profile [3] - 130-nm, Baseline [4] - 180-nm, Baseline [5] - 180-nm, Baseline 0.1 mw 0.01 mw 0.5 V 0.5 V 0.66 V 0.74 V 0.1 1 10 100 Mpixels/s [6] - 180-nm, Main This work - 65-nm, Baseline

Summary Pipeline and parallelism Concurrency allows 14MHz @ 720p Parallelism: luma DB = 4x, luma MC = 2x Separate voltage/clock domains 25% P-frame power savings DVFS on each domain for I/P-frame differences Efficient memory accesses Low-voltage on-chip caches and data reuse Off-chip BW lowered by 40%

Acknowledgements Funding: Nokia, TI, and NSERC Chip fabrication: TI Valuable feedback: Nokia: J. Hicks, G. Raghavan, J. Ankcorn TI: M. Budagavi, D. Buss, M. Zhou MIT: Arvind, E. Fleming

Video Demo