Low Power Design of the Next-Generation High Efficiency Video Coding Authors: Muhammad Shafique, Jörg Henkel CES Chair for Embedded Systems
Outline Introduction to the High Efficiency Video Coding (HEVC) HEVC Analysis complexity, memory access, thermal Power-Efficient HEVC System Design Conclusion 2
Normalized Memory BW. [GB/s] High Efficiency Video Coding (HEVC) Ultra-HD (or supervision) 7680 4320 33 million pixels per frame By 2017: 80% 90% global internet traffic New video compression standards/techniques required JCT-VC s High Efficiency Video Coding (HEVC) ~2 compression efficiency compared to H.264 Full HD @ 30fps 1 second 712 Mbits 1 hour 2.4 Tbits 1.6 1.2 0.8 0.4 Time Bitrate (a) 3 1.4E+12 1.2E+12 2.5 1E+12 2 8E+11 1.5 6E+11 1 4E+11 0.5 2E+11 HEVC H.264/AVC (b) 0 1 2 3 Basketball Kimono PeopleOnStreet 0 HD720 1 HD1080 2 2K 3 3
Challenges for Developing HEVC-based Multimedia Systems Challenges & Requirements Compute Complexity Content-Awareness, HW-SW Collaboration, Many-core Systems Power Efficiency Accelerator Design, Content-Awareness, Power- Gating Thermal Management Thermal Analysis, Configurations, Content-Adaptive Parallelization Workload Balancing, Arch.-Awareness, Power Budgeting Video Memory Memory Hierarchy Design, Content-Aware 4
HEVC Overview: Encoding Flow Input Video in CTUs + Transform and Quatization Inverse Transform and Quantization Recursive TU Size Reduction Intra Prediction Recursive CU/PU Size Reduction Inter Prediction Bitstream Headers CABAC Entropy Coder Decoded Picture Buffer Deblocking and SAO Filter Output Reconstructed Video Output Bitstream 5
HEVC Overview: Slices and Tiles Slice 0 Slice 2 Slice 3 Slice 1 Tile 0 Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 GOP 0 GOP 0 F 0 F 0 F M-1 T 0 T 1 T 0 T 1 T K-1 T K-1 Core 0 f 0 Core 1 f 1 Core K-1 f K-1 HEVC Parallel Encoding 6
HEVC Overview: Tree-Block Structure 32 32 64 64 CTU 8 8 16 16 4 4 CTU 0 CTU 1 CTU 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 Example PU Configuration 14 15 16 17 18 21... 19 20 Example CU Configuration Tested TU Configurations 7
CTU Distribution 8
HEVC Overview: Intra and Inter Prediction HEVC Intra Prediction HEVC Inter Prediction Vertical Angular Predictors 64 64 Horizontal Angular Predictors 0: Planar 1: DC log2 2 2 M i 0 2 i N i 32 32 32 32 16 16 16 16 log2 3 2 M i 13 2 i 0 HEVC-Intra: ~2.56 more mode decisions than H.264 HEVC-Inter: ~2.2 more complex than H.264 9
HEVC Overview: Motion Estimation Block Matching (BM) or Motion Estimation (ME) Compression by searching temporal neighbors High energy/time, high compression efficiency (H.264-Inter, HEVC-Inter) Reference Frame Current Frame Residue Frame Motion Vector Best Matching Current Block Search Window Previous Frame 10 Current Frame
HEVC Overview: Search Data Fetching High leakage High dynamic External Memory (DRAM) High bus power External Memory Bus Very high leakage On-Chip Memory (SRAM) Current Frame A memory subsystem with low power consumption and high efficiency is Current Block required Search Window Block Matching Reference Frame 11
Outline Introduction to the High Efficiency Video Coding (HEVC) HEVC Analysis complexity, memory access, thermal Power-Efficient HEVC System Design Conclusion 12
Percentage Area HEVC Analysis: Computational Complexity CU/PU Partitioning Large partitions for low-variance and homogeneous image areas and vice-versa High Variance Low Early PU size prediction may provide Regions Variance significant reduction in computational Regions and energy requirements 64 64 32 32 16 16 8 8 4 4 Smooth texture (due to larger QP or resolution) is usually captured by larger sized PUs BasketballDrill 832 480 ParkScene 1920 1080 PeopleOnStreet 2560 1600 13
HEVC Analysis: CTU Distribution 14
Percentage Utilization HEVC Analysis: Memory Accesses Memory Access for Motion Estimation Memory accesses of HEVC 3.86 of H.264 Most of the on-chip memory is wasted (leakage power) 100% 75% 50% H.264 HEVC (a) Maximum (b) 25% Only a part of the full search window is utilized 0% Median Adapting the search window size at run-time provides Minimum increased potential for leakage power savings 75 % 25 % Keiba BasketballDrill RaceHorses KristenAndSara 15
Using a thermal camera setup Linux Ubuntu kernel Voltage supply IR Camera A bottom view Water-cooling unit to cool down the thermoelectric device Thermal pad CPU chip Thermal map Water heat sink Thermoelectric device Copper plate Peltier Based Cooling Intel Atom 45nm dual-core processor (1.8 GHz) Src: Intel DIAS Pyroview thermal camera operates at 50Hz with spatial resolution of 50 µm Copyright: Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany 16
Temperature Measurements for HEVC [RaceHorses@37QP vs. 22QP] Temp max.: 55.0 C Temp min.: 36.0 C Temp avg.: 53.0 C hevcdtm @ DATE 14 Temp max.: 53.0 C Temp min.: 35.0 C Temp avg.: 49.0 C Copyright: Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany 17
HEVC Analysis: Temperature Temperature ( C) 60 55 50 55 So What is Required? Interplay between Software and Hardware needs 45 Keiba (1.8 GHz) 40 Basketball (1.35 GHz) 1350 1400 1450 1500 1550 Time (sec) to be explored for power/energy optimization 62 ºC 56 ºC 50 ºC 44 ºC Temperature ( C) 60 50 45 40 Keiba Basketball 1000 1050 1100 1150 Time (sec) 1. Optimized Algorithms for Fast Intra- and Inter- Prediction 2. Energy-Efficient Hardware Accelerators 3. Energy-Efficient Video Memory Heirarchy 4. Content-Adaptive Power Management Frequency Dependence Content Dependence 62 ºC 60 ºC 58 ºC 56 ºC 54 ºC 52 ºC 50 ºC 48 ºC 46 ºC 44 ºC 18
Outline Introduction to the High Efficiency Video Coding (HEVC) HEVC Analysis complexity, memory access, thermal Power-Efficient HEVC System Design Conclusion 19
Power Efficient HEVC Design: Hardware Architecture HEVC Software Layer Application Driven Adaptive Power/ Thermal Manager Video Tile Formation HEVC Encoding Intra/Inter Energy to Quality Tradeoff Complexity Reduction Scheme Data Analysis and Statistics Adaptive Workload Budgeting CT CT R R CT CT R R...... CT CT R R HEVC Hardware Processing Architecture Feedback Monitors to Software CT R CT R... CT R Battery Off-chip DRAM 20
Analysis and Statistics 2000 PDF Frequency 8x8 16x16 32x32 64x64 Variance PDF Frequency 8x8 16x16 32x32 64x64 Distortion Variance 0 20 40 60 140 160 180 200 80 100 120 10 Distortion 1 100 1000 10000 100000 Parameter Value SAD SSE SATD Kbps Max. CU Depth Search Range 4 771 263 51 3001.7 3 659.15 229.08 42.1 3320.9 2 372.92 153.37 28.6 3363.1 64 771 263 51 3001.7 Variance and Motion based Classification 32 553.92 263.26 50.9 3080.44 16 472.37 262.78 52.82 3738.83 AMP 1 771 263 51 3001.7 0 665.74 237.1 44.27 3072.92 21
Complexity Reduction: PU Size Estimation CTU variance computation at 4 4 v 1 n 1 x 1 n i x i 0 2 HEVC CTU Compressor Recursive 4 neighbors merge PU Map (PUM) PU Map Above (PUMA) v c CombineVariances v if v v OR v v i, i {1,2,3,4} c Th i, i {1,2,3,4} Th MergeBlocks v Th 4 1 log 2 ln 2 v QP 1 220 Rayleigh CDF Analysis Empirical Analysis H µ v = Mean of variance curve Δ = CDF threshold (0.8) H = Size of PU to combine 22
Normalized Time Time Savings and Video Quality Results 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 39.5 44.3 37.8 37.0 42.1 57.4 Sequence Class Size BD-PSNR BD-Rate Traffic A 4K -0.03048 0.5611 BasketballDrive B 1080p -0.04966 3.1834 BasketballDrill Traffic BasketballDrive BasketballDrill C BQSquare WVGA RaceHorses -0.05175 Johnny 1.0846 Basketball BQSquare Drive D Drill WQVGA -0.03365 0.3802 DrillText RaceHorses D WQVGA -0.03009 0.4482 Johnny E 720p -0.08711 2.1241 BasketballDrillText F WVGA -0.05827 1.1123 39.5 23
Time [msec] Tile Mapping and Parallelization Cores CPUs Max freq. f max Frame Rate f p Core 0 Output Core 1 Core 2 Core 3... Core K-2 Core K-1 Workload is not equal for tiles Workload (per core) Tile Formation and Maximum Workload Tile Estimator 0 Tile 1 Tile 3 Tile Video 4 Input Frequency (per CPU) Workload Allocator Monitoring Unit Threshold Generator Workload Adaptation Total Intra Angles (θ) Frame Offline Tuning Workload Manager 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Core Frequency Selector Intra Mode Prediction User bit-rate tolerance n 24
HEVC Thermal Management Application-Driven DTM Extract Motion Intensity HEVC Encoder Application driven DTM Frequency scaling Execute HM Core0 Sensor Core1 Sensor T current > T critical NO YES 25
Temperature (ºC) HEVC Thermal Management 56 ºC 54 ºC 52 ºC 50 ºC 48 ºC 46 ºC 44 ºC 42 ºC 40 ºC 38 ºC 60 Max Average Min 150 55 100 50 50 45 40 No DTM DTM 54ºC DTM 50ºC DTM 46ºC 0 PSNR (db) Bit rate (kbps) No DTM Our 54 C Our 50 C Our 46 C 26
Peak temperature (ºC) HEVC Thermal Management 60 60 55 50 45 40 No DTM DTM 54ºC DTM 50ºC DTM 46ºC 0 10 20 # Frames Peak temperature (ºC) 55 50 45 40 No DTM DTM 54ºC DTM 50ºC DTM 46ºC 0 10 20 # Frames Keiba BasketballDrill 27
Power Efficient HEVC Design: Hardware Architecture CT CT R R CT CT R R...... CT CT R R HEVC Hardware Processing Architecture Feedback Monitors to Software CT R CT R... CT R Battery 28
Hardware Accelerators 1000 800 600 400 200 M CTU Row HW 0 HW 1 HW Intra 2 HW 8 PPC 2N/8 Predictor 0 1 2 3 4 5 6 Number of datapaths in parallel Legend: Slice LUTs (luma) Slice LUTs (chroma) Slice registers (luma) Slice registers (chroma) Occupied Slices (luma) Occupied Slices (chroma) 29
AMBER: Memory Subsystem External Memory holds the current frame High density, low read and write power On-chip SRAM memory (FIFO) holds only the current block External Memory (Current Frame) External Memory Controller High read and write speed and low dynamic write power Hides latencies from HEVC engine MRAM Buffers (N Reference Frames) Reference Write Master Current Read Master - Reads current frame data - Writes SRAM Buffer -Low write amount -Fast Write On-chip Current Data (Block) SRAM SRAM Block FIFO Reference Read Master - Reads reference frames - Low latency read Block Matching Engine + ++ + + + + HEVC Encoder (Transform loop) Power Control HEVC Video Compression Control 30
AMBER: MRAM Reference Buffers One MRAM buffer holds a full reference frame Each column (sector) of reference buffer is power-gated Reference read and write masters read and write data to the MRAM buffer Reference Write Master H MRAM Reference Buffers W W Reference F 1 H Reference F N Row Buffer SRAM FIFO MRAM Power Gate Control Reference Read Master HEVC Encoder Block Matching 31
AMBER: Reference Buffer Power Management Observation: Not all of the search window is used Block matching algorithm accesses only a small percentage of reference buffer sectors Power-gate unused sectors Reduce leakage s CU s 1 x min s 1 s 2 Block Matching Turned OFF Turned ON s CUPrediction s of Unused Sectors is based on: 2 x max 1. Self-Organizing Map 2. Content Properties 32
Power [W] Power Consumption (4 reference frames) 2 1.5 1 129 129 193 193 257 257 Search Window AMBER 0.5 0 Keiba China Speed Four People Basketball Drive People Keiba ChinaSpeed FourPeople BasketballDrive People 832x480 1024x768 1280x720 1920x1080 2560x1600 Increasing the number of reference frames improves the power consumption of the AMBER system compared to the search window approach 33
Conclusion Comprehensive analysis of HEVC Architecture, power, thermal and complexity Challenges posed by HEVC Architectural (memory, reconfiguration, accelerators) Power/thermal (power-gating, configuration control) Complexity (parallelization, many-core, workload balancing) Both Hardware and Software need to be optimized while leveraging the application-specific knowledge Our approach Adaptive complexity management Video tiling, workload budgeting, CU/PU partitioning Power and thermal aware HEVC configuration Hybrid video memory hierarchy with content-driven power-gating 34
ces265: Multi-threaded HEVC Encoder Open-source C++ based Multithreading via pthread API One thread of ces265 13.2 faster than HM-9.2 Tile Formation and Workload Curtailing Slice Compressor Sniper many-core x86 simulator HEVC-Intra Encoder s Top GOP Compressor Workload Queue Tile Compressor Threads Workload Manager System Configuration YUV Read Write Encoder Statistics CTU Compressor Workload Allocator Proposed HEVC Intra Encoder Simulator statistics McPAT power simulator Power statistics Web http:///ces265/ Download https://sourceforge.net/projects/ces265/ 35
Acknowledgement Muhammad Usman Karim Khan Daniel Palomino Claudio M. Diniz Felipe Sampaio 36
Thank you! Questions? Web: http:///ces265/ Download: https://sourceforge.net/projects/ces265/ 37