Low Power Design of the Next-Generation High Efficiency Video Coding

Similar documents
A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b

Chapter 2 Introduction to

HEVC Real-time Decoding

A Low-Power 0.7-V H p Video Decoder

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

Overview: Video Coding Standards

17 October About H.265/HEVC. Things you should know about the new encoding.

Real-time SHVC Software Decoding with Multi-threaded Parallel Processing

COMPLEXITY REDUCTION FOR HEVC INTRAFRAME LUMA MODE DECISION USING IMAGE STATISTICS AND NEURAL NETWORKS.

Into the Depths: The Technical Details Behind AV1. Nathan Egge Mile High Video Workshop 2018 July 31, 2018

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges

A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds.

HEVC Subjective Video Quality Test Results

Authors: Glenn Van Wallendael, Sebastiaan Van Leuven, Jan De Cock, Peter Lambert, Joeri Barbarien, Adrian Munteanu, and Rik Van de Walle

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

The H.26L Video Coding Project

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Advanced Video Processing for Future Multimedia Communication Systems

Signal Processing: Image Communication

WHITE PAPER. Perspectives and Challenges for HEVC Encoding Solutions. Xavier DUCLOUX, December >>

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

On Complexity Modeling of H.264/AVC Video Decoding and Its Application for Energy Efficient Decoding

HIGH Efficiency Video Coding (HEVC) version 1 was

Overview of the Emerging HEVC Screen Content Coding Extension

A low-power portable H.264/AVC decoder using elastic pipeline

A HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING. Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt

Decoder Hardware Architecture for HEVC

Conference object, Postprint version This version is available at

Parallel SHVC decoder: Implementation and analysis

WITH the rapid development of high-fidelity video services

Analysis of the Intra Predictions in H.265/HEVC

Video Transmission. Thomas Wiegand: Digital Image Communication Video Transmission 1. Transmission of Hybrid Coded Video. Channel Encoder.

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

Efficient encoding and delivery of personalized views extracted from panoramic video content

THE TWO prominent international organizations specifying

Selective Intra Prediction Mode Decision for H.264/AVC Encoders

NO-REFERENCE QUALITY ASSESSMENT OF HEVC VIDEOS IN LOSS-PRONE NETWORKS. Mohammed A. Aabed and Ghassan AlRegib

THE High Efficiency Video Coding (HEVC) standard is

Algorithm and architecture design of the motion estimation for the H.265/HEVC 4K-UHD encoder

Project Proposal Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

The H.263+ Video Coding Standard: Complexity and Performance

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Versatile Video Coding The Next-Generation Video Standard of the Joint Video Experts Team

WITH the demand of higher video quality, lower bit

Performance and Energy Consumption Analysis of the X265 Video Encoder

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH GHEVC: An Efficient HEVC Decoder for Graphics Processing Units

AV1 Update. Thomas Daede October 5, Mozilla & The Xiph.Org Foundation

Multicore Design Considerations

Project Interim Report

An efficient interpolation filter VLSI architecture for HEVC standard

Lossless Compression Algorithms for Direct- Write Lithography Systems

AV1: The Quest is Nearly Complete

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

SCALABLE video coding (SVC) is currently being developed

A two-stage approach for robust HEVC coding and streaming

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Compressed Domain Video Compositing with HEVC

MULTI-STATE VIDEO CODING WITH SIDE INFORMATION. Sila Ekmekci Flierl, Thomas Sikora

AV1 Image File Format (AVIF)

Constant Bit Rate for Video Streaming Over Packet Switching Networks

Convolutional Neural Network-Based Block Up-sampling for Intra Frame Coding

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Multimedia Communications. Image and Video compression

A Novel Parallel-friendly Rate Control Scheme for HEVC

Video coding standards

PERCEPTUAL QUALITY OF H.264/AVC DEBLOCKING FILTER

Multimedia Communications. Video compression

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU

IMAGE SEGMENTATION APPROACH FOR REALIZING ZOOMABLE STREAMING HEVC VIDEO ZARNA PATEL. Presented to the Faculty of the Graduate School of

Feasibility Study of Stochastic Streaming with 4K UHD Video Traces

Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding

Joint Algorithm-Architecture Optimization of CABAC

H.264/AVC Baseline Profile Decoder Complexity Analysis

Visual Communication at Limited Colour Display Capability

Advanced Screen Content Coding Using Color Table and Index Map

UHD 4K Transmissions on the EBU Network

Interim Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

Sanz-Rodríguez, S., Álvarez-Mesa, M., Mayer, T., & Schierl, T. A parallel H.264/SVC encoder for high definition video conferencing

Dual Frame Video Encoding with Feedback

Research Article Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

A Novel Study on Data Rate by the Video Transmission for Teleoperated Road Vehicles

Standardized Extensions of High Efficiency Video Coding (HEVC)

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

CHROMA CODING IN DISTRIBUTED VIDEO CODING

THIS PAPER describes a video compression scheme that

Frame Processing Time Deviations in Video Processors

H.264/AVC. The emerging. standard. Ralf Schäfer, Thomas Wiegand and Heiko Schwarz Heinrich Hertz Institute, Berlin, Germany

An HEVC-Compliant Fast Screen Content Transcoding Framework Based on Mode Mapping

Joint source-channel video coding for H.264 using FEC

Fast Simultaneous Video Encoder for Adaptive Streaming

ELEC 691X/498X Broadcast Signal Transmission Fall 2015

A High Performance Deblocking Filter Hardware for High Efficiency Video Coding

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

Reduced complexity MPEG2 video post-processing for HD display

Enhanced Frame Buffer Management for HEVC Encoders and Decoders

Transcription:

Low Power Design of the Next-Generation High Efficiency Video Coding Authors: Muhammad Shafique, Jörg Henkel CES Chair for Embedded Systems

Outline Introduction to the High Efficiency Video Coding (HEVC) HEVC Analysis complexity, memory access, thermal Power-Efficient HEVC System Design Conclusion 2

Normalized Memory BW. [GB/s] High Efficiency Video Coding (HEVC) Ultra-HD (or supervision) 7680 4320 33 million pixels per frame By 2017: 80% 90% global internet traffic New video compression standards/techniques required JCT-VC s High Efficiency Video Coding (HEVC) ~2 compression efficiency compared to H.264 Full HD @ 30fps 1 second 712 Mbits 1 hour 2.4 Tbits 1.6 1.2 0.8 0.4 Time Bitrate (a) 3 1.4E+12 1.2E+12 2.5 1E+12 2 8E+11 1.5 6E+11 1 4E+11 0.5 2E+11 HEVC H.264/AVC (b) 0 1 2 3 Basketball Kimono PeopleOnStreet 0 HD720 1 HD1080 2 2K 3 3

Challenges for Developing HEVC-based Multimedia Systems Challenges & Requirements Compute Complexity Content-Awareness, HW-SW Collaboration, Many-core Systems Power Efficiency Accelerator Design, Content-Awareness, Power- Gating Thermal Management Thermal Analysis, Configurations, Content-Adaptive Parallelization Workload Balancing, Arch.-Awareness, Power Budgeting Video Memory Memory Hierarchy Design, Content-Aware 4

HEVC Overview: Encoding Flow Input Video in CTUs + Transform and Quatization Inverse Transform and Quantization Recursive TU Size Reduction Intra Prediction Recursive CU/PU Size Reduction Inter Prediction Bitstream Headers CABAC Entropy Coder Decoded Picture Buffer Deblocking and SAO Filter Output Reconstructed Video Output Bitstream 5

HEVC Overview: Slices and Tiles Slice 0 Slice 2 Slice 3 Slice 1 Tile 0 Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 GOP 0 GOP 0 F 0 F 0 F M-1 T 0 T 1 T 0 T 1 T K-1 T K-1 Core 0 f 0 Core 1 f 1 Core K-1 f K-1 HEVC Parallel Encoding 6

HEVC Overview: Tree-Block Structure 32 32 64 64 CTU 8 8 16 16 4 4 CTU 0 CTU 1 CTU 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 Example PU Configuration 14 15 16 17 18 21... 19 20 Example CU Configuration Tested TU Configurations 7

CTU Distribution 8

HEVC Overview: Intra and Inter Prediction HEVC Intra Prediction HEVC Inter Prediction Vertical Angular Predictors 64 64 Horizontal Angular Predictors 0: Planar 1: DC log2 2 2 M i 0 2 i N i 32 32 32 32 16 16 16 16 log2 3 2 M i 13 2 i 0 HEVC-Intra: ~2.56 more mode decisions than H.264 HEVC-Inter: ~2.2 more complex than H.264 9

HEVC Overview: Motion Estimation Block Matching (BM) or Motion Estimation (ME) Compression by searching temporal neighbors High energy/time, high compression efficiency (H.264-Inter, HEVC-Inter) Reference Frame Current Frame Residue Frame Motion Vector Best Matching Current Block Search Window Previous Frame 10 Current Frame

HEVC Overview: Search Data Fetching High leakage High dynamic External Memory (DRAM) High bus power External Memory Bus Very high leakage On-Chip Memory (SRAM) Current Frame A memory subsystem with low power consumption and high efficiency is Current Block required Search Window Block Matching Reference Frame 11

Outline Introduction to the High Efficiency Video Coding (HEVC) HEVC Analysis complexity, memory access, thermal Power-Efficient HEVC System Design Conclusion 12

Percentage Area HEVC Analysis: Computational Complexity CU/PU Partitioning Large partitions for low-variance and homogeneous image areas and vice-versa High Variance Low Early PU size prediction may provide Regions Variance significant reduction in computational Regions and energy requirements 64 64 32 32 16 16 8 8 4 4 Smooth texture (due to larger QP or resolution) is usually captured by larger sized PUs BasketballDrill 832 480 ParkScene 1920 1080 PeopleOnStreet 2560 1600 13

HEVC Analysis: CTU Distribution 14

Percentage Utilization HEVC Analysis: Memory Accesses Memory Access for Motion Estimation Memory accesses of HEVC 3.86 of H.264 Most of the on-chip memory is wasted (leakage power) 100% 75% 50% H.264 HEVC (a) Maximum (b) 25% Only a part of the full search window is utilized 0% Median Adapting the search window size at run-time provides Minimum increased potential for leakage power savings 75 % 25 % Keiba BasketballDrill RaceHorses KristenAndSara 15

Using a thermal camera setup Linux Ubuntu kernel Voltage supply IR Camera A bottom view Water-cooling unit to cool down the thermoelectric device Thermal pad CPU chip Thermal map Water heat sink Thermoelectric device Copper plate Peltier Based Cooling Intel Atom 45nm dual-core processor (1.8 GHz) Src: Intel DIAS Pyroview thermal camera operates at 50Hz with spatial resolution of 50 µm Copyright: Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany 16

Temperature Measurements for HEVC [RaceHorses@37QP vs. 22QP] Temp max.: 55.0 C Temp min.: 36.0 C Temp avg.: 53.0 C hevcdtm @ DATE 14 Temp max.: 53.0 C Temp min.: 35.0 C Temp avg.: 49.0 C Copyright: Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany 17

HEVC Analysis: Temperature Temperature ( C) 60 55 50 55 So What is Required? Interplay between Software and Hardware needs 45 Keiba (1.8 GHz) 40 Basketball (1.35 GHz) 1350 1400 1450 1500 1550 Time (sec) to be explored for power/energy optimization 62 ºC 56 ºC 50 ºC 44 ºC Temperature ( C) 60 50 45 40 Keiba Basketball 1000 1050 1100 1150 Time (sec) 1. Optimized Algorithms for Fast Intra- and Inter- Prediction 2. Energy-Efficient Hardware Accelerators 3. Energy-Efficient Video Memory Heirarchy 4. Content-Adaptive Power Management Frequency Dependence Content Dependence 62 ºC 60 ºC 58 ºC 56 ºC 54 ºC 52 ºC 50 ºC 48 ºC 46 ºC 44 ºC 18

Outline Introduction to the High Efficiency Video Coding (HEVC) HEVC Analysis complexity, memory access, thermal Power-Efficient HEVC System Design Conclusion 19

Power Efficient HEVC Design: Hardware Architecture HEVC Software Layer Application Driven Adaptive Power/ Thermal Manager Video Tile Formation HEVC Encoding Intra/Inter Energy to Quality Tradeoff Complexity Reduction Scheme Data Analysis and Statistics Adaptive Workload Budgeting CT CT R R CT CT R R...... CT CT R R HEVC Hardware Processing Architecture Feedback Monitors to Software CT R CT R... CT R Battery Off-chip DRAM 20

Analysis and Statistics 2000 PDF Frequency 8x8 16x16 32x32 64x64 Variance PDF Frequency 8x8 16x16 32x32 64x64 Distortion Variance 0 20 40 60 140 160 180 200 80 100 120 10 Distortion 1 100 1000 10000 100000 Parameter Value SAD SSE SATD Kbps Max. CU Depth Search Range 4 771 263 51 3001.7 3 659.15 229.08 42.1 3320.9 2 372.92 153.37 28.6 3363.1 64 771 263 51 3001.7 Variance and Motion based Classification 32 553.92 263.26 50.9 3080.44 16 472.37 262.78 52.82 3738.83 AMP 1 771 263 51 3001.7 0 665.74 237.1 44.27 3072.92 21

Complexity Reduction: PU Size Estimation CTU variance computation at 4 4 v 1 n 1 x 1 n i x i 0 2 HEVC CTU Compressor Recursive 4 neighbors merge PU Map (PUM) PU Map Above (PUMA) v c CombineVariances v if v v OR v v i, i {1,2,3,4} c Th i, i {1,2,3,4} Th MergeBlocks v Th 4 1 log 2 ln 2 v QP 1 220 Rayleigh CDF Analysis Empirical Analysis H µ v = Mean of variance curve Δ = CDF threshold (0.8) H = Size of PU to combine 22

Normalized Time Time Savings and Video Quality Results 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 39.5 44.3 37.8 37.0 42.1 57.4 Sequence Class Size BD-PSNR BD-Rate Traffic A 4K -0.03048 0.5611 BasketballDrive B 1080p -0.04966 3.1834 BasketballDrill Traffic BasketballDrive BasketballDrill C BQSquare WVGA RaceHorses -0.05175 Johnny 1.0846 Basketball BQSquare Drive D Drill WQVGA -0.03365 0.3802 DrillText RaceHorses D WQVGA -0.03009 0.4482 Johnny E 720p -0.08711 2.1241 BasketballDrillText F WVGA -0.05827 1.1123 39.5 23

Time [msec] Tile Mapping and Parallelization Cores CPUs Max freq. f max Frame Rate f p Core 0 Output Core 1 Core 2 Core 3... Core K-2 Core K-1 Workload is not equal for tiles Workload (per core) Tile Formation and Maximum Workload Tile Estimator 0 Tile 1 Tile 3 Tile Video 4 Input Frequency (per CPU) Workload Allocator Monitoring Unit Threshold Generator Workload Adaptation Total Intra Angles (θ) Frame Offline Tuning Workload Manager 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Core Frequency Selector Intra Mode Prediction User bit-rate tolerance n 24

HEVC Thermal Management Application-Driven DTM Extract Motion Intensity HEVC Encoder Application driven DTM Frequency scaling Execute HM Core0 Sensor Core1 Sensor T current > T critical NO YES 25

Temperature (ºC) HEVC Thermal Management 56 ºC 54 ºC 52 ºC 50 ºC 48 ºC 46 ºC 44 ºC 42 ºC 40 ºC 38 ºC 60 Max Average Min 150 55 100 50 50 45 40 No DTM DTM 54ºC DTM 50ºC DTM 46ºC 0 PSNR (db) Bit rate (kbps) No DTM Our 54 C Our 50 C Our 46 C 26

Peak temperature (ºC) HEVC Thermal Management 60 60 55 50 45 40 No DTM DTM 54ºC DTM 50ºC DTM 46ºC 0 10 20 # Frames Peak temperature (ºC) 55 50 45 40 No DTM DTM 54ºC DTM 50ºC DTM 46ºC 0 10 20 # Frames Keiba BasketballDrill 27

Power Efficient HEVC Design: Hardware Architecture CT CT R R CT CT R R...... CT CT R R HEVC Hardware Processing Architecture Feedback Monitors to Software CT R CT R... CT R Battery 28

Hardware Accelerators 1000 800 600 400 200 M CTU Row HW 0 HW 1 HW Intra 2 HW 8 PPC 2N/8 Predictor 0 1 2 3 4 5 6 Number of datapaths in parallel Legend: Slice LUTs (luma) Slice LUTs (chroma) Slice registers (luma) Slice registers (chroma) Occupied Slices (luma) Occupied Slices (chroma) 29

AMBER: Memory Subsystem External Memory holds the current frame High density, low read and write power On-chip SRAM memory (FIFO) holds only the current block External Memory (Current Frame) External Memory Controller High read and write speed and low dynamic write power Hides latencies from HEVC engine MRAM Buffers (N Reference Frames) Reference Write Master Current Read Master - Reads current frame data - Writes SRAM Buffer -Low write amount -Fast Write On-chip Current Data (Block) SRAM SRAM Block FIFO Reference Read Master - Reads reference frames - Low latency read Block Matching Engine + ++ + + + + HEVC Encoder (Transform loop) Power Control HEVC Video Compression Control 30

AMBER: MRAM Reference Buffers One MRAM buffer holds a full reference frame Each column (sector) of reference buffer is power-gated Reference read and write masters read and write data to the MRAM buffer Reference Write Master H MRAM Reference Buffers W W Reference F 1 H Reference F N Row Buffer SRAM FIFO MRAM Power Gate Control Reference Read Master HEVC Encoder Block Matching 31

AMBER: Reference Buffer Power Management Observation: Not all of the search window is used Block matching algorithm accesses only a small percentage of reference buffer sectors Power-gate unused sectors Reduce leakage s CU s 1 x min s 1 s 2 Block Matching Turned OFF Turned ON s CUPrediction s of Unused Sectors is based on: 2 x max 1. Self-Organizing Map 2. Content Properties 32

Power [W] Power Consumption (4 reference frames) 2 1.5 1 129 129 193 193 257 257 Search Window AMBER 0.5 0 Keiba China Speed Four People Basketball Drive People Keiba ChinaSpeed FourPeople BasketballDrive People 832x480 1024x768 1280x720 1920x1080 2560x1600 Increasing the number of reference frames improves the power consumption of the AMBER system compared to the search window approach 33

Conclusion Comprehensive analysis of HEVC Architecture, power, thermal and complexity Challenges posed by HEVC Architectural (memory, reconfiguration, accelerators) Power/thermal (power-gating, configuration control) Complexity (parallelization, many-core, workload balancing) Both Hardware and Software need to be optimized while leveraging the application-specific knowledge Our approach Adaptive complexity management Video tiling, workload budgeting, CU/PU partitioning Power and thermal aware HEVC configuration Hybrid video memory hierarchy with content-driven power-gating 34

ces265: Multi-threaded HEVC Encoder Open-source C++ based Multithreading via pthread API One thread of ces265 13.2 faster than HM-9.2 Tile Formation and Workload Curtailing Slice Compressor Sniper many-core x86 simulator HEVC-Intra Encoder s Top GOP Compressor Workload Queue Tile Compressor Threads Workload Manager System Configuration YUV Read Write Encoder Statistics CTU Compressor Workload Allocator Proposed HEVC Intra Encoder Simulator statistics McPAT power simulator Power statistics Web http:///ces265/ Download https://sourceforge.net/projects/ces265/ 35

Acknowledgement Muhammad Usman Karim Khan Daniel Palomino Claudio M. Diniz Felipe Sampaio 36

Thank you! Questions? Web: http:///ces265/ Download: https://sourceforge.net/projects/ces265/ 37