Conference object, Postprint version This version is available at

Similar documents
HEVC Real-time Decoding

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges

COMPLEXITY REDUCTION FOR HEVC INTRAFRAME LUMA MODE DECISION USING IMAGE STATISTICS AND NEURAL NETWORKS.

Real-time SHVC Software Decoding with Multi-threaded Parallel Processing

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

HEVC Subjective Video Quality Test Results

Parallel SHVC decoder: Implementation and analysis

REAL-TIME AND PARALLEL SHVC HYBRID CODEC AVC TO HEVC DECODER. Pierre-Loup Cabarat Wassim Hamidouche Olivier Déforges

CODING EFFICIENCY IMPROVEMENT FOR SVC BROADCAST IN THE CONTEXT OF THE EMERGING DVB STANDARDIZATION

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU

Comparative Study of JPEG2000 and H.264/AVC FRExt I Frame Coding on High-Definition Video Sequences

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264

Compressed Domain Video Compositing with HEVC

WITH the rapid development of high-fidelity video services

NO-REFERENCE QUALITY ASSESSMENT OF HEVC VIDEOS IN LOSS-PRONE NETWORKS. Mohammed A. Aabed and Ghassan AlRegib

A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b

WHITE PAPER. Perspectives and Challenges for HEVC Encoding Solutions. Xavier DUCLOUX, December >>

A Novel Parallel-friendly Rate Control Scheme for HEVC

Project Interim Report

Performance and Energy Consumption Analysis of the X265 Video Encoder

A robust video encoding scheme to enhance error concealment of intra frames

A Low Energy HEVC Inverse Transform Hardware

SCALABLE video coding (SVC) is currently being developed

Tunneling High-Resolution Color Content through 4:2:0 HEVC and AVC Video Coding Systems

Performance Evaluation of Error Resilience Techniques in H.264/AVC Standard

UHD 4K Transmissions on the EBU Network

Project Proposal: Sub pixel motion estimation for side information generation in Wyner- Ziv decoder.

Subband Decomposition for High-Resolution Color in HEVC and AVC 4:2:0 Video Coding Systems

Advanced Video Processing for Future Multimedia Communication Systems

Authors: Glenn Van Wallendael, Sebastiaan Van Leuven, Jan De Cock, Peter Lambert, Joeri Barbarien, Adrian Munteanu, and Rik Van de Walle

Towards Robust UHD Video Streaming Systems Using Scalable High Efficiency Video Coding

an organization for standardization in the

Overview: Video Coding Standards

17 October About H.265/HEVC. Things you should know about the new encoding.

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

HEVC, the key to delivering an enhanced television viewing experience Beyond HD

Variable Block-Size Transforms for H.264/AVC

Sanz-Rodríguez, S., Álvarez-Mesa, M., Mayer, T., & Schierl, T. A parallel H.264/SVC encoder for high definition video conferencing

THE High Efficiency Video Coding (HEVC) standard is

Analysis of the Intra Predictions in H.265/HEVC

Efficient encoding and delivery of personalized views extracted from panoramic video content

Error Resilient Video Coding Using Unequally Protected Key Pictures

Error concealment techniques in H.264 video transmission over wireless networks

ROBUST ADAPTIVE INTRA REFRESH FOR MULTIVIEW VIDEO

Selective Intra Prediction Mode Decision for H.264/AVC Encoders

Multiview Video Coding

SCALABLE EXTENSION OF HEVC USING ENHANCED INTER-LAYER PREDICTION. Thorsten Laude*, Xiaoyu Xiu, Jie Dong, Yuwen He, Yan Ye, Jörn Ostermann*

Systematic Lossy Error Protection of Video based on H.264/AVC Redundant Slices

Project Proposal Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

RATE-DISTORTION OPTIMISED QUANTISATION FOR HEVC USING SPATIAL JUST NOTICEABLE DISTORTION

Region of Interest Coding for Aerial Surveillance Video Using AVC & HEVC

Chapter 2 Introduction to

Performance evaluation of Motion-JPEG2000 in comparison with H.264/AVC operated in pure intra coding mode

ADAPTIVE QUANTISATION IN HEVC FOR CONTOURING ARTEFACTS REMOVAL IN UHD CONTENT

Feasibility Study of Stochastic Streaming with 4K UHD Video Traces

Camera Motion-constraint Video Codec Selection

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

Quarter-Pixel Accuracy Motion Estimation (ME) - A Novel ME Technique in HEVC

ROI ENCRYPTION FOR THE HEVC CODED VIDEO CONTENTS. Mousa Farajallah, Wassim Hamidouche, Olivier Déforges and Safwan El Assad

Improved Error Concealment Using Scene Information

Interim Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

HIGH Efficiency Video Coding (HEVC) version 1 was

Parameters optimization for a scalable multiple description coding scheme based on spatial subsampling

Performance Comparison of JPEG2000 and H.264/AVC High Profile Intra Frame Coding on HD Video Sequences

HEVC: Future Video Encoding Landscape

Reduced complexity MPEG2 video post-processing for HD display

Final Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

THE new video coding standard H.264/AVC [1] significantly

ERROR CONCEALMENT TECHNIQUES IN H.264 VIDEO TRANSMISSION OVER WIRELESS NETWORKS

Versatile Video Coding The Next-Generation Video Standard of the Joint Video Experts Team

SCENE CHANGE ADAPTATION FOR SCALABLE VIDEO CODING

High Efficiency Video coding Master Class. Matthew Goldman Senior Vice President TV Compression Technology Ericsson

Low Power Design of the Next-Generation High Efficiency Video Coding

Spatially scalable HEVC for layered division multiplexing in broadcast

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ICASSP.2016.

Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding

Highly Efficient Video Codec for Entertainment-Quality

H.265/HEVC decoder optimization

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Free Viewpoint Switching in Multi-view Video Streaming Using. Wyner-Ziv Video Coding

HEVC in wireless environments

Systematic Lossy Forward Error Protection for Error-Resilient Digital Video Broadcasting

Speeding up Dirac s Entropy Coder

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

Hierarchical SNR Scalable Video Coding with Adaptive Quantization for Reduced Drift Error

Scalability of MB-level Parallelism for H.264 Decoding

Fast Simultaneous Video Encoder for Adaptive Streaming

Power Reduction via Macroblock Prioritization for Power Aware H.264 Video Applications

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

A Color Gamut Mapping Scheme for Backward Compatible UHD Video Distribution

Signal Processing: Image Communication

MULTI-STATE VIDEO CODING WITH SIDE INFORMATION. Sila Ekmekci Flierl, Thomas Sikora

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Image Segmentation Approach for Realizing Zoomable Streaming HEVC Video

Video Over Mobile Networks

PERCEPTUAL QUALITY COMPARISON BETWEEN SINGLE-LAYER AND SCALABLE VIDEOS AT THE SAME SPATIAL, TEMPORAL AND AMPLITUDE RESOLUTIONS. Yuanyi Xue, Yao Wang

High Efficiency Video Coding (HEVC)

Transcription:

Benjamin Bross, Valeri George, Mauricio Alvarez-Mesay, Tobias Mayer, Chi Ching Chi, Jens Brandenburg, Thomas Schierl, Detlev Marpe, Ben Juurlink HEVC performance and complexity for K video Conference object, Postprint version This version is available at http://dx.doi.org/.79/depositonce-78. Suggested Citation Bross, Benjamin; George, Valeri; Álvarez-Mesa, Mauricio; Mayer, Tobias; Chi, Chi Ching; Brandenburg, Jens; Schierl, Thomas; Marpe, Detlev; Juurlink, Ben: HEVC performance and complexity for K Video. In: IEEE International Conference on Consumer Electronics : ICCE. - New York, NY [u.a.] : IEEE,. - ISBN: 978--799--. - pp. -7. - DOI:.9/ICCE-Berlin..6698. (Postprint version is cited, page numbers differ.) Terms of Use IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Powered by TCPDF (www.tcpdf.org)

HEVC Performance and Complexity for K Video Benjamin Bross, Valeri George, Mauricio Alvarez-Mesa, Tobias Mayer, Chi Ching Chi, Jens Brandenburg Thomas Schierl, Detlev Marpe, and Ben Juurlink Image Processing Department, Fraunhofer HHI, 87 Berlin, Germany Embedded Systems Architecture Group, Technical University of Berlin, 87 Berlin, Germany Abstract The recently finalized High-Efficiency Video Coding (HEVC) standard was jointly developed by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) to improve the compression performance of current video coding standards by %. Especially when it comes to transmit high resolution video like K over the internet or in broadcast, the % bitrate reduction is essential. This paper shows that real-time decoding of K video with a framelevel parallel decoding approach using four desktop CPU cores is feasible. I. INTRODUCTION In January and ten years after the widely-used H.6/MPEG-AVC video coding standard [] was published, the first version of the HEVC standard was finalized by ITU- T consent and issued as ISO/IEC Final Draft International Standard (FDIS) []. A design overview of the new HEVC standard can be found in []. The coding efficiency of HEVC was analyzed in [] and compared with previous video coding standards like H.6/MPEG-AVC and H.6/MPEG-Video. Bitrate reductions of % for the same subjective quality compared to H.6/MPEG-AVC are reported. Since this coding efficiency gain comes along with increased complexity, the complexity aspects of HEVC en-/decoding have been studied in [] and [6] and en-/decoding times for HD (9 8) video sequences are reported. One of the targeted applications of HEVC is coding of ultra-high resolution video and hence, this paper reviews and reports results for real-time HEVC decoding of K (8 6) video sequences. First approaches to enable real-time decoding of HEVC coded K video sequences have been analyzed and presented in [7], [8], [9], and []. In these studies, the HEVC test model (HM) reference software decoder code was optimized and modified to support multithreading. The first analysis uses multi-threading in combination with entropy slices in version. of the HM software [7]. Entropy slices are not part of the final standard but with Wavefront Parallel Processing (WPP), similar multi-threaded decoding as with entropy slices can be achieved. A slightly modified version of WPP, called Overlapped Wavefront (OWF) and Tiles have been studied in [8] and [9] based on HM.. The most recent publication shows results for OWF based on HM 8. and further reports speedup due to the use of Single Instruction Multiple Data (SIMD) code optimizations []. The K (8 6) sequences used in these publications are from the Sveriges Television (SVT) High Definition Multi Format Test Set. Although WPP and Tiles allow low delay parallel decoding, a special indication of these techniques in the bitstream is required. II. REAL-TIME HEVC DECODING OF K VIDEO In order to provide the required speedup for K decoding using parallel processing without putting constraints on the bitstream, e.g. having WPP or Tiles enabled, a frame-level parallel processing approach has been chosen for this paper. For the initial version of this approach presented here, each frame to be processed in parallel is assigned a worker thread. Therefore, the number of worker threads controls the number of frames to be processed in parallel. The frame-level parallelism has been integrated in a from scratch HEVC decoder implementation developed at Fraunhofer HHI and results are provided for all sequences from the K (8 6) Hz UHD- test set provided by the European Broadcast Union (EBU) []. These have been encoded with version of the HM reference software (HM) [] using the Intra Main, Intra High Efficiency bit (Main ), Random Access and Random Access High Efficiency bit (Main ) configuration described in the common test conditions [] and decoded with the Fraunhofer HHI HEVC software decoder. III. RESULTS ON A WORKSTATION CPU All runtime measurements have been performed on the same type of computer which has an eight core Intel Xeon E- 687W CPU running at.ghz. Simultaneous Multithreading (SMT, also called Hyperthreading by Intel) is disabled to limit the number of hardware threads to eight and dynamic overclocking (aka Turbo Boost) is disabled to have reproducible results. Fig. shows the speedup factor that can be achieved for different numbers of threads used for frame-level parallel decoding. It can be seen that the speedup for the Intra configurations increases compared to the Random Access speedup when the number of threads increases since all frames can be independently processed in parallel. Because the Random Access configuration uses inter-picture prediction, the framelevel parallelism provides a non-linear speedup. This is due to the fact that synchronization between the threads is more frequent to account for inter-picture prediction sample referencing. The speedup saturates when the number of worker threads reaches the maximum number of CPU cores which is eight. Only for the Random Access configurations, the speedup gets larger when the number of threads is further increased to ten. This can be explained by the initial, still

8 7 Speedup 6 6 8 6 8 6 intra-main intra-main Number of worker threads randomaccess-main randomaccess-main Fig.. Decoding speedup on an Intel Xeon E-687W workstation CPU at.ghz averaged over the complete K EBU UHD- test set for Intra and Random Access configurations. sub-optimal implementation of frame-level parallelism where the number of parallel processed frames is set equal to the number of worker threads. When a worker thread is idle, it cannot start decoding another picture when this would increase the number of simultaneously decoded frames. Especially for the hierarchical coding structure in the Random Access configuration, where frames inside the group of pictures (GOP) are coded with different quantization parameters, decoding times of frames vary much more than for the Intra configuration. Choosing more worker threads than CPU cores helps in these cases since it increases the number of frames that are allowed to be processed in parallel. Going a bit more in the details for the Random Access Main configuration with bit video, Fig. a, Fig. b and Fig. c show the execution time of the Fraunhofer HHI decoder for all the UHD- Hz sequences when one, four and ten worker threads are used. According to Fig., the performance peaks when using ten worker threads and saturates from this point on. The horizontal dashed line represents the real-time limit for Hz which is = [ms/frame]. Whether real-time decoding is possible or not depends on the sequence and the bitrate. For example when four threads are used, Lupo boa can be decoded in real-time up to 7. MBits/s while veggie fruits passes the ms/frame line at MBits/s. Looking at the objective quality for the different sequences at different bitrates as shown in Fig., it can be seen that Lupo boa provides a Peak Signal to Noise Ratio (PSNR) of 9. db at 7. MBits/s and veggie fruits already reaches db at MBits/s. Hence, real-time decoding for both sequences at a good objective quality is feasible using four threads on four cores. IV. RESULTS ON A DESKTOP CPU In addition to the Xeon workstation CPU, the Random Access Main configuration bitstreams have also been decoded on a state-of-the-art four core core Intel i7-9xm desktop CPU running at.ghz. This configuration is considered to be more representative for systems that people have at home. Here, SMT is enabled giving a maximum of eight 6 Lupoboa-p-t Lupocandlelight-p-t Lupoconfetti-p-t candlesmoke-p-t fountainlady-p-t parkdancers-p-t penduluswide-p-t rainfruits-p-t studiodancer-p-t veggiefruits-p-t waterfallpan-p-t windwool-p-t Hz (a) Xeon E-687W workstation CPU at.ghz using core - thread 6 Lupoboa-p-t Lupocandlelight-p-t Lupoconfetti-p-t candlesmoke-p-t fountainlady-p-t parkdancers-p-t penduluswide-p-t rainfruits-p-t studiodancer-p-t veggiefruits-p-t waterfallpan-p-t windwool-p-t Hz (b) Xeon E-687W workstation CPU at.ghz using cores - threads 6 Lupoboa-p-t Lupocandlelight-p-t Lupoconfetti-p-t candlesmoke-p-t fountainlady-p-t parkdancers-p-t penduluswide-p-t rainfruits-p-t studiodancer-p-t veggiefruits-p-t waterfallpan-p-t windwool-p-t Hz (c) Xeon E-687W workstation CPU at.ghz using 8 cores - threads Fig.. Decoding time for each sequence of the K EBU UHD- test set for the Random Access Main configuration with, and 8 cores.

PSNR Y [db] 9 8 7 6 6 Lupoboa-p Lupocandlelight-p Lupoconfetti-p candlesmoke-p fountainlady-p parkdancers-p penduluswide-p rainfruits-p studiodancer-p veggiefruits-p waterfallpan-p windwool-p Fig.. Rate-distortion performance of the K EBU UHD- test set for the Random Access Main configuration. hardware threads for the software to use. As for the Xeon workstation CPU, Turbo Boost is disabled to not distort the runtime measurements by varying CPU clock rates. Similarly to Fig., Fig. shows the speedup achieved when more than one worker thread is used. Although SMT provides eight hardware threads, the speedup when using more than four worker threads is not increased as much as it would be when having eight cores available. Therefore, the four additional hardware threads or virtual cores cannot be counted as full cores for frame-level parallel decoding. Fig. a, Fig. b and Fig. c show the execution time over the bitrate for all EBU UHD- test sequences. It can be seen that the performance for one and four worker threads is comparable to the Xeon workstation CPU. In the best performing configuration, i.e. when all CPU resources are used with ten worker threads, all sequences can be decoded at least up to Mbits/s. When mapping the maximum bitrates again to the PSNR values representing objective quality in Fig., the coded bitstreams have at least a decent objective quality. The sequence pendulus wide for example, which has the worst coding performance according to Fig., can be decoded in real-time up to Mbits/s. At Mbits/s, its PSNR value is around 7. db which is quite good considering that the rate distortion curve saturates around 8 db. V. CONCLUSION It has been shown that real-time software decoding of K Hz video with HEVC is feasible on current desktop CPUs using four CPU cores. Encoding K video in real-time on the other hand remains a challenge. Therefore, first use cases of K video coded with HEVC are expected to be limited to offline encoded material for internet services like video on demand. Speedup..... 6 7 8 9 Number of worker threads randomaccess-main Fig.. Decoding speedup on an Intel i7-9xm desktop CPU at.ghz averaged over the complete K EBU UHD- test set for Random Access Main configuration. [] B. Bross, W.-J. Han, J.-R. Ohm, G. J. Sullivan, Y.-K. Wang, and T. Wiegand, High Efficiency Video Coding (HEVC) text specification draft (for FDIS & Last Call), document JCTVC-L of JCT-VC, Geneva, CH, Jan.. [] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, Overview of the High Efficiency Video Coding (HEVC) Standard, IEEE Transactions on Circuits and Systems for Video Technology, vol., no., pp. 69 668, Dec.. [] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand, Comparison of the Coding Efficiency of Video Coding Standards Including High Efficiency Video Coding (HEVC), IEEE Transactions on Circuits and Systems for Video Technology, vol., no., pp. 669 68, Dec.. [] Y. J. Ahn, W. J. Han, and D. G. Sim, Study of decoder complexity for hevc and avc standards based on tool-by-tool comparison, in Proceeding of SPIE 899, Applications of Digital Image Processing XXXV, October, p. paper 899X. [6] F. Bossen, B. Bross, K. Sühring, and D. Flynn, HEVC Complexity and Implementation Analysis, IEEE Transactions on Circuits and Systems for Video Technology, vol., pp. 669 68, Dec.. [7] M. Alvarez-Mesa, C. C. Chi, B. Juurlink, V. George, and T. Schierl, Parallel Video Decoding in the Emerging HEVC Standard, in Proceedings of the 7th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March. [8] C. C. Chi, M. Alvarez-Mesa, B. Juurlink, V. George, and T. Schierl, Improving the Parallelization Efficiency of HEVC Decoding, in Proceedings of IEEE International Conference on Image Processing (ICIP), Oct. [9] C. C. Chi, M. Alvarez-Mesa, B. Juurlink, G. Clare, F. Henry, S. Pateux, and T. Schierl, Parallel Scalability and Efficiency of HEVC Parallelization Approaches, IEEE Transaction of Circuits and Systems for Video Technology, vol., no., pp. 87 88, Dec.. [] C. C. Chi, M. Alvarez-Mesa, J. Lucas, B. Juurlink, and T. Schierl, Parallel HEVC Decoding on Multi- and Many-core Architectures, Journal of Signal Processing Systems, pp., Dec.. [] European Broadcast Union, EBU UHD- Test Set,. [Online]. Available: http://tech.ebu.ch/testsequences/uhd- [] JCT-VC, Subversion Repository for the HEVC Test Model version HM,. [Online]. Available: https://hevc.hhi.fraunhofer.de/svn/ svn HEVCSoftware/tags/HM-/ [] F. Bossen, Common HM test conditions and software reference configurations, document JCTVC-L of JCT-VC, Geneva, CH, Jan.. REFERENCES [] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, Overview of the H.6/AVC video coding standard, IEEE Transactions on Circuits and Systems for Video Technology, vol., no. 7, pp. 6 76,.

8 6 6 Lupoboa-p-t Lupocandlelight-p-t Lupoconfetti-p-t candlesmoke-p-t fountainlady-p-t parkdancers-p-t penduluswide-p-t rainfruits-p-t studiodancer-p-t veggiefruits-p-t waterfallpan-p-t windwool-p-t Hz (a) Intel i7-9xm desktop CPU at.ghz using core - thread 6 6 Lupoboa-p-t Lupocandlelight-p-t Lupoconfetti-p-t candlesmoke-p-t fountainlady-p-t parkdancers-p-t penduluswide-p-t rainfruits-p-t studiodancer-p-t veggiefruits-p-t waterfallpan-p-t windwool-p-t Hz (b) Intel i7-9xm desktop CPU at.ghz using cores - threads 6 Lupoboa-p-t Lupocandlelight-p-t Lupoconfetti-p-t candlesmoke-p-t fountainlady-p-t parkdancers-p-t penduluswide-p-t rainfruits-p-t studiodancer-p-t veggiefruits-p-t waterfallpan-p-t windwool-p-t Hz (c) Intel i7-9xm desktop CPU at.ghz using cores - threads Fig.. Decoding time for each sequence of the K EBU UHD- test set for the Random Access Main configuration with and cores.