A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b

Similar documents
Chapter 2 Introduction to

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

HEVC Real-time Decoding

THE High Efficiency Video Coding (HEVC) standard is

A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds.

COMPLEXITY REDUCTION FOR HEVC INTRAFRAME LUMA MODE DECISION USING IMAGE STATISTICS AND NEURAL NETWORKS.

Parallel SHVC decoder: Implementation and analysis

Real-time SHVC Software Decoding with Multi-threaded Parallel Processing

HEVC Subjective Video Quality Test Results

Selective Intra Prediction Mode Decision for H.264/AVC Encoders

Conference object, Postprint version This version is available at

IMAGE SEGMENTATION APPROACH FOR REALIZING ZOOMABLE STREAMING HEVC VIDEO ZARNA PATEL. Presented to the Faculty of the Graduate School of

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264

Adaptive Key Frame Selection for Efficient Video Coding

SCALABLE video coding (SVC) is currently being developed

WHITE PAPER. Perspectives and Challenges for HEVC Encoding Solutions. Xavier DUCLOUX, December >>

Project Proposal Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

Error Resilient Video Coding Using Unequally Protected Key Pictures

Interim Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

H.264/AVC Baseline Profile Decoder Complexity Analysis

Analysis of the Intra Predictions in H.265/HEVC

WITH the rapid development of high-fidelity video services

Visual Communication at Limited Colour Display Capability

Overview: Video Coding Standards

Overview of the Emerging HEVC Screen Content Coding Extension

Performance Evaluation of Error Resilience Techniques in H.264/AVC Standard

MULTI-STATE VIDEO CODING WITH SIDE INFORMATION. Sila Ekmekci Flierl, Thomas Sikora

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Low Power Design of the Next-Generation High Efficiency Video Coding

Project Proposal: Sub pixel motion estimation for side information generation in Wyner- Ziv decoder.

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

AUDIOVISUAL COMMUNICATION

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

1. INTRODUCTION. Index Terms Video Transcoding, Video Streaming, Frame skipping, Interpolation frame, Decoder, Encoder.

Video coding standards

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

The H.26L Video Coding Project

Project Interim Report

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work

Multicore Design Considerations

Study of AVS China Part 7 for Mobile Applications. By Jay Mehta EE 5359 Multimedia Processing Spring 2010

A Study on AVS-M video standard

HIGH Efficiency Video Coding (HEVC) version 1 was

An Overview of Video Coding Algorithms

On Complexity Modeling of H.264/AVC Video Decoding and Its Application for Energy Efficient Decoding

Video compression principles. Color Space Conversion. Sub-sampling of Chrominance Information. Video: moving pictures and the terms frame and

Comparative Study of JPEG2000 and H.264/AVC FRExt I Frame Coding on High-Definition Video Sequences

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

Authors: Glenn Van Wallendael, Sebastiaan Van Leuven, Jan De Cock, Peter Lambert, Joeri Barbarien, Adrian Munteanu, and Rik Van de Walle

17 October About H.265/HEVC. Things you should know about the new encoding.

Image Segmentation Approach for Realizing Zoomable Streaming HEVC Video

H.265/HEVC decoder optimization

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Dual Frame Video Encoding with Feedback

Final Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

HEVC: Future Video Encoding Landscape

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Impact of scan conversion methods on the performance of scalable. video coding. E. Dubois, N. Baaziz and M. Matta. INRS-Telecommunications

PACKET-SWITCHED networks have become ubiquitous

Video Compression - From Concepts to the H.264/AVC Standard

Quarter-Pixel Accuracy Motion Estimation (ME) - A Novel ME Technique in HEVC

H.264/AVC. The emerging. standard. Ralf Schäfer, Thomas Wiegand and Heiko Schwarz Heinrich Hertz Institute, Berlin, Germany

complex than coding of interlaced data. This is a significant component of the reduced complexity of AVS coding.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH GHEVC: An Efficient HEVC Decoder for Graphics Processing Units

ROBUST ADAPTIVE INTRA REFRESH FOR MULTIVIEW VIDEO

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

Overview of the H.264/AVC Video Coding Standard

WITH the demand of higher video quality, lower bit

A Novel Parallel-friendly Rate Control Scheme for HEVC

The H.263+ Video Coding Standard: Complexity and Performance

Reduced complexity MPEG2 video post-processing for HD display

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Into the Depths: The Technical Details Behind AV1. Nathan Egge Mile High Video Workshop 2018 July 31, 2018

Research Article Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

Feasibility Study of Stochastic Streaming with 4K UHD Video Traces

STUDY OF AVS CHINA PART 7 JIBEN PROFILE FOR MOBILE APPLICATIONS

TERRESTRIAL broadcasting of digital television (DTV)

AV1 Update. Thomas Daede October 5, Mozilla & The Xiph.Org Foundation

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION

Motion Re-estimation for MPEG-2 to MPEG-4 Simple Profile Transcoding. Abstract. I. Introduction

AV1: The Quest is Nearly Complete

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

COMP 9519: Tutorial 1

Error Concealment for SNR Scalable Video Coding

Video Over Mobile Networks

Modeling and Optimization of a Systematic Lossy Error Protection System based on H.264/AVC Redundant Slices

Content storage architectures

Error-Resilience Video Transcoding for Wireless Communications

1 Overview of MPEG-2 multi-view profile (MVP)

OBJECT-BASED IMAGE COMPRESSION WITH SIMULTANEOUS SPATIAL AND SNR SCALABILITY SUPPORT FOR MULTICASTING OVER HETEROGENEOUS NETWORKS

Transcription:

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b 1 Education Ministry s Key Lab of Broadband Wireless Communication and Sensor Network Technology 2 Education Ministry s Engineering Research Center of Ubiquitous Network and Health Service System 3 Jiangsu Province s Key Lab of Image Procession and Image Communications, Nanjing University of Post and Telecommunications, Nanjing, 210003, China a email: 1013010606@njupt.edu.cn, bemail: hud@njupt.edu.cn Keywords: HEVC; multi-core platform; parallel processing; frame-level; CTB-level Abstract. In this paper, we propose a parallel HEVC encoder scheme based on multi-core platform, which provides maximized parallel scalability by exploiting two-level parallelism, namely, the frame level parallelism and the CTB level parallelism. Inspired by the intra-ctb row level parallelism of WPP in HEVC, we investigate the inter-frame CTB prediction dependency to its reference CTBs, and find the inter-ctb correlation. Using this inter-correlation, we divide a frame into CTB units and create CTB-row level coding threads when their corresponding reference CTBs are available. Each thread is bonded to a processing core, therefore, both intra- and inter-ctb rows can be encoded in parallel. Moreover, we introduce a priority scheduling mechanism to control the coding threads. Experiments on Tilera-Gx36 multi-core platform show that, compared with serial execution, the proposed method achieves 3.6 and 4.3 times speedup for 1080P and 720P video sequences, respectively. 1. Introduction Recent increasing demands on video coding support for higher resolutions in consumer devices are driving the video coding development to higher compression rates. To meet these demands, the Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T and ISO/IEC Moving Pictures Experts Group has developed a new video coding standard, the High Efficiency Video Coding (HEVC) [1]. The HEVC project aims at reducing the bitrate compared to the previous H.264/AVC by another 50%. However, the cost paid for higher coding efficiency is much higher computational complexity. Then, the HEVC encoders are expected to be more complex than the H.264/AVC encoders [2], and parallelism has to be considered in real-time HEVC encoding. With the development of multi-core Digital Signal Processor (DSP) platform, parallelizing HEVC encoding on such platforms is of extreme importance to deal with this problem [3]. In the video coding layer of HEVC, the same hybrid approach (intra-/inter- picture prediction and 2-D transform coding) as that of H.264/AVC [4] is employed. Fig.1 depicts the block diagram of a hybrid video encoder, which could create a bitstream conforming to the HEVC standard. As Fig.1 shows the main modules in HEVC encoder includes intra-/inter-prediction, transformation and quantization, inverse transformation and inverse quantization, entropy coding and loop filtering, etc. Compared with H.264/AVC, the complexity of HEVC is mainly reflected as follows: In HEVC the motion compensation uses the same quarter pixel motion resolution, but the derivation of interpolated pixels is generalized by using a larger 8-tap and 4-tap interpolation filter for luma chroma respectively. Intra-prediction is generalized as well by parameterization of the predicted angle, allowing 33 possible different angles. The transform is still an integer transform but allows more block sizes, ranging from 4 4 to 32 32 with higher internal processing precision. HEVC also defines a more efficient block structure, called Coding Tree Block (CTB). The sequence is coded by CTB of size 16 16, 32 32, or 64 64 pixels. Each CTB can be recursively subdivided using quad-tree segmentation in coding units (CUs), which can in turn be further subdivided into 2016. The authors - Published by Atlantis Press 375

prediction units (PUs) and transform units (TUs). Coding units can be subdivided down to a minimum CU of size 8 8. The minimum prediction unit size is 4 8 and 8 4, and minimum TU size is 4 4 pixels [5]. As we can see, the computational complexity of HEVC encoding is huge. It is unlikely that single core processor can encode a 1080P or higher resolution HEVC video in real-time. This paper will present a new parallel scheme on the multi-core platform. In particular, our contributions can be summarized as follows: 1) Inspired by Wavefront Parallel Processing (WPP) of HEVC and the intra-ctb row level parallelism, we investigate the inter-frame CTB prediction dependency to its reference CTB, and find the inter-ctb correlation. 2) Based on intra-ctb and inter-ctb correlation, we divide a frame into CTB units and create CTB-row level coding threads when their corresponding reference CTBs are available. Each thread is bonding to a single processing core, so intra- and inter-ctb rows can be encoded in parallel. Moreover, we introduce a priority scheduling mechanism to control these coding threads. Hence, the frames and CTBs are processed in parallel, both frame-level and CTB-row-level parallelism are realized. 3) We test the proposed parallel encoding scheme on a multi-core platform, the Tilera-Gx36 system with 36 cores running at 1.2GHz, for 1080P and 720P video sequence respectively. We compare the proposed approach with serial execution of x265 reference software both in terms of speedup and PSNR to prove the efficiency of the proposed parallel scheme. The rest of this paper is organized as follows. Section 2 introduces the parallelization strategies of HEVC and analyzes the dependencies of inter-frame CTBs. In Section 3, the proposed parallel scheme is described in details. Experimental results and their analysis are presented in Section 4, followed by a short conclusion in section 5. 2. Parallelization Strategies in HEVC Fig.1. General diagram of the HEVC encoder The current HEVC standard contains several strategies aiming at better parallel processing. In H.264/AVC, there are frame-level, slice-level or macroblock-level parallelism [6]. Take slice-level parallelism for example, a picture can be partitioned in multiple arbitrarily sized slices for independent processing, having multiple slices in a picture, however, degrades objective and subjective quality due to slice boundary discontinuities and increases significant coding losses. In order to overcome the shortage of the parallelization strategies employed in H.264/AVC, two tools have been included in the HEVC standard: Wavefront Parallel Processing (WPP) and Tiles. Both of these tools allow subdivision of each picture into multiple partitions that can be processed in parallel. 376

2.1 Wavefront Parallel Processing When WPP is enabled, a picture is divided into several CTB-rows and every row can be assigned to a core [7]. The first row is processed in an ordinary way, the second row can begin to be processed after two CTBs have been encoded in the first row, the third row can begin to be processed after two CTBs have been encoded in the second row, etc. Compared to slices, no coding dependences are broken at row boundaries. Additionally, CABAC probabilities are propagated from the second CTB of the previous row, to further reduce the coding losses (Fig.2). Also, WPP does not change the regular raster scan order. Furthermore, a WPP bitstream can be losslessly transcoded to/from a nonparallel bitstream with only an entropy-level conversion [8]. 2.2 Investigation on Inter-Frame Dependency Although WPP makes the intra-frame parallelism easier, the parallelism is limited. Due to the intra-frame dependency, WPP does not allow all the rows to start being encoded simultaneously, so rows cannot be finished at the same time, which will make parallelization inefficiency. As is discussed above, the wavefront parallelism still can be improved. To maximize parallel scalability, we combine inter-frame level parallelism with intra-ctb row level parallelism. So analyzing the CTB dependency to its reference CTBs inter-mode frame, the inter-frame dependency, is necessary. In order to improve the encoding performance, HEVC requires the reconstructed frames as reference pictures to deal with time redundancy, thus achieving inter-ctb row level parallelism must consider inter-frame dependency. Before a CTB in current frame can be encoded, all of its reference CTBs in the searching range must be available. Inter-CTB row level parallel approach is shown in Fig.3, it can not immediately be encoded when the first row in Image 0 has been encoded, because the motion estimation search range is larger than a CTB-row. In order to predict accurately, we have to wait the coding unit being reconstructed. If the motion estimation search range is twice as big as a CTB unit, the first CTB in upper left corner of image 1 can not be encoded unless the third row in image 0 is reconstructed [9]. As shown in Fig.3, the current coding CTB in Image 1 can be encoded with the CTB in Image 0 simultaneously. Therefore, as long as there no inter-frame CTB prediction dependency exists, some CTBs in Image 1 and Image 0 can be encoded simultaneously. By introducing the inter-parallel, we improve the parallel speedup significantly. Fig.2 WPP processes rows of CTBs in parallel Fig.3 inter-ctb level parallelism 3. Proposed Parallel Encoding Method According to the data dependence analysis of the inter-frame parallelism, this section we will present the proposed parallel method, we divide our work into two parts, the first part describes the proposed method in details and the second part, the priority scheduling mechanism, is presented to control the coding threads, which is essential to our scheme. 377

3.1 Combine intra-ctb row Level Parallelism With Inter-frame Parallelism Our proposed scheme combines intra-ctb row level parallel processing with inter-frame parallel processing. Compared to single granularity, multi-granularity division can achieve higher speedup. In fact, the inter-frame level parallelism is not at the level of the frame, but calls inter-ctb row level parallelism to complete the final coding process. The proposed scheme is based on multi-core platform. Using the parallel programming technology like task pool and thread pool, we assign tasks to frames that can be coded, each task has a separate memory space to store parameter information of each frame, a plurality of frame-level tasks share a common thread pool. Therefore, several frames can be encoded in parallel, which called inter-frame level parallelism. Then according to the received parameters, we create multiple CTB-row level tasks which call threads to complete the coding, and bind each thread to the corresponding CPU core. The proposed scheme further uses homogeneous multi-core platform s shared memory model to achieve multi-thread synchronization. It can parallel encoding intra- and inter-ctb rows only when the dependencies are eliminated, by checking intra- and inter-dependency flag, we can achieve CTB-row level parallelism. The specific process is shown in Fig.4. Fig.4 the scheme of combine intra-ctb level parallelism with inter-frame level parallelism 3.2 Priority Scheduling Mechanism As Fig.4 shows to us, two priority queue structures are introduced to implement the proposed scheme. In the main coding thread, first reading a picture, adding it to frame-level task queue and then continue reading frames if the queue is not full. When the buffer is full, by using task pool technology, we assign tasks to frames that can be encoded in the queue, each task has a separate 378

memory space to store parameter information of each frame, a plurality of frame-level tasks share a common thread pool. But in fact, the inter-frame level parallelism is not at the level of the frame, but by calling inter-ctb row unit to complete the final coding process. In this paper, as the smallest parallel granularity, a CTB-row calls an idle thread in thread pool to encode it and add it to CTB-row level task queue. To determine the CTB-row coding order, we define two-level priority of CTB-row task queue. The first one is the inter-level and the second is intra-level, inter-level of priority level is higher than intra-level, that means if both inter-ctb row and intra-ctb row are well prepared be encoded, the inter-ctb row enter the queue first. Specifically, the inter-level of priority specifies while in the task queue, if several CTB-rows in different frames are ready to be encoded, the CTB-row which has the smallest frame number in frame-level task queue joins the queue first, similarly, the intra-level specifies if several CTB-rows in the same frame are prepared be encoded, the CTB-row with the smallest line number adds to the queue first. Two-level task queue is depicted in Fig.5, each CTB-row task in the CTB-row level task queue will call an idle thread to encode it, until there are no threads available in thread pool, then the CTB-row task in the task queue have to wait for a new idle thread, once there is a CTB-row finish being encoded, the coding thread rejoins the thread pool for other CTB row-level task calls. It is worth noting that the CTB-row coding work is carried out serial, one CTB by one CTB. For CTB-rows in I frame, multiple CTB-rows can be parallel encoded when intra-frame data dependency eliminates, and for CTB-rows in non-i frames, we need to consider inter-frame data dependency, due to the need of reconstructed pixels to deal with time redundancy, therefore, only the corresponding CTB-rows in reference region are reconstructed, can CTB-row thread be opened to achieve intra- and inter-ctb parallel processing. 4. Experimental results In order to compare our proposed parallel method with serial execution, we adopt HEVC reference software x265, which supports WPP and includes all feature of the main profile [10]. The experimental platform is Tilera-Gx36, which is a member of TILERA multi-core processor family with 36 cores. In order to avoid the impact of special platform, we do not use any Tilera-Gx36 platform-dependent optimizations. The test sequences are Kimono of 1080P resolution and Fourpeople of 720P resolution. In our work, we implement the proposed parallel method on x265, CTU size set 32 32, so there are more CTB-rows to be parallel encoded, QP unified set 27, similar results are observed for other QPs. More detail experimental environments and conditions are written in Table 1. Table 1. Experimental Platform and Test Conditions Processor TILE-Gx8036 Architecture TILE-Gx Number of cores 36 Frequency (single core) 1.2GHz Operating system Tilera MDE-4.0.3.1415127 Compiler GCC 4.6.3 Test sequences Kimono, Fourpeople Reference software x265 Encoding Conditions QP: 27 To evaluate the efficiency of our proposed parallel method, we use four index to compare our proposed method to original method in x265. The fps denotes coding rate, the PSNR and Bitrate are used to measure the change of picture quality and the speedup of our proposed method can be calculated as follows: 379

T T serial Speedup = (1) Tserial proposed Tproposed Where and are respectively the coding time of serial execution and our proposed method. Table 2 shows the parallel performance of our proposed parallel method compare to serial execution. And Fig.6 shows the speedup of two sequences. Table 2. Experimental Results of Our Proposed Parallel Method Threads Original method Proposed method Class number (cores) fps PSNR (db) Bitrate(kbps) fps PSNR (db) Bitrate(kbps) 2 3.03 38.394 1792.63 4 4.44 38.394 1792.63 8 6.31 38.394 1792.63 Kimono 16 7.84 38.394 1792.63 2.65 38.403 1771.60 (1080P) 24 8.52 38.394 1792.63 32 9.46 38.394 1792.63 36 9.31 38.394 1792.63 2 5.76 30.112 25117.76 4 8.32 27.857 25117.76 8 10.71 27.857 25117.76 Fourpeople 16 3.94 27.870 25006.86 13.55 27.857 25117.76 (720P) 24 15.01 27.857 25117.76 32 16.75 27.857 25117.76 36 16.69 27.857 25117.76 From Fig.6 and Table 2, we get three major observations: 1) Compare to original method in x265, our proposed method has a little change in PSNR and Bitrate, this may results from parallel processing more CTB-rows, weaken the relevance of inter-frames, so that the image quality has declined. 2) The speedup of our proposed method behaves good when less than 32 cores but the upward trend gradually become flat from this point, this result may due to inter-core synchronization and communication costs. 3) Compare with serial execution, our proposed method achieves 3.6 and 4.3 times speedup for 1080P and 720P video sequences, respectively. Fig.5 the two-level task queue model Fig.6 Speedup of proposed method using different number of cores 380

5. Conclusion In this paper, in order to improve the speedup of HEVC encoder, a parallel scheme based on multi-core platform is proposed. On the basis of intra-frame CTB-row level parallelism, we exploit inter-frame parallelism to achieve intra- and inter-ctb multigrain parallelization. Meanwhile, we introduce a priority scheduling mechanism to control these coding threads. Experimental results shows the new scheme improve the parallel speedup with a little change in PSNR and Bitrate, but we only test CTU size 32 32 and do not consider the effect of other CTU size like 64 64. How to choose suitable CTU size of different video sequences is the direction of our future research. References [1] G.J.Sullivan and J.-R.Ohm. Recent developments in standardization of high efficiency video coding (HEVC). Proc.SPIE, Aug.2010, p.77980v. [2] F.Bossen, B.Bross, K.Sühring and D.Flynn. HEVC complexity and implementation analysis. IEEE Trans.Circuits Syst.Video Technol, vol.22, no.12, pp.1684-1695, Dec.2012. [3] S.Borkar and A.A.Chien. The future of microprocessors. Commun.ACM, vol.54, pp.67-77, May 2011. [4] G.J.Sullivan, J.-R.Ohm, W.-J.Han and T.Wiegand. Overview of the High Efficiency Video Coding (HEVC) standard. IEEE Trans.Circuits Syst.Video Technol, vol.22, no.12, pp.1648-1667, Dec.2012. [5] C.C.Chi, M.Alvarez-Mesa, B.Juurlink, G.Clare, F.Henry, S.Pateux, and T.Schierl. Parallel scalability and efficiency of HEVC parallelization approaches. IEEE Trans.Circuits Syst.Video Technol, vol.22, no.12, pp.1826-1837, Dec.2012. [6] B.Juurlink, M.Alvarez-Mesa, C.C.Chi, A.Azevedo, C.Meenderinck and A.Ramirez. Scalable Parallel Programming Applied to H.264/AVC Decoding. Berlin, Germany: Springer, 2012. [7] F.Henry and S.Pateux. Wavefront parallel processing. Tech.Rep.JCTVC-E196, Mar.2011. [8] G.Clare and F.Henry. An HEVC transcoder converting non-parallel bitstreams to/from WPP. Tech.Rep.JCTVC-J0032, May 2012. [9] WEI Fei-fei, LIANG Jiu-zhen, HAN Jun. A Parallel X264 Encoder Algorithm Based on the Inter-Frame and Intra-Frame Macroblock-Level. Computer Engineering & Science, vol.33, No.7, 2011. [10] x265 project, multicoreware, https://bitbucket.org/multicoreware/x265/src/e7424e0cb60f4bb08 e7d519a49ff9ab77d6fe713/source/common/vec/dct-sse3.cpp? at=default. 381