A Novel Parallel-friendly Rate Control Scheme for HEVC

A Novel Parallel-friendly Rate Control Scheme for HEVC Jianfeng Xie, Li Song, Rong Xie, Zhengyi Luo, Min Chen Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University Cooperative Medianet Innovation Center, Shanghai, China School of Electronics and Information Engineering, Shanghai University of Electric Power Multicoreware Email: {richrd, song li, xierong}@sjtu.edu.cn, lzy@shiep.edu.cn, chenm003@163.com Abstract Rate control plays a key role in video coding, which has a significant effect on encoder performance. With parallel video coding frameworks more and more popular, rate control suitable for parallel coding is highly desired. However, most rate control algorithms only focus on the rate distortion performance but ignoring the data correlation in parallel coding. In this paper, based on the parallel framework of the x265 encoder, we propose a parallel-friendly rate control scheme for HEVC coding, which supports both frame level and slice level parallel. Experimental results show that the algorithm can achieve not only highly accurate rate control but also excellent rate distortion performance under parallel coding. I. INTRODUCTION In recent years, high resolution video application becomes more and more popular. In order to satisfy the demands of high quality video services, multiple parallel technologies, such as frame parallel, slice/tile parallel, Wavefront Parallel Processing(WPP) [1] are designed to accomplish real-time video coding. However video encoder which uses parallel framework will introduce the challenge of data dependency. Data dependence effects not only the speedup, but also rate control (RC) algorithm in the parallel framework. Rate control, which is closed related to RD performance, is adopted to minimize distortion under a target bitrate limitation. Rate control roughly includes two steps. Firstly, appropriate target bits or bit budgets are allocated at different coding levels. Then appropriate coding parameters are set to produce bits as allocated. The bit budgets should be should be dynamically adjusted according to previous coding information, such as actually used bits, content complexity, buffer status, etc. However, with the introduction of parallel coding, multiple coding units may be encoded at the same time, which usually makes coding information of immediately previous units unavailable and degrades the rate control performance. Therefore, conventional rate control algorithms have to be adapted for the parallel coding frameworks. In [2][3], a parallel rate control algorithm based on image spatial division was for MPEG-2 MP@HL (Main Profile at High level). Every frame is divided into several parts of equal size, which are encoded independently at the same time. After finishing encoding, all bit streams are merged into an integrated stream. A global rate control algorithm is designed to allocate target bit for different parts according to previous coding information. In [4], a parallel rate control algorithm for H.264 SVC (Scalable Video Coding) was based on the dependency between different layers. For every slice in one frame, target bit is allocated according to the co-located slice s coding complexity in the previous frame, where coding complexity is defined as the product of actual bit and quantization step. A similar strategy is adopted to allocate target bits at the MB level. In [5], a parallel rate control algorithm for H.264/AVC was. This scheme performs target bit allocation at the GOP and the frame level. At the GOP level, target bit is allocated based on buffer occupancy rate. For every frame in one GOP, target bit is calculated according to their frame types in advance, which enables parallel encoding of multiple frames. All of above methods are designed for the previous generations of coding standard. As far as we known, there does not exist specifically designed rate control algorithm for HEVC. In this paper, we propose a novel parallel-friendly rate control scheme, which supports parallel coding both at the frame and the slice level. If applied the scheme necessitates few modifications of original parallel frameworks. Besides, low computational complexity is still maintained, which enables real-time application as well. The rest of this paper is organized as follows. Section II describes the parallel-friendly rate control algorithm. Section III shows the experimental results and discusses the coding performance. Section IV draws the conclusion. II. PROPOSED PARALLEL-FRIENDLY RATE CONTROL SCHEME The rate control algorithm can be applied to parallel coding down to slice level. As shown in Fig 1, every frame is divided into multiple parts of equal sizes, i.e. slices, which get encoded independently at the same time. The output stream is composed of bit streams from different sub coders. Except for respective internal rate control in every coder, a global rate control module is designed to periodically synchronize internal control and reallocate bit rates for different slices based on previous coding information. This method can efficiently adapt to slice parallel and minimize the target bit estimation error due to content variation.

Fig. 1. Joint rate control scheme of parallel video coding architecture This parallel rate control scheme mainly consists of four modules: frame level bit allocation, slice level bit allocation, rate-lambda (R-λ) model and rate control status update. Bit allocation modules allocate bits to different granularity levels according to actually coded bits, content complexity, buffer status, frame type, etc. R-λ model is used to set appropriate encoding parameters, i.e. lambda and QP, to produce allocated bits. Rate control status update module includes R-λ model parameter update and buffer status update. A. Frame level target bit allocation All frames are classified into four types: I, P, B and b, where B denotes the reference B frame and b denotes the nonreference B frame. Frames of different types receive different rate control. For the non-i frame, QP determination by R-λ model can be applied. As far as I frames are concerned, in light of an usual large interval, weak correlation between different I frames are assumed and neglected. Hence the original R-λ model used to set encoding parameters is no longer applicable and a simple QP estimation method is designed for I frames. 1) Non-I frame: The bit allocation depends mainly on three conditions: global target bitrate, virtual buffer status and frame type. Firstly, the average bits per frame is calculated as T avg = R tar fps where R tar is the global target bitrate and fps is the frame rate. The average frame bit is set as the benchmark of adaptive bit allocation. Secondly, the virtual buffer occupancy can be calculated by { V i = L i = 0 V i 1 + b i T avg otherwise where L denotes target buffer occupancy, which is set as 0.5 times of R tar. (1) (2) L = 0.5 R tar (3) In other words, the buffer can tolerate 0.5 second s bitrate fluctuation at most. A bigger target occupancy means virtual buffer can tolerate more bitrate fluctuation but a bigger delay. In our experiment, 0.5 times of R tar is a good experience value which compromises between bitrate fluctuation and delay. The virtual buffer size is also initialized as target buffer occupancy L, and after encoding one frame, the buffer occupancy is updated using the actually generated bit b i. Buffer status directs bit allocation in two ways. When the buffer occupancy ranges from 10% to 90% of target buffer occupancy, low risk of overflow or underflow is assumed and the frame target bit is slightly corrected as B = L V (4) SW where SW is the size of sliding window, which is used for smooth bit rate adjustment. The SW used in our experiments is set to 40. When the buffer occupancy is less than 10% or larger than 90% of the buffer size, a high risk of underflow or overflow is assumed. So the frame target bit needs a further adjustment to avoid that situation happening. The target bit considering buffer status is calculated by T norm = α T avg + B (5) where α is defined as 0.9 V > 0.9 T avg α = 1 0.1 T avg V 0.9 T avg (6) 1.1 V < 0.1 T avg Thirdly, frame type should be also taken into account. The final target bit is defined as T = T norm ω p = (α T avg + B) ω p (7) where ω p is the frame type dependent weight and can be fitted by pre-analysis of coding information. It is decided by the setting of GOP structure, mainly including key frame interval keyint and number of consecutive b-frames bf rames. For example, when bframes is set as the default value 4, frame weight can be determined as ω p = a keyint b + c (8) where parameter a, b and c are shown as Table I. TABLE I PARAMETER OF FRAME WEIGHT CALCULATION Frame type a b c P -7.272-0.451 3.589 B -1.333-0.627 0.6468 b -0.39-0.4974 0.2842 For other GOP structures, corresponding parameter can be fitted with the similar method. 2) I frame: I frames coding pattern is quite different from that of non-i frames, which makes it typically consume much more bits than others. Usually, exact bit allocation and QP estimation is still a challenge for I frames, for which the main reason is the weak correlation between neighboring I frames due to large intervals. Fortunately, since rate control usually

runs for a number of frames but not for a single one, bit rates can still be regulated afterwards in despite of possible inaccuracy. Here we use a simple QP estimation method to determine the quantization parameter for I frames. Usually, in view of the roles different frame types play, B frames QP should be larger than P frame, while P frames QP should be larger than the periodically inserted I frame. QP of I and B frames can be expressed with a rough conversion of that of P frames as follows QP I = QP P 6 ipf actor (9) QP B = QP P + 6 pbf actor (10) where ipf actor and pbf actor are the transfer factors for I and P frames and are set to 1.4 and 1.3, respectively. So all frames QP in an equivalent P frame format can be updated via an exponential average of coded frames QP with a forgetting factor set to 0.95 QP n = n 1 i=0 QP i 0.95 n 1 i +0.24 0.95 n n (11) 0.95 n i + 0.01 0.95 n i=1 where QP i is the QP of the i-th coded frame in equivalent P frame format and QP n is the estimated QP value of current frame in the equivalent P frame format. If the current frame is I frame, the QP will be estimated through the above equation. Notice that the above equation only applies to I frames. Consider the limited number of I frames, the method is undoubtedly feasible despite of slight inaccuracy. B. Slice level target bit allocation As is shown in Fig 1, the component slices from one image are independently encoded. Consider the possible content differences between slices, uniform allocation of bits for all slices may lead to significantly different quality. For example, band phenomenon may appear at the slice boundaries. To avoid that, a global rate control module is designed to periodically synchronize all coders rate control status and reallocate target bitrate for different slices according to the previous coding information. Specifically, every slice should have three parameters calculated before one frame is encoded: average target bit, target buffer occupancy and actual buffer occupancy. Let m denotes the number of slices, the jth slice s average target bit T j avg, target buffer occupancy L j and actual buffer occupancy V j can be recalculated through (12) (13) (14), respectively. T j avg = L j = SAT Dj m T avg (12) SAT D k k=1 SAT Dj m L (13) SAT D k k=1 V j = SAT Dj m V (14) SAT D k k=1 where SAT D j indicates the jth slice s weighted Sum of Absolute Transformed Difference(SATD) value with previous co-located slices, with a forgetting factor set to 0.5. This can make the reallocating adjustment more smoothly. The weight indicates how important is the history frame SATD to current frame. The frame far from current frame has a low weight. n SAT Dn j = w i SAT D j i (15) i=0 w i = 0.5 n i / n 0.5 n k (16) k=0 After the reallocation adjustment, the target bit calculation can be conducted as the frame level bit allocation. C. λ and QP determination with R-λ model Except for the I frames QP determination using the above method, the R-λ model is adopted to determine QP of non I-frames. The R-λ model is the latest rate control model in HEVC, which has been adopted by the HEVC reference software HM. According to the RD relationship analysis on HEVC, Li [6] builds an exponential relationship between rate and lagrange multiplier λ, which is modeled as λ = α bpp β (17) where bpp indicates the bit per pixel. If the target bit is T and the number of pixels is N, then the bpp is defined as bpp = T (18) N The model parameters α and β are updated according to the actually used bits after coding every frame. λ comp = α old bpp β old real (19) α new = α old + δ α (ln λ real ln λ comp ) α old (20) β new = β old + δ β (ln λ real ln λ comp ) ln bpp real (21) where bpp real is the actual bit per pixel. QP can be determined through the empirical equation between λ and QP λ QP = 3 log 2 + 12 (22) To keep consistent quality of coded video, QP is clipped into an appropriate range as follows. First, the difference from that of the last frame with different frame type should not exceed 10. QP last diff type 10 QP QP last diff type + 10 (23) Second, the difference from that of the last frame with same frame type should not exceed 3. QP last same type 3 QP QP last same type + 3 (24)

III. EXPERIMENTAL RESULTS Experiments are conducted to test the performance of the rate control scheme. Main indexes include R-D performance and rate control accuracy, where R-D performance is measured by PSNR and, rate control accuracy is measured by bitrate error between target bitrate and actual bitrate. The benchmark we used in our experiment is x265 1.6 which supports frame parallel and WPP encoding. Two kinds of rate control scheme in original x265 including and is used as the comparing object, where means average bit rate, and VBV means video buffer verifier. mode has a good RD performance while rate control accuracy is terrible. VBV mode is a plug-in mode which can be used in most of rate control scheme to further subtly adapt the QP and achieve better rate control accuracy, but the RD performance will suffer a great degradation. Our ultimate goal is to obtain a enough accurate rate control accuracy close to mode, with a RD performance improvement. The preset of x265 is set as medium. Considering the demand of rate stability, scene cut detection is turn off, because the uncertain I frame introduced by it will lead to drastic rate fluctuation and have a significant harmful influence to rate control performance. Actually, in most of real-time coding application, scene cut detection is usually not used. The key frame interval is set as 30 frames. Number of consecutive b- frames is set as 4. The hierarchical depth is two by default. All the 1080p HD sequences (Kimono1, P arkscene, Cactus, BasketballDrive and BQT errace) in the HEVC standard test sequences Class B are adopted. Target bitrate is set according to HEVC call for proposal [7]. Specially, for VBV mode, the vbv buffer size is set as one second s rate bit. The vbv max rate and vbv init size is set by encoder default. A. Performance comparing to x265 anchor Firstly, to validate the RD-performance improvement, two quality metrics, PSNR and, are used in our experiment. More and more researches have reached a consensus that is a more effective video quality metric than PSNR which provides a good approximation of the perceptual visual quality degradation. TABLE II BD-RATE TO ORIGINAL RC ALGORITHM BD-Rate psnr ssim psnr ssim Kimono1 6.00% 6.39% -0.34% -1.32% ParkScene 1.29% 1.77% -3.13% -3.78% Cactus -1.50% -2.54% -2.95% -4.32% BasketballDrive 1.77% -0.70% 0.11% -3.38% BQTerrace -2.40% -1.88% -4.44% -5.06% Average 1.03% 0.61% -2.15% -3.57% The BD-Rate comparing to x265 original mode and mode is list in Table II. The two columns of psnr and ssim list the BD-rate on quality metric with PSNR and, respectively. From this table, we can find that algorithm s RD performance is slightly worse than mode, 38.5.5.5 Rate-PSNR curve of Cactus 34 0.92 0.91 0.9 (a) Rate- curve of Cactus 0.84.5.5 5 (b) Rate-PSNR curve of BQTerrace 34 5 5 5 5 (c) Rate- curve of BQTerrace 0.845 (d) Fig. 2. R-D curve ( (a) Rate-PSNR curve of Cactus (b) Rate- curve of Cactus (c) Rate-PSNR curve of BQT errace (d) Rate- curve of BQT errace) ) while has a significant improvement to mode. Specially, for the metric, algorithm has a close performance to mode, even better on some sequences. Comparing to mode, method achieves a

great gain up to 3.57% on average. Fig 2 shows the Rate-PSNR curve and Rate- curve of two sequences, Cactus and BQT errace. RD performance achieves a improvement to both the original and algorithm. This is largely because we adopt the more reasonable target bit allocation method and more accurate R-Q model than the original rate control module. Secondly, to compare the rate control accuracy of our algorithm, a mismatch ratio is defined by M% = R actual R t arg et R t arg et 100% (25) where R target and R actual denote the target bit rate and the actual bit rate, respectively. As stated before, a sequence adopts the same target bitrate for the anchor algorithm and the algorithm. Table III states the bit rate mismatch comparisons of the two x265 anchor rate control algorithm and the rate control algorithm. It shows that the rate control accuracy of algorithm is much better than mode, while slightly worse than mode. Observing all the sequences result, we can find that the worst performance sequence is Kimono1. The main reason is that there exists a scene cut in this sequence, which causes a rate control performance degradation. But method s performance is still much better than mode where maximum mismatch is up to 18%. Kimono1 ParkScene Cactus BasketballDrive BQTerrace TABLE III MISMATCH COMPARING target/kbps Rate control mismatch 6000 9.05% 2.49% 6.10% 4000 13.65% 3.47% 5.30% 1600 17.73% 5.17% 3.25% 1000 18.09% 5.67% 1.83% 6000 5.% 1.% 4.68% 4000 4.99% 1.82% 3.79% 1600 5.20% 2.78% 2.10% 1000 5.10% 3.26% 1.52% 10000 1.41% 0.90% 1.79% 7000 0.69% 0.95% 1.72% 3000 0.54% 0.93% 1.48% 2000 0.97% 1.01% 1.30% 10000 2.15% 1.80% 3.17% 7000 1.62% 1.73% 3.23% 3000 0.24% 1.31% 3.08% 2000 1.04% 1.14% 2.86% 10000 3.02% 1.60% 0.16% 7000 2.42% 1.40% 0.72% 3000 0.33% 0.79% 1.28% 2000 0.52% 0.74% 1.70% Average 4.71% 2.02% 2.55% To illustrate the rate control performance more intuitive, frame bit frame bit 14 12 10 8 6 4 2 10 5 Actual bits of P frame 0 0 50 100 150 200 250 300 0 400 frame number 4 3.5 3 2.5 2 1.5 1 0.5 10 5 (a) Actual bits of B frame 0 0 50 100 150 200 250 300 0 400 450 frame number (b) Fig. 3. Frame actual bit ( (a) P frame bits (b) B frame bits ) Fig 3 shows the frame actual bits of three kinds of rate control method. The sequence used is the connected sequence of all test sequences mentioned above, which makes the sequence more closing to a real video with scene cut. Fig 3(a) shows the P frame s bit and Fig 3(b) shows the B frame s bit. We can find that method has more smoothly frame bit variation than the original algorithm. To sum up, algorithm can obtain a enough accurate rate control accuracy close to mode. Meanwhile, a significant RD performance improvement has been achieved. B. Performance of the joint rate control module One important aspect we need to validate is the performance of joint rate control module. The test condition is designed as follows. Two kinds of image division strategies is used in our experiment. One is dividing each image to 2 equal parts and the other is dividing each image to 4 equal parts. For each division strategies, two kinds of bit allocation scheme is used for these slices. First one is averagely allocating frame bits to each slice, which is marked by parts equal in the following pages. The second one uses the slice target bit allocation scheme described in Section II, which is marked by parts satd in the following pages. Table IV shows the slice bit allocation scheme s RD-performance improvement comparing to the equal slice bit allocation. The two columns of psnr and ssim list the BDrate on quality metric with PSNR and, respectively. We can find that RD has about 1% gain on PSNR for both kinds of division strategies, while on, about 2% gain is achieved.

TABLE IV BD-RATE TO EQUAL BIT ALLOCATION 39 38.5 Rate-PSNR curve of BasketballDrive of 2 parts BD-Rate 2parts 4parts psnr ssim psnr ssim Kimono1-1.52% -2.53% -1.71% -4.49% ParkScene 0.63% 1.24% 0.47% 1.% Cactus -0.09% 1.21% 0.16% 1.06% BasketballDrive -2.77% -1.85% -4.58% -4.58% BQTerrace -1.64% -6.21% 0.64% -5.12% Average -1.08% -1.63% -1.00% -2.% Notice that three sequences, Kimono1, BasketballDrive and BQT errace, have a higher RD performance improvement than others. Observing these sequences content, these sequences have a bigger difference between divided slices than others, which verifies that slice bit allocation scheme has a better content adaption. Fig 4 shows the Rate-PSNR curve and Rate- curve of sequence BasketballDrive, when sequences are divided into 2 parts and 4 parts, respectively. Comparing to the equal slice bit allocation, method achieves a significant performance improvement. Kimono1 ParkScene Cactus BasketballDrive BQTerrace TABLE V MISMATCH COMPARING target bitrate /kbps 2parts equal Rate control mismatch 2parts 4parts satd equal 4parts satd 6000 6.59% 6.25% 6.58% 6.10% 4000 5.73% 5.63% 5.88% 5.57% 1600 3.46% 3.43% 3.51% 3.55% 1000 2.02% 2.07% 2.33% 2.39% 6000 5.03% 5.05% 4.91% 4.96% 4000 3.93% 3.88% 3.83% 3.80% 1600 1.59% 1.24% 1.29% 0.90% 1000 0.38% 0.10% 0.68% 0.14% 10000 2.30% 2.29% 2.22% 2.21% 7000 2.13% 2.09% 2.03% 1.97% 3000 1.45% 1.45% 1.32% 1.28% 2000 1.08% 0.98% 0.90% 0.68% 10000 2.95% 2.93% 2.85% 2.79% 7000 2.82% 2.83% 2.67% 2.68% 3000 2.59% 2.70% 2.43% 2.57% 2000 2.% 2.47% 2.32% 2.40% 10000 0.06% 0.14% 0.09% 0.29% 7000 0.46% 0.66% 0.43% 0.79% 3000 1.09% 1.15% 0.97% 1.34% 2000 0.92% 1.28% 1.22% 1.57% Average 2.45% 2.43% 2.42% 2.40% Table V states the bit rate mismatch of the two kinds of slice bit allocation scheme. we can find that scheme has a slight refinement, which also verifies that slice bit allocation scheme has a better content adaption. 38.5.5.5 0.92 0.91 0.9 39 38.5 38.5.5.5 0.92 0.91 0.9 2parts equal 2parts stad (a) Rate- curve of BasketballDrive of 2 parts 2parts equal 2parts stad (b) Rate-PSNR curve of BasketballDrive of 4 parts 4parts equal 4parts stad (c) Rate- curve of BasketballDrive of 4 parts 4parts equal 4parts stad (d) Fig. 4. R-D curve ( (a) Rate-PSNR curve of 2 parts (b) Rate- curve of 2 parts (c) Rate-PSNR curve of 4 parts (d) Rate- curve of 4 parts) ) IV. CONCLUSION In this paper, we propose a novel parallel-friendly rate control scheme, which supports parallel coding both at the frame and slice level. Experimental results show that the

algorithm can obtain a rate accuracy close to that of the original x265 plus VBV mode but with a significant improvement of RD performance. Besides, SATD based slice bit allocation provides a better content adaptation, which makes the algorithm more applicable to content variation than other schemes. ACKNOWLEDGMENT This work was supported by Shanghai Zhangjiang national independent innovation demonstration zone development fund(201501-pd-sb-b201-001) and NSFC (61671296,61527804,61521062). REFERENCES [1] Chi C C, Alvarez-Mesa M, Juurlink B, et al. Parallel scalability and efficiency of HEVC parallelization approaches[j]. Circuits and Systems for Video Technology, IEEE Transactions on, 2012, 22(12): 1827-1838. [2] Nakamura K, Ikeda M, Yoshitome T, et al. Global rate control scheme for MPEG-2 HDTV parallel encoding system[c]//information Technology: Coding and Computing, 2000. Proceedings. International Conference on. IEEE, 2000: 195-200. [3] Nog S. A study on rate control method for MP@ HL encoder with parallel encoder architecture using picture partitioning[c]//image Processing, 1999. ICIP 99. Proceedings. 1999 International Conference on. IEEE, 1999, 4: 261-265. [4] Sanz-Rodriguez S, Mayer T, Alvarez-Mesa M, et al. A low-complexity parallel-friendly rate control algorithm for ultra-low delay high definition video coding[c]//multimedia and Expo Workshops (ICMEW), 2013 IEEE International Conference on. IEEE, 2013: 1-4. [5] Wang J, Gao Z, Zhang X. Efficient parallel-friendly rate control for realtime UHD video encoder on many-core platform[c]//multimedia and Expo Workshops (ICMEW), 2014 IEEE International Conference on. IEEE, 2014: 1-6. [6] Bin Li; Houqiang Li; Li Li; Jinlei Zhang, Domain Rate Control Algorithm for High Efficiency Video Coding, Image Processing, IEEE Transactions on, vol.23, no.9, pp.3841,3854, Sept. 2014. [7] ITU-T Q6/16, lso/lec JTC1/SCZQ/WG11, VCEG-AM91 (2010) Joint Call for Proposals on Video Compression Technology, 22 January 2010, Kyoto, Japan.