A QFHD 30 fps HEVC Decoder Design

9035 1 A QFHD 30 fps HEVC Decoder Design Pai-Tse Chiang, Yi-Ching Ting, Hsuan-Ku Chen, Shiau-Yu Jou, I-Wen Chen, Hang-Chiu Fang and Tian-Sheuan Chang, Senior Member, IEEE, Abstract The HEVC video standard provides superior compression with large and variable-sized coding units and advanced prediction modes, which leads to high buffer costs, memory bandwidth and irregular computation for ultra high definition (HD) video decoding hardware. Thus, this paper presents an HEVC decoder with a four-stage mixed block size pipeline to reduce by approximately 91% the pipeline stage buffer size compared with the 64x64 block-based pipeline. The high memory bandwidth due to motion compensation problem was solved by blockbased data access, precision-based data access and a smart buffer to reduce the data bandwidth by 88%. In addition, for irregular computation, a reconfigurable architecture was adopted to unify the variable size transform. A common intra prediction module was also designed with a 4x4 block-based bottom up computation for variable size intra prediction and modes in a regular manner. Furthermore, the corner position computation for the motion vector predictor (MVP) was applied to handle variable size motion compensation. Finally, the implementation with the TSMC 90 nm CMOS process used 467 K logic gates and 15.778 KBytes of on-chip memory and supported 4096 2160@30fps video decoding at a 270 MHz operation frequency. Index Terms High efficiency video coding, decoder, VLSI implementation I. INTRODUCTION To meet the ultra high definition (HD) video compression demand, high efficiency video coding (HEVC) [1][2], which is the latest video coding standard, has recently been standardized to provide a 50% bit rate reduction over the previously popular H.264/AVC standard. Although HEVC uses a similar hybrid coding scheme as that of H.264, Copyright (c) 2014 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. Manuscript received Jan. 28, 2014; revised May 17 2014, Aug. 21 2014, Nov. 28, 2014, and Feb. 1, 2015; and accepted Feb. 18. 2015. This work was supported by Ministry of Science and Technology, R. O. C., under Grant 103-2221-E-009-224. All the authors are with Department of Electronics Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, R. O. C. (e-mail: tschang@mail.nctu.edu.tw) consisting of inter- and intra-frame prediction, transform units, an in-loop filter, and entropy coding, it displays a marked improvement over H.264 in several aspects. 1) Large hierarchical units: Instead of the fixed macroblocks in H.264, HEVC partitions a frame into raster scanned coding tree units (CTU) that are fixed to 64x64, 32x32 or. Each CTU is further recursively split into a smaller coding unit (CU), down to the size of 8x8. Each CU is split into one, two or four prediction units (PUs) from 64x64 to 4x4 (intra) or 8x4/4x8 (inter) with mixed intra or inter prediction instead of all-inter or all-intra predictions as in H.264. This approach results in the data dependency of the intra and the inter predictions. Each CU is recursively split into transform units (TUs) from 32x32 to 4x4. These large, hierarchical and flexible CUs, PUs and TUs can encode dynamic content very well, but the size of the largest CTU is 16x larger than the previous marcroblock, which significantly increases the required buffer size. 2) Advanced predictions: HEVC uses 35 intra-prediction modes, including DC, planar and 33 angular modes for all PU sizes, which is much more than the 9 modes in H.264. In addition, the inter-prediction in HEVC uses an 8-tap interpolation filter instead of a 6-tap one, as found in H.264. Generally speaking, these predictions can model content better; however, they also lead to irregular computation and high memory bandwidth. 3) Simplified structure: The in-loop filter uses a simpler deblocking filter, operating on an 8x8 grid instead of a small 4x4 grid as in H.264. This approach enables parallel filtering of different edges. HEVC also introduces a new loop filter, sample adaptive offset (SAO), to reduce any distortion between the reconstructed and original data. Moreover, HEVC uses only one type of entropy coding to simplify the design. The above-mentioned improvements certainly facilitate better coding efficiency but, at the same time, also require significant computational complexity [3], large on-chip storage and high memory bandwidth, especially for real-time ultra HD video processing, which in turn demands a hardware implementation to meet the real-time requirements. To meet the above requirements, several decoder designs have been recently proposed [4]-[6]. An FPGA prototype has been proposed in [4] to decode 1080p@60fps with a 7-stage pipeline architecture. To avoid complex synchronization of the variable PU and TU size within a CTU, they adopted the

9035 2 largest CTU level (64x64) as their basic pipeline unit, which leads to a simple control but incurs a significant pipeline buffer cost. [5] proposed a 3840x2160@30fps decoder that uses a two-stage sub-pipeline scheme. Their pipeline scheme was also a CTU-based design but was able to adapt to different CTU sizes. Their design adopted variable-sized pipeline blocks for processing that had a fixed height (64) and CTU dependent widths, (16, 32 or 64) to reduce the data access switching between luma and chroma pixels. Although this approach unified the control flow, the pipeline buffer size was still based on 64x64 for the worst-case condition. Furthermore, [6] proposed a 1080p@30fps decoder that used a four-stage pipeline with embedded compression to reduce bandwidth. Several single module designs have also been proposed [7]-[14]. However, all these module designs are still waiting for suitable tailoring to meet the needs of the entire decoder. This paper presents an efficient HEVC decoder design with a four-stage pipeline to address, in particular, the buffer, irregular computation and memory bandwidth issues. For the buffer size, we analyzed the data dependency of the decoder and proposed a mixed block size pipeline that uses 32x32 for the first stage and for the rest of the stages instead of 64x64 or a fixed height for the entire pipeline. The complex control of the pipeline was avoided through a 4x4 block-based prediction structure for adaptation to variable PU sizes. This approach can reduce the total buffer size by 87.2% compared to the previous design [5]. The memory bandwidth in the motion compensation (MC) was reduced by blockbased data access and a reference data cache with a simplified addressing scheme. Moreover, to handle irregular computations, a reconfigurable transform architecture, common intra-prediction modules with a 4x4 bottom-up structure, and corner position computation for motion vector prediction (MVP) were used. All these optimizations led to a 35% gate count reduction, compared to the previously reported design [5]. The rest of the paper is organized as follows. In Section II, the design analysis and pipeline overview of the proposed decoder will be presented. In Section III, the details of the component designs will be described. Next, the implementation results and design comparisons will be shown in Section IV. Finally, conclusions will be made in Section V. II. MIXED BLOCK SIZE PIPELINE A. Overview of the four-stage pipeline Fig. 1 shows the pipeline of the proposed HEVC decoder. For the target specification, 4096x2160@30fps at 270 MHz with the 90 nm CMOS process, the decoder was divided into four pipeline stages: entropy decoder, IQ/IT and reference data loading, reconstruction of prediction, and loop filters. The overall functions of the four stages are as follows. The first stage, the entropy decoder, decodes residual coefficients and other information with the context adaptive binary arithmetic decoder (CABAD). Then, the second stage reconstructs the residual data through inverse quantization (IQ) and inverse transform (IT). At the same stage, the corresponding motion vector (MV) is generated to load reference data into a smart buffer. The third stage reconstructs pixels by adding the residual data to the prediction values. Finally, the reconstruction data are filtered by the two in-loop filters, deblocking and SAO, for the final output. B. Analysis of the pipeline The pipeline block size directly affects the final scheduling and the cost of both stage buffers and intermediate buffers. The stage buffer size can be estimated as follows. Assume a 2Nx2N pipeline block. The buffer size for the transform coefficient buffer, residual buffer, or pre-filter buffer is given by (Size for Y Size for Cb and Cr)x2 = (2Nx2N NxNx2)x2 = 12N 2 (1) This size consists of buffers for Y, Cb, and Cr in the 4:2:0 format with the ping-pong buffer style to match the throughput RISC Bitstream DRAM Bitstream Buffer Memory Controller CABAD Transform Coeff. Buf. Inverse Quantization Inverse Transform Ref. Pixel Accessing MV Generation Smart Buf. Pred. Info. Buf. Ref. Pixel. Buf. Intra Pred. Interpolation Rec. SAO Deblocking Filter v REC. Pixels Buf. Residuals Buf. 1 st stage 2 nd stage 3 rd stage 4 th stage 1051-8215 (c) 2015 IEEE. Personal Fig. 1 use Pipeline is permitted, of the but proposed republication/redistribution HEVC decoder requires IEEE permission. See

9035 3 differences between the stages. The size of the smart buffer (reference data buffer for MC) can be roughly estimated as no. of 8x4 Y block 8x4 interpolation no. of 4x2 CbCr block 4x2 interpolation 2 = NxNx2 x(15x11) 2Nx2N 8x4 (2) 4x2 x(7x5) x2 = 58.75N2 Byte This size considers only the data for one pipeline block for simplicity and consists of buffers for Y, Cb, and Cr in the 4:2:0 format with the ping-pong buffer style. Thus, the total stage buffer size is 12N 2 ( 16 8 9 8 1) 58.75N2 = 108.25N 2 for 16-b/9-b/8-b precision for the transform coefficient buffer, residual buffer, and pre-filter buffer, respectively. A direct design with a 64x64 block size (N=32) will need 110.848 KB SRAM. The buffer size can be simplified to only the Y buffer size, i.e., 74.25 N 2, if Y and CbCr data are interleaved. The optimized buffer size should consider the trade-off concerning the computational dependency and efficiency within and between the modules, the buffer cost and the external memory bandwidth, which will be derived below. For the first stage, the only restriction for the buffer between CABAD and IQ/IT is to accommodate the maximum IT size, 32x32. A buffer size larger than 32x32, say, 64x64, is unnecessary and does not provide any remarkable benefit. However, a smaller buffer cannot compute a larger size transform without complicated scheduling. In addition, the transform coefficients decoded from CABAD have to be reordered before the inverse transform. Hence, we had to set the transform coefficient buffer size to 32x32x2 instead of 64x64x2. The corresponding IT pipeline block was set to 32x32 for the first 1-D IT. However, the second 1-D IT could be decomposed to two units with even-odd decomposition. Thus, the maximum required size of the IT output buffer, the residual buffer, would be x2 instead of 32x32x2, which saves 3/4 of the SRAM. For the second stage, Fig. 2 shows the ratios of the external data access increase to smart buffer size for MC interpolation relative to a 64x64 block for a typical video sequence. The data access amount depends on the selected PU size and the MC pipeline block size. If the PU size is larger than the MC pipeline block size, it is necessary to split the PU access into smaller ones to fit the pipeline block size, which results in accessing duplicated data. Thus, a smaller pipeline size will lead to much higher data access even though it has a smaller buffer cost. Therefore, we choose as a tradeoff. The above-mentioned analysis ignored reuse of duplicated data between blocks. If we consider possible data reuse schemes as discussed in Section III, the smart buffer size could be increased to (32x32)x2 = 2048 Bytes to save memory bandwidth for list 0 and list 1. Ratio 1.2 1 0.8 0.6 0.4 0.2 0 Fig. 2 Ratio of data access increase and buffer size relative to the 64x64 pipeline block. For the final stage, the pre-filter pixel buffer stores the reconstructed output for the following deblocking filter. Thus, the stage buffer size was set to x2 to store data from the previous stage with a pipeline block. Its internal processing is decomposed into an 8x8 sub-pipeline block to fit 8x8-grid edges at the deblocking filter. For line buffers to store neighboring information, such as neighboring MVs, prediction information, reference data for intra prediction, and deblocking buffers, we only store part of the them on-chip (e.g., three LCU wide buffers) and other data off-chip to reduce buffer sizes, which is similar to previous H.264 decoder chips and an HEVC decoder design [6]. In summary, the pipeline size was set to 32x32 for CABAD, IQ/IT and for the rest of the modules. The stage buffer size was 7.232KB, which saves 62% and 91% of the buffer cost when compared with the 32x32 and 64x64 cases, respectively. C. Scheduling of the proposed design Fig. 3 shows the mixed block size scheduling of the proposed design with four pipeline stages and interleaved luma and chroma block processing. First, this design decoded the bitstream by CABAD at the first stage in a 32x32 block size and then applied IQ and the first 1-D IT at the second stage in a 32x32 block size for luma blocks. The second 1-D IT was operated in a block. The chroma blocks were operated in a block size. Once the required data were available, the subsequent operations could be started. The cycle count in each module was not fixed because of variable PU size and different complexity. A. CABAD 8x8 32x32 64x64 III. Pipeline block size COMPONENT DESIGNS buffer size data amount When the targeted HEVC decoder design was 4096x2160@30fps, the number of bins required was 110 Mbins/sec to 120 Mbins/sec based on our simulations for the target bit rate specification [1]. The limits on particular CTUs and 32x32 units could be much higher for the worst-case condition (e.g., 4x4 block with CTB), which rarely

9035 4 occurs in practical conditions. Thus, 270 Mbins/sec is enough for bit rate variations based on the simulations. Thus, the single bin per cycle architecture was proposed and has been shown in Fig. 4 with a throughput of 270 Mbins/s when operating at 270 MHz. The input bitstream was first decoded through binary arithmetic decoding, and the decoded bin was then directly passed to the context selection because of the smaller critical delay. The context selection produced one context address from two parallel options of the decoded bin to avoid dependency and ensure a single bin per cycle. The context address and the context state from the context modeling were then sent to the binary arithmetic decoding for the next bin decoding. B. Inverse transform The inverse transform matrix in the HEVC possesses the property of coefficient symmetry; the same coefficients are present in the even rows and similar coefficients are in the odd rows for all of the TU sizes. Thus, this paper proposes a reconfigurable inverse transform architecture [7] that is well suited to various TU sizes. This architecture adopts the commonly used row-column decomposition to convert a 2D transform into two consecutive 1-D transforms. Each 1-D 32- point transform is further decomposed into two 1-D 16-point transforms with even-odd decomposition. The even part of a 16-point transform can again be decomposed into smaller size transforms to adapt to various TU sizes by reusing the same coefficients. However, the odd part has different but similar coefficients. Thus, we decompose the coefficients in the odd part into a base coefficient for one TU size and a refined term for another TU size to compensate for the difference. The scheduling of the proposed design involves a row-byrow processing of each of the 32x32 units, regardless of the TU combination within a single 32x32 unit. Moreover, one 32-point row is reconfigured into any legal TU combination from (4, 8, 16, 32) to maintain regular processing and full hardware utilization. C. Deblocking filter and SAO Fig. 5 shows the interleaved scheduling of the proposed nonpipelined deblocking filter (see the details in Table I), which operates on a 4x4 block base. The processing of luma and chroma blocks is also interleaved. The number of clock cycles for a block is 182. The proposed filter design first takes one line of pixels from two adjacent 4x4 blocks as inputs for every cycle, computes its strength and finally applies the suitable filters. Each edge in a 4x4 block takes a total of 8 cycles: 2 cycles for boundary strength, 1 cycle for filter strength, 4 cycles for edge filtering, and 1 cycle to write back the remaining output. In the interleaved scheduling, horizontal filtering is first Time 32x32 B3 32x32 B2 IQ/IT 32x32 B1 MC 32x32 B1 Intra prediction 32x32 B0 In-loop filtering 32x32 1st-D 2nd-D Y 2nd-D 1098 cycle CABAD 2nd-D Y 2nd-D CbCr CbCr Y CbCr Fig. 3 Schedule of the decoder Y CbCr state Pipeline buffer Selection Next bin binarization generator address state Modeling Initial & Read state address Pipeline buffer Binary Arithmetic decoding Bin decoder State update decoded bin Pipeline buffer Fig. 4 Single bin CABAD architecture

9035 5 performed over its two vertical edges, from the top four-line unit to the bottom four-line unit (edge 1 and edge 2). Because all the data are available for the horizontal edge 3, vertical filtering is then performed over the top horizontal edge (edge 3 to edge 4). This horizontal-vertical interleaved approach is repeated for every 8x8 block in the Z-scan order. This interleaved approach reuses all the available data immediately to reduce the intermediate buffer cost. B0 B1 B2 B3 3 4 B4 1 7 8 B5 B6 5 B7 h1 CYCLE UNIT (8 CYCLES) Table I. Scheduling of the deblocking filter P BLK Q BLOCK PROCESSI NG EDGE C1 B4 B5 E1 C2 B8 B9 E2 FILTERED BLOCK TO SAO C3 B0 B4 E3 B0, B4 C4 B1 B5 E4 B1, B5 B8 B12 2 B9 B10 6 B11 11 12 9 v1 B13 B14 15 16 13 B16 10 B17 B18 14 B19 v2 B15 Fig. 5 Interleaved scheduling of the proposed deblocking filter Fig. 6 shows the deblocking filter architecture that shares horizontal and vertical filters with the same filters. The input is from the pre-filer buffer or the line buffer that stores partial filtered data, depending on the edge locations. All memory access schemes follow the interleaved scheduling. Fig. 7 shows the proposed SAO design, which is tightly coupled with the deblocking filter. This design receives its four columns of 8-pixel input from the deblocking filter every 16 cycles, computes and then accumulates its offset cost: band offset (BO) or edge offset (EO). Once the data are ready, it selects the minimum as the final offset operation for every 8x8 block. BO has no dependency to different rows of pixels. EO has four patterns: horizontal, vertical, 45 degree and 135 degree, which require the neighboring upper and lower rows of reference pixels for computation, except the horizontal pattern. The cost computation is a direct implementation as these equations have been optimized in the standard. Filtered blocks for SAO Register_ group1 Bs calculation Choose Block Source form Ping-pong buffer Register_ group2 Filter strength calculation h2 Weak filter Strong filter Filter Decision Deblocking filter output Internal buffer Buffer C(8x8bit) Buffer D(8x8bit) Reference buffer Buffer A(192x8bit) External memory Buffer E(8x8bit) Buffer F(8x8bit) Buffer B(192x8bit) BO EO_pattern0 EO_pattern1 EO_pattern2 EO_pattern3 Fig. 7 Architecture of the proposed SAO D. Reconstruction of intra prediction Min cost selection The main challenges in the design of intra prediction were the various PU sizes, ranging from 64x64 to 4x4, and the 35 prediction modes, which complicate computation and on-chip data fetching. A previously reported design [11] used several parallel data paths to calculate prediction equations for different PU sizes, but the only supported PU sizes were 4 4 and 8 8. This method resulted in low hardware utilization to satisfy more PU sizes. Therefore, we have proposed an efficient hardware architecture suitable for all intra PU sizes, as shown in Fig. 8, which involves a common intra prediction module with the 4 4 bottom-up structure and an adaptive sample fetch controller to compute different PU size predictions in a regular manner. The proposed intra prediction module also implements the DC mode with tree-based average buffers, the directional mode with pixel-based prediction mapping, and PU size adaptive reconstruction buffers for different PU sizes. Fig. 9 displays the corresponding schedule for each PU size with a two-stage sub-pipeline. In this schedule, each 4x4 block requires 9 cycles to load the reference data, calculate prediction values and wait for reconstruction. In addition, based on the decoding order, the chroma blocks are computed after the luma block is completed for each PU. Therefore, the total necessary cycles are 216 per 16 16 block. Output pixel Re-constructed Blocks from previous stage & Line buffer Fig. 6 Architecture of the proposed deblocking filter

9035 6 60 byte 4 byte Mode Information Partition Information System Controller Directional Mode Sample Selection Mode Angle Calculation 384 byte 256 byte Residue Neighboring Data On-Chip Memory Adaptive Sample Fetch Controller Intra Prediction DC Average Calculation Planar Mode Sample Selection Main Sample Side Sample Projecting Filter Reconstructed Samples OUTPUT Fig. 8 Architecture of the intra prediction 2-stage sub-pipeline: Load Data Load Data Load Data Load Data Prediction (2 pixels) Prediction (2 pixels) Prediction (2 pixels) Prediction (2 pixels) 9 cycles PU 4 4 and 8 8: (for a block 8 8) PU 16 16 and 32 32: (for a block 16 16) (U) (V) 36 cycles 9 cycles 9 cycles 54 cycles * 16 * 4 * 4 (U) 144 cycles 36 cycles 216 cycles (U) (V) Fig. 9 Decoding schedule of intra prediction for each PU size Neighboring Data 36 cycles (V) x N y N Orig Orig Orig Orig T Tn L Ln 4x4 8x8 angx * x Main Sample Array Index angy * y Side Table Sample Array Index Orig Register Register Register << N * x << N * y TOP Sample Array 32x32 LEFT Sample Array * * Planar Mode >> N1 DC Mode >> N1 >> 5 Directional Mode Residue Register Fig. 10 The intra prediction datapath

9035 7 1) 4 4 bottom-up structure and PU size adaptive sample fetching To satisfy the recursive combination of PUs with a simple control, a 4 4 block was selected as a basic unit to form a bottom-up structure. That is, each 16 16 block is processed in a double-z-scan order with 16 basic units instead of row-byrow. Therefore, the processing order of each 16 16 block is regular, regardless of the PU sizes. For the above processing, we implemented a PU sizeadaptive sample fetch controller, which fetches the corresponding reference samples adaptively based on the PU size and the coordinates of the current block because the prediction equations of different PU sizes are similar. The reconstructed data are stored in the PU size adaptive reconstruction buffer (Fig. 11), which can adapt to different PU sizes. With this buffer, the use of any extra buffer for each PU size can be avoided. 5 th PU: 6 th PU: Pred 1 (x 1, y) = [(T 1 n) (B 1 y)] [(L n) (R x 1 ) 2 n ] (n 1) (4) 2 cycles L L L LN - 1 cycle T0 T1 T0 T1 (x0,y) - B0 (x1,y) PU n n (N N, 2 2 ) B1 B0 B1 TN R R R - Fig. 11 An example of the PU size-adaptive reconstruction buffer 2) Common intra prediction modules for all PU sizes Fig. 10 shows a common intra prediction data path for different PU sizes with 2-pixel-per-cycle throughput to satisfy the processing rate. In this data path, the planar, DC, and directional modes are separated into independent flows because of different prediction equations. The corresponding also reference sample filterings are integrated into the prediction datapath but are not shown here for figure clarity. Planar mode: Fig. 10 and Fig. 12 show the prediction data path and buffer for the planar mode. With 2-pixelper-cycle throughput, the adaptive sample fetch controller updates the top reference pixels, T 0 and T 1, every cycle and the left reference pixels, L, every two cycles. The values of T N and L N are updated only for a new PU. Next, the values of the bottom reference pixels, B 0, and B 1, and right reference pixels, R, for the following blocks are calculated with subtractors, as displayed in Fig. 12. Finally, the prediction values are computed with a position-related linear combination, which is simplified with shift operators as follows Pred 0 (x 0, y) = [(T 0 n) (B 0 y)] [(L n) (R x 0 ) 2 n ] (n 1) (3) Fig. 12 The prediction buffers of the planar mode DC mode with tree-based average buffers: The DC mode requires all the reference samples to estimate the average value of different PU sizes and positions. Thus, to share these average values for different cases, 2 four-level treebased average buffers are implemented for the left and above reference samples, respectively. Directional mode with pixel-based prediction mapping: In the current HEVC reference software, a single function is used to cover all 33 prediction modes to simplify the software implementation. To fit into this function, the top reference and the left reference samples are exchanged first when the prediction mode belongs to the horizontal modes (modes 2 through 17). Then, the prediction values are calculated with the corresponding vertical modes (modes 19 through 33). After all prediction values in a PU are computed, the block flips back again. However, this flow is a block-based operation with respect to the hardware that requires additional buffers to store the unflipped block, and thus time is wasted on flipping. To solve the above problems, the common prediction equation was decomposed as a pixel-based mapping equation with the angle parameter shown in Table II, which does not need to flip the PU. Pred(x, y) = [(32 f(x, y)) S 0 f(x, y) S 1 16] 5 (5) f(x, y) = [(x_ang x) (y_ang y)] % 32 (6) where x and y are the coordinates of pixels, and S 0 and S 1 are the corresponding reference samples. In addition, the equation to calculate the extension of the main reference samples was also modified as follows. Ext_Index(x, y) = [128 invang ang(x, y)] 8 (7)

9035 8 ang(x, y) = [(x_ang x) (y_ang y)] 5, mode < 11, mode 26 [(y_ang y) (x_ang x)] 5, mode < 18 [(x_ang x) (y_ang y)] 5, mode < 26 (8) Table II. The angle parameter of the directional modes Mode x_ang y_ang invang Mode x_ang y_ang invang 2 32 32 256 3 26 32 256 19 32 26 315 4 21 32 256 20 32 21 390 5 17 32 256 21 32 17 482 6 13 32 256 22 32 13 630 7 9 32 256 23 32 9 910 8 5 32 256 24 32 5 1638 9 2 32 256 25 32 2 4096 10 0 32 256 26 32 0 256 11 2 32 4096 27 32 2 256 12 5 32 1638 28 32 5 256 13 9 32 910 29 32 9 256 14 13 32 630 30 32 13 256 15 17 32 482 31 32 17 256 16 21 32 390 32 32 21 256 17 26 32 315 33 32 26 256 18 32 32 256 34 32 32 256 E. Motion compensation The MC design in HEVC was more complex than the priorreported standards because of various PU sizes, more complicated interpolation, and MVP, which increases computation irregularity, buffer size and bandwidth to a large extent. In this proposed MC design, the bandwidth and buffer size was reduced first with block access, precisionbased access, and a smart buffer. Then, the computation irregularity was avoided and the area cost was reduced with the corner index-based MVP generation and 4x4 interpolators, with common subexpression sharing. 1) Bandwidth reduction The memory bandwidth reduction method in MC involves reuse of the reference samples in the highly overlapped data between different blocks, which is similar to the previous approaches [15]-[17] for H.264. Other approaches, such as efficient DRAM data mapping [24] and embedded compression [25], can also be combined directly to save more bandwidth. In the current work, the sole focus was on data reuse only, which was based on three approaches partitionbased access, precision-based access and cache-based design. The partition-based access loads reference data according to different partition sizes. A straightforward method to load reference data is to decompose a partition into 4x4 blocks and load an 11x11 block for each 4x4 interpolation. However, if the partition size is larger than 4x4, the corresponding reference data of 4x4 blocks in the same partition would be highly overlapped, and they can be reused to reduce the bandwidth. Thus, in the previous work for H.264 [15]-[17], the upper and upper left blocks were saved to reuse the overlapped part of the current block[15] or access (M5)x(N5) blocks for MxN blocks with M, N = 4 or 8 [16] or up to 16 in [17]. In general, a larger access size will have a lesser bandwidth but will increase the buffer cost. This method also works for HEVC. However, the maximum PU size of HEVC is 64x64 instead of in H.264, which should lead to significant data storage. In the current design, the block-based data access was adopted instead of 64x64, which would reuse overlapped samples for PU size <=. Moreover, the proposed smart buffer was also capable of reusing data for larger PUs, as described later. Thus, this size selection is compatible with the processing unit in MC and provides a good tradeoff between the buffer cost and the bandwidth. The precision-based data access loads the reference data according to different fractional MV positions, similar to previous H.264 work [18]-[21]. For HEVC, a MxN PU will load MxN, (M7)xN, Mx(N7) or (M7)x(N7) data according to its fractional MV positions, which avoids loading the (M7)x(N7) data for all types of MV positions. A cache in MC can further increase data reuse of adjacent block access (e.g., [21]-[23] for H.264) to reduce the bandwidth. The design in [22] proposed a 6-way cache to reduce the bandwidth by more than 70%. However, the authors used a large-sized memory (6x Bytes), a large number of registers and complex address mapping to implement a cache. These issues were also found in the design of HEVC [5]. Moreover, the PU size in HEVC was up to 64x64, which is much larger than the in H.264. To address these issues, this paper proposes a cache design with a simplified addressing scheme, denoted as a smart buffer, as shown in Fig. 13. Table III lists the percent savings of the data loading amount compared with the original 4x4 block-based loading. An average savings of 88% was found for the current proposed approach. Other sequences show similar results. The table also shows that the savings for cache size larger than 32x32 are quite similar. Thus, 32x32 was chosen for the current method, which can fit our selection of the processing unit and access size.

9035 9 The proposed design partitions a 32x32 buffer into 16 8x8 blocks with a double-z-scan (DZS) index, as shown in Fig. 14, to fit the DZS computing order in HEVC. The same partition method and its DZS index were also applied to a 64x64 block (Fig. 15). An 8x8 block in a 64x64 block can, thus, be easily mapped to a block in the smart buffer using the DZS index modulo 16 as the mapping function. This addressing scheme provides an easy way to update the smart buffer block without any complex address control. Table III. Percent savings of data loading amount compared with the 4x4 block-based access Method Basket balldrive Race- HorsesC Basket- ball Pass Avg. B 24.7% 24.3% 68.1% 39.0% C 79.9% 78.9% 84.2% 81.0% D() 82.1% 80.9% 86.1% 83.0% D(32x32) 89.1% 86.9% 88.1% 88.0% D(64x32) 90.6% 88.2% 88.7% 89.1% D(64x64) 91.5% 89.2% 89.4% 90.0% Note: B: Precision-based request (4x4 based), C: B block size request D: C smart buffer MV Index Data Request Generator Tag 64 Buffer Tag V T Tag update compare Miss Reloading Hit 32x32x2 Smart Buffer Fetch Data Data Reload Interpolation 0 1 2 3 8 9 10 11 32 33 34 35 40 41 42 43 4 5 6 7 12 13 14 15 36 37 38 39 44 45 46 47 16 17 18 19 24 25 26 27 48 49 50 51 56 57 58 59 20 21 22 23 28 29 30 31 52 53 54 55 60 61 62 63 Fig. 15 A smart buffer example (8x8 pixels for each word) 2) MC design The proposed MC architecture is divided into two stages. The first stage can compute the final MV with MVP and motion vector difference (MVD) and can access the reference data. After the reference pixels of a 4x4 block are ready in the stage buffer, the second stage starts to interpolate the reference data and reconstruct the pixels. To generate MVs in a regular way adaptable to variable PU sizes, the basic block of MVs is set to 4x4. Thus, the neighboring MVs are stored in a 4x4 base instead of a PU base to provide a regular way of accessing the neighboring MVs. Next, the corner index of the PU is computed to access the corresponding MVP candidates in different PU sizes, as shown in Fig. 16. If the size of the PU is larger than, only the top leftmost 4x4 block will then go through the MVP generation. For interpolation, the 4x4 block was chosen as the basic unit to process the recursive PU combinations with a bottom-up structure, as shown in Fig. 17. The area cost of the filters was reduced through hardware sharing for the same coefficients and the common subexpression sharing for different coefficients. Thus, the proposed interpolator can save 19.77% and 55% of the area cost compared with the original costs for the luma and chroma components, respectively. Fig. 13 The smart buffer architecture 32x32 0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 64x64 Fig. 14 Double-z scan order Fig. 16 Corner position for different partition sizes

9035 10 To further reduce the temporary buffer cost, scheduling for all the interpolations is performed at the earliest opportunity (Fig. 17). The original scheduling was to compute all the horizontal filters first, store them next, and finally compute the vertical filter, which needs to save 44 results in the local buffer. The proposed scheduling computes the left half of the horizontal filters first, then directly consumes the results with the left half vertical filters, and then repeats the first two steps for the right half part. It is important to note that, when this order is followed, only saving 22 local results is necessary. 2 1 3 4 Pixel in current 4x4 Ref. pixel Hor. FIR result Ver. FIR result Compute order Fig. 17 Proposed computational order of the luma interpolator IV. IMPLEMENTATION RESULTS AND COMPARISONS The proposed decoder has been implemented and synthesized by the TSMC 90 nm 1P9M CMOS technology, as shown in Table IV and V, with the main profile tools support. The prediction part occupies the largest area because of the various prediction modes and complex memory data access, while the inverse transform occupies the second largest area because of the size up to 32x32. Table VI summarizes the techniques used in the current design to reduce the buffer size and the area cost while, at the same time, also increasing the adaptability to variable size processing. Table VII and VIII show comparisons with the previously published designs [4]-[6][26]. All of these designs support the main profile tools with B-frames, except [6], which implemented an earlier HEVC version with CAVLD as the entropy decoder and without any SAO. [4] requires the largest area and buffer cost because of its 64x64 pipeline. When compared with the intra prediction in [4], the proposed intra prediction reduces the gate count by 71.5% and the buffer size by 83% because of the common prediction units and smaller pipeline size. Other modules in [4] also need higher area cost than the proposed one. The design in [6] has a similar gate count and buffer cost as the proposed design but is only capable of 1920x1080@30fps video processing. However, a detailed comparison was rather impossible due to the unavailability of detailed design information. Both the proposed design and [5] can support similar real-time decoding capability with variable-size pipelines. However, [5] uses a 64x64 pipeline size as the worst case, which significantly increases the area cost. By contrast, our pipeline size is 32x32 in the first stage, and it is reduced to for the remaining stages. The smaller pipeline size decision and module optimization also help to reduce the gate count. The proposed design reduces the gate count by 35% and the buffer size by 87.2% in comparison to [5]. A detailed module comparison in Table VII shows that every module in the proposed design needs a lower gate count than that in [5]. The major gate count savings come from the module optimizations, such as single bin CABAD, reconfigurable prediction datapath, interleaved deblocking filter, and smart buffer scheme for MC cache (which avoids complicated register files in the cache tag). A commercial IP from [26] supports 4Kp120 processing with 700K gate count and no other details disclosed. However, its on-chip memory requirement is as high as 151 KB. Table IV. Detailed gate count of the proposed decoder Pipeline stage Module Gate Count (270 MHz, 90 nm) Control 21,342 1 st stage CABAD 48,430 2 nd stage Inverse 48,556 Quantization Inverse Transform 63,844 Motion Vector 84,730 Generation Reference Pixel 15,841 Accessing Smart buffer 47,130 3 rd stage Interpolation 18,742 -luma interpolator 14,462 -chroma 4,280 interpolator Intra Prediction 76,812 4 th stage Deblocking Filter 20,551 SAO 21,752 Total 467,730 Table V. Buffer size of the proposed design Module On-chip memory (Byte) Inverse transform 2048 MC 1230 Intra prediction 704 SAO 384 Others 4180 Pipeline ping-pong buffer 7232 -Coefficients 4096 -Smart buffer 2048 -Residual buffer 576 -Pre-filter pixel 512 Total 15778

9035 11 Module CABAD IT Intra prediction Table VI. Summary of the design techniques Techniques Single bin per cycle Reconfigurable for all TU size for various PU size a common intra prediction module 4 4 based bottom-up structure adaptive sample fetch controller MC 4x4 based structure for various PU size Cache with simplified addressing scheme corner position computation for MVP for various PU size Deblocking filter Pipeline Interleaved processing to save intermediate buffer Mixed block size (32x32 and ) to save pipeline buffer Table VII. Module comparison with other designs Design [5] [4] Proposed feature HEVC Standard HEVC WD4 HEVC WD6 Gate count 715K 1763K 467K -Entropy 94.5K 138K 48.4K decoder (CAVLD) (CABAC) (CABAC) -ITIQ 121.1K 779K 112.4K - 191.9K (inter 270K(intra) 76.8K(intra) Prediction intra) -MC 440K (inter 166.4K (inter 126K cache cache) cache) - 49.9K 84K 20.5K deblocking -SAO NA 32K 21.7K 131.2K (register files 20K (signal -others for cache tag, communicat 21K MEM ion) interface) V. CONCLUSION This paper has presented a real-time HEVC decoder that is capable of decoding 4096x2160@30fps at 270 MHz. The proposed design adopts a 4-stage mixed-block-size pipeline to create a tradeoff between the buffer cost and the bandwidth, and it has reduced the stage buffer cost by approximately 90%. The required bandwidth was reduced by approximately 88% through block access, precision-based access and a smart buffer scheme. The complex irregular computation was unified by a reconfigurable design and 4x4 block-based processing for the inverse transform and prediction, respectively. By following the above-mentioned approaches, the gate count and on-chip memory were reduced to 467 K and 15.778 KB, respectively. These values indicate that there is a savings of 35% for the gate count and 87.2% for the buffer cost when compared to the other published designs with the same processing capabilities. REFERENCES [1] High Efficiency Video Coding, ITU-T Rec. H.265, Apr. 2013. [2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand, Overview of the High Efficiency Video Coding (HEVC) Standard, in IEEE Tran. Circuits Syst. Video Technol., vol. 22, no. 12, Dec. 2012, pp. 1649-1668. [3] B. Li, G. J. Sullivan, and J. Xu, Comparison of compression performance of HEVC Draft 9 with AVC high Profile and performance of HM9.0 with temporal scalability characteristics, JCTVC-L0322, Jan. 2013. [4] S. Cho and H. Kim, Hardware implementation of a HEVC decoder, JCTVC-L0096, Jan. 2013. [5] C.-T. Huang, M. Tikekar, C. Juvekar, V. Sze and A. Chandrakasan, A 249Mpixel/s HEVC video-decoder chip for Quad Full HD applications, in ISSCC Dig. Tech. Papers, 2013, pp. 162-163. [6] C.-H. Tsai, H.-T. Wang, C.-L. Liu, Y. Li, and C.-Y. Lee, "A 446.6K-gates 0.55-1.2V H.265/HEVC decoder for next generation video applications," in Proc. ASSCC, 2013, pp. 305-308. [7] P.-T. Chiang and T. S. Chang, "A reconfigurable inverse transform architecture design for HEVC decoder," in Proc. ISCAS, 2013, pp. 1006-1009. [8] J.-S. Park, W.-J. Nam, S.-M. Han and S. Lee, 2-D large inverse Transform (, 32x32) for HEVC (High Efficiency Video Coding), J. Semi. Tech. Sci., vol. 12, no. 2, pp. 203-211, June 2012. [9] J. Zhu, Z. Liu, and D. Wang, "Fully pipelined DCT/IDCT/Hadamard unified transform architecture for HEVC Codec," in Proc. ISCAS, 2013, pp. 677-680. [10] W. Zhao, T. Onoye, and T. Song, "High-performance multiplierless transform architecture for HEVC," in Proc. ISCAS, 2013, pp. 1668-1671.

9035 12 Table VIII. Decoder comparison with other designs Design feature [5] [4] [6] [26] Proposed Standard HEVC main HEVC WD6 HEVC HEVC main/main10 HEVC main Version WD4 (no SAO, CAVLC, fewer WD6 NA Final Final intra modes) Technology 40nm 65nm 90nm NA 90nm Frequency 200MHz 266 MHz 224MHz 600MHz 270MHz Resolution 3840x2160 1920x1080 1920x1080 4096x2160 4096x2160 Frame rate 30 fps 60 fps 35fps 120fps 30 fps Pipeline Variable 64x64 (7 stages) 4 stages NA Variable (4 stages) Gate count 715K 1763K 446.6K 700K (8b)/ 840K(10b) 467K Memory (KB) 124 177.9 10.21 151(8b)/ 179(10b) 15.778 Throughput (Mpixel/sec) 248 124 72 1061 265 [11] E. Kalali, Y. Adibelli, and I. Hamzaoglu, A High performance and low energy intra prediction hardware for HEVC video decoding, in Proc. DASIP, 2012. [12] Z. Guo, D. Zhou and S. Goto, An optimized MC interpolation architecture for HEVC, in Proc. ICASSP, 2012, p.p1117-1120. [13] V. Afonso, H. Maich, L. Agostini and D. Franco, Low cost and high throughput FME interpolation for the HEVC emerging video coding standard, in Proc. LASCAS, 2013, pp. 1-4. [14] W. Shen, Q. Shang, S. Shen, Y. Fan, and X. Zeng, "A high-throughput VLSI architecture for deblocking filter in HEVC," in Proc. ISCAS, 2013, pp. 673-676. [15] D.-Y. Shen, T.-H. Tsai, A 4x4-block level pipeline and bandwidth optimized motion compensation hardware design for H.264/AVC decoder, in Proc. ICME, 2009, pp. 1106-1109. [16] C.-C. Lin, J.-I. Guo, H.-C. Chang, Y.-C. Yang, J.-W. Chen, M.-C. Tsai and J.-S. Wang, A 160kGate 4.5kB SRAM H.264 video decoder for HDTV applications, in ISSCC Dig. Tech. Papers, 2006, pp. 406-407. [17] C. Li, K. Huang, X. Yan, J. Feng, D. Ma and H. Ge, A high efficient memory architecture for H. 264/AVC motion compensation, in Proc. ASAP, 2010, pp. 239-245. [18] R.-G. Wang, J.-T. Li and C. Huang, Motion compensation memory access optimization strategies for H.264/AVC decoder, in Proc. ICASSP, 2005, vol. 5, pp. 97-100. [19] Y. Li and Y. He, Bandwidth optimized and high performance interpolation architecture in motion compensation for H.264/AVC HDTV decoder, J. Signal Proc. Syst. vol. 52, no. 2, pp. 111-126, Aug. 2008. [20] E. Matei, C. Praet, J. Bauweklinck, P. Cautereels and E. Lumley, Novel data storage for H. 264 motion compensation: system architecture and hardware implementation, EURASIP J. Image and Video Proc., p.p1-12, Dec. 2011 [21] Y. Li, Y. Qu and Y. He. Memory cache based motion compensation architecture for HDTV H. 264/AVC decoder, in Proc. ISCAS, 2007, pp. 2906-0909. [22] T.-D. Chuang, L.-H. Chang, T.-W. Chiu, Y.-H. Chen and L.-G. Chen, Bandwidth-efficient cache-based motion compensation architecture with DRAM-friendly data access control, in Proc. ICASSP, 2009, pp. 2009 2012. [23] X. Chen, L. Peilin, Z. Jiayi, Z. Dajiang, and S. Goto, "Block-pipelining cache for motion compensation in high definition H.264/AVC video decoder," in Proc. ISCAS, 2009. pp. 1069-1072. [24] G.-S. Yu and T. S. Chang, "Optimal Data Mapping for Motion Compensation in H.264 Video Decoding," in Proc. SiPS, 2007, pp. 505-508. [25] L. C. Chiu and T. S. Chang, "A lossless embedded compression codec engine for HD video decoding," in Proc. VLSI-DAT, 2012, pp. 1-4. [26] ovics, ViC-1 HEVC 4Kp120 Decoder, http://ovics.com/wp-content/uploads/2014/01/vic-1- ProdBrief-2014-011.pdf, 2014 Pai-Tse Chiang received the B.S and M.S degrees in electronics engineering from National Chiao-Tung University, Hsinchu, Taiwan, in 2011 and 2013, respectively. His current research interests include digital signal processing, video coding and systemon-chip design.

9035 13 Yi-Ching Ting received the B.S. and M.S. degrees in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 2011 and 2013 respectively. He is an engineer with PixArt Imaging Inc., Hsinchu, Taiwan. His current research interests include image processing and multimedia applications. Hsuan-Ku Chen recevied the B.S. and M.S. degrees in Electronics Engineering from National Chiao Tung University (NCTU), Hsinchu, Taiwan, R.O.C., in 2011 and 2013 respectively. He is an engineer with PixArt Imaging Inc., Hsinchu, Taiwan. His current research interests are video coding and digital integrated circuits. Shiaw-Yu Jou recevied the B.S. and M.S. degrees in Electronics Engineering from National Chiao Tung University (NCTU), Hsinchu, Taiwan, R.O.C., in 2011 and 2013 respectively. He is an engineer with PixArt Imaging Inc., Hsinchu, Taiwan. His current research interests are image processing and digital integrated circuits. I-Wen Chen received the B.S. and M.S. degree in Electronics Engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 2012, and 2014. She is an engineer with PixArt Imaging Inc., Hsinchu, Taiwan. Her current research interests are image processing and digital integrated circuits. Hang-Chiu Fang received the B.S. in Electrical Engineering from National Chung Hsing University, Taichung, Taiwan, R.O.C., in 2012, and M.S. degree in Electronics Engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 2014. He is an engineer with ELan Inc., Hsinchu, Taiwan. His current research interests are video coding and digital integrated circuits. Tian-Sheuan Chang (S 93 M 06 SM 07) received the B.S., M.S., and Ph.D. degrees in electronic engineering from National Chiao-Tung University (NCTU), Hsinchu, Taiwan, in 1993, 1995, and 1999, respectively. From 2000 to 2004, he was a Deputy Manager with Global Unichip Corporation, Hsinchu, Taiwan. In 2004, he joined the Department of Electronics Engineering, NCTU, where he is currently a Professor. In 2009, he was a visiting scholar in IMEC, Belgium. His current research interests include system-on-a-chip design, VLSI signal processing, and computer architecture. Dr. Chang has received the Excellent Young Electrical Engineer from Chinese Institute of Electrical Engineering in 2007, and the Outstanding Young Scholar from Taiwan IC Design Society in 2010. He has been actively involved in many international conferences as an organizing committee or technical program committee member. He is current an Editorial Board Member of IEEE Transactions of Circuits and Systems for Video Technology.