A QFHD 30 fps HEVC Decoder Design

Size: px
Start display at page:

Download "A QFHD 30 fps HEVC Decoder Design"

Transcription

1 A QFHD 30 fps HEVC Decoder Design Pai-Tse Chiang, Yi-Ching Ting, Hsuan-Ku Chen, Shiau-Yu Jou, I-Wen Chen, Hang-Chiu Fang and Tian-Sheuan Chang, Senior Member, IEEE, Abstract The HEVC video standard provides superior compression with large and variable-sized coding units and advanced prediction modes, which leads to high buffer costs, memory bandwidth and irregular computation for ultra high definition (HD) video decoding hardware. Thus, this paper presents an HEVC decoder with a four-stage mixed block size pipeline to reduce by approximately 91% the pipeline stage buffer size compared with the 64x64 block-based pipeline. The high memory bandwidth due to motion compensation problem was solved by blockbased data access, precision-based data access and a smart buffer to reduce the data bandwidth by 88%. In addition, for irregular computation, a reconfigurable architecture was adopted to unify the variable size transform. A common intra prediction module was also designed with a 4x4 block-based bottom up computation for variable size intra prediction and modes in a regular manner. Furthermore, the corner position computation for the motion vector predictor (MVP) was applied to handle variable size motion compensation. Finally, the implementation with the TSMC 90 nm CMOS process used 467 K logic gates and KBytes of on-chip memory and supported @30fps video decoding at a 270 MHz operation frequency. Index Terms High efficiency video coding, decoder, VLSI implementation I. INTRODUCTION To meet the ultra high definition (HD) video compression demand, high efficiency video coding (HEVC) [1][2], which is the latest video coding standard, has recently been standardized to provide a 50% bit rate reduction over the previously popular H.264/AVC standard. Although HEVC uses a similar hybrid coding scheme as that of H.264, Copyright (c) 2014 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an to pubs-permissions@ieee.org. Manuscript received Jan. 28, 2014; revised May , Aug , Nov. 28, 2014, and Feb. 1, 2015; and accepted Feb This work was supported by Ministry of Science and Technology, R. O. C., under Grant E All the authors are with Department of Electronics Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, R. O. C. ( tschang@mail.nctu.edu.tw) consisting of inter- and intra-frame prediction, transform units, an in-loop filter, and entropy coding, it displays a marked improvement over H.264 in several aspects. 1) Large hierarchical units: Instead of the fixed macroblocks in H.264, HEVC partitions a frame into raster scanned coding tree units (CTU) that are fixed to 64x64, 32x32 or. Each CTU is further recursively split into a smaller coding unit (CU), down to the size of 8x8. Each CU is split into one, two or four prediction units (PUs) from 64x64 to 4x4 (intra) or 8x4/4x8 (inter) with mixed intra or inter prediction instead of all-inter or all-intra predictions as in H.264. This approach results in the data dependency of the intra and the inter predictions. Each CU is recursively split into transform units (TUs) from 32x32 to 4x4. These large, hierarchical and flexible CUs, PUs and TUs can encode dynamic content very well, but the size of the largest CTU is 16x larger than the previous marcroblock, which significantly increases the required buffer size. 2) Advanced predictions: HEVC uses 35 intra-prediction modes, including DC, planar and 33 angular modes for all PU sizes, which is much more than the 9 modes in H.264. In addition, the inter-prediction in HEVC uses an 8-tap interpolation filter instead of a 6-tap one, as found in H.264. Generally speaking, these predictions can model content better; however, they also lead to irregular computation and high memory bandwidth. 3) Simplified structure: The in-loop filter uses a simpler deblocking filter, operating on an 8x8 grid instead of a small 4x4 grid as in H.264. This approach enables parallel filtering of different edges. HEVC also introduces a new loop filter, sample adaptive offset (SAO), to reduce any distortion between the reconstructed and original data. Moreover, HEVC uses only one type of entropy coding to simplify the design. The above-mentioned improvements certainly facilitate better coding efficiency but, at the same time, also require significant computational complexity [3], large on-chip storage and high memory bandwidth, especially for real-time ultra HD video processing, which in turn demands a hardware implementation to meet the real-time requirements. To meet the above requirements, several decoder designs have been recently proposed [4]-[6]. An FPGA prototype has been proposed in [4] to decode 1080p@60fps with a 7-stage pipeline architecture. To avoid complex synchronization of the variable PU and TU size within a CTU, they adopted the

2 largest CTU level (64x64) as their basic pipeline unit, which leads to a simple control but incurs a significant pipeline buffer cost. [5] proposed a 3840x2160@30fps decoder that uses a two-stage sub-pipeline scheme. Their pipeline scheme was also a CTU-based design but was able to adapt to different CTU sizes. Their design adopted variable-sized pipeline blocks for processing that had a fixed height (64) and CTU dependent widths, (16, 32 or 64) to reduce the data access switching between luma and chroma pixels. Although this approach unified the control flow, the pipeline buffer size was still based on 64x64 for the worst-case condition. Furthermore, [6] proposed a 1080p@30fps decoder that used a four-stage pipeline with embedded compression to reduce bandwidth. Several single module designs have also been proposed [7]-[14]. However, all these module designs are still waiting for suitable tailoring to meet the needs of the entire decoder. This paper presents an efficient HEVC decoder design with a four-stage pipeline to address, in particular, the buffer, irregular computation and memory bandwidth issues. For the buffer size, we analyzed the data dependency of the decoder and proposed a mixed block size pipeline that uses 32x32 for the first stage and for the rest of the stages instead of 64x64 or a fixed height for the entire pipeline. The complex control of the pipeline was avoided through a 4x4 block-based prediction structure for adaptation to variable PU sizes. This approach can reduce the total buffer size by 87.2% compared to the previous design [5]. The memory bandwidth in the motion compensation (MC) was reduced by blockbased data access and a reference data cache with a simplified addressing scheme. Moreover, to handle irregular computations, a reconfigurable transform architecture, common intra-prediction modules with a 4x4 bottom-up structure, and corner position computation for motion vector prediction (MVP) were used. All these optimizations led to a 35% gate count reduction, compared to the previously reported design [5]. The rest of the paper is organized as follows. In Section II, the design analysis and pipeline overview of the proposed decoder will be presented. In Section III, the details of the component designs will be described. Next, the implementation results and design comparisons will be shown in Section IV. Finally, conclusions will be made in Section V. II. MIXED BLOCK SIZE PIPELINE A. Overview of the four-stage pipeline Fig. 1 shows the pipeline of the proposed HEVC decoder. For the target specification, 4096x2160@30fps at 270 MHz with the 90 nm CMOS process, the decoder was divided into four pipeline stages: entropy decoder, IQ/IT and reference data loading, reconstruction of prediction, and loop filters. The overall functions of the four stages are as follows. The first stage, the entropy decoder, decodes residual coefficients and other information with the context adaptive binary arithmetic decoder (CABAD). Then, the second stage reconstructs the residual data through inverse quantization (IQ) and inverse transform (IT). At the same stage, the corresponding motion vector (MV) is generated to load reference data into a smart buffer. The third stage reconstructs pixels by adding the residual data to the prediction values. Finally, the reconstruction data are filtered by the two in-loop filters, deblocking and SAO, for the final output. B. Analysis of the pipeline The pipeline block size directly affects the final scheduling and the cost of both stage buffers and intermediate buffers. The stage buffer size can be estimated as follows. Assume a 2Nx2N pipeline block. The buffer size for the transform coefficient buffer, residual buffer, or pre-filter buffer is given by (Size for Y Size for Cb and Cr)x2 = (2Nx2N NxNx2)x2 = 12N 2 (1) This size consists of buffers for Y, Cb, and Cr in the 4:2:0 format with the ping-pong buffer style to match the throughput RISC Bitstream DRAM Bitstream Buffer Memory Controller CABAD Transform Coeff. Buf. Inverse Quantization Inverse Transform Ref. Pixel Accessing MV Generation Smart Buf. Pred. Info. Buf. Ref. Pixel. Buf. Intra Pred. Interpolation Rec. SAO Deblocking Filter v REC. Pixels Buf. Residuals Buf. 1 st stage 2 nd stage 3 rd stage 4 th stage (c) 2015 IEEE. Personal Fig. 1 use Pipeline is permitted, of the but proposed republication/redistribution HEVC decoder requires IEEE permission. See

3 differences between the stages. The size of the smart buffer (reference data buffer for MC) can be roughly estimated as no. of 8x4 Y block 8x4 interpolation no. of 4x2 CbCr block 4x2 interpolation 2 = NxNx2 x(15x11) 2Nx2N 8x4 (2) 4x2 x(7x5) x2 = 58.75N2 Byte This size considers only the data for one pipeline block for simplicity and consists of buffers for Y, Cb, and Cr in the 4:2:0 format with the ping-pong buffer style. Thus, the total stage buffer size is 12N 2 ( ) 58.75N2 = N 2 for 16-b/9-b/8-b precision for the transform coefficient buffer, residual buffer, and pre-filter buffer, respectively. A direct design with a 64x64 block size (N=32) will need KB SRAM. The buffer size can be simplified to only the Y buffer size, i.e., N 2, if Y and CbCr data are interleaved. The optimized buffer size should consider the trade-off concerning the computational dependency and efficiency within and between the modules, the buffer cost and the external memory bandwidth, which will be derived below. For the first stage, the only restriction for the buffer between CABAD and IQ/IT is to accommodate the maximum IT size, 32x32. A buffer size larger than 32x32, say, 64x64, is unnecessary and does not provide any remarkable benefit. However, a smaller buffer cannot compute a larger size transform without complicated scheduling. In addition, the transform coefficients decoded from CABAD have to be reordered before the inverse transform. Hence, we had to set the transform coefficient buffer size to 32x32x2 instead of 64x64x2. The corresponding IT pipeline block was set to 32x32 for the first 1-D IT. However, the second 1-D IT could be decomposed to two units with even-odd decomposition. Thus, the maximum required size of the IT output buffer, the residual buffer, would be x2 instead of 32x32x2, which saves 3/4 of the SRAM. For the second stage, Fig. 2 shows the ratios of the external data access increase to smart buffer size for MC interpolation relative to a 64x64 block for a typical video sequence. The data access amount depends on the selected PU size and the MC pipeline block size. If the PU size is larger than the MC pipeline block size, it is necessary to split the PU access into smaller ones to fit the pipeline block size, which results in accessing duplicated data. Thus, a smaller pipeline size will lead to much higher data access even though it has a smaller buffer cost. Therefore, we choose as a tradeoff. The above-mentioned analysis ignored reuse of duplicated data between blocks. If we consider possible data reuse schemes as discussed in Section III, the smart buffer size could be increased to (32x32)x2 = 2048 Bytes to save memory bandwidth for list 0 and list 1. Ratio Fig. 2 Ratio of data access increase and buffer size relative to the 64x64 pipeline block. For the final stage, the pre-filter pixel buffer stores the reconstructed output for the following deblocking filter. Thus, the stage buffer size was set to x2 to store data from the previous stage with a pipeline block. Its internal processing is decomposed into an 8x8 sub-pipeline block to fit 8x8-grid edges at the deblocking filter. For line buffers to store neighboring information, such as neighboring MVs, prediction information, reference data for intra prediction, and deblocking buffers, we only store part of the them on-chip (e.g., three LCU wide buffers) and other data off-chip to reduce buffer sizes, which is similar to previous H.264 decoder chips and an HEVC decoder design [6]. In summary, the pipeline size was set to 32x32 for CABAD, IQ/IT and for the rest of the modules. The stage buffer size was 7.232KB, which saves 62% and 91% of the buffer cost when compared with the 32x32 and 64x64 cases, respectively. C. Scheduling of the proposed design Fig. 3 shows the mixed block size scheduling of the proposed design with four pipeline stages and interleaved luma and chroma block processing. First, this design decoded the bitstream by CABAD at the first stage in a 32x32 block size and then applied IQ and the first 1-D IT at the second stage in a 32x32 block size for luma blocks. The second 1-D IT was operated in a block. The chroma blocks were operated in a block size. Once the required data were available, the subsequent operations could be started. The cycle count in each module was not fixed because of variable PU size and different complexity. A. CABAD 8x8 32x32 64x64 III. Pipeline block size COMPONENT DESIGNS buffer size data amount When the targeted HEVC decoder design was 4096x2160@30fps, the number of bins required was 110 Mbins/sec to 120 Mbins/sec based on our simulations for the target bit rate specification [1]. The limits on particular CTUs and 32x32 units could be much higher for the worst-case condition (e.g., 4x4 block with CTB), which rarely

4 occurs in practical conditions. Thus, 270 Mbins/sec is enough for bit rate variations based on the simulations. Thus, the single bin per cycle architecture was proposed and has been shown in Fig. 4 with a throughput of 270 Mbins/s when operating at 270 MHz. The input bitstream was first decoded through binary arithmetic decoding, and the decoded bin was then directly passed to the context selection because of the smaller critical delay. The context selection produced one context address from two parallel options of the decoded bin to avoid dependency and ensure a single bin per cycle. The context address and the context state from the context modeling were then sent to the binary arithmetic decoding for the next bin decoding. B. Inverse transform The inverse transform matrix in the HEVC possesses the property of coefficient symmetry; the same coefficients are present in the even rows and similar coefficients are in the odd rows for all of the TU sizes. Thus, this paper proposes a reconfigurable inverse transform architecture [7] that is well suited to various TU sizes. This architecture adopts the commonly used row-column decomposition to convert a 2D transform into two consecutive 1-D transforms. Each 1-D 32- point transform is further decomposed into two 1-D 16-point transforms with even-odd decomposition. The even part of a 16-point transform can again be decomposed into smaller size transforms to adapt to various TU sizes by reusing the same coefficients. However, the odd part has different but similar coefficients. Thus, we decompose the coefficients in the odd part into a base coefficient for one TU size and a refined term for another TU size to compensate for the difference. The scheduling of the proposed design involves a row-byrow processing of each of the 32x32 units, regardless of the TU combination within a single 32x32 unit. Moreover, one 32-point row is reconfigured into any legal TU combination from (4, 8, 16, 32) to maintain regular processing and full hardware utilization. C. Deblocking filter and SAO Fig. 5 shows the interleaved scheduling of the proposed nonpipelined deblocking filter (see the details in Table I), which operates on a 4x4 block base. The processing of luma and chroma blocks is also interleaved. The number of clock cycles for a block is 182. The proposed filter design first takes one line of pixels from two adjacent 4x4 blocks as inputs for every cycle, computes its strength and finally applies the suitable filters. Each edge in a 4x4 block takes a total of 8 cycles: 2 cycles for boundary strength, 1 cycle for filter strength, 4 cycles for edge filtering, and 1 cycle to write back the remaining output. In the interleaved scheduling, horizontal filtering is first Time 32x32 B3 32x32 B2 IQ/IT 32x32 B1 MC 32x32 B1 Intra prediction 32x32 B0 In-loop filtering 32x32 1st-D 2nd-D Y 2nd-D 1098 cycle CABAD 2nd-D Y 2nd-D CbCr CbCr Y CbCr Fig. 3 Schedule of the decoder Y CbCr state Pipeline buffer Selection Next bin binarization generator address state Modeling Initial & Read state address Pipeline buffer Binary Arithmetic decoding Bin decoder State update decoded bin Pipeline buffer Fig. 4 Single bin CABAD architecture

5 performed over its two vertical edges, from the top four-line unit to the bottom four-line unit (edge 1 and edge 2). Because all the data are available for the horizontal edge 3, vertical filtering is then performed over the top horizontal edge (edge 3 to edge 4). This horizontal-vertical interleaved approach is repeated for every 8x8 block in the Z-scan order. This interleaved approach reuses all the available data immediately to reduce the intermediate buffer cost. B0 B1 B2 B3 3 4 B B5 B6 5 B7 h1 CYCLE UNIT (8 CYCLES) Table I. Scheduling of the deblocking filter P BLK Q BLOCK PROCESSI NG EDGE C1 B4 B5 E1 C2 B8 B9 E2 FILTERED BLOCK TO SAO C3 B0 B4 E3 B0, B4 C4 B1 B5 E4 B1, B5 B8 B12 2 B9 B10 6 B v1 B13 B B16 10 B17 B18 14 B19 v2 B15 Fig. 5 Interleaved scheduling of the proposed deblocking filter Fig. 6 shows the deblocking filter architecture that shares horizontal and vertical filters with the same filters. The input is from the pre-filer buffer or the line buffer that stores partial filtered data, depending on the edge locations. All memory access schemes follow the interleaved scheduling. Fig. 7 shows the proposed SAO design, which is tightly coupled with the deblocking filter. This design receives its four columns of 8-pixel input from the deblocking filter every 16 cycles, computes and then accumulates its offset cost: band offset (BO) or edge offset (EO). Once the data are ready, it selects the minimum as the final offset operation for every 8x8 block. BO has no dependency to different rows of pixels. EO has four patterns: horizontal, vertical, 45 degree and 135 degree, which require the neighboring upper and lower rows of reference pixels for computation, except the horizontal pattern. The cost computation is a direct implementation as these equations have been optimized in the standard. Filtered blocks for SAO Register_ group1 Bs calculation Choose Block Source form Ping-pong buffer Register_ group2 Filter strength calculation h2 Weak filter Strong filter Filter Decision Deblocking filter output Internal buffer Buffer C(8x8bit) Buffer D(8x8bit) Reference buffer Buffer A(192x8bit) External memory Buffer E(8x8bit) Buffer F(8x8bit) Buffer B(192x8bit) BO EO_pattern0 EO_pattern1 EO_pattern2 EO_pattern3 Fig. 7 Architecture of the proposed SAO D. Reconstruction of intra prediction Min cost selection The main challenges in the design of intra prediction were the various PU sizes, ranging from 64x64 to 4x4, and the 35 prediction modes, which complicate computation and on-chip data fetching. A previously reported design [11] used several parallel data paths to calculate prediction equations for different PU sizes, but the only supported PU sizes were 4 4 and 8 8. This method resulted in low hardware utilization to satisfy more PU sizes. Therefore, we have proposed an efficient hardware architecture suitable for all intra PU sizes, as shown in Fig. 8, which involves a common intra prediction module with the 4 4 bottom-up structure and an adaptive sample fetch controller to compute different PU size predictions in a regular manner. The proposed intra prediction module also implements the DC mode with tree-based average buffers, the directional mode with pixel-based prediction mapping, and PU size adaptive reconstruction buffers for different PU sizes. Fig. 9 displays the corresponding schedule for each PU size with a two-stage sub-pipeline. In this schedule, each 4x4 block requires 9 cycles to load the reference data, calculate prediction values and wait for reconstruction. In addition, based on the decoding order, the chroma blocks are computed after the luma block is completed for each PU. Therefore, the total necessary cycles are 216 per block. Output pixel Re-constructed Blocks from previous stage & Line buffer Fig. 6 Architecture of the proposed deblocking filter

6 byte 4 byte Mode Information Partition Information System Controller Directional Mode Sample Selection Mode Angle Calculation 384 byte 256 byte Residue Neighboring Data On-Chip Memory Adaptive Sample Fetch Controller Intra Prediction DC Average Calculation Planar Mode Sample Selection Main Sample Side Sample Projecting Filter Reconstructed Samples OUTPUT Fig. 8 Architecture of the intra prediction 2-stage sub-pipeline: Load Data Load Data Load Data Load Data Prediction (2 pixels) Prediction (2 pixels) Prediction (2 pixels) Prediction (2 pixels) 9 cycles PU 4 4 and 8 8: (for a block 8 8) PU and 32 32: (for a block 16 16) (U) (V) 36 cycles 9 cycles 9 cycles 54 cycles * 16 * 4 * 4 (U) 144 cycles 36 cycles 216 cycles (U) (V) Fig. 9 Decoding schedule of intra prediction for each PU size Neighboring Data 36 cycles (V) x N y N Orig Orig Orig Orig T Tn L Ln 4x4 8x8 angx * x Main Sample Array Index angy * y Side Table Sample Array Index Orig Register Register Register << N * x << N * y TOP Sample Array 32x32 LEFT Sample Array * * Planar Mode >> N1 DC Mode >> N1 >> 5 Directional Mode Residue Register Fig. 10 The intra prediction datapath

7 ) 4 4 bottom-up structure and PU size adaptive sample fetching To satisfy the recursive combination of PUs with a simple control, a 4 4 block was selected as a basic unit to form a bottom-up structure. That is, each block is processed in a double-z-scan order with 16 basic units instead of row-byrow. Therefore, the processing order of each block is regular, regardless of the PU sizes. For the above processing, we implemented a PU sizeadaptive sample fetch controller, which fetches the corresponding reference samples adaptively based on the PU size and the coordinates of the current block because the prediction equations of different PU sizes are similar. The reconstructed data are stored in the PU size adaptive reconstruction buffer (Fig. 11), which can adapt to different PU sizes. With this buffer, the use of any extra buffer for each PU size can be avoided. 5 th PU: 6 th PU: Pred 1 (x 1, y) = [(T 1 n) (B 1 y)] [(L n) (R x 1 ) 2 n ] (n 1) (4) 2 cycles L L L LN - 1 cycle T0 T1 T0 T1 (x0,y) - B0 (x1,y) PU n n (N N, 2 2 ) B1 B0 B1 TN R R R - Fig. 11 An example of the PU size-adaptive reconstruction buffer 2) Common intra prediction modules for all PU sizes Fig. 10 shows a common intra prediction data path for different PU sizes with 2-pixel-per-cycle throughput to satisfy the processing rate. In this data path, the planar, DC, and directional modes are separated into independent flows because of different prediction equations. The corresponding also reference sample filterings are integrated into the prediction datapath but are not shown here for figure clarity. Planar mode: Fig. 10 and Fig. 12 show the prediction data path and buffer for the planar mode. With 2-pixelper-cycle throughput, the adaptive sample fetch controller updates the top reference pixels, T 0 and T 1, every cycle and the left reference pixels, L, every two cycles. The values of T N and L N are updated only for a new PU. Next, the values of the bottom reference pixels, B 0, and B 1, and right reference pixels, R, for the following blocks are calculated with subtractors, as displayed in Fig. 12. Finally, the prediction values are computed with a position-related linear combination, which is simplified with shift operators as follows Pred 0 (x 0, y) = [(T 0 n) (B 0 y)] [(L n) (R x 0 ) 2 n ] (n 1) (3) Fig. 12 The prediction buffers of the planar mode DC mode with tree-based average buffers: The DC mode requires all the reference samples to estimate the average value of different PU sizes and positions. Thus, to share these average values for different cases, 2 four-level treebased average buffers are implemented for the left and above reference samples, respectively. Directional mode with pixel-based prediction mapping: In the current HEVC reference software, a single function is used to cover all 33 prediction modes to simplify the software implementation. To fit into this function, the top reference and the left reference samples are exchanged first when the prediction mode belongs to the horizontal modes (modes 2 through 17). Then, the prediction values are calculated with the corresponding vertical modes (modes 19 through 33). After all prediction values in a PU are computed, the block flips back again. However, this flow is a block-based operation with respect to the hardware that requires additional buffers to store the unflipped block, and thus time is wasted on flipping. To solve the above problems, the common prediction equation was decomposed as a pixel-based mapping equation with the angle parameter shown in Table II, which does not need to flip the PU. Pred(x, y) = [(32 f(x, y)) S 0 f(x, y) S 1 16] 5 (5) f(x, y) = [(x_ang x) (y_ang y)] % 32 (6) where x and y are the coordinates of pixels, and S 0 and S 1 are the corresponding reference samples. In addition, the equation to calculate the extension of the main reference samples was also modified as follows. Ext_Index(x, y) = [128 invang ang(x, y)] 8 (7)

8 ang(x, y) = [(x_ang x) (y_ang y)] 5, mode < 11, mode 26 [(y_ang y) (x_ang x)] 5, mode < 18 [(x_ang x) (y_ang y)] 5, mode < 26 (8) Table II. The angle parameter of the directional modes Mode x_ang y_ang invang Mode x_ang y_ang invang E. Motion compensation The MC design in HEVC was more complex than the priorreported standards because of various PU sizes, more complicated interpolation, and MVP, which increases computation irregularity, buffer size and bandwidth to a large extent. In this proposed MC design, the bandwidth and buffer size was reduced first with block access, precisionbased access, and a smart buffer. Then, the computation irregularity was avoided and the area cost was reduced with the corner index-based MVP generation and 4x4 interpolators, with common subexpression sharing. 1) Bandwidth reduction The memory bandwidth reduction method in MC involves reuse of the reference samples in the highly overlapped data between different blocks, which is similar to the previous approaches [15]-[17] for H.264. Other approaches, such as efficient DRAM data mapping [24] and embedded compression [25], can also be combined directly to save more bandwidth. In the current work, the sole focus was on data reuse only, which was based on three approaches partitionbased access, precision-based access and cache-based design. The partition-based access loads reference data according to different partition sizes. A straightforward method to load reference data is to decompose a partition into 4x4 blocks and load an 11x11 block for each 4x4 interpolation. However, if the partition size is larger than 4x4, the corresponding reference data of 4x4 blocks in the same partition would be highly overlapped, and they can be reused to reduce the bandwidth. Thus, in the previous work for H.264 [15]-[17], the upper and upper left blocks were saved to reuse the overlapped part of the current block[15] or access (M5)x(N5) blocks for MxN blocks with M, N = 4 or 8 [16] or up to 16 in [17]. In general, a larger access size will have a lesser bandwidth but will increase the buffer cost. This method also works for HEVC. However, the maximum PU size of HEVC is 64x64 instead of in H.264, which should lead to significant data storage. In the current design, the block-based data access was adopted instead of 64x64, which would reuse overlapped samples for PU size <=. Moreover, the proposed smart buffer was also capable of reusing data for larger PUs, as described later. Thus, this size selection is compatible with the processing unit in MC and provides a good tradeoff between the buffer cost and the bandwidth. The precision-based data access loads the reference data according to different fractional MV positions, similar to previous H.264 work [18]-[21]. For HEVC, a MxN PU will load MxN, (M7)xN, Mx(N7) or (M7)x(N7) data according to its fractional MV positions, which avoids loading the (M7)x(N7) data for all types of MV positions. A cache in MC can further increase data reuse of adjacent block access (e.g., [21]-[23] for H.264) to reduce the bandwidth. The design in [22] proposed a 6-way cache to reduce the bandwidth by more than 70%. However, the authors used a large-sized memory (6x Bytes), a large number of registers and complex address mapping to implement a cache. These issues were also found in the design of HEVC [5]. Moreover, the PU size in HEVC was up to 64x64, which is much larger than the in H.264. To address these issues, this paper proposes a cache design with a simplified addressing scheme, denoted as a smart buffer, as shown in Fig. 13. Table III lists the percent savings of the data loading amount compared with the original 4x4 block-based loading. An average savings of 88% was found for the current proposed approach. Other sequences show similar results. The table also shows that the savings for cache size larger than 32x32 are quite similar. Thus, 32x32 was chosen for the current method, which can fit our selection of the processing unit and access size.

9 The proposed design partitions a 32x32 buffer into 16 8x8 blocks with a double-z-scan (DZS) index, as shown in Fig. 14, to fit the DZS computing order in HEVC. The same partition method and its DZS index were also applied to a 64x64 block (Fig. 15). An 8x8 block in a 64x64 block can, thus, be easily mapped to a block in the smart buffer using the DZS index modulo 16 as the mapping function. This addressing scheme provides an easy way to update the smart buffer block without any complex address control. Table III. Percent savings of data loading amount compared with the 4x4 block-based access Method Basket balldrive Race- HorsesC Basket- ball Pass Avg. B 24.7% 24.3% 68.1% 39.0% C 79.9% 78.9% 84.2% 81.0% D() 82.1% 80.9% 86.1% 83.0% D(32x32) 89.1% 86.9% 88.1% 88.0% D(64x32) 90.6% 88.2% 88.7% 89.1% D(64x64) 91.5% 89.2% 89.4% 90.0% Note: B: Precision-based request (4x4 based), C: B block size request D: C smart buffer MV Index Data Request Generator Tag 64 Buffer Tag V T Tag update compare Miss Reloading Hit 32x32x2 Smart Buffer Fetch Data Data Reload Interpolation Fig. 15 A smart buffer example (8x8 pixels for each word) 2) MC design The proposed MC architecture is divided into two stages. The first stage can compute the final MV with MVP and motion vector difference (MVD) and can access the reference data. After the reference pixels of a 4x4 block are ready in the stage buffer, the second stage starts to interpolate the reference data and reconstruct the pixels. To generate MVs in a regular way adaptable to variable PU sizes, the basic block of MVs is set to 4x4. Thus, the neighboring MVs are stored in a 4x4 base instead of a PU base to provide a regular way of accessing the neighboring MVs. Next, the corner index of the PU is computed to access the corresponding MVP candidates in different PU sizes, as shown in Fig. 16. If the size of the PU is larger than, only the top leftmost 4x4 block will then go through the MVP generation. For interpolation, the 4x4 block was chosen as the basic unit to process the recursive PU combinations with a bottom-up structure, as shown in Fig. 17. The area cost of the filters was reduced through hardware sharing for the same coefficients and the common subexpression sharing for different coefficients. Thus, the proposed interpolator can save 19.77% and 55% of the area cost compared with the original costs for the luma and chroma components, respectively. Fig. 13 The smart buffer architecture 32x x64 Fig. 14 Double-z scan order Fig. 16 Corner position for different partition sizes

10 To further reduce the temporary buffer cost, scheduling for all the interpolations is performed at the earliest opportunity (Fig. 17). The original scheduling was to compute all the horizontal filters first, store them next, and finally compute the vertical filter, which needs to save 44 results in the local buffer. The proposed scheduling computes the left half of the horizontal filters first, then directly consumes the results with the left half vertical filters, and then repeats the first two steps for the right half part. It is important to note that, when this order is followed, only saving 22 local results is necessary Pixel in current 4x4 Ref. pixel Hor. FIR result Ver. FIR result Compute order Fig. 17 Proposed computational order of the luma interpolator IV. IMPLEMENTATION RESULTS AND COMPARISONS The proposed decoder has been implemented and synthesized by the TSMC 90 nm 1P9M CMOS technology, as shown in Table IV and V, with the main profile tools support. The prediction part occupies the largest area because of the various prediction modes and complex memory data access, while the inverse transform occupies the second largest area because of the size up to 32x32. Table VI summarizes the techniques used in the current design to reduce the buffer size and the area cost while, at the same time, also increasing the adaptability to variable size processing. Table VII and VIII show comparisons with the previously published designs [4]-[6][26]. All of these designs support the main profile tools with B-frames, except [6], which implemented an earlier HEVC version with CAVLD as the entropy decoder and without any SAO. [4] requires the largest area and buffer cost because of its 64x64 pipeline. When compared with the intra prediction in [4], the proposed intra prediction reduces the gate count by 71.5% and the buffer size by 83% because of the common prediction units and smaller pipeline size. Other modules in [4] also need higher area cost than the proposed one. The design in [6] has a similar gate count and buffer cost as the proposed design but is only capable of 1920x1080@30fps video processing. However, a detailed comparison was rather impossible due to the unavailability of detailed design information. Both the proposed design and [5] can support similar real-time decoding capability with variable-size pipelines. However, [5] uses a 64x64 pipeline size as the worst case, which significantly increases the area cost. By contrast, our pipeline size is 32x32 in the first stage, and it is reduced to for the remaining stages. The smaller pipeline size decision and module optimization also help to reduce the gate count. The proposed design reduces the gate count by 35% and the buffer size by 87.2% in comparison to [5]. A detailed module comparison in Table VII shows that every module in the proposed design needs a lower gate count than that in [5]. The major gate count savings come from the module optimizations, such as single bin CABAD, reconfigurable prediction datapath, interleaved deblocking filter, and smart buffer scheme for MC cache (which avoids complicated register files in the cache tag). A commercial IP from [26] supports 4Kp120 processing with 700K gate count and no other details disclosed. However, its on-chip memory requirement is as high as 151 KB. Table IV. Detailed gate count of the proposed decoder Pipeline stage Module Gate Count (270 MHz, 90 nm) Control 21,342 1 st stage CABAD 48,430 2 nd stage Inverse 48,556 Quantization Inverse Transform 63,844 Motion Vector 84,730 Generation Reference Pixel 15,841 Accessing Smart buffer 47,130 3 rd stage Interpolation 18,742 -luma interpolator 14,462 -chroma 4,280 interpolator Intra Prediction 76,812 4 th stage Deblocking Filter 20,551 SAO 21,752 Total 467,730 Table V. Buffer size of the proposed design Module On-chip memory (Byte) Inverse transform 2048 MC 1230 Intra prediction 704 SAO 384 Others 4180 Pipeline ping-pong buffer Coefficients Smart buffer Residual buffer 576 -Pre-filter pixel 512 Total 15778

11 Module CABAD IT Intra prediction Table VI. Summary of the design techniques Techniques Single bin per cycle Reconfigurable for all TU size for various PU size a common intra prediction module 4 4 based bottom-up structure adaptive sample fetch controller MC 4x4 based structure for various PU size Cache with simplified addressing scheme corner position computation for MVP for various PU size Deblocking filter Pipeline Interleaved processing to save intermediate buffer Mixed block size (32x32 and ) to save pipeline buffer Table VII. Module comparison with other designs Design [5] [4] Proposed feature HEVC Standard HEVC WD4 HEVC WD6 Gate count 715K 1763K 467K -Entropy 94.5K 138K 48.4K decoder (CAVLD) (CABAC) (CABAC) -ITIQ 121.1K 779K 112.4K K (inter 270K(intra) 76.8K(intra) Prediction intra) -MC 440K (inter 166.4K (inter 126K cache cache) cache) K 84K 20.5K deblocking -SAO NA 32K 21.7K 131.2K (register files 20K (signal -others for cache tag, communicat 21K MEM ion) interface) V. CONCLUSION This paper has presented a real-time HEVC decoder that is capable of decoding 4096x2160@30fps at 270 MHz. The proposed design adopts a 4-stage mixed-block-size pipeline to create a tradeoff between the buffer cost and the bandwidth, and it has reduced the stage buffer cost by approximately 90%. The required bandwidth was reduced by approximately 88% through block access, precision-based access and a smart buffer scheme. The complex irregular computation was unified by a reconfigurable design and 4x4 block-based processing for the inverse transform and prediction, respectively. By following the above-mentioned approaches, the gate count and on-chip memory were reduced to 467 K and KB, respectively. These values indicate that there is a savings of 35% for the gate count and 87.2% for the buffer cost when compared to the other published designs with the same processing capabilities. REFERENCES [1] High Efficiency Video Coding, ITU-T Rec. H.265, Apr [2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand, Overview of the High Efficiency Video Coding (HEVC) Standard, in IEEE Tran. Circuits Syst. Video Technol., vol. 22, no. 12, Dec. 2012, pp [3] B. Li, G. J. Sullivan, and J. Xu, Comparison of compression performance of HEVC Draft 9 with AVC high Profile and performance of HM9.0 with temporal scalability characteristics, JCTVC-L0322, Jan [4] S. Cho and H. Kim, Hardware implementation of a HEVC decoder, JCTVC-L0096, Jan [5] C.-T. Huang, M. Tikekar, C. Juvekar, V. Sze and A. Chandrakasan, A 249Mpixel/s HEVC video-decoder chip for Quad Full HD applications, in ISSCC Dig. Tech. Papers, 2013, pp [6] C.-H. Tsai, H.-T. Wang, C.-L. Liu, Y. Li, and C.-Y. Lee, "A 446.6K-gates V H.265/HEVC decoder for next generation video applications," in Proc. ASSCC, 2013, pp [7] P.-T. Chiang and T. S. Chang, "A reconfigurable inverse transform architecture design for HEVC decoder," in Proc. ISCAS, 2013, pp [8] J.-S. Park, W.-J. Nam, S.-M. Han and S. Lee, 2-D large inverse Transform (, 32x32) for HEVC (High Efficiency Video Coding), J. Semi. Tech. Sci., vol. 12, no. 2, pp , June [9] J. Zhu, Z. Liu, and D. Wang, "Fully pipelined DCT/IDCT/Hadamard unified transform architecture for HEVC Codec," in Proc. ISCAS, 2013, pp [10] W. Zhao, T. Onoye, and T. Song, "High-performance multiplierless transform architecture for HEVC," in Proc. ISCAS, 2013, pp

12 Table VIII. Decoder comparison with other designs Design feature [5] [4] [6] [26] Proposed Standard HEVC main HEVC WD6 HEVC HEVC main/main10 HEVC main Version WD4 (no SAO, CAVLC, fewer WD6 NA Final Final intra modes) Technology 40nm 65nm 90nm NA 90nm Frequency 200MHz 266 MHz 224MHz 600MHz 270MHz Resolution 3840x x x x x2160 Frame rate 30 fps 60 fps 35fps 120fps 30 fps Pipeline Variable 64x64 (7 stages) 4 stages NA Variable (4 stages) Gate count 715K 1763K 446.6K 700K (8b)/ 840K(10b) 467K Memory (KB) (8b)/ 179(10b) Throughput (Mpixel/sec) [11] E. Kalali, Y. Adibelli, and I. Hamzaoglu, A High performance and low energy intra prediction hardware for HEVC video decoding, in Proc. DASIP, [12] Z. Guo, D. Zhou and S. Goto, An optimized MC interpolation architecture for HEVC, in Proc. ICASSP, 2012, p.p [13] V. Afonso, H. Maich, L. Agostini and D. Franco, Low cost and high throughput FME interpolation for the HEVC emerging video coding standard, in Proc. LASCAS, 2013, pp [14] W. Shen, Q. Shang, S. Shen, Y. Fan, and X. Zeng, "A high-throughput VLSI architecture for deblocking filter in HEVC," in Proc. ISCAS, 2013, pp [15] D.-Y. Shen, T.-H. Tsai, A 4x4-block level pipeline and bandwidth optimized motion compensation hardware design for H.264/AVC decoder, in Proc. ICME, 2009, pp [16] C.-C. Lin, J.-I. Guo, H.-C. Chang, Y.-C. Yang, J.-W. Chen, M.-C. Tsai and J.-S. Wang, A 160kGate 4.5kB SRAM H.264 video decoder for HDTV applications, in ISSCC Dig. Tech. Papers, 2006, pp [17] C. Li, K. Huang, X. Yan, J. Feng, D. Ma and H. Ge, A high efficient memory architecture for H. 264/AVC motion compensation, in Proc. ASAP, 2010, pp [18] R.-G. Wang, J.-T. Li and C. Huang, Motion compensation memory access optimization strategies for H.264/AVC decoder, in Proc. ICASSP, 2005, vol. 5, pp [19] Y. Li and Y. He, Bandwidth optimized and high performance interpolation architecture in motion compensation for H.264/AVC HDTV decoder, J. Signal Proc. Syst. vol. 52, no. 2, pp , Aug [20] E. Matei, C. Praet, J. Bauweklinck, P. Cautereels and E. Lumley, Novel data storage for H. 264 motion compensation: system architecture and hardware implementation, EURASIP J. Image and Video Proc., p.p1-12, Dec [21] Y. Li, Y. Qu and Y. He. Memory cache based motion compensation architecture for HDTV H. 264/AVC decoder, in Proc. ISCAS, 2007, pp [22] T.-D. Chuang, L.-H. Chang, T.-W. Chiu, Y.-H. Chen and L.-G. Chen, Bandwidth-efficient cache-based motion compensation architecture with DRAM-friendly data access control, in Proc. ICASSP, 2009, pp [23] X. Chen, L. Peilin, Z. Jiayi, Z. Dajiang, and S. Goto, "Block-pipelining cache for motion compensation in high definition H.264/AVC video decoder," in Proc. ISCAS, pp [24] G.-S. Yu and T. S. Chang, "Optimal Data Mapping for Motion Compensation in H.264 Video Decoding," in Proc. SiPS, 2007, pp [25] L. C. Chiu and T. S. Chang, "A lossless embedded compression codec engine for HD video decoding," in Proc. VLSI-DAT, 2012, pp [26] ovics, ViC-1 HEVC 4Kp120 Decoder, ProdBrief pdf, 2014 Pai-Tse Chiang received the B.S and M.S degrees in electronics engineering from National Chiao-Tung University, Hsinchu, Taiwan, in 2011 and 2013, respectively. His current research interests include digital signal processing, video coding and systemon-chip design.

13 Yi-Ching Ting received the B.S. and M.S. degrees in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 2011 and 2013 respectively. He is an engineer with PixArt Imaging Inc., Hsinchu, Taiwan. His current research interests include image processing and multimedia applications. Hsuan-Ku Chen recevied the B.S. and M.S. degrees in Electronics Engineering from National Chiao Tung University (NCTU), Hsinchu, Taiwan, R.O.C., in 2011 and 2013 respectively. He is an engineer with PixArt Imaging Inc., Hsinchu, Taiwan. His current research interests are video coding and digital integrated circuits. Shiaw-Yu Jou recevied the B.S. and M.S. degrees in Electronics Engineering from National Chiao Tung University (NCTU), Hsinchu, Taiwan, R.O.C., in 2011 and 2013 respectively. He is an engineer with PixArt Imaging Inc., Hsinchu, Taiwan. His current research interests are image processing and digital integrated circuits. I-Wen Chen received the B.S. and M.S. degree in Electronics Engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 2012, and She is an engineer with PixArt Imaging Inc., Hsinchu, Taiwan. Her current research interests are image processing and digital integrated circuits. Hang-Chiu Fang received the B.S. in Electrical Engineering from National Chung Hsing University, Taichung, Taiwan, R.O.C., in 2012, and M.S. degree in Electronics Engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in He is an engineer with ELan Inc., Hsinchu, Taiwan. His current research interests are video coding and digital integrated circuits. Tian-Sheuan Chang (S 93 M 06 SM 07) received the B.S., M.S., and Ph.D. degrees in electronic engineering from National Chiao-Tung University (NCTU), Hsinchu, Taiwan, in 1993, 1995, and 1999, respectively. From 2000 to 2004, he was a Deputy Manager with Global Unichip Corporation, Hsinchu, Taiwan. In 2004, he joined the Department of Electronics Engineering, NCTU, where he is currently a Professor. In 2009, he was a visiting scholar in IMEC, Belgium. His current research interests include system-on-a-chip design, VLSI signal processing, and computer architecture. Dr. Chang has received the Excellent Young Electrical Engineer from Chinese Institute of Electrical Engineering in 2007, and the Outstanding Young Scholar from Taiwan IC Design Society in He has been actively involved in many international conferences as an organizing committee or technical program committee member. He is current an Editorial Board Member of IEEE Transactions of Circuits and Systems for Video Technology.

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS 9th European Signal Processing Conference (EUSIPCO 2) Barcelona, Spain, August 29 - September 2, 2 A 6-65 CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS Jinjia Zhou, Dajiang

More information

WITH the demand of higher video quality, lower bit

WITH the demand of higher video quality, lower bit IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 8, AUGUST 2006 917 A High-Definition H.264/AVC Intra-Frame Codec IP for Digital Video and Still Camera Applications Chun-Wei

More information

A High Performance Deblocking Filter Hardware for High Efficiency Video Coding

A High Performance Deblocking Filter Hardware for High Efficiency Video Coding 714 IEEE Transactions on Consumer Electronics, Vol. 59, No. 3, August 2013 A High Performance Deblocking Filter Hardware for High Efficiency Video Coding Erdem Ozcan, Yusuf Adibelli, Ilker Hamzaoglu, Senior

More information

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Conference object, Postprint version This version is available

More information

Chapter 2 Introduction to

Chapter 2 Introduction to Chapter 2 Introduction to H.264/AVC H.264/AVC [1] is the newest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The main improvements

More information

Decoder Hardware Architecture for HEVC

Decoder Hardware Architecture for HEVC Decoder Hardware Architecture for HEVC The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Tikekar, Mehul,

More information

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC Motion Compensation Techniques Adopted In HEVC S.Mahesh 1, K.Balavani 2 M.Tech student in Bapatla Engineering College, Bapatla, Andahra Pradesh Assistant professor in Bapatla Engineering College, Bapatla,

More information

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension 05-Silva-AF:05-Silva-AF 8/19/11 6:18 AM Page 43 A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension T. L. da Silva 1, L. A. S. Cruz 2, and L. V. Agostini 3 1 Telecommunications

More information

COMPLEXITY REDUCTION FOR HEVC INTRAFRAME LUMA MODE DECISION USING IMAGE STATISTICS AND NEURAL NETWORKS.

COMPLEXITY REDUCTION FOR HEVC INTRAFRAME LUMA MODE DECISION USING IMAGE STATISTICS AND NEURAL NETWORKS. COMPLEXITY REDUCTION FOR HEVC INTRAFRAME LUMA MODE DECISION USING IMAGE STATISTICS AND NEURAL NETWORKS. DILIP PRASANNA KUMAR 1000786997 UNDER GUIDANCE OF DR. RAO UNIVERSITY OF TEXAS AT ARLINGTON. DEPT.

More information

Memory interface design for AVS HD video encoder with Level C+ coding order

Memory interface design for AVS HD video encoder with Level C+ coding order LETTER IEICE Electronics Express, Vol.14, No.12, 1 11 Memory interface design for AVS HD video encoder with Level C+ coding order Xiaofeng Huang 1a), Kaijin Wei 2, Guoqing Xiang 2, Huizhu Jia 2, and Don

More information

A Low-Power 0.7-V H p Video Decoder

A Low-Power 0.7-V H p Video Decoder A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008 Outline Motivation for low-power video decoders Low-power techniques pipelining

More information

Video coding standards

Video coding standards Video coding standards Video signals represent sequences of images or frames which can be transmitted with a rate from 5 to 60 frames per second (fps), that provides the illusion of motion in the displayed

More information

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur Module 8 VIDEO CODING STANDARDS Lesson 27 H.264 standard Lesson Objectives At the end of this lesson, the students should be able to: 1. State the broad objectives of the H.264 standard. 2. List the improved

More information

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame I J C T A, 9(34) 2016, pp. 673-680 International Science Press A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame K. Priyadarshini 1 and D. Jackuline Moni

More information

Overview: Video Coding Standards

Overview: Video Coding Standards Overview: Video Coding Standards Video coding standards: applications and common structure ITU-T Rec. H.261 ISO/IEC MPEG-1 ISO/IEC MPEG-2 State-of-the-art: H.264/AVC Video Coding Standards no. 1 Applications

More information

Project Proposal Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

Project Proposal Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359 Project Proposal Time Optimization of HEVC Encoder over X86 Processors using SIMD Spring 2013 Multimedia Processing Advisor: Dr. K. R. Rao Department of Electrical Engineering University of Texas, Arlington

More information

Motion Compensation Hardware Accelerator Architecture for H.264/AVC

Motion Compensation Hardware Accelerator Architecture for H.264/AVC Motion Compensation Hardware Accelerator Architecture for H.264/AVC Bruno Zatt 1, Valter Ferreira 1, Luciano Agostini 2, Flávio R. Wagner 1, Altamiro Susin 3, and Sergio Bampi 1 1 Informatics Institute

More information

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy Vladimir Afonso 1-2, Henrique Maich 1, Luan Audibert 1, Bruno Zatt 1, Marcelo Porto 1, Luciano Agostini

More information

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

The Multistandard Full Hd Video-Codec Engine On Low Power Devices The Multistandard Full Hd Video-Codec Engine On Low Power Devices B.Susma (M. Tech). Embedded Systems. Aurora s Technological & Research Institute. Hyderabad. B.Srinivas Asst. professor. ECE, Aurora s

More information

A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications

A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

626 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 4, APRIL 2012

626 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 4, APRIL 2012 626 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 4, APRIL 2012 A 135 MHz 542 k Gates High Throughput H.264/AVC Scalable High Profile Decoder Gwo-Long Li, Yu-Chen Chen, Yuan-Hsin

More information

An efficient interpolation filter VLSI architecture for HEVC standard

An efficient interpolation filter VLSI architecture for HEVC standard Zhou et al. EURASIP Journal on Advances in Signal Processing (2015) 2015:95 DOI 10.1186/s13634-015-0284-0 RESEARCH An efficient interpolation filter VLSI architecture for HEVC standard Wei Zhou 1*, Xin

More information

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264 Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264 Ju-Heon Seo, Sang-Mi Kim, Jong-Ki Han, Nonmember Abstract-- In the H.264, MBAFF (Macroblock adaptive frame/field) and PAFF (Picture

More information

WITH the rapid development of high-fidelity video services

WITH the rapid development of high-fidelity video services 896 IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 7, JULY 2015 An Efficient Frame-Content Based Intra Frame Rate Control for High Efficiency Video Coding Miaohui Wang, Student Member, IEEE, KingNgiNgan,

More information

A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds.

A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds. Video coding Concepts and notations. A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds. Each image is either sent progressively (the

More information

An FPGA Implementation of Shift Register Using Pulsed Latches

An FPGA Implementation of Shift Register Using Pulsed Latches An FPGA Implementation of Shift Register Using Pulsed Latches Shiny Panimalar.S, T.Nisha Priscilla, Associate Professor, Department of ECE, MAMCET, Tiruchirappalli, India PG Scholar, Department of ECE,

More information

THE new video coding standard H.264/AVC [1] significantly

THE new video coding standard H.264/AVC [1] significantly 832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 9, SEPTEMBER 2006 Architecture Design of Context-Based Adaptive Variable-Length Coding for H.264/AVC Tung-Chien Chen, Yu-Wen

More information

A Low Energy HEVC Inverse Transform Hardware

A Low Energy HEVC Inverse Transform Hardware 754 IEEE Transactions on Consumer Electronics, Vol. 60, No. 4, November 2014 A Low Energy HEVC Inverse Transform Hardware Ercan Kalali, Erdem Ozcan, Ozgun Mert Yalcinkaya, Ilker Hamzaoglu, Senior Member,

More information

Interim Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

Interim Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359 Interim Report Time Optimization of HEVC Encoder over X86 Processors using SIMD Spring 2013 Multimedia Processing Advisor: Dr. K. R. Rao Department of Electrical Engineering University of Texas, Arlington

More information

A Novel VLSI Architecture of Motion Compensation for Multiple Standards

A Novel VLSI Architecture of Motion Compensation for Multiple Standards A Novel VLSI Architecture of Motion Compensation for Multiple Standards Junhao Zheng, Wen Gao, Senior Member, IEEE, David Wu, and Don Xie Abstract Motion compensation (MC) is one of the most important

More information

Algorithm and architecture design of the motion estimation for the H.265/HEVC 4K-UHD encoder

Algorithm and architecture design of the motion estimation for the H.265/HEVC 4K-UHD encoder J Real-Time Image Proc (216) 12:517 529 DOI 1.17/s11554-15-516-4 SPECIAL ISSUE PAPER Algorithm and architecture design of the motion estimation for the H.265/HEVC 4K-UHD encoder Grzegorz Pastuszak Maciej

More information

An Efficient Reduction of Area in Multistandard Transform Core

An Efficient Reduction of Area in Multistandard Transform Core An Efficient Reduction of Area in Multistandard Transform Core A. Shanmuga Priya 1, Dr. T. K. Shanthi 2 1 PG scholar, Applied Electronics, Department of ECE, 2 Assosiate Professor, Department of ECE Thanthai

More information

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm Mustafa Parlak and Ilker Hamzaoglu Faculty of Engineering and Natural Sciences Sabanci University, Tuzla, 34956, Istanbul, Turkey

More information

ALONG with the progressive device scaling, semiconductor

ALONG with the progressive device scaling, semiconductor IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 4, APRIL 2010 285 LUT Optimization for Memory-Based Computation Pramod Kumar Meher, Senior Member, IEEE Abstract Recently, we

More information

Quarter-Pixel Accuracy Motion Estimation (ME) - A Novel ME Technique in HEVC

Quarter-Pixel Accuracy Motion Estimation (ME) - A Novel ME Technique in HEVC International Transaction of Electrical and Computer Engineers System, 2014, Vol. 2, No. 3, 107-113 Available online at http://pubs.sciepub.com/iteces/2/3/5 Science and Education Publishing DOI:10.12691/iteces-2-3-5

More information

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists 4,000 116,000 120M Open access books available International authors and editors Downloads Our

More information

Express Letters. A Novel Four-Step Search Algorithm for Fast Block Motion Estimation

Express Letters. A Novel Four-Step Search Algorithm for Fast Block Motion Estimation IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 6, NO. 3, JUNE 1996 313 Express Letters A Novel Four-Step Search Algorithm for Fast Block Motion Estimation Lai-Man Po and Wing-Chung

More information

Video Encoder Design for High-Definition 3D Video Communication Systems

Video Encoder Design for High-Definition 3D Video Communication Systems INTEGRATED CIRCUITS FOR COMMUNICATIONS Video Encoder Design for High-Definition 3D Video Communication Systems Pei-Kuei Tsung, Li-Fu Ding, Wei-Yin Chen, Tzu-Der Chuang, Yu-Han Chen, Pai-Heng Hsiao, Shao-Yi

More information

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.8, NO.5, OCTOBER, 08 ISSN(Print) 598-657 https://doi.org/57/jsts.08.8.5.640 ISSN(Online) -4866 A Modified Static Contention Free Single Phase Clocked

More information

Design Challenge of a QuadHDTV Video Decoder

Design Challenge of a QuadHDTV Video Decoder Design Challenge of a QuadHDTV Video Decoder Youn-Long Lin Department of Computer Science National Tsing Hua University MPSOC27, Japan More Pixels YLLIN NTHU-CS 2 NHK Proposes UHD TV Broadcast Super HiVision

More information

17 October About H.265/HEVC. Things you should know about the new encoding.

17 October About H.265/HEVC. Things you should know about the new encoding. 17 October 2014 About H.265/HEVC. Things you should know about the new encoding Axis view on H.265/HEVC > Axis wants to see appropriate performance improvement in the H.265 technology before start rolling

More information

Selective Intra Prediction Mode Decision for H.264/AVC Encoders

Selective Intra Prediction Mode Decision for H.264/AVC Encoders Selective Intra Prediction Mode Decision for H.264/AVC Encoders Jun Sung Park, and Hyo Jung Song Abstract H.264/AVC offers a considerably higher improvement in coding efficiency compared to other compression

More information

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System Zhibin Xiao and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Outline Introduction to H.264

More information

Chapter 10 Basic Video Compression Techniques

Chapter 10 Basic Video Compression Techniques Chapter 10 Basic Video Compression Techniques 10.1 Introduction to Video compression 10.2 Video Compression with Motion Compensation 10.3 Video compression standard H.261 10.4 Video compression standard

More information

H.264/AVC Baseline Profile Decoder Complexity Analysis

H.264/AVC Baseline Profile Decoder Complexity Analysis 704 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003 H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, Senior

More information

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER Wassim Hamidouche, Mickael Raulet and Olivier Déforges

More information

Reduced complexity MPEG2 video post-processing for HD display

Reduced complexity MPEG2 video post-processing for HD display Downloaded from orbit.dtu.dk on: Dec 17, 2017 Reduced complexity MPEG2 video post-processing for HD display Virk, Kamran; Li, Huiying; Forchhammer, Søren Published in: IEEE International Conference on

More information

Joint Algorithm-Architecture Optimization of CABAC

Joint Algorithm-Architecture Optimization of CABAC Noname manuscript No. (will be inserted by the editor) Joint Algorithm-Architecture Optimization of CABAC Vivienne Sze Anantha P. Chandrakasan Received: date / Accepted: date Abstract This paper uses joint

More information

The H.26L Video Coding Project

The H.26L Video Coding Project The H.26L Video Coding Project New ITU-T Q.6/SG16 (VCEG - Video Coding Experts Group) standardization activity for video compression August 1999: 1 st test model (TML-1) December 2001: 10 th test model

More information

A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b

A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b 1 Education Ministry

More information

A VLSI Architecture for Variable Block Size Video Motion Estimation

A VLSI Architecture for Variable Block Size Video Motion Estimation A VLSI Architecture for Variable Block Size Video Motion Estimation Yap, S. Y., & McCanny, J. (2004). A VLSI Architecture for Variable Block Size Video Motion Estimation. IEEE Transactions on Circuits

More information

Video Compression - From Concepts to the H.264/AVC Standard

Video Compression - From Concepts to the H.264/AVC Standard PROC. OF THE IEEE, DEC. 2004 1 Video Compression - From Concepts to the H.264/AVC Standard GARY J. SULLIVAN, SENIOR MEMBER, IEEE, AND THOMAS WIEGAND Invited Paper Abstract Over the last one and a half

More information

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Implementation of an MPEG Codec on the Tilera TM 64 Processor 1 Implementation of an MPEG Codec on the Tilera TM 64 Processor Whitney Flohr Supervisor: Mark Franklin, Ed Richter Department of Electrical and Systems Engineering Washington University in St. Louis Fall

More information

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC http://dx.doi.org/10.5573/jsts.2013.13.5.430 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.13, NO.5, OCTOBER, 2013 Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC Juwon

More information

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications

A Reed Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 2, FEBRUARY 2001 229 A Reed Solomon Product-Code (RS-PC) Decoder Chip DVD Applications Hsie-Chia Chang, C. Bernard Shung, Member, IEEE, and Chen-Yi Lee

More information

A Highly Parallel and Scalable CABAC Decoder for Next Generation Video Coding

A Highly Parallel and Scalable CABAC Decoder for Next Generation Video Coding 8 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 A Highly Parallel and Scalable CABAC Decoder for Next Generation Video Coding Vivienne Sze, Member, IEEE, and Anantha P. Chandrakasan,

More information

Visual Communication at Limited Colour Display Capability

Visual Communication at Limited Colour Display Capability Visual Communication at Limited Colour Display Capability Yan Lu, Wen Gao and Feng Wu Abstract: A novel scheme for visual communication by means of mobile devices with limited colour display capability

More information

FRAME RATE BLOCK SELECTION APPROACH BASED DIGITAL WATER MARKING FOR EFFICIENT VIDEO AUTHENTICATION USING NETWORK CONDITIONS

FRAME RATE BLOCK SELECTION APPROACH BASED DIGITAL WATER MARKING FOR EFFICIENT VIDEO AUTHENTICATION USING NETWORK CONDITIONS FRAME RATE BLOCK SELECTION APPROACH BASED DIGITAL WATER MARKING FOR EFFICIENT VIDEO AUTHENTICATION USING NETWORK CONDITIONS A. Kirthika 1 and A. Senthilkumar 2 1 Department of Electronics and Communication

More information

SCALABLE video coding (SVC) is currently being developed

SCALABLE video coding (SVC) is currently being developed IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 7, JULY 2006 889 Fast Mode Decision Algorithm for Inter-Frame Coding in Fully Scalable Video Coding He Li, Z. G. Li, Senior

More information

Multimedia Communications. Video compression

Multimedia Communications. Video compression Multimedia Communications Video compression Video compression Of all the different sources of data, video produces the largest amount of data There are some differences in our perception with regard to

More information

THE TRANSMISSION and storage of video are important

THE TRANSMISSION and storage of video are important 206 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Novel RD-Optimized VBSME with Matching Highly Data Re-Usable Hardware Architecture Xing Wen, Student Member,

More information

Hardware Decoding Architecture for H.264/AVC Digital Video Standard

Hardware Decoding Architecture for H.264/AVC Digital Video Standard Hardware Decoding Architecture for H.264/AVC Digital Video Standard Alexsandro C. Bonatto, Henrique A. Klein, Marcelo Negreiros, André B. Soares, Letícia V. Guimarães and Altamiro A. Susin Department of

More information

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0 General Description Applications Features The OL_H264e core is a hardware implementation of the H.264 baseline video compression algorithm. The core

More information

HIGH Efficiency Video Coding (HEVC) version 1 was

HIGH Efficiency Video Coding (HEVC) version 1 was 1 An HEVC-based Screen Content Coding Scheme Bin Li and Jizheng Xu Abstract This document presents an efficient screen content coding scheme based on HEVC framework. The major techniques in the scheme

More information

Advanced Screen Content Coding Using Color Table and Index Map

Advanced Screen Content Coding Using Color Table and Index Map 1 Advanced Screen Content Coding Using Color Table and Index Map Zhan Ma, Wei Wang, Meng Xu, Haoping Yu Abstract This paper presents an advanced screen content coding solution using Color Table and Index

More information

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar May 2006, Vol.21, No.3, pp.370 377 J. Comput. Sci. & Technol. An Efficient VLSI Architecture for Motion Compensation of AVS HDTV Decoder Jun-Hao Zheng 1;3 (ΨΞ ), Lei Deng 2 ( Π), Peng Zhang 1;3 (Φ ±),

More information

IN DIGITAL transmission systems, there are always scramblers

IN DIGITAL transmission systems, there are always scramblers 558 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 7, JULY 2006 Parallel Scrambler for High-Speed Applications Chih-Hsien Lin, Chih-Ning Chen, You-Jiun Wang, Ju-Yuan Hsiao,

More information

Final Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359

Final Report Time Optimization of HEVC Encoder over X86 Processors using SIMD. Spring 2013 Multimedia Processing EE5359 Final Report Time Optimization of HEVC Encoder over X86 Processors using SIMD Spring 2013 Multimedia Processing Advisor: Dr. K. R. Rao Department of Electrical Engineering University of Texas, Arlington

More information

THE USE OF forward error correction (FEC) in optical networks

THE USE OF forward error correction (FEC) in optical networks IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 8, AUGUST 2005 461 A High-Speed Low-Complexity Reed Solomon Decoder for Optical Communications Hanho Lee, Member, IEEE Abstract

More information

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH GHEVC: An Efficient HEVC Decoder for Graphics Processing Units

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH GHEVC: An Efficient HEVC Decoder for Graphics Processing Units IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH 2017 459 GHEVC: An Efficient HEVC Decoder for Graphics Processing Units Diego F. de Souza, Student Member, IEEE, Aleksandar Ilic, Member, IEEE, Nuno

More information

An Overview of Video Coding Algorithms

An Overview of Video Coding Algorithms An Overview of Video Coding Algorithms Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University Video coding can be viewed as image compression with a temporal

More information

IMAGE SEGMENTATION APPROACH FOR REALIZING ZOOMABLE STREAMING HEVC VIDEO ZARNA PATEL. Presented to the Faculty of the Graduate School of

IMAGE SEGMENTATION APPROACH FOR REALIZING ZOOMABLE STREAMING HEVC VIDEO ZARNA PATEL. Presented to the Faculty of the Graduate School of IMAGE SEGMENTATION APPROACH FOR REALIZING ZOOMABLE STREAMING HEVC VIDEO by ZARNA PATEL Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of

More information

Multimedia Communications. Image and Video compression

Multimedia Communications. Image and Video compression Multimedia Communications Image and Video compression JPEG2000 JPEG2000: is based on wavelet decomposition two types of wavelet filters one similar to what discussed in Chapter 14 and the other one generates

More information

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005. Wang, D., Canagarajah, CN., & Bull, DR. (2005). S frame design for multiple description video coding. In IEEE International Symposium on Circuits and Systems (ISCAS) Kobe, Japan (Vol. 3, pp. 19 - ). Institute

More information

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS Susanna Spinsante, Ennio Gambi, Franco Chiaraluce Dipartimento di Elettronica, Intelligenza artificiale e

More information

Parallel Implementation of Sample Adaptive Offset Filtering Block for Low-Power HEVC Chip. Luis A. Fernández Lara

Parallel Implementation of Sample Adaptive Offset Filtering Block for Low-Power HEVC Chip. Luis A. Fernández Lara Parallel Implementation of Sample Adaptive Offset Filtering Block for Low-Power HEVC Chip by Luis A. Fernández Lara B.S., Massachusetts Institute of Technology (2014) Submitted to the Department of Electrical

More information

Into the Depths: The Technical Details Behind AV1. Nathan Egge Mile High Video Workshop 2018 July 31, 2018

Into the Depths: The Technical Details Behind AV1. Nathan Egge Mile High Video Workshop 2018 July 31, 2018 Into the Depths: The Technical Details Behind AV1 Nathan Egge Mile High Video Workshop 2018 July 31, 2018 North America Internet Traffic 82% of Internet traffic by 2021 Cisco Study

More information

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work Introduction to Video Compression Techniques Slides courtesy of Tay Vaughan Making Multimedia Work Agenda Video Compression Overview Motivation for creating standards What do the standards specify Brief

More information

The H.263+ Video Coding Standard: Complexity and Performance

The H.263+ Video Coding Standard: Complexity and Performance The H.263+ Video Coding Standard: Complexity and Performance Berna Erol (bernae@ee.ubc.ca), Michael Gallant (mikeg@ee.ubc.ca), Guy C t (guyc@ee.ubc.ca), and Faouzi Kossentini (faouzi@ee.ubc.ca) Department

More information

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION 1 YONGTAE KIM, 2 JAE-GON KIM, and 3 HAECHUL CHOI 1, 3 Hanbat National University, Department of Multimedia Engineering 2 Korea Aerospace

More information

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards COMP 9 Advanced Distributed Systems Multimedia Networking Video Compression Standards Kevin Jeffay Department of Computer Science University of North Carolina at Chapel Hill jeffay@cs.unc.edu September,

More information

LUT Optimization for Memory Based Computation using Modified OMS Technique

LUT Optimization for Memory Based Computation using Modified OMS Technique LUT Optimization for Memory Based Computation using Modified OMS Technique Indrajit Shankar Acharya & Ruhan Bevi Dept. of ECE, SRM University, Chennai, India E-mail : indrajitac123@gmail.com, ruhanmady@yahoo.co.in

More information

H.261: A Standard for VideoConferencing Applications. Nimrod Peleg Update: Nov. 2003

H.261: A Standard for VideoConferencing Applications. Nimrod Peleg Update: Nov. 2003 H.261: A Standard for VideoConferencing Applications Nimrod Peleg Update: Nov. 2003 ITU - Rec. H.261 Target (1990)... A Video compression standard developed to facilitate videoconferencing (and videophone)

More information

THE High Efficiency Video Coding (HEVC) standard is

THE High Efficiency Video Coding (HEVC) standard is IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012 1649 Overview of the High Efficiency Video Coding (HEVC) Standard Gary J. Sullivan, Fellow, IEEE, Jens-Rainer

More information

Authors: Glenn Van Wallendael, Sebastiaan Van Leuven, Jan De Cock, Peter Lambert, Joeri Barbarien, Adrian Munteanu, and Rik Van de Walle

Authors: Glenn Van Wallendael, Sebastiaan Van Leuven, Jan De Cock, Peter Lambert, Joeri Barbarien, Adrian Munteanu, and Rik Van de Walle biblio.ugent.be The UGent Institutional Repository is the electronic archiving and dissemination platform for all UGent research publications. Ghent University has implemented a mandate stipulating that

More information

Project Interim Report

Project Interim Report Project Interim Report Coding Efficiency and Computational Complexity of Video Coding Standards-Including High Efficiency Video Coding (HEVC) Spring 2014 Multimedia Processing EE 5359 Advisor: Dr. K. R.

More information

Implementation of Memory Based Multiplication Using Micro wind Software

Implementation of Memory Based Multiplication Using Micro wind Software Implementation of Memory Based Multiplication Using Micro wind Software U.Palani 1, M.Sujith 2,P.Pugazhendiran 3 1 IFET College of Engineering, Department of Information Technology, Villupuram 2,3 IFET

More information

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions 1128 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions Kwok-Wai Wong, Kin-Man Lam,

More information

WHITE PAPER. Perspectives and Challenges for HEVC Encoding Solutions. Xavier DUCLOUX, December >>

WHITE PAPER. Perspectives and Challenges for HEVC Encoding Solutions. Xavier DUCLOUX, December >> Perspectives and Challenges for HEVC Encoding Solutions Xavier DUCLOUX, December 2013 >> www.thomson-networks.com 1. INTRODUCTION... 3 2. HEVC STATUS... 3 2.1 HEVC STANDARDIZATION... 3 2.2 HEVC TOOL-BOX...

More information

On Complexity Modeling of H.264/AVC Video Decoding and Its Application for Energy Efficient Decoding

On Complexity Modeling of H.264/AVC Video Decoding and Its Application for Energy Efficient Decoding 1240 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 13, NO. 6, DECEMBER 2011 On Complexity Modeling of H.264/AVC Video Decoding and Its Application for Energy Efficient Decoding Zhan Ma, Student Member, IEEE, HaoHu,

More information

ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROCESSING / 14.6

ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROCESSING / 14.6 ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROSSING / 14.6 14.6 A 1.8V 250mW COFDM Baseband Receiver for DVB-T/H Applications Lei-Fone Chen, Yuan Chen, Lu-Chung Chien, Ying-Hao Ma, Chia-Hao Lee, Yu-Wei

More information

MPEG has been established as an international standard

MPEG has been established as an international standard 1100 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 7, OCTOBER 1999 Fast Extraction of Spatially Reduced Image Sequences from MPEG-2 Compressed Video Junehwa Song, Member,

More information

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0 General Description Applications Features The OL_H264MCLD core is a hardware implementation of the H.264 baseline video compression

More information

Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding

Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding 356 IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.1, January 27 Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding Abderrahmane Elyousfi 12, Ahmed

More information

PHASE-LOCKED loops (PLLs) are widely used in many

PHASE-LOCKED loops (PLLs) are widely used in many IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 5, MAY 2005 233 A Portable Digitally Controlled Oscillator Using Novel Varactors Pao-Lung Chen, Ching-Che Chung, and Chen-Yi Lee

More information

Study of AVS China Part 7 for Mobile Applications. By Jay Mehta EE 5359 Multimedia Processing Spring 2010

Study of AVS China Part 7 for Mobile Applications. By Jay Mehta EE 5359 Multimedia Processing Spring 2010 Study of AVS China Part 7 for Mobile Applications By Jay Mehta EE 5359 Multimedia Processing Spring 2010 1 Contents Parts and profiles of AVS Standard Introduction to Audio Video Standard for Mobile Applications

More information

Memory efficient Distributed architecture LUT Design using Unified Architecture

Memory efficient Distributed architecture LUT Design using Unified Architecture Research Article Memory efficient Distributed architecture LUT Design using Unified Architecture Authors: 1 S.M.L.V.K. Durga, 2 N.S. Govind. Address for Correspondence: 1 M.Tech II Year, ECE Dept., ASR

More information

Standardized Extensions of High Efficiency Video Coding (HEVC)

Standardized Extensions of High Efficiency Video Coding (HEVC) MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Standardized Extensions of High Efficiency Video Coding (HEVC) Sullivan, G.J.; Boyce, J.M.; Chen, Y.; Ohm, J-R.; Segall, C.A.: Vetro, A. TR2013-105

More information

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013 International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013 Design and Implementation of an Enhanced LUT System in Security Based Computation dama.dhanalakshmi 1, K.Annapurna

More information

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture Vinaykumar Bagali 1, Deepika S Karishankari 2 1 Asst Prof, Electrical and Electronics Dept, BLDEA

More information