THE TRANSMISSION and storage of video are important

Size: px
Start display at page:

Download "THE TRANSMISSION and storage of video are important"

Transcription

1 206 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Novel RD-Optimized VBSME with Matching Highly Data Re-Usable Hardware Architecture Xing Wen, Student Member, IEEE, Oscar C. Au, Senior Member, IEEE, Jiang Xu, Member, IEEE, Lu Fang, Student Member, IEEE, Run Cha, Student Member, IEEE, and Jiali Li, Student Member, IEEE Abstract To achieve superior performance, rate-distortion optimized motion estimation (ME) for variable block size (RDO- VBSME) is often used in state-of-the-art video coding systems such as the H.264 JM software. However, the complexity of RDO-VBSME is very high both for software and hardware implementations. In this paper, we propose a hardware-friendly ME algorithm called RDOMFS with a novel hardware-friendly rate-distortion (RD)-like cost function, and a hardware-friendly modified motion vector predictor. Simulation results suggest that the proposed RDOMFS can achieve essentially the same RD performance as RDO-VBSME in JM. We also propose a matching hardware architecture with a novel Smart Snake Scanning order which can achieve very high data re-use ratio and data throughout. It is also reconfigurable because it can achieve variable data re-use ratio and can process variable frame size. The design is implemented with TSMC 0.18 µm CMOS technology and costs 103k gates. At a clock frequency of 63 MHz, the architecture achieves real-time RDO-VBSME at 30 frames/s. At a maximum clock frequency of 250 MHz, it can process at 30 frames/s. Index Terms Data re-use, hardware, motion estimation, scanning order, software-hardware co-design, VHDL. I. Introduction THE TRANSMISSION and storage of video are important for many applications. But raw video sequences are well known to be huge in size. Thus video compression is needed. Over the years, a lot of video coding standards such as MPEG- 1/2/4 and ITU-T H.261/263/264 have been developed to achieve efficient compression. They achieve compression by exploiting temporal redundancy using motion estimation (ME) and compensation, spatial redundancy using discrete cosine transform, statistical redundancy using entropy coding and perceptual irrelevancy using quantization. This paper is about efficient hardware-software co-design of rate-distortion optimized (RDO) ME to achieve good rate-distortion performance, realtime implementation with high data throughput, regular data flow, good parallelism, and high degree of memory re-use. Manuscript received April 8, 2010; revised July 27, 2010; accepted September 12, Date of publication January 17, 2011; date of current version March 2, This work was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region, China, under GRF Projects and This paper was recommended by Associate Editor L.-P. Chau. The authors are with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong ( wxxab@ust.hk; eeau@ust.hk; eexu@ust.hk; fanglu@ust.hk; charun@ust.hk; jiali@ust.hk). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT It is well known that ME requires very high computational complexity. ME contains mainly two parts: integer ME (IME) and fractional ME. Runtime profiling of H.264 JM encoder reveals that IME consumes close to 60% of total encoder time and up to 90% when fractional ME is included. Thus efficient ME algorithms and hardware architectures for IME are needed. This paper is about IME algorithm and its hardware implementation. In IME, the current frame is divided into non-overlapping macroblocks (MB) of size N N (N = 16). For each MB, a search window is defined around a point (e.g., the collocated point or some predicted location) in the reference frame. In this paper, we assume the search range is [ P, P) in both horizontal and vertical directions. Each point in the search window corresponds to a candidate MB to predict the current MB. A distortion measure is defined to measure the similarity between the candidate MB and the current MB. A search is performed within the search window for the best matched candidate MB with maximum similarity. The displacement of the best matched MB from the current MB is the motion vector (MV). There are many common mismatch measures such as sum of absolute difference (SAD), sum of squared difference (SSD) and sum of absolute transformed difference (SATD). SAD is most common due to its simplicity and effectiveness SAD k,l (m, n) = /$26.00 c 2011 IEEE X t (k+i, l+j) X t 1 (k+m+i, l+n+j) N 1 N 1 i=0 j=0 where (m, n) is the motion vector with P m,n < P, X t (i, j) and X t 1 (i, j) are the pixel values at location (i, j) in the current frame at time t and reference frame at time t 1, (k, l) is the location of the current block in the current frame, respectively. SAD computation is very regular and is suitable for efficient hardware implementation. Most existing hardware ME architectures are based on SAD. In recent years, an alternative measure called rate-distortion (RD) cost function becomes increasingly popular. It was first introduced by Everett in 1963 [1] with the general form being (1) RDCost = D + λ R (2) where D is the distortion such as SSD, SATD, or SAD, R is the associated bit rate (e.g., those of MV and/or those of the residue), and λ is the Lagrangian multiplier.

2 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 207 In H.264, λ is different for each of the 51 values of the quantization parameter Qp. When RD cost is used, the ME is called rate-distortion optimized or RDO. However, it is difficult to implement the RD cost in hardware because the RD cost computation requires floating-point multiplication and/or large hardware cost for the lookup table. A common ME method is full search (FS), which examines all points in search window in a brute-force manner. It is zerobiased, with its search center collocated to the current MB. FS can achieve global minimum and thus good visual quality, but requires much computation in software implementation. FS can be efficiently implemented by hardware to achieve good data throughput because its dataflow is regular and is suitable for pipelining. Also, data can be re-used between neighboring search locations. Besides FS, a lot of fast ME (FME) algorithms have been developed. Most FMEs perform some search around a search center which may be zero-biased or MVP-biased. The zerobiased search center is the (0, 0) motion vector. Some common zero-biased FME include NTSS [2], diamond search [3], FTS [4], and cross search [5]. An MVP-biased search center is chosen from a number of MVPs according to certain criteria. The MVPs are typically obtained by using MVs of spatially and temporally neighboring blocks. Some common MVPbiased FME include PMVFAST [6], UMHexagonS [7], and EPZS [8]. Often some local search is performed around the search center in the FMEs leading to local minimum (as opposed to global minimum achieved by FS). However, it is often difficult to implement MVP-biased ME by hardware because the consideration of multiple MVPs and the often irregular local search patterns can easily break the hardware pipeline leading to low hardware efficiency, low data re-use, and high memory access. Early video coding standards such as MPEG-1, MPEG-2, H.263 [9], and MPEG-4 [10] use a macroblock as a unit and perform fixed block-sized (FBS) motion estimation. The latest H.264 [11] allows a MB to be partitioned into seven kinds of sub-blocks as shown in Fig. 1 (16 16, 16 8, 8 16, 8 8, 8 4, 4 8, and 4 4) each with its own motion vector and performs variable block-size (VBS) ME for all possible sub-blocks. VBS ME allows different MVs for different sub-blocks and thus can achieve better matching for all sub-blocks and higher coding efficiency than FBSME. VBSME is especially useful for MBs containing multiple objects each with possibly different motion. It can also be useful for MBs with rotation and even deformation. While VBSME has good RD performance compared with FBSME, it has huge computational requirement and irregular memory access making it hard for efficient hardware implementation. This paper is about efficient VBSME. While early ME algorithms tend to use SAD due to its low complexity. recent ME algorithms tend to use RD cost due to its superior RD performance. We use -SAD and -RD to distinguish two different versions of any algorithm: the one using SAD and the one using RD cost, respectively. For example, FS-SAD is FS using SAD and FS-RD is FS using RD cost. Similarly, we use -zero and -mvp to mean the zero-biased and MVP-biased versions, Fig. 1. Variable block size in H.264/AVC. respectively. And we use -var and -fix to mean VBS and FBS, respectively. For example, FS-SAD-fix-zero is zerobiased FBS FS using SAD, and FS-RD-var-mvp is MVPbiased VBS FS using RD cost. Among existing ME architectures, some are for FME but most are for FS. Some of those for FS are MVP-biased and some use RD cost, but most are for zero-biased FS using SAD. While early ME architectures tend to do FBS ME, the new ones are predominantly for VBS ME. A good overview of ME architectures can be found in [12]. In [13], a 1-D systolic array [14], [15] architecture with 16 processing elements (PE) for full-search VBSME (FSVBSME) was proposed. The authors in [16] [18] proposed three 2-D systolic array architectures with 256 PEs for FS-SAD-var-zero, which has lower RD performance than the FS-RD-var-mvp in H.264 JM software. A few such as [12] used FS-RD-var-mvp but they needed significant amount of extra on-chip memory to store all the MVs required to generate the MVPs. All these architectures incur redundant loading inside the search window leading to huge latency and considerable power consumption. Here we use redundant loading to mean data being loaded more than once. In this paper, we propose a novel RDO-like MVP-biased VBS ME algorithm called RDOMFS with a matching reconfigurable architecture. RDOMFS, introduced in Section II with simulation results, uses a hardware-friendly single MVP (SMVP) and a hardware-friendly RD-like cost function to achieve essentially the same coding efficiency as FS-RD-varmvp. The matching architecture, introduced in Section III, uses a 2-D systolic array to implement the proposed RDOMFS. It re-uses MVPs in the current frame to eliminate the need to store the MVs in the on-chip memory. It uses a novel scanning order in the searching window to minimize redundant loading and achieve different tradeoff between power, data throughput, and data re-use ratio. Implementation results of the proposed architecture and comparisons with others are shown in Section IV. A conclusion is given in Section V. II. Proposed Motion Estimation Algorithm In this section, we will propose a novel hardware-oriented ME algorithm called RD optimized single-mvp-biased FS (RDOMFS), which has a hardware-friendly single-mvp bias and uses a hardware-friendly RD cost function. As will be shown in Section II-C, the proposed RDOMFS can achieve similar RD performance as FS-RD-var-mvp (default motion estimation algorithm in H.264 JM software) and better performance than FS-SAD-var-zero.

3 208 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 3. Original and modified lambda values. Fig. 2. (a) Median MVP definition in H.264 for various sub-block sizes. (b) Spatially varying definitions of left, top, and top-right sub-blocks. (c) Proposed unified SMVP for all sub-blocks. A. Unified Single MVP (SMVP) Nowadays, many state-of-the-art FME methods, such as [6] [8], are used in H.264 JM software to achieve good RD performance with low complexity. Most of these FME are MVP-biased in the sense that they perform a local search such as small diamond search [6] or hexagonal search [7] around a search center, chosen from a few MVPs. Typically they would compute the similarity measure or cost function for a few highly probable MVPs, select one as the best, and use it as center for the local search. It is well known that a local search yields a local minimum which may or may not be the global minimum. But the use of multiple MVPs to find the search center as opposed to a single MVP helps to increase the probability that the local minimum is the global minimum. Typically, the MVPs used in these FME include temporally and spatially neighboring MVs. The temporally neighboring MV is the MV of the collocated block in the reference frame, which requires the storage of MV of the reference frames. The spatially neighboring MVs include the MVs of the left, top, and top-right sub-blocks. Functions of these three MVs such as the median can also be used. Note that, in H.264, the definitions of the left, top, and top-right sub-blocks are different for different sub-block size and can be different for sub-blocks of the same size at different locations. Thus, a large amount of memory is required to store the MVs for sub-blocks of all sizes. Some examples are shown in Fig. 2(a) for subblock sizes of 4 and 16. And the data flow to compute the median is also irregular. In particular, the use of multiple MVPs tends to be inefficient for hardware implementation for four reasons. First, the spatially varying definitions of left, top, and top-right subblocks, as shown in Fig. 2(c), result in irregular data flow which makes hardware implementation inefficient. Second, the hardware utilization would be low during the examination of the multiple MVPs because the random nature of the final chosen MVP make it hard for pipeline implementation. Third, recall that the multiple MVPs are different points in the search area. Thus, the reference pixels associated with the MVPs need to be loaded separately and can hardly be re-used, which cause high memory bandwidth and high latency. Fourth, a great amount of past MVs need to be saved on-chip for hardware implementation of multiple MVP. This leads to significant onchip memory requirement and cost. For efficient hardware implementation, we choose to use a SMVP in the proposed RDOMFS rather than multiple MVPs. We use a unified SMVP definition for all sub-blocks of all sizes within a MB. Using the symbol MV to mean for a MB its best MV for the sub-block size, we define our SMVP as SMVP = Median(MV left 16 16,MVtop right 16 16,MVtop ) (3) which is the median of the MV of the MBs on the left, top, and top-right, as shown in Fig. 2(b). In other words, all sub-blocks of all sizes at any location within the MB use the same SMVP. Although the use of one MVP as opposed to multiple MVPs tends to result in lower probability that the local minimum is the global minimum, our experiments in Section II-C suggest that the performance drop is not significant. Most importantly, our SMVP can address the four problems mentioned above. First, even though our SMVP is based on spatially neighboring MVs, the identical SMVP definition for all sub-blocks of all sizes makes the data flow regular. Second, the MVP selection is deterministic so that the data flow becomes regular and pipeline design can be used. Actually, there is no more MVP selection stage as there is only one candidate. Third, latency is much lower due to the absence of the MVP selection stage. Fourth, our choice of SMVP requires the storage of much fewer past MVs and our MVP re-use scheme as will be explained later can further reduce the on-chip memory requirement. With the SMVP as search center, RDOMFS performs local full search for all sub-block size, which is highly regular and is suitable for hardware implementation. B. Distortion Measure Recall the tradeoff between two common distortion measures: SAD and RD. SAD computation is very regular and is suitable for efficient hardware implement. Most existing ME hardware architectures are designed based on SAD. On the

4 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 209 other hand, RD is defined as RD(MV )=SAD(MV )+λ(qp)r( MV median MV ) (4) where MV is the candidate MV, λ is the Lagrange multiplier which changes with the quantization parameter Qp, MV median is the median motion vector predictor for MV coding in H.264 and R is the bit rate to encode the motion vector difference. In H.264 JM software, the λ is λ ME defined as where λ mode is λ ME = λ mode (5) λ mode,i,p = (Qp 12)/3 (6) for I-block and P-block and is Qp 12 λ mode,b = max(2, min(4, )) λ mode,i,p (7) 6 for B-block. While RD can give significantly better RD performance than SAD, it is hard to design efficient hardware for RD for at least two reasons. First, RD computation requires floating point operation for the multiplication of λ and R which is time and resource consuming. If this is to be relieved by using lookup tables, it would require huge chip area for the lookup tables. Second, the data flow in the computation of MV median is irregular and requires a large amount of on-chip memory to store the required past MVs as noted before. In RDOMFS, we wish to approach the performance of RD and the hardware-friendly nature of SAD. Here, we propose a novel hardware-friendly RD-like cost function RD smvp, defined as RD smvp (MV )=SAD(MV )+λ smvp (Qp)R( SMVP MV ) (8) where the irregular MV median is replaced by our regular SMVP as defined in (3), and the floating point λ is replaced by a hardware-friendly power-of-2 λ smvp λ smvp =2 n ; n = floor ln λ ME. (9) ln 2 The proposed RD smvp can address the problems mentioned above. First, with our λ smvp, the multiplication of λ smvp and R is simply a left or right shift of R by n bits and can be easily implemented by hardware. Fig. 3 shows the λ smvp and original λ values for I, P, and B frames. While the approximation error between λ and λ smvp can degrade the performance, our simulation results in Section II-C suggest that the performance degradation is minimal. Second, the identical SMVP definition for all subblocks of all sizes makes the data flow in the computation of RD smvp regular and the amount of required on-chip memory is greatly decreased, if not eliminated. C. Simulation Results Experiments are performed to compare the RD performance of the proposed RDOMFS with three algorithms: FS-RDvar-mvp, FS-SAD-var-zero, and UMHexagonS [7]. FS-RDvar-mvp is the default ME method in H.264 JM software and should have very good, if not the best, RD performance in spite of its high computational complexity. UMHexagonS is one of the state-of-the-art multiple-mvp FMEs included in JM and should have slightly lower, if not similar, RD performance than FS-RD-var-mvp but at a significantly lower computational complexity. FS-SAD-var-zero is included in the experiment because most existing ME architectures are based on this. FS-SAD-var-zero should have lower RD performance than FS-RD-var-mvp. The experiments are done on the JM14.1 reference software with various search ranges. Many sequences with different resolution and motion level are tested. The PSNR are shown against the bit rate for six challenging sequences with three being CIF and three being 720P ( ), all at 30 frames/s. In Fig. 4, we show the RD curves for three out of six test sequences. The detailed values are shown in Tables I and II. Some 1080P sequences are also tested, but not shown due to limited space. (Some partial results of 1080p are shown in Table III.) Using FS- RD-var-mvp as the reference, the corresponding BD-PSNR and BD-bitrate [19], [20] are shown in Table I. Similarly, BD-PSNR and BD-bitrate are computed in Table II using RDOMFS with SR = 32 as reference. (A method with positive BD-PSNR and negative BD-bitrate has better RD performance than the reference.) The CIF sequences are Foreman, Soccer, and Mobile. The 720P sequences are Jets, Raven, and Crew. In the figures, we use SR = n to mean a search range of [ n/2,n/2). A wide range of search range (SR) values, namely 4, 8, 16, 32, 64, and 128, are used in the simulation though not all are shown due to limited space. We study the effect of SR on the RD performance because a smaller SR has many advantages including lower latency, lower memory bandwidth, and lower power consumption. But a smaller SR may have the disadvantage of lower RD performance due to poor motion compensation as a result of out-of-range motion. Thus, we want to study this tradeoff for the proposed RDOMFS. In Table I and Fig. 4, the RD performance of UMHexagonS is slightly lower than FS-RD-var-mvp while FS-SAD-varzero is significantly lower than both FS-RD-var-mvp and UMHexgonS, as expected. The proposed RDOMFS manages to achieve similar RD performance as FS-RD-var-mvp in spite of the hardware-friendly modifications: SMVP and RD-like cost function. In general, SR = 4 is too small as all methods with SR = 4 are found to have considerable, if not significant, RD drop compared with SR = 64. For FS-RD-var-mvp and UMHexagonS, RD performance is almost the same for the rest of the SR values. For FS-SAD-var-zero which is used in most existing architecture designs, the RD performance at SR = 64 is significantly worse than the other three methods at SR = 64, and considerable RD drop is observed as SR decreases. For the proposed RDOMFS, RD performance is very similar for SR = 16, 32, 64, and 128, but some drop is observed for SR = 8. We will choose SR = 32 for RDOMFS in the hardware implementation in the next section. RDOMFS contains mainly two simplification tools: hardware-friendly RD function (tool1) and the use of SMVP search center (tool2). Here, we do a partial experiment to investigate the effect of the two tools using 1080p test sequences: Traffic and ParkJoy and the results are shown in

5 210 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 4. RD performance of FS-RD-var-mvp, FS-SAD-var-zero, and RDOMFS for three typical test sequences with various SR. (a) Foreman. (b) Crew. (c) Jets. TABLE I Performance of FS-RD-var-mvp, FS-SAD-var-zero, UMHexagonS, and RDOMFS FS-RD-var-mvp FS-SAD-var-zero RDOMFS UMHexagonS SR=32 SR=32 SR=32 SR=32 Sequence Qp Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) Foreman (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 1.12% 1.95% Soccer (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.4% 5.63% Mobile (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 2.72% 0.07% Jets (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 4.75% 14.48% Raven (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 6.55% 6.77% Crew (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.17% 0.83%

6 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 211 TABLE II Performance of RDOMFS with Various SR RDOMFS RDOMFS RDOMFS RDOMFS SR=32 SR=16 SR=8 SR=4 Sequence Qp Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) Foreman (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 1.01% 4.40% Soccer (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 4.8% 21.01% Mobile (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.12% 1.00% Jets (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 6.95% 14.48% Raven (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 8.06% 9.82% Crew (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.15% 0.27% TABLE III Performance of RDOMFS FS-RD-var-mvp UMHexagonS RODMFS1 RDOMFS2 RDOMFS (SR = 128) (SR = 128) (SR = 32) (SR = 32) (SR = 32) Sequence Qp Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) Traffic (1080 at 50 frames/s) BD-PSNR (db) BD-bitrate % 0.48% 1.10% 0.77% ParkJoy (1080 at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.16% 0.07% 0.06%

7 212 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Table III. RDOMFS1 is RDOMFS with tool1 but not tool2. RDOMFS2 is RDOMFS with tool2 but not tool1. RDOMFS, RDOMFS1, and RDOMFS2 are simulated with SR = 32. For the sake of comparison, we also simulate FS-RD-var-mvp and UMHexagonS, both with SR = 128. From Table III, we can observe that RDOMFS with SR = 32 has very similar BD- PSNR and BD-BR as FS-RD-var-mvp with SR = 128. Both RDOMFS1 and RDOMFS2 have similar performance as FS- RD-var-mvp, which verifies that the performance degradation due to the two simplification tools in RDOMFS are rather negligible. III. Proposed Reconfigurable ME Architecture for RDOMFS In this section, we propose a reconfigurable architecture for RDOMFS based on 2-D systolic PE array. Recall that RDOMFS is much more regular than FS-RD-var-mvp due to the use of SMVP and hardware-friendly RD-like cost function. In terms of regularity, RDOMFS is slightly worse than FS- SAD-var-zero due to the need to compute the product of λ smvp and R and the generation of the SMVP. A straightforward hardware design for RDOMFS would include several major components: a 2-D systolic PE array with one PE to process one pixel, an adder tree to calculate 41 possible SADs for all the possible sub-blocks, on-chip or off-chip memory to store all the past MVs required for the computation of SMVP. For the local FS, the common scanning order is the Raster Scan order which can achieve a good data re-use ratio. Such a design typically have several problems. First, it is not reconfigurable and cannot achieve a different data re-use ratio without significant hardware changes. Second, while Raster Scan can give good data re-use ratio, it does not fully exploit the potential data re-use possibility. It has good data re-use in horizontal direction, but not vertical. Ineffective data re-use results in high power consumption, more on-chip memory, and high latency. Third, past MVs need to be stored in on-chip or off-chip memory for the calculation of SMVP in RDOMFS. Although the memory requirement of SMVP is already rather small with our use of MV left top 16 16, MV16 16, and MV top right 16 16, it would still cause considerable latency and power consumption to load the required MVs. While the straightforward design can achieve good performance, we seek to develop a novel design to address these three issues. First, it should be reconfigurable. Second, it should achieve higher data re-use ratio, especially in vertical direction. Third, it should not store past MVs. In this section, we present a novel architecture with several special features: a novel Smart Snake (SS) scanning order instead of Raster Scan, a special hardware to achieve different data re-use ratios and to avoid redundant data loading, and a multi-resolution MVP re-use scheme based on SS to avoid the storage of past MVs. A. System Overview The top-level block diagram of the proposed architecture is shown in Fig. 5. It contains a 2-D PE array with one PE to compute the SAD for one pixel. Different pixel-wise SAD is combined in the 2-D adder tree (2DAT) to compute the 41 possible SADs for sub-blocks of different sizes. The MV of past MBs are propagated in the adaptive shift register array (ASRA) and are used to compute the SMVP which in turn is used to compute the MV cost. Finally, the RD smvp is computed by adding the product of λ smvp and MV cost to the SAD, and the best sub-block combination with its corresponding best MV is selected. The 2-D PE array contains 256 PEs. Each PE stores a pixel in the current MB. Reference pixels are propagated into the PE array to calculate SAD. Conceptually, the PE array has 16 sub-arrays each with 4 4 PEs corresponding to a 4 4 subblock. A reconfigurable register array (RRA) is introduced to help achieve reconfigurable capability and higher data reuse ratio. This is a key module for the proposed SS scanning order which will be introduced in the next subsection. After the pixels of current MB are loaded, reference pixels are propagated into the PE array. In each clock cycle, the 256 PEs compute the 256 pixel-wise SAD for a search location in the search windows and pass the pixel-wise SAD to the 2DAT. Among the 41 SADs to be computed, there are sixteen 4 4 SADs, eight 8 4, eight 4 8, four 8 8, two 16 8, two 8 16, and one The 2DAT takes two clock cycles to compute the sixteen 4 4 SADs, three clock cycles for the eight 8 4 and eight 4 8, four clock cycles for the four 8 8, five clock cycles for the two 16 8 and two 8 16, and six clock cycles for the The MV of past MBs propagated in ASRA are re-used to compute the SMVP during the loading phase of the current MB pixels, and thus SMVP computation does not require extra clock cycles. A reconfigurable feature is that the delay cycles in ASRA can be adjusted so that it can be easily adapted to different frame size. The SMVP is used to compute the MV cost which is passed to the best MV selector along with the 41 SADs from 2DAT. The proposed hardware-friendly RD-like cost functions of all the candidate MVs are compared and the best MV is selected after scanning all possible locations in the search window. B. Smart Snake Scanning Order Consider a search window of size 2P 2Q and a macroblock size of N N. When full search is performed in the window, most traditional architectures use Raster Scan as shown in Fig. 6(a). In Raster Scan, the search locations in the first row are scanned from left to right, followed by the second row from left to right, and so on. Raster Scan is effective in reusing data horizontally with relatively high data re-use ratio. For example, when the upper-left search location is processed, all the pixels in the N N reference block are loaded into the PE array. When the next search location in Raster Scan order is processed, 1 N pixels are loaded with (N 1) N pixels re-used. However, there is no data re-use between adjacent rows. Thus many pixels are loaded up to N times, with the second time onward being redundant loading. The data re-usability is improved slightly in some architectures by another scanning order called Snake Scan as shown in Fig. 6(b). Snake Scan processes the first row from left to right, then the second row from right to left, and then the third row from left to right, and so on. During horizontal scanning

8 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 213 Fig. 5. Top-level block diagram of proposed architecture. Fig. 6. (a) Traditional Raster Scan. (b) Traditional Snake Scan. (c) Proposed Smart Snake Scanning order (SS). (d) Two sub-regions enlarged. along a row, Snake Scan re-use (N 1) N pixels from one search point to another. After row k (for any k) is processed, (N 1) N pixels of the last search location in row k are re-used in the first search point in row k + 1. In other words, Snake Scan is slightly better than Raster Scan by re-using data between adjacent rows. However, when processing the subsequent search points (in horizontal direction) in row k +1, the other pixels loaded during row k processing are not reused leading to a lot of redundant loading. Note that in both Raster Scan and Snake Scan, the data re-use ratio is fixed for a fixed search window size. [12] adopted a Modified Snake Scanning order to achieve higher data re-use ratio than Raster Scan. And [21] proposed a novel scanning order which we call ASAP Raster Scan to get the SAD for each partition as soon as passible. Although it can reduce the number of registers to save the SAD temporally for generating all partitions, it incurs more redundant loading than Raster Scan. Here we propose a novel scanning order called Smart Snake (SS) which can achieve variable data re-use ratios and minimum redundant data loading. In particular, in each search window, each reference pixel is loaded once and only once in SS. In the proposed SS scan, we divide the search window into an array of non-overlapping rectangular sub-regions that span the search window. An example with two rows and three columns of sub-regions is shown in Fig. 6(c). Basically, in each

9 214 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 rectangular sub-region, we perform Snake Scan with some tricks to achieve significantly higher data re-use. After one sub-region is searched, it will move into an adjacent region and Snake Scan will be applied again. In different sub-regions, the Snake Scan may be performed from top to bottom (e.g., sub-region L 1 ), or from bottom to top (e.g., L 2 ). It may start from left and end at right (e.g., L 1, L 2, L 3 ), or start from right and end at left (e.g., L 4, L 5, L 6 ). It may be horizontal (e.g L 1, L 2 ) or vertical (e.g., L 3, L 4 ). Here we use horizontal to mean the original Snake Scan which processes the search points row-by-row and vertical to mean column-by-column Snake Scan. We restrict the width (or the height) of each sub-region to be less than or equal to a parameter M. Then we construct a structure called reconfiguration register array (RRA) which is an array of (M 1) (N 1) registers. We now describe the data loading behavior of SS in L 1, which contains two initialization steps (A and B) unique to L 1 and two steadystate steps (C and D) common to all sub-regions. These steps are labeled in Fig. 6(d). We assume the size of L 1 is W H, with W M. Step A is used to process the top-left search location in L 1. Step B is performed W 1 times to process the rest of the top row (row 1) of L 1 moving from left to right. After processing one row in one direction, it will use step C to move to the next row and step D is performed W 1 times to process the rest of the row in opposite direction. In Step A, the N N reference pixels corresponding to the upper-left search location are propagated into the PE array, one clock cycle for each column of N pixels. This takes N set-up clock cycles with a data loading rate of N pixels/cycle. After the pixelwise SAD computation, the right N 1 columns are propagated within the PE array as they are needed for the following search locations in row 1. The lower N 1 pixels of the remaining (left) column will be needed for future search points (in rows 2, 3, and so on) and are propagated from the PE array to the RRA. The top pixel of the left column is discarded as it is no longer needed by the algorithm. Step B is applied after Step A. Step B uses W 1 clock cycles to process the remaining W 1 search locations in row 1. In each clock cycle, a new search location is processed in which a new column of N pixels is loaded (at N pixels/cycle). The right N 1 columns are propagated within the PE array and the bottom N 1 pixels of the remaining column is propagated from the PE array to the RRA. After a row of search locations are processed, Step C is applied to move down 1 search point to the next row in one clock cycle. The bottom N 1 rows are propagated within the PE array and a new row of N pixels are loaded (at N pixels/cycle). The RRA remains unchanged. Step D is applied after Step C. It uses W 1 clock cycles to process the remaining W 1 search locations in the current row. In each clock cycle, only one new pixel is loaded, N 1 pixels are propagated back from the RRA and N 1 columns are propagated within the PE array. Thus, the data loading rate is reduced greatly from N pixels/cycle in Steps A, B, and C to only 1 pixel/cycle in Step D. The bottom N 1 pixels of the last column will be needed for future search points and thus are propagated from the PE array to the RRA. And Steps C and D are applied recursively until the last search point in L 1 is reached. Note that there is no redundant data loading in L 1. After one sub-region is processed, a neighboring sub-region is processed next, using Snake Scan. Four steps A B, C, and D similar to corresponding steps A, B, C, D in L 1 are used. Essentially, Steps B, C, and D are similar to Steps B, C, D, respectively, except that their processing directions may be different (width-wise processing to the right or to the left, or length-wise processing to the top or bottom). But Step A is significantly different from Step A, in the sense that Step A uses only one clock cycle to process the first search point, and it performs either Step B or Step C depending on the relative locations of the two sub-regions and their Snake Scan directions (vertical or horizontal). This is applied recursively until all the sub-regions are processed. In the special case of only one sub-region such that W = 2P = M, there is no redundant data loading and hardware utilization can reach 100%, which are excellent. But, if the search window size is large (e.g., HDTV), the size of the required RRA would be large. In such a situation, the use of sub-region would allow the size of RRA to be reduced significantly from (2P 1) (N 1) to (M 1) (N 1), at the expense of redundant loading and lower data re-use ratio because, when there are more than one sub-regions, some reference pixels will be loaded more than once. For example, between L 1 and L 2,(N 1) H reference pixels will be loaded twice. Another advantage of the use of sub-region is that, by adjusting the size of each sub-region, we can achieve different trade-off between the size of active RRA and the data re-use ratio. When bandwidth is the most critical issue, the system can turn on all RRA and use maximum sub-region size to achieve maximum data re-use ratio. However, if power is most critical, the RRA can be partially turned off and a smaller subregion size can be used to achieve lower power at the expense of lower data throughput and data re-use ratio. To investigate the data re-usability of different scanning orders, we define the redundant loading ratio R to be R = L actual S (10) S where S = (2Q + N 1) (2Q + N 1) is the number of reference pixels inside a search window, L actual is the number of reference pixel loading actually performed to finish the search inside the search window. Note that R 0by definition. R is equal to zero when each pixel inside the search window is loaded only once and no redundant loading occurs. For Raster Scan and Snake Scan, the redundant loading ratio R raster and R snake are R raster = 2Q[N2 + N(2P 1)] S (11) S R snake = [N2 + N(2P 1)]+2NP(2Q 1) S. (12) S For the proposed SS Scan, the worst-case redundant loading

10 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 215 ratio R SS is R SS = 4PQ(M + N 1)2 /M 2 S. (13) S These redundant loading ratios for various scanning methods including Modified Snake Scan [12] and ASAP Raster Scan [21] are computed for two representative cases: small video resolution (CIF) with small search window (32 32), and large video resolution (1080P) with large search window (128 96), and are shown in Table IV together with the corresponding data loading rates (pixel per second). As expected, Snake Scan has considerably better (smaller) data redundant loading ratio than Raster Scan, especially in low resolution video. But the proposed SS scan can achieve much better data redundant loading ratio than any other methods. In particular, when one sub-region is used (M = 2P = 2Q = 32), SS can achieve zero redundant loading, as expected. The ASAP Raster Scan [21] has the largest R with worst redundant loading. The Modified Snake Scan tends to have similar R as the traditional Snake Scan. Fig. 7. Hardware diagram of proposed multi-resolution MVP re-use scheme. C. Multi-Resolution MVP Re-Use Scheme If there is no MVP re-use in the architecture design (including RDOMFS), all required past MVs would need to be stored in on-chip or off-chip memory and then loaded back to generate the MVPs of the current MB (and to compute the RD cost when needed). This would result in increased on-chip memory size, huge latency, and considerable power consumption. Thus MVP re-use is highly desirable. We use Fig. 8 to illustrate the proposed MVP re-use method. Here we use subscripts L, T, and TR to indicate the block to the left, top, and top-right of a particular MB, respectively. E.g., R TR is the block to the top-right of macroblock R. When we process the macroblock R (highlighted by red color), it would require the best MV of three neighboring MBs: R L, R T, and R TR. Similarly, the macroblock G (highlighted by green color) would require MVs of G L, G T, and G TR. Recall that MBs are processed in Raster Scan order (though the locations in the search window of a particular MB are processed in SS order). After the motion estimation is finished for one row of MBs (for example, the row containing R), the next row of MBs would be processed (for example, the row contained G). It is obvious that R L is the same as G T and the MV of R L can be re-used for G after certain delay. The delay (in terms of clock cycles) depends on the width of the current frame. So our main idea is that, rather than storing all the MV of the MBs in the memory, we simply propagate the MV of the current MB to an ASRA which can propagate the MV with a variable delay, which can be different for different resolution (frame size). The variable delay helps to make this design re-configurable. Fig. 7 shows the hardware architecture for our multiresolution MVP re-use scheme. It consists of the ASRA and three MVP registers, MVPR-L, MVPR-T, and MVPR-TR to store the MV of the left, top, and top-right MBs, respectively. After a MB is processed, its MV is propagated into MVPR-L during the initialization step (step A) of SS of the next MB. This achieves high hardware utilization and Fig. 8. MVP re-use relationship between MBs in two MB rows. Fig. 9. (a) Concept diagram. (b) Hierarchical architecture of proposed 2-D Adder Tree. low latency because it does not take additional clock cycles. The ASRA contains shift registers with multiple outputs (corresponding to different delays) for different resolutions. A MUX is used after ASRA to select the output corresponding to the resolution. With the intention to support a video width of 4096, we use 512 registers in ASRA, with two registers for the MV of each MB in a row. Although the size of ASRA is fixed, the MUX selects different data bypass to achieve different delay for variable frame size. D. 2-D Adder Tree (2DAT) Fig. 9 shows the concept diagram and hierarchical architecture of the proposed 2DAT. The 2DAT takes a total of two clock cycles to get sixteen 4 4 SADs, and similarly, three clock cycles to get eight 8 4 and eight 4 8 SADs, four

11 216 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 TABLE IV Data Re-Use for Various Scanning Orders Scanning Method 2P = 32, 2Q = 32, N =16 2P = 128, 2Q = 96, N =16 CIF at 30 frames/s 1080P at 30 frames/s S L actual R Pixel/s S L actual R Pixel/s Raster Scan % 285.8M % 53.4G Snake Scan % 197.5M % 47.8G Modified Snake [12] % 196M % 47.7G ASAP Raster Scan [21] % M % 334G SS (M = 16) % 45.6M % 11G SS (M = 32) M % 6.4G Fig. 10. Data flow of proposed architecture for M = W = 3. arrows represent the propagation of the reference pixels to the RRA and the black dashed arrows represent the propagation of the pixelwise SADs to the 2DAT. Fig. 11 shows the architecture of a single PE. It contains a current pixel register to store the current pixel, a reference pixel register to store the reference pixel, an absolute difference calculator, a MUX to select reference data from four possible directions (left, right, top, and bottom) and four latches (L1, L2, L3, and L4) for propagating the reference pixel to the four directions. Fig. 11. Single PE design. clock cycles to get four 8 8 SADs, five clock cycles for two 16 8 and two 8 16 SADs, and six clock cycles to get the final SAD. E. PE Design In this section, we use a 2-D PE array, which consists of multiple single-pe as shown in Fig. 11, with a RRA as shown in Fig. 12 to illustrate the PE design. Without loss of generality, we assume M = 3 and N = 4. This PE array contains N 2 single PEs, with one PE to process one pixel in the current N N block. The red, blue, green, and black arrows are the leftward, rightward, upward, and downward reference data paths, respectively. So each single PE has a MUX to select reference data from four possible directions. The blue dashed F. Data Flow Table V shows the data flow of the proposed architecture for the case of N = 4 and M = 3. As M = 3, the RRA contains two (M 1) columns, column 1 and column 2. After initialization cycles in Step A, the current pixels are stored inside PE array and the reference pixels are propagated into the PE array. Here, R ij is the ijth reference pixel in the search window. After calculating the SAD of the first search point, a new column of reference pixels is loaded into the PE array, three (N 1) reference pixels are propagated into column 1 of RRA, and the SAD of a new search point is also calculated in step B 1. Step B 2 is similar to B 1, except that column 1 of RRA is propagated to column 2 and another three reference pixels are propagated into column 1 from the PE array. In step C, a new row of reference pixels is loaded into the PE array and the RRA is not changed. In steps D 1 and D 2, rather than loading one row (or column) of pixels, it only loads one reference pixel per clock cycle and data re-use (with ratio (N 1)/N) is achieved by moving the data from RRA into the PE array.

12 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 217 Fig. 12. PE array and RRA for M = 3. TABLE V Data Flow of the Proposed Architecture When N =4andM=3=W Cycle Step Read From the Memory Data in RRA Column 1 Data in RRA Column 2 Search Point 0 A 1 R 00, R 10, R 20, R 30 1 A 2 R 01, R 11, R 21, R 31 2 A 3 R 02, R 12, R 22, R 32 3 A 4 R 03, R 13, R 23, R 33 ( 2, 2) 4 B 1 R 04, R 14, R 24, R 34 R 10, R 20, R 30 ( 2, 1) 5 B 2 R 05, R 15, R 25, R 35 R 11, R 21, R 31 R 10, R 20, R 30 ( 2, 0) 6 C R 42, R 43, R 44, R 45 R 11, R 21, R 31 R 10, R 20, R 30 ( 1, 0) 7 D 1 R 41 R 10, R 20, R 30 R 25, R 35, R 45 ( 1, 1) 8 D 2 R 40 R 25, R 35, R 45 R 24, R 34, R 44 ( 1, 2) 9 C R 50, R 51, R 52, R 53 R 25, R 35, R 45 R 24, R 34, R 44 (0, 2) 10 D 1 R 54 R 30, R 40, R 50 R 25, R 35, R 45 (0, 1) 11 D 2 R 55 R 31, R 41, R 51 R 30, R 40, R 50 (0, 0) 12 C R 62, R 63, R 64, R 65 R 31, R 41, R 51 R 30, R 40, R 50 (1, 0) TABLE VI Performance Comparison of Various Architectures Architecture [13] [16] [17] [18] [12] Proposed RDOMFS (M = 32) No. of PE Block size to 4 4 to 4 4 to 4 4 to 4 4 to 4 4 to 4 4 Search method FS-SAD FS-SAD FS-SAD FS-SAD FS-RD RDOMFS -var-zero -var-zero -var-zero -var-zero -var-mvp Technology 0.13 µm 0.18 µm 0.18 µm 0.18 µm 0.18 µm 0.18 µm Gate count 61k 210k 160k 597k 330k 103K Max frequency (MHz) Power (mw) SRAM size (bytes) Max video size P 1080P 720P Frames/s Scanning order Raster Raster Raster Raster Modified Snake Smart Snake Search range

13 218 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 TABLE VII Performance Comparison of Various Architectures Architecture [22] [23] [21] [24] Proposed RDOMFS (M = 32) No. of PE Block size to 8 8 to 4 4 to 4 4 to 4 4 to 4 4 Search method FME-SAD FS-SAD FS-SAD FS-SAD RDOMFS -var-zero -var-zero -var-zero -var-zero Technology 0.18 µm 0.13 µm 0.18 µm 0.18 µm 0.18 µm Gate count 485.7k 453k 39k 12k 103k Max frequency (MHz) Power (mw) SRAM size (bytes) Max video size 1080P 1080P 720P CIF frames/s Scanning order Raster Raster ASAP Raster Smart Snake Search range IV. Implementation Results and Comparison The proposed architecture was designed with VHDL description and synthesized by Synopsys Design Compiler with TSMC 0.18 µm CMOS standard cell library. Table VI shows the details of the implementation results. The design contains about 103k gates excluding SRAM. The total size of SRAM is 1271 bytes. The circuit can operate at frequencies up to 250 MHz allowing the processing of blocks per second when SR = 16. Under a clock frequency of 63 MHz, the architecture allows the real-time processing of (1080P) at 30 frames/s and SR = 32 (with similar RD performance as FS-RD-var-mvp at SR = 128). A comparison among the proposed circuit for RDOMFS (with M = 32) and some existing typical VBSME circuits [12], [13], [16] [18], [21] [24] for H.264 is presented in Tables VI and VII. Among the architectures, the proposed RDOMFS circuit can provide the lowest redundant load ratio, which means every pixel in the search window is loaded only once. Furthermore, except [12], all architectures use SAD as criterion of similarity which would results in significant RD drop compared with FS-RD-var-mvp. With the hardware-friendly SMVP and RD-like cost function, the proposed RDOMFS circuit with a small search range can achieve better RD performance than FS-SAD-var-zero and similar RD performance as UMHexagonS and FS-RD-var-mvp with bigger search range, especially in high resolution situation, as shown in simulation results in Section II. And we could observe that the power consumption of the proposed architecture is significantly lower than that of other architectures. This is because the SS and multi-resolution MVP re-use scheme help the proposed architecture to achieve highest data re-use ratio, leading to low latency, high data throughput, and significant reduction of data loading from the memory. V. Conclusion In this paper, we proposed a hardware-friendly MVPbiased motion estimation algorithm RDOMFS with unified single MVP and RD-like cost function. Simulation results suggest that it achieves comparable RD performance as the FS-RD-var-mvp and UMHexagonS used in JM software, and is significantly better than FS-SAD-var-zero commonly used in hardware implementation. We also proposed a matching architecture with novel SS scanning order and multi-resolution MVP re-use scheme. The design is implemented with TSMC 0.18 µm CMOS technology and costs 103k gates. At a clock frequency of 63 MHz, the architecture achieves real-time RDO-VBSME at 30 frames/s. Acknowledgment The authors appreciate the contribution from E. Ueda. References [1] H. Everett, III, Generalized Lagrange multiplier method for solving problems of optimum allocation of resources, Oper. Res., vol. 11, no. 3, pp , [2] R. Li, B. Zeng, and M. Liou, A new three-step search algorithm for block motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 4, no. 4, pp , Aug [3] S. Zhu and K. K. Ma, A new diamond search algorithm for fast blockmatching motion estimation, IEEE Trans. Image Process., vol. 9, no. 2, pp , Feb [4] M. Rehan et al., Block-based motion estimation using an enhanced flexible triangle search algorithm, in Proc. Can. Conf. Electr. Comput. Eng., May 2005, pp [5] M. Ghanbari, The cross-search algorithm for motion estimation [image coding], IEEE Trans. Commun., vol. 38, no. 7, pp , Jul [6] A. Tourapis, O. Au, and M. Liou, Predictive motion vector field adaptive search technique (PMVFAST)-enhancing block based motion estimation, in Proc. SPIE Conf. Visual Commun. Image Process., vol , pp [7] Z. Chen, J. Xu, Y. He, and J. Zheng, Fast integer-pel and fractionalpel motion estimation for H.264/AVC, J. Visual Commun. Image Representation, vol. 17, no. 2, pp , Apr [8] A. Tourapis, O. Au, and M. Liou, Highly efficient predictive zonal algorithms for fast block-matching motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 10, pp , Oct [9] G. Cote, B. Erol, M. Gallant, and F. Kossentini, H. 263+: Video coding at low bit rates, IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 7, pp , Nov [10] Information Technology-Coding of Audio-Visual Objects-Part2: Visual, ISO/IEC , [11] S. Kwon, A. Tamhankar, and K. R. Rao, Overview of H.264/MPEG- 4 part 10, J. Visual Commun. Image Representation, vol. 17, no. 2, pp , Apr [12] C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, T.-C. Wang, and L.- G. Chen, Analysis and architecture design of variable block-size motion estimation for H.264/AVC, IEEE Trans. Circuits Syst.-I, vol. 53, no. 3, pp , Mar

14 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 219 [13] S. Y. Yap and J. McCanny, A VLSI architecture for variable block size video motion estimation, IEEE Trans. Circuits Syst.-II, vol. 51, no. 7, pp , Jul [14] H. Kung, Why systolic architectures, Computer, vol. 15, no. 1, pp , [15] D. Moldovan, On the design of algorithms for VLSI systolic arrays, Proc. IEEE, vol. 71, no. 1, pp , Jan [16] L. Deng, W. Gao, M. Z. Hu, and Z. Z. Ji, An efficient hardware implementation for motion estimation of AVC standard, IEEE Trans. Consumer Electron., vol. 51, no. 4, pp , Nov [17] W. Cao, H. Hui, J. Tong, J. Lai, and H. Min, A high-performance reconfigurable VLSI architecture for VBSME in H.264, IEEE Trans. Consumer Electron., vol. 54, no. 3, pp , Aug [18] C.-M. Ou, C.-F. Le, and W.-J. Hwang, An efficient VLSI architecture for H.264 variable block size motion estimation, IEEE Trans. Consumer Electron., vol. 51, no. 4, pp , Nov [19] G. Bjontegaard, Calculation of Average PSNR Differences Between RD- Curves, ITU-T SG16 Q.6 document, vol. VCEG-M33, Austin, TX, Apr [20] G. Bjontegaard, Improvements of the BD-PSNR Model, ITU-T SG16 Q.6 document, vol. VCEG-AI11, Berlin, Germany, Jul [21] J. Kim and T. Park, A novel VLSI architecture for full-search variable block-size motion estimation, IEEE Trans. Consumer Electron. I, vol. 55, no. 2, pp , May [22] Z. Liu, Y. Song, M. Shao, S. Li, L. Li, S. Goto, and T. Ikenaga, 32-parallel SAD tree hardwired engine for variable block size motion estimation in HDTV1080P real-time encoding application, in Proc. IEEE Workshop Signal Process. Syst., Oct. 2007, pp [23] C.-Y. Kao and Y.-L. Lin, A high-performance and memory-efficient architecture for H.264/AVC motion estimation, in Proc. IEEE Int. Conf. Multimedia Expo, Jun. 2008, pp [24] H. Parandeh-Afshar, P. Brisk, and P. Ienne, Scalable and low cost design approach for variable block size motion estimation (VBSME), in Proc. Int. Symp. VLSI Design Autom. Test, 2009, pp assessment. Xing Wen (S XX) received the B.E. degree in electronic engineering from Xidian University, Xi an, China, in He is currently pursuing the Ph.D. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. His current research interests include video coding standard, motion estimation, HW/SW co-design, hardware implementation of multi-media algorithms, high throughput very large scale integration design, multiple description video coding, and visual quality Oscar C. Au (SM XX) received the B.A.Sc. degree from the University of Toronto, Toronto, ON, Canada, in 1986, and the M.A. and Ph.D. degrees from Princeton University, Princeton, NJ, in 1988 and 1991, respectively. After being a Post-Doctoral Researcher with Princeton University for one year, he joined the Hong Kong University of Science and Technology (HKUST), Clear Water Bay, Hong Kong, as an Assistant Professor in He is/has been a Professor with the Department of Electronic and Computer Engineering, Director of the Multimedia Technology Research Center, and Director of the Computer Engineering Program at HKUST. He has published about 280 technical journal and conference papers. His fast motion estimation algorithms were accepted into the ISO/IEC MPEG-4 international video coding standard and the China AVS-M standard. His light-weight encryption and error resilience algorithms were accepted into the China AVS standard. He has four U.S. patents and is applying for more than 60 on his signal processing techniques. He has performed forensic investigation and stood as an expert witness in the Hong Kong courts many times. His main research contributions include video and image coding and processing, watermarking and light weight encryption, speech and audio processing. Research topics include fast motion estimation for MPEG-1/2/4, H.261/3/4 and AVS, optimal and fast sub-optimal rate control, mode decision, transcoding, denoising, deinterlacing, post-processing, multiview coding, scalable video coding, distributed video coding, subpixel rendering, JPEG/JPEG2000, HDR imaging, compressive sensing, halftone image data hiding, GPU-processing, software-hardware co-design, and so on. Dr. Au is/was an Associate Editor of the IEEE Transactions on Circuits and Systems for Video Technology, the IEEE Transactions on Image Processing, and the IEEE Transactions on Circuits and System, Part 1. He is on the editorial boards of the Journal of Signal Processing Systems, Journal of Multimedia, and the Journal of Franklin Institute. He is/was the Chairman of the CAS Technical Committee on Multimedia Systems and Applications and a member of CAS TC on Video Signal Processing and Communications, CAS TC on DSP, SP TC on Multimedia Signal Processing, and SP TC on Image, Video and Multidimensional Signal Processing. He served on the Steering Committee of the IEEE Transactions on Multimedia and the IEEE International Conference of Multimedia and Expo (ICME). He served on the organizing committee of the IEEE International Symposium on Circuits and Systems in 1997, the IEEE International Conference on Acoustics, Speech and Signal Processing in 2003, the ISO/IEC MPEG 71st Meeting in 2004, the International Conference on Image Processing in 2010, and other conferences. He was the General Chair of the Pacific-Rim Conference on Multimedia (PCM) in 2007, and chaired both IEEE ICME and the Packet Video Workshop in He won Best Paper Awards in SiPS 2007 and PCM Jiang Xu (M XX) received the B.S. and M.S. degrees in electrical engineering from the Harbin Institute of Technology, Harbin, China, and the M.A. and Ph.D. degrees in electrical engineering from Princeton University, Princeton, NJ. From 2001 to 2002, he was a Research Associate with Bell Laboratories, Murray Hill, NJ. He was a Research Associate with NEC Laboratories America, Inc., Princeton, from 2003 to 2005, and worked on networks-on-chip. He joined a startup company, Sandbridge Technologies, Lowell, MA, in Since 2007, he has been with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, as an Assistant Professor, and has established the Mobile Computing System Laboratory. He has published more than 30 papers in peer-reviewed journals and conferences. His current research interests include multiprocessor systems-on-chip, computer architecture, lowpower very large scale integration design, and HW/SW co-design. Dr. Xu received one Best Paper Award. He serves on the organizing and technical committees in many international conferences, including ICCD, CASES, ISVLSI, VLSI, EMSOFT, VLSI-SoC, ICESS, RTCSA, NOCS, ESO, and so on. He currently serves as an Associate Editor of the ACM Transactions on Embedded Computing Systems. Lu Fang (S XX) received the B.E. degree from the Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China, in She is currently pursuing the Ph.D. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Run Cha (S XX) received the B.S. degree in electronic and information engineering from Tianjin University, Tianjin, China. She is currently pursuing the M.Phil. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Her current research interests include sub-pixel interpolation filter design, motion estimation algorithms, and combined intra and inter prediction. Jiali Li (S XX) received the B.S. degree in electrical engineering and information science from the University of Science and Technology of China, Hefei, China, in She is currently pursuing the Ph.D. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Her current research interests include video coding, GPU accelerating in multi-media, and HW/SW co-design.

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists 4,000 116,000 120M Open access books available International authors and editors Downloads Our

More information

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions 1128 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions Kwok-Wai Wong, Kin-Man Lam,

More information

A VLSI Architecture for Variable Block Size Video Motion Estimation

A VLSI Architecture for Variable Block Size Video Motion Estimation A VLSI Architecture for Variable Block Size Video Motion Estimation Yap, S. Y., & McCanny, J. (2004). A VLSI Architecture for Variable Block Size Video Motion Estimation. IEEE Transactions on Circuits

More information

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Implementation of an MPEG Codec on the Tilera TM 64 Processor 1 Implementation of an MPEG Codec on the Tilera TM 64 Processor Whitney Flohr Supervisor: Mark Franklin, Ed Richter Department of Electrical and Systems Engineering Washington University in St. Louis Fall

More information

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264 Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264 Ju-Heon Seo, Sang-Mi Kim, Jong-Ki Han, Nonmember Abstract-- In the H.264, MBAFF (Macroblock adaptive frame/field) and PAFF (Picture

More information

Selective Intra Prediction Mode Decision for H.264/AVC Encoders

Selective Intra Prediction Mode Decision for H.264/AVC Encoders Selective Intra Prediction Mode Decision for H.264/AVC Encoders Jun Sung Park, and Hyo Jung Song Abstract H.264/AVC offers a considerably higher improvement in coding efficiency compared to other compression

More information

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy Vladimir Afonso 1-2, Henrique Maich 1, Luan Audibert 1, Bruno Zatt 1, Marcelo Porto 1, Luciano Agostini

More information

Video coding standards

Video coding standards Video coding standards Video signals represent sequences of images or frames which can be transmitted with a rate from 5 to 60 frames per second (fps), that provides the illusion of motion in the displayed

More information

WITH the demand of higher video quality, lower bit

WITH the demand of higher video quality, lower bit IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 8, AUGUST 2006 917 A High-Definition H.264/AVC Intra-Frame Codec IP for Digital Video and Still Camera Applications Chun-Wei

More information

Chapter 2 Introduction to

Chapter 2 Introduction to Chapter 2 Introduction to H.264/AVC H.264/AVC [1] is the newest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The main improvements

More information

Adaptive Key Frame Selection for Efficient Video Coding

Adaptive Key Frame Selection for Efficient Video Coding Adaptive Key Frame Selection for Efficient Video Coding Jaebum Jun, Sunyoung Lee, Zanming He, Myungjung Lee, and Euee S. Jang Digital Media Lab., Hanyang University 17 Haengdang-dong, Seongdong-gu, Seoul,

More information

H.264/AVC Baseline Profile Decoder Complexity Analysis

H.264/AVC Baseline Profile Decoder Complexity Analysis 704 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003 H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, Senior

More information

The H.26L Video Coding Project

The H.26L Video Coding Project The H.26L Video Coding Project New ITU-T Q.6/SG16 (VCEG - Video Coding Experts Group) standardization activity for video compression August 1999: 1 st test model (TML-1) December 2001: 10 th test model

More information

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Mohamed Hassan, Taha Landolsi, Husameldin Mukhtar, and Tamer Shanableh College of Engineering American

More information

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur Module 8 VIDEO CODING STANDARDS Lesson 27 H.264 standard Lesson Objectives At the end of this lesson, the students should be able to: 1. State the broad objectives of the H.264 standard. 2. List the improved

More information

Express Letters. A Novel Four-Step Search Algorithm for Fast Block Motion Estimation

Express Letters. A Novel Four-Step Search Algorithm for Fast Block Motion Estimation IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 6, NO. 3, JUNE 1996 313 Express Letters A Novel Four-Step Search Algorithm for Fast Block Motion Estimation Lai-Man Po and Wing-Chung

More information

/$ IEEE

/$ IEEE 568 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 5, MAY 2007 Fast Algorithm and Architecture Design of Low-Power Integer Motion Estimation for H.264/AVC Tung-Chien Chen,

More information

Temporal Error Concealment Algorithm Using Adaptive Multi- Side Boundary Matching Principle

Temporal Error Concealment Algorithm Using Adaptive Multi- Side Boundary Matching Principle 184 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 Temporal Error Concealment Algorithm Using Adaptive Multi- Side Boundary Matching Principle Seung-Soo

More information

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0 General Description Applications Features The OL_H264e core is a hardware implementation of the H.264 baseline video compression algorithm. The core

More information

Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding

Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding 356 IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.1, January 27 Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding Abderrahmane Elyousfi 12, Ahmed

More information

Chapter 10 Basic Video Compression Techniques

Chapter 10 Basic Video Compression Techniques Chapter 10 Basic Video Compression Techniques 10.1 Introduction to Video compression 10.2 Video Compression with Motion Compensation 10.3 Video compression standard H.261 10.4 Video compression standard

More information

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks Research Topic Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks July 22 nd 2008 Vineeth Shetty Kolkeri EE Graduate,UTA 1 Outline 2. Introduction 3. Error control

More information

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS Susanna Spinsante, Ennio Gambi, Franco Chiaraluce Dipartimento di Elettronica, Intelligenza artificiale e

More information

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC http://dx.doi.org/10.5573/jsts.2013.13.5.430 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.13, NO.5, OCTOBER, 2013 Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC Juwon

More information

PACKET-SWITCHED networks have become ubiquitous

PACKET-SWITCHED networks have become ubiquitous IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 7, JULY 2004 885 Video Compression for Lossy Packet Networks With Mode Switching and a Dual-Frame Buffer Athanasios Leontaris, Student Member, IEEE,

More information

Reduced complexity MPEG2 video post-processing for HD display

Reduced complexity MPEG2 video post-processing for HD display Downloaded from orbit.dtu.dk on: Dec 17, 2017 Reduced complexity MPEG2 video post-processing for HD display Virk, Kamran; Li, Huiying; Forchhammer, Søren Published in: IEEE International Conference on

More information

A Novel VLSI Architecture of Motion Compensation for Multiple Standards

A Novel VLSI Architecture of Motion Compensation for Multiple Standards A Novel VLSI Architecture of Motion Compensation for Multiple Standards Junhao Zheng, Wen Gao, Senior Member, IEEE, David Wu, and Don Xie Abstract Motion compensation (MC) is one of the most important

More information

Principles of Video Compression

Principles of Video Compression Principles of Video Compression Topics today Introduction Temporal Redundancy Reduction Coding for Video Conferencing (H.261, H.263) (CSIT 410) 2 Introduction Reduce video bit rates while maintaining an

More information

Memory interface design for AVS HD video encoder with Level C+ coding order

Memory interface design for AVS HD video encoder with Level C+ coding order LETTER IEICE Electronics Express, Vol.14, No.12, 1 11 Memory interface design for AVS HD video encoder with Level C+ coding order Xiaofeng Huang 1a), Kaijin Wei 2, Guoqing Xiang 2, Huizhu Jia 2, and Don

More information

SCALABLE video coding (SVC) is currently being developed

SCALABLE video coding (SVC) is currently being developed IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 7, JULY 2006 889 Fast Mode Decision Algorithm for Inter-Frame Coding in Fully Scalable Video Coding He Li, Z. G. Li, Senior

More information

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION 1 YONGTAE KIM, 2 JAE-GON KIM, and 3 HAECHUL CHOI 1, 3 Hanbat National University, Department of Multimedia Engineering 2 Korea Aerospace

More information

AUDIOVISUAL COMMUNICATION

AUDIOVISUAL COMMUNICATION AUDIOVISUAL COMMUNICATION Laboratory Session: Recommendation ITU-T H.261 Fernando Pereira The objective of this lab session about Recommendation ITU-T H.261 is to get the students familiar with many aspects

More information

WITH the rapid development of high-fidelity video services

WITH the rapid development of high-fidelity video services 896 IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 7, JULY 2015 An Efficient Frame-Content Based Intra Frame Rate Control for High Efficiency Video Coding Miaohui Wang, Student Member, IEEE, KingNgiNgan,

More information

Color Quantization of Compressed Video Sequences. Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 CSVT

Color Quantization of Compressed Video Sequences. Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 CSVT CSVT -02-05-09 1 Color Quantization of Compressed Video Sequences Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 Abstract This paper presents a novel color quantization algorithm for compressed video

More information

Constant Bit Rate for Video Streaming Over Packet Switching Networks

Constant Bit Rate for Video Streaming Over Packet Switching Networks International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Constant Bit Rate for Video Streaming Over Packet Switching Networks Mr. S. P.V Subba rao 1, Y. Renuka Devi 2 Associate professor

More information

Lecture 2 Video Formation and Representation

Lecture 2 Video Formation and Representation 2013 Spring Term 1 Lecture 2 Video Formation and Representation Wen-Hsiao Peng ( 彭文孝 ) Multimedia Architecture and Processing Lab (MAPL) Department of Computer Science National Chiao Tung University 1

More information

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame I J C T A, 9(34) 2016, pp. 673-680 International Science Press A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame K. Priyadarshini 1 and D. Jackuline Moni

More information

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4 Contents List of figures List of tables Preface Acknowledgements xv xxi xxiii xxiv 1 Introduction 1 References 4 2 Digital video 5 2.1 Introduction 5 2.2 Analogue television 5 2.3 Interlace 7 2.4 Picture

More information

MPEG has been established as an international standard

MPEG has been established as an international standard 1100 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 7, OCTOBER 1999 Fast Extraction of Spatially Reduced Image Sequences from MPEG-2 Compressed Video Junehwa Song, Member,

More information

Multimedia Communications. Image and Video compression

Multimedia Communications. Image and Video compression Multimedia Communications Image and Video compression JPEG2000 JPEG2000: is based on wavelet decomposition two types of wavelet filters one similar to what discussed in Chapter 14 and the other one generates

More information

ALONG with the progressive device scaling, semiconductor

ALONG with the progressive device scaling, semiconductor IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 4, APRIL 2010 285 LUT Optimization for Memory-Based Computation Pramod Kumar Meher, Senior Member, IEEE Abstract Recently, we

More information

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005. Wang, D., Canagarajah, CN., & Bull, DR. (2005). S frame design for multiple description video coding. In IEEE International Symposium on Circuits and Systems (ISCAS) Kobe, Japan (Vol. 3, pp. 19 - ). Institute

More information

Modeling and Optimization of a Systematic Lossy Error Protection System based on H.264/AVC Redundant Slices

Modeling and Optimization of a Systematic Lossy Error Protection System based on H.264/AVC Redundant Slices Modeling and Optimization of a Systematic Lossy Error Protection System based on H.264/AVC Redundant Slices Shantanu Rane, Pierpaolo Baccichet and Bernd Girod Information Systems Laboratory, Department

More information

An Overview of Video Coding Algorithms

An Overview of Video Coding Algorithms An Overview of Video Coding Algorithms Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University Video coding can be viewed as image compression with a temporal

More information

Algorithm and architecture design of the motion estimation for the H.265/HEVC 4K-UHD encoder

Algorithm and architecture design of the motion estimation for the H.265/HEVC 4K-UHD encoder J Real-Time Image Proc (216) 12:517 529 DOI 1.17/s11554-15-516-4 SPECIAL ISSUE PAPER Algorithm and architecture design of the motion estimation for the H.265/HEVC 4K-UHD encoder Grzegorz Pastuszak Maciej

More information

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards COMP 9 Advanced Distributed Systems Multimedia Networking Video Compression Standards Kevin Jeffay Department of Computer Science University of North Carolina at Chapel Hill jeffay@cs.unc.edu September,

More information

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter Abstract: In this paper, we analyze the contents of lookup tables (LUTs) of distributed arithmetic (DA)- based

More information

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Conference object, Postprint version This version is available

More information

THE USE OF forward error correction (FEC) in optical networks

THE USE OF forward error correction (FEC) in optical networks IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 8, AUGUST 2005 461 A High-Speed Low-Complexity Reed Solomon Decoder for Optical Communications Hanho Lee, Member, IEEE Abstract

More information

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

The Multistandard Full Hd Video-Codec Engine On Low Power Devices The Multistandard Full Hd Video-Codec Engine On Low Power Devices B.Susma (M. Tech). Embedded Systems. Aurora s Technological & Research Institute. Hyderabad. B.Srinivas Asst. professor. ECE, Aurora s

More information

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes Digital Signal and Image Processing Lab Simone Milani Ph.D. student simone.milani@dei.unipd.it, Summer School

More information

Frame Processing Time Deviations in Video Processors

Frame Processing Time Deviations in Video Processors Tensilica White Paper Frame Processing Time Deviations in Video Processors May, 2008 1 Executive Summary Chips are increasingly made with processor designs licensed as semiconductor IP (intellectual property).

More information

1022 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010

1022 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 1022 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 Delay Constrained Multiplexing of Video Streams Using Dual-Frame Video Coding Mayank Tiwari, Student Member, IEEE, Theodore Groves,

More information

A Study of Encoding and Decoding Techniques for Syndrome-Based Video Coding

A Study of Encoding and Decoding Techniques for Syndrome-Based Video Coding MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com A Study of Encoding and Decoding Techniques for Syndrome-Based Video Coding Min Wu, Anthony Vetro, Jonathan Yedidia, Huifang Sun, Chang Wen

More information

Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm

Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm International Journal of Signal Processing Systems Vol. 2, No. 2, December 2014 Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm Walid

More information

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System Zhibin Xiao and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Outline Introduction to H.264

More information

The H.263+ Video Coding Standard: Complexity and Performance

The H.263+ Video Coding Standard: Complexity and Performance The H.263+ Video Coding Standard: Complexity and Performance Berna Erol (bernae@ee.ubc.ca), Michael Gallant (mikeg@ee.ubc.ca), Guy C t (guyc@ee.ubc.ca), and Faouzi Kossentini (faouzi@ee.ubc.ca) Department

More information

MULTI-STATE VIDEO CODING WITH SIDE INFORMATION. Sila Ekmekci Flierl, Thomas Sikora

MULTI-STATE VIDEO CODING WITH SIDE INFORMATION. Sila Ekmekci Flierl, Thomas Sikora MULTI-STATE VIDEO CODING WITH SIDE INFORMATION Sila Ekmekci Flierl, Thomas Sikora Technical University Berlin Institute for Telecommunications D-10587 Berlin / Germany ABSTRACT Multi-State Video Coding

More information

Error Resilient Video Coding Using Unequally Protected Key Pictures

Error Resilient Video Coding Using Unequally Protected Key Pictures Error Resilient Video Coding Using Unequally Protected Key Pictures Ye-Kui Wang 1, Miska M. Hannuksela 2, and Moncef Gabbouj 3 1 Nokia Mobile Software, Tampere, Finland 2 Nokia Research Center, Tampere,

More information

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0 General Description Applications Features The OL_H264MCLD core is a hardware implementation of the H.264 baseline video compression

More information

Multimedia Communications. Video compression

Multimedia Communications. Video compression Multimedia Communications Video compression Video compression Of all the different sources of data, video produces the largest amount of data There are some differences in our perception with regard to

More information

Dual Frame Video Encoding with Feedback

Dual Frame Video Encoding with Feedback Video Encoding with Feedback Athanasios Leontaris and Pamela C. Cosman Department of Electrical and Computer Engineering University of California, San Diego, La Jolla, CA 92093-0407 Email: pcosman,aleontar

More information

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension 05-Silva-AF:05-Silva-AF 8/19/11 6:18 AM Page 43 A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension T. L. da Silva 1, L. A. S. Cruz 2, and L. V. Agostini 3 1 Telecommunications

More information

Visual Communication at Limited Colour Display Capability

Visual Communication at Limited Colour Display Capability Visual Communication at Limited Colour Display Capability Yan Lu, Wen Gao and Feng Wu Abstract: A novel scheme for visual communication by means of mobile devices with limited colour display capability

More information

DWT Based-Video Compression Using (4SS) Matching Algorithm

DWT Based-Video Compression Using (4SS) Matching Algorithm DWT Based-Video Compression Using (4SS) Matching Algorithm Marwa Kamel Hussien Dr. Hameed Abdul-Kareem Younis Assist. Lecturer Assist. Professor Lava_85K@yahoo.com Hameedalkinani2004@yahoo.com Department

More information

Intra-frame JPEG-2000 vs. Inter-frame Compression Comparison: The benefits and trade-offs for very high quality, high resolution sequences

Intra-frame JPEG-2000 vs. Inter-frame Compression Comparison: The benefits and trade-offs for very high quality, high resolution sequences Intra-frame JPEG-2000 vs. Inter-frame Compression Comparison: The benefits and trade-offs for very high quality, high resolution sequences Michael Smith and John Villasenor For the past several decades,

More information

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS 9th European Signal Processing Conference (EUSIPCO 2) Barcelona, Spain, August 29 - September 2, 2 A 6-65 CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS Jinjia Zhou, Dajiang

More information

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015 Optimization of Multi-Channel BCH Error Decoding for Common Cases Russell Dill Master's Thesis Defense April 20, 2015 Bose-Chaudhuri-Hocquenghem (BCH) BCH is an Error Correcting Code (ECC) and is used

More information

MPEGTool: An X Window Based MPEG Encoder and Statistics Tool 1

MPEGTool: An X Window Based MPEG Encoder and Statistics Tool 1 MPEGTool: An X Window Based MPEG Encoder and Statistics Tool 1 Toshiyuki Urabe Hassan Afzal Grace Ho Pramod Pancha Magda El Zarki Department of Electrical Engineering University of Pennsylvania Philadelphia,

More information

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar May 2006, Vol.21, No.3, pp.370 377 J. Comput. Sci. & Technol. An Efficient VLSI Architecture for Motion Compensation of AVS HDTV Decoder Jun-Hao Zheng 1;3 (ΨΞ ), Lei Deng 2 ( Π), Peng Zhang 1;3 (Φ ±),

More information

Design and Analysis of Modified Fast Compressors for MAC Unit

Design and Analysis of Modified Fast Compressors for MAC Unit Design and Analysis of Modified Fast Compressors for MAC Unit Anusree T U 1, Bonifus P L 2 1 PG Student & Dept. of ECE & Rajagiri School of Engineering & Technology 2 Assistant Professor & Dept. of ECE

More information

Robust Transmission of H.264/AVC Video using 64-QAM and unequal error protection

Robust Transmission of H.264/AVC Video using 64-QAM and unequal error protection Robust Transmission of H.264/AVC Video using 64-QAM and unequal error protection Ahmed B. Abdurrhman 1, Michael E. Woodward 1 and Vasileios Theodorakopoulos 2 1 School of Informatics, Department of Computing,

More information

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009 12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009 Project Overview This project was originally titled Fast Fourier Transform Unit, but due to space and time constraints, the

More information

Motion Re-estimation for MPEG-2 to MPEG-4 Simple Profile Transcoding. Abstract. I. Introduction

Motion Re-estimation for MPEG-2 to MPEG-4 Simple Profile Transcoding. Abstract. I. Introduction Motion Re-estimation for MPEG-2 to MPEG-4 Simple Profile Transcoding Jun Xin, Ming-Ting Sun*, and Kangwook Chun** *Department of Electrical Engineering, University of Washington **Samsung Electronics Co.

More information

PAPER A Fine-Grain Scalable and Low Memory Cost Variable Block Size Motion Estimation Architecture for H.264/AVC

PAPER A Fine-Grain Scalable and Low Memory Cost Variable Block Size Motion Estimation Architecture for H.264/AVC 1928 PAPER A Fine-Grain Scalable and Low Memory Cost Variable Block Size Motion Estimation Architecture for H.264/AVC Zhenyu LIU a), Nonmember,YangSONG, Student Member,TakeshiIKENAGA, Member, and Satoshi

More information

Video Over Mobile Networks

Video Over Mobile Networks Video Over Mobile Networks Professor Mohammed Ghanbari Department of Electronic systems Engineering University of Essex United Kingdom June 2005, Zadar, Croatia (Slides prepared by M. Mahdi Ghandi) INTRODUCTION

More information

ROBUST REGION-OF-INTEREST SCALABLE CODING WITH LEAKY PREDICTION IN H.264/AVC. Qian Chen, Li Song, Xiaokang Yang, Wenjun Zhang

ROBUST REGION-OF-INTEREST SCALABLE CODING WITH LEAKY PREDICTION IN H.264/AVC. Qian Chen, Li Song, Xiaokang Yang, Wenjun Zhang ROBUST REGION-OF-INTEREST SCALABLE CODING WITH LEAKY PREDICTION IN H.264/AVC Qian Chen, Li Song, Xiaokang Yang, Wenjun Zhang Institute of Image Communication & Information Processing Shanghai Jiao Tong

More information

Speeding up Dirac s Entropy Coder

Speeding up Dirac s Entropy Coder Speeding up Dirac s Entropy Coder HENDRIK EECKHAUT BENJAMIN SCHRAUWEN MARK CHRISTIAENS JAN VAN CAMPENHOUT Parallel Information Systems (PARIS) Electronics and Information Systems (ELIS) Ghent University

More information

Motion Video Compression

Motion Video Compression 7 Motion Video Compression 7.1 Motion video Motion video contains massive amounts of redundant information. This is because each image has redundant information and also because there are very few changes

More information

Drift Compensation for Reduced Spatial Resolution Transcoding

Drift Compensation for Reduced Spatial Resolution Transcoding MERL A MITSUBISHI ELECTRIC RESEARCH LABORATORY http://www.merl.com Drift Compensation for Reduced Spatial Resolution Transcoding Peng Yin Anthony Vetro Bede Liu Huifang Sun TR-2002-47 August 2002 Abstract

More information

A Low-Power 0.7-V H p Video Decoder

A Low-Power 0.7-V H p Video Decoder A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008 Outline Motivation for low-power video decoders Low-power techniques pipelining

More information

A low-power portable H.264/AVC decoder using elastic pipeline

A low-power portable H.264/AVC decoder using elastic pipeline Chapter 3 A low-power portable H.64/AVC decoder using elastic pipeline Yoshinori Sakata, Kentaro Kawakami, Hiroshi Kawaguchi, Masahiko Graduate School, Kobe University, Kobe, Hyogo, 657-8507 Japan Email:

More information

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics 1) Explain why & how a MOSFET works VLSI Design: 2) Draw Vds-Ids curve for a MOSFET. Now, show how this curve changes (a) with increasing Vgs (b) with increasing transistor width (c) considering Channel

More information

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion Asmar A Khan and Shahid Masud Department of Computer Science and Engineering Lahore University of Management Sciences Opp Sector-U,

More information

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Research Article. ISSN (Print) *Corresponding author Shireen Fathima Scholars Journal of Engineering and Technology (SJET) Sch. J. Eng. Tech., 2014; 2(4C):613-620 Scholars Academic and Scientific Publisher (An International Publisher for Academic and Scientific Resources)

More information

High Performance Carry Chains for FPGAs

High Performance Carry Chains for FPGAs High Performance Carry Chains for FPGAs Matthew M. Hosler Department of Electrical and Computer Engineering Northwestern University Abstract Carry chains are an important consideration for most computations,

More information

Multicore Design Considerations

Multicore Design Considerations Multicore Design Considerations Multicore: The Forefront of Computing Technology We re not going to have faster processors. Instead, making software run faster in the future will mean using parallel programming

More information

PERCEPTUAL QUALITY OF H.264/AVC DEBLOCKING FILTER

PERCEPTUAL QUALITY OF H.264/AVC DEBLOCKING FILTER PERCEPTUAL QUALITY OF H./AVC DEBLOCKING FILTER Y. Zhong, I. Richardson, A. Miller and Y. Zhao School of Enginnering, The Robert Gordon University, Schoolhill, Aberdeen, AB1 1FR, UK Phone: + 1, Fax: + 1,

More information

Robust Transmission of H.264/AVC Video Using 64-QAM and Unequal Error Protection

Robust Transmission of H.264/AVC Video Using 64-QAM and Unequal Error Protection Robust Transmission of H.264/AVC Video Using 64-QAM and Unequal Error Protection Ahmed B. Abdurrhman, Michael E. Woodward, and Vasileios Theodorakopoulos School of Informatics, Department of Computing,

More information

H.261: A Standard for VideoConferencing Applications. Nimrod Peleg Update: Nov. 2003

H.261: A Standard for VideoConferencing Applications. Nimrod Peleg Update: Nov. 2003 H.261: A Standard for VideoConferencing Applications Nimrod Peleg Update: Nov. 2003 ITU - Rec. H.261 Target (1990)... A Video compression standard developed to facilitate videoconferencing (and videophone)

More information

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work Introduction to Video Compression Techniques Slides courtesy of Tay Vaughan Making Multimedia Work Agenda Video Compression Overview Motivation for creating standards What do the standards specify Brief

More information

Scalable multiple description coding of video sequences

Scalable multiple description coding of video sequences Scalable multiple description coding of video sequences Marco Folli, and Lorenzo Favalli Electronics Department University of Pavia, Via Ferrata 1, 100 Pavia, Italy Email: marco.folli@unipv.it, lorenzo.favalli@unipv.it

More information

Overview: Video Coding Standards

Overview: Video Coding Standards Overview: Video Coding Standards Video coding standards: applications and common structure ITU-T Rec. H.261 ISO/IEC MPEG-1 ISO/IEC MPEG-2 State-of-the-art: H.264/AVC Video Coding Standards no. 1 Applications

More information

Understanding PQR, DMOS, and PSNR Measurements

Understanding PQR, DMOS, and PSNR Measurements Understanding PQR, DMOS, and PSNR Measurements Introduction Compression systems and other video processing devices impact picture quality in various ways. Consumers quality expectations continue to rise

More information

LUT Optimization for Memory Based Computation using Modified OMS Technique

LUT Optimization for Memory Based Computation using Modified OMS Technique LUT Optimization for Memory Based Computation using Modified OMS Technique Indrajit Shankar Acharya & Ruhan Bevi Dept. of ECE, SRM University, Chennai, India E-mail : indrajitac123@gmail.com, ruhanmady@yahoo.co.in

More information

Highly Efficient Video Codec for Entertainment-Quality

Highly Efficient Video Codec for Entertainment-Quality Highly Efficient Video Codec for Entertainment-Quality Seyoon Jeong, Sung-Chang Lim, Hahyun Lee, Jongho Kim, Jin Soo Choi, and Haechul Choi We present a novel video codec for supporting entertainment-quality

More information

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

FPGA Laboratory Assignment 4. Due Date: 06/11/2012 FPGA Laboratory Assignment 4 Due Date: 06/11/2012 Aim The purpose of this lab is to help you understanding the fundamentals of designing and testing memory-based processing systems. In this lab, you will

More information

International Journal of Engineering Research-Online A Peer Reviewed International Journal

International Journal of Engineering Research-Online A Peer Reviewed International Journal RESEARCH ARTICLE ISSN: 2321-7758 VLSI IMPLEMENTATION OF SERIES INTEGRATOR COMPOSITE FILTERS FOR SIGNAL PROCESSING MURALI KRISHNA BATHULA Research scholar, ECE Department, UCEK, JNTU Kakinada ABSTRACT The

More information

Memory efficient Distributed architecture LUT Design using Unified Architecture

Memory efficient Distributed architecture LUT Design using Unified Architecture Research Article Memory efficient Distributed architecture LUT Design using Unified Architecture Authors: 1 S.M.L.V.K. Durga, 2 N.S. Govind. Address for Correspondence: 1 M.Tech II Year, ECE Dept., ASR

More information

A video signal processor for motioncompensated field-rate upconversion in consumer television

A video signal processor for motioncompensated field-rate upconversion in consumer television A video signal processor for motioncompensated field-rate upconversion in consumer television B. De Loore, P. Lippens, P. Eeckhout, H. Huijgen, A. Löning, B. McSweeney, M. Verstraelen, B. Pham, G. de Haan,

More information