THE TRANSMISSION and storage of video are important

Size: px

Start display at page:

Download "THE TRANSMISSION and storage of video are important"

Sydney Fisher
5 years ago
Views:

1 206 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Novel RD-Optimized VBSME with Matching Highly Data Re-Usable Hardware Architecture Xing Wen, Student Member, IEEE, Oscar C. Au, Senior Member, IEEE, Jiang Xu, Member, IEEE, Lu Fang, Student Member, IEEE, Run Cha, Student Member, IEEE, and Jiali Li, Student Member, IEEE Abstract To achieve superior performance, rate-distortion optimized motion estimation (ME) for variable block size (RDO- VBSME) is often used in state-of-the-art video coding systems such as the H.264 JM software. However, the complexity of RDO-VBSME is very high both for software and hardware implementations. In this paper, we propose a hardware-friendly ME algorithm called RDOMFS with a novel hardware-friendly rate-distortion (RD)-like cost function, and a hardware-friendly modified motion vector predictor. Simulation results suggest that the proposed RDOMFS can achieve essentially the same RD performance as RDO-VBSME in JM. We also propose a matching hardware architecture with a novel Smart Snake Scanning order which can achieve very high data re-use ratio and data throughout. It is also reconfigurable because it can achieve variable data re-use ratio and can process variable frame size. The design is implemented with TSMC 0.18 µm CMOS technology and costs 103k gates. At a clock frequency of 63 MHz, the architecture achieves real-time RDO-VBSME at 30 frames/s. At a maximum clock frequency of 250 MHz, it can process at 30 frames/s. Index Terms Data re-use, hardware, motion estimation, scanning order, software-hardware co-design, VHDL. I. Introduction THE TRANSMISSION and storage of video are important for many applications. But raw video sequences are well known to be huge in size. Thus video compression is needed. Over the years, a lot of video coding standards such as MPEG- 1/2/4 and ITU-T H.261/263/264 have been developed to achieve efficient compression. They achieve compression by exploiting temporal redundancy using motion estimation (ME) and compensation, spatial redundancy using discrete cosine transform, statistical redundancy using entropy coding and perceptual irrelevancy using quantization. This paper is about efficient hardware-software co-design of rate-distortion optimized (RDO) ME to achieve good rate-distortion performance, realtime implementation with high data throughput, regular data flow, good parallelism, and high degree of memory re-use. Manuscript received April 8, 2010; revised July 27, 2010; accepted September 12, Date of publication January 17, 2011; date of current version March 2, This work was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region, China, under GRF Projects and This paper was recommended by Associate Editor L.-P. Chau. The authors are with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong ( wxxab@ust.hk; eeau@ust.hk; eexu@ust.hk; fanglu@ust.hk; charun@ust.hk; jiali@ust.hk). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT It is well known that ME requires very high computational complexity. ME contains mainly two parts: integer ME (IME) and fractional ME. Runtime profiling of H.264 JM encoder reveals that IME consumes close to 60% of total encoder time and up to 90% when fractional ME is included. Thus efficient ME algorithms and hardware architectures for IME are needed. This paper is about IME algorithm and its hardware implementation. In IME, the current frame is divided into non-overlapping macroblocks (MB) of size N N (N = 16). For each MB, a search window is defined around a point (e.g., the collocated point or some predicted location) in the reference frame. In this paper, we assume the search range is [ P, P) in both horizontal and vertical directions. Each point in the search window corresponds to a candidate MB to predict the current MB. A distortion measure is defined to measure the similarity between the candidate MB and the current MB. A search is performed within the search window for the best matched candidate MB with maximum similarity. The displacement of the best matched MB from the current MB is the motion vector (MV). There are many common mismatch measures such as sum of absolute difference (SAD), sum of squared difference (SSD) and sum of absolute transformed difference (SATD). SAD is most common due to its simplicity and effectiveness SAD k,l (m, n) = /$26.00 c 2011 IEEE X t (k+i, l+j) X t 1 (k+m+i, l+n+j) N 1 N 1 i=0 j=0 where (m, n) is the motion vector with P m,n < P, X t (i, j) and X t 1 (i, j) are the pixel values at location (i, j) in the current frame at time t and reference frame at time t 1, (k, l) is the location of the current block in the current frame, respectively. SAD computation is very regular and is suitable for efficient hardware implementation. Most existing hardware ME architectures are based on SAD. In recent years, an alternative measure called rate-distortion (RD) cost function becomes increasingly popular. It was first introduced by Everett in 1963 [1] with the general form being (1) RDCost = D + λ R (2) where D is the distortion such as SSD, SATD, or SAD, R is the associated bit rate (e.g., those of MV and/or those of the residue), and λ is the Lagrangian multiplier.

2 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 207 In H.264, λ is different for each of the 51 values of the quantization parameter Qp. When RD cost is used, the ME is called rate-distortion optimized or RDO. However, it is difficult to implement the RD cost in hardware because the RD cost computation requires floating-point multiplication and/or large hardware cost for the lookup table. A common ME method is full search (FS), which examines all points in search window in a brute-force manner. It is zerobiased, with its search center collocated to the current MB. FS can achieve global minimum and thus good visual quality, but requires much computation in software implementation. FS can be efficiently implemented by hardware to achieve good data throughput because its dataflow is regular and is suitable for pipelining. Also, data can be re-used between neighboring search locations. Besides FS, a lot of fast ME (FME) algorithms have been developed. Most FMEs perform some search around a search center which may be zero-biased or MVP-biased. The zerobiased search center is the (0, 0) motion vector. Some common zero-biased FME include NTSS [2], diamond search [3], FTS [4], and cross search [5]. An MVP-biased search center is chosen from a number of MVPs according to certain criteria. The MVPs are typically obtained by using MVs of spatially and temporally neighboring blocks. Some common MVPbiased FME include PMVFAST [6], UMHexagonS [7], and EPZS [8]. Often some local search is performed around the search center in the FMEs leading to local minimum (as opposed to global minimum achieved by FS). However, it is often difficult to implement MVP-biased ME by hardware because the consideration of multiple MVPs and the often irregular local search patterns can easily break the hardware pipeline leading to low hardware efficiency, low data re-use, and high memory access. Early video coding standards such as MPEG-1, MPEG-2, H.263 [9], and MPEG-4 [10] use a macroblock as a unit and perform fixed block-sized (FBS) motion estimation. The latest H.264 [11] allows a MB to be partitioned into seven kinds of sub-blocks as shown in Fig. 1 (16 16, 16 8, 8 16, 8 8, 8 4, 4 8, and 4 4) each with its own motion vector and performs variable block-size (VBS) ME for all possible sub-blocks. VBS ME allows different MVs for different sub-blocks and thus can achieve better matching for all sub-blocks and higher coding efficiency than FBSME. VBSME is especially useful for MBs containing multiple objects each with possibly different motion. It can also be useful for MBs with rotation and even deformation. While VBSME has good RD performance compared with FBSME, it has huge computational requirement and irregular memory access making it hard for efficient hardware implementation. This paper is about efficient VBSME. While early ME algorithms tend to use SAD due to its low complexity. recent ME algorithms tend to use RD cost due to its superior RD performance. We use -SAD and -RD to distinguish two different versions of any algorithm: the one using SAD and the one using RD cost, respectively. For example, FS-SAD is FS using SAD and FS-RD is FS using RD cost. Similarly, we use -zero and -mvp to mean the zero-biased and MVP-biased versions, Fig. 1. Variable block size in H.264/AVC. respectively. And we use -var and -fix to mean VBS and FBS, respectively. For example, FS-SAD-fix-zero is zerobiased FBS FS using SAD, and FS-RD-var-mvp is MVPbiased VBS FS using RD cost. Among existing ME architectures, some are for FME but most are for FS. Some of those for FS are MVP-biased and some use RD cost, but most are for zero-biased FS using SAD. While early ME architectures tend to do FBS ME, the new ones are predominantly for VBS ME. A good overview of ME architectures can be found in [12]. In [13], a 1-D systolic array [14], [15] architecture with 16 processing elements (PE) for full-search VBSME (FSVBSME) was proposed. The authors in [16] [18] proposed three 2-D systolic array architectures with 256 PEs for FS-SAD-var-zero, which has lower RD performance than the FS-RD-var-mvp in H.264 JM software. A few such as [12] used FS-RD-var-mvp but they needed significant amount of extra on-chip memory to store all the MVs required to generate the MVPs. All these architectures incur redundant loading inside the search window leading to huge latency and considerable power consumption. Here we use redundant loading to mean data being loaded more than once. In this paper, we propose a novel RDO-like MVP-biased VBS ME algorithm called RDOMFS with a matching reconfigurable architecture. RDOMFS, introduced in Section II with simulation results, uses a hardware-friendly single MVP (SMVP) and a hardware-friendly RD-like cost function to achieve essentially the same coding efficiency as FS-RD-varmvp. The matching architecture, introduced in Section III, uses a 2-D systolic array to implement the proposed RDOMFS. It re-uses MVPs in the current frame to eliminate the need to store the MVs in the on-chip memory. It uses a novel scanning order in the searching window to minimize redundant loading and achieve different tradeoff between power, data throughput, and data re-use ratio. Implementation results of the proposed architecture and comparisons with others are shown in Section IV. A conclusion is given in Section V. II. Proposed Motion Estimation Algorithm In this section, we will propose a novel hardware-oriented ME algorithm called RD optimized single-mvp-biased FS (RDOMFS), which has a hardware-friendly single-mvp bias and uses a hardware-friendly RD cost function. As will be shown in Section II-C, the proposed RDOMFS can achieve similar RD performance as FS-RD-var-mvp (default motion estimation algorithm in H.264 JM software) and better performance than FS-SAD-var-zero.

208 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 3. Original and modified lambda values. Fig. 2. (a) Median MVP definition in H.

3 208 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 3. Original and modified lambda values. Fig. 2. (a) Median MVP definition in H.264 for various sub-block sizes. (b) Spatially varying definitions of left, top, and top-right sub-blocks. (c) Proposed unified SMVP for all sub-blocks. A. Unified Single MVP (SMVP) Nowadays, many state-of-the-art FME methods, such as [6] [8], are used in H.264 JM software to achieve good RD performance with low complexity. Most of these FME are MVP-biased in the sense that they perform a local search such as small diamond search [6] or hexagonal search [7] around a search center, chosen from a few MVPs. Typically they would compute the similarity measure or cost function for a few highly probable MVPs, select one as the best, and use it as center for the local search. It is well known that a local search yields a local minimum which may or may not be the global minimum. But the use of multiple MVPs to find the search center as opposed to a single MVP helps to increase the probability that the local minimum is the global minimum. Typically, the MVPs used in these FME include temporally and spatially neighboring MVs. The temporally neighboring MV is the MV of the collocated block in the reference frame, which requires the storage of MV of the reference frames. The spatially neighboring MVs include the MVs of the left, top, and top-right sub-blocks. Functions of these three MVs such as the median can also be used. Note that, in H.264, the definitions of the left, top, and top-right sub-blocks are different for different sub-block size and can be different for sub-blocks of the same size at different locations. Thus, a large amount of memory is required to store the MVs for sub-blocks of all sizes. Some examples are shown in Fig. 2(a) for subblock sizes of 4 and 16. And the data flow to compute the median is also irregular. In particular, the use of multiple MVPs tends to be inefficient for hardware implementation for four reasons. First, the spatially varying definitions of left, top, and top-right subblocks, as shown in Fig. 2(c), result in irregular data flow which makes hardware implementation inefficient. Second, the hardware utilization would be low during the examination of the multiple MVPs because the random nature of the final chosen MVP make it hard for pipeline implementation. Third, recall that the multiple MVPs are different points in the search area. Thus, the reference pixels associated with the MVPs need to be loaded separately and can hardly be re-used, which cause high memory bandwidth and high latency. Fourth, a great amount of past MVs need to be saved on-chip for hardware implementation of multiple MVP. This leads to significant onchip memory requirement and cost. For efficient hardware implementation, we choose to use a SMVP in the proposed RDOMFS rather than multiple MVPs. We use a unified SMVP definition for all sub-blocks of all sizes within a MB. Using the symbol MV to mean for a MB its best MV for the sub-block size, we define our SMVP as SMVP = Median(MV left 16 16,MVtop right 16 16,MVtop ) (3) which is the median of the MV of the MBs on the left, top, and top-right, as shown in Fig. 2(b). In other words, all sub-blocks of all sizes at any location within the MB use the same SMVP. Although the use of one MVP as opposed to multiple MVPs tends to result in lower probability that the local minimum is the global minimum, our experiments in Section II-C suggest that the performance drop is not significant. Most importantly, our SMVP can address the four problems mentioned above. First, even though our SMVP is based on spatially neighboring MVs, the identical SMVP definition for all sub-blocks of all sizes makes the data flow regular. Second, the MVP selection is deterministic so that the data flow becomes regular and pipeline design can be used. Actually, there is no more MVP selection stage as there is only one candidate. Third, latency is much lower due to the absence of the MVP selection stage. Fourth, our choice of SMVP requires the storage of much fewer past MVs and our MVP re-use scheme as will be explained later can further reduce the on-chip memory requirement. With the SMVP as search center, RDOMFS performs local full search for all sub-block size, which is highly regular and is suitable for hardware implementation. B. Distortion Measure Recall the tradeoff between two common distortion measures: SAD and RD. SAD computation is very regular and is suitable for efficient hardware implement. Most existing ME hardware architectures are designed based on SAD. On the

4 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 209 other hand, RD is defined as RD(MV )=SAD(MV )+λ(qp)r( MV median MV ) (4) where MV is the candidate MV, λ is the Lagrange multiplier which changes with the quantization parameter Qp, MV median is the median motion vector predictor for MV coding in H.264 and R is the bit rate to encode the motion vector difference. In H.264 JM software, the λ is λ ME defined as where λ mode is λ ME = λ mode (5) λ mode,i,p = (Qp 12)/3 (6) for I-block and P-block and is Qp 12 λ mode,b = max(2, min(4, )) λ mode,i,p (7) 6 for B-block. While RD can give significantly better RD performance than SAD, it is hard to design efficient hardware for RD for at least two reasons. First, RD computation requires floating point operation for the multiplication of λ and R which is time and resource consuming. If this is to be relieved by using lookup tables, it would require huge chip area for the lookup tables. Second, the data flow in the computation of MV median is irregular and requires a large amount of on-chip memory to store the required past MVs as noted before. In RDOMFS, we wish to approach the performance of RD and the hardware-friendly nature of SAD. Here, we propose a novel hardware-friendly RD-like cost function RD smvp, defined as RD smvp (MV )=SAD(MV )+λ smvp (Qp)R( SMVP MV ) (8) where the irregular MV median is replaced by our regular SMVP as defined in (3), and the floating point λ is replaced by a hardware-friendly power-of-2 λ smvp λ smvp =2 n ; n = floor ln λ ME. (9) ln 2 The proposed RD smvp can address the problems mentioned above. First, with our λ smvp, the multiplication of λ smvp and R is simply a left or right shift of R by n bits and can be easily implemented by hardware. Fig. 3 shows the λ smvp and original λ values for I, P, and B frames. While the approximation error between λ and λ smvp can degrade the performance, our simulation results in Section II-C suggest that the performance degradation is minimal. Second, the identical SMVP definition for all subblocks of all sizes makes the data flow in the computation of RD smvp regular and the amount of required on-chip memory is greatly decreased, if not eliminated. C. Simulation Results Experiments are performed to compare the RD performance of the proposed RDOMFS with three algorithms: FS-RDvar-mvp, FS-SAD-var-zero, and UMHexagonS [7]. FS-RDvar-mvp is the default ME method in H.264 JM software and should have very good, if not the best, RD performance in spite of its high computational complexity. UMHexagonS is one of the state-of-the-art multiple-mvp FMEs included in JM and should have slightly lower, if not similar, RD performance than FS-RD-var-mvp but at a significantly lower computational complexity. FS-SAD-var-zero is included in the experiment because most existing ME architectures are based on this. FS-SAD-var-zero should have lower RD performance than FS-RD-var-mvp. The experiments are done on the JM14.1 reference software with various search ranges. Many sequences with different resolution and motion level are tested. The PSNR are shown against the bit rate for six challenging sequences with three being CIF and three being 720P ( ), all at 30 frames/s. In Fig. 4, we show the RD curves for three out of six test sequences. The detailed values are shown in Tables I and II. Some 1080P sequences are also tested, but not shown due to limited space. (Some partial results of 1080p are shown in Table III.) Using FS- RD-var-mvp as the reference, the corresponding BD-PSNR and BD-bitrate [19], [20] are shown in Table I. Similarly, BD-PSNR and BD-bitrate are computed in Table II using RDOMFS with SR = 32 as reference. (A method with positive BD-PSNR and negative BD-bitrate has better RD performance than the reference.) The CIF sequences are Foreman, Soccer, and Mobile. The 720P sequences are Jets, Raven, and Crew. In the figures, we use SR = n to mean a search range of [ n/2,n/2). A wide range of search range (SR) values, namely 4, 8, 16, 32, 64, and 128, are used in the simulation though not all are shown due to limited space. We study the effect of SR on the RD performance because a smaller SR has many advantages including lower latency, lower memory bandwidth, and lower power consumption. But a smaller SR may have the disadvantage of lower RD performance due to poor motion compensation as a result of out-of-range motion. Thus, we want to study this tradeoff for the proposed RDOMFS. In Table I and Fig. 4, the RD performance of UMHexagonS is slightly lower than FS-RD-var-mvp while FS-SAD-varzero is significantly lower than both FS-RD-var-mvp and UMHexgonS, as expected. The proposed RDOMFS manages to achieve similar RD performance as FS-RD-var-mvp in spite of the hardware-friendly modifications: SMVP and RD-like cost function. In general, SR = 4 is too small as all methods with SR = 4 are found to have considerable, if not significant, RD drop compared with SR = 64. For FS-RD-var-mvp and UMHexagonS, RD performance is almost the same for the rest of the SR values. For FS-SAD-var-zero which is used in most existing architecture designs, the RD performance at SR = 64 is significantly worse than the other three methods at SR = 64, and considerable RD drop is observed as SR decreases. For the proposed RDOMFS, RD performance is very similar for SR = 16, 32, 64, and 128, but some drop is observed for SR = 8. We will choose SR = 32 for RDOMFS in the hardware implementation in the next section. RDOMFS contains mainly two simplification tools: hardware-friendly RD function (tool1) and the use of SMVP search center (tool2). Here, we do a partial experiment to investigate the effect of the two tools using 1080p test sequences: Traffic and ParkJoy and the results are shown in

5 210 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 4. RD performance of FS-RD-var-mvp, FS-SAD-var-zero, and RDOMFS for three typical test sequences with various SR. (a) Foreman. (b) Crew. (c) Jets. TABLE I Performance of FS-RD-var-mvp, FS-SAD-var-zero, UMHexagonS, and RDOMFS FS-RD-var-mvp FS-SAD-var-zero RDOMFS UMHexagonS SR=32 SR=32 SR=32 SR=32 Sequence Qp Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) Foreman (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 1.12% 1.95% Soccer (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.4% 5.63% Mobile (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 2.72% 0.07% Jets (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 4.75% 14.48% Raven (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 6.55% 6.77% Crew (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.17% 0.83%

6 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 211 TABLE II Performance of RDOMFS with Various SR RDOMFS RDOMFS RDOMFS RDOMFS SR=32 SR=16 SR=8 SR=4 Sequence Qp Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) Foreman (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 1.01% 4.40% Soccer (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 4.8% 21.01% Mobile (CIF at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.12% 1.00% Jets (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 6.95% 14.48% Raven (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 8.06% 9.82% Crew (720P at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.15% 0.27% TABLE III Performance of RDOMFS FS-RD-var-mvp UMHexagonS RODMFS1 RDOMFS2 RDOMFS (SR = 128) (SR = 128) (SR = 32) (SR = 32) (SR = 32) Sequence Qp Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) (kbit/s) (db) Traffic (1080 at 50 frames/s) BD-PSNR (db) BD-bitrate % 0.48% 1.10% 0.77% ParkJoy (1080 at 30 frames/s) BD-PSNR (db) BD-bitrate % 0.16% 0.07% 0.06%

7 212 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Table III. RDOMFS1 is RDOMFS with tool1 but not tool2. RDOMFS2 is RDOMFS with tool2 but not tool1. RDOMFS, RDOMFS1, and RDOMFS2 are simulated with SR = 32. For the sake of comparison, we also simulate FS-RD-var-mvp and UMHexagonS, both with SR = 128. From Table III, we can observe that RDOMFS with SR = 32 has very similar BD- PSNR and BD-BR as FS-RD-var-mvp with SR = 128. Both RDOMFS1 and RDOMFS2 have similar performance as FS- RD-var-mvp, which verifies that the performance degradation due to the two simplification tools in RDOMFS are rather negligible. III. Proposed Reconfigurable ME Architecture for RDOMFS In this section, we propose a reconfigurable architecture for RDOMFS based on 2-D systolic PE array. Recall that RDOMFS is much more regular than FS-RD-var-mvp due to the use of SMVP and hardware-friendly RD-like cost function. In terms of regularity, RDOMFS is slightly worse than FS- SAD-var-zero due to the need to compute the product of λ smvp and R and the generation of the SMVP. A straightforward hardware design for RDOMFS would include several major components: a 2-D systolic PE array with one PE to process one pixel, an adder tree to calculate 41 possible SADs for all the possible sub-blocks, on-chip or off-chip memory to store all the past MVs required for the computation of SMVP. For the local FS, the common scanning order is the Raster Scan order which can achieve a good data re-use ratio. Such a design typically have several problems. First, it is not reconfigurable and cannot achieve a different data re-use ratio without significant hardware changes. Second, while Raster Scan can give good data re-use ratio, it does not fully exploit the potential data re-use possibility. It has good data re-use in horizontal direction, but not vertical. Ineffective data re-use results in high power consumption, more on-chip memory, and high latency. Third, past MVs need to be stored in on-chip or off-chip memory for the calculation of SMVP in RDOMFS. Although the memory requirement of SMVP is already rather small with our use of MV left top 16 16, MV16 16, and MV top right 16 16, it would still cause considerable latency and power consumption to load the required MVs. While the straightforward design can achieve good performance, we seek to develop a novel design to address these three issues. First, it should be reconfigurable. Second, it should achieve higher data re-use ratio, especially in vertical direction. Third, it should not store past MVs. In this section, we present a novel architecture with several special features: a novel Smart Snake (SS) scanning order instead of Raster Scan, a special hardware to achieve different data re-use ratios and to avoid redundant data loading, and a multi-resolution MVP re-use scheme based on SS to avoid the storage of past MVs. A. System Overview The top-level block diagram of the proposed architecture is shown in Fig. 5. It contains a 2-D PE array with one PE to compute the SAD for one pixel. Different pixel-wise SAD is combined in the 2-D adder tree (2DAT) to compute the 41 possible SADs for sub-blocks of different sizes. The MV of past MBs are propagated in the adaptive shift register array (ASRA) and are used to compute the SMVP which in turn is used to compute the MV cost. Finally, the RD smvp is computed by adding the product of λ smvp and MV cost to the SAD, and the best sub-block combination with its corresponding best MV is selected. The 2-D PE array contains 256 PEs. Each PE stores a pixel in the current MB. Reference pixels are propagated into the PE array to calculate SAD. Conceptually, the PE array has 16 sub-arrays each with 4 4 PEs corresponding to a 4 4 subblock. A reconfigurable register array (RRA) is introduced to help achieve reconfigurable capability and higher data reuse ratio. This is a key module for the proposed SS scanning order which will be introduced in the next subsection. After the pixels of current MB are loaded, reference pixels are propagated into the PE array. In each clock cycle, the 256 PEs compute the 256 pixel-wise SAD for a search location in the search windows and pass the pixel-wise SAD to the 2DAT. Among the 41 SADs to be computed, there are sixteen 4 4 SADs, eight 8 4, eight 4 8, four 8 8, two 16 8, two 8 16, and one The 2DAT takes two clock cycles to compute the sixteen 4 4 SADs, three clock cycles for the eight 8 4 and eight 4 8, four clock cycles for the four 8 8, five clock cycles for the two 16 8 and two 8 16, and six clock cycles for the The MV of past MBs propagated in ASRA are re-used to compute the SMVP during the loading phase of the current MB pixels, and thus SMVP computation does not require extra clock cycles. A reconfigurable feature is that the delay cycles in ASRA can be adjusted so that it can be easily adapted to different frame size. The SMVP is used to compute the MV cost which is passed to the best MV selector along with the 41 SADs from 2DAT. The proposed hardware-friendly RD-like cost functions of all the candidate MVs are compared and the best MV is selected after scanning all possible locations in the search window. B. Smart Snake Scanning Order Consider a search window of size 2P 2Q and a macroblock size of N N. When full search is performed in the window, most traditional architectures use Raster Scan as shown in Fig. 6(a). In Raster Scan, the search locations in the first row are scanned from left to right, followed by the second row from left to right, and so on. Raster Scan is effective in reusing data horizontally with relatively high data re-use ratio. For example, when the upper-left search location is processed, all the pixels in the N N reference block are loaded into the PE array. When the next search location in Raster Scan order is processed, 1 N pixels are loaded with (N 1) N pixels re-used. However, there is no data re-use between adjacent rows. Thus many pixels are loaded up to N times, with the second time onward being redundant loading. The data re-usability is improved slightly in some architectures by another scanning order called Snake Scan as shown in Fig. 6(b). Snake Scan processes the first row from left to right, then the second row from right to left, and then the third row from left to right, and so on. During horizontal scanning

8 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 213 Fig. 5. Top-level block diagram of proposed architecture. Fig. 6. (a) Traditional Raster Scan. (b) Traditional Snake Scan. (c) Proposed Smart Snake Scanning order (SS). (d) Two sub-regions enlarged. along a row, Snake Scan re-use (N 1) N pixels from one search point to another. After row k (for any k) is processed, (N 1) N pixels of the last search location in row k are re-used in the first search point in row k + 1. In other words, Snake Scan is slightly better than Raster Scan by re-using data between adjacent rows. However, when processing the subsequent search points (in horizontal direction) in row k +1, the other pixels loaded during row k processing are not reused leading to a lot of redundant loading. Note that in both Raster Scan and Snake Scan, the data re-use ratio is fixed for a fixed search window size. [12] adopted a Modified Snake Scanning order to achieve higher data re-use ratio than Raster Scan. And [21] proposed a novel scanning order which we call ASAP Raster Scan to get the SAD for each partition as soon as passible. Although it can reduce the number of registers to save the SAD temporally for generating all partitions, it incurs more redundant loading than Raster Scan. Here we propose a novel scanning order called Smart Snake (SS) which can achieve variable data re-use ratios and minimum redundant data loading. In particular, in each search window, each reference pixel is loaded once and only once in SS. In the proposed SS scan, we divide the search window into an array of non-overlapping rectangular sub-regions that span the search window. An example with two rows and three columns of sub-regions is shown in Fig. 6(c). Basically, in each

9 214 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 rectangular sub-region, we perform Snake Scan with some tricks to achieve significantly higher data re-use. After one sub-region is searched, it will move into an adjacent region and Snake Scan will be applied again. In different sub-regions, the Snake Scan may be performed from top to bottom (e.g., sub-region L 1 ), or from bottom to top (e.g., L 2 ). It may start from left and end at right (e.g., L 1, L 2, L 3 ), or start from right and end at left (e.g., L 4, L 5, L 6 ). It may be horizontal (e.g L 1, L 2 ) or vertical (e.g., L 3, L 4 ). Here we use horizontal to mean the original Snake Scan which processes the search points row-by-row and vertical to mean column-by-column Snake Scan. We restrict the width (or the height) of each sub-region to be less than or equal to a parameter M. Then we construct a structure called reconfiguration register array (RRA) which is an array of (M 1) (N 1) registers. We now describe the data loading behavior of SS in L 1, which contains two initialization steps (A and B) unique to L 1 and two steadystate steps (C and D) common to all sub-regions. These steps are labeled in Fig. 6(d). We assume the size of L 1 is W H, with W M. Step A is used to process the top-left search location in L 1. Step B is performed W 1 times to process the rest of the top row (row 1) of L 1 moving from left to right. After processing one row in one direction, it will use step C to move to the next row and step D is performed W 1 times to process the rest of the row in opposite direction. In Step A, the N N reference pixels corresponding to the upper-left search location are propagated into the PE array, one clock cycle for each column of N pixels. This takes N set-up clock cycles with a data loading rate of N pixels/cycle. After the pixelwise SAD computation, the right N 1 columns are propagated within the PE array as they are needed for the following search locations in row 1. The lower N 1 pixels of the remaining (left) column will be needed for future search points (in rows 2, 3, and so on) and are propagated from the PE array to the RRA. The top pixel of the left column is discarded as it is no longer needed by the algorithm. Step B is applied after Step A. Step B uses W 1 clock cycles to process the remaining W 1 search locations in row 1. In each clock cycle, a new search location is processed in which a new column of N pixels is loaded (at N pixels/cycle). The right N 1 columns are propagated within the PE array and the bottom N 1 pixels of the remaining column is propagated from the PE array to the RRA. After a row of search locations are processed, Step C is applied to move down 1 search point to the next row in one clock cycle. The bottom N 1 rows are propagated within the PE array and a new row of N pixels are loaded (at N pixels/cycle). The RRA remains unchanged. Step D is applied after Step C. It uses W 1 clock cycles to process the remaining W 1 search locations in the current row. In each clock cycle, only one new pixel is loaded, N 1 pixels are propagated back from the RRA and N 1 columns are propagated within the PE array. Thus, the data loading rate is reduced greatly from N pixels/cycle in Steps A, B, and C to only 1 pixel/cycle in Step D. The bottom N 1 pixels of the last column will be needed for future search points and thus are propagated from the PE array to the RRA. And Steps C and D are applied recursively until the last search point in L 1 is reached. Note that there is no redundant data loading in L 1. After one sub-region is processed, a neighboring sub-region is processed next, using Snake Scan. Four steps A B, C, and D similar to corresponding steps A, B, C, D in L 1 are used. Essentially, Steps B, C, and D are similar to Steps B, C, D, respectively, except that their processing directions may be different (width-wise processing to the right or to the left, or length-wise processing to the top or bottom). But Step A is significantly different from Step A, in the sense that Step A uses only one clock cycle to process the first search point, and it performs either Step B or Step C depending on the relative locations of the two sub-regions and their Snake Scan directions (vertical or horizontal). This is applied recursively until all the sub-regions are processed. In the special case of only one sub-region such that W = 2P = M, there is no redundant data loading and hardware utilization can reach 100%, which are excellent. But, if the search window size is large (e.g., HDTV), the size of the required RRA would be large. In such a situation, the use of sub-region would allow the size of RRA to be reduced significantly from (2P 1) (N 1) to (M 1) (N 1), at the expense of redundant loading and lower data re-use ratio because, when there are more than one sub-regions, some reference pixels will be loaded more than once. For example, between L 1 and L 2,(N 1) H reference pixels will be loaded twice. Another advantage of the use of sub-region is that, by adjusting the size of each sub-region, we can achieve different trade-off between the size of active RRA and the data re-use ratio. When bandwidth is the most critical issue, the system can turn on all RRA and use maximum sub-region size to achieve maximum data re-use ratio. However, if power is most critical, the RRA can be partially turned off and a smaller subregion size can be used to achieve lower power at the expense of lower data throughput and data re-use ratio. To investigate the data re-usability of different scanning orders, we define the redundant loading ratio R to be R = L actual S (10) S where S = (2Q + N 1) (2Q + N 1) is the number of reference pixels inside a search window, L actual is the number of reference pixel loading actually performed to finish the search inside the search window. Note that R 0by definition. R is equal to zero when each pixel inside the search window is loaded only once and no redundant loading occurs. For Raster Scan and Snake Scan, the redundant loading ratio R raster and R snake are R raster = 2Q[N2 + N(2P 1)] S (11) S R snake = [N2 + N(2P 1)]+2NP(2Q 1) S. (12) S For the proposed SS Scan, the worst-case redundant loading

with small search window (32 32), and large video resolution (1080P) with large search window (128 96), and are shown in Table IV together with the corresponding data loading rates (pixel per second).

10 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 215 ratio R SS is R SS = 4PQ(M + N 1)2 /M 2 S. (13) S These redundant loading ratios for various scanning methods including Modified Snake Scan [12] and ASAP Raster Scan [21] are computed for two representative cases: small video resolution (CIF) with small search window (32 32), and large video resolution (1080P) with large search window (128 96), and are shown in Table IV together with the corresponding data loading rates (pixel per second). As expected, Snake Scan has considerably better (smaller) data redundant loading ratio than Raster Scan, especially in low resolution video. But the proposed SS scan can achieve much better data redundant loading ratio than any other methods. In particular, when one sub-region is used (M = 2P = 2Q = 32), SS can achieve zero redundant loading, as expected. The ASAP Raster Scan [21] has the largest R with worst redundant loading. The Modified Snake Scan tends to have similar R as the traditional Snake Scan. Fig. 7. Hardware diagram of proposed multi-resolution MVP re-use scheme. C. Multi-Resolution MVP Re-Use Scheme If there is no MVP re-use in the architecture design (including RDOMFS), all required past MVs would need to be stored in on-chip or off-chip memory and then loaded back to generate the MVPs of the current MB (and to compute the RD cost when needed). This would result in increased on-chip memory size, huge latency, and considerable power consumption. Thus MVP re-use is highly desirable. We use Fig. 8 to illustrate the proposed MVP re-use method. Here we use subscripts L, T, and TR to indicate the block to the left, top, and top-right of a particular MB, respectively. E.g., R TR is the block to the top-right of macroblock R. When we process the macroblock R (highlighted by red color), it would require the best MV of three neighboring MBs: R L, R T, and R TR. Similarly, the macroblock G (highlighted by green color) would require MVs of G L, G T, and G TR. Recall that MBs are processed in Raster Scan order (though the locations in the search window of a particular MB are processed in SS order). After the motion estimation is finished for one row of MBs (for example, the row containing R), the next row of MBs would be processed (for example, the row contained G). It is obvious that R L is the same as G T and the MV of R L can be re-used for G after certain delay. The delay (in terms of clock cycles) depends on the width of the current frame. So our main idea is that, rather than storing all the MV of the MBs in the memory, we simply propagate the MV of the current MB to an ASRA which can propagate the MV with a variable delay, which can be different for different resolution (frame size). The variable delay helps to make this design re-configurable. Fig. 7 shows the hardware architecture for our multiresolution MVP re-use scheme. It consists of the ASRA and three MVP registers, MVPR-L, MVPR-T, and MVPR-TR to store the MV of the left, top, and top-right MBs, respectively. After a MB is processed, its MV is propagated into MVPR-L during the initialization step (step A) of SS of the next MB. This achieves high hardware utilization and Fig. 8. MVP re-use relationship between MBs in two MB rows. Fig. 9. (a) Concept diagram. (b) Hierarchical architecture of proposed 2-D Adder Tree. low latency because it does not take additional clock cycles. The ASRA contains shift registers with multiple outputs (corresponding to different delays) for different resolutions. A MUX is used after ASRA to select the output corresponding to the resolution. With the intention to support a video width of 4096, we use 512 registers in ASRA, with two registers for the MV of each MB in a row. Although the size of ASRA is fixed, the MUX selects different data bypass to achieve different delay for variable frame size. D. 2-D Adder Tree (2DAT) Fig. 9 shows the concept diagram and hierarchical architecture of the proposed 2DAT. The 2DAT takes a total of two clock cycles to get sixteen 4 4 SADs, and similarly, three clock cycles to get eight 8 4 and eight 4 8 SADs, four

11 216 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 TABLE IV Data Re-Use for Various Scanning Orders Scanning Method 2P = 32, 2Q = 32, N =16 2P = 128, 2Q = 96, N =16 CIF at 30 frames/s 1080P at 30 frames/s S L actual R Pixel/s S L actual R Pixel/s Raster Scan % 285.8M % 53.4G Snake Scan % 197.5M % 47.8G Modified Snake [12] % 196M % 47.7G ASAP Raster Scan [21] % M % 334G SS (M = 16) % 45.6M % 11G SS (M = 32) M % 6.4G Fig. 10. Data flow of proposed architecture for M = W = 3. arrows represent the propagation of the reference pixels to the RRA and the black dashed arrows represent the propagation of the pixelwise SADs to the 2DAT. Fig. 11 shows the architecture of a single PE. It contains a current pixel register to store the current pixel, a reference pixel register to store the reference pixel, an absolute difference calculator, a MUX to select reference data from four possible directions (left, right, top, and bottom) and four latches (L1, L2, L3, and L4) for propagating the reference pixel to the four directions. Fig. 11. Single PE design. clock cycles to get four 8 8 SADs, five clock cycles for two 16 8 and two 8 16 SADs, and six clock cycles to get the final SAD. E. PE Design In this section, we use a 2-D PE array, which consists of multiple single-pe as shown in Fig. 11, with a RRA as shown in Fig. 12 to illustrate the PE design. Without loss of generality, we assume M = 3 and N = 4. This PE array contains N 2 single PEs, with one PE to process one pixel in the current N N block. The red, blue, green, and black arrows are the leftward, rightward, upward, and downward reference data paths, respectively. So each single PE has a MUX to select reference data from four possible directions. The blue dashed F. Data Flow Table V shows the data flow of the proposed architecture for the case of N = 4 and M = 3. As M = 3, the RRA contains two (M 1) columns, column 1 and column 2. After initialization cycles in Step A, the current pixels are stored inside PE array and the reference pixels are propagated into the PE array. Here, R ij is the ijth reference pixel in the search window. After calculating the SAD of the first search point, a new column of reference pixels is loaded into the PE array, three (N 1) reference pixels are propagated into column 1 of RRA, and the SAD of a new search point is also calculated in step B 1. Step B 2 is similar to B 1, except that column 1 of RRA is propagated to column 2 and another three reference pixels are propagated into column 1 from the PE array. In step C, a new row of reference pixels is loaded into the PE array and the RRA is not changed. In steps D 1 and D 2, rather than loading one row (or column) of pixels, it only loads one reference pixel per clock cycle and data re-use (with ratio (N 1)/N) is achieved by moving the data from RRA into the PE array.

12 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 217 Fig. 12. PE array and RRA for M = 3. TABLE V Data Flow of the Proposed Architecture When N =4andM=3=W Cycle Step Read From the Memory Data in RRA Column 1 Data in RRA Column 2 Search Point 0 A 1 R 00, R 10, R 20, R 30 1 A 2 R 01, R 11, R 21, R 31 2 A 3 R 02, R 12, R 22, R 32 3 A 4 R 03, R 13, R 23, R 33 ( 2, 2) 4 B 1 R 04, R 14, R 24, R 34 R 10, R 20, R 30 ( 2, 1) 5 B 2 R 05, R 15, R 25, R 35 R 11, R 21, R 31 R 10, R 20, R 30 ( 2, 0) 6 C R 42, R 43, R 44, R 45 R 11, R 21, R 31 R 10, R 20, R 30 ( 1, 0) 7 D 1 R 41 R 10, R 20, R 30 R 25, R 35, R 45 ( 1, 1) 8 D 2 R 40 R 25, R 35, R 45 R 24, R 34, R 44 ( 1, 2) 9 C R 50, R 51, R 52, R 53 R 25, R 35, R 45 R 24, R 34, R 44 (0, 2) 10 D 1 R 54 R 30, R 40, R 50 R 25, R 35, R 45 (0, 1) 11 D 2 R 55 R 31, R 41, R 51 R 30, R 40, R 50 (0, 0) 12 C R 62, R 63, R 64, R 65 R 31, R 41, R 51 R 30, R 40, R 50 (1, 0) TABLE VI Performance Comparison of Various Architectures Architecture [13] [16] [17] [18] [12] Proposed RDOMFS (M = 32) No. of PE Block size to 4 4 to 4 4 to 4 4 to 4 4 to 4 4 to 4 4 Search method FS-SAD FS-SAD FS-SAD FS-SAD FS-RD RDOMFS -var-zero -var-zero -var-zero -var-zero -var-mvp Technology 0.13 µm 0.18 µm 0.18 µm 0.18 µm 0.18 µm 0.18 µm Gate count 61k 210k 160k 597k 330k 103K Max frequency (MHz) Power (mw) SRAM size (bytes) Max video size P 1080P 720P Frames/s Scanning order Raster Raster Raster Raster Modified Snake Smart Snake Search range

13 218 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 TABLE VII Performance Comparison of Various Architectures Architecture [22] [23] [21] [24] Proposed RDOMFS (M = 32) No. of PE Block size to 8 8 to 4 4 to 4 4 to 4 4 to 4 4 Search method FME-SAD FS-SAD FS-SAD FS-SAD RDOMFS -var-zero -var-zero -var-zero -var-zero Technology 0.18 µm 0.13 µm 0.18 µm 0.18 µm 0.18 µm Gate count 485.7k 453k 39k 12k 103k Max frequency (MHz) Power (mw) SRAM size (bytes) Max video size 1080P 1080P 720P CIF frames/s Scanning order Raster Raster ASAP Raster Smart Snake Search range IV. Implementation Results and Comparison The proposed architecture was designed with VHDL description and synthesized by Synopsys Design Compiler with TSMC 0.18 µm CMOS standard cell library. Table VI shows the details of the implementation results. The design contains about 103k gates excluding SRAM. The total size of SRAM is 1271 bytes. The circuit can operate at frequencies up to 250 MHz allowing the processing of blocks per second when SR = 16. Under a clock frequency of 63 MHz, the architecture allows the real-time processing of (1080P) at 30 frames/s and SR = 32 (with similar RD performance as FS-RD-var-mvp at SR = 128). A comparison among the proposed circuit for RDOMFS (with M = 32) and some existing typical VBSME circuits [12], [13], [16] [18], [21] [24] for H.264 is presented in Tables VI and VII. Among the architectures, the proposed RDOMFS circuit can provide the lowest redundant load ratio, which means every pixel in the search window is loaded only once. Furthermore, except [12], all architectures use SAD as criterion of similarity which would results in significant RD drop compared with FS-RD-var-mvp. With the hardware-friendly SMVP and RD-like cost function, the proposed RDOMFS circuit with a small search range can achieve better RD performance than FS-SAD-var-zero and similar RD performance as UMHexagonS and FS-RD-var-mvp with bigger search range, especially in high resolution situation, as shown in simulation results in Section II. And we could observe that the power consumption of the proposed architecture is significantly lower than that of other architectures. This is because the SS and multi-resolution MVP re-use scheme help the proposed architecture to achieve highest data re-use ratio, leading to low latency, high data throughput, and significant reduction of data loading from the memory. V. Conclusion In this paper, we proposed a hardware-friendly MVPbiased motion estimation algorithm RDOMFS with unified single MVP and RD-like cost function. Simulation results suggest that it achieves comparable RD performance as the FS-RD-var-mvp and UMHexagonS used in JM software, and is significantly better than FS-SAD-var-zero commonly used in hardware implementation. We also proposed a matching architecture with novel SS scanning order and multi-resolution MVP re-use scheme. The design is implemented with TSMC 0.18 µm CMOS technology and costs 103k gates. At a clock frequency of 63 MHz, the architecture achieves real-time RDO-VBSME at 30 frames/s. Acknowledgment The authors appreciate the contribution from E. Ueda. References [1] H. Everett, III, Generalized Lagrange multiplier method for solving problems of optimum allocation of resources, Oper. Res., vol. 11, no. 3, pp , [2] R. Li, B. Zeng, and M. Liou, A new three-step search algorithm for block motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 4, no. 4, pp , Aug [3] S. Zhu and K. K. Ma, A new diamond search algorithm for fast blockmatching motion estimation, IEEE Trans. Image Process., vol. 9, no. 2, pp , Feb [4] M. Rehan et al., Block-based motion estimation using an enhanced flexible triangle search algorithm, in Proc. Can. Conf. Electr. Comput. Eng., May 2005, pp [5] M. Ghanbari, The cross-search algorithm for motion estimation [image coding], IEEE Trans. Commun., vol. 38, no. 7, pp , Jul [6] A. Tourapis, O. Au, and M. Liou, Predictive motion vector field adaptive search technique (PMVFAST)-enhancing block based motion estimation, in Proc. SPIE Conf. Visual Commun. Image Process., vol , pp [7] Z. Chen, J. Xu, Y. He, and J. Zheng, Fast integer-pel and fractionalpel motion estimation for H.264/AVC, J. Visual Commun. Image Representation, vol. 17, no. 2, pp , Apr [8] A. Tourapis, O. Au, and M. Liou, Highly efficient predictive zonal algorithms for fast block-matching motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 10, pp , Oct [9] G. Cote, B. Erol, M. Gallant, and F. Kossentini, H. 263+: Video coding at low bit rates, IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 7, pp , Nov [10] Information Technology-Coding of Audio-Visual Objects-Part2: Visual, ISO/IEC , [11] S. Kwon, A. Tamhankar, and K. R. Rao, Overview of H.264/MPEG- 4 part 10, J. Visual Commun. Image Representation, vol. 17, no. 2, pp , Apr [12] C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, T.-C. Wang, and L.- G. Chen, Analysis and architecture design of variable block-size motion estimation for H.264/AVC, IEEE Trans. Circuits Syst.-I, vol. 53, no. 3, pp , Mar

WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 219 [13] S. Y. Yap and J.

Kung, Why systolic architectures, Computer, vol. 15, no. 1, pp. 37 46, 1982. [15] D. Moldovan, On the design of algorithms for VLSI systolic arrays, Proc. IEEE, vol. 71, no. 1, pp. 113 120, Jan. 1983.

[17] W. Cao, H. Hui, J. Tong, J. Lai, and H. Min, A high-performance reconfigurable VLSI architecture for VBSME in H.264, IEEE Trans. Consumer Electron., vol. 54, no. 3, pp. 1338 1345, Aug. 2008.

[19] G. Bjontegaard, Calculation of Average PSNR Differences Between RD- Curves, ITU-T SG16 Q.6 document, vol. VCEG-M33, Austin, TX, Apr. 2001. [20] G.

Park, A novel VLSI architecture for full-search variable block-size motion estimation, IEEE Trans. Consumer Electron. I, vol. 55, no. 2, pp. 728 733, May 2009. [22] Z. Liu, Y. Song, M. Shao, S. Li, L.

, Oct. 2007, pp. 675 680. [23] C.-Y. Kao and Y.-L. Lin, A high-performance and memory-efficient architecture for H.264/AVC motion estimation, in Proc. IEEE Int. Conf. Multimedia Expo, Jun. 2008, pp.

14 WEN et al.: NOVEL RD-OPTIMIZED VBSME WITH MATCHING HIGHLY DATA RE-USABLE HARDWARE ARCHITECTURE 219 [13] S. Y. Yap and J. McCanny, A VLSI architecture for variable block size video motion estimation, IEEE Trans. Circuits Syst.-II, vol. 51, no. 7, pp , Jul [14] H. Kung, Why systolic architectures, Computer, vol. 15, no. 1, pp , [15] D. Moldovan, On the design of algorithms for VLSI systolic arrays, Proc. IEEE, vol. 71, no. 1, pp , Jan [16] L. Deng, W. Gao, M. Z. Hu, and Z. Z. Ji, An efficient hardware implementation for motion estimation of AVC standard, IEEE Trans. Consumer Electron., vol. 51, no. 4, pp , Nov [17] W. Cao, H. Hui, J. Tong, J. Lai, and H. Min, A high-performance reconfigurable VLSI architecture for VBSME in H.264, IEEE Trans. Consumer Electron., vol. 54, no. 3, pp , Aug [18] C.-M. Ou, C.-F. Le, and W.-J. Hwang, An efficient VLSI architecture for H.264 variable block size motion estimation, IEEE Trans. Consumer Electron., vol. 51, no. 4, pp , Nov [19] G. Bjontegaard, Calculation of Average PSNR Differences Between RD- Curves, ITU-T SG16 Q.6 document, vol. VCEG-M33, Austin, TX, Apr [20] G. Bjontegaard, Improvements of the BD-PSNR Model, ITU-T SG16 Q.6 document, vol. VCEG-AI11, Berlin, Germany, Jul [21] J. Kim and T. Park, A novel VLSI architecture for full-search variable block-size motion estimation, IEEE Trans. Consumer Electron. I, vol. 55, no. 2, pp , May [22] Z. Liu, Y. Song, M. Shao, S. Li, L. Li, S. Goto, and T. Ikenaga, 32-parallel SAD tree hardwired engine for variable block size motion estimation in HDTV1080P real-time encoding application, in Proc. IEEE Workshop Signal Process. Syst., Oct. 2007, pp [23] C.-Y. Kao and Y.-L. Lin, A high-performance and memory-efficient architecture for H.264/AVC motion estimation, in Proc. IEEE Int. Conf. Multimedia Expo, Jun. 2008, pp [24] H. Parandeh-Afshar, P. Brisk, and P. Ienne, Scalable and low cost design approach for variable block size motion estimation (VBSME), in Proc. Int. Symp. VLSI Design Autom. Test, 2009, pp assessment. Xing Wen (S XX) received the B.E. degree in electronic engineering from Xidian University, Xi an, China, in He is currently pursuing the Ph.D. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. His current research interests include video coding standard, motion estimation, HW/SW co-design, hardware implementation of multi-media algorithms, high throughput very large scale integration design, multiple description video coding, and visual quality Oscar C. Au (SM XX) received the B.A.Sc. degree from the University of Toronto, Toronto, ON, Canada, in 1986, and the M.A. and Ph.D. degrees from Princeton University, Princeton, NJ, in 1988 and 1991, respectively. After being a Post-Doctoral Researcher with Princeton University for one year, he joined the Hong Kong University of Science and Technology (HKUST), Clear Water Bay, Hong Kong, as an Assistant Professor in He is/has been a Professor with the Department of Electronic and Computer Engineering, Director of the Multimedia Technology Research Center, and Director of the Computer Engineering Program at HKUST. He has published about 280 technical journal and conference papers. His fast motion estimation algorithms were accepted into the ISO/IEC MPEG-4 international video coding standard and the China AVS-M standard. His light-weight encryption and error resilience algorithms were accepted into the China AVS standard. He has four U.S. patents and is applying for more than 60 on his signal processing techniques. He has performed forensic investigation and stood as an expert witness in the Hong Kong courts many times. His main research contributions include video and image coding and processing, watermarking and light weight encryption, speech and audio processing. Research topics include fast motion estimation for MPEG-1/2/4, H.261/3/4 and AVS, optimal and fast sub-optimal rate control, mode decision, transcoding, denoising, deinterlacing, post-processing, multiview coding, scalable video coding, distributed video coding, subpixel rendering, JPEG/JPEG2000, HDR imaging, compressive sensing, halftone image data hiding, GPU-processing, software-hardware co-design, and so on. Dr. Au is/was an Associate Editor of the IEEE Transactions on Circuits and Systems for Video Technology, the IEEE Transactions on Image Processing, and the IEEE Transactions on Circuits and System, Part 1. He is on the editorial boards of the Journal of Signal Processing Systems, Journal of Multimedia, and the Journal of Franklin Institute. He is/was the Chairman of the CAS Technical Committee on Multimedia Systems and Applications and a member of CAS TC on Video Signal Processing and Communications, CAS TC on DSP, SP TC on Multimedia Signal Processing, and SP TC on Image, Video and Multidimensional Signal Processing. He served on the Steering Committee of the IEEE Transactions on Multimedia and the IEEE International Conference of Multimedia and Expo (ICME). He served on the organizing committee of the IEEE International Symposium on Circuits and Systems in 1997, the IEEE International Conference on Acoustics, Speech and Signal Processing in 2003, the ISO/IEC MPEG 71st Meeting in 2004, the International Conference on Image Processing in 2010, and other conferences. He was the General Chair of the Pacific-Rim Conference on Multimedia (PCM) in 2007, and chaired both IEEE ICME and the Packet Video Workshop in He won Best Paper Awards in SiPS 2007 and PCM Jiang Xu (M XX) received the B.S. and M.S. degrees in electrical engineering from the Harbin Institute of Technology, Harbin, China, and the M.A. and Ph.D. degrees in electrical engineering from Princeton University, Princeton, NJ. From 2001 to 2002, he was a Research Associate with Bell Laboratories, Murray Hill, NJ. He was a Research Associate with NEC Laboratories America, Inc., Princeton, from 2003 to 2005, and worked on networks-on-chip. He joined a startup company, Sandbridge Technologies, Lowell, MA, in Since 2007, he has been with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, as an Assistant Professor, and has established the Mobile Computing System Laboratory. He has published more than 30 papers in peer-reviewed journals and conferences. His current research interests include multiprocessor systems-on-chip, computer architecture, lowpower very large scale integration design, and HW/SW co-design. Dr. Xu received one Best Paper Award. He serves on the organizing and technical committees in many international conferences, including ICCD, CASES, ISVLSI, VLSI, EMSOFT, VLSI-SoC, ICESS, RTCSA, NOCS, ESO, and so on. He currently serves as an Associate Editor of the ACM Transactions on Embedded Computing Systems. Lu Fang (S XX) received the B.E. degree from the Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China, in She is currently pursuing the Ph.D. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Run Cha (S XX) received the B.S. degree in electronic and information engineering from Tianjin University, Tianjin, China. She is currently pursuing the M.Phil. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Her current research interests include sub-pixel interpolation filter design, motion estimation algorithms, and combined intra and inter prediction. Jiali Li (S XX) received the B.S. degree in electrical engineering and information science from the University of Science and Technology of China, Hefei, China, in She is currently pursuing the Ph.D. degree from the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Her current research interests include video coding, GPU accelerating in multi-media, and HW/SW co-design.

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists 4,000 116,000 120M Open access books available International authors and editors Downloads Our