Video Encoder Design for High-Definition 3D Video Communication Systems

INTEGRATED CIRCUITS FOR COMMUNICATIONS Video Encoder Design for High-Definition 3D Video Communication Systems Pei-Kuei Tsung, Li-Fu Ding, Wei-Yin Chen, Tzu-Der Chuang, Yu-Han Chen, Pai-Heng Hsiao, Shao-Yi Chien, and Liang-Gee Chen, National Taiwan University ABSTRACT VLSI realization of video compression is the key to real-time high-definition 3D communication systems. The newly established multiview video coding standard, as an extension profile of H.264/AVC, draws more and more attention for its high compression ratio and free-viewpoint support. Besides providing the 3D experience, multiview video can also give users complete scene perception. However, the multiple-viewpoint throughput requirement of MVC increase the complexity and hardware cost dramatically. The system memory bandwidth, on-chip memory size, and processing data throughput of each module all need to be optimized in an MVC encoder. Therefore, efficient hardware solutions for MVC architecture design are needed. In this article an overview of 3D video coding standards developments and design challenges of an MVC encoder are discussed. Then the algorithm and architecture optimization schemes are proposed. For the trade-off between system memory bandwidth and on-chip memory size, a cache-based prediction engine is proposed to ease both design challenges. Moreover, the hybrid openclose loop intra prediction scheme and the frame-parallel pipeline-doubled dual solve the throughput requirement problem. At the end of this article, based on all the proposed solutions, a prototype single-chip MVC encoder design with processing ability of 4096 2160 single-view to 1280 720 seven-view is presented. INTRODUCTION For advanced TV applications, vivid perception quality is required. Therefore, higher and higher video resolutions, like high-definition (HD) 720p (1280 720 pixels) and 1080p (1920 1080 pixels), are recommended. In addition, 3D video can bring the 3D and realistic perceptual experience to viewers by projecting different views to users left and right eyes simultaneously. As the technology evolves, lots of 3D related applications, such as 3D-TV and free-viewpoint TV, are emerging [1]. In a real-time HD 3D video communication system, three key technologies make it feasible. The first one is the stereo or multiview capturing and display device. The second one is the coding standard. Since 3D video contains many different view angles, different and more efficient coding algorithms than the conventional single-view video coding standards are required to further reduce the bit rate for communication. Third, the efficient hardware architecture is required for accelerating the coding speed to meet the realtime constraint. Because of the multiple-viewangle characteristic, data needed to be processed in a 3D video is multiple times that in a conventional single-view video. Thus, if the conventional architecture is adopted, it will multiply computation complexity and hardware cost. In order to transmit and store 3D/multiview contents, an efficient multiview video coding (MVC) scheme is needed. The MPEG 3D Audio/Video (3DAV) Group is working on the standardization of MVC. In July 2008, MVC was standardized as the Multiview High Profile in H.264/AVC by the MPEG 3DAV Group [2]. The joint MVC (JMVC) was released by the MPEG 3DAV Group as the reference software and research platform [3]. In the JMVC H.264/AVC is adopted as the base layer. In addition, disparity estimation (DE) and disparity compensation (DC), the most significant features in JMVC, can effectively discover the interview redundancy of a multiview video and save 20 30 percent of bit rates. Based on the bit rate reduction, an HD MVC sequence is able to be stored in high-end multimedia portable storage like a Blu-ray disc. However, the coding complexity increases dramatically in the MVC because of the hybrid inter-view DE and intraview motion estimation (ME) prediction schemes. Furthermore, the processing throughput requirement of HD MVC is many times larger than that of the current HDTV specification. Thus, a new and efficient encoder architecture design for the MVC is desired. In this article the mainstream 3D video coding standards, design challenges in MVC encoder design, and the proposed solutions are briefly introduced. 76 0163-6804/10/$25.00 2010 IEEE

The video coding standard development is introduced in the next section. The hardware resource analysis is then presented. Then the proposed MVC architecture design is shown. The final section concludes this article. FROM 2D TO 3D: VIDEO CODING STANDARD DEVELOPMENT 2D VIDEO CODING: FROM MPEG-1, H.261 TO H.264/AVC Video data without compression is impossible to transmit directly due to the incredible size of the uncompressed raw data. Since 1990 many video coding standards have been defined for storage and transmission. Among these coding standards, coding efficiency is the most important criterion. There are two main series of video coding standards: the International Standards Organization (ISO) MPEG-x standards and International Telecommunication Union Telecommunication Standardization Sector (ITU-T) H.26x standards. The MPEG-x series contains MPEG-1, MPEG-2, and MPEG-4. On the other side, the H.26x series starts from H.261 in 1990 to H.263, H.263+, and H.26L. Furthermore, some standards are the result of coworking of these two groups. For example, MPEG-2 is also called H.262 and is the result of a common project. Then H.264 is delivered by both ISO and ITU-T, which is also called the Joint Video Team (JVT). Therefore, H.264 can also be called MPEG-4 Advanced Video Coding (AVC) or H.264/AVC. Being the latest finalized advanced video coding standard from these two main streams, H.264/AVC has the best coding performance. It provides more than 50 percent bit rate reduction over the previous MPEG-2 standard. In order to provide better and better rate distortion (R-D) performance in the future, at the last MPEG meeting, a new Joint Collaborative Team between MPEG and ITU was created to work on a new standard. 3D VIDEO CODING: FROM MPEG-2 MULTIVIEW PROFILE TO MVC 3D video has always played an important role in the video processing research field, including, of course, 3D video coding. The first finalized 3D video coding standard was the MPEG-2 Multiview Profile. A stereo video sequence can be compressed into a bitstream containing a base layer and an enhancement layer. In addition to the stereo-view representation, another approach to 3D video is the single-view-plus-depth, or socalled 2D + Z, format. The Advanced Three- Dimensional Television System Technologies (ATTEST) from European Information Society Technologies (IST) and MPEG-C Part 3 from MPEG both focus on this format. The depth information can be captured by the depth sensor. With the depth map, virtual views can be generated by depth image-based rendering (DIBR). However, the technology of depth map generation is not mature enough. It directly causes quality degradation of the rendered virtual views on the receiver side. In order to solve the problems of the previous standards, MVC is proposed as an extension profile of H.264/AVC. In contrast to the singleview-plus-depth format, MVC encodes video data from multiple viewing angles into a single bitstream by hybrid motion and disparity compensated prediction. Figure 1 illustrates the overview of an MVC system and the corresponding block diagram of an MVC encoder. The multiview video is captured by a camera array, followed by the MVC encoder compressing the multiview video data for transmission or storage. On the decoder side, reconstructed multiview video can be displayed on various displays such as currently commercialized HDTV, or nearly developed stereo and multiview 3DTV. In an MVC encoder, video frames from the first view channel are compressed by a typical H.264/AVC encoder. On the other hand, DE and DC are adopted to other view channels to further reduce inter-view redundancy. This multiple-viewpoint characteristic of MVC avoids the quality degradation from the inaccurate depth map. Furthermore, the H.264/AVC-based encoding flow reduces the bit rate overhead for each view. However, the complexity of an MVC encoder is also much higher than that of the single H.264/ AVC encoder due to its multichannel characteristic. Therefore, an efficient hardware architecture is urgently required. DESIGN CHALLENGES OF AN HD MVC ENCODER MVC outperforms previous 3D video coding standards by use of an H.264/AVC-based coding scheme. The multiple view angles characteristic also avoids the quality uncertainty due to the depth map. However, these features also bring larger complexity and hardware cost than previous standards, especially when the resolution requirement is as high as the HDTV specifications. The main design challenges of an MVC encoder are shown in Fig. 2 and discussed below. ULTRA HIGH COMPUTATION COMPLEXITY AND THROUGHPUT REQUIREMENT MVC has large computational requirements because it needs to compress data from multiple viewpoints. In a video coding system, inter-frame redundancy elimination causes most of the complexity. For a single-view video, ME is used to find out the inter-frame relationship and reduce the data redundancy in the temporal domain. In MVC DE is used as well for the inter-view domain inter-frame prediction. For an N-view multiview sequence, this hybrid ME/DE encoding scheme requires more than N times more computation than a single-view sequence. Figure 2a shows the integer ME/DE (IMDE) computation analysis under different resolutions and view numbers, where different search algorithms used in integer ME/DE and the corresponding computation requirements are listed. Two hardware oriented algorithms are considered in Fig. 2a. The full search algorithm uses all possible candidates over the entire search window (SW) and thus provides the best rate-distortion (R-D) perfor- MVC outperforms previous 3D video coding standards by use of H.264/AVC-based coding scheme. The multiple view angles characteristic also avoids the quality uncertainty due to the depth map. However, these features also bring the larger complexity and hardware cost than previous standards. 77

In the architecture design field, the system memory bandwidth and the on-chip memory size are two major limitations. The trade-off between them is classic in architecture design. That is, larger on-chip memory allows lower system memory bandwidth. Input multiview video MVC encoder Storage/ transmission MVC decoder Output various representation HDTV Stereo TV Multiview 3DTV Input frame From view 1 Block engine Entropy coding Compressed data Intra prediction Motion estimation First view channel Motion compensation Frame(s) memory MV Vector coding M ul Second view channel Motion/ disparity estimation Motion/disparity compensation Intra prediction Frame(s) memory MV and DV Vector coding t i p l e x e r Bitstream Input frame From view 2 Block engine Entropy coding Third view channel Nth view channel Figure 1. Overview of an MVC system and the block diagram of an MVC encoder. mance. However, it requires huge computation. Hierarchical search is a fast algorithm to reduce the computation. By hierarchically downsampling the SW, the required number of search candidates can be reduced to about 10 times less than that of full search. However, the computation is still too large to be processed for the HD MVC specifications. As shown in Fig. 2a, the required instructions per second (IPS) is over 1000 GIPS even when hierarchical search is adopted. Meanwhile, the high-end quad- CPU by Intel, QX9770, can only provide 60 GIPS. According to this analysis, hardware acceleration is needed for an HD MVC encoder design. Another design challenge of an HD MVC encoder is the large data throughput requirement. To encode an N-view MVC sequence, the throughput requirement is about N ore more times that of encoding a conventional single-view sequence. However, throughput on some modules cannot be enlarged by simply duplicating and parallel processing. Taking the entropy coding, for example, the entropy coder in H.264/AVC and MVC, content-based adaptive binary arithmetic coding (), has very strong data dependence since it needs to consider the previous symbols when generating the current symbol. Therefore, most existing 78

1E+5 IMDE computation in MVC 2000 1800 symbol rate Maximum symbol count Average symbol count Instructons/s (GIPS) 1E+4 1E+3 1E+2 10 1 1280 x 720 x 2 1280 x 720 x 3 1280 x 720 x 4 (a) Full search Hierarchical search 1920 x 1080 x 2 60 GIPS== QX9770 1920 x 1080 x 3 Symbols/MB 1600 1400 1200 1000 800 600 400 200 0 1 Throughput limit of one-symbol 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Frame number (b) Gbyte/s 10 1 0.1 Level C Level C+ Level D Hier. 1280 x 720 x 2 System bandwidth in MVC 1280 x 720 x 3 (c) 1280 x 720 x 4 1920 x 1080 x 2 1920 x 1080 x 3 kbytes 10000 1000 100 10 1 1280 x 720 x 2 SRAM size in MVC 1280 x 720 x 3 1280 x 720 x 4 (d) 1920 x 1080 x 2 Level C Level C+ Level D Hier. 1920 x 1080 x 3 Figure 2. Design challenges in an HD MVC encoder: (a) IMDE computation analysis; (b) throughput analysis; (c) system bandwidth analysis; (d) on-hip SRAM size analysis coder designs can only provide the throughput of one symbol per clock cycle. However, this processing ability is far from the target HD MVC throughput. Figure 2b illustrates the frame-by-frame symbol count analysis result on an HDTV sequence. The red line is the largest throughput of a one-symbol coder. This throughput limit is calculated from the operating frequency and video resolution. Take our target HD MVC specifications, for example. Considering the systemon-chip (SoC) integration compatibility, the highest operating frequency of the previous H.264/AVC encoders is selected as no more than 200 MHz [4 6]. However, when the target specifications are as high as the HD MVC, the available processing cycles for a macroblock (MB) is only about 350 cycles even the operating frequency is increased to 300 MHz. As shown in Fig. 2b, the symbol count has large variance between frames because the symbol counts of the I-frame and P-frame are much higher than that of the B-frame. A conventional coder can barely deal with the average case, but is infeasible for the maximum symbol rate. Unfortunately, the symbol rate cannot be raised by increasing the parallelism because of the data dependence issues mentioned above. Therefore, a new and efficient architecture is required. HIGH SYSTEM MEMORY BANDWIDTH AND LARGE ON-CHIP MEMORY SIZE Since hardware acceleration is needed according to the analysis above, further system and memory analysis is required before the implementation. In the architecture design the system memory bandwidth and on-chip memory size are two major limitations. The trade-off between them is classic in architecture design. That is, larger on-chip memory allows lower system memory bandwidth. In a video encoder design, IME, or IMDE in MVC, requires most of the bandwidth and onchip memory because a large SW must be loaded onto the chip for doing IMDE. Typically the width and height of the SW are set to about 10 percent of the frame width and height, respectively. Furthermore, more than one SW is loaded when the frame type is B-frame or the multiple-reference-frame scheme is enabled. In order to reduce the hardware cost, various data reuse schemes, including level C, level C+, level 79

350 Cycles/stage Stage 2: IMDE Stage 3: NOP Stage 4: FMDE Pf Stage 1: IMDE Pf Stage 5: FMDE Stage 6: IP and MDC Stage 7: REC Stage 8: Dual-EC and DB MVC encoder chip View-parallel MB-interleaved CTRL IMDE Cur. Luma MB buf. FMDE prefetch View1 cache SRAM View2 cache SRAM IMDE prefetch FMDE Cur. Luma MB buf. IP Cur. MB buf. MDC MDC MB SRAM IMDE Cur. MB buf. EC select Residue MB SRAM Rec. MB SRAM EC 1 Bitstream buf. EC 2 Bitstream buf. IMDE DB MB SRAM 128-bit system bus interface System external memory DRAM controller Bus mater/slave Processor Video input External bus IMDE: Integer ME/DE FMDE: Fractional ME/DE IP: Intra prediction MDC: Motion/disparity compensation REC: Reconstruction Pf: Prefetch EC: Entropy coding DB: Deblocking Figure 3. Proposed eight-stage MB pipelined MVC encoder architecture. Note that each stage has about 350 processing cycles if the processing frequency is 300 MHz under HD MVC specifications. D, and hierarchical search, have been proposed in recent years [7]. The system memory analysis of these algorithms for MVC with different numbers of views and resolutions are shown in Figs. 2c and 2d. Different trade-offs between bandwidth and memory size are selected under different algorithms. For example, level D data reuse has the largest on-chip SRAM size and lowest memory bandwidth. From the bandwidth point of view, a high-end SoC with a fairly wide 128-bit bus can only support about 4 Gbytes/s bandwidth even under 100 percent bus utilization and 250 MHz operating frequency. Meanwhile, the required bandwidth is over 5 Gbytes/s for nearly all algorithms listed in Fig. 2c for 1a 080p three-view MVC sequence. On the other hand, if the TSMC 90LP process is used, the lowest point in Fig 2d, which is about 60 kbytes, occupies the equivalent gate count from 0.57 to 1.94 million under different memory compiler configurations. From Fig. 2d, when the target specification matures, the maximum memory requirement may be as high as dozens or even hundreds of millions of gates, which is far beyond what a high-end SoC system can support. Therefore, a smart strategy to reduce both onchip memory size and system memory bandwidth is desired. PROPOSED MVC ENCODER SOLUTIONS SYSTEM ARCHITECTURE Figure 3 shows the system architecture of the proposed MVC encoder. The encoder contains seven kinds of computation s, including integer ME/DE (IMDE), fractional ME/DE (FMDE), intra prediction (IP), motion and disparity compensation (MDC), reconstruction 80

(REC), entropy coding (EC), and deblocking filter (DB). According to the design challenges described above, instead of simply raising the parallelism from the conventional three-or four-stage MB pipelined architecture in the previous H.264/AVC encoder design [4 6], an eight-stage MB pipelining is proposed. In order to ease the hardware cost of IMDE, the inter-frame prediction part is split into five MB pipeline stages, and the cache-based prediction is adopted. After the cache memory is used, two SW prefetching stages for IMDE and FMDE are added to load SWs into on-chip SRAM prior to the processing stage. They not only reduce the burden of the pipeline-cycle budget but also enhance the hardware utilization of IMDE and FMDE s. An no operation (NOP) stage is inserted to deal with the data dependence between the prefetching and processing stages. After the inter-frame prediction is done, the intra-frame prediction and motion/disparity compensation are performed in parallel in the sixth stage. The reconstruction stage reconstructs the compressed frame as the reference for the following frames. Finally, two EC modules and one DB module are processed simultaneously in the eighth MB pipeline stage. According to the analysis in the previous section, each pipeline stage only has about 350 cycles under the target HD MVC specifications. PREDICTOR-CENTERED CACHE-BASED MOTION/DISPARITY ESTIMATION According to the analysis in the previous section, the IMDE part accounts for most of the hardware cost. One major reason is that it requires a large SW buffer, which grows proportionally to the frame resolution. Based on previous work on fast search, only 30 percent of the SW area is really used in common intermediate format (CIF), and this utilization decreases to 15 percent in D1 video. That is, much data is loaded to the on-chip SW buffer unnecessarily. However, if we directly shrink the search range, the R-D performance drops greatly. These two characteristics indicate that we only need a small part of data in the SW, but we cannot assume that the location of this part is always close to zero-mv. Therefore, a predictor-centered cache-based IMDE is proposed. The SW is centered by the predictor, so the search range can be reduced with little quality degradation. The cache memory trades off the possibility of cache misses for a much smaller on-chip memory capacity, and is still able to handle the varying and dynamic data access pattern. Figures 4a and 4b are the comparison between the conventional ME algorithm and the proposed predictor-centered algorithm. Figure 4a shows the concept of previous hardware-oriented algorithms. In order to find the relationship between frames, a SW is set on the reference frame around the relative location of the current MB. That is, the center of the SW is the zero motion vector (MV). Since the length of MV grows proportionally to the dimension of video frames, the size of SW also needs to be enlarged to keep the best-matching MV inside, or the quality drops greatly. To prevent this from raising SW cost, the proposed algorithm shown in Fig. 4b takes the relationship between MVs into consideration. Since MBs inside the same object should have similar MVs, MVs from the neighboring MBs can be set as the initial search hint of the current ME process. If we put the SW around the best hint instead of the zero MV, the required SW size can be dramatically reduced because of the inter-mb MV similarity. Based on this concept, the detailed algorithm flow of the proposed predictor-centered algorithm is described as follows. First, several initial hints are set, and each has a tiny SW. The window size is 4 4 in our implementation. Second, each candidate in these windows is sent to the IMDE module, and a corresponding R-D cost is calculated. The candidate with the best R-D cost is chosen as the refinement center, and a larger refinement range is defined around it. However, this multiple hints and refinement flow may cause a larger quality drop in cases with non-uniform motion fields. A motion information preserving scheme is proposed to maintain the quality on the complex motion field by getting more accurate initial hints and refining centers. In the proposed scheme motion information is saved and reused in the intra-coded MBs. The MV predictor defined in the standard H.264/AVC is derived from the MV field. As a result, when an MB is intra-coded, its motion information is not encoded, and no MV is available. However, if the MV pointing to the best matched block is stored, even if the intra mode wins the inter/intra mode decision, the MV can still be used as a hint for neighbor MBs. Therefore, motion information is reused instead of being discarded even if the block is intra-coded. After the proposed scheme, the R-D performance on all the test sequences used in JVT H.264/AVC meetings can be maintained as less than 0.1 db drop even when the SW size is as small as ±16 ±16 under our target HD MVC specifications. Based on this multiple hints with refinement scheme, the SW can be retargeted MB by MB dynamically, and therefore the requirement on SW size is reduced. Figures 4c and 4d illustrates how the predictors are generated. The performance of this predictor-centered algorithm highly depends on the accuracy of hints. If the hint targets a wrong region, it needs a larger refinement range to compensate for the quality loss, and the benefit of the predictor-centered algorithm is decreased. Two kinds of hints are used to exploit the spatial and temporal correlation of MVs inside the same object. The first is the intra-frame predictors, which are MVs/DVs from the neighborhood MBs. Since the video processing is done in raster-scan order, only MBs above or left to the current MB have MVs available. Thus, the MVs/DVs from the top, top-left, top-right, and left MB are used as the intra-frame predictors. Furthermore, the zero MV and motion vector predictor (MVP) defined in H.264/AVC and MVC, which is the median-filtered result of the top, topright, and left MVs, are also allocated as intraframe predictors. The inter-frame predictors are the other kind of predictors. Because an MVC sequence consists of more than one viewpoint, one A motion information preserving scheme is proposed to maintain the quality on the complex motion field by getting more accurate initial hints and refining centers. In the proposed scheme, motion information is saved and reused in the intra-coded MBs. 81

Because of the splitting of the intra prediction and reconstruction stages, only one best matched mode is required for reconstruction. Consequently, the cycle budget is enough even under the close-loop scheme. t 1 Frame Ref. frame (a) ME t Current frame Search window Time Current macroblock t 1 Ref. frame ME (b) Initial search hint t Current frame Time View 1 View 2 View Current frame MVs from neighboring MBs t 1 DV 1 MV 2 MV1 Cur. MB t Time DV 2 Currrent frame MV 1 + DV 1 ~=MV 2 + DV 2 MV 2 from best matching MB. (c) (d) Figure 4. Proposed IMDE algorithm: a) concept of previous hardware oriented algorithms; b) the proposed predictor-centered algorithm; c) intra-frame predictor generation and reuse; d) inter-view predictor generation and reuse. object may be captured in more than one view at the same time. Since the object is the same, the captured motion in different cameras are also similar. Therefore, the MVs from the neighboring views are very strong predictors [8]. In fact, after including the inter-frame predictors, the required refinement range can be shrunk to 4 4, the same size as the tiny search window for a hint. That is, the refinement step in ME can be canceled for those views under both ME and DE [9]. In order to support the dynamic hint refinement access pattern without loading all the pixels in all possible locations of SW, a cache system is implemented as the SW buffer. Unlike the conventional cache memory system in the computer architecture field, cache memory used in video processing has several different features. The most significant difference between them is that video data has 2D spatial coherence rather than the 1D addressing in general cache memory design. To fully utilize this coherence, the internal index wraps in two dimensions. The three-tuple vector (x, y, frame-index) is translated to the tag address and the tag. A tag set is pointed by the tag address, and the tag is compared to that set. Upon a cache hit, the word address locates the word in a five-banked on-chip SRAM. The cache system provides flexible data access. However, the cache miss penalty is considerable. Every time the wanted data are not in the cache, the system needs to be stalled, and the required data is reloaded from the external memory. This stall-and-reload waiting time lowers the hardware utilization. Therefore, two new MB pipelines, IMDE prefetch and FMDE prefetch, are added to the proposed MVC system architecture to lower the cache miss rate. After this scheduling optimization and other proposed cache architecture optimizations, including priority-based replacement policy and a concurrent SW prefetching and reading scheme, the total cycles of cache miss penalty are reduced by 93 percent. That is, only 1.2 misses will happen during one MB pipeline stage, which has 350 cycles. HYBRID OPEN-CLOSED LOOP INTRA PREDICTION Other than the inter-frame prediction, intraframe prediction is also used for reducing the spatial redundancy within a frame. Pixels are predicted from the neighboring pixels. In the H.264/AVC high profile and MVC, there are three kinds of intra predictions: intra4 4 (I4) mode, intra8 8 (I8) mode, and intra16 16 82

Task 1080p HDTV D1 SDTV Setup 14 Rec 14 Blk1 rec Blk15 14 Blk15 R ec Intra_8x8 I16 I16 I16 Blk1 5 Chroma Mode0 Chroma Mode4 Cycles 910 1300 (a) Use reconstructed pixels Task HD MVC Blk Blk Blk Blk 0 1 2 3 Blk Blk Blk Blk 4 5 6 7 Blk Blk Blk 8 9 10 Blk 12 Blk 13 Blk 14 Blk 11 Blk 15 Use original pixels Prediction stage Reconstruction stage Setup Intra_16 /1 /1 Intra_8x8 Intra_16 Blk2/3 Blk2/3 Reconstruction of the best pred. mode from I16/I4/I8 Intra_16 Blk14/15 Blk14/15 Intra_8x8 Blk4 272 320 350 Cycles (b) (c) Figure 5. Issues and solutions on the intra stage: a) illustration of the throughput bottleneck due to data dependence; b) the pro-posed hybrid open-close loop intra prediction; c) the corresponding processing scheduling. (I16) mode. The 4 4 discrete cosine transform (DCT) is used in I4 and I16, while the 8 8 DCT is used in I8 mode to further improve the coding efficiency. In previous H.264/AVC designs, intra prediction for the baseline and main profile are well developed for lower specifications like D1 (720 480 pixels) and HD 720p. However, there are two main design challenges that lower the efficiency of previous designs. The first issue comes from the data dependence between each subblock. According to the definition of I4 and I8 modes in H.264/AVC standard, each subblock should be processed in zig-zag scan order. Since the predictor pixels in the intra prediction are generated from neighboring blocks and are not available until the neighboring blocks are reconstructed, each subblock should be processed sequentially. This data dependence also causes the other design challenge of low hardware utilization. As Fig. 5a shows, sequential processing scheduling makes it difficult to increase the parallelism. Thus, it costs about 1300 cycles to finish intra prediction of one MB in a D1-size video under single-view encoding. However, as mentioned before, the cycle count available for one MB is only around 350 cycles under the target HD multiview specifications. In order to improve the throughput, the hybrid open-close loop intra prediction scheme is proposed to break the data dependence described above [10]. It is illustrated in Fig. 5b. For subblock boundaries, the original pixels instead of the reconstructed pixels are used as the intra predictor, and this is the open-loop part. This modification is based on the assumption that the difference between the original and reconstructed pixels is very small if the target peak signal-tonoise ratio (PSNR) is higher than 35 db. In our target HD multiview environment, this assumption works well. For MB boundaries, the reconstructed pixels are still used as predictors since these pixels are already reconstructed in the previous MB pipeline stages. The proposed processing schedule is shown in Fig. 5c. Intra prediction on and Blk1 in Fig. 5c can start simultaneously because Blk1 does not need the reconstructed pixels from. Therefore, the parallelism of intra prediction can be largely improved to meet the target HD MVC specifications with little quality loss. However, this openloop scheme cannot be adopted to the reconstruction step because the original pixels are not available in the decoder side, and mismatch between the encoder and decoder would break standard compliance. For this reason, the reconstruction step is split as a standalone stage. MBs are reconstructed in a closed-loop manner in the reconstruction stage. Because of the splitting of the intra prediction and reconstruction stages, only one best matched mode is required for reconstruction. Consequently, the cycle budget is enough even under the closed-loop scheme. FRAME-PARALLEL PIPELINE-DOUBLED DUAL Entropy coding compresses data based on the probability distribution of symbols, and it plays an important role in video coding. In the base- 83

Data 1 Data 2 State (2-symbol) ctx 2 ctx 1 Ctx state Read state 2*(206:1 table) Syntax element Side information Binarization Index Context modeling (a) Symbol Context Bypass Binary arithmetic coding Bitstream Range Range Low Low Output Output Output 1 Output 2 Ctx comparators Update Update Write state 2*(3:1 table) Ctx state (b) Task view1 MB1 view1 MB2 view1 MB3... view2 MB1 view2 MB2 view2 MB3... 350 Cycle budget of other MB pipelines Cycles (c) Figure 6. Issues and solutions on the stage: a) the system overview of ; b) proposed two-symbol arithmetic coder; c) the frame-parallel scheme of the stage further improves the symbol rate; d) chip photo of the proposed MVC encoder. (d) line profile, H.264/AVC adopts context-based adaptive variable length coding (CAVLC) as the entropy coder. In the main profile or other advanced profiles, including the MVC, is adopted. achieves 9 to 14 percent bit rate savings over CAVLC, but its computation is much more complicated. Furthermore, due to the sequential nature of arithmetic coding, the hardware design makes it extremely difficult to exploit pipelining or parallel techniques. Figure 6a shows the block diagram of. The inputs of are syntax elements (SEs) and side information. Syntax elements are the essential data to be coded, such as MB type, prediction mode, and residues. Side information, usually the information of neighboring coded blocks, helps to estimate the probability of symbols. These SEs must be transformed into binary symbols before binary arithmetic encoding. The adaptive effect is achieved through the context (ctx) assigned to the symbol. These ctxs are modeled according to the SE type, side information, and binary index. Symbols with the same ctx have similar statistical properties and use the same adaptive probability state for estimation. Besides normal arithmetic coding, bypass mode is introduced to speed up the encoding process. The symbol along with its associated ctx and bypass flag enters the binary arithmetic coder. Finally, the arithmetic coder generates an output bitstream. Due to the limited cycle budget in the MB pipeline architecture, an EC engine with a onesymbol arithmetic encoder can only process about 350 symbols in one MB pipeline stage. As discussed earlier, this throughput ability is way below the target HD MVC spec. Therefore, the multisymbol architecture is proposed [11]. The arithmetic coder is duplicated as in Fig. 6b. For range stage, low stage, and output stage, two one-symbol PEs are directly cascaded. However, we cannot simply cascade two onesymbol state stages because they are possibly the same. The two-symbol state stage is shown on the right of Fig. 6b. The proposed two-symbol arithmetic coder may not provide exactly doubled throughput since the throughput depends on the ctx types. Based on our simulation, the actual throughput of the proposed two-symbol coder is 1.94 times larger than the conventional one-symbol/clock cycle architecture. Applying the two-symbol architecture can double the throughput. However, for some textured MBs, the two-symbol architecture still does not meet the throughput requirement. Based on the analysis from previous work, the critical path increases with the number of concurrently processed symbols in the 84

arithmetic coder. For our target operating frequency, 250 300 MHz, architectures processing more than two symbols in parallel are not feasible. Therefore, frame-parallel pipeline-doubled dual (FPPDD) is proposed to utilize frame-level parallelism. Dual computation s are adopted, and each has a doubled pipeline cycle budget of 700 cycles. These computation s process in an interleaved manner as shown in Fig. 6c. Thus, the frame-parallel scheduling scheme can be adopted to avoid data dependence between the two s. With circuit- and frame-level optimization, the throughput of the proposed FPPDD is 3.88 times that of the onesymbol design. CHIP IMPLEMENTATION Besides the above algorithm and architecture optimization, all the other modules, including the view parallel MB interleaved (VPMBI) scheduling controller, fractional motion/disparity estimation [12], and motion/disparity compensation, are also optimized. After adopting all the proposed solutions, a prototype MVC single chip encoder was fabricated by Taiwan Semiconductor Manufacturing Company (TSMC) with 90 nm 1P9M process [13]. The chip photo is shown in Fig. 6d. The size of the chip is 11.46 mm 2 (3.95 mm 2.90 mm), which contains 1732 kgates. This chip supports both H.264/AVC Multivew High Profile and High Profile at level 5.1. For multiview video coding, the proposed MVC chip can support from the full HD 1080p three views to the HDTV 720p seven views. According to this view scalability, the processing ability can be as high as 4096 2160p if the view number is only one. Thus, the proposed chip can support not only the HD MVC encoder, but also the quad full HD (QFHD) H.264/AVC singleview encoding. CONCLUSION In this article several issues in video encoder design for 3DTV applications are discussed. First, the video coding standard development from 2D to 3D video is introduced. Among these standards, MVC, an extension profile in H.264/AVC, provides the best coding efficiency with a dramatically huge computation requirement. Therefore, very large-scale integrated (VLSI) hardware acceleration is required to enable real-time applications. Moreover, the system analysis shows that the previous design methods used in single video coding have dramatic hardware resource requirements and cannot be employed directly. In order to deal with these design challenges, solutions for each module in the MVC encoder, including cache-based and predictor-centered IMDE, hybrid open-close loop intra prediction, and FPPDD, are proposed. After adopting all the proposed algorithm and architecture optimizations, an MVC single chip encoder is implemented under the TSMC 90 nm process. By the proposed MVC encoder design, the target HD MVC specifications can be supported with different view scalability from the 1920 1080p full HD three views to 1280 720 HDTV seven views. Furthermore, the single view QFHD H.264/AVC encoding is also supported. With the proposed VLSI techniques, real-time 3D video applications become feasible, and we believe more and more 3D video consumer products can be realized in the near future. REFERENCES [1] ISO/IEC MPEG Video and Requirements Group, Applications and Requirements on 3D Video Coding, ISO/IEC JTC1/SC29/WG11 N10857, 2009. [2] Joint Video Team of ISO/IEC MPEG and ITU-T VCEG, Joint Draft 7.0 on Multiview Video Coding, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6 JVT-AA209, Apr. 2008. [3] Joint Video Team of ISO/IEC MPEG and ITU-T VCEG, WD 1 Reference Software for MVC, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6 JVT-AA212, Apr. 2008. [4] Y.-W. Huang et al., A 1.3TOPS H.264/AVC Single-chip Encoder for HDTV Applications, IEEE ISSCC Dig. Tech. Papers, 2005. [5] Y. K. Lin et al., A 242mw 10mm 2 1080p H.264/AVC High-profile Encoder Chip, IEEE ISSCC Dig. Tech. Papers, 2008. [6] Z. Liu et al., A Real-Time 1.41w H.264/AVC Encoder SOC for HDTV 1080p, IEEE Int l. Symp. VLSI Circuits Dig. Tech. Papers, 2007. [7] C.-Y. Chen et al., Level C+ Data Reuse Scheme for Motion Estimation with Corresponding Coding Orders, IEEE Trans. Circuits Sys. Video Tech., vol. 16, no. 4, Apr. 2006, pp. 553 58. [8] L.-F. Ding et al., Content-aware Prediction Algorithm with Inter-View Mode Decision for Multiview Video Coding, IEEE Trans. Multimedia, vol. 10, no. 8, Dec. 2008, pp. 1553 64. [9] P.-K. Tsung et al., Cache-Based Integer Motion/Disparity Estimation for Quad-HD H.264/AVC and HD Multiview Video Coding, Proc. IEEE Int l. Conf. Acoustics, Speech, Signal Process., 2009, pp. 2013 16. [10] T.-D. Chuang et al., Algorithm and Architecture Design for Intra Prediction in H.264/AVC High Profile, Proc. Picture Coding Symp., 2007. [11] Y.-J. Chen, C.-H. Tsai, and L.-G. Chen, Architecture Design of Area Efficient SRAM-Based Multi-Symbol Arithmetic Encoder in H.264/AVC, Proc. IEEE Symp. Circuits Sys., 2006, pp. 2621 24. [12] P.-K. Tsung et al., Single-Iteration Full-Search Fractional Motion Estimation for Quad Full HD H.264/AVC Encoding, Proc. IEEE Int l. Conf. Multimedia Expo., 2009, pp. 9 12. [13] L.-F. Ding et al., A 212MPixels/s 4096 2160p Multiview Video Encoder Chip for 3D/Quad HDTV Applications, IEEE ISSCC Dig. Tech. Papers, 2009, pp. 154 55. BIOGRAPHIES PEI-KUEI TSUNG (iceworm@video.ee.ntu.edu.tw) received his B.S. degree in electrical engineering and M.S. degree in electronics engineering from National Taiwan University, Taipei, Taiwan, in 2006 and 2008, respectively, where he is working toward his Ph.D. degree in electronics engineering. His major research interests include stereo and multiview video coding, motion estimation algorithms, view synthesis algorithms, and associated VLSI architectures. LI-FU DING received his B.S. degree in electrical engineering, and M.S. and Ph.D. degrees in electronics engineering from National Taiwan University in 2003, 2005, and 2008, respectively. In 2009 he joined Taiwan Semiconductor Manufacturing Company as a principal engineer. His major research interests include stereo and multiview video coding, motion estimation algorithms, and associated VLSI architectures. WEI-YIN CHEN received his B.S. degree in electrical engineering and M.S. degree in electronics engineering from National Taiwan University in 2005 and 2008, respectively. In 2007 he was with MIT as a visiting graduate student. His major research interests include super high definition and multiview video coding, associated VLSI architectures, high-level synthesis, and computer architecture. TZU-DER CHUANG received his B.S.E.E. degree from the Department of Electrical Engineering, National Taiwan Uni- The frame-parallel scheduling scheme can be adopted to avoid data dependency between the two s. With the circuit-level and frame-level optimization, the throughput of the proposed FPPDD is 3.88 times of the one-symbol design. 85

versity in 2005. Now he is working toward his Ph.D. degree in the Graduate Institute of Electronics Engineering, National Taiwan University. His major research interests include the algorithm and related VLSI architectures of H.264/AVC, and scalable video coding. YU-HAN CHEN received his B.S. degree from the Department of Electrical Engineering, National Taiwan University in 2003. He is currently pursuing his Ph.D. degree at the Graduate Institute of Electronics Engineering, National Taiwan University. His research interests include image/video signal processing, motion estimation, algorithm and architecture design of H.264 video coders, and low-power and power-aware video coding systems. PAI-HENG HSIAO received his B.S.E.E. degree from the Department of Electrical Engineering, National Tsinh- Hua University, Hsinchu, Taiwan, in 2007. Now he is working toward his Master s degree in the Graduate Institute of Electronics Engineering, National Taiwan University. His major research interests include the algorithm and architectures of video coding and neural signal processing. S HAO-YI C HIEN [S 99, M 04] received B.S. and Ph.D. degrees from the Department of Electrical Engineering, National Taiwan University in 1999 and 2003, respectively. During 2003 to 2004 he was a research staff member at Quanta Research Institute, Tao Yuan County, Taiwan. In 2004 he joined the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, as an assistant professor. Since 2008 he has been an associate professor. His research interests include video segmentation algorithms, intelligent video coding technology, perceptual coding technology, image processing for digital still cameras and display devices, computer graphics, and the associated VLSI and processor architectures. He has published more than 120 papers in these areas. He serves as an Associate Editor for IEEE Transactions on Circuits and Systems for Video Technology and Springer Circuits, Systems, and Signal Processing, and served as a Guest Editor for Springer Journal of Signal Processing Systems in 2008. He also serves on the Technical Program Committees of several conferences, including ISCAS, A-SSCC, and VLSI-DAT. LIANG-GEE CHEN [S 84, M 86, SM 94, F 01] received B.S., M.S., and Ph.D. degrees in electrical engineering from National Cheng Kung University, Taiwan, in 1979, 1981, and 1986, respectively. He was an instructor (1981 1986) and an associate professor (1986 1988) in the Department of Electrical Engineering, National Cheng Kung University. In the military service during 1987 and 1988, he was an associate professor in the Institute of Resource Management, Defense Management College. In 1988 he joined the Department of Electrical Engineering, National Taiwan University. During 1993 to 1994 he was a visiting consultant at the DSP Research Department, AT&T Bell Labs, Murray Hill, New Jersey. In 1997 he was a visiting scholar of the Department of Electrical Engineering, University of Washington, Seattle. Currently, he is a professor at National Taiwan University. Since 2004 he has also been the executive vice president and general director of the Electronics Research and Service Organization (ERSO) in the Industrial Technology Research Institute (ITRI). His current research interests are DSP architecture design, video processor design, and video coding systems. He is a member of the honor society Phi Tau Phi. He was the general chairman of the 7th VLSI Design CAD Symposium. He is also the general chairman of the 1999 IEEE Workshop on Signal Processing Systems: Design and Implementation. He has served as an Associate Editor of IEEE Transactions on Circuits and Systems for Video Technology since June 1996 and as an Associate Editor of IEEE Transactions on VLSI Systems since January 1999. He has been an Associate Editor of the Journal of Circuits, Systems, and Signal Processing since 1999. He served as a Guest Editor of the Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology in November 2001. He is also an Associate Editor of IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing. Since 2002 he has also been an Associate Editor of Proceedings of the IEEE. He received the Best Paper Award from the R.O.C. Computer Society in 1990 and 1994. From 1991 to 1999 he received Long-Term (Acer) Paper Awards annually. In 1992, he received the Best Paper Award of the 1992 Asia-Pacific Conference on Circuits and Systems in VLSI design track. In 1993 he received the Annual Paper Award of the Chinese Engineer Society. In 1996 he received the Outstanding Research Award from NSC and the Dragon Excellence Award from Acer. He was elected as the IEEE Circuits and Systems Distinguished Lecturer in 2001 2002. 86