IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY"

Ethan Sanders
5 years ago
Views:

1 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY A Highly Efficient VLSI Architecture for H.264/AVC CAVLC Decoder Heng-Yao Lin, Student Member, IEEE, Ying-Hong Lu, Bin-Da Liu, Fellow, IEEE, and Jar-Ferr Yang, Fellow, IEEE Abstract In this paper, an efficient algorithm is proposed to improve the decoding efficiency of the context-based adaptive variable length coding (CAVLC) procedure. Due to the data dependency among symbols in the decoding flow, the CAVLC decoder requires large computation time, which dominates the overall decoder system performance. To expedite its decoding speed, the critical path in the CAVLC decoder is first analyzed and then reduced by forwarding the adaptive detection for succeeding symbols. With a shortened critical path, the CAVLC architecture is further divided into two segments, which can be easily implemented by a pipeline structure. Consequently, the overall performance is effectively improved. In the hardware implementation, a low power combined LUT and single output buffer have been adopted to reduce the area as well as power consumption without affecting the decoding performance. Experimental results show that the proposed architecture surpassing other recent designs can approximately reduce power consumption by 40% and achieve three times decoding speed in comparison to the original decoding procedure suggested in the H.264 standard. The maximum frequency can be larger than 210 MHz, which can easily support the real-time requirement for resolutions higher than the HD1080 format. Index Terms Context-based adaptive variable length coding (CAVLC), H.264/AVC, variable length coding. I. INTRODUCTION IN RECENT YEARS, with the rapid growth of multimedia and communication techniques, multimedia systems have become indispensable. However, rich multimedia services result in problems of limited data storage and transmission bandwidth. There are several important video coding standards that have been developed to effectively compress multimedia information while maintaining a high quality. The Moving Picture Experts Group (MPEG) was established to successfully build compression techniques, such as MPEG-1 [1], MPEG-2 [2], and MPEG-4 [3], for enabling many important multimedia services. At the same time, the International Telecommunication Union (ITU) also established a series of video compression standards, such as H.261 [4], H.263 [5], H.263+, and H In 2003, the Joint Video Team (JVT), consisting of experts from MPEG and ITU, approved a new video standard, H.264/AVC with many advanced features to achieve effective video compression [6]. Manuscript received January 29, 2007; revised July 5, This work was supported in part by the National Science Council of Taiwan, R.O.C., under Grants NSC E and E The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Anna Hac. The authors are with the Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Taiwan, R.O.C. ( lhy92@spic.ee. ncku.edu.tw; y_h_lu@novatek.com.tw; bdliu@mail.ncku.edu.tw; jfyang@ ee.ncku.edu.tw). Digital Object Identifier /TMM However, the computational complexity of the H.264/AVC encoders and decoders is dramatically increased. For video coding standards, variable length coding (VLC) is a well-known lossless entropy coding technique, widely adopted in image/video compression standards. However, since codeword length is variable, the cascaded codeword boundary cannot be determined until the previous codeword is decoded, which limits decoding throughput. In addition, entropy coding is normally based on pre-defined tables of variable-length codes. Adaptation to actual input symbol statistics is difficult. Thus, Jeon et al. proposed a joint use of adaptive codebook selection and dynamic codeword re-association to achieve better coding performance than pre-defined Huffman tables [7]. Lakhani also suggested that run-length coding should encode the run-length of subsequent zeros instead of preceding zeros of nonzero AC coefficients [8]. To achieve the maximum compression ratio under reasonable hardware cost, the context-based adaptive variable length coding (CAVLC) method including all the above mentioned features is adopted in the latest H.264/AVC standard. The CAVLC algorithm advantageously uses the trend among AC coefficients in each block to predict the next codeword. The prediction mechanism can significantly improve decoding performance. Compared to the previous entropy coding method, the CAVLC introduces the context model concept to model symbol probability more accurately so that the compression ratio can be further increased. However, the cost of high compression ratio achieved by the CAVLC coder is the high computational complexity. Thus, developing an efficient CAVLC decoder implementation is of practical importance. Several recent design studies about efficient VLC decoding hardware [9] [13] have emerged. These architectures can be mainly classified into two groups: tree-based and parallel decoding approaches. In traditional VLC decoding algorithms, a level code is searched in the coding tree per operation. The throughput rate is therefore limited according to the search level. Although the tree-based method is simple, it is not suitable for real-time processing applications. Therefore, parallel VLC decoding approaches are mostly adopted in VLSI hardware designs. How to implement the CAVLC decoding hardware has been of great interest in recent years [14] [25]. To realize the H.264 decoder, a decoding hardware design to parse the CAVLC codeword was proposed in [14], [15]. In [16], the VLSI implementation of the CAVLC decoder for H.264/AVC was presented. Moreover, several low-cost, high-performance VLSI architectures for realizing the CAVLC decoder were proposed in [17] [19]. Nevertheless, the proposed methods mainly focused on improving look-up table processing speed. Moon proposed an efficient algorithm based on arithmetic operations instead of /$ IEEE

2 32 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 memory access [20], which is further matured by Kim [21]. Obviously, the proposed schemes are not suitable for hardware implementation since the straightforward implementation of the arithmetic algorithms requires a number of operative resources. The designs [22] [24] to reduce the clock cycles through skipping methods and multiple symbol design were suggested to speed up the processing performance. In this paper, after analyzing the CAVLC coding algorithm, we propose an efficient decoding architecture to deal with the critical path of CAVLC decoders. The decoder critical path delay can be divided into two pipelined stages to improve the processing speed. Finally, a low-cost and low-power CAVLC decoder integrated with recently-developed techniques is synthesized while maintaining the designed fast decoding performance. The rest of this paper is organized as follows: Section II addresses the basic concept of the decoding procedure of the CALVC framework. In Section III, the algorithm level optimization that fits the hardware design to speed up the decoding performance is presented. A detailed description of the designed CAVLC decoder is discussed in Section IV. The simulation results and synthesis circuits to evidently show the performance improvement of the proposed CAVLC decoder are then presented in Section V. In Section VI, the conclusion is drawn. Fig. 1. Reverse scan magnitudes and their transmitted bitstream. TABLE I CAVLC DECODING PROCEDURE FOR THE EXAMPLE DEPICTED IN FIG. 1 II. CONTEXT-BASED ADAPTIVE VLC PRINCIPLES In H.264/AVC, the context-based adaptive variable length coding (CAVLC) technique is designed to encode the quantized residual coefficients of 4 4 (and 2 2) blocks. It makes effective use of several characteristics in block-based video compression. After prediction, transformation, and quantization, the quantized DCT coefficients are mostly zeros. The CAVLC uses run-level coding to compactly represent strings of zeros. Since the tendencies of Run and Level are not quite correlated, the CAVLC encodes Run and Level information separately with better adaptation to achieve better coding efficiency. In transformed blocks, the nonzero high frequency coefficients after the zig-zag scan are often sequences of. The CAVLC signals the number of high-frequency coefficients by using trailing 1s to achieve a compact representation. For nonzero coefficients, the coefficients according the decoding status can be effectively encoded by using different VLC tables. The CAVLC encoder adaptively selects the best-matched VLC table to advance the coding performance. The choice of look-up table depends on the number of encoded coefficients and the magnitude of nonzero coefficient. The CAVLC effectively uses the mechanism, called context-based adaptive (CA), to achieve better coding performance than the traditional fixed VLC table. In a 4 4 block, the levels (magnitudes) of nonzero coefficients tend to be higher near the DC coefficient. Hence, the CAVLC also reversely encodes them from lower to higher frequencies. By using another CA concept, the CAVLC adaptively selects the VLC look-up table for the next level parameter depending on recently-coded level magnitudes. In the CAVLC encoder, the quantized coefficients are zig-zag scanned and then encoded by five syntax elements. These syntax elements are defined as follows. a) coeff_token: Both numbers of all nonzero coefficients (total_coeff) and trailing ones are encoded using this syntax element. b) Sign of T1s: This syntax element encodes the sign bit of each in reverse zig-zag scan order. c) Level: The value of each nonzero coefficients, except for, is encoded using this syntax element. d) total_zeros: This syntax element encodes the total number of zero coefficients preceding the last nonzero coefficients in zig-zag scanned order. e) run_before: This syntax element encodes the number of successive zero coefficients preceding each nonzero coefficients in reverse zig-zag scanned order. A. CAVLC Decoding Procedure Fig. 1 shows an example of the reverse scan values of a 4 4 block and its associated transmitted CAVLC bitstream from the encoder point of view. The CAVLC decoder will decode the CAVLC bitstream into syntax elements. Table I exhibits the parsed syntax elements and decoded information bits of the CAVLC bitstream, which are shown in Fig. 1. The corresponding 4 4 block in the reverse zig-zag scan of magnitudes,, is finally reconstructed. The details of the CAVLC process can be found in the H.264 standard [6]. According to the H.264/AVC standard, Sign of T1s and Levels are decoded using simple arithmetic operations since these two syntax elements are encoded by regular VLC codes. However, the other syntax elements, such as coeff_token, total_zeros, and run_before, are encoded by content-dependent VLC codes. Decoding the content-dependent symbols is mostly realized by multiple look-up tables.

3 LIN et al.: HIGHLY EFFICIENT VLSI ARCHITECTURE FOR H.264/AVC CAVLC DECODER 33 TABLE II REAL-TIME REQUIREMENT FOR DIFFERENT VIDEO RESOLUTIONS B. Real-Time Requirement Generally speaking, the number of required clock cycles for each block is dependent on block coefficients. However, in CAVLC, the average number of symbols is about twice that of the previous VLC method. To guarantee that our architecture is suitable for most resolutions and satisfies the real-time requirement for most applications, the worst case, which requires the most processing time, should be considered for analyzing the real-time requirement. The CAVLC decoder operation frequencies for various video formats that satisfy real-time requirements (resolutions) are listed in Table II. III. HARDWARE-ORIENTATED ALGORITHM OPTIMIZATION In most applications, such as broadcasting and video conferencing, real-time processing is the most important issue. In the VLC decoding procedure, the next symbol cannot be processed until the current symbol is decoded due to data dependency in variable length codes. The system performance is mainly confined by the critical path. In the CAVLC decoder, the critical path occurs in decoding the level information since this syntax element is decoded based on arithmetic operations. Moreover, two parameters, current code length and suffixlength, should be obtained to decode the subsequent level coefficients. Unfortunately, the suffixlength value, which is processed at the end of the level decoding procedure, can only be decided until the level coefficient is resolved. Therefore, the original decoding procedure can not be divided as pipeline structure. In order to improve overall performance, a modified suffixlength detector (MSD) algorithm is presented to advance the suffixlength computation prior to the determination of level coefficient. In addition to shortened critical path, the level decoding process can be realized with a pipeline structure. The detailed description of MSD algorithm is presented in the following subsection. A. Original Level Decoding Process The level coefficient includes the sign and the magnitude of each remaining nonzero coefficient in the 4 4 block. The code for each level is composed of a prefix part (level_prefix) and a suffix part (level_suffix) as (1) Fig. 2. Level decoding process in the H.264 standard. where the amount of leading zeros in the prefix part is represented as level_prefix, and level_suffix contains the value of the suffix part. Unlike Exp-Golomb entropy coding, in which leading zeros determine the amount of trailing information bits after the first 1 bit, there is no relationship between level_prefix and level_suffix in the Level bitstream. The level_prefix also carries level information. In the decoding procedure, these variables are used to compute the level value. Fig. 2 shows the level decoding procedure flow diagram, which can be divided into two stages [6]. The decoding process in the first stage starts from level_prefix decoding, which is physically a first 1 detector and counts the number of zeros preceding the first 1 bit. After the level_prefix value has been decided, the size of the suffix part, represented as levelsuffixsize, is examined in the decoding process. If the variable levelsuffixsize is equal to 0, the syntax element level_suffix is inferred to be equal to 0. Otherwise, the level_suffix based on the corresponding suffixlength is obtained. The variable suffixlength, ranging from 0 to 6, is used to decide the corresponding magnitude range of level_suffix. The probability model entity used in the level code can be

4 34 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 modified depending on the previous level information, such that each model can be identified by a unique suffixlength value, i.e., the context-based adaptive feature in CAVLC. A large suffixlength value is suitable for higher magnitude levels, whereas a small suffixlength value is appropriate for levels with lower magnitudes. In general, the levelsuffixsize value is equal to the variable suffixlength with the exception of the Escape_code, and determines the codeword representative range in level_suffix. Afterward, the intermediate levelcode value, represented in italics, is obtained from level_prefix and level_suffix values and transferred to the second stage. In the second stage, levelcode is refined and applied to calculate the level value in the level processing unit (LPU). In the LPU, there are two special conditions that cause levelcode to be incremented as follows. When level_prefix is equal to 15 and suffixlength is equal to zero, the first special condition occurs. The Escape_code adopted for a large CAVLC level value occurs in this condition. The variable LevelCode should be increased by 15. If the current symbol is the first Level symbol and the amount of trailing ones is less than 3 in this block, the second special condition appears. The first Level symbol absolute value cannot be 1, or the first Level symbol will be merged into the TrailingOnes index. Hence, the level- Code, which has been modified to improve compression efficiency in this condition, should be increased by 2 in the decoding procedure. After checking the two special conditions, the levelcode value is completely decoded and transferred into the LPU. The level value would be derived from levelcode. The equations of the LPU are as follows: From these equations, it is obvious that if levelcode is even, the level value is positive, and if levelcode is odd, the level value is negative. Finally, the absolute value of the level should be checked with the boundary defined in Table III. If the magnitude exceeds the boundary, the syntax element suffixlength value must be updated to decode the next Level symbol. For example, if the current suffixlength is equal to 2 and the input bitstream is , it is obvious that the prefix part is (000), and the suffix part is (01). Therefore, the level_prefix and level_suffix are equal to 3 and 1, respectively. As shown in Fig. 2, following the calculation procedure in the first part, the output pattern levelcode is equal to 13 and then submitted into the second part. By substituting the levelcode into (2) in the second part, the final level value is equal to. After checking the level value in Table III, the suffixlength for next Level symbol should be incremented by 1. (2) TABLE III THRESHOLD VALUE FOR DETERMINING NEXT SUFFIXLENGTH B. Critical Path Improvement in the Decoding Level In the level code decoding procedure, two parameters should be derived and delivered to decode the next piece of information. One is the current codeword length, and the other is the suffixlength value. Generally speaking, the VLC decoding bottleneck occurs in the codeword boundary. The next codeword starting point remains unknown until the current codeword length is decoded. In the level decoder, Level symbol length is defined in the following equation: In the original level decoding procedure shown in Fig. 2, level_prefix and levelsuffixsize are available in the first stage; thus, codeword length can be obtained in the first stage. Although the length is derived in the first decoding procedure stage, the decoding process cannot perform the next symbol immediately since the suffixlength, which identifies the next symbol magnitude range, is not updated until the end of the second stage. The suffixlength updating step refers to the complete level calculation after the LPU in the second stage. Furthermore, the LPU calculates the level value from the levelcode, which is derived from level_prefix and level_suffix. The data dependency in the original decoding process leads to an unavoidably long critical path. In order to improve decoding performance, a new decoding Level procedure algorithm is proposed to break data dependency and reduce the critical path. The threshold value in the suffixlength detector is defined in Table III, which can be described as the following threshold function: where level[ ] represents the level value. If condition (4) is satisfied, suffixlength is increased by 1. Since the level value refers to levelcode as shown in (2), we can rewrite (4) as shown in (5) at the bottom of the next page. Since levelcode changes with different conditions, to clearly explain the proposed algorithm, it is categorized into three cases: Normal mode, Escape_code mode, and TrailingOnes mode. 1) Normal Mode With SuffixLength > 0: First, we consider the variable levelsuffixsize larger than 0, which also means that the suffixlength is not equal to 0. Simultaneously, we also ignore (3) (4) (5)

5 LIN et al.: HIGHLY EFFICIENT VLSI ARCHITECTURE FOR H.264/AVC CAVLC DECODER 35 the special conditions in the second stage that will increment the levelcode value. Thus, levelcode can be replaced with the combination of level_prefix and level_suffix as in the following equation: (6) By substituting (6) into (5), the upper equation in (5) is rewritten as when level[ ] is positive. After arranging terms, we can represent (7) as Since level[ ] is positive, levelcode must be even. In (6), if levelcode is even, the level_suffix value must be even and equal to, depending on the levelsuffixsize, i.e., suffixlength. Thus, the last term of (8) ranges between 0 and 1 as The level_prefix value must be an integer. Therefore, for any level_suffix value, (8) can represent an equivalent condition with the following equation: (7) (8) (9) (10) Similarly, when level [ ] is negative, the lower equation in (5) can be rewritten as (11) (12) Since level [ ] is negative, levelcode must be odd, and the level_suffix value can be. Then, we can represent (12) as (13) It is obvious that no matter whether the Level symbol is positive or negative, if the level_prefix value is larger than 2, the syntax element suffixlength must be incremented by 1 to decode the next Level symbol. 2) Normal Mode With : If suffixlength is equal to 0, there is no level_suffix value. The levelcode can be described as (14) In the H.264 standard, when suffixlength is equal to 0, suffixlength is updated to 1 to decode the next level of information. Moreover, if the absolute level value is larger than the threshold value for suffixlength equal to 1, defined in Table III, which is equal to 3, suffixlength is set equal to 2 directly. Under this condition, we can rewrite (5) as for an even levelcode and (15) (16) for an odd levelcode. In brief, we can summarize (15) and (16) as follows: (17) If level_prefix is larger than 5, suffixlength is incremented by 2; otherwise, suffixlength is incremented by 1. 3) Escape Code Mode: Afterward, the two special cases, which will modify the levelcode value and lead suffixlength to be incremented, may be considered in threshold detection conditions. First, if Escape_code is adopted for the level value, it means that the level value is very large and definitely larger than the threshold value. The syntax element suffixlength should be updated for next Level symbol. 4) TrailingOnes Mode: Another special case occurs when the current symbol is the first Level symbol and the number of trailing ones is less than 3. levelcode should be increased by 2 in the decoding procedure. In the H.264 standard, the variable suffixlength is initialized as follows: if there are more than 10 residual coefficients and the TrailingOnes value is less than 3, suffixlength is set equal to 1; otherwise, suffixlength is set equal to 0. Thus, the possible suffixlength for the first Level symbol could be 0 or 1. Under this condition, if suffixlength is equal to 0, levelcode can be given by (18) which could be substituted into (5) to solve for the threshold function as (19) On the other hand, if suffixlength is equal to 1, levelcode can be described as Substituting (20) into (5), we can get the following result: (20) (21) 5) Summary: The results of all conditions mentioned above are summarized in Fig. 3. The modified suffixlength detector (MSD) input signal is level_prefix instead of the level value. The suffixlengeh value of decoding the next Level symbol can be obtained from the current symbol information and a recent

6 36 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 Fig. 3. Proposed MSD decoding procedure. Fig. 4. Modified Level decoding process. level_prefix value. For this reason, the suffixlength detector can be used after the level_prefix unit in the first stage. The modified decoding procedure, using the MSD, is shown in Fig. 4. With the proposed decoding process, the MSD can be integrated into the first stage and the level decoding path can be reduced. Moreover, since the two parameters utilized for decoding the next codeword can be derived in the first stage, the decoding process will decode the next symbol serially after completing the first stage. The level decoder critical path delay can be further improved with a pipeline structure. Deciding the Level value can be performed in the second stage in the following clock cycle. In the meantime, the first stage of the pipeline structure can directly decode the next Level symbol. Following the example as described in Section III, the current suffixlength is equal to 2 and the level_prefix is equal to 3. After checking the detecting function in Fig. 3, the suffixlength

7 LIN et al.: HIGHLY EFFICIENT VLSI ARCHITECTURE FOR H.264/AVC CAVLC DECODER 37 TABLE IV MERGED CATEGORIES IN THE PROPOSED LPCLUT Fig. 5. Overview of the proposed CAVLD architecture. should be incremented by 1 for decoding the next Level symbol, which can be detected in the first stage of the level decoding process. The result is the same with the original level decoding process. IV. DESIGN OF CAVLC DECODING ARCHITECTURE Based on the proposed algorithm and the CAVLC decoding flow, a highly effective VLSI architecture for the CAVLC decoder (CAVLD) was designed. Fig. 5 shows the proposed CAVLC decoder architecture, which is mainly composed of a Combined_LUTs decoder, a sign decoder, a Level VLC decoder, a Barrel shifter, a Control unit, and an Output buffer. The Combined_LUTs decoder is the most important VLC component, which supports coeff_token, total_zeros, and run_before decoding process. In addition to the Level VLC decoder, the remaining functional blocks should be considered when designing an effective CAVLC decoder. In this research, several techniques related to hardware implementation are introduced to reduce the hardware cost and power consumption, and increase the data throughput rate. A more detailed description of the realization of these function units is provided in the following subsections. A. Low-Power Combined Look-Up Table (LPCLUT) The VLC tables use the most area and consume the most power of the chip if the syntax elements coeff_token, total_zero, and run_before in the CAVLC decoder are decoded by exploiting their own look-up tables (LUTs). In [25], a low power CAVLC decoder architecture with prefix-predecoding and table partitioning methods was presented to reduce the power consumption of LUTs without affecting the overall performance. However, adding a latch before each sub-table to prevent the unselected tables from consuming unnecessary dynamic power requires extra area overhead. To overcome the drawbacks of conventional methods, the correlation of unstructured VLC tables for different syntax elements is analyzed. The detailed procedures for decoding these three symbols are discussed in the following. The first VLC symbol to be decoded in the CAVLC decoder is coeff_token. As defined in the coding standard [6], there are five sub-tables for decoding this symbol. The choice of table depends on the number of nonzero coefficients in the neighboring blocks. The index for sending a separate VLC LUTs to indicate total_zeros is based on the total_coeff value except the maximum block size. The total_zeros table is partitioned into 15 sub-tables for luminance and three sub-tables for chrominance. During the run_before decoding flow, the zerosleft symbol is first checked to ascertain whether the number of zero coefficients should be decoded or not. If zerosleft is larger than zero, the run_before symbol is decoded by searching the corresponding table. The decoding procedure is terminated when zerosleft is equal to zero or the decoded run_before symbol indicates the last coefficient. The run_before table is divided into seven sub-tables and only one of them is selected for decoding procedure based on the index zerosleft value. In the CAVLC decoding procedure, each symbol is decoded by a specific decoding unit, which requires three individual LUTs in hardware implementation traditionally. Due to the CAVLC decoding flow, the coeff_token, total_zero, and run_before symbols will not be happened at the same time such that we suggest a combined LUT, which can decode these three symbols in the proposed CAVLD architecture. Using the VLC features such as bit arrangement, value, and code length, the sub-tables within different syntax elements are grouped into multiple categories. Each category can be actually combined to the other category in a different disjoint symbol table. For example, the first total_zeros VLC table is merged with the first coeff_token table due to the variance of codeword distribution. The combined table has fewer latches than two separate tables, which results in a smaller hardware cost. The merged categories from coeff_token, total_zero, and run_before symbols in the combined LUT are listed in Table IV. The number in the parentheses indicates the bits required for the latch. There are fifteen VLC tables and one fixed length coding (FLC) table in the architecture of one combined LUT, as shown in Fig. 6. Control

8 38 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 Fig. 6. Proposed architecture of the LPCLUT. unit outputs enable signals to activate the latches, and only one of these latches is enabled during the decoding process. B. Throughput Improvement in Decoding Sign of T1s In the H.264 standard, the number of trailing ones can be any from 0 to 3. For each trailing one, the sign is coded with a single bit. Therefore, there are at most three bits to represent the trailing_one_sign_flag, in which the codeword 1 expresses positive one and the codeword 0 expresses negative one. Generally speaking, due to data dependencies, one cycle is needed to decode one symbol in a simple VLC decoder. Since the trailing_one_sign_flag is not a VLC symbol, there are no data dependencies between trailing_one_sign_flag elements. The throughput (symbols/cycle) of the CAVLC decoder can be improved by decoding multiple sign symbols per cycle. The proposed architecture for decoding the sign of trailing ones is shown in Fig. 7. In the trailing_one_sign_flag decoding process, the controller simultaneously reads three bits from a barrel shifter. The syntax element Num_Trail, which is decoded in coeff_token, is used to indicate the number of trailing ones that should be decoded in the block. Decoded symbols will then be stored in the level register for reconstructing the 4 4 block in the next step. Since the occurrence probability of the number of trailing ones is varied with different QPs, the performance improvement versus QP is shown in Table V. The normal VLC decoder for trailing_one_sign_flag spends about two cycles; however, the proposed method can decode all trailing_one_sign_flag elements in one clock cycle, achieving an approximately 88% performance improvement. C. Output Buffer After the CAVLC decoding, the quantized transform residues are obtained in the output buffer and transmitted to inverse quantization for further processing. Conventionally, all level/run symbols of a 4 4 block are first decoded and stored in buffers. Then, the level/run symbols are converted to on 16-element buffer and then transferred to a 4 4 quantized transform residues block according to the inverse scan. Normally, it requires some clock cycles to reconstruct and reorder Fig. 7. Proposed architecture for trailing_one_sign_flag. residual coefficients. The next CAVLC decoding procedure will not be initiated if the whole CAVLC decoding process cannot be completed. Thus, the extra output buffer should be introduced for increase the decoding performance. In [17], an output buffer using the double-stack architecture with a block pipelining scheme was proposed to speed up the data transition between the CAVLC decoder and the IQ. However, it takes more hardware area and power to store the level/run symbols by introduction of the output buffer. In this paper, only single output buffer is adopted in the proposed CAVLC decoder, which can keep the decoding speed as the double-stack architecture. In the CAVLC decoding procedure, the 16-zero coefficients can be reset into the output buffer and the nonzero coefficients are firstly decoded and written into the level buffer during the cycles in decoding Sign of T1s and Level symbols. In total_zeros and run_before cycles, the first nonzero coefficient position index in the reverse order is at total_coeff + total_zeros. Then, other nonzero coefficients are consecutively transmitted to the buffer in the corresponding position whenever run_before symbols are decoded. Since the quantized transform residues are immediately stored in the buffer after the CAVLC decoding procedure, the run_before buffer is no longer needed. Moreover, during the reordering step, only one of the pixel data is transmitted into the corresponding position and it will not overwrite the other nonzero coefficients. At the same time, a zero value is written into the original position. If there are no more zeros left (i.e., [run_before] =total_zeros), it is unnecessary to decode any run_before symbols and the remaining

9 LIN et al.: HIGHLY EFFICIENT VLSI ARCHITECTURE FOR H.264/AVC CAVLC DECODER 39 TABLE V PERFORMANCE IMPROVEMENT VERSUS THE NUMBER OF TRAILING ONES VARIED WITH QP TABLE VI CRITICAL PATH IMPROVEMENT Fig. 8. Proposed architecture with a single output buffer. TABLE VII HARDWARE COST PROFILE FOR DIFFERENT FUNCTIONAL BLOCK (GATE COUNT) Fig. 9. Example of data movement in a single output buffer. coefficients in the level buffer are located in the corresponding position without any movement. Therefore, only one level buffer is required. Thus, the cycle counts, hardware area, and power consumption with one output buffer can be reduced at the same time. The output buffer architecture is shown in Fig. 8. Since level coefficient values can be represented with 13 bits, a 16-entry deep and 13-bit wide memory is adopted to store the level coefficients and to buffer the reconstructed quantized coefficients. As an example, the data movement in the output buffer is illustrated in Fig. 9. Following the decoding procedure shown in Table I, six nonzero coefficients are successively stored in the level buffer during the Sign of T1s and Level cycles. After the total_zeros cycle, the last coefficient 1 is directly moved to the location index, total_coeff + total_zeros, and the level buffer in the original position is filled with 0. Moreover, the zero_left is defaulted as the value of total_zeros. When the first run_before value is decoded, since the zero_left is equal to 2, the possible run_before value is set between 0 and 2. Hence, the second nonzero coefficient in the reverse scan -1 is moved into the corresponding position without overwriting the other nonzero coefficients. Then, after the second run_before value is decoded, the sum of total run_before values is equal to total_zeros. The remaining coefficients are located in the correct positions and can be transmitted into the next functional block directly. V. SIMULATION AND VERIFICATION The proposed hardware architecture was synthesized with a 0.18 m CMOS standard cell-based library for performance evaluation. The experimental results for the critical path of the optimized algorithm are shown in Table VI. When the modified suffixlength detector (MSD) is shifted to the first stage, the proposed architecture can efficiently shorten the critical path delay by 45% as compared with the original level decoder. Moreover, by forwarding the MSD, the proposed architecture can be easily implemented using a pipeline structure. The critical path delay is further reduced to one-third of the original level decoder,

10 40 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 Fig. 10. Power consumption MHz. TABLE VIII SIMULATION RESULTS AND COMPARISONS OF H.264 CAVLC DECODERS which allows the maximum work frequency to be about 213 MHz. In contrast with the worst case described above, the proposed architecture processing capability can easily guarantee the real-time requirement for resolutions higher than 1080 HD ( ) video format. The hardware cost profile in the individual functional blocks of the CAVLC decoders in terms of gate count is shown in Table VII. There are four kinds of CAVLC decoders. The original CAVLC decoder is implemented as the H.264 standard. Low Power LUTs [25] introduces table partitioning and prefix predecoding to reduce power consumption. Since extra latches are adopted to prevent the unselected tables from consuming unnecessary dynamic power, the area for coeff_token, total_zeros, and run_before are larger than the original ones. Because the prefix predecoding method is used to reduce the table size, the coeff_token area does not increase as much as these of total_zeros and run_before. Low Power Combined LUT (LPCLUT) combines the tables within coeff_token, total_zeros, and run_before. Since the number of latches for low power LUTs can be further reduced using the combining procedure, the LPCLUT requires less hardware than three separate tables. The analyzed results show that the proposed one combined LUT method reduces area by 11% in average. Moreover, the proposed architecture with LPCLUT and a single output buffer is labeled as Improved LPCLUT in which only half the register size is required. Although the self-controller in the output buffer is more complex than the original one, the hardware saving in the output buffer part achieves 33% compared with the double buffer architecture. Consequently, the overall hardware cost is about 19% less than that of the previous CAVLD architecture. Fig. 10 shows the experimental results for the average power consumption for the proposed Improved LPCLUT CAVLC decoders and the original criterion. A previous low power design [25], marked as LP LUTs, is also included for evaluation of power consumption. Three sequences with four different quantization parameters were applied to verify the low power architecture. Since the quantized coefficients become smaller as the QP gets larger, the power reduction declines for large QP. The sequence Mobile, which is the most complex sequence, has better power degradation comparing to the other sequences. The results show that the proposed architecture reduced the power consumption by approximately 40% in average. A comparison of hardware cost and processing speed of the proposed design with other existing designs is shown in Table VIII. The proposed architecture has minimum area compared to all other designs except the one suggested in [16]. It is noted that the area of the output buffer unit used to reconstruct the CAVLD elements into a 4 4 residual block of zig-zag scan is not included in [16]. In [17], extra memory is required for the IDS (Interleave Double Stacks) buffer. In addition, it can only decode one symbol per cycle. The proposed design outperforms the method addressed in [23] in terms of both hardware cost and operation speed while providing the comparable throughput. Alle s design [19] achieves slightly higher performance due to superior CMOS technology; however, it requires roughly three times area overhead. VI. CONCLUSION A high-performance CAVLC decoding architecture for H.264 decoder is proposed in this paper. To improve overall performance, the CAVLC decoding process was precisely analyzed.

LIN et al.: HIGHLY EFFICIENT VLSI ARCHITECTURE FOR H.264/AVC CAVLC DECODER 41 An effective algorithm for decoding Level symbols was proposed to improve the suffixlength detector.

With the improved algorithm, the proposed architecture was implemented by using a pipeline structure, which triples the decoding speed of the conventional decoding procedure suggested in the H.

11 LIN et al.: HIGHLY EFFICIENT VLSI ARCHITECTURE FOR H.264/AVC CAVLC DECODER 41 An effective algorithm for decoding Level symbols was proposed to improve the suffixlength detector. The critical path was then reduced by forwarding the suffixlength detector to the first stage in the Level decoding procedure. With the improved algorithm, the proposed architecture was implemented by using a pipeline structure, which triples the decoding speed of the conventional decoding procedure suggested in the H.264 standard. Using parallel realization in all syntax element decoders, the proposed CAVLD architecture can decode more than one syntax element per cycle. Moreover, the VLC tables within the different syntax elements were combined into one combined LUT, where the latches were added to avoid unnecessary dynamic power consumption. A single output buffer was used in the hardware implementation. The hardware cost and power consumption were reduced without affecting the decoding performance. Experimental results show that the proposed architecture achieves approximately 19% area reduction and 40% power savings compared to conventional CAVLC decoding methods while maintaining triple decoding speed. The maximum frequency of the proposed architecture is 213 MHz, which is fast enough to decode the 1080 HD ( ) video format. REFERENCES [1] Information Technology-Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to About 1.5 Mbit/s: Video, in ISO/IEC (MPEG-1 Video), ISO/IEC Std [2] Information Technology-Generic Coding of Moving Pictures and Associated Audio Information: Video, in ITU-T Rec. H.262 (MPEG-2 Video), ISO/IEC Std [3] Information Technology-Generic Coding of Audio-Visual Objects Part 2: Visual, in ISO/IEC (MPEG-4 Video), ISO/IEC Std [4] Video codec for audiovisual services at px64 kbits/s, in ITU-T Recommend. H.261 Version [5] Video Coding for Low Bitrate Communication, in ITU-T Recommend. H.263, 1995, Version 1, Version 2 Sep [6] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec.H.264 jiso/iec AVC), in Joint Video Team, Mar. 2003, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-G050. [7] B. Jeon, J. Park, and J. Jeong, Huffman coding of DCT coefficients using dynamic codeword assignment and adaptive codebook selection, Signal Process. Image Commun., vol. 12, no. 3, pp , Jun [8] G. Lakhani, Optimal huffman coding of DCT blocks, IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 4, pp , Apr [9] S. M. Lei and M. T. Sun, An entropy coding system for digital HDTV applications, IEEE Trans. Circuits Syst. Video Technol., vol. 1, no. 1, pp , Mar [10] D. S. Ma, J. F. Yang, and J. Y. Lee, Programmable and parallel variable-length decoder for video systems, IEEE Trans. Consum. Electron., vol. 39, no. 3, pp , Jun [11] B. J. Shieh, Y. S. Lee, and C. Y. Lee, A new approach of groupbased VLC codec system with full table programmability, IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 2, pp , Feb [12] R. Hashemian, Design and hardware implementation of a memory efficient huffman decoding, IEEE Trans. Consum. Electron., vol. 40, no. 3, pp , Aug [13] J. Nikara, S. Vassiliadis, J. Takala, and P. Liuha, Multiple-symbol parallel decoding for variable length codes, IEEE Trans. Very Large Scale Integrat. Syst., vol. 12, no. 7, pp , Jul [14] H. Y. Kang, K. A. Jeong, J. Y. Bae, Y. S. Lee, and S. H. Lee, MPEG4 AVC/H.264 decoder with scalable bus architecture and dual memory controller, in Proc. IEEE Int. Symp. Circuits Syst., May 2004, pp [15] S. H. Wang, W. H. Peng, Y. He, G. Y. Lin, C. Y. Lin, S. C. Chang, C. N. Wang, and T. Chiang, A platform-based MPEG-4 advanced video coding (AVC) decoder with block level pipelining, in Proc. IEEE Int. Conf. Inform. Commun. Security, Huhehaote, China, Dec. 2003, pp [16] D. Wu, W. Gao, M. Hu, and Z. Ji, A VLSI architecture design of CAVLC decoder, in Proc. IEEE Int. Conf. ASIC, Beijing, China, Oct. 2003, vol. 2, pp [17] H. C. Chang, C. C. Lin, and J. I. Guo, A novel low-cost high-performance VLSI architecture for MPEG-4 AVC/H.264 CAVLC decoding, in Proc. IEEE Int. Symp. Circuits Syst., Kobe, Japan, May 2005, pp [18] Y. M. Lin and P. Y. Chen, An efficient implementation of CAVLC for H.264/AVC, in Proc. Int. Conf. Innovative Comput. Inform. Contr., Beijing, China, Aug. 2006, pp [19] M. Alle, J. Biswas, and S. K. Nandy, High performance VLSI architecture design for H.264 CAVLC decoder, in Proc. IEEE 17th Int. Conf. Application-Specific Systems, Architectures Processors, Steamboat Springs, CO, Sep. 2006, pp [20] Y. H. Moon, G. Y. Kim, and J. H. Kim, An efficient decoding of CAVLC in H.264/AVC video coding standard, IEEE Trans. Consum. Electron., vol. 51, no. 3, pp , Aug [21] Y.-H. Kim, Y.-J. Yoo, J. Shin, B. Choi, and J. Paik, Memory-efficient H.264/AVC CAVLC for fast decoding, IEEE Trans. Consum. Electron., vol. 52, no. 3, pp , Aug [22] S.-Y. Tseng and T.-W. Hsieh, A pattern-search method for H.264/AVC CAVLC decoding, in 2006 IEEE Int. Conf. Multimedia Expo, Toronto, ON, Canada, Jul. 2006, pp [23] G.-S. Yu and T.-S. Chang, A zero-skipping multi-symbol CAVLC decoder for MPEG-4 AVC/H.264, in Proc. Int. Symp. Circuits Syst., Island of Kos, Greece, May 2006, pp [24] Y.-N. Wen, G.-L. Wu, S.-J. Chen, and Y.-H. Hu, Multiple-Symbol parallel CAVLC decoder for H.264/AVC, in Proc. APCCAS IEEE Asia Pacific Conf. Circuits Syst., Singapore, Dec. 2006, pp [25] H. Y. Lin, Y. H. Lu, B. D. Liu, and J. F. Yang, Low power design of H.264 CAVLC decoder, in Proc IEEE Int. Symp. Circuits Syst., Island of Kos, Greece, May 2006, pp Heng-Yao Lin (S 06) was born in Tainan, Taiwan, R.O.C., in He received the B.S. and M.S. degrees in electrical engineering from National Cheng Kung University, Tainan, in 2001 and 2003, respectively. He is currently pursuing the Ph.D. degree in electrical engineering at National Cheng Kung University, Tainan. His major research interests include fast algorithms, low power designs, and VLSI architectures for H.264/AVC and multimedia application. Ying-Hung Lu received the B.S. and M.S. degrees in electrical engineering from National Cheng Kung University, Tainan, Taiwan, R.O.C., in 2003 and 2005, respectively. Since 2005, he has been with NOVATEK Microelectronics Corp., Hsinchu, Taiwan. His current research interests include low power designs and VLSI architectures for H.264/AVC and multimedia application.

42 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 Bin-Da (Brian) Liu (S 79-M 82-SM 95-F 06) received the B.S., M.S., and Ph.D. degrees all in electrical engineering from the National Cheng Kung University, Tainan, Taiwan, R.

Since 1977, he has been on the faculty of the National Cheng Kung University, where he is currently Distinguished Professor in the Department of Electrical Engineering and Director of the SoC

12 42 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 1, JANUARY 2008 Bin-Da (Brian) Liu (S 79-M 82-SM 95-F 06) received the B.S., M.S., and Ph.D. degrees all in electrical engineering from the National Cheng Kung University, Tainan, Taiwan, R.O.C., in 1973, 1975, and 1983, respectively. From 1975 to 1977, he was Electrical Officer in the Combined Service Forces. Since 1977, he has been on the faculty of the National Cheng Kung University, where he is currently Distinguished Professor in the Department of Electrical Engineering and Director of the SoC Research Center. During , he was a Visiting Assistant Professor in the Department of Computer Science, University of Illinois at Urbana-Champaign. During , he was the Director of Electrical Laboratories, National Cheng Kung University. He was the Associate Chair of the Electrical Engineering Department during and the Chair during He has been a Consultant of the Chip Implementation Center, National Applied Research Laboratories, and the VLSI Educational Program, Ministry of Education, Taiwan, since 1995 and 1997, respectively. He has published more than 240 technical papers. He also contributed chapters in the books Neural Networks and Systolic Array Design (Singapore: World Scientific, 2002, D. Zhang, Ed.), Accuracy Improvements in Linguistic Fuzzy Modeling (Heidelberg, Germany: Springer-Verlag, 2003, J. Casillas, O. Cordón, F. Herrera, and L. Magdalena, Eds.), and VLSI Handbook, (Boca Raton, FL: CRC, 2006, W. K. Chen, Ed.). His current research interests include low power circuits, neural network circuits, sensory and biomedical circuits, and VLSI implementation of fuzzy/neural circuits and audio/video signal processors. Dr. Liu is on the Board of Directors of Taiwan IC Design Society and IEEE Tainan Section, and a member of Phi Tau Phi, Taiwan SOC Consortium, International Union of Radio Science, Chinese Fuzzy Systems Association, Chinese Institute of Electrical Engineering (CIEE), and the Institute of Electronics, Information and Communication Engineers (IEICE). He was the Chair of IEEE Circuits and Systems Society-Taipei Chapter during and the Vice President of Region 10, IEEE Circuits and Systems Society, during He served as a CAS Associate Editor for the IEEE Circuits and Devices Magazine during , and an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS during He received the Dragon Distinguished Paper Award from the Acer Foundation in 1991, 1997, and 2004, the Best Paper Award from the CIEE in 1995 and 2002, the Golden Silicon Award from the Macronix Foundation in 2001, 2002, 2003, and 2006, the MPC Chip Design Award from the National Chip Implementation Center in 2002, 2003, and 2004, the Low Power Design Contest Award form the ACM/IEEE in 2003, the Shen Wen-Zen Memorial Paper Award from the Taiwan IC Design Society in 2004, the Outstanding Electrical Engineering Professor Award from the CIEE in 2004, the Lam Research Thesis Award from Lam Research Corporation in 2005, the Best Paper Award from the Fourth Regional Inter-University Postgraduate Electrical and Electronics Engineering Conference in 2006, the Best Paper Award from 2006 IEEE Asia-Pacific Conference on Circuits and Systems, and the Research Award from the National Science Council annually since He organized the Taiwan Student VLSI Design Contest from 1998, to Since 1992 he has served as a Member of the Steering Committee of VLSI Design/CAD Symposium and served as the General Chair in He served as a member of the Program Committee for many international conferences, including 1998 and 1999 IEEE Workshop on VLSI Signal Processing Systems, 1998 and 2000 IEEE Asia Pacific Conference on Circuits and Systems, 1999 to 2001 IEEE Asia Pacific Conference on ASICs, 1997 to 2003 International Symposium on VLSI Technology, Systems, and Applications, 2005 to 2007 International Symposium on VLSI Design, Automation and Test, 2006 and 2007 IEEE International Conference on Multimedia and Expo, IEEE TENCON and IEEE International Conference on Systems, Man, and Cybernetics. He was a member of International Advisory Committee of the 2003 IEEE International Conference on Neural Networks & Signal Processing, and the International Steering Committee of the IEEE Asia-Pacific Conference on Circuits and Systems from 2001 to He also served as the Technical Program Chair of the 2003 Workshop on Consumer Electronics, the General Co-Chair of the First and Second International Meeting on Microsensors and Microsystems in 2003 and 2006, the General Chair of the 2004 IEEE Asia-Pacific Conference on Circuits and Systems, and the Meeting Chair of the IEEE 9th International Workshop of Cellular Neural Networks and Their Applications in He is currently serving as an Associate Editor for the IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, the IEEE TRANSACTIONS ON FUZZY SYSTEMS, and the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS. Jar-Ferr Yang (S 84-M 88-SM 98-F 07) was born in Keelung, Taiwan, R.O.C., on September 15, He received his B.S. degree from the Chung-Yuan Christian University, Taiwan, in 1977, the M.S. degree from the National Taiwan University, Taiwan, in 1979, and the Ph.D. degree from the University of Minnesota, Minneapolis, in 1988 all in electrical engineering. He was an Instructor in the Chinese Naval Engineering School for his Navy ROTC service in As an Assistant Researcher, he worked in the Data Transmission and Network Design Research Group, Telecommunication Laboratories, during From 1982 to 1984, he was an Adjunct Lecturer in the Chung-Yuan Christian University. From 1984 to 1988, he received the Government Study Abroad Scholarship supported his advanced study in the University of Minnesota. In 1988, he jointed the National Cheng Kung University started from an Associate Professor and promoted to Full Professor and Distinguished Professor in 1994 and 2004, respectively. He was the Chairman of the Center for Computer and Communication Research, National Cheng Kung University, from 1997 to In 2002, he was a Visiting Scholar at the Department of Electrical Engineering, University of Washington, Seattle. Currently, he is the Chairperson of Graduate Institute of Computer and Communication Engineering, the Director of the Electrical and Information Technology Center. Dr. Yang was selected as a speaker in the Distinguished Lecturer Program by the IEEE Circuits and Systems Society in He was the Technical Program Co-chair of the 2004 IEEE Asia Pacific Conference on Circuits and Systems and the 9th 2005 IEEE International Workshop on Cellular Neural Networks and Their Applications. From 2004 to 2006, he was Chair of IEEE Signal Processing Society, Tainan Chapter, and Treasurer of IEEE Tainan Section. From 2006, he joins Chair Committee of the IEEE Signal Processing Society. During , he was the Secretary (Chair-elected) of IEEE Multimedia Systems and Applications Technical Committee in the IEEE Circuits and Systems Society. Currently, he is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and IEEE Circuits and Devices Magazine. He is an Associate Editor of the Journal of Applied Signal Processing and was a Guest Editor of the Special Issue on Advanced Video Technologies and Applications for H.264/AVC and Beyond in this journal. During , he is an Editorial Board Member of IET Signal Processing. He has published over 75 journal and 120 conference papers. His teaching and research areas primarily include video, audio, and speech signal processing and coding, adaptive signal processing, and digital life system design and integration. He is a Fellow of IEEE for his contributions to fast algorithms and efficient realization of video and audio coding.

THE new video coding standard H.264/AVC [1] significantly

THE new video coding standard H.264/AVC [1] significantly 832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 9, SEPTEMBER 2006 Architecture Design of Context-Based Adaptive Variable-Length Coding for H.264/AVC Tung-Chien Chen, Yu-Wen