A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm Mustafa Parlak and Ilker Hamzaoglu Faculty of Engineering and Natural Sciences Sabanci University, Tuzla, 34956, Istanbul, Turkey Email: mparlak@su.sabanciuniv.edu, hamzaoglu@sabanciuniv.edu Abstract In this paper, we present a low power implementation of H.264 adaptive deblocking filter (DBF) algorithm on ARM Versatile / PB926EJ-S Development Board. The DBF hardware is implemented using Verilog HDL. An AHB bus interface is designed and integrated into DBF hardware in order to communicate with ARM processor and SRAM through AHB bus. An efficient memory hierarchy and data transfer scheme is also implemented. The DBF hardware implementation works at 72 MHz in a Xilinx Virtex II FPGA and it can code 30 CIF frames (352x288) per second. The power consumption of DBF hardware is analyzed and up to 13% power savings is achieved by applying clock gating and glitch reduction techniques to DBF datapath. 1. Introduction Video compression systems are used in many commercial products, from consumer electronic devices such as digital camcorders, cellular phones to video teleconferencing systems. These applications make the video compression hardware devices an inevitable part of many commercial products. To improve the performance of the existing applications and to enable the applicability of video compression to new real-time applications, recently, a new international standard for video compression is developed. This new standard, offering significantly better video compression efficiency than previous International standards, is developed with the collaborations of ITU and ISO standardization organizations. Hence it is called with two different names, H.264 and MPEG4 Part 10. The video compression efficiency achieved in H.264 standard is not a result of any single feature but rather a combination of a number of encoding tools. As it is shown in the top level block diagram of an H.264 encoder in Figure 1, one of these tools is the adaptive deblocking filter (DBF) algorithm [1, 2, 3]. As shown in Figure 1, deblocking filter is applied to each decoded Macroblock (MB), 16x16 pixel array, after inverse quantization and inverse transform. Deblocking filter improves the visual quality of decoded frames by reducing the visually disturbing blocking artifacts and discontinuities in a frame due to coarse quantization of MBs and motion compensated prediction. Since the filtered frame is used as a reference frame for motion-compensated prediction of future frames, deblocking filter also increases coding efficiency resulting in bit rate savings [4]. The deblocking filter algorithm used in H.264 standard is more complex than the deblocking filter algorithms used in previous video compression standards. First of all, the H.264 deblocking filter algorithm is highly adaptive. Second, it is applied to each edge of all the 4x4 luma and chroma blocks in a MB. Third, it can update 3 pixels in each direction that the filtering takes place. Fourth, in order to decide whether the deblocking filter will be applied to an edge, the related pixels in the current and neighboring 4x4 blocks must be read from memory and processed. Because of these complexities, the deblocking filter algorithm can easily account for one-third of the computational complexity of a H.264 video decoder [4]. We presented an efficient hardware architecture for real-time implementation of H.264 adaptive DBF algorithm in [5]. In this paper, we present a low power implementation of the DBF algorithm on ARM Versatile / PB926EJ-S Development Board. The DBF hardware is implemented using Verilog HDL. An AHB bus interface is designed and integrated into DBF hardware in order to communicate with ARM processor and SRAM through AHB bus. An efficient memory hierarchy and data transfer scheme is also implemented.

Figure 1 H.264 Encoder Block Diagram detail. The memory hierarchy and data transfer scheme is explained in section 4. The implementation of DBF hardware on ARM Versatile / PB926EJ-S Development Board is given in section 5. Section 6 presents power consumption analysis of DBF hardware. The application of low power techniques to DBF datapath is explained in section 7. Finally, Section 8 presents the conclusions. Figure 2 4x4 Blocks in a MB and Filtering Order The DBF hardware implementation works at 72 MHz in a Xilinx Virtex II FPGA and it can code 30 CIF frames (352x288) per second. The power consumption of DBF hardware is analyzed using Xilinx XPower tool and up to 13% power savings is achieved by applying clock gating and glitch reduction techniques to DBF datapath. Several hardware architectures for real-time implementation of H.264 adaptive deblocking filter algorithm are presented in the literature [6, 7]. These architectures achieve higher performance than our hardware design at the expense of a much higher hardware cost. Our hardware design is a more cost effective solution for portable applications. We achieved real-time performance for portable applications by only using one 12-bit adder, one 12- bit comparator, a few shifters, two s complementers and multiplexers in our datapath. In addition, we used low power techniques for reducing power consumption of the DBF hardware. The rest of the paper is organized as follows. Section 2 presents a brief overview of adaptive deblocking filter algorithm used in H.264. Section 3 describes the proposed hardware architecture in 2. Overview of H.264 Adaptive Deblocking Filter Algorithm H.264 adaptive deblocking filter removes visually disturbing block boundaries created by coarse quantization of MBs and motion compensated prediction. Filtering is applied to each edge of all the 4x4 luma and chroma blocks in a MB. The 4x4 luma and chroma blocks in a MB are shown in Figure 2. The 4x4 block edges in a MB are filtered in the order specified in the H.264 standard [1]. First, the vertical edges in the MB are filtered in the order a, b, c, d, i and j. Then, the horizontal edges in the MB are filtered in the order h, g, f, e, l and k. There are several conditions that determine whether a 4x4 block edge will be filtered or not. There are additional conditions that determine the strength of the filtering for the 4x4 block edges that will be filtered. Boundary strength (BS) parameter, α and β threshold values and the values of the pixels in the edge determine the outcomes of these conditions, and the values of up to 3 pixels on both sides of an edge can be changed depending on the outcomes of these conditions. The deblocking filter algorithm is adaptive in three levels; slice level, edge level and sample level [3, 4]. Slice level adaptivity is used to adjust the filtering strength in a slice to the characteristics of the

slice data. The filtering strength in a slice is adjusted by encoder using the offset-a and offset-b parameters. The α and β threshold values that determine whether a 4x4 block edge will be filtered or not and how strong the filtering will be for an edge are a function of quantization parameter (QP) and these two offset parameters. Edge level adaptivity is used to adjust the filtering strength for an edge to the characteristics of that edge. The filtering strength for an edge is adjusted using the BS parameter. Every edge is assigned a BS value depending on the coding modes and conditions of the 4x4 blocks. The conditions used for determining the BS value for an edge between two neighboring 4x4 blocks are specified in the H.264 standard [1]. The strength of the filtering done for an edge is proportional to its BS value. No filtering is done for the edges with a BS value of 0, whereas strongest filtering is done for the edges with a BS value of 4. Sample level adaptivity is used to adjust the filtering strength for a sample to the characteristics of the pixels in that sample in order to distinguish the true edges from those created by quantization. The filtering strength for a sample is therefore determined by comparing pixel gradients in that sample with α and β threshold values for that edge. Figure 3 DBF Hardware Block Diagram 3. Proposed Hardware Architecture The proposed DBF hardware architecture is shown in Figure 3. It includes a datapath, a control unit, an address generator, one 384x8 register file and two dual-port internal SRAMs to store partially filtered pixels. There is also an input buffer to store the unfiltered pixels and an output buffer to store the filtered pixels. In a complete H.264 video encoder, the input buffer is loaded with the reconstructed MB generated by the inverse transform and quantization unit. This unit generates the reconstructed MB one 4x4 block at a time [8]. The datapath is two stage pipelined to improve speed and throughput. As the DBF algorithm is highly adaptive, the control unit and address generator designs are quite complex. The address generator is implemented as a two stage pipeline to improve the clock frequency. Since the DBF algorithm includes several conditional branches, control unit sometimes has to wait for a branch outcome to continue its execution. In order to avoid datapath pipeline stalls, pre-computation calculations are executed in these cycles. Figure 4 Data Order in Input Buffer 4. Memory Hierarchy and Data Transfer The DBF algorithm requires significant data transfer between frame memory and dbf datapath. The data transfer is very important for performance and power consumption. DBF algorithm is very complex and in order to finish complete filtering process one pixel might be accessed and written more than four times. We tried to avoid unnecessary data access to minimize power consumption by data reusing. This section describes the data transfer scheme we used in our DBF implementation. There are five memory units in this architecture as shown in Figure 3. Input buffer stores pixels for a new macroblock ready for filtering. Pixels in a macroblock are stored in the order shown in the Figure 4. Since H.264 standard processes each video frame in macroblock units, we implemented an 384x8 input buffer which can store 256 luma and 128 chroma pixels. This buffer is a dual-port memory and it is filled from frame memory through AHB bus.

Since each 4x4 block in a MB has 4 edges, a pixel in a 4x4 block may be read or updated four times before filtering process finishes completely. Therefore some pixels, unfiltered or partially filtered, have to be stored in local memories to be accessed later. SRAMs and Register Files are used for this purpose to temporarily store intermediate results. The filtered macroblocks are stored in output buffer and the filtering order is shown in Figure 2. The neighboring 4x4 blocks in the upper and left macroblocks are used for filtering the edges a, e, i and k of a macroblock. Therefore, these 4x4 blocks are stored in local buffers after they are filtered. Considering that at the frame boundary filtering is not done and some 4x4 blocks have to be stored for future filtering, there are nine different writing schemes for macroblocks which are shown in Figure 5. In the figure, the shaded regions in a macroblock indicate the filtered 4x4 blocks and the white 4x4 blocks are stored for future filterings. Filtered macroblocks are transfered from output buffer according to these writing schemes and placed in frame memory through AHB bus interface. Type-5 is the most common scheme and used for the macroblocks in the center of the frame. Type-1, type-3, type-7 and type-9 schemes are used for the macroblocks in the left-upper, right-upper, left-lower and right-lower corners of the frame. Type-2, type-4, type-6 and type-8 schemes are used for the macroblocks in the first row, first column, last column and last row of the frame. There are two reasons for using these writing schemes. The first one is to keep the output buffer small. Because small memories consume less power. The second reason is that address generation for the most common writing scheme has a very regular structure and it can be implemented with a small amount of hardware. 5. DBF ARM Versatile / PB926EJ-S Development Board Implementation The proposed architecture is implemented in Verilog HDL. The implementation is verified with RTL simulations using Mentor Graphics ModelSim SE. The Verilog RTL is then synthesized to a 2V8000ff1157 Xilinx Virtex II FPGA with speed grade 5 using Mentor Graphics Leonardo Spectrum. The resulting netlist is placed and routed to the same FPGA using Xilinx ISE Series 7.1i. The DBF hardware implementation works at 72 MHz and it can code 30 CIF frames (352x288) per second. Figure 5 Writing Schemes in Output Buffer Figure 6 ARM Versatile / PB926EJ-S Development Board The development environment is shown in Figure 6 [9]. It consists of a PC connected to ARM Versatile/PB926EJ-S board through ARM Multi-ICE debugger, a logic tile mounted on the Versatile/PB926EJS board and a color LCD panel. PC is used to create the FPGA bit stream, which is loaded to the FPGA on the logic tile. ARM Multi ICE Server V.2.2.6 is used to communicate with the development board and AXD Debugger from ARM Developer Suite 1.2 is used to debug the system. A Color LCD panel is used to display the original and reconstructed images for visual verification. The Versatile/PB926EJ-S board contains a development chip including an ARM 9 processor, a bus matrix and a number of peripheral interfaces. The board has a JTAG connector which is used for configuring the

FPGAs on the board and for debugging the system. Versatile board offers the possibility of using one or more Real View logic tiles which include Xilinx Virtex II 8 million gate FPGAs to implement additional custom-designed logic in the system [9]. Before implementing H.264 deblocking filter algorithm on this platform, an AHB bus interface is designed and integrated into DBF hardware in order to communicate with ARM processor and SRAM through AHB bus. A video frame is loaded into SRAM located in the board using software. This video frame is used as an input to DBF hardware running on Virtex II FPGA. DBF hardware applies the DBF algorithm to this video frame and writes the resulting frame back to SRAM. The resulting video frame is shown on the color LCD panel. Figure 7 shows an example unfiltered video frame and Figure 8 shows the same frame filtered by H.264 DBF algorithm running on the ARM versatile development board. Figure 7 Unfiltered Raw Video Frame 6. Power Consumption Analysis Power consumption analysis of DBF hardware is performed using Xilinx XPower tool. XPower needs activity rate of each signal in a design to accurately estimate its power consumption. XPower uses the activity rates that are stored in a VCD file which is generated by timing simulation of placed and routed netlist using a simulator such as Mentor Graphics ModelSim. Since our main concern for DBF hardware is internal power dissipation, the input and output power consumptions have not been considered. As shown in Figure 9, apart from the power due to input/output, power consumption of DBF hardware can be divided into three main categories; clock power, logic power and signal power. Signal power is the power dissipated in routing tracks between logic blocks. A significant amount of power is dissipated in routing tracks and it accounts for 47% of total power consumption. This is an expected result because of the long interconnects, programmable switches and heavy capacitive loads in FPGA. Logic power is the amount of power dissipated in the parts where logic functions and computations take place. Clock power is due to clock tree used in FPGA. Power consumption due to logic constitutes 27% percent of the total power and clock tree consumes remaining 26%. Figure 8 Video Frame Filtered by H.264 Deblocking Filter Algorithm Figure 9 DBF Hardware Power Consumption Distribution

Table 1 DBF Modules Power Consumption Distribution Module Power (mw) Datapath 33.49 Control Unit 56.61 Address Generator 9.91 Distributed SelectRAM 135.08 Block SelectRAM 24.04 Total 259.13 DBF consists of four main modules; these are datapath, control unit, address generator and memories as shown in Figure 3. Xilinx Virtex-II features a large number of 18 Kb block SelectRAM and distributed SelectRAM. DBF architecture has two block SelectRAMs and two distributed SelectRAMs used as local memory to store intermediate results. Power consumptions of DBF modules are given in Table 1. In this analysis, each module is handled individually. As shown in Table 1, largest contribution to power consumption is coming from distributed selectram. Distributed SelectRAM is formed by look up tables (LUT) in CLBs and synthesized to a circuit which uses many LUTs placed in a distributed fashion in FPGA. Therefore routing tracks of distributed RAMs are typically long. This results in increased power dissipation. Second largest contributor to power consumption is control unit. Since DBF algorithm is adaptive and complex, the control unit includes many registers, multiplexers, comparators, and counters. This increases power due to glitch and clock power. Datapath consumes 13% of the total power consumption. Address generator consumes least power among these modules and accounts for only 4% of total power. 7. Datapath Power Reduction We applied clock gating and glitch reduction techniques to DBF datapath for reducing its power consumption. Clock gating is providing clock signal to the modules only when they are active. The power consumption of synchronous systems can be reduced by minimizing unnecessary logic transition. Clock gating causes registers to keep their contents the same, therefore reduces switching activity. The DBF datapath is implemented as a two stage pipeline to improve the clock frequency and throughput. As shown in Figure 10, the first pipeline stage includes one 12-bit adder and two shifters to perform numerical calculations like multiplication Figure 10 DBF Datapth Table 2 Impact of Clock Gating on Datapath Power Consumption Datapath (mw) Datapath with Clock Gating (mw) Clocks 7.46 6.71 Logic 7.2 5.88 Signal 18.1 16.56 Total 33.9 29.15 and addition. The second pipeline stage includes one 12-bit comparator, several two s complementers and multiplexers to determine conditional branch results. The first stage performs numerical calculations every cycle, but the second stage is not active for a considerable amount of time. Therefore we shut off the second stage through clock gating when it is inactive. Table 2 shows the impact of clock gating on datapath power consumption. The datapath power consumption is reduced by 13% using clock gating. Glitch is a spurious transition at a node within a single cycle before the node settles to the correct logic value [10]. Unlike ASICs, in which signals can be routed using any available silicon, FPGAs implement interconnects using fixed metal tracks and programmable switches. The relative scarcity of

Table 3 Impact of Glitch Reduction on Datapath Power Consumption Datapath Pipelined Datapath without Datapath (mw) Glitches (mw) (mw) Clocks 7.46 7.37 9,25 Logic 7.62 6.60 6,07 Signal 18.41 16.47 16,59 Total 33.49 30.44 31,91 programmable switches often forces signals to take longer routes than would be seen in an ASIC. As a result, the potential for unequal delays among signals, and hence the creation of glitches, is more likely than that in an ASIC. Thus, reducing glitches by pipelining is an effective power reduction technique for FPGAs. The impact of glitches on DBF datapath power consumption can be seen by simulating the datapath under zero delay model and analyzing its power consumption. The glitch free power consumption of DBF datapath is shown in Table 3. The glitch free power consumption shows the maximum power consumption reduction that can be obtained by reducing glitches. Table 3 shows the impact of reducing glitches by pipelining on datapath power consumption. We inserted two pipeline registers immediately before the inputs of the adder. This reduced the datapath power consumption by %4.7. We therefore obtained %50 of maximum possible power reduction that can be obtained by reducing glitches. 8. Conclusions In this paper, we presented a low power implementation of H.264 adaptive DBF algorithm on ARM Versatile / PB926EJ-S Development Board. The DBF hardware is implemented using Verilog HDL. An AHB bus interface is designed and integrated into DBF hardware in order to communicate with ARM processor and SRAM through AHB bus. An efficient memory hierarchy and data transfer scheme is also implemented. The DBF hardware implementation works at 72 MHz in a Xilinx Virtex II FPGA and it can code 30 CIF frames (352x288) per second. The power consumption of DBF hardware is analyzed using Xilinx XPower tool and up to 13% power savings is achieved by applying clock gating and glitch reduction techniques to DBF datapath. As future work, we will apply other low power techniques to DBF hardware and assess their impact on the power consumption. 9. Acknowledgement This research was supported in part by the Scientific and Technological Research Council of Turkey (TUBITAK) under the contract 106E153. 10. References [1] Joint Video Team (JVT) of ITU-T VCEG and ISO/IEC MPEG, Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, May 2003. [2] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra Overview of the H.264/AVC Video Coding Standard, IEEE Trans. on Circuits and Systems for Video Technology vol. 13, no. 7, pp. 560 576, July 2003. [3] I. Richardson, H.264 and MPEG-4 Video Compression, Wiley, 2003. [4] Peter List, Anthony Joch, Jani Lainema, Gisle Bj0ntegaard, and Marta Karczewicz, "Adaptive Deblocking Filter", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, pp. 614-619, 2003. [5] Mustafa Parlak and Ilker Hamzaoglu, "An Efficient Hardware Architecture for H.264 Adaptive Deblocking Filter Algorithm", NASA/ESA Conference on Adaptive Hardware and Systems, June 2006. [6] Yu -Wen Huang, To-Wei Chen, Bing-Yu Hsieh, Tu- Chih Wang, Te-Hao Chang, and Liang-Gee Chen, Architecture Design for Deblocking Filter in H.264/JVT/AVC", IEEE Conf. on Multimedia and Expo, pp. 693-696, 2003. [7] Bin Sheng, Wen Gao and Di Wu, "An Implemented Architecture of Deblocking Filter for H.264/AVC", IEEE International Conference on Image Processing (ICIP'04), Vol.1, 24-27, pp. 665-668, October 2004. [8] Ozgur Tasdizen and Ilker Hamzaoglu, "A High Performance and Low Cost Hardware Architecture for H.264 Transform and Quantization Algorithms", European Signal Processing Conference, September 2005. [9] Versatile Platform Baseboard for ARM926EJ-S User Guide, http://www.arm.com, May 2004. [10] S. J. E. Wilton, S-S. Ang and W. Luk, "The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays'', International Conference on Field- Programmable Logic and Applications, August 2004.