A low-power portable H.264/AVC decoder using elastic pipeline

Similar documents
Frame Processing Time Deviations in Video Processors

Interframe Bus Encoding Technique for Low Power Video Compression

Visual Communication at Limited Colour Display Capability

FOR MULTIMEDIA mobile systems powered by a battery

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

WITH the demand of higher video quality, lower bit

Fast MBAFF/PAFF Motion Estimation and Mode Decision Scheme for H.264

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

OPTIMIZING VIDEO SCALERS USING REAL-TIME VERIFICATION TECHNIQUES

Fully Static and Compressed Topology Using Power Saving in Digital circuits for Reduced Transistor Flip flop

Workload Prediction and Dynamic Voltage Scaling for MPEG Decoding

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC

H.264/AVC Baseline Profile Decoder Complexity Analysis

Performance Evaluation of Error Resilience Techniques in H.264/AVC Standard

A Low-Power 0.7-V H p Video Decoder

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

A Low Power Delay Buffer Using Gated Driver Tree

Chapter 2 Introduction to

Selective Intra Prediction Mode Decision for H.264/AVC Encoders

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

Video Over Mobile Networks

FDTD_SPICE Analysis of EMI and SSO of LSI ICs Using a Full Chip Macro Model

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Motion Compensation Hardware Accelerator Architecture for H.264/AVC

How to Manage Video Frame- Processing Time Deviations in ASIC and SOC Video Processors

Efficient 500 MHz Digital Phase Locked Loop Implementation sin 180nm CMOS Technology

International Journal of Engineering Research-Online A Peer Reviewed International Journal

Memory interface design for AVS HD video encoder with Level C+ coding order

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

Introduction to Data Conversion and Processing

ADVANCES in semiconductor technology are contributing

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Design of Fault Coverage Test Pattern Generator Using LFSR

Error Resilient Video Coding Using Unequally Protected Key Pictures

The H.26L Video Coding Project

A HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING. Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

An Efficient Power Saving Latch Based Flip- Flop Design for Low Power Applications

THE new video coding standard H.264/AVC [1] significantly

Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

On Complexity Modeling of H.264/AVC Video Decoding and Its Application for Energy Efficient Decoding

FAST SPATIAL AND TEMPORAL CORRELATION-BASED REFERENCE PICTURE SELECTION

A Novel VLSI Architecture of Motion Compensation for Multiple Standards

AS THE ITRS Roadmap predicts, memory area is becoming

Fault Detection And Correction Using MLD For Memory Applications

Adaptive Key Frame Selection for Efficient Video Coding

An Efficient Reduction of Area in Multistandard Transform Core

Low Power VLSI CMOS Design An Image Processing Chip for RGB to HSI Conversion

Variation-and-Aging Aware Low Power embedded SRAM for Multimedia Applications

A Low-Power CMOS Flip-Flop for High Performance Processors

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components

Novel Low Power and Low Transistor Count Flip-Flop Design with. High Performance

A New Hardware Implementation of Manchester Line Decoder

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

A VLSI Architecture for Variable Block Size Video Motion Estimation

Performance Modeling and Noise Reduction in VLSI Packaging

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

Lossless Compression Algorithms for Direct- Write Lithography Systems

A Novel Bus Encoding Technique for Low Power VLSI

Comparative Study of JPEG2000 and H.264/AVC FRExt I Frame Coding on High-Definition Video Sequences

Transactions Briefs. Interframe Bus Encoding Technique and Architecture for MPEG-4 AVC/H.264 Video Compression

A Power Efficient Flip Flop by using 90nm Technology

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Bit Rate Control for Video Transmission Over Wireless Networks

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

Reduced complexity MPEG2 video post-processing for HD display

Frame-Based Dynamic Voltage and Frequency Scaling for a MPEG Decoder

DESIGN AND ANALYSIS OF COMBINATIONAL CODING CIRCUITS USING ADIABATIC LOGIC

Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Chapter 10 Basic Video Compression Techniques

An FPGA Implementation of Shift Register Using Pulsed Latches

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Arithmetic Unit Based Reconfigurable Approximation Technique for Video Encoding

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

THE USE OF forward error correction (FEC) in optical networks

Low Power D Flip Flop Using Static Pass Transistor Logic

PAPER Wireless Multi-view Video Streaming with Subcarrier Allocation

ROBUST ADAPTIVE INTRA REFRESH FOR MULTIVIEW VIDEO

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

A Fast Constant Coefficient Multiplier for the XC6200

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

FP 12.4: A CMOS Scheme for 0.5V Supply Voltage with Pico-Ampere Standby Current

LFSR Counter Implementation in CMOS VLSI

Retiming Sequential Circuits for Low Power

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b

Research Article Low Power 256-bit Modified Carry Select Adder

VLSI Chip Design Project TSEK06

Power Reduction Techniques for a Spread Spectrum Based Correlator

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

Principles of Video Compression

Transcription:

Chapter 3 A low-power portable H.64/AVC decoder using elastic pipeline Yoshinori Sakata, Kentaro Kawakami, Hiroshi Kawaguchi, Masahiko Graduate School, Kobe University, Kobe, Hyogo, 657-8507 Japan Email: {yoshi, kawakami, kawapy, yosimoto}@cs8.cs.kobe-u.ac.jp Abstract. We propose an elastic pipeline architecture that can apply dynamic voltage scaling (DVS) to a dedicated hardware, and implement the elastic pipeline to a portable H.64/AVC decoder LSI with embedded frame buffer SRAM. A supply voltage and operating frequency are decreased by a feedback-type voltage/frequency control algorithm. In a portable H.64/AVC decoder, embedded SARM can be utilized as frame buffer since the frame buffer is not so large that an external DRAM is required. In the proposed pipeline architecture, the power in the embedded SRAM and even in a local bus connecting with the frame buffer SRAM can be controlled by DVS. We carried out simulation in the 30 80 pixels baseline profile and 30 40 pixels mail profile. The total power reduction in 30 80 pixels and 30 40 pixels are 30% and 3%, respectively. Keywords. Dynamic voltage scaling, Elastic pipeline, Embedded sram, H.64/AVC, Memory bandwidth 3. Introduction Dynamic voltage scaling (DVS) is a low-power technique that controls an operating frequency and a supply voltage on an LSI, according to an appli-

884 Yoshinori Sakata, Kentaro Kawakami, Hiroshi Kawaguchi, Masahiko cation workload. DVS is well utilized on general-purpose processors to achieve both high peak performance and low average power [, ]. Figure 3. shows a relationship between an operating frequency and a power, with DVS and without DVS (= clock gating). In DVS, if a workload does not need a high operating frequency, we can choose a combination of a lower operating frequency and a lower supply voltage. Due to this power optimization, the power becomes lower than the conventional scheme without DVS, when a workload is low. If the maximum performance is instantaneously needed, the highest supply voltage and highest operating frequency are utilized so that DVS can accommodate the peak performance. In a case of an application to hardwired logic circuits for real-time processing, there are a few problems; a dedicated hardware is often built with pipeline architecture for high performance. Considering the likely worst-case workload, the starting time of a pipeline process is segmented into the worst-case execution cycles (WCEC). Thus, the required operating frequency is uniquely fixed, and there is no room to apply DVS in the hardwired logic circuits. To realize DVS in hardwired logic circuits, we propose an elastic pipeline architecture. Depending on characteristics of input data, this architecture can conserve process cycles in the pipeline operation. The slack time is exploited for DVS, which achieves lower power in hardwired logic circuits. The rest of this paper is organized as follow. Section 3. mentions the conventional pipeline architecture. Section 3.3 describes the proposed elastic pipeline architecture for DVS in hardwired logic circuits. In Section 3.4, we exhibit the experimental results of the proposed architecture. Section 3.5 summarizes this paper. Normalized power 0.5 Without DVS (clock gating) DVS 0 0 0.5 Normalized frequency Fig. 3.. Relationships between power and frequency in DVS and clock gating

A low-power portable H.64/AVC decoder using elastic pipeline 885 3. Conventional pipeline architecture Figure 3. illustrates a timing diagram of the conventional pipeline architecture. The WCEC is the maximum number of execution cycles required for one pipeline process. A gray area in the figure shows the number of processing cycles that a pipeline stage processes a datum. A hatching area means common idle cycles in a pipeline process after all pipeline stages were completed. Considering the worst-case workload, a starting time of each pipeline process is fixed to the WCEC in the conventional pipeline architecture. Hence, all the pipeline stages have to idle until the next starting time even if all the pipeline stages finish earlier than the WCEC. WCEC Process WCEC Process WCEC WCEC WCEC Process Process Process i N+M- N+M- Time Pipeline stages M- M i i- i-m+ i-m+ N N- N Processing cycles Common idle cycle in a process WCEC: Worst-case execution cycles Fig. 3.. Timing diagram of the conventional pipeline 3.3 Elastic pipeline architecture 3.3. Concept We propose the elastic pipeline architecture as the solution to the issue of the conventional pipeline architecture [3]. Figure 3.3 (a) and (b) shows a concept and a timing diagram of the proposed elastic pipeline architecture. After each stage in the elastic pipeline was completed, it sends a completion signal to the pipeline controller. As soon as the pipeline controller collects all the completion signal from all the pipeline stages, each pipeline stage proceed to the next pipeline process with the start signal. In the proposed architecture, the common idle cycles are built up in pipeline processes, and become a lump of time. As illustrated in Figure 3.3 (b), a pipeline process in the elastic pipeline requires less time than the conventional pipeline since the common idle cycles are put off. Thereby,

886 Yoshinori Sakata, Kentaro Kawakami, Hiroshi Kawaguchi, Masahiko the elastic pipeline architecture produces the slack time, ΔH, compared to the conventional pipeline architecture. i (Input) Start signal i- i- (a) Pipeline controller Completion signal M i-m (Output) Pipeline stages M- M WCEC Process WCEC Slack time ΔH Time Process Process Process Process i N+M- N+M- i Datu m i- i-m+ N i-m+ N- N Processing cycles WCEC: Worst-case execution cycles (b) Fig. 3.3. Proposed elastic pipeline architecture: (a) concept and (b) timing diagram 3.3. Feedback-type voltage/frequency control algorithm For DVS in the proposed elastic pipeline, a supply voltage and an operating frequency are changed by a feedback-type voltage/frequency control algorithm as illustrated in Figure 3.4 [4]. In an H.64 codec, data are processed in every single macro block (MB: 6 6 pixels). In this algorithm, a frame is divided into some slots; a set of MBs are assigned to a slot. The first and second slots are always processed with the maximum frequency (= f in Figure 3.4). However, these slots are potentially completed earlier since the elastic pipeline reduces the number of processing cycles. Now, pay attention to the third slots, where the slack time, H, is left. Even considering a voltage/frequency transition time, T td, the third slot has twice as long time as T slot (a processing time for a slot), which allows the third slot to be processed at an half of f. Note that a real-time operation is guaranteed in this feedback algorithm. As described, we prepare the two kinds of operating frequencies, f and f/ in this study [3, 4]. T f :Processing time for a slot T slot Slot T slot Slot T slot Slot 3 Slot N ΔH 3 T f = T slot x T 3 OK T f T f = T slot x + T td T 3 OK f =f 3 T f f =f/ 3 Select f f T td T 3 Slot 3 T slot: Processing time for a slot T td: Time for voltage/frequency transition Fig. 3.4. Feedback-type voltage/frequency control algorithm

A low-power portable H.64/AVC decoder using elastic pipeline 887 3.3.3 Architecture To estimate the effect of the power reduction in the proposed pipeline architecture, we designed a H.64 decoder architecture as shown in Figure 3.5. SRAM utilized for reference images is embedded on a chip. The external DRAM is used as bit stream buffer, if the resolution is very large as HDTV. In this case, a supply voltage and operating frequency should be fixed since it is preferable that the DRAM interface operates at a fixed supply voltage and operating frequency for compatibility with other hardware cores. But in this study, we handle small resolution which is used for the portable product. So SRAM which utilized for bit stream buffer is embedded. We can adapt DVS for not only decoder core but SRAM and local bus connecting with the frame buffer SRAM. Control bus Decoder Decoder core Level shifters Entropy decode Bit stream decoded sequence IQ/IDCT Ref. picture Prediction error picture Inter prediction Intra prediction Predicted picture + Loop filter Prediction error adder Current picture DC-DC converter & PLL Controlled Vdd / ƒ Local bus Memory Fig. 3.5. H.64 decoder block diagram 3.4 Experimental results 3.4. Test sequence Assuming portable H.64 image sequences, we handle two kinds of resolution (30 80 pixels and 30 40 pixels). As test sequences, we chose six sequences: Bus (BUS), Cheer (CHER), Flower (FLOW), Foreman (FORE), Football (FTBL), Girl (GIRL). Then, we encoded the six test sequences under the configurations in Table 3. (a) to prepare the test sequences: The baseline profile with 30 80 pixels complies with the Japanese portable television standard, and the main profile with 30 40 is adopted by Sony PSP [5].

888 Yoshinori Sakata, Kentaro Kawakami, Hiroshi Kawaguchi, Masahiko Table 3.. (a) Encoding configuration, (b) Simulation parameter (a) Frame size (Resolution) 30 80 pixels 30 40 pixels # of test sequence 6 sequences 6 sequences Profile Baseline profile Main profile Frame rate 5 30 # of reference frames Reference software JM9.6[6] JM9.6[6] (b) Frame size (Resolution) 30 80 30 40 # of slots 75 30 Operating frequency (MHz) 6.75/3.38 5.84/7.9 Supply voltage (V) 0.6/0.6 0.8/0.63 SRAM (Mbits) 4 WCEC 760 760 # of logic gates 60039 60039 Table 3. (b) illustrates the simulation parameters. The respective supply voltage and operating frequency are prepared for the two kinds of resolutions. The capacities of the embedded SRAMs are M bits and 4 M bits, respectively. The SRAM capacity of the 30 40 pixel main profile is twice as large since the number of reference frames is two. However, note that the WCECs and the numbers of logic gates are equal between the two kinds of resolutions. In other words, the operating frequencies are different, but the sizes of the decoder cores are the same. 3.4. The optimum number of slots per frame Since the elastic pipeline architecture reduces the processing cycle, we can apply DVS to the decoder LSI. In this subsection, the optimum number of slots is discussed. Figure 3.6 illustrates the simulation result of the relationship between the power of the decoder core and the number of slots, using the BUS sequence. The power reduction factor depends on the number of slots. In this simulation, the transition time is assumed to be 50 μs []. The baseline profile with 30 80 pixels has the power minimum when the number of slices is 75. On the other hand, the optimum number of slots is 30 in the main profile with 30 40 pixels. If the number of slots is smaller than the optimum point, the power reduction drastically becomes worse. For instance, if there are merely several slots, there are few chances to make the operating frequency and the

A low-power portable H.64/AVC decoder using elastic pipeline 889 supply voltage lower, which causes high power consumption. Alternatively, if there are many slots, there are many chances to change the operating frequency and supply voltage. The voltage/frequency transition time, however, becomes longer. The power reduction gradually becomes worse with the increase of the slice number. 3.4.3 Power saving As well as the decoder core, we can apply DVS to the embedded frame buffer SRAM and the local bus connecting to the SRAM. Figure 3.7 (a) shows the respective power reduction factors in the local bus, the embedded SRAM, and the decoder core. In the case of the frame size of 30 80 pixels, the proposed elastic pipeline reduces the powers by 7%, 5%, and 4%, respectively, in the local bus, the embedded SRAM, and the decoder core. In the case of the frame size of 30 40 pixels, the respective factors are 38%, 7%, and 33% in the resolution of 30 80 pixels. The overall power reduction is 30% and 3% on average, in the two kinds of resolutions, as shown in Figure 3.7 (b). normalized power 0.8 0.6 0.4 0. 0 0 50 75 00 50 00 The number of slots (a) Normalized power 0.8 0.6 0.4 0. 0 0 30 50 00 50 00 50 300 The number of slots (b) Fig. 3.6. The number of slots in a frame vs. power using the test sequence Bus : (a) 30 x 80 pixels and (b) 30 x 40 pixels 30x40 (main profile) 30x80 (baseline profile) decoder core SRAM local bus 0 0. 0.4 0.6 0. 8 (a) Normalized power -33% -7% -38% -4% -5% -7% Normalized power 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0. 0. 0 30 x 80(baseline profile) 30 x 40(main profile) - 30% - 3% BUS CHER FLOW FORE FTBL GIRL Video sequence (b) Fig. 3.7. Power reduction ratio: (a) decode core, embedded SRAM, local bus, (b) overall decoder

890 Yoshinori Sakata, Kentaro Kawakami, Hiroshi Kawaguchi, Masahiko 3.5 Summary We proposed the elastic pipeline architecture that can apply DVS to a hardwired circuit. We implemented a H.64 decoder LSI, and controlled the embedded frame buffer SRAM and the local bus connecting to the embedded SRAM with DVS, as well as the decoder core. We verified that the proposed elastic pipeline reduces the power on the H.64 decoder LSI by 7% in the local bus, by 5% in the frame buffer SRAM, and by 4% in a decoder core in a 30 80 pixel baseline profile. In a case that 30 40 pixel main profile, the power is reduced by 38%, 7%, and 33% in the local bus, the frame buffer SRAM, and the decoder core, respectively. The total power reductions in the baseline profile and the mail profile pixels are 30% and 3%, respectively. References. Nowka KJ, Carpenter GD, MacDonald EW, Ngo HC, Brock BC, Ishii KI, Nguyen TY, Burns JL (00) A 3-bit PowerPC system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling. IEEE J. Solid-State Circuits 37():44-447. Kawakami K, Kanamori M, Morita Y, Takemura J, Miyama M, M (005) Power-minimum frequency/voltage cooperative management method for VLSI processor in leakage-dominant technology era. IEICE Trans. Fundamentals E88-A():390-397 3. Kawakami K, Kuroda M, Kawaguchi H, M (007) Power and memory bandwidth reduction of an H.64/AVC HDTV decoder LSI with elastic pipeline architecture. In: Proceeding of Asia and South Pacific Design Automation Conference (ASP-DAC) 4. Kawaguchi H, Shin Y, and Sakurai T (005) µitron-lp: power-conscious real-time OS based on cooperative voltage scaling for multimedia applications. IEEE Trans. Multimedia 7():67-74 5. http://manuals.playstation.net/document/jp/psp/current/video/filetypes.html 6. Joint Video Team (JVT) of ISO/IEC MPEG&ITU-T VCEG (003) ISO/IEC 4496-0.