A Low-Power 0.7-V H p Video Decoder

Similar documents
Low-Power Techniques for Video Decoding. Daniel Frederic Finchelstein

A HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING. Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt

A 0.7-V 1.8-mW H.264/AVC 720p Video Decoder

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

Decoder Hardware Architecture for HEVC

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Introduction to Video Compression Techniques. Slides courtesy of Tay Vaughan Making Multimedia Work

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

Design Challenge of a QuadHDTV Video Decoder

Overview: Video Coding Standards

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

EECS150 - Digital Design Lecture 12 - Video Interfacing. Recap and Outline

A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications

H.261: A Standard for VideoConferencing Applications. Nimrod Peleg Update: Nov. 2003

Video coding standards

Video Compression. Representations. Multimedia Systems and Applications. Analog Video Representations. Digitizing. Digital Video Block Structure

Chapter 2 Introduction to

Joint Algorithm-Architecture Optimization of CABAC

17 October About H.265/HEVC. Things you should know about the new encoding.

Hardware Decoding Architecture for H.264/AVC Digital Video Standard

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

Multicore Processing and Efficient On-Chip Caching for H.264 and Future Video Decoders

Lossless Compression Algorithms for Direct- Write Lithography Systems

Visual Communication at Limited Colour Display Capability

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

LOW POWER DIGITAL EQUALIZATION FOR HIGH SPEED SERDES. Masum Hossain University of Alberta

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Principles of Video Compression

A Highly Parallel and Scalable CABAC Decoder for Next Generation Video Coding

Advanced Computer Networks

The H.263+ Video Coding Standard: Complexity and Performance

An Overview of Video Coding Algorithms

Multimedia Communications. Video compression

Video 1 Video October 16, 2001

Implementation of an MPEG Codec on the Tilera TM 64 Processor

A low-power portable H.264/AVC decoder using elastic pipeline

On Complexity Modeling of H.264/AVC Video Decoding and Its Application for Energy Efficient Decoding

AV1 Update. Thomas Daede October 5, Mozilla & The Xiph.Org Foundation

WITH the demand of higher video quality, lower bit

Video Transmission. Thomas Wiegand: Digital Image Communication Video Transmission 1. Transmission of Hybrid Coded Video. Channel Encoder.

Update on Super HDTV Decoder Project

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Parallel Implementation of Sample Adaptive Offset Filtering Block for Low-Power HEVC Chip. Luis A. Fernández Lara

Frame Processing Time Deviations in Video Processors

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

H.264/AVC Baseline Profile Decoder Complexity Analysis

Into the Depths: The Technical Details Behind AV1. Nathan Egge Mile High Video Workshop 2018 July 31, 2018

Workload Prediction and Dynamic Voltage Scaling for MPEG Decoding

Multimedia Communications. Image and Video compression

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH GHEVC: An Efficient HEVC Decoder for Graphics Processing Units

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU

Digital Video Telemetry System

Memory interface design for AVS HD video encoder with Level C+ coding order

Low Power Design of the Next-Generation High Efficiency Video Coding

An FPGA Platform for Demonstrating Embedded Vision Systems. Ariana Eisenstein

AV1: The Quest is Nearly Complete

Reduced complexity MPEG2 video post-processing for HD display

Algorithm and architecture design of the motion estimation for the H.265/HEVC 4K-UHD encoder

MPEG-2. ISO/IEC (or ITU-T H.262)

Digital Blocks Semiconductor IP

A Novel VLSI Architecture of Motion Compensation for Multiple Standards

A High Performance Deblocking Filter Hardware for High Efficiency Video Coding

Design of a Fast Multi-Reference Frame Integer Motion Estimator for H.264/AVC

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

A CONFIGURABLE H.265-COMPATIBLE MOTION ESTIMATION ACCELERATOR ARCHITECTURE SUITABLE FOR REALTIME 4K VIDEO ENCODING

Performance Driven Reliable Link Design for Network on Chips

MMI: A General Narrow Interface for Memory Devices

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar

AN-ENG-001. Using the AVR32 SoC for real-time video applications. Written by Matteo Vit, Approved by Andrea Marson, VERSION: 1.0.0

THE new video coding standard H.264/AVC [1] significantly

Study of AVS China Part 7 for Mobile Applications. By Jay Mehta EE 5359 Multimedia Processing Spring 2010

A video signal processor for motioncompensated field-rate upconversion in consumer television

On the Rules of Low-Power Design

Video (Fundamentals, Compression Techniques & Standards) Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Chapter 10 Basic Video Compression Techniques

An FPGA Implementation of Shift Register Using Pulsed Latches

Video Over Mobile Networks

MIPI D-PHY Bandwidth Matrix Table User Guide. UG110 Version 1.0, June 2015

626 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 4, APRIL 2012

06 Video. Multimedia Systems. Video Standards, Compression, Post Production

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

DIGIMIMIC Digital/Analog Parts Portfolio

Audio and Video II. Video signal +Color systems Motion estimation Video compression standards +H.261 +MPEG-1, MPEG-2, MPEG-4, MPEG- 7, and MPEG-21

Motion Compensation Hardware Accelerator Architecture for H.264/AVC

Introduction to image compression

The H.26L Video Coding Project

A Highly Scalable Parallel Implementation of H.264

Dual Link DVI Receiver Implementation

CS 152 Computer Architecture and Engineering

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

Transcription:

A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008

Outline Motivation for low-power video decoders Low-power techniques pipelining and parallelism independent voltage/clock domains efficient memory accessing ASIC results Comparison with state of the art Summary

Motivation High demand for video capture and playback on mobile devices iphone H.264 state of the art video coding standard DVC Goal: Ultra Low Power H.264 decoder in 65nm 1280x720 @ 30fps Digital Camera PSP

H.264 Decoder Architecture Bitstream Input ED MVS FIFO PARALLEL LUMA MVS MC MEM SHARED MODES INTRA MUX + DB CHROMA COEFFS IT OFF-CHIP MODES INTRA MUX + DB MVS MC MEM MVS YUV->RGB (FPGA) FRAME BUFFER (ZBT SRAM) Pipelined, highly parallel architecture to reduce voltage (and power)

Pipeline FIFO Sizing Normalized System Throughput 1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 1 2 4 8 16 32 256 Sizes on this chip FIFO Depth Pipeline stages have variable latencies ex: ED latency is 0-33 cycles per 4x4 block Larger FIFOs help average out workload increase performance by up to 45% FIFOs of depths 1-4 chosen to reduce area

Deblocking Filter Parallelism 4 edges in parallel Process entire 4x4 edge (4 filters) in parallel Filter luma and chroma in parallel 192 cycles reduced to ~ 46 cycles per 16x16 block

Deblocking Filter Architecture Datapath Control Datapath Control P IN 4x4 4x4 4x1 Last Line Cache 104kb [SRAM] 4x4 4 PARALLEL FILTERS Filters (bs=1 to 3) threshold calc clip << Boundary Strength (bs) Datapath Control 4x4 Block IN Q IN 4x4 4x1 Filters (bs=4) threshold <<...... >> P OUT Q OUT 4x4 Block OUT Datapath Control (bs=0) 4x4 4x4 Internal Memory 4x(4x4x8b) [DFF] Datapath Control

Motion Compensation (MC) 4x4 block in current frame Reference block Vector (1, -1) Reference block Vector (0.5, -0.5) Use two interpolators in parallel Interpolate luma and chroma in parallel 176 reduced to ~ 72 cycles per 16x16 block

Parallel MC Interpolators MC0 0 1 4 5 Frame Buffer (FB) column0 column1 MC0 6:1 6:1 6:1 MC1 MC0 2 3 6 7 8 9 12 13 Entropy Decoding (ED) MV0 MV1 MC1 6:1 6:1 x9 MC1 10 11 14 15 4x4 16x16 Interpolators can run in same cycle when motion vectors are all available memory interface supplies 2 columns per cycle Interpolators are synchronized MC0: even 4x4 rows, MC1: odd 4x4 rows shared interpolation data reused

20 18 16 14 12 10 8 6 4 2 0 MEM_luma Clock & Voltage Domains MEM_chroma ED Memory Controller Average Cycles / block DB_chroma DB_luma MC_luma Core Domain IT MC_chroma Core Domain V low CLK slow Level Shifters Dual-Clock FIFO Decouple voltage / clock domains lower core voltage and frequency 25% power savings vs. single domain 4% further savings if we used 3 domains V high CLK fast dual-clock FIFOs and level-shifters link domains DQ Mem DFF Array V low Q D FIFO LOGIC V high DQ DQ Memory Controller

Workload Variation P-frame I-frame Core Domain [ MHz @ V ] 14 @ 0.70 53 @ 0.90 Memory Controller [ MHz @ V ] 50 @ 0.84 25 @ 0.76 Relative Power @720p 1 I-frame, 14 P-frames [%] Power No DVFS MAX 53 @ 0.90 50 @ 0.84 100 ΔP Perfect DVFS DVFS FA 14 @ 0.70 53 @ 0.90 17 @ 0.72 25 @ 0.76 50 @ 0.84 48 @ 0.83 73 73 Workload 100% INTER-INTRA workload variation MAX: maximum frequency on each domain DVFS: 1 frame every 33ms Frame Averaging (FA): 15 frames every 15 * 33ms switches less often than DVFS, but needs output buffer

MC Data Overlap Current 4x4 block Horizontal Neighbors Interpolation Area Vertical Neighbors Overlapped Interpolation Area Neighboring 4x4 with same MVs Overlap area shared horizontally and vertically Reduced MC read bandwidth

Last-line On-chip Caches Top- Left Top Top- Right Left 16x16 block Deblocking Cache Size [kb] 120 100 80 60 40 20 0 807 Mbps 122 Mbps 26 Mbps 21 Mbps DB INTRA MC ED M A B C D E F G H I J K L Intra prediction

Off-chip Bandwidth P-frame off-chip BW [Gbps ] 2 1 26% 19% 0 original caching cache & fewer reads Frame buffer off-chip (1.4 MByte per frame) P-frames more common than I-frames P-frame off-chip BW larger due to MC 40% (0.9 Gbps) total reduction last-line caches avoided redundant reads in MC

Voltage Scalable SRAM 8T SRAM Cell Write Assist to improve writability at low voltages Extra 2 Tx ensures read stability at low voltages DOUT Low voltage SRAM needed Typical 6T SRAMs fail at low voltages 8T SRAMs work down to 0.5V RDBL snsen snsref Pseudo-differential sense amplifier with global snsref

H.264 Decoder ASIC 3.3 mm 3.3 mm 176 I/O PADS CACHES CORE DOMAIN MEMORY CONTROLLER DOMAIN DECODER STATISTICS Area (w/o pads) : Area Utilization : Technology : I/O Pads : On-chip SRAM : 2.76 x 2.76 mm 2 31 % 65-nm 176 17kB

Area Breakdown Cache area 3x larger than logic Logic 25% Standard Cells: 134k Caches 75% Parallelism Overhead 1.5% of active chip area 4 luma + 2 chroma filters: 1.5% of DB 2 luma + 4 chroma interpolators: MISC 1.4% ED 1.6% IT 8.2% INTRA 15.6% 9% of MC DB 56% MC 16.6%

Power Measurements 720p Video mobcal shields parkrun Input Bitrate [ Mbps ] 5.4 7.0 26 Core [ MHz @ V ] 14 @ 0.70 14 @ 0.70 25 @ 0.80 MEM [ MHz @ V ] 50 @ 0.84 50 @ 0.84 50 @ 0.84 Power [ mw ] 1.8 1.8 3.2 Distribution Across 15 Dies 10 9 8 7 6 5 4 3 2 1 0 0.69 0.70 0.71 0.72 Minimum Core Vdd @ 720p

Power Breakdown Pipeline control & FIFOs 19% ED 3% IDCT 1% INTRA 1% Motion Vector predictor Interpolators 5% 20% MEM write 7% MC 42% DB 26% P-Frame MEM Read 75% P-frame power dominated by: MC (frame buffer reads) deblocking filter

Survey of Other Decoders Power 1 W 100 mw 10 mw 1 mw Resolution 15fps 30fps QCIF CIF D1 720p 1080p Core Domain Memory Cntl 0.55 V 0.68 V 0.70 V 0.85 V 0.85 V 1.15 V [work] - process, profile [3] - 130-nm, Baseline [4] - 180-nm, Baseline [5] - 180-nm, Baseline 0.1 mw 0.01 mw 0.5 V 0.5 V 0.66 V 0.74 V 0.1 1 10 100 Mpixels/s [6] - 180-nm, Main This work - 65-nm, Baseline

Summary Pipeline and parallelism Concurrency allows 14MHz @ 720p Parallelism: luma DB = 4x, luma MC = 2x Separate voltage/clock domains 25% P-frame power savings DVFS on each domain for I/P-frame differences Efficient memory accesses Low-voltage on-chip caches and data reuse Off-chip BW lowered by 40%

Acknowledgements Funding: Nokia, TI, and NSERC Chip fabrication: TI Valuable feedback: Nokia: J. Hicks, G. Raghavan, J. Ankcorn TI: M. Budagavi, D. Buss, M. Zhou MIT: Arvind, E. Fleming

Video Demo