A video signal processor for motioncompensated field-rate upconversion in consumer television

Similar documents
At-speed Testing of SOC ICs

Using on-chip Test Pattern Compression for Full Scan SoC Designs

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

TKK S ASIC-PIIRIEN SUUNNITTELU

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

IMPLEMENTATION OF X-FACTOR CIRCUITRY IN DECOMPRESSOR ARCHITECTURE

for Digital IC's Design-for-Test and Embedded Core Systems Alfred L. Crouch Prentice Hall PTR Upper Saddle River, NJ

Data Converters and DSPs Getting Closer to Sensors

Sharif University of Technology. SoC: Introduction

IC FOR MOTION-COMPENSATED DE-INTERLACING, NOISE REDUCTION, AND PICTURE-RATE CONVERSION

Layout Decompression Chip for Maskless Lithography

ROM MEMORY AND DECODERS

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

Scan. This is a sample of the first 15 pages of the Scan chapter.

EEM Digital Systems II

Overview: Logic BIST

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

Design and analysis of microcontroller system using AMBA- Lite bus

R Fig. 5 photograph of the image reorganization circuitry. Circuit diagram of output sampling stage.

L12: Reconfigurable Logic Architectures

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Design of Fault Coverage Test Pattern Generator Using LFSR

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

FPGA Development for Radar, Radio-Astronomy and Communications

Innovative Fast Timing Design

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Lecture 23 Design for Testability (DFT): Full-Scan

Advanced Training Course on FPGA Design and VHDL for Hardware Simulation and Synthesis. 26 October - 20 November, 2009

Microprocessor Design

L11/12: Reconfigurable Logic Architectures


VLSI Chip Design Project TSEK06

Lecture 23 Design for Testability (DFT): Full-Scan (chapter14)

COE328 Course Outline. Fall 2007

Figure.1 Clock signal II. SYSTEM ANALYSIS

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components

UNIT IV CMOS TESTING. EC2354_Unit IV 1

SoC IC Basics. COE838: Systems on Chip Design

Low Power Illinois Scan Architecture for Simultaneous Power and Test Data Volume Reduction

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

Verification Methodology for a Complex System-on-a-Chip

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

EECS150 - Digital Design Lecture 2 - CMOS

Design Project: Designing a Viterbi Decoder (PART I)

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

Laboratory 1 - Introduction to Digital Electronics and Lab Equipment (Logic Analyzers, Digital Oscilloscope, and FPGA-based Labkit)

VLSI Test Technology and Reliability (ET4076)

Unit 8: Testability. Prof. Roopa Kulkarni, GIT, Belgaum. 29

Performance Modeling and Noise Reduction in VLSI Packaging

Design of BIST with Low Power Test Pattern Generator

Testing Digital Systems II

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Remote Diagnostics and Upgrades

TYPICAL QUESTIONS & ANSWERS

Introduction to CMOS VLSI Design (E158) Lecture 11: Decoders and Delay Estimation

System IC Design: Timing Issues and DFT. Hung-Chih Chiang

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

THE USE OF forward error correction (FEC) in optical networks

Why FPGAs? FPGA Overview. Why FPGAs?

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

A Low Power Delay Buffer Using Gated Driver Tree

Boolean, 1s and 0s stuff: synthesis, verification, representation This is what happens in the front end of the ASIC design process

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

Radar Signal Processing Final Report Spring Semester 2017

A Low-Power 0.7-V H p Video Decoder

Block Diagram. pixin. pixin_field. pixin_vsync. pixin_hsync. pixin_val. pixin_rdy. pixels_per_line. lines_per_field. pixels_per_line [11:0]

[Krishna*, 4.(12): December, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROCESSING / 14.6

FPGA Implementation of DA Algritm for Fir Filter

EE178 Spring 2018 Lecture Module 5. Eric Crabill

Final Exam CPSC/ECEN 680 May 2, Name: UIN:

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

Dual Link DVI Receiver Implementation

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

Design for Testability

Lossless Compression Algorithms for Direct- Write Lithography Systems

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Testing Digital Systems II

Chapter 5 Flip-Flops and Related Devices

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Powerful Software Tools and Methods to Accelerate Test Program Development A Test Systems Strategies, Inc. (TSSI) White Paper.

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Methodology. Nitin Chawla,Harvinder Singh & Pascal Urard. STMicroelectronics

TSIU03: Lab 3 - VGA. Petter Källström, Mario Garrido. September 10, 2018

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Slide Set 14. Design for Testability

Digital Blocks Semiconductor IP

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Design and Implementation of an AHB VGA Peripheral

Transcription:

A video signal processor for motioncompensated field-rate upconversion in consumer television B. De Loore, P. Lippens, P. Eeckhout, H. Huijgen, A. Löning, B. McSweeney, M. Verstraelen, B. Pham, G. de Haan, J. Kettenis author return address: B. De Loore Philips GmbH, RHW Stresemannallee 101 D-22529 Hamburg Germany Philips Semiconductors Philips Research Philips Consumer Electronics Phone : + 49 40 5613 3691 Fax : + 49 40 5613 3392 Email : deloore@hhcich01.serigate.philips.nl A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 1

1.0 Abstract The four embedded video signal processors on this IC provide a processing power of 10 Gops. Their architecture was generated from an algorithm description using behavioural synthesis. The required 25 Gbit/s memory bandwidth was realized by embedding 24 single/dual port SRAM/DRAM instances. The test approach includes full scan, boundary scan, functional, built-in-self-test and IDDq-test. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 2

2.0 Paper summary In today s 100Hz television sets, the display rate is doubled by displaying incoming fields twice. Moving objects are displayed at an incorrect position in the interpolated fields (Fig. 1). In the new generation 100Hz television sets, this artefact is solved by calculating the motion vectors for all objects and performing motion compensated interpolation. Known algorithms for motion estimation and compensation require a huge number of computations. In [1], the 3D-recursive block matching algorithm is presented that renders a one-chip solution possible. The presented IC is also capable of judder-free motion portrayal of movie material (25Hz to 50Hz upconversion), noise reduction and vertical zoom. In Fig. 2, the chip architecture and application are shown. It consists of three subprocessors and one top-level processor. Motion estimation is performed by two recursive block matchers, using only 8 candidate vectors per block. Search spaces of current and previous fields have to be accessed randomly at the pixel frequency. This can only be performed using on chip cache memory. Candidate vector evaluation and selection of the best vector are also performed in the motion estimation processor. In the vector processor, the selected vectors are stored in a vector field memory, to be used for interpolation and as one of the candidates in the next field. The vector processor organizes the storage and access of the vector field. In order to obtain motion vectors for blocks of 4*2 pixels, it also postprocesses the originally calculated motion vectors. The interpolation processor generates the motion-compensated interpolated output video field. Here also, two video search spaces have to be accessed randomly. The vertical delays could be reused, the horizontal caches had to be duplicated. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 3

The top level processor synchronizes the data flow between the three subprocessors and peripheral functions like e.g. noise reduction, formatters and I/O. Externally, a video field memory and a microprocessor are required. Programmability was implemented to allow user-specific taste settings and picture sizes and to allow for fine tuning of parts of the algorithm that could only be evaluated in real-time. For this purpose, there are 28 on-chip 8-bit registers that can be read and/or written by the microprocessor. An implementation of the algorithm on existing microprocessors or DSPs is not feasible with the current state of microprocessor and DSP technology. The recursive block matching, which is only a small part of the algorithm, requires 14 operations per sample. With a sample frequency of 33 MHz this results in a required computing power of 462 MOPS. The integrated function requires a total computing power of 10 GOPS. Also the required memory bandwidth can only be offered by a dedicated processor. The chosen architecture requires a memory bandwidth of 25 Gbit/s. Time to market was a major issue in the design of an application specific processor for this algorithm. Register Transfer Level (RTL)design, using logic synthesis, does not offer a solution. The conversion of the behavioural specification into an RTL specification (Table 1) takes too much time: the designer has to take care of clock-level signal timing, memory organisation and addressing and the generation of all control signals. Because of the complex nature of the algorithm (recursive, quincunx) and the more than 100 modes of operation, such an approach would not have been possible in the given time frame. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 4

For this IC, the four video processors were generated using in-house behavioural synthesis tools from the Phideo toolset [2]. These tools take a behavioural specification and generate an architecture at the RT level. Arithmetic and logic operations are the functional primitives of the specification. A scheduler determines at which clock cycle what function is performed. From this time schedule, memories are inferred to store intermediate processing results, the memory organisation is determined and the addressing hardware is generated. A controller is then synthesized to control and synchronize the complete architecture. The designer selects architecture alternatives, the tools perform bookkeeping and detailed optimization tasks. The output of the behavioural synthesis toolbox is synthesizable RT L-VHDL, which can be translated to a gate level netlist using logic synthesis tools. Retiming at the gate level is performed to let the circuit run at the desired operating frequency of 33 MHz. An overview of the embedded memories is given in Table 2. Four different memory types had to be used to come to an acceptable area/performance ratio. Dual port 33 MHz random access is achieved using register-file generators. For single port 33 MHz instances, an embedded-sram module generator is used. The vector field, characterized by single port 16.5 MHz bandwidth, is implemented using an embedded-dram module generator. Area and power constraints imposed the use of DRAM technology for the 15 line delay instances. A dedicated line memory was designed with doubled data throughput at 33 Mhz. This was realized by making simultaneous read and write access to the same memory address possible within one memory cycle. The typical incremental access of these line delays also allowed to reduce their power dissipation with a factor of three, by the addition of a so called page mode. A single line memory instance stores A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 5

896x8 pixels, measures 1.1 mm2 and dissipates 0.5 mw/mhz at 5 Volts. In total, 170Kbits of data are stored in embedded memory. A single phase clocking scheme was adopted for the core of the IC. To limit clock skew, the clock scheme from Fig. 3 was adopted The clock generator was put near the center of the IC. It s output is connected with equally long wires to four large clock buffers, placed in the center of the four sides of the IC s periphery. The outputs of these clock buffers are again connected (ck_core) in a ring that surrounds the core. The core circuitry connects to this clock-ring locally. The clock generator also generates clock signals (ck_sync_in, ck_sync_out) with a small phase difference compared to the core clock signal. This was required to meet the timing requirements for the IC s inputs and outputs. In the datapath a full scan approach was adopted. The overall error coverage, based on a stuck-at fault model, exceeds 99%. This test may leave bridging faults undetected. These faults can be identified by measuring the quiescent current (IDDq) in the absence of a clock signal for different vector sets. Care was taken in the design of I/O and memory cells, to switch off all possible current leakage paths during this test. Timing problems remain undetected by scan test and IDDq test. A functional test is included in the vector set which activates the timing critical path. Memories require special attention, as they are extremely dense. Next to stuck-at and bridging errors, also coupling errors and specific decoder errors have to be identified. Specific test pattern sequences (6N, 9N algorithms) were used to test SRAM and DRAM modules properly. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 6

Finally, board level testing is made possible by the implementation of a boundary-scanlike test concept. The high quality standards must be met at the lowest possible cost. Test time is the major parameter determining the test cost. Minimization of test time was achieved by having 28 scan chains operating in parallel, the length of these chains being well balanced. When applied serially, the memory tests take too much time and the required vectors require too much of the tester s pin memory. Four built-in-self-test modules have been implemented to avoid this. These modules consist of a test-pattern generator for both data and addresses, a signature analyser and a state machine to control the complete test sequence. Table 3 shows the main characteristics of the realized chip. 65% of the transistors are in embedded memory. The IC spec was written and verified by a system designer. A team of three IC designers performed the synthesis. Functional samples were available, one year after project start. Fig. 4 shows a chip photograph. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 7

3.0 References [1]G. de Haan, J. Kettenis, B. De Loore: IC for motion-compensated 100Hz TV with smooth-motion movie-mode, Digest of the ICCE 95, Chicago. [2]P. E. R. Lippens, J. L. van Meerbergen, A. van der Werf, W. F. J. Verhaegh, B. T. McSweeney, J. O. Huisken, and O. P. McArdle, PHIDEO: A silicon compiler for high speed algorithms, in Proc. EDAC, Amsterdam, The Netherlands, Feb. 1991, pp. 436-441. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 8

position expected position of object in field 6 1 2 3 4 5 6 7 8 object in original fields object in interpolated fiels time (field nr) Fig. 1 : Field repetition causes moving objects to be perceived twice by the object tracking observer. input video external field memory external microprocessor pixel line delays current field pixel line delays previous field upinterface best match horiz. horiz. cache cache block matcher horiz. horiz. cache cache block matcher vector line delays best match selection motion estimation processor horiz. cache temporal vector prediction field delay vector postprocessing vector processor interpolator horiz. cache interpolation processor noise reduction formatting reformatting colour processing top level processor output video Fig. 2 : The IC architecture: one top-level processor organizes the data transfer between three processors and peripheral functions. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 9

Table 1: Behavioural versus RT level spec behavioural spec RT level spec abstraction level algorithm architecture video image 3D-array serial pixel stream time base video frame clock cycle language C VHDL, VERILOG simulation time 1 500 code size 1 50 Table 2: Embedded memory instances instances R/W ports # words circuit style pixel line delay 15 2 ports, 8 bit,33 MHz horizontal cache 6 2 ports, 40 bit, 33 MHz vector field delay 1 1 port, 10 bit, 16.5 MHz vector line delay 2 1 port, 10 bit, 33 MHz 896 FIFO-type DRAM 15... 30 2-port register file 4096 1-port DRAM 54 1-port SRAM A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 10

ck_core ck_ext clk generator ck_sync_in ck_sync_out ck Fig. 3 : Clock distribution scheme Table 3: IC characteristics process 0.8 µ CMOS dissipation 1.8 W compute power 10 Gops memory bandwidth 25 Gbit/s die size 97 mm 2 embedded memory 170 Kbit transistor count 980,000 test vector length 260,000 max. clock rate 33 MHz package PLCC84 A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 11

Fig. 4 : Chip photograph A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 12

A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 13